0% found this document useful (0 votes)

777 views

Information Retrieval Models: Vector Space Models: Chengxiang Zhai

The document discusses vector space models for information retrieval. It describes how vector space models represent documents and queries as vectors of term weights in a high-dimensional space. Relevance is determined by the similarity between the document and query vectors, with more similar vectors indicating higher relevance. The document outlines key aspects of vector space models like term weighting, which assigns higher weights to more important terms based on factors like term frequency and inverse document frequency.

Uploaded by

ctadventuresingh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

777 views

Information Retrieval Models: Vector Space Models: Chengxiang Zhai

Uploaded by

ctadventuresingh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 30

Information Retrieval Models:

Vector Space Models

ChengXiang Zhai

Department of Computer Science

University of Illinois, Urbana-Champaign

Empirical IR vs. Model-based IR
• Empirical IR:
– heuristic approaches
– solely rely on empirical evaluation
– assumptions not always clearly stated
– findings: empirical observations; may or may not generalize well

• Model-based IR:
– theoretical approaches
– rely more on mathematics
– assumptions are explicitly stated
– findings: principles, models that may work well or not work well;
generalize better

• Boundary may not be clear and a combination is

generally necessary
2
History of Research on IR Models
• 1960: First probabilistic model [Maron & Kuhns 60]

• 1970s: Active research on retrieval models

started
– Vector-space model [Salton et al. 75]
– Classic probabilistic model [Robertson & Sparck Jones 76]
– Probability Ranking Principle [Robertson 77]

• 1980s: Further development of different models

– Non-classic logic model [Rijsbergen 86]
– Extended Boolean [Salton et al. 83]
– Early work on learning to rank [Fuhr 89]

3
History of Research on IR Models (cont.)
• 1990s: retrieval model research driven by TREC
– Inference network [Turtle & Croft 91]
– BM25/Okapi [Robertson et al. 94]
– Pivoted length normalization [Singhal et al. 96]
– Language model [Ponte & Croft 98]

• 2000s-present: retrieval model influenced by machine

learning and Web search
– Further development of language models [Zhai & Lafferty 01, Lavrenko &
Croft 01]

– Divergence from randomness [Amati et al. 02]

– Axiomatic model [Fang et al. 04]
– Markov Random Field [Metzler & Croft 05]
– Further development of Learning to rank [Joachimes 02, Burges et al. 05]
4
Modeling Relevance:
Raodmap for Retrieval Models
Relevance constraints
[Fang et al. 04]
Relevance
Div. from Randomness
(Amati & Rijsbergen 02)

(Rep(q), Rep(d)) P(r=1|q,d) r {0,1} P(d q) or P(q d)

Similarity Probability of Relevance Probabilistic inference

Regression
Model (Fuhr 89) Generative Different
Different Model
rep & similarity inference system
Learn. To Rank Doc Query
… (Joachims 02, generation generation
Prob. concept Inference
Berges et al. 05) network
space model
Vector space Prob. distr. Classical LM (Wong & Yao, 95) model
model model prob. Model approach (Turtle & Croft, 91)
(Salton et al., 75) (Wong & Yao, 89) (Robertson & (Ponte & Croft, 98)
Sparck Jones, 76) (Lafferty & Zhai, 01a)
5
1. Vector Space Models
The Basic Question
Given a query, how do we know if
document A is more relevant than B?

One Possible Answer

If document A uses more query words

than document B
(Word usage in document A is more
similar to that in query)
Relevance = Similarity
• Assumptions
– Query and document are represented similarly
– A query can be regarded as a “document”
– Relevance(d,q)  similarity(d,q)
• R(q) = {dC|f(d,q)>}, f(q,d)=(Rep(q), Rep(d))
• Key issues
– How to represent query/document?
– How to define the similarity measure ?

8
Vector Space Model
• Represent a doc/query by a term vector
– Term: basic concept, e.g., word or phrase
– Each term defines one dimension
– N terms define a high-dimensional space
– Element of vector corresponds to term weight
– E.g., d=(x1,…,xN), xi is “importance” of term i
• Measure relevance by the distance between the
query vector and document vector in the vector
space

9
VS Model: illustration

Starbucks
D9
D2 ??
?? D11

D3 D5
D10

D4 D6
Java
Query
D7
D8 D1
Microsoft
?? 10
What the VS model doesn’t say
• How to define/select the “basic concept”
– Concepts are assumed to be orthogonal
• How to assign weights
– Weight in query indicates importance of term
– Weight in doc indicates how well the term
characterizes the doc
• How to define the similarity/distance measure

11
What’s a good “basic concept”?
• Orthogonal
– Linearly independent basis vectors
– “Non-overlapping” in meaning
• No ambiguity
• Weights can be assigned automatically and
hopefully accurately
• Many possibilities: Words, stemmed words,
phrases, “latent concept”, …
• “Bag of words” representation works
“surprisingly” well!
12
How to Assign Weights?
• Very very important!
• Why weighting
– Query side: Not all terms are equally important
– Doc side: Some terms carry more information about contents

• How?
– Two basic heuristics
• TF (Term Frequency) = Within-doc-frequency
• IDF (Inverse Document Frequency)

– Document length normalization

13
TF Weighting

• Idea: A term is more important if it occurs more

frequently in a document
• Formulas: Let f(t,d) be the frequency count of term
t in doc d
– Raw TF: TF(t,d) = f(t,d)
– Log TF: TF(t,d)=log ( f(t,d) +1)
– Maximum frequency normalization:
TF(t,d) = 0.5 +0.5*f(t,d)/MaxFreq(d)
– “Okapi/BM25 TF”:
TF(t,d) = k f(t,d)/(f(t,d)+k(1-b+b*doclen/avgdoclen))
• Normalization of TF is very important!
14
TF Normalization
• Why?
– Document length variation
– “Repeated occurrences” are less informative than
the “first occurrence”
• Two views of document length
– A doc is long because it uses more words
– A doc is long because it has more contents
• Generally penalize long doc, but avoid over-
penalizing (e.g., pivoted normalization)

15
TF Normalization (cont.)
Norm. TF

Raw TF
“Pivoted normalization”: Using avg. doc length to regularize normalization

1-b+b*doclen/avgdoclen

b varies from 0 to 1

Normalization interacts with the similarity measure

16
IDF Weighting
• Idea: A term is more discriminative/important if it
occurs only in fewer documents
• Formula:
IDF(t) = 1+ log(n/k)
n – total number of docs
k -- # docs with term t (doc freq)
• Other variants:
– IDF(t) = log((n+1)/k)
– IDF(t)=log ((n+1)/(k+0.5))
• What are the maximum and minimum values of
IDF?
17
Non-Linear Transformation in IDF
IDF(t)

IDF(t) = 1+ log(n/k)
1+log(n)

Linear penalization

k (doc freq)
1 N
=totoal number of docs in collection

Is this transformation optimal?

18
TF-IDF Weighting
• TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t)
– Common in doc  high tf  high weight
– Rare in collection high idf high weight
• Imagine a word count profile, what kind of terms
would have high weights?

19
Empirical distribution of words

• There are stable language-independent patterns

in how people use natural languages
• A few words occur very frequently; most occur
rarely. E.g., in news articles,
– Top 4 words: 10~15% word occurrences
– Top 50 words: 35~40% word occurrences
• The most frequent word in one corpus may be
rare in another

20
Zipf’s Law

• rank * frequency  constant F ( w) 

C
r ( w)
  1, C  0.1

Word High entropy words

Freq.

Word Rank (by Freq)

C
Generalized Zipf’s law: F ( w)  Applicable in many domains
[r ( w)  B]

21
How to Measure Similarity?

Di  ( w i 1 ,..., w iN )

Q  ( wq1 ,..., wqN ) w  0 if a term is absent
  N
Dot product similarity : sim(Q , Di )   wqj  w ij
j 1
N

   wqj  wij
j 1
Cosine : sim(Q , Di ) 
N N
 ( wqj )  2
 ij
( w ) 2

j 1 j 1

(  normalized dot product)

  N

How about Euclidean? sim (Q, Di )   qj ij

( w
j 1
 w ) 2

22
What Works the Best?

•Use single words

Error
•Use stat. phrases

[ ] •Remove stop words

•Stemming
•Others(?)

(Singhal 2001; Singhal et al. 1996)

23
Relevance Feedback in VS
• Basic setting: Learn from examples
– Positive examples: docs known to be relevant
– Negative examples: docs known to be non-relevant
– How do you learn from this to improve performance?
• General method: Query modification
– Adding new (weighted) terms
– Adjusting weights of old terms
– Doing both
• The most well-known and effective approach is Rocchio
[Rocchio 1971]

24
Rocchio Feedback: Illustration
Centroid of relevant documents Centroid of
non-relevant documents

-- ---
+ -
+
++ + --
- - -
-- + q q
- m +
+ +
- - - ++ + +++ -
--
+ + + -
-
- -- --

25
Rocchio Feedback: Formula

Parameters
New query

Origial query Rel docs Non-rel docs

26
Rocchio in Practice
• How can we optimize the parameters?
• Can it be used for both relevance feedback and pseudo
feedback?
• How does Rocchio feedback affect the efficiency of
scoring documents? How can we improve the
efficiency?

27
Advantages of VS Model

• Empirically effective! (Top TREC performance)

• Intuitive
• Easy to implement
• Well-studied/Most evaluated
• The Smart system
– Developed at Cornell: 1960-1999
– Still widely used
• Warning: Many variants of TF-IDF!
28
Disadvantages of VS Model
• Assume term independence
• Assume query and document to be the same
• Lack of “predictive adequacy”
– Arbitrary term weighting
– Arbitrary similarity measure
• Lots of parameter tuning!

29
What You Should Know
• Basic idea of the vector space model
• TF-IDF weighting
• Pivoted length normalization (read [Singhal et
al. 1996] to know more)
• BM25/Okapi retrieval function (particularly TF
weighting)
• How Rocchio feedback works

Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
IR Chap4
100% (1)
IR Chap4
32 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
4_IRModels
No ratings yet
4_IRModels
30 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
F-IR
No ratings yet
F-IR
30 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Chapter 2: Modeling: Advanced Topics in Information Retrieval
No ratings yet
Chapter 2: Modeling: Advanced Topics in Information Retrieval
28 pages
IR - Models
100% (3)
IR - Models
58 pages
L03
No ratings yet
L03
16 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
Vector Space Model
No ratings yet
Vector Space Model
11 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
Document Ranking Using Customizes Vector Method
No ratings yet
Document Ranking Using Customizes Vector Method
6 pages
Chapter 6_Scoring Term weighting and vector space model
No ratings yet
Chapter 6_Scoring Term weighting and vector space model
43 pages
IRS Unit 3 by Krishna
No ratings yet
IRS Unit 3 by Krishna
50 pages
1 Overview
No ratings yet
1 Overview
44 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
TF Idf
100% (3)
TF Idf
38 pages
Basic IR: Modeling
No ratings yet
Basic IR: Modeling
22 pages
Information Retrieval Practical
No ratings yet
Information Retrieval Practical
10 pages
Chapter-3 Termweighting
No ratings yet
Chapter-3 Termweighting
17 pages
IRS Automatic Indexing UNIT-2
67% (3)
IRS Automatic Indexing UNIT-2
18 pages
Retrieval Models and Rank Retrieval
No ratings yet
Retrieval Models and Rank Retrieval
16 pages
Tif Idf Cosine Similarity
No ratings yet
Tif Idf Cosine Similarity
44 pages
NLP Ir
No ratings yet
NLP Ir
24 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
Lecture 2: More Similarity Searching Multidimensional Scaling
No ratings yet
Lecture 2: More Similarity Searching Multidimensional Scaling
8 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
Acm Iconiaac 2014
No ratings yet
Acm Iconiaac 2014
8 pages
Scoring
No ratings yet
Scoring
49 pages
Hatakenaka 2011
No ratings yet
Hatakenaka 2011
6 pages
Mastering Algorithms and Data Structures
From Everand
Mastering Algorithms and Data Structures
Manish Soni
No ratings yet
Essay Task Types: With Stephen Walker
No ratings yet
Essay Task Types: With Stephen Walker
10 pages
The Theory of Finite Dimensional Vector Spaces: 4.1 Some Basic Concepts
No ratings yet
The Theory of Finite Dimensional Vector Spaces: 4.1 Some Basic Concepts
29 pages
Focus On Language: Describing Data: With Stephen Walker
No ratings yet
Focus On Language: Describing Data: With Stephen Walker
22 pages
Object-Oriented Design and High-Level Programming Languages
No ratings yet
Object-Oriented Design and High-Level Programming Languages
85 pages
Great Theoretical Ideas in Computer Science
No ratings yet
Great Theoretical Ideas in Computer Science
62 pages
Bipartite Index Coding: Arash Saber Tehrani Alexandros G. Dimakis Michael J. Neely
No ratings yet
Bipartite Index Coding: Arash Saber Tehrani Alexandros G. Dimakis Michael J. Neely
20 pages
Graphs With Tiny Vector Chromatic Numbers and Huge Chromatic Numbers
No ratings yet
Graphs With Tiny Vector Chromatic Numbers and Huge Chromatic Numbers
63 pages
Coding For Modern Distributed Storage Systems: Part 1.: Locally Repairable Codes
No ratings yet
Coding For Modern Distributed Storage Systems: Part 1.: Locally Repairable Codes
23 pages
Enumeration Algorithms: Problems. An Enumeration Algorithm Solves An Enumeration Problem
No ratings yet
Enumeration Algorithms: Problems. An Enumeration Algorithm Solves An Enumeration Problem
24 pages
Orthogonal Representations and Connectivity of Graphs
No ratings yet
Orthogonal Representations and Connectivity of Graphs
11 pages
HCI Lect 7-Brain-Computer Interfaces For Communication in Paralysis2
No ratings yet
HCI Lect 7-Brain-Computer Interfaces For Communication in Paralysis2
30 pages
Gottfried Semper and The Problem of Historicism - M. Hvattum (2004)
100% (1)
Gottfried Semper and The Problem of Historicism - M. Hvattum (2004)
289 pages
N-Tier Architecture
No ratings yet
N-Tier Architecture
6 pages
Pramod Verma
No ratings yet
Pramod Verma
4 pages
International Society of Sports Nutrition Position Stand Beta-Alanine
No ratings yet
International Society of Sports Nutrition Position Stand Beta-Alanine
16 pages
Stage 1 Visit 1
No ratings yet
Stage 1 Visit 1
15 pages
Data-Types-and-Data-Structure1
No ratings yet
Data-Types-and-Data-Structure1
2 pages
Secrets series Emergency Medicine Secrets Fifth Edition Vincent J. Markovchick Md Faaem Facep - Instantly access the full ebook content in just a few seconds
100% (1)
Secrets series Emergency Medicine Secrets Fifth Edition Vincent J. Markovchick Md Faaem Facep - Instantly access the full ebook content in just a few seconds
62 pages
MIS - Ethical and Social Issues
No ratings yet
MIS - Ethical and Social Issues
38 pages
Answer_23년_11월_고1_모의고사_07_01_변형문제(유형별)
No ratings yet
Answer_23년_11월_고1_모의고사_07_01_변형문제(유형별)
92 pages
Milans New Catalog
No ratings yet
Milans New Catalog
68 pages
Chapter 10 - Nutrition and Gas Exchange in Plants: Learning Objectives
No ratings yet
Chapter 10 - Nutrition and Gas Exchange in Plants: Learning Objectives
14 pages
MT2 - Cycle 2 Manual
No ratings yet
MT2 - Cycle 2 Manual
30 pages
PBA-80Tx2500 Hydraulic Press Brake With E200P CNC Controller
No ratings yet
PBA-80Tx2500 Hydraulic Press Brake With E200P CNC Controller
3 pages
District: Sta. Barbara Ii: Schools Division Office I Pangasinan
No ratings yet
District: Sta. Barbara Ii: Schools Division Office I Pangasinan
2 pages
Ax2012 Enus Deviv 06 PDF
100% (1)
Ax2012 Enus Deviv 06 PDF
48 pages
ASSIGMENT 6_UBDI UMAR FARUQ
No ratings yet
ASSIGMENT 6_UBDI UMAR FARUQ
4 pages
Cruz v. Salva
100% (1)
Cruz v. Salva
6 pages
Revised KYC Text File Structure 2.0: S. No. Field Name Type Size Validation Remark
No ratings yet
Revised KYC Text File Structure 2.0: S. No. Field Name Type Size Validation Remark
6 pages
321 SST Tolerances A 480-A 480M
No ratings yet
321 SST Tolerances A 480-A 480M
25 pages
Group 1 Nouns Reporting
No ratings yet
Group 1 Nouns Reporting
23 pages
Assessment of Risk For Gardening Activities Involving Pupils
No ratings yet
Assessment of Risk For Gardening Activities Involving Pupils
5 pages
What Is Eczema
No ratings yet
What Is Eczema
19 pages
Businessplan
No ratings yet
Businessplan
25 pages
Escorts Railways
No ratings yet
Escorts Railways
8 pages
Sharp MX-m260 MX-m310 Service Manual
100% (1)
Sharp MX-m260 MX-m310 Service Manual
137 pages
Corporate Governance and CEO Innovation: Atl Econ J (2018) 46:43 - 58
No ratings yet
Corporate Governance and CEO Innovation: Atl Econ J (2018) 46:43 - 58
16 pages
The Experimental New Hybrid Solar Dryer and Hot Water Storage
No ratings yet
The Experimental New Hybrid Solar Dryer and Hot Water Storage
15 pages
Imagery Lesson Plan
No ratings yet
Imagery Lesson Plan
5 pages
RPH CHPTR11
No ratings yet
RPH CHPTR11
2 pages