Introduction To: Information Retrieval
Introduction To: Information Retrieval
Introduction to
Information Retrieval
Lecture 5: Scoring, Term Weighting and the
Vector Space Model
Introduction to Information Retrieval
This lecture
Ranked retrieval
Scoring documents
Term frequency
Collection statistics
Weighting schemes
Vector space scoring
Introduction to Information Retrieval Ch. 6
Ranked retrieval
Thus far, our queries have all been Boolean.
Documents either match or don’t.
Good for expert users with precise understanding of
their needs and the collection.
Not good for the majority of users.
Most users incapable of writing Boolean queries (or they
are, but they think it’s too much work).
Most users don’t want to wade through 1000s of results.
This is particularly true of web search.
Introduction to Information Retrieval Ch. 6
Term frequency t
The term frequency tt,d of term t in document d is
defined as the number of times that t occurs in d.
We want to use t when computing query-document
match scores. But how?
Raw term frequency is not what we want:
A document with 10 occurrences of the term is more
relevant than a document with 1 occurrence of the term.
But not 10 times more relevant.
Relevance does not increase proportionally with
term frequency.
NB: frequency = count in IR
Introduction to Information Retrieval Sec. 6.2
Log-frequency weighting
The log frequency weight of term t in d is
1 log10 tf t,d , if tf t,d 0
wt,d
0, otherwise
Document frequency
Rare terms are more informative than frequent terms
Recall stop words
Consider a term in the query that is rare in the
collection (e.g., arachnocentric)
A document containing this term is very likely to be
relevant to the query arachnocentric
→ We want a high weight for rare terms like
arachnocentric.
Introduction to Information Retrieval Sec. 6.2.1
idf weight
dft is the document frequency of t: the number of
documents that contain t
dft is an inverse measure of the informativeness of t
dft N
We define the idf (inverse document frequency) of t
by
idf t log10 ( N/df t )
We use log (N/dft) instead of N/dft to “dampen” the effect
of idf.
21
Introduction to Information Retrieval Sec. 6.2.1
t-idf weighting
The t-idf weight of a term is the product of its t
weight and its idf weight.
w t ,d log(1 tf t ,d ) log10 ( N / df t )
Score(q,d) tf.idft,d
tqd
24
Introduction to Information Retrieval Sec. 6.3
Documents as vectors
So we have a |V|-dimensional vector space
Terms are axes of the space
Documents are points or vectors in this space
Very high-dimensional: tens of millions of dimensions
when you apply this to a web search engine
These are very sparse vectors - most entries are zero.
Introduction to Information Retrieval Sec. 6.3
Queries as vectors
Key idea 1: Do the same for queries: represent them
as vectors in the space
Key idea 2: Rank documents according to their
proximity to the query in this space
proximity = similarity of vectors
proximity ≈ inverse of distance
Recall: We do this because we want to get away from
the you’re-either-in-or-out Boolean model.
Instead: rank more relevant documents higher than
less relevant documents
Introduction to Information Retrieval Sec. 6.3
cosine(query,document)
Dot product Unit vectors
V
qd q d q di
i 1 i
cos(q , d )
q
qd d
i1 i
V 2 V
2
q
i 1 i
d
for q, d length-normalized.
33
Introduction to Information Retrieval
34
Introduction to Information Retrieval Sec. 6.3
cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0
≈ 0.94
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69
Why do we have cos(SaS,PaP) > cos(SaS,WH)?
Introduction to Information Retrieval Sec. 6.4