Lec 4
Lec 4
1. Document Collection
2. Topics (sample of information needs)
3. Relevance judgments (qrels)
2
How can we get it?
For web search, companies apply their own studies to assess the
performance of their search engine.
Web-search performance is monitored by:
● Traffic
● User clicks and session logs
● Labelling results for selected users’ queries
12
Scoring Example: Jaccard coefficient
Commonly-used measure of overlap of two sets A and B
The 2 sets in our context are terms (documents after preprocessing) and queries
Note that:
jaccard(A, A)
=1
jaccard(A, B) = 0 if A ∩ B = 0
A and B don’t have to be of the same size.
Always assigns a number between 0 and 1.
13
Example
For the query "the arab world" and the document "fifa world cup in arab
country", what is Jaccard similarity (after removing stop words)?
15
TF-IDF ranking
Term-Document Count Matrix
Each document as a count (frequency) vector
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
tf
24
Log-Frequency Weighting
The log frequency weight of term t in d is (term weighting function):
1 log10 tf t,d , if tft,d 0
wt,d
0, otherwise (If tft,d 0)
0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.
26
Inverse Documents Frequency, idf
idf (inverse document frequency) of t:
28
Solve this Example: each student should do a detailed solution
Collection of 5 documents (balls = terms)
Query the destructive storm
Which is the least relevant document?
Start your solution with:
Which is the most relevant document?
• df (yellow)= 5
D1 D2 D3 D4 D5 • df (red)=3
• df (green)=3
4 5 3 2 1
Vector Space Model
Binary → Count → Weight Matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
27
Sec. 6.3
Documents as Vectors
According to the shown figure, Assume 2 terms then space will
be 2 dimensions d4
Note that:
Very high-dimensional: tens of millions of dimensions when you apply this to a web
search engine.
These are very sparse vectors - most entries are zero.
This led to Vector Space Model
28
Sec. 6.3
Queries as Vectors
Key idea 1: Do the same for queries: represent them as vectors in the space.
(space) عل نفس ال
( يvector) صغبه ونمثلها ك
ر (document) (كأنهاquery) نعتب ال
ر
29
Sec. 6.3
31
Sec. 6.3
3. Length Normalization
A vector can be (length-) normalized by dividing each of its
components by its length – for this we use the L2 norm:
32
Cosine “Similarity” (Query, Document)
34
Sec. 6.4
Many search engines allow for different weightings for queries vs.
documents.
SMART Notation: use notation ddd.qqq, using the acronyms from the table
A very standard weighting scheme is: lnc.ltc
35
BM25 ranking
Okapi BM25 Ranking Function
And
5.8
0.8 L /𝐿 5.6
5.4
0.6
Okapi TF
0.5
Classic
IDF
1.0 5.2
Okapi
2.0
0.4
5.0
4.8
0.2
4.6
0.0 4.4
0 5 10 15 0 5 10 15 20 25 43
Raw TF Raw DF
𝐿𝑑 :Length of d
𝐿: average doc length in collection
Okapi BM25 Ranking Function
𝐿𝑑 :Length of d
𝐿: average doc length in collection
𝐾1 , 𝑏: parameters
If 𝑲𝟏 =2, 𝒃=0.75
Summary – Vector Space Ranking
Represent the query as a term-weighted (e.g., tf-idf) vector.
Represent each document as a term-weighted (e.g., tf-idf) vector.
Compute the cosine similarity score for the query vector and
each document vector.
Rank documents with respect to the query by score
Return the top K (e.g., K = 10) to the user.
39