0% found this document useful (0 votes)
19 views

Lec 4

Reusable test collections are used for information retrieval evaluation. They consist of a document collection, topics representing information needs, and relevance judgments indicating which documents are relevant to each topic. Such test collections allow researchers to develop and evaluate new retrieval methods in a standardized way.

Uploaded by

haneenalaa465
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Lec 4

Reusable test collections are used for information retrieval evaluation. They consist of a document collection, topics representing information needs, and relevance judgments indicating which documents are relevant to each topic. Such test collections allow researchers to develop and evaluate new retrieval methods in a standardized way.

Uploaded by

haneenalaa465
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Reusable Test Collections

Reusable Test Collections (for evaluation purposes)

1. Document Collection
2. Topics (sample of information needs)
3. Relevance judgments (qrels)

2
How can we get it?
 For web search, companies apply their own studies to assess the
performance of their search engine.
 Web-search performance is monitored by:
● Traffic
● User clicks and session logs
● Labelling results for selected users’ queries

 Academia (or lab settings):


● Someone goes out and builds them (expensive)
● As a byproduct of large scale evaluations (collaborative effort)
 IR Evaluation Campaigns are created for this reason
IR Evaluation Campaigns
 IR test collections are provided for scientific communities to develop
better IR methods.
 Collections and queries are provided, relevance judgements are built
during the campaign.
 TREC = Text REtrieval Conference http://trec.nist.gov/
● Main IR evaluation campaign, sponsored by NIST (US gov).
● Series of annual evaluations, started in 1992.
 Other evaluation campaigns
● CLEF: European version (since 2000)
● NTCIR: Asian version (since 1999)
● FIRE: Indian version (since 2008)
TREC Tracks and Tasks
 TREC (and other campaigns) are formed of a set of tracks, each track
is about (one or more) search tasks.
● Each track/task is about searching a set of documents of given genre and
domain.
 Examples
● TREC Web track
● TREC Medical track
● TREC Legal track  CLEF-IP track  NTCIR patent mining track
● TREC Microblog track
• Adhoc search task
• Filtering task
1. TREC Collection
A set of hundreds of thousands or millions of docs
● 1B in case of web search (TREC ClueWeb09)
 The typical format of a document:
<DOC>
<DOCNO> 1234 </DOCNO>
<TEXT>
This is the document.
Multilines of plain text.
</TEXT>
</DOC>
2. TREC Topic
 Topic: a statement of information need
 Multiple topics (~50) developed (mostly) at NIST for a collection.
 Developed by experts and associated with additional details.
● Title: the query text
● Description: description of what is meant by the query.
● Narrative: what should be considered relevant.
<num>189</num>
<title>Health and Computer Terminals</title>
<desc>Is it hazardous to the health of individuals to work with computer terminals on a
daily basis?</desc>
<narr>Relevant documents would contain any information that expands on any
physical disorder/problems that may be associated with the daily working with
computer terminals. Such things as carpel tunnel, cataracts, and fatigue have been said
to be associated, but how widespread are these or other problems and what is being
done to alleviate any health problems</narr>
3. Relevance Judgments
 For each topic, set of relevant docs is required to be known for an
effective evaluation:
a) Exhaustive assessment is usually impractical
● TREC usually has 50 topics
● Collection usually has >1 million documents
b) Random sampling won’t work
● If relevant docs are rare, none may be found!
c) IR systems can help focus the sample (Pooling)
● Each system finds some relevant documents
● Different systems find different relevant documents
● Together, enough systems will find most of them
8
Pooled Assessment Methodology
1. Systems submit top 1000 documents per topic (Ranked)
2. Top 100 documents from each are manually judged
• Single pool, duplicates removed, arbitrary order
• Judged by the person who developed the topic
3. Treat unevaluated documents as not relevant
4. Compute MAP (or others) down to 1000 documents

 To make pooling work:


● Good number of participating systems
● Systems must do reasonably well
● Systems must be different (not all “do the same thing”)
9
Example
In one of TREC tracks, 3 teams T1, T2, and T3 Solution
have participated and they were asked to
retrieve up to 15 documents per query. In Given: Pool is the top 5
Start judging by comparison with the given relevant:
reality (with exhaustive judgments), a query Q Relevant in the collection w.r.t pool: A, G, B, E
has 9 relevant documents in the collection: 1 2 3
( + + )
1 7 9
AP of team (1) =
A, B, C, D, E, F, G, H, and I. 4
The submitted ranked lists are as follows: Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T1 A M Y R K L B Z E N D C W
Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T2 Y A J R N Z M C G B X P D K W
T1 A M Y R K L B Z E N D C W
T3 G B Y K E A Z L N C H K W X
T2 Y A J R N Z M C G B X P D K W
T3 G B Y K E A Z L N C H K W X
We constructed the judging pools for Q using We can easily repeat the previous methodology for
other team
only the top 5 documents of each of the
submitted ranked lists.
What is the Average Precision of each of the 3
teams? 10
Ranked Retrieval (vector space model & BM25)
Ranked Retrieval

 Typical queries: free text queries (no Boolean operators)


 Results are “ranked” with respect to a query
 Large result sets are not an issue
● We just show the top k ( ≈ 10) results
● We don’t overwhelm the user
 Top ranked documents are the most likely to satisfy user’s query.
 Assign a score –say in [0, 1] – to each document.
 Score (d, q) measures how well doc d matches a query q.

12
Scoring Example: Jaccard coefficient
 Commonly-used measure of overlap of two sets A and B
 The 2 sets in our context are terms (documents after preprocessing) and queries

 Note that:
 jaccard(A, A)
=1
 jaccard(A, B) = 0 if A ∩ B = 0
 A and B don’t have to be of the same size.
 Always assigns a number between 0 and 1.
13
Example
For the query "the arab world" and the document "fifa world cup in arab
country", what is Jaccard similarity (after removing stop words)?

Solution: after removing the stop words (the, in)


2
𝐽𝑎𝑐𝑐𝑎𝑟𝑑 =
5
Issues and problems With Jaccard for Scoring
1) Term frequency?
 Doesn’t consider term frequency (how many times a term
occurs in a document)
2) Term importance
 It treats all terms equally!
● How about rare terms in a collection? more informative than
frequent terms.
3) Length
 Needs more sophisticated way of length normalization

15
TF-IDF ranking
Term-Document Count Matrix
 Each document as a count (frequency) vector
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0

 Each value in the matrix represents term frequency (tf)


 Bag-of-Words Model doesn’t consider the ordering of words in a document

 John is quicker than Mary


Same vectors (error due to bag of words criteria)
 Mary is quicker than John
22
1. Frequent Terms in a Document
Term Frequency
 tft,d: the number of times that term t occurs in doc d.
 We want to use tf when computing query-document match
scores. But how?

 Raw term frequency?


● A document with 10 occurrences of the term is more relevant than a
document with 1 occurrence of the term.
● But not 10 times more relevant.
 Relevance does not increase linearly with tf.
23
Log-Frequency Weighting
Weight of term in document

tf

24
Log-Frequency Weighting
 The log frequency weight of term t in d is (term weighting function):
1  log10 tf t,d , if tft,d  0
wt,d  
 0, otherwise (If tft,d  0)
 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

 Score for a document-query pair: sum of weights over terms t in


both q and d:

 The score is 0 if none of the query terms is present in the document.


25
2. Informative Terms in a Collection
 Rare terms are more informative than frequent terms
● Recall stop words
 We want a high weight for rare terms.
 Collection Frequency cft:
● number of occurrences of term t in the collection
 Document Frequency dft:
● the number of documents that contain t
● inverse measure of the informativeness of t
● dft  N

26
Inverse Documents Frequency, idf
 idf (inverse document frequency) of t:

● N is the number of documents in the collection


● log (N/dft) instead of N/dft to “dampen” the effect of idf.
● As increasing the value of idf, it means that the term is more
informative term dft idft This is the more
informative term as
 Suppose N = 1 million calpurnia 1 6
idf is big
animal 100 4
sunday 1,000 3
fly 10,000 2
under 100,000 1
27
If term “ the” is in all documents the 1,000,000 0
Summary

idf: is a relation between term and collection


tf: is a relation between term and document
tf.idf Term Weighting
 The tf-idf weight of a term is the product of its tf weight and its
idf weight.

 One of the best-known weighting scheme in IR


● Increases with the number of occurrences within a document
● Increases with the rarity of the term in the collection

28
Solve this Example: each student should do a detailed solution
 Collection of 5 documents (balls = terms)
 Query the destructive storm
 Which is the least relevant document?
Start your solution with:
 Which is the most relevant document?
• df (yellow)= 5
D1 D2 D3 D4 D5 • df (red)=3
• df (green)=3

What next !!!

4 5 3 2 1
Vector Space Model
Binary → Count → Weight Matrix

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 5.25 3.18 0 0 0 0.35


Dictionary = vocab

Brutus 1.21 6.1 0 1 0 0


Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95

Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V|

27
Sec. 6.3

Documents as Vectors
 According to the shown figure, Assume 2 terms then space will
be 2 dimensions d4

 Dimension of space = number of all terms in the whole collection


 |V|-dimensional vector space where |V| is the vocabulary size
 Terms are axes of the space
 Documents are points or vectors in this space

 Note that:
 Very high-dimensional: tens of millions of dimensions when you apply this to a web
search engine.
 These are very sparse vectors - most entries are zero.
 This led to Vector Space Model
28
Sec. 6.3

Queries as Vectors
 Key idea 1: Do the same for queries: represent them as vectors in the space.
(space) ‫عل نفس ال‬
‫( ي‬vector) ‫صغبه ونمثلها ك‬
‫ر‬ (document) ‫(كأنها‬query) ‫نعتب ال‬
‫ر‬

 Key idea 2: Rank documents according to their proximity to the query in


this space.
‫( نقدر نقيس المسافه ر ن‬linear algebra) ‫باستخدام ال‬
‫( ومتجه ال‬query) ‫بي متجه ال‬
‫( لمعرفة مدي قرب هم‬document)
 proximity = similarity of vectors

29
Sec. 6.3

Does the Euclidean Distance is efficient here


 Distance between the end points of the two vectors

 Large for vectors of different lengths.


 Thought experiment: take a document d and append it to
itself. Call this document d′.
● “Semantically” d and d′ have the same content
● Euclidean distance can be quite large 37
Sec. 6.3

Angle Instead of Distance


 If the angle between the two documents is 0, corresponding to maximal similarity.
‫( زادت‬similarity) ‫( الن ال‬more relevant) ‫صغبه كلما كانت‬
‫ر‬ ‫كلما كانت الزاويه ر ن‬
(d) ‫( وال‬q) ‫بي ال‬ •
(Cosine) ‫ه ال‬ ‫الت تحقق تلك العالقه العكسيه ي‬ ‫الداله ي‬ •

 Key idea: Rank documents according to angle with query.


● Rank documents in increasing order of the angle with query
● Rank documents in decreasing order of cosine(query, document)
 Cosine is a monotonically decreasing function for the interval [0o, 180o]

31
Sec. 6.3

3. Length Normalization
 A vector can be (length-) normalized by dividing each of its
components by its length – for this we use the L2 norm:

 Dividing a vector by its L2 norm makes it a unit (length) vector (on


surface of unit hypersphere)
 Effect on the two documents d and d′ (d appended to itself) from
earlier slide: they have identical vectors after length-normalization.
● Long and short documents now have comparable weights

32
Cosine “Similarity” (Query, Document)

 𝑞⃗𝑖 is the [tf-idf] weight of term 𝑖 in the query


 d⃗𝑖 is the [tf-idf] weight of term 𝑖 in the doc
 For normalized vectors:

 For non-normalized vectors:


Computing Cosine Scores

34
Sec. 6.4

Variants of tf-idf Weighting

 Many search engines allow for different weightings for queries vs.
documents.
 SMART Notation: use notation ddd.qqq, using the acronyms from the table
 A very standard weighting scheme is: lnc.ltc

35
BM25 ranking
Okapi BM25 Ranking Function
And

tf component idf component


1.0 6.0

5.8
0.8 L /𝐿 5.6

5.4
0.6
Okapi TF

0.5
Classic

IDF
1.0 5.2
Okapi
2.0
0.4
5.0

4.8
0.2
4.6

0.0 4.4
0 5 10 15 0 5 10 15 20 25 43
Raw TF Raw DF

𝐿𝑑 :Length of d
𝐿: average doc length in collection
Okapi BM25 Ranking Function

𝐿𝑑 :Length of d
𝐿: average doc length in collection
𝐾1 , 𝑏: parameters

If 𝑲𝟏 =2, 𝒃=0.75
Summary – Vector Space Ranking
 Represent the query as a term-weighted (e.g., tf-idf) vector.
 Represent each document as a term-weighted (e.g., tf-idf) vector.
 Compute the cosine similarity score for the query vector and
each document vector.
 Rank documents with respect to the query by score
 Return the top K (e.g., K = 10) to the user.

39

You might also like