Introduction To: Information Retrieval
Introduction To: Information Retrieval
Introduction to
Information Retrieval
CS276
Information Retrieval and Web Search
Christopher Manning and Prabhakar Raghavan
Lecture 8: Evaluation
Introduction to Information Retrieval Sec. 6.2
This lecture
How do we know if our results are any good?
Evaluating a search engine
Benchmarks
Precision and recall
Results summaries:
Making our good results usable to a user
2
Introduction to Information Retrieval
5
Introduction to Information Retrieval Sec. 8.6.2
7
Introduction to Information Retrieval Sec. 8.1
Evaluating an IR system
Note: the information need is translated into a query
Relevance is assessed relative to the information
need not the query
E.g., Information need: I'm looking for information on
whether drinking red wine is more effective at
reducing your risk of heart attacks than white wine.
Query: wine red white heart attack effective
You evaluate whether the doc addresses the
information need, not whether it has these words
9
Introduction to Information Retrieval Sec. 8.2
Relevant Nonrelevant
Retrieved tp fp
Not Retrieved fn tn
12
Introduction to Information Retrieval Sec. 8.3
Search for:
0 matching results found.
Precision/Recall
You can get high recall (but low precision) by
retrieving all docs for all queries!
Recall is a non-decreasing function of the number
of docs retrieved
14
Introduction to Information Retrieval Sec. 8.3
15
Introduction to Information Retrieval Sec. 8.3
A combined measure: F
Combined measure that assesses precision/recall
tradeoff is F measure (weighted harmonic mean):
( 1) PR
1 2
F
1
(1 )
1 PR
2
P R
People usually use balanced F1 measure
i.e., with = 1 or = ½
Harmonic mean is a conservative average
See CJ van Rijsbergen, Information Retrieval
16
Introduction to Information Retrieval Sec. 8.3
100
80 Minimum
Maximum
60
Arithmetic
40 Geometric
Harmonic
20
0
0 20 40 60 80 100
Precision (Recall fixed at 70%)
17
Introduction to Information Retrieval Sec. 8.4
18
Introduction to Information Retrieval Sec. 8.4
A precision-recall curve
1.0
0.8
Precision
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
19
Introduction to Information Retrieval Sec. 8.4
20
Introduction to Information Retrieval Sec. 8.4
Interpolated precision
Idea: If locally precision increases with increasing
recall, then you should get to count that…
So you max of precisions to right of value
21
Introduction to Information Retrieval Sec. 8.4
Evaluation
Graphs are good, but people want summary measures!
Precision at fixed retrieval level
Precision-at-k: Precision of top k results
Perhaps appropriate for most of web search: all people want are
good matches on the first one or two results pages
But: averages badly and has an arbitrary parameter of k
11-point interpolated average precision
The standard measure in the early TREC competitions: you take
the precision at 11 levels of recall varying from 0 to 1 by tenths of
the documents, using interpolation (the value for 0 is always
interpolated!), and average them
Evaluates performance at all recall levels
22
Introduction to Information Retrieval Sec. 8.4
0.8
0.6
Precision
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Recall
23
Introduction to Information Retrieval Sec. 8.4
Variance
For a test collection, it is usual that a system does
crummily on some information needs (e.g., MAP =
0.1) and excellently on others (e.g., MAP = 0.7)
Indeed, it is usually the case that the variance in
performance of the same system across queries is
much greater than the variance of different systems
on the same query.
Test Collections
27
Introduction to Information Retrieval Sec. 8.5
Unit of Evaluation
We can compute precision, recall, F, and ROC curve
for different units.
Possible units
Documents (most common)
Facts (used in some TREC evaluations)
Entities (e.g., car companies)
May produce different results. Why?
29
Introduction to Information Retrieval Sec. 8.5
30
Introduction to Information Retrieval Sec. 8.5
P(A)? P(E)?
Kappa Measure: Example
Number of docs Judge 1 Judge 2
70 Nonrelevant Nonrelevant
20 Relevant Nonrelevant
10 Nonrelevant Relevant
31
Introduction to Information Retrieval Sec. 8.5
Kappa Example
P(A) = 370/400 = 0.925
P(nonrelevant) = (10+20+70+70)/800 = 0.2125
P(relevant) = (10+20+300+300)/800 = 0.7878
P(E) = 0.2125^2 + 0.7878^2 = 0.665
Kappa = (0.925 – 0.665)/(1-0.665) = 0.776
TREC
TREC Ad Hoc task from first 8 TRECs is standard IR task
50 detailed information needs a year
Human evaluation of pooled results returned
More recently other related things: Web track, HARD
A TREC query (TREC 5)
<top>
<num> Number: 225
<desc> Description:
What is the main function of the Federal Emergency Management
Agency (FEMA) and the funding level provided to meet emergencies?
Also, what resources are available to FEMA such as people,
equipment, facilities?
</top>
33
Introduction to Information Retrieval Sec. 8.2
34
Introduction to Information Retrieval Sec. 8.5
35
Introduction to Information Retrieval Sec. 8.5
36
Introduction to Information Retrieval Sec. 8.5.1
38
Introduction to Information Retrieval Sec. 8.6.3
A/B testing
Purpose: Test a single innovation
Prerequisite: You have a large search engine up and running.
Have most users use old system
Divert a small proportion of traffic (e.g., 1%) to the new
system that includes the innovation
Evaluate with an “automatic” measure like clickthrough on
first result
Now we can directly see if the innovation does improve user
happiness.
Probably the evaluation methodology that large search
engines trust most
In principle less powerful than doing a multivariate regression
analysis, but easier to understand
40
Introduction to Information Retrieval Sec. 8.7
RESULTS PRESENTATION
41
Introduction to Information Retrieval Sec. 8.7
Result Summaries
Having ranked the documents matching a query, we
wish to present a results list
Most commonly, a list of the document titles plus a
short summary, aka “10 blue links”
42
Introduction to Information Retrieval Sec. 8.7
Summaries
The title is often automatically extracted from document
metadata. What about the summaries?
This description is crucial.
User can identify good/relevant hits based on description.
Two basic kinds:
Static
Dynamic
A static summary of a document is always the same,
regardless of the query that hit the doc
A dynamic summary is a query-dependent attempt to explain
why the document was retrieved for the query at hand
43
Introduction to Information Retrieval Sec. 8.7
Static summaries
In typical systems, the static summary is a subset of
the document
Simplest heuristic: the first 50 (or so – this can be
varied) words of the document
Summary cached at indexing time
More sophisticated: extract from each document a set
of “key” sentences
Simple NLP heuristics to score each sentence
Summary is made up of top-scoring sentences.
Most sophisticated: NLP used to synthesize a summary
Seldom used in IR; cf. text summarization work
44
Introduction to Information Retrieval Sec. 8.7
Dynamic summaries
Present one or more “windows” within the document that
contain several of the query terms
“KWIC” snippets: Keyword in Context presentation
45
Introduction to Information Retrieval Sec. 8.7
46
Introduction to Information Retrieval
Quicklinks
For a navigational query such as united airlines
user’s need likely satisfied on www.united.com
Quicklinks provide navigational cues on that home
page
47
Introduction to Information Retrieval
48
Introduction to Information Retrieval
49
Introduction to Information Retrieval
50