0% found this document useful (0 votes)

18 views

IR Lecture 6b

Uploaded by

mhc2023006

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

IR Lecture 6b

Uploaded by

mhc2023006

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 45

Lecture 9: Query Expansion

This lecture
 Improving results
• For high recall. E.g., searching for aircraft doesn’t match
with plane; nor thermodynamic with heat
 Options for improving results…
• Focus on relevance feedback
• The complete landscape
• Global methods
• Query expansion
• Thesauri
• Automatic thesaurus generation
• Local methods
• Relevance feedback
• Pseudo relevance feedback
Query
expansion
Relevance Feedback
 Relevance feedback: user feedback on relevance
of docs in initial set of results
• User issues a (short, simple) query
• The user marks returned documents as relevant or non-
relevant.
• The system computes a better representation of the
information need based on feedback.
• Relevance feedback can go through one or more
iterations.
 Idea:
it may be difficult to formulate a good query
when you don’t know the collection well, so iterate
Relevance Feedback: Example
 Image search engine
http://nayana.ece.ucsb.edu/imsearch/imsearc
h.html
Results for Initial Query
Relevance Feedback
Results after Relevance Feedback
Rocchio Algorithm
 The Rocchio algorithm incorporates relevance feedback
information into the vector space model.
 Want to maximize sim (Q, Cr) - sim (Q, Cnr)
 The optimal query vector for separating relevant and
non-relevant documents (with cosine sim.):
 1  1 
 j  j
Qopt  d  d
Cr d j C r N  Cr d j C r

Qopt = optimal query; Cr = set of rel. doc vectors; N =

collection size
 Unrealistic: we don’t know relevant documents.
The Theoretically Best Query
x x
x x
o x x
x x x x
o x x
x
o x
o x x
 o o x
x

x non-relevant documents
Optimal
query o relevant documents
Rocchio 1971 Algorithm (SMART)
 Used in practice:
  1  1 
 j  j
qm q0   d  d
Dr d j Dr Dnr d j Dnr

qm = modified query vector; q0 = original query vector;

α,β,γ: weights (hand-chosen or set empirically); Dr =
set of known relevant doc vectors; Dnr = set of known
irrelevant doc vectors
 New query moves toward relevant documents and
away from irrelevant documents
 Tradeoff α vs. β/γ : If we have a lot of judged
documents, we want a higher β/γ.
 Term weight can go negative
Relevance feedback on initial query
Initial
x x
query x
o x
 x x
x x
o x x
x
x o
x o x
o o x
x x
x
x known non-relevant documents
Revised
query o known relevant documents
Relevance Feedback in vector spaces
 We can modify the query based on relevance feedback
and apply standard vector space model.
 Use only the docs that were marked.
 Relevance feedback can improve recall and precision
 Relevance feedback is most useful for increasing recall
in situations where recall is important
• Users can be expected to review results and to take
time to iterate
Positive vs Negative Feedback
 Positive feedback is more valuable than
negative feedback (so, set  < ; e.g.  =
0.25,  = 0.75).
 Many systems only allow positive feedback
y ?
(=0). W
h
Probabilistic relevance feedback
 Rather than reweighting in a vector space…
 If user has told us some relevant and irrelevant documents,
then we can proceed to build a classifier, such as a Naive
Bayes model:
• P(tk|R) = |Drk| / |Dr|
• P(tk|NR) = (Nk - |Drk|) / (N - |Dr|)
• tk = term in document; Drk = known relevant doc
containing tk; Nk = total number of docs containing tk
• This is effectively another way of changing the query term
weights
• But note: the above proposal preserves no memory of the
original weights
Relevance Feedback: Assumptions

 A1: User has sufficient knowledge for initial query.

 A2: Relevance prototypes are “well-behaved”.

• Term distribution in relevant documents will be similar

• Term distribution in non-relevant documents will be
different from those in relevant documents
• Either: All relevant documents are tightly clustered around a
single prototype.
• Or: There are different prototypes, but they have significant
vocabulary overlap.
• Similarities between relevant and irrelevant documents are
small
Violation of A1
 Userdoes not have sufficient initial
knowledge.
 Examples:

• Misspellings (Brittany Speers).

• Cross-language information retrieval
• Mismatch of searcher’s vocabulary vs. collection
vocabulary
• Cosmonaut/astronaut
Violation of A2
 Thereare several relevance prototypes.
 Examples:

• Burma/Myanmar
• Contradictory government policies
 Often:instances of a general concept
 Good editorial content can address

problem
• Report on contradictory government policies
Relevance Feedback: Problems
 Why do most search engines not use
relevance feedback?
Relevance Feedback: Problems
 Long queries are inefficient for typical IR engine.
• Long response times for user.
• High cost for retrieval system. W
hy
• Partial solution: ?
• Only reweight certain prominent terms
• Perhaps top 20 by term frequency
 Users are often reluctant to provide explicit feedback
 It’s often harder to understand why a particular document was
retrieved after apply relevance feedback
Evaluation of relevance feedback
strategies
 Use q0 and compute precision and recall graph
 Use qm and compute precision recall graph
• Assess on all documents in the collection
• Spectacular improvements, but … it’s cheating!
• Partly due to known relevant documents ranked higher
• Must evaluate with respect to documents not seen by user
• Use documents in residual collection (set of documents minus
those assessed relevant)
• Measures usually then lower than for original query
• But a more realistic evaluation
• Relative performance can be validly compared
 Empirically, one round of relevance feedback is often
very useful. Two rounds is sometimes marginally useful.
Relevance Feedback on the Web

 Some search engines offer a similar/related pages feature

(this is a trivial form of relevance feedback)
• Google (link-based)
α/β/γ ??
• Altavista
• Stanford WebBase
 But some don’t because it’s hard to explain to average
user:
• Alltheweb
• msn
• Yahoo
 Excite initially had true relevance feedback, but
abandoned it due to lack of use.
Excite Relevance Feedback
Spink et al. 2000
 Only about 4% of query sessions from a user used

relevance feedback option

• Expressed as “More like this” link next to each result
 Butabout 70% of users only looked at first page of
results and didn’t pursue things further
• So 4% is about 1/8 of people extending search
 Relevance feedback improved results about 2/3 of the
time
Other Uses of Relevance Feedback
 Following a changing information need
 Maintaining an information filter (e.g., for a

news feed)
Relevance Feedback
Summary

 Relevance feedback has been shown to be very

effective at improving relevance of results.
• Requires enough judged documents, otherwise it’s unstable (≥
5 recommended)
• Requires queries for which the set of relevant documents is
medium to large
 Full relevance feedback is painful for the user.
 Full relevance feedback is not very efficient in most IR

systems.
 Other types of interactive retrieval may improve

relevance by as much with less work.

The complete landscape
 Global methods
• Query expansion/reformulation
• Thesauri (or WordNet)
• Automatic thesaurus generation
• Global indirect relevance feedback
 Local methods
• Relevance feedback
• Pseudo relevance feedback
Query Reformulation: Vocabulary
Tools
 Feedback

• Information about stop lists, stemming, etc.

• Numbers of hits on each term or phrase
 Suggestions

• Thesaurus
• Controlled vocabulary
• Browse lists of terms in the inverted index
Query Expansion
 In relevance feedback, users give
additional input (relevant/non-relevant) on
documents, which is used to reweight
terms in the documents
 In query expansion, users give additional

input (good/bad search term) on words or

phrases.
Query Expansion: Example

Also: see www.altavista.com, www.teoma.com

Types of Query Expansion
 Global Analysis: (static; of all documents in collection)
• Controlled vocabulary
• Maintained by editors (e.g., medline)
• Manual thesaurus
• E.g. MedLine: physician, syn: doc, doctor, MD, medico
• Automatically derived thesaurus
• (co-occurrence statistics)
• Refinements based on query log mining
• Common on the web
 Local Analysis: (dynamic)
• Analysis of documents in result set
Controlled Vocabulary
Thesaurus-based Query Expansion
This doesn’t require user input
For each term, t, in a query, expand the query with synonym
and related words of t from the thesaurus
• feline → feline cat
May weight added terms less than original query terms.
Generally increases recall.
Widely used in many science/engineering fields
May significantly decrease precision, particularly with
ambiguous terms.
• “interest rate”  “interest rate fascinate evaluate”
There is a high cost of manually producing a thesaurus
Automatic Thesaurus Generation
 Attempt to generate a thesaurus automatically
by analyzing the collection of documents
 Two main approaches

• Co-occurrence based (co-occurring words are

more likely to be similar)
• Shallow analysis of grammatical relations
• Entities that are grown, cooked, eaten, and digested
are more likely to be food items.
 Co-occurrence based is more robust,
grammatical relations are more accurate .
Co-occurrence Thesaurus
 Simplest way to compute one is based on term-term
similarities in C = AAT where A is term-document matrix.
 wi,j = (normalized) weighted count (ti , dj)

dj n

m
Automatic Thesaurus Generation
Example
Automatic Thesaurus Generation
Discussion
 Quality of associations is usually a problem.
 Term ambiguity may introduce irrelevant statistically

correlated terms.
• “Apple computer”  “Apple red fruit computer”
 Problems:

• False positives: Words deemed similar that are not

• False negatives: Words deemed dissimilar that are similar
 Sinceterms are highly correlated anyway,
expansion may not retrieve many additional
documents.
Query Expansion: Summary
 Query expansion is often effective in increasing
recall.
• Not always with general thesauri
• Fairly successful for subject-specific collections
 In most cases, precision is decreased, often
significantly.
 Overall, not as useful as relevance feedback;

may be as good as pseudo-relevance feedback

Pseudo Relevance Feedback
 Mostlyworks (perhaps better than global
analysis!)
• Found to improve performance in TREC ad-hoc task
• Danger of query drift
 pseudo relevance feedback is the relevance
feedback without user intervention
 Automatic local analysis

 Pseudo relevance feedback attempts to

automate the manual part of relevance feedback.

 Retrieve an initial set of relevant
documents.
 Assume that top m ranked documents

are relevant.
 Do relevance feedback
There are two well-known models for applying relevance
feedback:
 (i) Ide’s method [69]: This method adds all the terms of
relevant documents of the set provided for relevance
feedback and removes the terms of first irrelevant
document in the set. Modified query vector is constructed
as:
qnew qold   di  s
diR

where qold is the original query vector, qnew is the new

query, di is the vector of relevant document R and s is
the vector of the first irrelevant document in the feedback
set.
 (ii) Rocchio’s method [127]: Rocchio’s method consists of
moving the initial query vector toward the centroid of the
relevant documents and away from the centroid of the non-
relevant documents. It attempts to estimate “the optimal” user
query through relevance feedback. This can be described by
the following equation:
qnew  qold    di    di
diR diI

Where, di is relevant or irrelevant document obtained by manual or

automatic feedback during initial retrieval, I is the set of
irrelevant documents and α, β and γ are coefficients. These
coefficients are set by trial and error method.
α =1, β=.75 and γ = .15
 In pseudo-relevance feedback, the
modified query vector is calculated by
dropping the negative terms appearing in
the Ide’s and Rocchio’s equation.
query drift
 Query expansion following retrieval feedback may
degrade the performance if the top ranked
documents retrieved during initial run are not
relevant.
 Expanding query by adding terms from irrelevant
documents, or adding terms from relevant
documents that are not closely related to the query
terms, will move the query representation away from
what may be “optimal” query representation.
 This results in alteration of focus of the query i.e.
‘query drift’.
Pseudo relevance feedback:
Cornell SMART at TREC 4
 Results show number of relevant documents out
of top 100 for 50 queries (so out of 5000)
 Results contrast two length normalization

schemes (L vs. l), and pseudo relevance feedback

(PsRF) (done as adding 20 terms)

• lnc.ltc 3210
• lnc.ltc-PsRF 3634
• Lnu.ltu 3709
• Lnu.ltu-PsRF 4350
Indirect relevance feedback
 On the web, DirectHit introduced a form of
indirect relevance feedback.
 DirectHit ranked documents higher that users

look at more often.

• Clicked on links are assumed likely to be relevant
• Assuming the displayed summaries are good, etc.
 Globally: Not user or query specific.
 This is the general area of clickstream mining

Relevance Feedback
No ratings yet
Relevance Feedback
37 pages
3859941
No ratings yet
3859941
51 pages
8.relavance Feedback - II
No ratings yet
8.relavance Feedback - II
52 pages
Feedback
No ratings yet
Feedback
19 pages
Asddas
No ratings yet
Asddas
34 pages
Lecture9 Expansion
No ratings yet
Lecture9 Expansion
32 pages
Relevance Feedback
No ratings yet
Relevance Feedback
47 pages
Relevance Feedback: LBSC 796/INFM 718R: Week 8
No ratings yet
Relevance Feedback: LBSC 796/INFM 718R: Week 8
56 pages
Modern Information Retrieval: Relevance Feedback and Query Expansion
No ratings yet
Modern Information Retrieval: Relevance Feedback and Query Expansion
104 pages
Relevance Feedback & Query Expansion
No ratings yet
Relevance Feedback & Query Expansion
4 pages
Query Operation 2021
No ratings yet
Query Operation 2021
35 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Chap - Week8 - Queries and Information Needs
No ratings yet
Chap - Week8 - Queries and Information Needs
44 pages
29 Relevance Feedback and Rocchio Algo
No ratings yet
29 Relevance Feedback and Rocchio Algo
23 pages
Relevance feedback
No ratings yet
Relevance feedback
16 pages
Week 3 - Probabilistic Retrieval and Relevance Feedback
No ratings yet
Week 3 - Probabilistic Retrieval and Relevance Feedback
37 pages
chap6
No ratings yet
chap6
70 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
18 pages
IR Presentation 1
No ratings yet
IR Presentation 1
41 pages
Module 3.2 Query Operations (1).pptx
No ratings yet
Module 3.2 Query Operations (1).pptx
38 pages
Ir 103 131
No ratings yet
Ir 103 131
29 pages
Chapter 7
No ratings yet
Chapter 7
7 pages
lecture7-relevance-feedback
No ratings yet
lecture7-relevance-feedback
47 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Relevance Feedback Slides PDF
No ratings yet
Relevance Feedback Slides PDF
14 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
Relevance Feedback: Improving Results
No ratings yet
Relevance Feedback: Improving Results
41 pages
7_Lec_2025
No ratings yet
7_Lec_2025
50 pages
1730003921-7 Relevence Feedback and Query Expansion
No ratings yet
1730003921-7 Relevence Feedback and Query Expansion
41 pages
2008_DAIATCA
No ratings yet
2008_DAIATCA
14 pages
Unit 4
No ratings yet
Unit 4
61 pages
IRS-Unit-4
No ratings yet
IRS-Unit-4
63 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
Rocchio Relevance
No ratings yet
Rocchio Relevance
10 pages
0809 Query Expansion and Probabilistic Retrieval Model
No ratings yet
0809 Query Expansion and Probabilistic Retrieval Model
56 pages
AZ Lecture7-Queryexpansion
No ratings yet
AZ Lecture7-Queryexpansion
49 pages
1 Overview
No ratings yet
1 Overview
44 pages
A Survey On Relevance Feedback For Information Retrieval Based On User Query
No ratings yet
A Survey On Relevance Feedback For Information Retrieval Based On User Query
4 pages
7 Query Operations
No ratings yet
7 Query Operations
19 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Relevance Ranking and Relevance Feedback: Carl Staelin
No ratings yet
Relevance Ranking and Relevance Feedback: Carl Staelin
34 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
F-IR
No ratings yet
F-IR
30 pages
What is Information Retrieval (IR) (1)
No ratings yet
What is Information Retrieval (IR) (1)
17 pages
Materi Pertemuan Ke-1-Dno 2018-1
No ratings yet
Materi Pertemuan Ke-1-Dno 2018-1
42 pages
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
No ratings yet
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
30 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Query Languages and Query Operation: Chapter Seven
No ratings yet
Query Languages and Query Operation: Chapter Seven
20 pages
Web Query Mining
No ratings yet
Web Query Mining
16 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
IR_workbook_answers
No ratings yet
IR_workbook_answers
36 pages
IRS UNIT-4 NOTES_241202_150037
No ratings yet
IRS UNIT-4 NOTES_241202_150037
18 pages
Building Fast Search Engines
No ratings yet
Building Fast Search Engines
21 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
No ratings yet
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
16 pages
Module 6 Updated Final
No ratings yet
Module 6 Updated Final
48 pages
The Future of Search
From Everand
The Future of Search
Andres J. Clary
No ratings yet
Structural Steel Design by Jay Ryan Santos
No ratings yet
Structural Steel Design by Jay Ryan Santos
19 pages
Panaguiton V DOJ, GR No. 167571, 25 November 2008
No ratings yet
Panaguiton V DOJ, GR No. 167571, 25 November 2008
4 pages
HFL Zone 1 Floodlight Datasheet-2024-09-26
No ratings yet
HFL Zone 1 Floodlight Datasheet-2024-09-26
2 pages
Compiler Design Mini Project Shivang PDF
No ratings yet
Compiler Design Mini Project Shivang PDF
7 pages
Manganese Beneficiation
0% (1)
Manganese Beneficiation
43 pages
LIT Era - Module 1
100% (1)
LIT Era - Module 1
24 pages
PBTE DAE 1st, 2nd, 3rd Year Date Sheet 2016 Download Online
No ratings yet
PBTE DAE 1st, 2nd, 3rd Year Date Sheet 2016 Download Online
73 pages
SHE Exercise 13: Jonalyn Tabo ACT20 1
No ratings yet
SHE Exercise 13: Jonalyn Tabo ACT20 1
31 pages
DCC Unit 3
No ratings yet
DCC Unit 3
66 pages
Chapter 14 Test Bank MCQ
No ratings yet
Chapter 14 Test Bank MCQ
27 pages
TV Channels
No ratings yet
TV Channels
5 pages
Electrostatic
100% (1)
Electrostatic
16 pages
Affidavit for OPS Ravi
No ratings yet
Affidavit for OPS Ravi
4 pages
Karima Gayash CV 2019
No ratings yet
Karima Gayash CV 2019
5 pages
Psoc
100% (1)
Psoc
2 pages
Com117 Application Packages Part Time
No ratings yet
Com117 Application Packages Part Time
7 pages
Insights Into 27 Hacker Groups
No ratings yet
Insights Into 27 Hacker Groups
29 pages
Paper LBO Model Example: How To Rip Through A Paper LBO in 5 Minutes
No ratings yet
Paper LBO Model Example: How To Rip Through A Paper LBO in 5 Minutes
2 pages
Mil Reviewer
No ratings yet
Mil Reviewer
8 pages
T-304 Standard Method of Test For Uncompacted Void Content of Fine
No ratings yet
T-304 Standard Method of Test For Uncompacted Void Content of Fine
9 pages
Accurate Time For Windows Server 2016
No ratings yet
Accurate Time For Windows Server 2016
23 pages
Q1 2021 Dallas Industrial Market Report
No ratings yet
Q1 2021 Dallas Industrial Market Report
4 pages
Hadzic's Textbook of Regional Anesthesia and Acute Pain Management: Self-Assessment and Review 1st Edition Admir Hadzic 2024 Scribd Download
No ratings yet
Hadzic's Textbook of Regional Anesthesia and Acute Pain Management: Self-Assessment and Review 1st Edition Admir Hadzic 2024 Scribd Download
51 pages
Delamere Vineyard and Steinway & Sons
0% (1)
Delamere Vineyard and Steinway & Sons
2 pages
RP9800 en
No ratings yet
RP9800 en
4 pages
My Thoughts On The Limits of Privacy
No ratings yet
My Thoughts On The Limits of Privacy
12 pages
19 CSS Exp2
No ratings yet
19 CSS Exp2
5 pages
Ebooks File Culture Growth and Economic Policy 1st Edition Panagiotis E. Petrakis (Auth.) All Chapters
100% (8)
Ebooks File Culture Growth and Economic Policy 1st Edition Panagiotis E. Petrakis (Auth.) All Chapters
62 pages
Using Technology As A Learning Tool
No ratings yet
Using Technology As A Learning Tool
8 pages
MY PLACEMENT - Profile
No ratings yet
MY PLACEMENT - Profile
2 pages

IR Lecture 6b

Uploaded by

IR Lecture 6b

Uploaded by

Lecture 9: Query Expansion

Qopt = optimal query; Cr = set of rel. doc vectors; N =

qm = modified query vector; q0 = original query vector;

 A1: User has sufficient knowledge for initial query.

• Term distribution in relevant documents will be similar

• Misspellings (Brittany Speers).

 Some search engines offer a similar/related pages feature

relevance feedback option

 Relevance feedback has been shown to be very

relevance by as much with less work.

• Information about stop lists, stemming, etc.

input (good/bad search term) on words or

Also: see www.altavista.com, www.teoma.com

• Co-occurrence based (co-occurring words are

• False positives: Words deemed similar that are not

may be as good as pseudo-relevance feedback

 Pseudo relevance feedback attempts to

automate the manual part of relevance feedback.

where qold is the original query vector, qnew is the new

Where, di is relevant or irrelevant document obtained by manual or

schemes (L vs. l), and pseudo relevance feedback

look at more often.

You might also like