Query Languages and Query Operation: Chapter Seven
Query Languages and Query Operation: Chapter Seven
Operation
Chapter seven
1 04/29/21
Common types of query
A. Keyword-based querying
Queries are combinations of words.
The document collection is searched for documents that
contain these words.
Word queries are intuitive, easy to express and provide fast
ranking.
Here, the concept of word must be defined.
A word is a sequence of letters terminated by a separator (period,
comma, space, etc).
Definition of letter and separator is flexible; e.g., hyphen could be
defined as a letter or as a separator.
Usually, common words (such as “a”, “the”, “of”, …) are ignored.
2 04/29/21
B. Single-word queries
A query is a single word: Usually used for searching in
document images
Simplest form of query.
All documents that include this word are retrieved.
Documents may be ranked by the frequency of this
word in the document.
3 04/29/21
C. Phrase queries
In this case a query is a sequence of words treated as a
single unit.
Also called “literal string” or “exact phrase” query.
Phrase is usually surrounded by quotation marks.
All documents that include this phrase are retrieved.
Usually, separators (commas, colons, etc.) and common
words (e.g., “a”, “the”, “of”, “for”…) in the phrase are
ignored.
In effect, this query is for a set of words that must appear
in sequence.
Allows users to specify a context and thus gain more precision.
Example: “Information Processing for Document
Retrieval”.
4 04/29/21
D. Multiple-word queries
In this case a query is a set of words (or phrases).
Two options: A document is retrieved if it includes:
Any of the query words, or
Each of the query words.
Documents are ranked by the number of query words
they contain:
A document containing n query words is ranked higher than a
document containing m < n query words.
Documents are ranked in decreasing order:
Those containing all the query words are ranked at the top,
only one query word at bottom.
Frequency counts may be used to break ties among documents
that contain the same query words.
5 Example: what is the result for the query “Red Bird” ? 04/29/21
E. Boolean queries
Based on concepts from logic: AND, OR, NOT
It describes the information needed by relating multiple
words with Boolean operators.
Operators: AND, OR, NOT
Semantics: For each query word w a corresponding set Dw
is constructed that includes the documents that contain w.
The Boolean expression is then interpreted as an
expression on the corresponding document sets with
corresponding set operators:
AND: Finds only documents containing all of the
specified words or phrases.
OR: Finds documents containing at least one of the
specified words or phrases.
NOT: Excludes documents containing the specified word
or phrase.
6 04/29/21
Boolean Queries
Precedence: Order of operations
NOT, AND, OR
use parentheses to override precedence
process left-to-right among operators with the same
precedence.
Truth Table
P Q NOT P P AND Q P OR Q
0 0 TRUE FALSE FALSE
0 1 TRUE FALSE TRUE
1 0 FALSE FALSE TRUE
7
1 1 FALSE TRUE TRUE 04/29/21
Examples: Boolean queries
1.computer OR server
Finds documents containing either computer, server or both
8 04/29/21
Penalizing documents
When interpreting queries, some models demote
or reduced documents that include keywords that
were not requested. For example:
Example: Assume the vector model with the
cosine measure and the simple case that both
documents and queries use binary values.
Consider these two documents and a query:
d1 = (0,1,0,1,0), d2= (0,1,1,1,0), q= (0,1,0,1,0)
sim(q, d1) = 1.0, sim(q, d2) = 0.82
d2 is demoted because it includes an extra keyword
not requested by q.
In contrast, the Boolean model does not
“penalize” documents with extra (non-requested)
keywords
9 04/29/21
Query Operations
Relevance Feedback &
Query Expansion
10 04/29/21
Problems with Keywords
May not retrieve relevant documents that include
synonymous terms.
◦ “restaurant” vs. “café”
◦ “PRC” vs. “China”
May retrieve irrelevant documents that include
ambiguous terms.
◦ “bat” (baseball vs. mammal)
◦ “Apple” (company vs. fruit)
◦ “bit” (unit of data vs. act of eating)
11 04/29/21
Query operations
No detailed knowledge of collection and retrieval
environment
difficult to formulate queries well designed for retrieval
Need many formulations of queries for effective retrieval
First formulation: often naïve attempt to retrieve
relevant information
Documents initially retrieved:
Can be examined for relevance information by user,
automatically by the system
Improve query formulations for retrieving additional
relevant documents
12 04/29/21
Query reformulation
Two basic techniques to revise query to account for
feedback:
Query expansion: Expanding original query with new terms
from relevant documents.
This is done by adding new terms to query from
relevant documents.
Term reweighting in expanded query: Modify term weights
based on user relevance judgements.
Increase weight of terms in relevant documents
decrease weight of terms in irrelevant documents
13 04/29/21
Approaches for Relevance Feedback
Approaches based on Users relevance feedback
Relevance feedback with user input
Description of cluster built interactively with user assistance
Approaches based on pseudo relevance feedback
Use relevance feedback methods without explicit user
involvement.
Obtain cluster description automatically
Identify terms related to query terms
e.g. synonyms, stemming variations, terms close to query terms in text
14 04/29/21
User Relevance Feedback
Most popular query reformulation strategy
Cycle:
User presented with list of retrieved documents
After initial retrieval results are presented, allow the user to provide
feedback on the relevance of one or more of the retrieved documents.
User marks those which are relevant
In practice: top 10-20 ranked documents are examined
Use this feedback information to reformulate the query.
Select important terms from documents assessed relevant by users
Enhance importance of these terms in a new query
Produce new results based on reformulated query.
Allows more interactive, multi-pass process.
Expected:
New query moves towards relevant documents and away from non-
relevant documents
15 04/29/21
User Relevance Feedback
Architecture
Query Document
String corpus
Revised Rankings
IR ReRanked
Query System Documents
1. Doc2
2. Doc4
Query 3. Doc5
Ranked 1. Doc1
Reformulation 2. Doc2 .
Documents 3. Doc3 .
1. Doc1 .
2. Doc2 .
3. Doc3
Feedback .
16 . 04/29/21
Pseudo Relevance Feedback
Just assume the top m retrieved documents are relevant,
and use them to reformulate the query.
Allows for query expansion that includes terms that are
correlated with the query terms.
Two strategies:
Local strategies: Approaches based on information
derived from set of initially retrieved documents (local set
of documents)
Global strategies: Approaches based on global
information derived from document collection.
17 04/29/21
Pseudo Feedback Architecture
Query Document
String corpus
Revised Rankings
IR ReRanked
Query System Documents
1. Doc2
2. Doc4
Query 3. Doc5
Ranked 1. Doc1
Reformulation 2. Doc2 .
Documents 3. Doc3 .
.
1. Doc1
.
Pseudo 2. Doc2
3. Doc3
Feedbac .
18 04/29/21
k .
Query Expansion Conclusions
Expansion of queries with related terms
can improve performance, particularly
recall.
However, must select similar terms very
carefully to avoid problems, such as loss
of precision.
19 04/29/21
Thank You
20 04/29/21