Week 2 - Information Retrieval Basics
Week 2 - Information Retrieval Basics
(CIS3-535)
2022 -
2023
Information Retrieval- 1
Part 1
Information Retrieval
Information Retrieval- 2
1. Information Retrieval - Overview
1.1 Introduction to Information Retrieval 1.4 Embedding Techniques
1.2 Basic Information Retrieval 1.4.1 Latent Semantic Indexing
1.2.1 Text-Based Information Retrieval 1.4.2 Latent Dirichlet Allocation
1.2.2 Boolean Retrieval 1.4.3 Word Embeddings – skipgram, CBOW
1.2.3 Vector Space Retrieval 1.4.4 Fasttext
1.2.4 Evaluating Information Retrieval 1.4.5 Glove
1.2.5 Probabilistic Information Retrieval 1.5 Link-Based Ranking
1.2.6 Query Expansion 1.5.1 PageRank
1.2.6.1 User Relevance Feedback 1.5.2 Hyperlink-Induced Topic Search (HITS)
1.2.6.2 Global Query Expansion 1.5.3 Link Indexing
1.3 Indexing for Information Retrieval
1.3.1 Inverted Index
1.3.2 Web-scale Indexing: Map-Reduce
1.3.3 Distributed Retrieval
1.3.3.1 Fagin’s algorithm
1.3.3.2 Threshold algorithm
Information Retrieval- 3
1.1 INTRODUCTION TO INFORMATION
RETRIEVAL
Information Retrieval- 4
What is Information Retrieval?
Information retrieval (IR) is the task of finding in a
large collection of documents those that satisfy the
information needs of a user
Examples
– Searching documents in a library
– Searching the Web
Information Retrieval- 5
Different Types of Information Retrieval
Documents can be
– unstructured data like texts, images, audio, and video
Queries can be both structured and unstructured
– Boolean expressions
– Free text, sample documents
Results can be sorted or unsorted
– Results sets
– Ranked lists
Information Retrieval- 6
Basic Information Retrieval Approach
information items retrieval
(unstructured) content system:
feature retrieval efficiency
extraction model:
structured relevance
document
representation
ranked/
similarity ?
binary
result
structured matching
query
representation
query
information need
formulation
(unstructured) query
Information Retrieval- 7
Example: Text Retrieval
Web documents information retrieval
text content system:
feature retrieval Google,
extraction: model: Bing etc.
terms, words example: Boolean,
"web retrieval" Vector etc.
ranked list
similarity ?
of Web
documents
example: matching:
"web information occurrence of
retrieval" query terms in
document
query
Web search formulation:
keywords
Information Retrieval- 8
Formally
𝐶⊆𝐷 𝐹 𝑐 : 𝑃 (𝐷)→ 𝑅𝑒𝑝𝑐 The representation depends on the
individual document as well as on
𝐹 𝐷 : 𝐷 × 𝑅𝑒𝑝 𝑐 → 𝑅𝑒𝑝 𝐷 the features of collection !
𝑑
𝑟𝑒𝑝 𝑑 =𝐹 𝐷 ( 𝑑 , 𝐹 𝐶 (𝐶) )
, )}
𝑟𝑒𝑝 𝑞 =𝐹 𝑄 ( 𝑞 , 𝐹 𝐶 (𝐶) )
Information Retrieval- 9
Retrieval Model
The retrieval model determines
– the structure of the document representation
– the structure of the query representation
– the similarity matching function
Relevance
– determined by the similarity matching function
– should reflect right topic, user needs, authority, recency
– no objective measure
Information Retrieval- 10
What does the Similarity Function compute?
Two basic models
1. Boolean Retrieval:
2. Ranked Retrieval: or
General Wisdom
- Boolean retrieval suitable for
- Experts: can formulate accurate queries
- Machines: can consume large results
- Ranked Retrieval good for ordinary users
Information Retrieval- 11
Information Filtering
filtering
system:
retrieval efficiency
information item feature
model:
content extraction structured relevance
document
representation
similarity ? disseminate
item if
relevant
structured matching
query
representation
query
information
profile
needs
Information Retrieval- 12
Information Retrieval and Browsing
Retrieval
– Produce a ranked result from a user request
– Interpretation of the information by the system
Browsing
– Let the user navigate in the information set
– Relevance feedback by the human
Retrieval
?
Browsing
Information Retrieval- 13
Other tasks
In a more general sense Information Retrieval is used
for a number of different types of tasks, such as
– Information filtering
– Document summarization
– Question answering
– Recommendation
– Document classification
Information Retrieval- 14
IR is an Information Management Task
model M
Information Retrieval- 15
1.2 BASIC INFORMATION RETRIEVAL
Information Retrieval- 16
1.2.1 Text-based Information Retrieval
Most of the information needs and content are expressed in
natural language
– Library and document management systems
– Web Search Engines
Basic approach: use the words that occur in a text as
features for the interpretation of the content
– This is called the "full text” or “bag of words” retrieval approach
– Ignore grammar, meaning etc.
– Simplification that has proven successful
– Document structure, layout and metadata may be considered
additionally (e.g., PageRank/Google)
Information Retrieval- 17
Architecture of Text Retrieval Systems
User Text
Interface
user need Text
1. feature extraction
Text Operations
Searching Index
Feature Extraction
Manual
Docs Tokens stopwords stemming indexing
Structure
Layout
Metadata
Information Retrieval- 19
Text Retrieval - Basic Concepts and Notations
Document d: expresses ideas about some topic in a natural language
Query q: expresses an information need for documents pertaining
to some topic
Index term: a semantic unit, a word, short phrase, or potentially root
of a word
Database DB: collection of n documents dj DB, j=1,…,n
Vocabulary T: collection of m index terms ki T, i=1,…,m
The IR system assigns a similarity coefficient sim(q ,dj) as an estimate for the
relevance of a document dj DB for a query q.
Information Retrieval- 20
Example: Documents
B1 A Course on Integral Equations
B2 Attractors for Semigroups and Evolution Equations
B3 Automatic Differentiation of Algorithms: Theory, Implementation, and Application
B4 Geometrical Aspects of Partial Differential Equations
B5 Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and Commutative
Algebra
B6 Introduction to Hamiltonian Dynamical Systems and the N-Body Problem
B7 Knapsack Problems: Algorithms and Computer Implementations
B8 Methods of Solving Singular Systems of Ordinary Differential Equations
B9 Nonlinear Systems
B10 Ordinary Differential Equations
B11 Oscillation Theory for Neutral Differential Equations with Delay
B12 Oscillation Theory of Delay Differential Equations
B13 Pseudodifferential Operators and Nonlinear Partial Differential Equations
B14 Sinc Methods for Quadrature and Differential Equations
B15 Stability of Stochastic Differential Equations with Respect to Semi-Martingales
B16 The Boundary Integral Approach to Static and Dynamic Contact Problems
B17 The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory
Information Retrieval- 21
Term-Document Matrix
Matrix of weights wij
This example
• Vocabulary (contains only terms that occur multiple times, no stop words)
• all weights are set to 1 (equal importance)
Information Retrieval- 22
Implementation in Python
Information Retrieval- 23
Implementation in Python
Information Retrieval- 24
A retrieval model attempts to capture …
1. the interface by which a user is accessing information
2. the importance a user gives to a piece of information
for a query
3. the formal correctness of a query formulation by user
4. the structure by which a document is organised
Information Retrieval- 25
Full-text retrieval refers to the fact that …
1. the document text is grammatically fully analyzed for
indexing
2. queries can be formulated as texts
3. all words of a text are considered as potential index
terms
4. grammatical variations of a word are considered as the
same index terms
Information Retrieval- 26
The entries of a term-document matrix indicate …
Information Retrieval- 27
1.2.2 Boolean Retrieval
Users specify which terms should be present in the documents
– Simple, based on set-theory, precise meaning
– Frequently used in (old) library systems
– Still many applications, e.g., web harvesting
Example query
– "application" AND "theory” answer: B3, B17
Retrieval Language
expr ::= term | (expr) | NOT expr | expr AND expr | expr OR expr
Step 2:
For each conjunctive term ct create its query weight vector
vec(ct)
– vec(ct)=(w1,…,wm) :
wi = 1 if ki occurs in ct
wi = -1 if NOT ki occurs in ct
wi = 0 otherwise
Information Retrieval- 29
"Similarity" Computation in Boolean Retrieval
Step 3:
If one weight vector of a conjunctive term ct in q matches
the document weight vector dj = (w1j, …,wmj) of a document
dj , then the document dj is relevant, i.e.,
sim(dj, q) = 1
– vec(ct) matches dj if:
wi = 1 wij = 1
wi = -1 wij = 0
Information Retrieval- 30
Example
Index terms {application, algorithm, theory}
Query "application" AND ("algorithm" OR NOT "theory")
Information Retrieval- 32
Similarity Computation in Vector Space Retrieval
k1
dj
Q
q
km
Information Retrieval- 33
Vector Space Retrieval - Properties
Properties
– Ranking of documents according to similarity value
– Documents can be retrieved even if they don’t contain some
query keyword
Information Retrieval- 34
Example
query vector:
"application algorithms" (1,1)
document vector:
"algorithms" (0, 1) sim(q, B3) 1
(B5, B7)
document vector:
"application algorithms" (1,1)
(B3)
1
sim(q, B5)
2
Information Retrieval- 36
Example
Vocabulary T = {information, retrieval, agency}
Query q = (information, retrieval) = (1,1,0)
information … retrieval …
…retrieval … …retrieval …
…information… "Result" …retrieval …
…retrieval … …retrieval …
D1 = (1, 1, 0) D2 = (0,1,0)
sim(q,D1)=1 sim(q,D2)=0.7071…
agency … retrieval …
…information… …agency …
retrieval… …retrieval…
…agency … …agency…
D3 = (0.5,0.5, 1) D4 = (0, 1, 1)
sim(q,D3)=0.5773… sim(q,D4)=0.5
Information Retrieval- 37
Inverse Document Frequency
We have not only to consider how frequent a term occurs within a
document (measure for similarity), but also how frequent a term is in
the document collection of size n (measure for distinctiveness)
Inverse document frequency of term ki
n
idf (i ) log( ) [0, log( n)]
ni
information … retrieval …
…retrieval … …retrieval …
…information… …retrieval …
…retrieval … …retrieval …
D1 = (log(2), 0, 0) D2 = (0,0,0)
sim(q,D1)=0.7071… sim(q,D2)=0
"Result"
agency … retrieval …
…information… …agency …
retrieval… …retrieval…
…agency … …agency…
Information Retrieval- 42
Let the query be represented by {(1, 0, -1), (0, -1, 1)}
Information Retrieval- 43
The term frequency of a term is normalized …
1. by the maximal frequency of all terms in the document
2. by the maximal frequency of the term in the document collection
3. by the maximal frequency of any term in the vocabulary
4. by the maximal term frequency of any document in the
collection
Information Retrieval- 44
The inverse document frequency of a
term can increase …
1. by adding the term to a document that contains the term
2. by removing a document from the document collection that does
not contain the term
3. by adding a document to the document collection that contains
the term
4. by adding a document to the document collection that does not
contain the term
Information Retrieval- 45
The Role of Document Length
When computing cosine similarity, document vectors are
normalized
Result: for Query “information”: information
doc2 will be favored because it is shorter agency
retrieval doc1
system
(0.25,0.25,0.25,0.25)
information
doc2
agency
Singhal 1996
(0.5,0.5,0,0)
Information Retrieval- 46
Normalization of Document Vector
Renormalize document vector
Standard normalization:
Information Retrieval- 47
Compensating Bias towards Short Documents
Slope s
s * pivot
(1 – s) * pivot
New normalization
Information Retrieval- 48
Length Normalization
Weighting scheme:
= slope
= original normalization factor, i.e.
Result
• If n < pivot, then N(n) > n and therefore weights will be smaller
Information Retrieval- 49
Pivoted Unique Query Normalization
Practical implementation of the approach
Weighting scheme:
with
Information Retrieval- 50
Variants of Vector Space Retrieval Model
The vector model with tf-idf weights is a good ranking strategy for
general collections
– many alternative weighting schemes exist, but are not fundamentally
different
Information Retrieval- 51
Discussion of Vector Space Retrieval Model
Advantages
– term-weighting improves quality of the answer set
– partial matching allows retrieval of docs that approximate the
query conditions
– cosine ranking formula sorts documents according to degree of
similarity to the query
Disadvantages
– assumes independence of index terms; not clear that this is a
disadvantage
– No theoretical justification why the model works
Information Retrieval- 52
1.2.4 Evaluating Information Retrieval
Quality of a retrieval model depends on how well it
matches user needs!
Information Retrieval- 53
Evaluating Information Retrieval
Test collections with test queries, where the relevant documents are
identified manually are used to determine the quality of an IR system
(e.g. TREC)
Information Retrieval- 54
Recall and Precision
Recall is the fraction of relevant documents retrieved
from the set of total relevant documents collection-
wide
Precision is the fraction of relevant documents
retrieved from the total number retrieved (answer
set)
Relevant Non-relevant
Retrieved True positives False positives
(tp) (fp)
Not Retrieved False negatives True negatives
(fn) (tn)
Information Retrieval- 55
Recall and Precision – A Tradeoff
Suppose you search for “Theory of Relativity”.
Information Retrieval- 57
Accuracy
Information Retrieval- 58
Accuracy - Pitfall
Class
Classifier 1
Fraud ¬Fraud
Fraud 5 10
Classified ¬Fraud 5 80
A = 85/100 = 0.85
Class
Always ¬Fraud
Fraud ¬Fraud
Fraud 0 0
Classified ¬Fraud 10 90
A = 90/100 = 0.90
Information Retrieval- 59
Which is the “best” classifier?
Class Class
Classifier 1 Classifier 2
A B A B
A 45 20 A 40 10
Classified Classified
B 5 30 B 10 40
A. Classifier 1
B. Classifier 2
C. Both are equally good
Information Retrieval- 60
Which is the “best” classifier?
Class Class
Classifier 1 Classifier 2
Cancer ¬Cancer Cancer ¬Cancer
Cancer 45 20 Cancer 40 10
Classified Classified
¬Cancer 5 30 ¬Cancer 10 40
A. Classifier 1
B. Classifier 2
C. Both are equally good
Information Retrieval- 61
Precision and Recall: Example
Class Class
Classifier 1 Classifier 2
Cancer ¬Cancer Cancer ¬Cancer
Cancer 45 20 Cancer 40 10
Classified Classified
¬Cancer 5 30 ¬Cancer 10 40
P1=45/65=0.69 P2=40/50=0.8
R1=45/50=0.9 R2=40/50=0.8
Class
Everybody has cancer
Cancer ¬Cancer
Cancer 45 20 Cancer 40 10
Classified Classified
¬Cancer 5 30 ¬Cancer 10 40
F1=2*(0.69*0.9)/(0.69+0.9) F2=2*(0.8*0.8)/(0.8+0.8)
= 0.78 =0.8
Class
Everybody has cancer
Cancer ¬Cancer
Cancer 50 50
Classified
¬Cancer 0 0
F=2*(0.5*1)/(0.5+1)=0.66
Information Retrieval- 63
F-alpha-Score: Example (alpha = 1/5)
Class Class
Classifier 1 Classifier 2
Cancer ¬Cancer Cancer ¬Cancer
Cancer 45 20 Cancer 40 10
Classified Classified
¬Cancer 5 30 ¬Cancer 10 40
F1=5*(0.69*0.9)/(4*0.69+0.9) F2=5*(0.8*0.8)/(4*0.8+0.8)
= 0.84 =0.8
Class
Everybody has cancer
Cancer ¬Cancer
Cancer 50 50
Classified
¬Cancer 0 0
F=5*(0.5*1)/(4*0.5+1)=0.83
Information Retrieval- 64
Precision/Recall Tradeoff in Ranked Retrieval
An IR system ranks documents by a similarity coefficient,
allowing the user to trade off between precision and recall
by choosing the cutoff level
Precision depends on the number of results retrieved:
P@k = precision for the top-k documents
hypothetical ideal IR system
realistic IR systems
Information Retrieval- 65
Evaluating Ranked Retrieval
Recall-Precision Plot
R N R N R R N N R R R N R N R R
(10 relevant documents)
Information Retrieval- 66
Interpolated Precision
Interpolated precision:
Information Retrieval- 67
Mean Average Precision (MAP)
Given a set of queries Q
For each the set of relevant documents
the top k relevant documents for query
interpolated precision of result
Information Retrieval- 68
Example
Assume 4 results are returned for a query q:
RNRN
Information Retrieval- 69
ROC Curve
True positives
False positives
Specificity S:
Specificity gives information about how many of the true negatives
have been retrieved as false positives
• The steeper the curve rises at the beginning, the better
• The larger the area under the curve, the better (AUC)
Information Retrieval- 70
If the top 100 documents contain 50 relevant
documents …
1. the precision of the system at 50 is 0.25
2. the precision of the system at 100 is 0.5
3. the recall of the system is 0.5
4. All of the above
Information Retrieval- 71
If retrieval system A has a higher precision at k
than system B …
1. the top k documents of A will have higher similarity values than
the top k documents of B
2. the top k documents of A will contain more relevant documents
than the top k documents of B
3. A will recall more documents above a given similarity threshold
than B
4. the top k relevant documents in A will have higher similarity
values than in B
Information Retrieval- 72
Let the first four documents retrieved be
R N N R. Then the MAP is
1. 1/2
2. 3/4
3. 2/3
4. 5/6
Information Retrieval- 73
References
Course material based on
– Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Modern Information
Retrieval (ACM Press Series), Addison Wesley, 1999.
– Christopher D. Manning, Prabhakar Raghavan and Hinrich
Schütze, Introduction to Information Retrieval, Cambridge
University Press. 2008 (http://www-nlp.stanford.edu/IR-book/)
– Course Information Retrieval by TU Munich (http://
www.cis.lmu.de/~hs/teach/14s/ir/)
– Singhal, A., Salton, G., Mitra, M., & Buckley, C. (1996). Document
length normalization. Information Processing &
Management, 32(5), 619-633.
Information Retrieval- 74