1.introduction Information Retrival
1.introduction Information Retrival
Information Retrieval:
Web Search Engines Special Applications and
Fundamentals
• Architectures Technologies
• Indexing: inverted index,
• Crawling and Feeds: • Machine Learning for
index construction and
Near Duplicate Page Information Retrieval
compression
Detection • Recommender Systems
• Scoring and Ranking
• Link Analysis • ChatGPT and Retrieval-
• Tolerant Search
• Spams Augmented Generation
• Query Expansion
(Gemini, New Bing, etc)
Course Materials
A number of recent papers for
Dense Retrieval Methods and
Retrieval-Augmented Generation:
IIR book
SE Book
Course Work (Tentative)
• (Light-weight) Programming Assignments: 2 (40%)
• 1 announced at week 5 (due in 2 weeks)
• 1 announced at week 9 (due in 2 weeks)
• Attendance: 10%
Contact Information
7
IR Applications Most Visible Applications: Web Search Engines
8
IR is more than
just Web Search
Engines
9
IR is more than
just Web Search
Engines –
ChatGPT and
Gemini
10
IR and its
sister fields Big Data,
Distributed Systems
Recommendation
ile
r of
Inverted
index
r p
u se
but
r y
que
unstructured data, NL queries No
Databases
IR Un
structured data, SQL queries da der
ta st
s n ts de ; s an
r e em di
swe cum co alin a n ng
n an do t ep un g
tin mo
t ic o f
r f s s
e t u ad o a t e rs g t st
R te i
d we er ly w
s e m
i n
e rm ans s it h
Int fore ted
Question be trac Natural Language
ex Processing
Answering
11
Source: adapted from Grace Hui Yang’s graph
Brief History of IR -- Pre-computer Era
• The roots of IR lie in early library practices
12
Brief History of IR -- Early Developments
• 1945: Vannevar Bush’s groundbreaking essay “As we may think”,
which describes a hypothetical machine known as “memex” that
would hold all of mankind’s knowledge in an index form for retrieval
• 1950: The term “Information Retrieval” was coined by Calvin Moores
13
Brief History of IR -- Early Developments
• 1960:
• The development of early IR systems,
like the SMART Information Retrieval
by Gerard Salton at Cornell University
14
Brief History of IR – The Rise of the Web and
Search Engines (1990-2000)
• Explosion of Information: The World Wide Web led to an
unprecedented amount of digital information, making efficient
search crucial
• Birth of Modern Search Engine: Web crawlers, indexing techniques,
and ranking algorithms (like Google’s PageRank) revolutionized how
we find information
• IR as a Mainstream Tool: Information retrieval transitioned from a
specialized field to a core element of everyday life
15
Brief History of IR – Modern Era (2000s-
Present)
• Beyond Text: IR now emcompasses images, videos, and other
multimedia data
• Machine Learning and AI: AI technologies play a significant role in
personalizing search, ranking results, and understanding complex
queries
• Conversational AI: Search interfaces becoming more dialog-based
and intuitive
• Proactive Information Curation: Systems will anticipate information
needs, delivering relevant content before a user even searches
16
What is Information Retrieval?
Basic assumptions of IR
17
Semi-structured data
• Infact almost no data is “unstructured”
• Documents have titles, sections, subsections, lists,
etc.
• Linguistic structure
• E.g. Parsed Tree
• “Semi-structured” searches
• E.g 1. Title contains data AND Bullets contains
search
• E.g 2:
• Title is about Object Oriented Programming AND
• Author something like stro*rup
• where * is the wild-card operator
misconception
Info about antivirus softwares
2. Info Need without paying moneys
misformalization
how to download free
3. Query antivirus softwares
6. Query
Refinement
4. Search
Engine
Collections
5. Results
Slide from Christopher Manning 19
Issues in Information Retrieval
• How to measure the relevance between a query and a document?
• How to evaluate an Information Retrieval System?
• How to uncover the user need?
20
Issues in Information Retrieval: Relevance
Query: Cats
Returned results from Google
• A relevant document contains the
information that a person was looking
for when she submitted a query
• Simple Matching is not enough
22
Issues in Information Retrieval: Interaction
• User judgements are the ultimate
judges of quality
23
The Big Issues
Search Engines
Information Retrieval Performance
- Response time, query throughput and
Relevance indexing speed
- Effective ranking
Incorporating new data
Evaluation - Coverage and freshness
- Testing and measuring
Scalablity
Information needs - Growing with data and users
- User interaction
Adaptability
- Tuning for applications
Specific Problems
- Eg: spam
27
The Vector Space Retrieval Model
• Query (Document) = a bag of words
• (without connecting search operators such as
AND, OR)
• Task: compute a score
• score= sum of the score of match scores
between each query term and the
document
• How? each term in a document is
associated with a weight
• Simple weighting method: term frequency
28
The Probabilistic Retrieval Model
• Similarities are computed as probabilities that a document is
relevant for a given query
• Example: BM25
29
Remarks
• Applications of IR
• IR definition
• IR and the related fields
• Issues in Information Retrieval
• Based on Mathematical basics, we can divide IR models into three
types:
• Set-theoretic based methods
• Algebraic based methods
• Probabilistic based methods
30
Read More
• Chapter 1,2, IIR
• Chapter 1, SE
Acknowledgements
Many slides in this section are adapted from the slides of Prof
Christopher Manning (Standford)
31