0% found this document useful (0 votes)
13 views

1.introduction Information Retrival

Uploaded by

jimmywangiscool
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

1.introduction Information Retrival

Uploaded by

jimmywangiscool
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Course Overview

Information & Introduction


Retrieval and Search Engines
Cam-Tu Nguyen, Ph.D
Email: [email protected]
1
Course Objectives
• Learn the principles of search engines such as Google, Baidu, etc
• Learn how to apply IR methods for a recommendation system, a
question-answering system
• E.g. New Bing, Google Bard (Gemini)
• Improve your programming skills
• Being able to build a mini search engine
• Improve your English
• Learn AI technical terms
Course Overview

Information Retrieval:
Web Search Engines Special Applications and
Fundamentals
• Architectures Technologies
• Indexing: inverted index,
• Crawling and Feeds: • Machine Learning for
index construction and
Near Duplicate Page Information Retrieval
compression
Detection • Recommender Systems
• Scoring and Ranking
• Link Analysis • ChatGPT and Retrieval-
• Tolerant Search
• Spams Augmented Generation
• Query Expansion
(Gemini, New Bing, etc)
Course Materials
A number of recent papers for
Dense Retrieval Methods and
Retrieval-Augmented Generation:

IIR book
SE Book
Course Work (Tentative)
• (Light-weight) Programming Assignments: 2 (40%)
• 1 announced at week 5 (due in 2 weeks)
• 1 announced at week 9 (due in 2 weeks)

• Final project: 1 (50%)


• Guideline announced around week 7
• Build a mini search engine for a domain of your interest
• Papers in Machine Learning, Data Mining
• Group of 2 people (more people, the search engine needs more features)
• The system demo is on week 17 (tentatively)
• Bonus for advanced features such as adding a (text-based) conversational interface
(conversational QA) or a summarization feature

• Attendance: 10%
Contact Information

• Teaching Cube (教学立⽅课程号):


application code MFT8935W
Course materials, Homework submissions, etc

• QQ Group ID: 663010245


• Contact: [email protected]
Introduction to Information
Retrieval
Definition, History, The Big Issues, Search Engines

7
IR Applications Most Visible Applications: Web Search Engines

8
IR is more than
just Web Search
Engines

9
IR is more than
just Web Search
Engines –
ChatGPT and
Gemini

10
IR and its
sister fields Big Data,
Distributed Systems
Recommendation

ile
r of

Inverted
index
r p
u se
but
r y
que
unstructured data, NL queries No
Databases
IR Un
structured data, SQL queries da der
ta st
s n ts de ; s an
r e em di
swe cum co alin a n ng
n an do t ep un g
tin mo
t ic o f
r f s s
e t u ad o a t e rs g t st
R te i
d we er ly w
s e m
i n
e rm ans s it h
Int fore ted
Question be trac Natural Language
ex Processing
Answering

11
Source: adapted from Grace Hui Yang’s graph
Brief History of IR -- Pre-computer Era
• The roots of IR lie in early library practices

Index Card (Wikipedia) Library Catalogs (Wikipedia-Common)

12
Brief History of IR -- Early Developments
• 1945: Vannevar Bush’s groundbreaking essay “As we may think”,
which describes a hypothetical machine known as “memex” that
would hold all of mankind’s knowledge in an index form for retrieval
• 1950: The term “Information Retrieval” was coined by Calvin Moores

13
Brief History of IR -- Early Developments

• 1960:
• The development of early IR systems,
like the SMART Information Retrieval
by Gerard Salton at Cornell University

• The Cranfield experiments were a


series of experimental studies in Gerry" Salton (8 March
Information Retrieval 1927 – 28 August 1995)

• Cranfield 1400 benchmark

14
Brief History of IR – The Rise of the Web and
Search Engines (1990-2000)
• Explosion of Information: The World Wide Web led to an
unprecedented amount of digital information, making efficient
search crucial
• Birth of Modern Search Engine: Web crawlers, indexing techniques,
and ranking algorithms (like Google’s PageRank) revolutionized how
we find information
• IR as a Mainstream Tool: Information retrieval transitioned from a
specialized field to a core element of everyday life

15
Brief History of IR – Modern Era (2000s-
Present)
• Beyond Text: IR now emcompasses images, videos, and other
multimedia data
• Machine Learning and AI: AI technologies play a significant role in
personalizing search, ranking results, and understanding complex
queries
• Conversational AI: Search interfaces becoming more dialog-based
and intuitive
• Proactive Information Curation: Systems will anticipate information
needs, delivering relevant content before a user even searches

16
What is Information Retrieval?
Basic assumptions of IR

Information Retrieval (IR) is • Collection: A set of documents


finding material (ussually • Assume it is a static collection for
documents) of an unstructured the moment
nature (ussually text) that
satisfies an information need • Goal: Retrieve documents with
from within large collections information that is relevant to
(ussually stored on computers) the user’s information need and
helps the user complete a task
(Source: Christopher Manning Book)

17
Semi-structured data
• Infact almost no data is “unstructured”
• Documents have titles, sections, subsections, lists,
etc.
• Linguistic structure
• E.g. Parsed Tree
• “Semi-structured” searches
• E.g 1. Title contains data AND Bullets contains
search
• E.g 2:
• Title is about Object Oriented Programming AND
• Author something like stro*rup
• where * is the wild-card operator

Slide from Christopher Manning 18


The classic search model
Get rid of computer viruses
1. User task in economic ways

misconception
Info about antivirus softwares
2. Info Need without paying moneys

misformalization
how to download free
3. Query antivirus softwares
6. Query
Refinement
4. Search
Engine

Collections
5. Results
Slide from Christopher Manning 19
Issues in Information Retrieval
• How to measure the relevance between a query and a document?
• How to evaluate an Information Retrieval System?
• How to uncover the user need?

20
Issues in Information Retrieval: Relevance
Query: Cats
Returned results from Google
• A relevant document contains the
information that a person was looking
for when she submitted a query
• Simple Matching is not enough

• Topical relevance vs user relevance

• Retrieval models: formalizing the


process of finding relevant documents

Text from W. Bruce Croft’s Book 21


Issues in Information Retrieval: Evaluation
• The quality of a document ranking depends on how well it matches a
person’s expectations
• Evaluation measures: precision & recall
• Experimental procedures
• A test collection include documents, a list of typical queries and relevance
documents
• Search engines also use clickthrough data in addition to relevance
judgements

22
Issues in Information Retrieval: Interaction
• User judgements are the ultimate
judges of quality

• Studies on how people interact


with search engines and help
them express their information
needs

• Query suggestion, query


expansion, relevance feedbacks

23
The Big Issues
Search Engines
Information Retrieval Performance
- Response time, query throughput and
Relevance indexing speed
- Effective ranking
Incorporating new data
Evaluation - Coverage and freshness
- Testing and measuring
Scalablity
Information needs - Growing with data and users
- User interaction
Adaptability
- Tuning for applications

Specific Problems
- Eg: spam

Figure from W. Bruce Croft’s Book 24


Some Dimensions of the Field of IR

Example of Examples of Examples of Tasks


Content Applications
Text Web search Ad hoc search
Images Vertical search Filtering
Video Enterprise search Classification
Scanned Desktop search Question Answering
documents
Audio Peer-to-peer search
Music

Table from W. Bruce Croft’s Book 25


Information Retrieval Models

Source: Wikipedia, Categorization of IR-models (translated from German entry, original


26
source Dominik Kuropka).
Boolean Retrieval Model
• The Boolean retrieval model is a model for information retrieval in
which
• Any query is in the form of a Boolean expression of terms, that
is, in which terms are combined with the operators AND, OR, and
NOT
• The model views each document as just a set of words

27
The Vector Space Retrieval Model
• Query (Document) = a bag of words
• (without connecting search operators such as
AND, OR)
• Task: compute a score
• score= sum of the score of match scores
between each query term and the
document
• How? each term in a document is
associated with a weight
• Simple weighting method: term frequency

28
The Probabilistic Retrieval Model
• Similarities are computed as probabilities that a document is
relevant for a given query
• Example: BM25

• Probabilistic theorems like the Bayes' theorem are often used in


these models

29
Remarks
• Applications of IR
• IR definition
• IR and the related fields
• Issues in Information Retrieval
• Based on Mathematical basics, we can divide IR models into three
types:
• Set-theoretic based methods
• Algebraic based methods
• Probabilistic based methods

30
Read More
• Chapter 1,2, IIR
• Chapter 1, SE

Acknowledgements
Many slides in this section are adapted from the slides of Prof
Christopher Manning (Standford)

31

You might also like