0% found this document useful (0 votes)

17 views41 pages

IR-Lec1 - Ch1-2023

Uploaded by

mohamedyasserezz067

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views41 pages

IR-Lec1 - Ch1-2023

Uploaded by

mohamedyasserezz067

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 41

Introduction to

Information Retrieval
Course Outline

Course Hrs /
No Course week Exam
Year Semester
Title Hours
Lect Lab
IS418 2 2 First 2
Information 2023/
Storage and 2024 –
Retrieval 4th Year
Assessments Methods:

• Assessment weight

• Midterm Exam 20%

• Oral Examination & Lab 10%
• Practical Examination 10%
• Final-term Examination 60%

• Total 100 %
Course Resources
• Textbook :
– Christopher D. Manning, Prabhakar Raghavan, and Hinrich
Schütze “An Introduction to Information Retrieval”,
Cambridge University Press, Cambridge, England, 2009

• Additional Materials:
– Lecture Slides.
Sec.
1.1

Course Content

• Chap. 1: Introducing Information Retrieval and Web Search

• Chap. 2: The term vocabulary and postings lists

• Chap. 3: Dictionaries and tolerant retrieval

• Chap. 6: Scoring, term weighting and the vector space mode

• Chap. 8: Evaluation in information retrieval

5
Introduction to
Information Retrieval
Introducing Information Retrieval
and Web Search
Introduction

Google

Web

7
Sec.
1.1

Basic assumptions of Information Retrieval

• Collection: A set of documents over which we perform retrieval

– Sometimes referred to as a corpus
– Assume it is a static collection for the moment

• Information Need is the topic about which the user desires to know
more and is differentiated from a query

• Query is what the user conveys to the computer in an attempt to

communicate the information need.

• Relevance :a document is relevant if it is one that the user perceives as

containing information of value with respect to their personal information
8
need.
The problem ???
• Goal = find documents relevant to the user’s
information need from a large document set
Info.
need

Query
IR system
Retrieval
Document Answer list
collection

9
Information Retrieval
• Information Retrieval (IR) is finding material (usually documents) of an

unstructured nature (usually text) that satisfies an information need from

within large collections (usually stored on computers).

– These days we frequently think first of web search, but there are many

other cases:

• E-mail search

• Searching your laptop

• Legal information retrieval

10
Sec.
1.1

Possible Approaches of Information

Retrieval
• Grep: the simplest form of document retrieval is for a computer to do this
sort of linear scan through documents (is called grepping through text)
- Grep is a Unix command which perform this process

• String matching (linear search in documents)

• Slow
• Difficult to improve

11
Main issues in IR

• Query evaluation (or retrieval process)

– To what extent does a document correspond to a
query?

• System evaluation
– How good is a system?
– Are the retrieved documents relevant? (precision)
– Are all the relevant documents retrieved? (recall)

12
Sec.
1.1

How good are the retrieved docs?

▪ Effectiveness: to assess the effectiveness of an IR system (the
quality of its search results)
▪ A user usually wants to know two key statistics about the
system’s results for a query
▪ Precision : Fraction of retrieved docs that are relevant to the
user’s information need
▪ Recall : Fraction of relevant docs in collection that are retrieved
▪ More precise definitions and measurements to follow later

13
Introduction to
Information Retrieval
Structured vs. Unstructured Data
IR vs. databases:
Structured vs unstructured data
• Structured data tends to refer to information in “tables”

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

Ivy Smith 50000

Typically allows numerical range and exact match (for text) queries,
e.g.,
Salary < 60000 AND Manager = Smith.
15
Unstructured data

• Typically refers to free text

• Allows
– Keyword queries including operators
– More sophisticated “concept” queries e.g.,
• find all web pages dealing with drug abuse

• Classic model for searching text documents

16
Semi-structured data
• In fact, almost no data is “unstructured”

• E.g., this slide has distinctly identified zones such as the Title and Bullets

• … to say nothing of linguistic structure

• IR is also used to facilitates “semi-structured” search such as

– Finding a document where the

Title contains data AND Bullets contain search

Title contains Java AND Body contain threading

• Or even

- Title is about Object Oriented Programming AND Author something like stro*rup

– where * is the wild-card operator

17
Unstructured (text) vs. structured
(database) data in 1996

18
Unstructured (text) vs. structured
(database) data in 2009

19
Unstructured (text) vs. structured
(database) data today

20
Introduction to
Information Retrieval
Term-document incidence matrices
Sec.
1.1

An example information retrieval problem

• Which plays of Shakespeare contain the words Brutus AND Caesar but

NOT Calpurnia?

• One could grep all of Shakespeare’s plays for Brutus and Caesar, then

strip out lines containing Calpurnia?

• Why is that not the answer?

– Slow (for large corpora)

– NOT Calpurnia is non-trivial

– Other operations (e.g., find the word Romans near countrymen) not feasible

– Ranked retrieval (best documents to return)

22
Sec.
1.1

An example information retrieval solution

• The way to avoid linearly scanning the texts for each query is to

INDEX the documents in advance.

• Index is used to introduce the basics of the Boolean retrieval model

23
Sec.
1.1

Boolean retrieval model

• Suppose we record each document (a play of Shakespeare’s) whether it
contains each word out of all the words Shakespeare used (Shakespeare
used about 32,000 different words)
• Boolean Retrieval Model: is a model for information retrieval in which we
can pose any query which is in the form of a Boolean expression of terms,
in which terms are combined with operators AND,OR, and NOT.
– The model views each document as a set of words
• The result is a binary term-document “incidence matrix”
• Terms are the indexed units.

- Terms are usually words

24
- some of terms phrase
Sec.
1.1

Term-document incidence matrices

Brutus AND Caesar BUT 1 if play contains

NOT Calpurnia ??? word, 0 otherwise
Sec.
1.1

Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus, Caesar and
Calpurnia (complemented the last) 🡺 bitwise AND.
– 110100 AND 110111 AND 101111 = 100100

26
Sec.
1.1

Answers to query

• Antony and Cleopatra, Act III, Scene ii

Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.

• Hamlet, Act III, Scene ii

Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.

27
Sec.
1.1

Bigger collections
• Consider corpus has N = 1 million documents,
• Each document with about 1000 words.
• Each word Average 6 bytes/word including spaces/punctuation
• Size of corpus= 1million X 1000 X 6 = 6GB
• Number of distinct terms M = 500,000 distinct terms among these
documents.
• Number of cells in the term-document matrix=1 million X
500,000= .5 trillion (too much for memory),
• Can we cut down on the space?
28
Sec.
1.1

Can’t build the term-document matrix

• A 500K x 1M matrix has half-a-trillion 0’s and 1’s (at most .2%
of the cells can have a 1).
• Too many to fit computer’s memory.
• But it has no more than one billion 1’s. Why?

– matrix is extremely sparse.

– Minimum of 99.8% of the cells are zero.
• What’s a better representation?
– Is to record only the things that do occur
29
– We only record the 1 positions.
Introduction to
Information Retrieval
The Inverted Index
The key data structure underlying
modern IR
Sec.
1.2

Inverted index
• It sometimes called inverted file
• It keeps a dictionary of terms (sometimes referred to as vocabulary or lexicon).
• We use dictionary for the data structure and vocabulary for the set of terms
• Posting list (inverted list): a list that records which documents the terms
occurs in
• All the postings lists taken together are referred to as the postings
• Posting (position in the document): each item in the list which records that a
term appeared in a document.
• The dictionary is sorted alphabetically and each postings list is sorted by
document ID

31
Sec.
1.2

Inverted index
• For each term t, we must store a list of all documents that
contain t.
– Identify each doc by a docID, a document serial number
– DocID a unique number for each document, known as the
document identifier
• Can we used fixed-size arrays for this?

Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

What happens if the word Caesar is added to document 14?

32
Sec.
1.2

Inverted index
• We need variable-size postings lists
– On disk, a continuous run of postings is normal and best
– In memory, can use linked lists or variable length arrays
• Some tradeoffs in size/ease of insertion
Posting

Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

Dictionary Posting
Sorted by docID s(more later on 33
Sec.
1.2

Inverted index construction

1. Collect the documents to be indexed
2. Tokenize the text, turning each document into a list
of tokens
3. Do linguistic preprocessing, producing a list of
normalized tokens, which are the indexing terms
4. Index the documents that each term occurs in by
creating an inverted index, consisting of a dictionary
and postings 34
Sec.
1.2

Inverted index construction

Documents Friends, Romans, countrymen.
to
be indexed
Tokenizer

Token stream Friends Romans Countrymen

Linguistic modules

Modified tokens friend roman countryman

Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Initial stages of text processing
• Tokenization
– Cut character sequence into word tokens
• Deal with “John’s”, a state-of-the-art solution
• Normalization
– Map text and query term to same form
• You want U.S.A. and USA to match
• Stemming
– We may wish different forms of a root to match
• authorize, authorization
• Stop words
– We may omit very common words (or not)
• the, a, to, of
Sec.
1.2

Indexer steps: Token sequence

• Sequence of (Modified token, Document ID) pairs.

Doc Doc
1 2
I did enact Julius So let it be with
Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Sec.
1.2

Indexer steps: Sort

• Sort by terms
– And then docID

Core indexing step

Sec.
1.2

Indexer steps: Dictionary & Postings

• Multiple term entries in a single

document are merged.

• Split into Dictionary and Postings

• Doc. frequency information is

added.

• Doc. Frequency: the number of

documents which contain each

term (which is the length of each

postings list)

Why frequency?
Will discuss later.
Sec.
1.2

Where do we pay in storage?

Lists of
docIDs

Terms
and
counts
IR system
implementation
• How do we
index efficiently?
• How much
storage do we
need?
Pointers 40
Sec.
1.2

What data structure should be used for

posting list?

• A fixed length array would be wasteful, some words occur in many

documents .

• Two good alternatives are linked lists or variable length arrays.

• We can use a hybrid scheme , with linked list of fixed length arrays for

each term.

Internet and Web Technology
100% (1)
Internet and Web Technology
113 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
51 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
46 pages
Ir 1
No ratings yet
Ir 1
59 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
57 pages
Week 6
No ratings yet
Week 6
98 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
Lecture2 Intro Boolean 6per
No ratings yet
Lecture2 Intro Boolean 6per
9 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
Boolean Retrieval
No ratings yet
Boolean Retrieval
34 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
No ratings yet
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
16 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
60 pages
Module 4-Boolean Retrieval Models
No ratings yet
Module 4-Boolean Retrieval Models
52 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
IR Summary Lec 1 - Introduction
No ratings yet
IR Summary Lec 1 - Introduction
54 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Boolean Retrieval PPT Updated
No ratings yet
Boolean Retrieval PPT Updated
30 pages
Lecture1 Intro Boolean
No ratings yet
Lecture1 Intro Boolean
42 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
30 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Lecture 2-Boolean Retrieval
No ratings yet
Lecture 2-Boolean Retrieval
29 pages
Unit I
No ratings yet
Unit I
83 pages
Ir 1
No ratings yet
Ir 1
14 pages
Lecture 1: Introduction and Overview: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 1: Introduction and Overview: Information Retrieval Computer Science Tripos Part II
38 pages
Chapter4 Indexconstruction
No ratings yet
Chapter4 Indexconstruction
49 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
On Information Retrival
No ratings yet
On Information Retrival
23 pages
Unit 1
No ratings yet
Unit 1
181 pages
Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
02 Boolean Retrieval
No ratings yet
02 Boolean Retrieval
52 pages
01 - Introduction To Information Retrieval
No ratings yet
01 - Introduction To Information Retrieval
15 pages
L3L4 IRSW Boolean Retrieval
No ratings yet
L3L4 IRSW Boolean Retrieval
54 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
31 pages
Unit 1 Intro To IR
No ratings yet
Unit 1 Intro To IR
32 pages
Lecture02 - IR
No ratings yet
Lecture02 - IR
36 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
11 Multimedia Media IR
No ratings yet
11 Multimedia Media IR
19 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
Information Retrieval
No ratings yet
Information Retrieval
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
115 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
ANSYS Mechanical APDL Programmer's Manual - Release 14 PDF
No ratings yet
ANSYS Mechanical APDL Programmer's Manual - Release 14 PDF
350 pages
The Eight Systems Engineering Documents
No ratings yet
The Eight Systems Engineering Documents
19 pages
Computer Project
No ratings yet
Computer Project
4 pages
MET CS 693 2020 Arena
No ratings yet
MET CS 693 2020 Arena
16 pages
Accounting Ratios (MR Kaybee)
No ratings yet
Accounting Ratios (MR Kaybee)
23 pages
CODEx
No ratings yet
CODEx
5 pages
vEPC Solutions Guide 10.0 PDF
No ratings yet
vEPC Solutions Guide 10.0 PDF
16 pages
GitHub Copilot Exam Dumps
No ratings yet
GitHub Copilot Exam Dumps
4 pages
176 fc5297 PDF
No ratings yet
176 fc5297 PDF
3 pages
Multidisciplinary CAD Context
No ratings yet
Multidisciplinary CAD Context
15 pages
Kemper Profiler Stage Quick Start en de Es FR It JP
No ratings yet
Kemper Profiler Stage Quick Start en de Es FR It JP
156 pages
CompTIA A+ Certification Acronyms
75% (4)
CompTIA A+ Certification Acronyms
6 pages
The API For The Internet Protocols: Unit III Interprocess Communication
No ratings yet
The API For The Internet Protocols: Unit III Interprocess Communication
10 pages
What Is Internet
No ratings yet
What Is Internet
40 pages
212 IMS DB Interview Questions Answers Guide
No ratings yet
212 IMS DB Interview Questions Answers Guide
9 pages
Book Fall2011
No ratings yet
Book Fall2011
450 pages
Installation Guide CBS
No ratings yet
Installation Guide CBS
57 pages
B Release Notes ncs540 r24211
No ratings yet
B Release Notes ncs540 r24211
26 pages
Week # 3.1a
No ratings yet
Week # 3.1a
3 pages
Edube Interactive - 1.1.1.1 Programming - Absolute Basics
No ratings yet
Edube Interactive - 1.1.1.1 Programming - Absolute Basics
1 page
Recursively Enumerable Languages
No ratings yet
Recursively Enumerable Languages
12 pages
Acrobat 7
No ratings yet
Acrobat 7
161 pages
PanelMate Eaton Corporation
No ratings yet
PanelMate Eaton Corporation
108 pages
Information Security Governance Framework
No ratings yet
Information Security Governance Framework
5 pages
Programming 2: Tutorial 1: Java Building Blocks
No ratings yet
Programming 2: Tutorial 1: Java Building Blocks
2 pages
Infs213 Pasco PDF Information Information S
No ratings yet
Infs213 Pasco PDF Information Information S
2 pages
Oracle 11gr2 Rac Installation Steps On Linux
No ratings yet
Oracle 11gr2 Rac Installation Steps On Linux
73 pages
How To Test Normality Distribution For A Variable: A Real Example and A Simulation Study
No ratings yet
How To Test Normality Distribution For A Variable: A Real Example and A Simulation Study
5 pages
Computer Vision PHD Thesis PDF
100% (3)
Computer Vision PHD Thesis PDF
7 pages

IR-Lec1 - Ch1-2023

Uploaded by

IR-Lec1 - Ch1-2023

Uploaded by

Introduction to

• Midterm Exam 20%

• Chap. 1: Introducing Information Retrieval and Web Search

• Chap. 2: The term vocabulary and postings lists

• Chap. 3: Dictionaries and tolerant retrieval

• Chap. 6: Scoring, term weighting and the vector space mode

• Chap. 8: Evaluation in information retrieval

Basic assumptions of Information Retrieval

• Collection: A set of documents over which we perform retrieval

• Query is what the user conveys to the computer in an attempt to

• Relevance :a document is relevant if it is one that the user perceives as

unstructured nature (usually text) that satisfies an information need from

within large collections (usually stored on computers).

• Searching your laptop

• Legal information retrieval

Possible Approaches of Information

• String matching (linear search in documents)

• Query evaluation (or retrieval process)

How good are the retrieved docs?

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

Ivy Smith 50000

• Typically refers to free text

• Classic model for searching text documents

• … to say nothing of linguistic structure

• IR is also used to facilitates “semi-structured” search such as

– Finding a document where the

Title contains data AND Bullets contain search

Title contains Java AND Body contain threading

– where * is the wild-card operator

An example information retrieval problem

strip out lines containing Calpurnia?

• Why is that not the answer?

– NOT Calpurnia is non-trivial

– Ranked retrieval (best documents to return)

An example information retrieval solution

INDEX the documents in advance.

• Index is used to introduce the basics of the Boolean retrieval model

Boolean retrieval model

- Terms are usually words

Term-document incidence matrices

Brutus AND Caesar BUT 1 if play contains

• Antony and Cleopatra, Act III, Scene ii

• Hamlet, Act III, Scene ii

Can’t build the term-document matrix

– matrix is extremely sparse.

Brutus 1 2 4 11 31 45 173 174

What happens if the word Caesar is added to document 14?

Brutus 1 2 4 11 31 45 173 174

Inverted index construction

Inverted index construction

Token stream Friends Romans Countrymen

Modified tokens friend roman countryman

Indexer steps: Token sequence

Indexer steps: Sort

Core indexing step

Indexer steps: Dictionary & Postings

• Multiple term entries in a single

document are merged.

• Split into Dictionary and Postings

• Doc. frequency information is

• Doc. Frequency: the number of

documents which contain each

term (which is the length of each

Where do we pay in storage?

What data structure should be used for

• A fixed length array would be wasteful, some words occur in many

• Two good alternatives are linked lists or variable length arrays.

You might also like