0% found this document useful (0 votes)

66 views16 pages

CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction

The document discusses information retrieval and provides an introduction to the topic. It defines information retrieval as searching for information within documents or databases. It explains that information retrieval deals with representing, storing, organizing, and accessing information items. It also discusses some key aspects of modern information retrieval systems, including ad hoc retrieval of text documents, searching user-generated web content, and other hot topics like image search, multilingual issues, and spoken language retrieval.

Uploaded by

api-20013624

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views16 pages

CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction

Uploaded by

api-20013624

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

CSCI 7000

Modern Information Retrieval

Lecture 1: Introduction

Information Retrieval
Information retrieval is the science of searching for information in
documents, searching for documents themselves, searching for
metadata which describe documents, or searching within databases,
whether relational stand-alone databases or hypertextually-networked
databases such as the World Wide Web.
Wikipedia
Finding material of an unstructured nature that satisfies an information
need from within large collections.
Manning et al 2008
The study of methods and structures used to represent and access
information.
Witten et al
The IR definition can be found in this book.
Salton
IR deals with the representation, storage, organization of, and access to
information items.
Salton
Information retrieval is the term conventionally, though somewhat
inaccurately, applied to the type of activity discussed in this volume.
2
van Rijsbergen

1
IR is now largely what Google does…

 Ad hoc retrieval is the core task that

modern IR systems need to start from.
 One-shot information seeking attempts by
ignorant users
 Ignorant about the structure of the collection
 Ignorant about how the system works
 Ignorant about how to formulate queries
 Typically textual documents, but video and
audio are becoming more prevalent.
 Collections are heterogeneous in nature.

But...

 The real action right now lies in Web 2.0

issues...
 Dealing with User Generated Content
 Discussion forums
 Blogs
 Microblogs
 To deal with
 Sentiment, opinions, etc
 Social networks
 Tribes, influencers
4

2
Other Hot Topics

 Image search
 How to index images
 With and without additional information like
captions
 Multilingual issues
 Cross-language search and indexing
 Spoken language issues
 ASR for indexing videos

Manning…

 Most of today’s slides were stolen/adapted

from Chris Manning…

3
Unstructured (text) vs. structured (database)
data in 1996

Unstructured (text) vs. structured (database)

data in 2006

4
Boulder players

Boulder players

5
Course Plan

 Cover the basics of IR technology in the

first part of the course
 Read papers/investigate newer topics in
the latter part
 Use case studies of real companies
throughout the semester
 Project presentations and discussions for
the last section of the class.

 I expect informed participation.

Last year...

 We followed 1 company in the tech news

quite a bit...
 Powerset
 NLP-based search technology
 Most of us were pretty skeptical. It seemed
like a lot of hype and little sensible work.
 Acquired by MS for $100M last month...
 Shows you what I know
 This year
 Cuil (pronounced cool)
12

6
Go to the web

Unstructured Data Scenario

 Which plays of Shakespeare contain the

words Brutus AND Caesar but NOT
Calpurnia?
 One could grep all of Shakespeare’s plays
for Brutus and Caesar, then strip out
lines containing Calpurnia. This is
problematic:
 Slow (for large corpora)
 NOT Calpurnia is non-trivial
 Other operations (e.g., find the word
Romans near countrymen) not feasible
 Ranked retrieval (best documents to return)
14

7
Term-Document Matrix

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

Brutus AND Caesar but NOT 1 if play contains

Calpurnia word, 0 otherwise
15

Incidence vectors

 So we have a 0/1 vector for each term.

 To answer query: take the vectors for
Brutus, Caesar and Calpurnia
(complemented) ➨ bitwise AND.
 110100 AND 110111 AND 101111 =
100100.

8
Answers to query

 Antony and Cleopatra, Act III, Scene ii

 Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
 When Antony found Julius Caesar dead,
 He cried almost to roaring; and he wept
 When at Philippi he found Brutus slain.

 Hamlet, Act III, Scene ii

 Lord Polonius: I did enact Julius Caesar I was killed i' the
 Capitol; Brutus killed me.

Bigger corpora

 Consider N = 1M documents, each with

about 1K terms.
 Avg 6 bytes/term including spaces and
punctuation
 6GB of data in the documents.
 Say there are m = 500K distinct terms
among these.
 Types vs. Tokens

9
Can’t build the matrix explicitly

 500K x 1M matrix has half-a-trillion 0’s

and 1’s.
 But it has no more than one billion 1’s. Why?
 matrix is extremely sparse.
 What’s a better representation?
 We only record the 1 positions.

Inverted index

 For each term T, we must store a list of all

documents that contain T.
 Use an array or a list for this?

Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 13 21 34
Caesar 13 16

What happens if the word Caesar

is added to document 14?
20

10
Inverted index

 Linked lists generally preferred to arrays

 Dynamic space allocation
 Insertion of terms into documents easy
 Space overhead of pointers Posting

Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 13 21 34
Caesar 13 16

Dictionary Postings lists

21
Sorted by docID (more later on why).

Inverted index construction

Documents to Friends, Romans, countrymen.
be indexed.

Tokenizer
Token stream. Friends Romans Countrymen
More on Linguistic
these later. modules
Modified tokens. friend roman countryman

Indexer friend 2 4
roman 1 2
Inverted index.
countryman 13 22 16

11
Indexer steps
Term Doc #
 Sequence of (Modified token, Document ID) I
did
1
1
pairs. enact
julius
1
1
caesar 1
I 1
was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
Doc 1 Doc 2 killed
me
1
1
so 2
let 2

I did enact Julius

it 2
So let it be with be
with
2
2
Caesar I was killed Caesar. The noble caesar
the
2
2

i' the Capitol; Brutus hath told you

noble
brutus
2
2

Brutus killed me.

hath 2
Caesar was ambitious told
you
2
2
caesar 2
was 2
ambitious 2
23

Term Doc # Term Doc #

Sort by terms.
I 1 ambitious 2
did 1 be 2
 enact 1 brutus 1
julius 1 brutus 2
caesar 1 capitol 1
I 1 caesar 1
Core indexing step. was
killed
1
1
caesar
caesar
2
2
i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
me 1 i' 1
so 2 it 2
let 2 julius 1
it 2 killed 1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
24

12
Term Doc #
Multiple term entries in a
Term Doc # Term freq
 ambitious 2 ambitious 2 1

single document are

be 2 be 2 1
brutus 1 brutus 1 1

merged.
brutus 2 brutus 2 1
capitol 1 capitol 1 1
caesar 1
Frequency information is
caesar 1 1
 caesar 2 caesar 2 2
caesar 2
added. did 1
did
enact
1
1
1
1
enact 1 hath 2 1
hath 1 I 1 2
I 1 i' 1 1
I 1 it 2 1
i' 1 julius 1 1
We’ll see why frequency it
julius
2
1
killed
let
1
2
2
1
matters later. killed
killed
1
1
me
noble
1
2
1
1
let 2 so 2 1
me 1
the 1 1
noble 2
the 2 1
so 2
told 2 1
the 1
you 2 1
the 2
was 1 1
told 2
was 2 1
you 2
with 2 1
was 1
was 2
with 2 25

 The result is split into a Dictionary file

and a Postings file.

Term Doc # Freq

ambitious 2 1 Doc # Freq
Term N docs Coll freq 2 1
be 2 1
ambitious 1 1 2 1
brutus 1 1
be 1 1 1 1
brutus 2 1
brutus 2 2 2 1
capitol 1 1
capitol 1 1 1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 did 1 1 2 2
did 1 1 enact 1 1 1 1
enact 1 1 hath 1 1 1 1
hath 2 1 I 1 2
2 1
I 1 2 i' 1 1
1 2
i' 1 1 it 1 1
1 1
it 2 1 julius 1 1
killed 1 2 2 1
julius 1 1
let 1 1 1 1
killed 1 2
me 1 1 1 2
let 2 1
noble 1 1 2 1
me 1 1
so 1 1 1 1
noble 2 1
the 2 2 2 1
so 2 1 2 1
told 1 1
the 1 1 you 1 1 1 1
the 2 1 was 2 2 2 1
told 2 1 with 1 1 2 1
you 2 1 2 1
was 1 1 1 1
was 2 1 2 1
with 2 1 2 1

13
 Storage costs?
Doc # Freq
Term N docs Coll freq 2 1
2 1
ambitious 1 1
1 1
be 1 1 2 1
brutus 2 2 1 1
capitol 1 1 1 1
caesar 2 3 2 2
1 1
did 1 1
1 1
enact 1 1 2 1
hath 1 1 1 2
I 1 2 1 1
i' 1 1 2 1

Terms it
julius
1
1
1
1
1
1
1
2
2 1
killed 1 2 1 1
let 1 1 2 1
me 1 1 2 1
noble 1 1 1 1
so 1 1 2 1
2 1
the 2 2 2 1
told 1 1 1 1
you 1 1 2 1
was 2 2 2 1
with 1 1

Pointers 27

Distributed Systems

How would you duplicate/partition/distribute

this if you were operating a large parallel
distributed high-availability system?
Ie. What would Google do?
Doc # Freq
Term N docs Coll freq 2 1
2 1
ambitious 1 1
1 1
be 1 1 2 1
brutus 2 2 1 1
capitol 1 1 1 1
caesar 2 3 2 2
did 1 1 1 1
1 1
enact 1 1 2 1
hath 1 1 1 2
I 1 2 1 1
i' 1 1 2 1
it 1 1 1 1
1 2
julius 1 1
2 1
killed 1 2 1 1
let 1 1 2 1
me 1 1 2 1
noble 1 1 1 1
so 1 1 2 1
2 1
the 2 2 2 1
told 1 1 1 1
you 1 1 2 1
was 2 2 2 1
with 1 1
28

14
Administrivia

 Work/Grading:
 Problem sets and programming exercises
50%
 Quizzes 20%
 Group Project 30%
 Textbooks:
 Introduction to Information Retrieval ---
Manning, Raghavan and Schutze
 Collective Intelligence --- Toby Segaran

Administrivia

 The exercises (and group project) will use

Lucene (lucene.apache.org)
 Open-source full text indexing system
 Guest lectures from local industry
 Umbria (JD Powers)
 Google
 Lijit
 Collective Intellect

15
Administrivia

 Professor: Jim Martin

 [email protected]
 ECOT 735
 Office hours TBA
 www.cs.colorado.edu/~martin/csci7000/

Next time

 Read Chapter 1 of both texts for next time

American Ways 4th Ed. (2014)
88% (99)
American Ways 4th Ed. (2014)
333 pages
Boolean Retrieval PPT Updated
No ratings yet
Boolean Retrieval PPT Updated
30 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
Analytical NTS Test 1: For Question 1 To 4
100% (2)
Analytical NTS Test 1: For Question 1 To 4
7 pages
Teaching Young Children Mathematics
100% (7)
Teaching Young Children Mathematics
225 pages
Manual Generador ENGGA
100% (2)
Manual Generador ENGGA
20 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
60 pages
Unit 1
No ratings yet
Unit 1
181 pages
Inverted Index Construction: Adapted From Lectures by
No ratings yet
Inverted Index Construction: Adapted From Lectures by
78 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Lecture2 Ranking1
No ratings yet
Lecture2 Ranking1
126 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
Unit I
No ratings yet
Unit I
83 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
Information Retrieval
No ratings yet
Information Retrieval
57 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
02 Boolean Retrieval
No ratings yet
02 Boolean Retrieval
52 pages
KEN2570-5-Search and IR
No ratings yet
KEN2570-5-Search and IR
18 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
30 pages
Information Retrieval: Indexing
No ratings yet
Information Retrieval: Indexing
32 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
Information Retrieval
No ratings yet
Information Retrieval
44 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
31 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
46 pages
Ir 1
No ratings yet
Ir 1
59 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
On Information Retrival
No ratings yet
On Information Retrival
23 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
51 pages
Lecture 1: Introduction and Overview: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 1: Introduction and Overview: Information Retrieval Computer Science Tripos Part II
38 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
Information Retrieval: Lecture One
No ratings yet
Information Retrieval: Lecture One
101 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
chapter2-MA212-Indexing & Preprocessing
No ratings yet
chapter2-MA212-Indexing & Preprocessing
68 pages
Module 4-Boolean Retrieval Models
No ratings yet
Module 4-Boolean Retrieval Models
52 pages
Week 6
No ratings yet
Week 6
98 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
04 Index Construction
No ratings yet
04 Index Construction
48 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
03lecture 3 - Biomedical IR-indexing
No ratings yet
03lecture 3 - Biomedical IR-indexing
27 pages
Lecture 1
No ratings yet
Lecture 1
53 pages
Chapter4 Indexconstruction
No ratings yet
Chapter4 Indexconstruction
49 pages
Boolean Retrieval
No ratings yet
Boolean Retrieval
34 pages
Lecture02 - IR
No ratings yet
Lecture02 - IR
36 pages
IR Summary Lec 1 - Introduction
No ratings yet
IR Summary Lec 1 - Introduction
54 pages
2-Boolean IR and Indexing
No ratings yet
2-Boolean IR and Indexing
46 pages
03 - Lect3 Search Engines-Part2
No ratings yet
03 - Lect3 Search Engines-Part2
32 pages
Lecture2 Intro Boolean 6per
No ratings yet
Lecture2 Intro Boolean 6per
9 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
57 pages
Unit 1 Intro To IR
No ratings yet
Unit 1 Intro To IR
32 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
01 - Introduction To Information Retrieval
No ratings yet
01 - Introduction To Information Retrieval
15 pages
Lecture1 Intro Boolean
No ratings yet
Lecture1 Intro Boolean
42 pages
Actuarial Science, Insurance & Assurance - Concepts, Models & Applications
No ratings yet
Actuarial Science, Insurance & Assurance - Concepts, Models & Applications
198 pages
Unit 5 Notes - Analytical Applications of Differentiation
No ratings yet
Unit 5 Notes - Analytical Applications of Differentiation
35 pages
Unit 7 SA2Eng
No ratings yet
Unit 7 SA2Eng
6 pages
MICRO Groove Tubes - 55 - Web0504
No ratings yet
MICRO Groove Tubes - 55 - Web0504
1 page
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
No ratings yet
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
40 pages
Communication Challenges in A Diverse, Global Marketplace: Twelfth Edition
No ratings yet
Communication Challenges in A Diverse, Global Marketplace: Twelfth Edition
31 pages
CSCI 7000 Modern Information Retrieval Jim Martin
No ratings yet
CSCI 7000 Modern Information Retrieval Jim Martin
36 pages
Relevance Ranking and Relevance Feedback: Carl Staelin
No ratings yet
Relevance Ranking and Relevance Feedback: Carl Staelin
34 pages
Conceptual Structures in Modern Information Retrieval: Claudio Carpineto
No ratings yet
Conceptual Structures in Modern Information Retrieval: Claudio Carpineto
28 pages
Civil Intenship
No ratings yet
Civil Intenship
15 pages
1 Introduction SCS
No ratings yet
1 Introduction SCS
24 pages
Tank Shell Design: API 650 12th Edition Eqpt: TK-01 One Foot Method
No ratings yet
Tank Shell Design: API 650 12th Edition Eqpt: TK-01 One Foot Method
1 page
Modern Information Retrieval: A Brief Overview
No ratings yet
Modern Information Retrieval: A Brief Overview
9 pages
Modern Information Retrieval: Parallel and Distributed IR
No ratings yet
Modern Information Retrieval: Parallel and Distributed IR
15 pages
Modern Information Retrieval: A Brief Overview: by Amit Singhal
No ratings yet
Modern Information Retrieval: A Brief Overview: by Amit Singhal
14 pages
Contemporary Issues and Trends in Philippine Education
No ratings yet
Contemporary Issues and Trends in Philippine Education
4 pages
Closed-Box Loudspeaker Systems - 2
No ratings yet
Closed-Box Loudspeaker Systems - 2
8 pages
Eckhart Tolle's: Findhorn Retreat
No ratings yet
Eckhart Tolle's: Findhorn Retreat
72 pages
TM Series Brochure
No ratings yet
TM Series Brochure
4 pages
cs419-519 Slides Part 2
No ratings yet
cs419-519 Slides Part 2
6 pages
Stucts Operator in C++
No ratings yet
Stucts Operator in C++
18 pages
Gopro Studio: ! Center For Innovation in Teaching and Research ! 1
No ratings yet
Gopro Studio: ! Center For Innovation in Teaching and Research ! 1
14 pages
Ch3 - Laser Beam Propagation (Laser Systems Engineering - 2016 - Keith Kasunic)
No ratings yet
Ch3 - Laser Beam Propagation (Laser Systems Engineering - 2016 - Keith Kasunic)
16 pages
CPC-Case-Study-Ebullient - LQ
No ratings yet
CPC-Case-Study-Ebullient - LQ
4 pages
Cost Behavior: Analysis and Use: Chapter Five
No ratings yet
Cost Behavior: Analysis and Use: Chapter Five
60 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
38 pages
Hand-Powered Ultralow-Cost Paper Centrifuge
No ratings yet
Hand-Powered Ultralow-Cost Paper Centrifuge
22 pages
Electromagnetism Intro
No ratings yet
Electromagnetism Intro
13 pages
The Victim Characteristics
No ratings yet
The Victim Characteristics
6 pages
Lesson Plan 4 - Peripheral Nervous System
No ratings yet
Lesson Plan 4 - Peripheral Nervous System
6 pages
12.study of Sound
No ratings yet
12.study of Sound
4 pages
Fiche Technique - MEANWELL APV 25 24
No ratings yet
Fiche Technique - MEANWELL APV 25 24
2 pages
SkyWatcher Instruction User Manual SynScan Serial Communication Protocol
No ratings yet
SkyWatcher Instruction User Manual SynScan Serial Communication Protocol
7 pages
More Human Than Human
From Everand
More Human Than Human
Neil Clarke
No ratings yet

CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction

Uploaded by

CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction

Uploaded by

CSCI 7000

Modern Information Retrieval

 Ad hoc retrieval is the core task that

 The real action right now lies in Web 2.0

 Most of today’s slides were stolen/adapted

Unstructured (text) vs. structured (database)

 Cover the basics of IR technology in the

 I expect informed participation.

 We followed 1 company in the tech news

Unstructured Data Scenario

 Which plays of Shakespeare contain the

Brutus AND Caesar but NOT 1 if play contains

 So we have a 0/1 vector for each term.

 Antony and Cleopatra, Act III, Scene ii

 Hamlet, Act III, Scene ii

 Consider N = 1M documents, each with

 500K x 1M matrix has half-a-trillion 0’s

 For each term T, we must store a list of all

What happens if the word Caesar

 Linked lists generally preferred to arrays

Dictionary Postings lists

Inverted index construction

I did enact Julius

i' the Capitol; Brutus hath told you

Brutus killed me.

Term Doc # Term Doc #

single document are

 The result is split into a Dictionary file

Term Doc # Freq

How would you duplicate/partition/distribute

 The exercises (and group project) will use

 Professor: Jim Martin

 Read Chapter 1 of both texts for next time

You might also like