CSCI 7000
Modern Information Retrieval
Lecture 1: Introduction
Information Retrieval
Information retrieval is the science of searching for information in
documents, searching for documents themselves, searching for
metadata which describe documents, or searching within databases,
whether relational stand-alone databases or hypertextually-networked
databases such as the World Wide Web.
Wikipedia
Finding material of an unstructured nature that satisfies an information
need from within large collections.
Manning et al 2008
The study of methods and structures used to represent and access
information.
Witten et al
The IR definition can be found in this book.
Salton
IR deals with the representation, storage, organization of, and access to
information items.
Salton
Information retrieval is the term conventionally, though somewhat
inaccurately, applied to the type of activity discussed in this volume.
2
van Rijsbergen
1
IR is now largely what Google does…
Ad hoc retrieval is the core task that
modern IR systems need to start from.
One-shot information seeking attempts by
ignorant users
Ignorant about the structure of the collection
Ignorant about how the system works
Ignorant about how to formulate queries
Typically textual documents, but video and
audio are becoming more prevalent.
Collections are heterogeneous in nature.
But...
The real action right now lies in Web 2.0
issues...
Dealing with User Generated Content
Discussion forums
Blogs
Microblogs
To deal with
Sentiment, opinions, etc
Social networks
Tribes, influencers
4
2
Other Hot Topics
Image search
How to index images
With and without additional information like
captions
Multilingual issues
Cross-language search and indexing
Spoken language issues
ASR for indexing videos
Manning…
Most of today’s slides were stolen/adapted
from Chris Manning…
3
Unstructured (text) vs. structured (database)
data in 1996
Unstructured (text) vs. structured (database)
data in 2006
4
Boulder players
Boulder players
10
5
Course Plan
Cover the basics of IR technology in the
first part of the course
Read papers/investigate newer topics in
the latter part
Use case studies of real companies
throughout the semester
Project presentations and discussions for
the last section of the class.
I expect informed participation.
11
Last year...
We followed 1 company in the tech news
quite a bit...
Powerset
NLP-based search technology
Most of us were pretty skeptical. It seemed
like a lot of hype and little sensible work.
Acquired by MS for $100M last month...
Shows you what I know
This year
Cuil (pronounced cool)
12
6
Go to the web
13
Unstructured Data Scenario
Which plays of Shakespeare contain the
words Brutus AND Caesar but NOT
Calpurnia?
One could grep all of Shakespeare’s plays
for Brutus and Caesar, then strip out
lines containing Calpurnia. This is
problematic:
Slow (for large corpora)
NOT Calpurnia is non-trivial
Other operations (e.g., find the word
Romans near countrymen) not feasible
Ranked retrieval (best documents to return)
14
7
Term-Document Matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Brutus AND Caesar but NOT 1 if play contains
Calpurnia word, 0 otherwise
15
Incidence vectors
So we have a 0/1 vector for each term.
To answer query: take the vectors for
Brutus, Caesar and Calpurnia
(complemented) ➨ bitwise AND.
110100 AND 110111 AND 101111 =
100100.
16
8
Answers to query
Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
Hamlet, Act III, Scene ii
Lord Polonius: I did enact Julius Caesar I was killed i' the
Capitol; Brutus killed me.
17
Bigger corpora
Consider N = 1M documents, each with
about 1K terms.
Avg 6 bytes/term including spaces and
punctuation
6GB of data in the documents.
Say there are m = 500K distinct terms
among these.
Types vs. Tokens
18
9
Can’t build the matrix explicitly
500K x 1M matrix has half-a-trillion 0’s
and 1’s.
But it has no more than one billion 1’s. Why?
matrix is extremely sparse.
What’s a better representation?
We only record the 1 positions.
19
Inverted index
For each term T, we must store a list of all
documents that contain T.
Use an array or a list for this?
Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 13 21 34
Caesar 13 16
What happens if the word Caesar
is added to document 14?
20
10
Inverted index
Linked lists generally preferred to arrays
Dynamic space allocation
Insertion of terms into documents easy
Space overhead of pointers Posting
Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 13 21 34
Caesar 13 16
Dictionary Postings lists
21
Sorted by docID (more later on why).
Inverted index construction
Documents to Friends, Romans, countrymen.
be indexed.
Tokenizer
Token stream. Friends Romans Countrymen
More on Linguistic
these later. modules
Modified tokens. friend roman countryman
Indexer friend 2 4
roman 1 2
Inverted index.
countryman 13 22 16
11
Indexer steps
Term Doc #
Sequence of (Modified token, Document ID) I
did
1
1
pairs. enact
julius
1
1
caesar 1
I 1
was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
Doc 1 Doc 2 killed
me
1
1
so 2
let 2
I did enact Julius
it 2
So let it be with be
with
2
2
Caesar I was killed Caesar. The noble caesar
the
2
2
i' the Capitol; Brutus hath told you
noble
brutus
2
2
Brutus killed me.
hath 2
Caesar was ambitious told
you
2
2
caesar 2
was 2
ambitious 2
23
Term Doc # Term Doc #
Sort by terms.
I 1 ambitious 2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2
caesar 1 capitol 1
I 1 caesar 1
Core indexing step. was
killed
1
1
caesar
caesar
2
2
i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
me 1 i' 1
so 2 it 2
let 2 julius 1
it 2 killed 1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
24
12
Term Doc #
Multiple term entries in a
Term Doc # Term freq
ambitious 2 ambitious 2 1
single document are
be 2 be 2 1
brutus 1 brutus 1 1
merged.
brutus 2 brutus 2 1
capitol 1 capitol 1 1
caesar 1
Frequency information is
caesar 1 1
caesar 2 caesar 2 2
caesar 2
added. did 1
did
enact
1
1
1
1
enact 1 hath 2 1
hath 1 I 1 2
I 1 i' 1 1
I 1 it 2 1
i' 1 julius 1 1
We’ll see why frequency it
julius
2
1
killed
let
1
2
2
1
matters later. killed
killed
1
1
me
noble
1
2
1
1
let 2 so 2 1
me 1
the 1 1
noble 2
the 2 1
so 2
told 2 1
the 1
you 2 1
the 2
was 1 1
told 2
was 2 1
you 2
with 2 1
was 1
was 2
with 2 25
The result is split into a Dictionary file
and a Postings file.
Term Doc # Freq
ambitious 2 1 Doc # Freq
Term N docs Coll freq 2 1
be 2 1
ambitious 1 1 2 1
brutus 1 1
be 1 1 1 1
brutus 2 1
brutus 2 2 2 1
capitol 1 1
capitol 1 1 1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 did 1 1 2 2
did 1 1 enact 1 1 1 1
enact 1 1 hath 1 1 1 1
hath 2 1 I 1 2
2 1
I 1 2 i' 1 1
1 2
i' 1 1 it 1 1
1 1
it 2 1 julius 1 1
killed 1 2 2 1
julius 1 1
let 1 1 1 1
killed 1 2
me 1 1 1 2
let 2 1
noble 1 1 2 1
me 1 1
so 1 1 1 1
noble 2 1
the 2 2 2 1
so 2 1 2 1
told 1 1
the 1 1 you 1 1 1 1
the 2 1 was 2 2 2 1
told 2 1 with 1 1 2 1
you 2 1 2 1
was 1 1 1 1
was 2 1 2 1
with 2 1 2 1
26
13
Storage costs?
Doc # Freq
Term N docs Coll freq 2 1
2 1
ambitious 1 1
1 1
be 1 1 2 1
brutus 2 2 1 1
capitol 1 1 1 1
caesar 2 3 2 2
1 1
did 1 1
1 1
enact 1 1 2 1
hath 1 1 1 2
I 1 2 1 1
i' 1 1 2 1
Terms it
julius
1
1
1
1
1
1
1
2
2 1
killed 1 2 1 1
let 1 1 2 1
me 1 1 2 1
noble 1 1 1 1
so 1 1 2 1
2 1
the 2 2 2 1
told 1 1 1 1
you 1 1 2 1
was 2 2 2 1
with 1 1
Pointers 27
Distributed Systems
How would you duplicate/partition/distribute
this if you were operating a large parallel
distributed high-availability system?
Ie. What would Google do?
Doc # Freq
Term N docs Coll freq 2 1
2 1
ambitious 1 1
1 1
be 1 1 2 1
brutus 2 2 1 1
capitol 1 1 1 1
caesar 2 3 2 2
did 1 1 1 1
1 1
enact 1 1 2 1
hath 1 1 1 2
I 1 2 1 1
i' 1 1 2 1
it 1 1 1 1
1 2
julius 1 1
2 1
killed 1 2 1 1
let 1 1 2 1
me 1 1 2 1
noble 1 1 1 1
so 1 1 2 1
2 1
the 2 2 2 1
told 1 1 1 1
you 1 1 2 1
was 2 2 2 1
with 1 1
28
14
Administrivia
Work/Grading:
Problem sets and programming exercises
50%
Quizzes 20%
Group Project 30%
Textbooks:
Introduction to Information Retrieval ---
Manning, Raghavan and Schutze
Collective Intelligence --- Toby Segaran
29
Administrivia
The exercises (and group project) will use
Lucene (lucene.apache.org)
Open-source full text indexing system
Guest lectures from local industry
Umbria (JD Powers)
Google
Lijit
Collective Intellect
30
15
Administrivia
Professor: Jim Martin
[email protected] ECOT 735
Office hours TBA
www.cs.colorado.edu/~martin/csci7000/
31
Next time
Read Chapter 1 of both texts for next time
32
16