Lecture 1
Lecture 1
Techniques
Objective
s
• Introduce the field of Information Retrieval (IR).
Hyderabad
The Internet is enormous
http://www.nature.com/nature/webmatters/tomog/tomfigs/fig1.htm
Prototype two-dimensional image depicting global connectivity among ISPs as viewed from
skitter host.
Why
IR?
• Web Sites Increasing Sharply
• Google
• 3 billion documents indexed
• 10-20 TB of text on web
• ~ 1000 TB of information produced every year
How big Facebook’s
data?
• 2.5 billion content items shared per day (status
updates + wall posts + photos + videos + comments)
• 2.7 billion Likes per day, 300 million photos uploaded
per day
• 100+ petabytes of disk space in one of FB’s largest
Hadoop
(HDFS) clusters
• 105 terabytes of data scanned via Hive, every 30
minutes
• 70,000 queries executed on these databases per day
Keywords in
IR
• A large repository of documents are stored
on
computers. (Corpus).
• There is topic about which I desire to get
information (Information need).
• Some of the documents may contain the
information that satisfies my need (Relevance).
• How do I retrieve these documents?
Database Systems
Unstructure
d
• Unstructured data does not have clear,overt
semantic structure(e.g, free text on a web page,
audio, video).
• Allows less expressive queries of the form:
Query
IR
Retrieval system
Document Answer list
collection
Exampl
e
Web
An example IR
problem
• Document corpus = Shakespeare’s written
plays(37).
• Works for Shakespeare’s plays, but may not work on huge documents
collections
(billions / trillions of words)
• Can we cut down the time?
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
• Brutus 110100
• Caesar 110111
• Calpurnia (complemented) 101111
• Bitwise AND
• 110100 AND 110111 AND 101111 = 100100
Answers to
query
http://www.rhymezone.com/shakespeare/
Realistic
Scenario
• Consider N = 1 million documents, each with
about 1000 words.
5,00,000
Distinct
words
A realistic
scenario
• No of cells in the term document matrix =
500000x106 = 0.5x1012 = 0.5 trillion (too much
for memory)
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Max % of 1’s possible in a
Calpurnia 0 1 0 0 0 0 column vector
Cleopatra 1 0 0 0 0 0
= (1000/500000)*100
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0 = 105/500000
= 1/5 = 0.2%
Dictionary Postings
Listby
Sorted
docID
4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus
Query processing:
AND
• Consider processing the
query:
– Brutus AND Caesar
– Locate Brutus in the Dictionary;
• Retrieve its postings.
– Locate Caesar in the Dictionary;
• Retrieve its postings.
– “Merge” the two postings:
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesa
r
4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus
The
merge
• Walk through the two postings simultaneously,
in time linear in the total number of postings
entries
p1
2 4 8 16 128 Brutus
3322
2
1 2 3 5 8 13 21 34 Caesa
6644
r
Answer 3
p2 p2
p1
2 4 8 16 32 64 128 Brutus
2
1 2 3 5 8 13 21 34 Caesar
Answer
p2
p1
2 4 8 16 32 64 128 Brutus
2
1 2 3 5 8 13 21 34 Caesar
Answer
p2
p1
2 4 8 16 32 64 128 Brutus
2
1 2 3 5 8 13 21 34 Caesar
Answer
p2
p1
2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar
Answer
p2
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
1 2 3 4 5 8 13 16 32 21 34 64 128
2 4 8 1 33 66
128 Brutus
6
22 44
1 3 5 6 7 9 …
Doc Doc
1 2
I did enact So let it be with
Julius Caesar I Caesar. The noble
was killed i' the Brutus hath told
Capitol; Brutus you Caesar was
killed me. ambitious
Why frequency?
Will discuss later.
docIDs
Terms
and
counts
Pointer
s
4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) 42 Pilani Campus
Query
Optimization
• Consider a query that is an and of t terms.
• For each t terms get the postings list, then
AND them together.
www.westlaw.com/
• Largest commercial legal search service in terms of
number of paying subscribers.
• Over half a million subscribers performing
millions of searches a day over tens of
terabytes of text data.
• The service was started in 1975.