0% found this document useful (0 votes)

8 views

Lecture 1

Uploaded by

Manimaran

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Lecture 1

Uploaded by

Manimaran

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 53

Information Retrieval

Techniques
Objective
s
• Introduce the field of Information Retrieval (IR).

• Basic foundation, Principles, Methods etc.

• Trends: Frontier Topics

• Prepare you to take up challenges in IR or related

fields.
Reasons to take this
course
• Information changes people’s life.

• Users need to find useful information:

Very HUGE amount of data.
• Industries need to earn money through search
engines:
Google, Yahoo, Bing, Baidu, SoSo, Sogou, etc.
• Many interesting problems in IR
field haven’t been solved.

Hyderabad
The Internet is enormous
http://www.nature.com/nature/webmatters/tomog/tomfigs/fig1.htm

Prototype two-dimensional image depicting global connectivity among ISPs as viewed from
skitter host.
Why
IR?
• Web Sites Increasing Sharply

• Internet Users Increasing Continuously

• Current Web (1 billion users more than 1000

billion
pages)

• Google
• 3 billion documents indexed
• 10-20 TB of text on web
• ~ 1000 TB of information produced every year
How big Facebook’s
data?
• 2.5 billion content items shared per day (status
updates + wall posts + photos + videos + comments)
• 2.7 billion Likes per day, 300 million photos uploaded
per day
• 100+ petabytes of disk space in one of FB’s largest
Hadoop
(HDFS) clusters
• 105 terabytes of data scanned via Hive, every 30
minutes
• 70,000 queries executed on these databases per day
Keywords in
IR
• A large repository of documents are stored
on
computers. (Corpus).
• There is topic about which I desire to get
information (Information need).
• Some of the documents may contain the
information that satisfies my need (Relevance).
• How do I retrieve these documents?

• I communicate my information need in

Structured
data
• How the query is expressed will depend on
whether the data in the document corpus is
structures / unstructured.

• Structured data tendsto refer to information in

“tables” and has a clear, overt semantic
structure.
Employee Manager Salary
Smith Jones 50000000
Chang Smith 60000
Ivy Smith 50000
Structured
data
• Structured data allows for expressive queries like:

• Give me the social security numbers of all the

employees who have stayed with company for
more than 5 years, and whose yearly salaries are
three standard deviations above average salary.
Employee Manager Salary
Smith Jones 50000000
Chang Smith 60000
Ivy Smith 50000

Database Systems
Unstructure
d
• Unstructured data does not have clear,overt
semantic structure(e.g, free text on a web page,
audio, video).
• Allows less expressive queries of the form:

• Give me all documentsthat have keywords

‘These
Romans
Structured are
data crazy’. Database systems

Unstructured data Information retrieval

Unstructured (text) vs.
structured (database) data in
1996
Unstructured (text) vs.
structured (database) data in
2009
Information
Retrieval
• Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need.
The problem of
IR
Goal = find documents relevant to an
need from a large
information Info.
document set need

Query
IR
Retrieval system
Document Answer list

collection
Exampl
e

Google

Web
An example IR
problem
• Document corpus = Shakespeare’s written
plays(37).

• Information needed = Which plays of

Shakespeare contain the words BRUTUS AND
CEASAR but NOT CALPURNIA.
Naïve
solution
• Linear scan through all the of
plays Shakespeare
(‘grepping’)
• Need to repeat this for every query.

• Works for Shakespeare’s plays, but may not work on huge documents
collections
(billions / trillions of words)
• Can we cut down the time?

• Better solution: Preprocess the Corpus in advance

and organize the information about the
occurrence of different words in a way that
speeds up query processing.
Term-document
incidence

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

Brutus A N D Caesar BUT

NOT 1 if play
Calpurnia contains word,
0 otherwise
Incidence
vectors
• So we have a 0/1 vector for each term

• To answer query: Brutus, Caesar and NOT

Calpurnia
take the vectors for

• Brutus 110100
• Caesar 110111
• Calpurnia (complemented) 101111

• Bitwise AND
• 110100 AND 110111 AND 101111 = 100100
Answers to
query

http://www.rhymezone.com/shakespeare/
Realistic
Scenario
• Consider N = 1 million documents, each with
about 1000 words.

• Each word Avg 6 bytes/ word

including spaces/punctuation
– 6GB of data in the documents.
• No of distinct terms ~ 5,00,000
I Million columns

5,00,000
Distinct
words
A realistic
scenario
• No of cells in the term document matrix =
500000x106 = 0.5x1012 = 0.5 trillion (too much
for memory)

• If we want to store 0.5 trillion cells in memory

how much memory do you need?

• If each cell takes 1 byte then we need 0.5x1012

bytes (i.e 500 GB)in main memory (No common
PC in market have this big RAM size)

• If each cell takes 1 Bit then we need 500GB/8

• still
Run not feasible.
out of space for 1 million news
Term-document matrix
is sparse
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Max % of 1’s possible in a
Calpurnia 0 1 0 0 0 0 column vector
Cleopatra 1 0 0 0 0 0
= (1000/500000)*100
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0 = 105/500000
= 1/5 = 0.2%

• Max number of 1’s = (#documents) * (#words

in each document) = 1 million * 1000 = 1
billion.
• Since total
the cells #cells
can Max % of 1’s
havein matrix = 0.5 trillion, atpossible in a
column vector
most
a 1. 0.2% of = (109/0.5*1012)*100
= 1/5 = 0.2%
• Boolean retrieval
Inverted
index
• For each term t, we must store a list of all
documents that contain t.
– Identify each by a docID, a document serial number

• Can we use fixed-size arrays for this?

Brutu 1 2 4 11 31 45 173 174

s 1 2 4 5 6 16 57 132
Caesa
r 2 31 54 101
Calpurn
ia What happens if the word
Caesar
is added to document 14?
Inverted
index
• We need variable-size postings lists
– On disk, a continuous run of postings is normal
and best
– In memory, can use linked lists or variable length
arrays
Posting
• Some tradeoffs in size/ease of
Brutus insertion 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

Dictionary Postings
Listby
Sorted
docID
4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus
Query processing:
AND
• Consider processing the
query:
– Brutus AND Caesar
– Locate Brutus in the Dictionary;
• Retrieve its postings.
– Locate Caesar in the Dictionary;
• Retrieve its postings.
– “Merge” the two postings:

2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesa
r
4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus
The
merge
• Walk through the two postings simultaneously,
in time linear in the total number of postings
entries

2 4 8 16 3322 6644 128 Brutus

2 8
1 2 3 5 8 13 21 34 Caesar

If list lengths are x and y, merge takes O(x+y)

operations. Crucial: postings sorted by docID.