0% found this document useful (0 votes)
8 views

Lecture 1

Uploaded by

Manimaran
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lecture 1

Uploaded by

Manimaran
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Information Retrieval

Techniques
Objective
s
• Introduce the field of Information Retrieval (IR).

• Basic foundation, Principles, Methods etc.

• Trends: Frontier Topics

• Prepare you to take up challenges in IR or related


fields.
Reasons to take this
course
• Information changes people’s life.

• Users need to find useful information:


Very HUGE amount of data.
• Industries need to earn money through search
engines:
Google, Yahoo, Bing, Baidu, SoSo, Sogou, etc.
• Many interesting problems in IR
field haven’t been solved.

Hyderabad
The Internet is enormous
http://www.nature.com/nature/webmatters/tomog/tomfigs/fig1.htm

Prototype two-dimensional image depicting global connectivity among ISPs as viewed from
skitter host.
Why
IR?
• Web Sites Increasing Sharply

• Internet Users Increasing Continuously

• Current Web (1 billion users more than 1000


billion
pages)

• Google
• 3 billion documents indexed
• 10-20 TB of text on web
• ~ 1000 TB of information produced every year
How big Facebook’s
data?
• 2.5 billion content items shared per day (status
updates + wall posts + photos + videos + comments)
• 2.7 billion Likes per day, 300 million photos uploaded
per day
• 100+ petabytes of disk space in one of FB’s largest
Hadoop
(HDFS) clusters
• 105 terabytes of data scanned via Hive, every 30
minutes
• 70,000 queries executed on these databases per day
Keywords in
IR
• A large repository of documents are stored
on
computers. (Corpus).
• There is topic about which I desire to get
information (Information need).
• Some of the documents may contain the
information that satisfies my need (Relevance).
• How do I retrieve these documents?

• I communicate my information need in


Structured
data
• How the query is expressed will depend on
whether the data in the document corpus is
structures / unstructured.

• Structured data tendsto refer to information in


“tables” and has a clear, overt semantic
structure.
Employee Manager Salary
Smith Jones 50000000
Chang Smith 60000
Ivy Smith 50000
Structured
data
• Structured data allows for expressive queries like:

• Give me the social security numbers of all the


employees who have stayed with company for
more than 5 years, and whose yearly salaries are
three standard deviations above average salary.
Employee Manager Salary
Smith Jones 50000000
Chang Smith 60000
Ivy Smith 50000

Database Systems
Unstructure
d
• Unstructured data does not have clear,overt
semantic structure(e.g, free text on a web page,
audio, video).
• Allows less expressive queries of the form:

• Give me all documentsthat have keywords


‘These
Romans
Structured are
data crazy’. Database systems

Unstructured data Information retrieval


Unstructured (text) vs.
structured (database) data in
1996
Unstructured (text) vs.
structured (database) data in
2009
Information
Retrieval
• Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need.
The problem of
IR
Goal = find documents relevant to an
need from a large
information Info.
document set need

Query
IR
Retrieval system
Document Answer list

collection
Exampl
e

Google

Web
An example IR
problem
• Document corpus = Shakespeare’s written
plays(37).

• Information needed = Which plays of


Shakespeare contain the words BRUTUS AND
CEASAR but NOT CALPURNIA.
Naïve
solution
• Linear scan through all the of
plays Shakespeare
(‘grepping’)
• Need to repeat this for every query.

• Works for Shakespeare’s plays, but may not work on huge documents
collections
(billions / trillions of words)
• Can we cut down the time?

• Better solution: Preprocess the Corpus in advance


and organize the information about the
occurrence of different words in a way that
speeds up query processing.
Term-document
incidence

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

Brutus A N D Caesar BUT


NOT 1 if play
Calpurnia contains word,
0 otherwise
Incidence
vectors
• So we have a 0/1 vector for each term

• To answer query: Brutus, Caesar and NOT


Calpurnia
take the vectors for

• Brutus 110100
• Caesar 110111
• Calpurnia (complemented) 101111

• Bitwise AND
• 110100 AND 110111 AND 101111 = 100100
Answers to
query

http://www.rhymezone.com/shakespeare/
Realistic
Scenario
• Consider N = 1 million documents, each with
about 1000 words.

• Each word Avg 6 bytes/ word


including spaces/punctuation
– 6GB of data in the documents.
• No of distinct terms ~ 5,00,000
I Million columns

5,00,000
Distinct
words
A realistic
scenario
• No of cells in the term document matrix =
500000x106 = 0.5x1012 = 0.5 trillion (too much
for memory)

• If we want to store 0.5 trillion cells in memory


how much memory do you need?

• If each cell takes 1 byte then we need 0.5x1012


bytes (i.e 500 GB)in main memory (No common
PC in market have this big RAM size)

• If each cell takes 1 Bit then we need 500GB/8


• still
Run not feasible.
out of space for 1 million news
Term-document matrix
is sparse
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Max % of 1’s possible in a
Calpurnia 0 1 0 0 0 0 column vector
Cleopatra 1 0 0 0 0 0
= (1000/500000)*100
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0 = 105/500000
= 1/5 = 0.2%

• Max number of 1’s = (#documents) * (#words


in each document) = 1 million * 1000 = 1
billion.
• Since total
the cells #cells
can Max % of 1’s
havein matrix = 0.5 trillion, atpossible in a
column vector
most
a 1. 0.2% of = (109/0.5*1012)*100
= 1/5 = 0.2%
• Boolean retrieval
Inverted
index
• For each term t, we must store a list of all
documents that contain t.
– Identify each by a docID, a document serial number

• Can we use fixed-size arrays for this?

Brutu 1 2 4 11 31 45 173 174


s 1 2 4 5 6 16 57 132
Caesa
r 2 31 54 101
Calpurn
ia What happens if the word
Caesar
is added to document 14?
Inverted
index
• We need variable-size postings lists
– On disk, a continuous run of postings is normal
and best
– In memory, can use linked lists or variable length
arrays
Posting
• Some tradeoffs in size/ease of
Brutus insertion 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

Dictionary Postings
Listby
Sorted
docID
4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus
Query processing:
AND
• Consider processing the
query:
– Brutus AND Caesar
– Locate Brutus in the Dictionary;
• Retrieve its postings.
– Locate Caesar in the Dictionary;
• Retrieve its postings.
– “Merge” the two postings:

2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesa
r
4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus
The
merge
• Walk through the two postings simultaneously,
in time linear in the total number of postings
entries

2 4 8 16 3322 6644 128 Brutus


2 8
1 2 3 5 8 13 21 34 Caesar

If list lengths are x and y, merge takes O(x+y)


operations. Crucial: postings sorted by docID.

4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus


Intersecting two postings
lists (a “merge” algorithm)

4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus


Intersecting two postings
lists (a “merge” algorithm)
P1: pointer to current location in list1
P2: pointer to current location in list2

p1
2 4 8 16 128 Brutus
3322
2
1 2 3 5 8 13 21 34 Caesa
6644
r
Answer 3
p2 p2

4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus


Intersecting two postings
lists (a “merge” algorithm)
P1: pointer to current location in list1
P2: pointer to current location in list2

p1

2 4 8 16 32 64 128 Brutus
2
1 2 3 5 8 13 21 34 Caesar
Answer
p2

4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus


Intersecting two postings
lists (a “merge” algorithm)
P1: pointer to current location in list1
P2: pointer to current location in list2

p1

2 4 8 16 32 64 128 Brutus
2
1 2 3 5 8 13 21 34 Caesar
Answer
p2

4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus


Intersecting two postings
lists (a “merge” algorithm)
P1: pointer to current location in list1
P2: pointer to current location in list2

p1

2 4 8 16 32 64 128 Brutus
2
1 2 3 5 8 13 21 34 Caesar
Answer
p2

4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) 33 Pilani Campus


Intersecting two postings
lists (a “merge” algorithm)
P1: pointer to current location in list1
P2: pointer to current location in list2

p1

2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar
Answer
p2

Postings sorted by DocIds.

4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) 34 Pilani Campus


More query
processing
Brutus OR
Caesar NOT
Brutus
Brutus AND NOT
Caesar Brutus OR
NOT Caesar

4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) 35 Pilani Campus


More query
processing
Brutus OR
Caesar

2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

1 2 3 4 5 8 13 16 32 21 34 64 128

4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus


More query
processing
NOT
Brutus

2 4 8 1 33 66
128 Brutus
6
22 44

1 3 5 6 7 9 …

4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus


Inverted index
construction
Documents Friends, Romans, countrymen.
to be
indexed
Tokenize
r
Token stream Friends Romans Countrymen

Linguistic modules DE pluralization Case folding

friend roman countryman


Modified
tokens Indexer friend 2 4
roman 1 2
Inverted
index countryman 13 16
4/19/2019 SS ZG537 (INFORMATION RETRIEVAL)
Pilani Campus
Indexer steps:
Token sequence
• Sequence of (Modified token, Document ID)
pairs.

Doc Doc
1 2
I did enact So let it be with
Julius Caesar I Caesar. The noble
was killed i' the Brutus hath told
Capitol; Brutus you Caesar was
killed me. ambitious

4/19/2019 SS ZG537 (INFORMATION RETRIEVAL)


Pilani Campus
Indexer steps:
Sort
Sort by
terms
– And then
docID

Core indexing step

4/19/2019 SS ZG537 (INFORMATION RETRIEVAL)


Pilani Campus
Indexer steps: Dictionary
& Postings
• Multiple term
entries in a
single document
are merged.
• Split into
Dictionary and
Postings
• Doc. frequency
information is
added.

Why frequency?
Will discuss later.

4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) 41 Pilani Campus


Where do we pay in
storage?
Lists of

docIDs

Terms
and
counts

Pointer
s
4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) 42 Pilani Campus
Query
Optimization
• Consider a query that is an and of t terms.
• For each t terms get the postings list, then
AND them together.

Brutu 1 2 4 11 31 45 173 174


s 1 2 4 5 6 16 57 132
Caesa
r 2 31 54 101
Calpurn
ia
QUERY: Brutus AND Caesar AND Calpurnia

4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus


Query
Optimization
• Process in the order of increasing document
frequency.
• Intersect the two postings list
• All intermediate results will be no bigger than the smallest postings list,
so we are likely to minimize the work.

Brutu 1 2 4 11 31 45 173 174


s 1 2 4 5 6 16 57 132
Caesa
r 2 31 54 101
Calpurn
ia
QUERY: Brutus AND Caesar AND Calpurnia
This is why the doc
Execute the query as (Caesar AND Brutus) AND Calpurnia freq is stored

4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus


Boolean Retrieval
model
• The Boolean Retrieval model can answer any query
that is a
Boolean expression.
• Boolean Queries are queries using AND,OR and NOT
to join query terms.
• Views each document as a set of terms.

• Is precise: document matches condition or not.

• Primary commercial retrieval tools for 3 decades.

• Many professional searchers (e.g., lawyers ) still like


Boolean queries:
• 4/19/2019
You know exactly
SS ZG537what you’re
(INFORMATION RETRIEVAL)getting. Pilani Campus
Example of a
Boolean Retrieval
Model
• Commercially successful Boolean retrieval : WEST
LAW

www.westlaw.com/
• Largest commercial legal search service in terms of
number of paying subscribers.
• Over half a million subscribers performing
millions of searches a day over tens of
terabytes of text data.
• The service was started in 1975.

• 4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus


Westlaw Example
Queries
• Information need: Information on the legal theories
involved in preventing the disclosure of trade
secrets by employees formerly employed by a
competing company
• “trade secret” /s disclos! /s prevent /s employe!

• Information need: Requirements for disabled people


to
• be able to access a workplace
• disab! /p access! /s work-site work-place
• Information need:Cases about a
• (employment /3 place)
responsibility
host’s drunk guests for
• host! /p (responsib! liab!) /p (intoxicat! drunk!)
/pguest
4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus
Westlaw Example Queries
(2)
• /s = within same sentence

• /p = within same paragraph

• /n= within n words

• Space is disjunction, not conjunction (This


was the default in search pre-Google.)
• ! is a trailing wildcard query

4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus


Limitations of
Boolean Retrieval
Model
• Not tolerant to spelling mistakes

• Phrase search(“Stanford University”) and


proximity search (Gates /s Microsoft) requires the
index to be augmented.
• More weight should be given to documents
containing higher number of instances of terms?
• No ranking of returned results.

4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus


How an User Interact with
IR System?

4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus


How to evaluate
performance of an IR
System ?
Precision: Fraction of documents that are relevant
to user’s
information.

Recall: Fraction of relevant documents in collection


that are retrieved.

4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus


Summar
y
• Course related stuff.
• IR is the current trend in research due to
unstructured data Tsunami.
• A simple IR example for Shakespeare’s written
plays.
• In an IR system a Term incidence matrix is built
apriori and queries are answered using Boolean
Retrieval Model.
• The problem with Term incidence matrix is that it
is sparse and hence the inverted index is built.

4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) Pilani Campus


Summar
y

• The Boolean retrieval model can answer any


query that is a Boolean expression.
• Boolean queries are queries that use AND , OR and NOT to join query
terms.
• Views each document as a set of terms.
• Is precise: Document matches condition or not.
• Primary commercial retrieval tool for 3 decades
• Many professionalsearchers (e.g., lawyers) still
like
Boolean queries
• You know exactly what you are getting.
• When are Boolean queries the best way of
searching? Depends on: information need,
searcher, document collection, . . .

4/19/2019 SS ZG537 (INFORMATION RETRIEVAL) 60 Pilani Campus

You might also like