0% found this document useful (0 votes)
15 views

Lect 3 Inverted Index

The document discusses the concept of an Inverted Index, a crucial structure in information retrieval that maps terms to the documents in which they appear. It details the construction process, including tokenization, normalization, and indexing, as well as the efficiency of query processing using Boolean retrieval methods. Additionally, it emphasizes the importance of query optimization to minimize processing time when handling multiple terms.

Uploaded by

golanihimanshu2
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Lect 3 Inverted Index

The document discusses the concept of an Inverted Index, a crucial structure in information retrieval that maps terms to the documents in which they appear. It details the construction process, including tokenization, normalization, and indexing, as well as the efficiency of query processing using Boolean retrieval methods. Additionally, it emphasizes the importance of query optimization to minimize processing time when handling multiple terms.

Uploaded by

golanihimanshu2
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Inverted Index

Dr. Subrat Kumar Nayak


Associate Professor
Dept. of CSE, ITER, SOADU
Inverted Index
 This is the first major concepts of information retrieval.
 The name is actually redundant: an index always maps back from
terms to the parts of a document where they occur.
 Inverted Index, sometimes coined as Inverted file.
 Inverted Index is used keep a dictionary of terms. Then for each
term, we have a list that records which documents the term occurs
in.
 Each item in the list, which records that a term appeared in a
document is conventionally called a posting.
 The list is then called a postings list (or inverted list), and all the
postings lists taken together are referred to as the postings.
dictionary term is used for the data structure and vocabulary for the set
of terms
Inverted Index
 For each term t, we must store a list of all documents that contain t.
Identify each doc by a docID, a document serial number
 Can we used fixed-size arrays for this?

Brutus 1 2 4 11 31 45 173 174


Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

What happens if the word Caesar is added


to document 14?
Inverted Index
 We need variable-size postings lists
On disk, a continuous run of postings is normal and best
In memory, can use linked lists or variable length arrays
Some tradeoffs in size/ease of insertion Posting

Brutus 1 2 4 11 31 45 173 174


Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

Dictionary Postings
Sorted by docID (more later on why).
Inverted Index (Construction)
Documents to Friends, Romans, countrymen.
be indexed

Tokenizer

Token stream Friends Romans Countrymen

Linguistic modules

Modified tokens friend roman countryman

Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Initial stages of text processing
 Tokenization
 Cut character sequence into word tokens
Deal with “John’s”, a state-of-the-art solution
 Normalization
 Map text and query term to same form
You want U.S.A. and USA to match
 Stemming
 We may wish different forms of a root to match
authorize, authorization
 Stop words
 We may omit very common words (or not)
the, a, to, of
Indexer steps: Token sequence

 Sequence of (Modified token, Document ID) pairs.

Doc 1 Doc 2

I did enact Julius So let it be with


Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Indexer steps: Sort

Sort by terms
At least conceptually
And then docID

Core indexing step


Indexer steps: Dictionary & Postings

 Multiple term entries in a


single document are
merged.
 Split into Dictionary and
Postings
 Doc. frequency information
is added.

Why frequency?
Will discuss later.
Where do we pay in storage?

Lists of
docIDs

Terms
and
counts
IR system
implementation
• How do we index
efficiently?
• How much
storage do we
need?

Pointers 10
Inverted Index

 Inverted index works much better than the Boolean retrieval method.
 Sorting based inverted indexing is more efficient than the inverted indexing
method since least work needs to be done.
Query processing with an inverted index
How do we process a query? Our focus

Later – what kinds of queries can we process?


Query processing: AND
 Consider processing the query:
Brutus AND Caesar
 Locate Brutus in the Dictionary;
 Retrieve its postings.
 Locate Caesar in the Dictionary;
 Retrieve its postings.
 “Merge” the two postings (intersect the document sets):

2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
The merge
 Walk through the two postings simultaneously, in time linear
in the total number of postings entries

2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

If the list lengths are x and y, the merge takes O(x+y)


operations.
Crucial: postings sorted by docID.
Intersecting two postings lists
(a “merge” algorithm)
The merge
 Walk through the two postings simultaneously, in time linear
in the total number of postings entries

2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar

If the list lengths are x and y, the merge takes O(x+y)


operations.
Crucial: postings sorted by docID.
Boolean queries: Exact match
 The Boolean retrieval model is being able to ask a query that is a
Boolean expression:
 Boolean Queries are queries using AND, OR and NOT to join query terms
 Views each document as a set of words
 Is precise: document matches condition or not.
 Perhaps the simplest model to build an IR system on
 Primary commercial retrieval tool for 3 decades.
 Many search systems you still use are Boolean:
 Email, library catalog, macOS Spotlight
Query optimization
 Query optimization is the process of selecting how to organize the work of
answering a query so that the least total amount of work needs to be done by the
system.
 What is the best order for query processing?
 Consider a query that is an AND of n terms.
 For each of the n terms, get its postings, then AND them together.
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16

Query: Brutus AND Calpurnia AND Caesar


Query optimization example
 Process in order of increasing freq:
 start with smallest set, then keep cutting further.

This is why we kept


document freq. in dictionary

Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16

Execute the query as (Calpurnia AND Brutus) AND Caesar.


More general optimization
e.g., (madding OR crowd) AND (ignoble OR strife)
 Get doc. freq.’s for all terms.
 Estimate the size of each OR by the sum of its doc. freq.’s
(conservative).
 Process in increasing order of OR sizes.
Algorithm to Intersect n terms.

You might also like