IR Unit 2 Dictionaries and Query Processing
IR Unit 2 Dictionaries and Query Processing
Q. Components of Index
Components of Index
Inverted Index: Commonly used in text retrieval, it maps terms (words) to their occurrences in a
document or dataset. It consists of two main components
2.Posting List: For each term in the dictionary, a list of documents where the term appears, often
including information like the term's position in the document.
B-trees and Hash Tables: Other data structures used in databases to build indexes. These allow for
fast lookup, insertion, and deletion operations.
• The individual words or elements that are indexed. In a dictionary, these are the words being
defined, while in a search engine or database, these are the search terms or keys used to
query data.
Posting Lists:
• Document IDs: In an inverted index, each term is associated with a list of document IDs
where the term appears.
• Positions: Sometimes, the position of the term within the document is also stored, which
helps in phrase searching or proximity queries.
Document Identifier:
Each document is assigned with a unique identifier. It allows quick access to documents to find
specific term.
Frequency Information:
The number of times a term appears in a document (term frequency) or across the entire corpus.
This is used in ranking the relevance of documents during query processing.
1.Index Construction:
Collection of Data: Gather the raw data or documents that need to be indexed. This could be text
documents, database entries, or any other structured or unstructured data.
Normalization: Standardize the tokens by converting them to a common format, such as lowercasing
all words, removing punctuation, or applying stemming and lemmatization to reduce words to their
root forms.
Filtering: Remove stop words (common words like "and," "the") that do not contribute to the search
relevance.
Indexing: Create the actual index, typically an inverted index, where each term is mapped to a list of
documents in which it appears.
2. Index Optimization
Compression: Apply techniques to reduce the size of the index, making it more efficient to store and
process
Sorting: Organize the posting lists to ensure that they are in a logical order, typically sorted by
document ID
Skip Pointers and Auxiliary Structures: Add additional data structures, such as skip pointers, to
speed up query processing by allowing quick jumps within the posting lists.
3. Index Deployment:
Loading into Memory: Depending on the size and usage requirements, the index or parts of it are
loaded into memory for fast access.
Integration with Query Engine: The index is connected to the query processing engine, which will
use it to retrieve documents or data in response to search queries.
4. Query Processing:
Query Parsing: Break down the user query into tokens, similar to how the index was constructed.
Index Lookup: Use the query terms to search the index and retrieve the corresponding posting lists
or documents.
Ranking and Scoring: Apply algorithms (like TF-IDF, BM25, etc.) to rank the retrieved documents by
their relevance to the query
Result Formatting: Compile and format the results in a way that is meaningful and useful to the end
user.
5. Index Retirement:
Archiving: When an index is no longer needed, it may be archived for historical reference or backup
purposes. This ensures that the data is not lost but is stored in a more compact and less accessible
form
Deletion: If the index is obsolete or irrelevant, it is permanently deleted to free up storage and
reduce maintenance overhead.
Migration: In cases where a new indexing system or architecture is adopted, the old index may be
migrated or transformed into the new format before being retired.
A Static Inverted Index is a type of inverted index that is created once and is not intended to be
updated or modified frequently. It is typically used in environments where the dataset (such as a
collection of documents) does not change often, or where updates to the index are not required in
real-time.
Terms:
• The unique words or tokens found in the entire document collection. Each term is a key in
the inverted index
Posting List:
• For each term, the index maintains a list of documents where the term appears. This list is
known as a posting list
Dictionary:
• The dictionary holds all the unique terms (the vocabulary) and provides access to the
corresponding posting lists.
1. Performance:
o Because the index is not changing, it can be highly optimized for search speed,
including using aggressive compression techniques.
2. Simplicity:
3. Storage Efficiency:
o The index can be designed to use minimal storage, thanks to the ability to compress
data effectively.
1. Lack of Flexibility:
o The main drawback is the lack of flexibility in handling updates. If the underlying
data changes, a new index needs to be created from scratch.
2. Re-indexing Overhead:
o When updates are necessary, the entire collection needs to be re-indexed, which can
be time-consuming and resource-intensive
Example:
o For each unique word, create a list (posting list) of all documents where the word
appears.
• "mat": [Doc 1]
• "barked": [Doc 2]
Q. Sort-based dictionary
A sort-based dictionary organizes and stores its entries (terms or keys) in a sorted order. This sorting
allows for efficient searching, as well as easier integration and merging of new data.
• Extract terms from documents. For example, from the sentence "The cat sat on the mat," the
terms might be: ["cat", "sat", "mat"].
Sorting Terms:
• Sort the extracted terms. In the case of multiple documents, all unique terms from all
documents are collected and then sorted.
• For each sorted term, a posting list is created that links the term to the documents in which
it appears. This posting list may include additional information, such as the frequency of the
term in each document.
• Example:
o "mat" → [Doc 1]
o "sat" → [Doc 1]
• The sorted terms (along with their posting lists) are stored in a way that allows fast binary
search
Advantages of Sort-Based Dictionaries
1.Efficient Search
3. Compression
Example:
Step 1:Tokenization
A Hash-Based Dictionary is a data structure that uses a hash function to map keys (like words or
terms) to their corresponding values (such as document IDs or posting lists in information retrieval)
Hash Function:
• A hash function is used to convert a key (e.g., a word) into a numerical value (called a hash
code). This hash code determines the index where the value associated with that key is
stored in the dictionary.
• Example: If the word "apple" is passed through a hash function, it might generate a hash
code like 42, and the corresponding value will be stored at index 42.
Buckets:
• The dictionary is typically implemented as an array, where each index or "bucket" can store
one or more key-value pairs.
• If two keys produce the same hash code (a situation known as a hash collision), multiple key-
value pairs might be stored in the same bucket.
Key-Value Mapping:
• In the context of an inverted index, the key would be a term (like a word), and the value
would be the posting list (which contains document IDs or other related data).
• Example:
o "apple" → [Doc 1, Doc 2]
1. Tokenization:
o Extract terms from documents as usual. For instance, from the sentence "The cat sat
on the mat," you get tokens like ["cat", "sat", "mat"].
2. Hashing:
• Each term is passed through a hash function to determine its storage location in the
dictionary.
• At each calculated index (or bucket), the dictionary stores the term along with its posting list
(the list of documents where the term appears).
4. Handling Collisions:
• If two different terms generate the same hash code and thus map to the same bucket, the
dictionary needs to handle this collision. A common method is chaining, where all colliding
elements are stored in a linked list at that bucket.
• Example:
o If both "dog" and "cat" map to index 15, the bucket at index 15 would store:
Speed
Efficiency
Scalability
Disadvantages:
Collisions
Memory overhead
Example Scenario
Consider a small dictionary for a search engine that indexes the following two documents:
1. The term "apple" is hashed and stored in a bucket (e.g., index 17).
2. The term "banana" is hashed and stored in another bucket (e.g., index 22).
3. The term "orange" is hashed and stored in a third bucket (e.g., index 35).
• "orange" → [Doc 2]
Posting List
A posting list is an ordered list that stores information about the documents in which a particular
term appears.
For each term in the index (also known as a vocabulary term), there is a corresponding posting list
that contains:
• Document IDs (DocIDs): Identifiers for documents where the term appears.
• Term Frequency (optional): The number of times the term appears in each document.
• Position Information (optional): The specific positions within the document where the term
appears, useful for phrase searching or proximity queries.
The posting lists for each term might look like this:
Interleaving
Example of Interleaving
In-memory index construction refers to building the index entirely within the system's
RAM. This approach is suitable for smaller datasets or scenarios where fast access is
crucial.
Process
1. Data Collection:
o Load the entire dataset into memory. This includes all documents that need to
be indexed.
2. Tokenization and Normalization:
o Process the documents to extract terms and normalize them (e.g., lowercasing,
stemming).
3. Index Construction:
o Create the inverted index by mapping terms to their corresponding documents
and storing this data in memory. The index typically consists of:
▪ A dictionary (or vocabulary) mapping terms to their posting lists.
▪ Posting lists that store document IDs and possibly term frequencies.
4. Query Processing:
o Queries are processed using the in-memory index, providing very fast lookup
and retrieval.
Advantages
Disadvantages
• Memory Limitations: Limited by the available RAM, making it unsuitable for very
large datasets.
• Volatility: Data is lost if the system is restarted or crashes, unless persistent storage
mechanisms are used.
Sort-based index construction involves creating the index by first sorting the terms and
then building the posting lists.
Process
1. Data Collection:
o Gather and process all documents to extract terms.
2. Tokenization and Normalization:
o Tokenize and normalize the documents as needed.
3. Term Extraction:
o Extract all unique terms from the documents.
4. Sorting:
o Sort the terms lexicographically or based on other criteria (e.g., term
frequency).
5. Building Posting Lists:
o For each sorted term, create a posting list that includes the documents where
the term appears.
6. Index Storage:
o Store the sorted terms and their posting lists in a way that allows efficient
querying.
Advantages
• Efficient Merging: When merging multiple indexes, sorted terms make it easier to
merge posting lists by simply scanning through the sorted lists.
• Optimized Query Performance: Sorted terms facilitate fast binary search for
individual terms.
Disadvantages
Merge-based index construction involves dividing the index-building process into smaller
segments and then merging them to create the final index. This method is particularly
useful for large datasets.
Process
1. Data Collection:
o Divide the dataset into manageable chunks or partitions.
2. Local Index Construction:
o Build a separate index for each chunk using either in-memory or sort-based
methods.
3. Merge Process:
o Merge these local indexes into a single global index. This typically involves:
▪ Merging posting lists for terms that appear in multiple chunks.
▪ Ensuring that the final index maintains the correct term-document
mappings.
4. Final Index Construction:
o The result is a unified index that integrates all the individual local indexes.
Advantages
• Scalability: Allows handling of very large datasets by breaking them into smaller,
more manageable parts.
• Efficiency: Merging is generally efficient, especially with sorted lists and optimized
merge algorithms.
Disadvantages
Disk-based index construction is used for very large datasets where in-memory indexing
is not feasible due to size constraints. This approach involves constructing and managing
the index using disk storage.
Process
1. Data Collection:
o Read data from disk and process it in chunks.
2. Tokenization and Normalization:
o Process documents to extract and normalize terms.
3. Chunk-Based Indexing:
o Build partial indexes for each chunk of data, storing them on disk.
4. Merging Indexes:
o Merge these partial indexes into a final, global index stored on disk. This
involves:
▪ Merging posting lists and ensuring the final index is coherent and
optimized.
5. Final Storage:
o The final index is stored on disk and optimized for efficient retrieval and
querying.
Advantages
• Handles Large Datasets: Suitable for very large datasets that exceed memory
capacity.
• Persistence: Index is persistent across system restarts and crashes.
Disadvantages
• Slower Access: Disk access is slower than in-memory operations, which can impact
query performance.
• Complexity: Managing disk storage and handling I/O operations adds complexity to
the indexing process.
Example Documents
Process:
Process:
Process:
Process:
Q. Dynamic Indexing
Dynamic Indexing refers to the process of updating an existing index in real-time or near-
real-time to accommodate new or modified documents without having to rebuild the entire
index from scratch.
Dynamic indexing allows the index to be updated incrementally, meaning only the new or
changed documents are processed rather than re-indexing all documents. This approach is
efficient and reduces the amount of work required.
Real-Time or Near-Real-Time Indexing:
• Changes to the data (e.g., new documents, updates, or deletions) are reflected in the
index as soon as possible, providing users with up-to-date search results
Query processing for ranked retrieval involves searching an index to retrieve documents
based on their relevance to a given query, with the results ranked according to a relevance
score. This process is fundamental to search engines and information retrieval systems,
aiming to return the most relevant results at the top of the list.
Query Parsing
Term Weighting
• Term Frequency (TF): Measures how often a term appears in a document. More
frequent terms typically indicate higher relevance.
• Inverse Document Frequency (IDF): Measures how important a term is by
considering how often it appears across all documents. Rare terms across documents
are weighted higher
• TF-IDF: Combines TF and IDF to calculate the relevance of a term in a document.
The formula is:
TF-IDF(t,d) =TF(t,d)×IDF(t)
IDF(t)=log(N/DF(t))
where N is the total number of documents, and DF(t) is the number of documents
containing term t.
Document Scoring
• Score Calculation: Each document's relevance score is calculated based on the query
terms and their TF-IDF weights.
Ranking
• Sorting: Documents are sorted based on their relevance scores, with higher scores
indicating greater relevance.
Post-Processing
• Relevance Feedback: Adjustments are made based on user feedback to refine search
results.
Documents:
Steps:
1. Query Parsing:
o Tokens: ["cat", "dog"]
2. Term Weighting:
o Calculate TF-IDF for each term in each document.
▪ "cat":
▪ Doc 1: 0.5
▪ Doc 2: 0.5
▪ Doc 3: 0.8
▪ "dog":
▪ Doc 1: 0
▪ Doc 2: 0.8
▪ Doc 3: 0.8
3. Document Scoring:
o Score Calculation for each document based on TF-IDF weights:
▪ Doc 1: Score = 0.5 (cat) + 0 (dog) = 0.5
▪ Doc 2: Score = 0.5 (cat) + 0.8 (dog) = 1.3
▪ Doc 3: Score = 0.8 (cat) + 0.8 (dog) = 1.6
4. Ranking:
o Rank Documents based on their scores:
▪ Doc 3: 1.6
▪ Doc 2: 1.3
▪ Doc 1: 0.5
5. Post-Processing:
o Relevance Feedback: If users indicate that Doc 2 is more relevant,
adjustments may be made to improve future retrievals.
Q. Document-at-a-Time (DAAT) Query Processing
1.Query Processing:
• Parse the Query: The query is parsed to extract terms and their weights.
• Retrieve Posting Lists: For each query term, retrieve the posting list from the
inverted index. A posting list contains document IDs and, in some cases, term
frequency information for each document.
• Intersect Posting Lists: If the query has multiple terms, intersect the posting lists of
the query terms to determine which documents need to be processed.
• Create a Document List: Compile a list of documents that contain at least one of the
query terms.
4. Score Documents:
• Score Each Document: For each document in the list, calculate its relevance score
based on the query terms. This typically involves:
o Term Frequency (TF): How often the query terms appear in the document.
o Inverse Document Frequency (IDF): How common the query terms are
across all documents.
o TF-IDF Calculation: For each query term in the document, compute the TF-
IDF score and aggregate the scores to get the document's overall score.
5.Rank Documents:
• Sort by Score: Sort the documents based on their computed relevance scores in
descending order.
6.Return Results:
• Present Top Documents: Return the top-ranked documents as the search results
Example
Documents:
Steps:
1. Query Processing:
o Query Terms: ["quick", "dog"]
2. Inverted Index Lookup:
o Posting List for "quick": [Doc 1, Doc 3]
o Posting List for "dog": [Doc 2, Doc 3]
3. Combine Posting Lists:
o Documents to Process: Combine the lists to get [Doc 1, Doc 2, Doc 3]
4. Score Documents:
o Doc 1:
o TF("quick") = 1, TF("dog") = 0
o Score = TF-IDF("quick") + TF-IDF("dog") = TF-IDF("quick")
o Doc 2:
o TF("quick") = 0, TF("dog") = 1
o Score = TF-IDF("dog")
o Doc 3:
o TF("quick") = 1, TF("dog") = 1
o Score = TF-IDF("quick") + TF-IDF("dog")
5. Rank Documents: Doc 3 will have the highest score because it contains both query
terms.
6. Return Results: Return Doc 3, followed by Doc 1 and Doc 2 based on their scores.
• Less Efficient for Large Queries: For queries with many terms, processing each
document for every query term can be less efficient compared to Term-at-a-Time,
which processes terms across all documents.
• Higher Overhead: Can have higher overhead if documents are large and contain
many terms.
• Retrieve Posting Lists: For each query term, retrieve its posting list from the inverted
index. A posting list contains document IDs and, in some cases, term frequency
information for each document.
3. Term Scoring
• Calculate Term Relevance: For each term, calculate its relevance in the context of
the documents it appears in. This involves:
o Term Frequency (TF): How often the term appears in a document.
o Inverse Document Frequency (IDF): How important the term is across all
documents.
o TF-IDF or Similar Metrics: Compute the TF-IDF score or other relevance
scores for each term in each document.
4. Combine Results
• Aggregate Scores: Combine the scores of documents for each term to determine the
overall relevance. This may involve summing or averaging term scores.
5.Rank Documents
• Sort by Score: Rank documents based on the combined scores from all query terms.
6. Return Results
• Present Top Documents: Return the documents with the highest relevance scores.
Example
Documents:
Steps:
1. Query Parsing:
o Query Terms: ["quick", "dog"]
2. Inverted Index Lookup:
oPosting List for "quick": [Doc 1, Doc 3]
oPosting List for "dog": [Doc 2, Doc 3
3. Term Scoring:
4. Combine Results:
• Doc 1:
o TF-IDF("quick") = 0.5
o TF-IDF("dog") = 0 (not present)
o Total Score = 0.5
• Doc 2:
o TF-IDF("quick") = 0 (not present)
o TF-IDF("dog") = 0.8
o Total Score = 0.8
• Doc 3:
o TF-IDF("quick") = 0.8
o TF-IDF("dog") = 0.8
o Total Score = 1.6
5. Rank Documents:
• Doc 3: 1.6
• Doc 2: 0.8
• Doc 1: 0.5
6. Return Results: