0% found this document useful (0 votes)

934 views20 pages

Inverted File Structures in IR

The document describes file structures and inverted indexes used for information retrieval. It discusses lexicographical indices, inverted files, sorted array implementations of inverted files, and partitioning an inverted file index into multiple loads to fit in memory. The FAST-INV algorithm avoids explicit sorting by using a binary tree to build inverted file indexes in partitions that fit in memory, traversing the tree and postings lists to construct the final index.

Uploaded by

kidoseno85

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

934 views20 pages

Inverted File Structures in IR

Uploaded by

kidoseno85

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

File Structures

Information Retrieval: Data Structures and Algorithms

by W.B. Frakes and R. Baeza-Yates (Eds.)

Englewood Cliffs, NJ: Prentice Hall, 1992.

(Chapters 3-5)

File Structures for IR

lexicographical indices
indices that are sorted e.g. inverted files e.g. Patricia (PAT) trees

cluster file structures indices based on hashing

signature files

Inverted Files
Information Retrieval: Data Structures and Algorithms

by W.B. Frakes and R. Baeza-Yates (Eds.)

Englewood Cliffs, NJ: Prentice Hall, 1992.

(Chapters 3)

Inverted Files

Each document is assigned a list of keywords or attributes. Each keyword (attribute) is associated with operational relevance weights. An inverted file is the sorted list of keywords (attributes), with each keyword having links to the documents containing that keyword. Penalty
the size of inverted files ranges from 10% to 100% of more of the size of the text itself need to update the index as the data set changes

Indexing Restrications

A controlled vocabulary which is the collection of keywords that will be indexed. Words in the text that are not in the vocabulary will not be indexed A list of stopwords that for reasons of volume will not be included in the index A set of rules that decide the beginning of a word or a piece of text that is indexable A list of character sequences to be indexed (or not indexed)

Sorted array implementation of an inverted file

Structures used in Inverted Files

Sorted Arrays
store the list of keywords in a sorted array using a standard binary search advantage: easy to implement disadvantage: updating the index is expensive

Hashing Structures Tries (digital search trees) Combinations of these structures

Sorted Arrays
1. The input text is parsed into a list of words along with their location in the text. (time and storage consuming operation) 2. This list is inverted from a list of terms in location order to a list of terms in alphabetical order. 3. Add term weights, or reorganize or compress the files.

Inversion of Word List

Dictionary and postings file

Idea: the file to be searched should be as short as possible split a single file into two pieces

e.g. data set: 38,304 records, 250,000 unique terms

(document #, frequency)

Producing an Inverted File for Large Data Sets without Sorting

Idea: avoid the use of an explicit sort by using a right-threaded binary tree

current number of term postings & the storage location of postings list

traverse the binary tree and the linked postings list

A Fast Inversion Algorithm

Principle 1
the large primary memories are available If databases can be split into memory loads that can be rapidly processed and then combined, the overall cost will be minimized.

Principle 2
the inherent order of the input data It is very expensive to use polynomial or even nlogn sorting algorithms for large files

FAST-INV algorithm
concept postings/ pointers See p. 13.

Sample document vector

document number concept number (one concept number for each unique word) Similar to the documentword list shown in p. 7. The concept numbers are sorted within document numbers, and document numbers are sorted within collection

Preparation

Terminology
HCN= highest concept number in dictionary, or the number of words to be indexed L= number of document/concept pairs in the collection M= available primary memory size

Assumption
M>>HCN M<L

: the range of concepts for each primary load

(Doc,Con) ConLoad Load

Load FileCONPTR Offset

Preparation
1. Allocate an array, con_entries_cnt, of size HCN. 2. For each <doc#, con#> entry in the document vector file: increment con_entries_cnt[con#]

0 (1,1), (1,4).. 2 (2,3) .. 3 (3,1), (3,2), (3,5) ... 6 (4,2), (4,3) . 8 (con#, doc#)

Preparation (continued)
5. For each <con#,count> pair obtained from con_entries_cnt: if there is no room for documents with this concept to fit in the current load, then created an entry in the load table and initialize the next load entry; otherwise update information for the current load table entry.

Building Load Table

Terminology
LL= length of current load S= spread of concept numbers in the current load 8 bytes = space needed for each concept/weight pair 4 bytes = space needed for each concept to store count of postings for it

Constraints
8*LL+4*S<M

: the range of concepts for each primary load

(Doc,Con) ConLoad Load

Load FileCONPTR Offset

Inverted File Structures Overview
No ratings yet
Inverted File Structures Overview
10 pages
Multimedia Information Retrieval Overview
No ratings yet
Multimedia Information Retrieval Overview
38 pages
Document and Term Clustering Overview
No ratings yet
Document and Term Clustering Overview
53 pages
Indexing in Information Retrieval Systems
No ratings yet
Indexing in Information Retrieval Systems
25 pages
Machine Learning Viva Questions & Answers
100% (1)
Machine Learning Viva Questions & Answers
5 pages
Internz Learn: Internship Program Overview
No ratings yet
Internz Learn: Internship Program Overview
7 pages
Digital Search Trees Explained
No ratings yet
Digital Search Trees Explained
15 pages
Item Clustering in Automatic Indexing
No ratings yet
Item Clustering in Automatic Indexing
30 pages
DeltaX Associate Product Engineer Interview Guide
No ratings yet
DeltaX Associate Product Engineer Interview Guide
3 pages
Bioinformatics and Information Display Overview
No ratings yet
Bioinformatics and Information Display Overview
5 pages
Understanding Big Data Analytics
No ratings yet
Understanding Big Data Analytics
29 pages
Morphology in NLP Explained
No ratings yet
Morphology in NLP Explained
59 pages
Cataloging and Indexing Objectives
No ratings yet
Cataloging and Indexing Objectives
16 pages
Understanding MSXML2 DOMDocument60 Methods
100% (1)
Understanding MSXML2 DOMDocument60 Methods
57 pages
Distributed Object-Based Systems Overview
No ratings yet
Distributed Object-Based Systems Overview
40 pages
Automatic Indexing Techniques Overview
No ratings yet
Automatic Indexing Techniques Overview
28 pages
Minimum Edit Distance in NLP
No ratings yet
Minimum Edit Distance in NLP
57 pages
Haze Removal Techniques Overview
No ratings yet
Haze Removal Techniques Overview
34 pages
Understanding Double Line Graphs
No ratings yet
Understanding Double Line Graphs
24 pages
Aneka and Comet Cloud Architectures
No ratings yet
Aneka and Comet Cloud Architectures
67 pages
Objective Type Exam on Information Retrieval
No ratings yet
Objective Type Exam on Information Retrieval
9 pages
Objectives of Query Processing in DDBMS
No ratings yet
Objectives of Query Processing in DDBMS
2 pages
Robotics Process Automation Course Guide
No ratings yet
Robotics Process Automation Course Guide
110 pages
Transaction and Data Flow Testing Overview
No ratings yet
Transaction and Data Flow Testing Overview
36 pages
Language Modeling in NLP Explained
No ratings yet
Language Modeling in NLP Explained
28 pages
Transaction Flow Testing in Software
No ratings yet
Transaction Flow Testing in Software
15 pages
Default Baseline Models in Deep Learning
No ratings yet
Default Baseline Models in Deep Learning
2 pages
IRS Lab Manual: Unit 4 Overview
No ratings yet
IRS Lab Manual: Unit 4 Overview
13 pages
Data Structures Lab Manual - Anurag University
No ratings yet
Data Structures Lab Manual - Anurag University
76 pages
Automatic Indexing and Clustering Techniques
No ratings yet
Automatic Indexing and Clustering Techniques
48 pages
Global Cloud Resource Management
No ratings yet
Global Cloud Resource Management
20 pages
KRR Unit2 (Text Book)
No ratings yet
KRR Unit2 (Text Book)
81 pages
Computational Lexicons in NLP
No ratings yet
Computational Lexicons in NLP
9 pages
Overview of Energy Storage Systems
No ratings yet
Overview of Energy Storage Systems
54 pages
Understanding Hive as a NoSQL Database
No ratings yet
Understanding Hive as a NoSQL Database
9 pages
HCI in Software Development Lifecycle
No ratings yet
HCI in Software Development Lifecycle
15 pages
Overview of Clustering Techniques
No ratings yet
Overview of Clustering Techniques
11 pages
Software Testing Techniques Overview
No ratings yet
Software Testing Techniques Overview
51 pages
M2M Communication and IoT Interoperability
100% (1)
M2M Communication and IoT Interoperability
14 pages
GIS in Health Care Applications
No ratings yet
GIS in Health Care Applications
35 pages
Data Integration and Transformation Techniques
No ratings yet
Data Integration and Transformation Techniques
3 pages
Ids Unit-4
No ratings yet
Ids Unit-4
33 pages
Hashing Techniques in DBMS Explained
No ratings yet
Hashing Techniques in DBMS Explained
7 pages
Sna 5
No ratings yet
Sna 5
6 pages
Understanding XML and Its Applications
No ratings yet
Understanding XML and Its Applications
14 pages
Understanding Quality of Service in Networking
No ratings yet
Understanding Quality of Service in Networking
25 pages
Internet Search and Hypertext in IRS
100% (1)
Internet Search and Hypertext in IRS
1 page
Key Concepts in Natural Language Processing
No ratings yet
Key Concepts in Natural Language Processing
33 pages
User Search Techniques in IRS
No ratings yet
User Search Techniques in IRS
23 pages
Getting Lost in Reinforcement Learning
No ratings yet
Getting Lost in Reinforcement Learning
30 pages
OpenCV Multiscale Detection Guide
No ratings yet
OpenCV Multiscale Detection Guide
77 pages
Naive Bayes for Word Sense Disambiguation
No ratings yet
Naive Bayes for Word Sense Disambiguation
7 pages
Ambiguity Resolution in Parsing Models
No ratings yet
Ambiguity Resolution in Parsing Models
15 pages
Overview of Computing Paradigms
No ratings yet
Overview of Computing Paradigms
3 pages
JNTUH R22 Computer Networks Notes
No ratings yet
JNTUH R22 Computer Networks Notes
82 pages
Text Search Algorithms in IRS
No ratings yet
Text Search Algorithms in IRS
62 pages
NLTK Text Processing and TF-IDF
No ratings yet
NLTK Text Processing and TF-IDF
42 pages
Transaction Flow Testing Techniques
No ratings yet
Transaction Flow Testing Techniques
27 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
Understanding Indexing Structures
No ratings yet
Understanding Indexing Structures
145 pages
Binary - Search & Binary - Insert A - Into Sorted Array
No ratings yet
Binary - Search & Binary - Insert A - Into Sorted Array
2 pages
B Tree Notes
No ratings yet
B Tree Notes
9 pages
2011 Dawson Stemmer
No ratings yet
2011 Dawson Stemmer
7 pages
Preprocessing Stemin JI
No ratings yet
Preprocessing Stemin JI
3 pages
TF-IDF for Feature Selection in Text Mining
No ratings yet
TF-IDF for Feature Selection in Text Mining
4 pages
S Teeming Porter
No ratings yet
S Teeming Porter
6 pages
Setting Koneksi VB Ke Mysql
No ratings yet
Setting Koneksi VB Ke Mysql
14 pages
eCommerce Application Impact Analysis
No ratings yet
eCommerce Application Impact Analysis
5 pages
JSP Actions and Their Usage
No ratings yet
JSP Actions and Their Usage
7 pages
BadStore 1.2.3 Pentesting Guide
No ratings yet
BadStore 1.2.3 Pentesting Guide
12 pages
DF Unit01 QB Answers
No ratings yet
DF Unit01 QB Answers
15 pages
AR Transaction Completion Check
No ratings yet
AR Transaction Completion Check
6 pages
Lucas Silva Nascimento: Python Developer
No ratings yet
Lucas Silva Nascimento: Python Developer
3 pages
SNS College Bus Booking System Report
No ratings yet
SNS College Bus Booking System Report
78 pages
Manage Tekla Structures
No ratings yet
Manage Tekla Structures
426 pages
DBMS Question Paper BR23
No ratings yet
DBMS Question Paper BR23
2 pages
Mibokeh Camera Device Status Report
No ratings yet
Mibokeh Camera Device Status Report
177 pages
What Is RMAN
No ratings yet
What Is RMAN
6 pages
DBMS Handwritten Notes PDF
No ratings yet
DBMS Handwritten Notes PDF
135 pages
Introduction to Database Systems Concepts
No ratings yet
Introduction to Database Systems Concepts
30 pages
Cybersecurity Risk Assessment Report
No ratings yet
Cybersecurity Risk Assessment Report
9 pages
Java Task Manager Application Guide
No ratings yet
Java Task Manager Application Guide
9 pages
WorldSkills Cloud Computing Standards
No ratings yet
WorldSkills Cloud Computing Standards
30 pages
Nav Doc
100% (2)
Nav Doc
652 pages
Internship Report at Anvistar ITS Pvt. Ltd.
No ratings yet
Internship Report at Anvistar ITS Pvt. Ltd.
17 pages
Remote vs Onsite Audit Insights
No ratings yet
Remote vs Onsite Audit Insights
8 pages
VB.Net Framework Overview and Requirements
No ratings yet
VB.Net Framework Overview and Requirements
69 pages
Python SQL Database Programming Guide
No ratings yet
Python SQL Database Programming Guide
28 pages
Data Dictionary for ER Diagram
No ratings yet
Data Dictionary for ER Diagram
3 pages
Assignment 1
No ratings yet
Assignment 1
1 page
PC Troubleshooting and Maintenance Guide
100% (1)
PC Troubleshooting and Maintenance Guide
16 pages
Luniverse: Next-Gen Blockchain Platform
No ratings yet
Luniverse: Next-Gen Blockchain Platform
32 pages
Data Analytics for Business Insights
No ratings yet
Data Analytics for Business Insights
38 pages
SAP Application Log Overview and Usage
No ratings yet
SAP Application Log Overview and Usage
60 pages
Emme Software Training Courses Overview
No ratings yet
Emme Software Training Courses Overview
3 pages
Pega Platform Overview and Best Practices
No ratings yet
Pega Platform Overview and Best Practices
3 pages
Contents:: Introduction To ABAP Objects
No ratings yet
Contents:: Introduction To ABAP Objects
73 pages

Inverted File Structures in IR

Uploaded by

Inverted File Structures in IR

Uploaded by

File Structures

Information Retrieval: Data Structures and Algorithms

by W.B. Frakes and R. Baeza-Yates (Eds.)

File Structures for IR

cluster file structures indices based on hashing

by W.B. Frakes and R. Baeza-Yates (Eds.)

Sorted array implementation of an inverted file

Structures used in Inverted Files

Hashing Structures Tries (digital search trees) Combinations of these structures

Inversion of Word List

Dictionary and postings file

e.g. data set: 38,304 records, 250,000 unique terms

Producing an Inverted File for Large Data Sets without Sorting

traverse the binary tree and the linked postings list

A Fast Inversion Algorithm

Sample document vector

: the range of concepts for each primary load

(Doc,Con) ConLoad Load

Load FileCONPTR Offset

Building Load Table

: the range of concepts for each primary load

(Doc,Con) ConLoad Load

Load FileCONPTR Offset

You might also like