0% found this document useful (0 votes)
5 views

IRS unit-2

The document covers cataloging and indexing, detailing the history, objectives, and processes involved in indexing, including automatic indexing and information extraction. It also discusses various data structures used in information systems, such as stemming algorithms and inverted file structures. Key concepts include the importance of indexing terms, the challenges faced by indexers, and the methodologies for automatic indexing and information extraction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

IRS unit-2

The document covers cataloging and indexing, detailing the history, objectives, and processes involved in indexing, including automatic indexing and information extraction. It also discusses various data structures used in information systems, such as stemming algorithms and inverted file structures. Key concepts include the importance of indexing terms, the challenges faced by indexers, and the methodologies for automatic indexing and information extraction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

UNIT-2

1. Cataloging and Indexing:


• History and Objectives of Indexing
• Indexing Process
• Automatic Indexing
• Information Extraction
2. Data Structure:
• Introduction to Data Structure
• Stemming Algorithms
• Inverted File Structure
• N-Gram Data Structures
• PAT Data Structure
• Signature File Structure
• Hypertext and XML Data Structures
• Hidden Markov Models
3.1 History and Objectives of Indexing

OVERVIEW

• The indexing process determines which terms (concepts) can


represent a particular item
• Search Once the searchable data structure has been created,
techniques must be defined that correlate the user entered query
statement to the set of items in the database to determine the
items to be returned to the user.
• Information extraction extract specific information to be
normalized and entered into a structured database (DBMS)
3.1.1 History
• Indexing (originally called cataloging) is the oldest technique
for identifying the contents of item to assist in their retrieval
• The indexing process is typically performed by professional
indexers associated with library organizations.
• Throughout the history of libraries, this has been the most
important and most difficult processing step.
• Most items are retrieved based upon what the item is about.
• The user's ability to find items on a particular subject is limited
by the indexer creating index terms for that subject.
• The user, instead of searching through physical cards in a card
catalog, now performed a search on a computer and
electronically displayed the card equivalents.
3.1.2 Objectives
• The objectives of indexing have changed with the evolution
of Information Retrieval Systems
• The full text searchable data structures for items in the
Document File provides a new class of indexing called total
document indexing
• All of the words within the item are potential index
descriptors of the Subjects of the item.
• They may want to constrain the search by their Private Index
file to increase the precision of the search.
• Public Indexing of the concept adds additional index terms
over the words in the item to achieve abstraction.
• The index file use fewer terms than found in the items
because it only indexes the important concepts.
3.1.2 Objectives
• Private Index files are even more focused, limiting the number
of items indexed to those that have value to the user and
within items only the concepts bounded by the specific user's
interest domain.
• There is overlap between the Private and Public Index files,
but the Private Index file is indexing fewer concepts in an item
than the Public Index file
3.2 Indexing Process
• When an organization with multiple indexers decides to
create a public or private index some procedural
decisions on how to create the index terms assist the
indexers and end users in knowing what to expect in the
index file.
• The first decision is the scope of the indexing to define
what level of detail the subject index will contain.
• This is based upon usage scenarios of the end users.
• The other decision is the need to link index terms
together in a single index for a particular concept.
• Linking index terms is needed when there are multiple
independent concepts found within an item.
3.2.1 Scope of Indexing
• When performed manually, the process of reliably and
consistently determining the bibliographic terms that represent
the concepts in an item is extremely difficult.
• Problems arise from interaction of two sources:
1. The author
2. The indexer.
• The vocabulary domain of the author may be different than that
of the indexer, causing the indexer to misinterpret the emphasis
and possibly even the concepts being presented.
• The indexer is not an expert on all areas and has different levels of
knowledge in the different areas being presented in the item.
• This results in different quality levels of indexing.
3.2.1 Scope of Indexing
• There are two factors involved in deciding on what
level to index the concepts in an item
1. Exhanstivity
2. Specificity
• Exhanstivity of indexing is the extent to which the
different concepts in the item are indexed.
• Specificity relates to the preciseness of the index terms
used in indexing.
• Indexing an item only on the most important concept
in it and using general index terms yields low
exhanstivity and specificity.
3.2.1 Scope of Indexing
• What portions of an item should be indexed
• Only title or title + abstract ?
• Weighting of index terms is not common in manual indexing
• Weighting is the process of assigning an importance to an
index term’s use in an item
• The weight should represent the degree to which the concept
associated with the index term is represented in the item.
• The manual process of assigning weights adds additional
overhead on the indexer and requires a more complex data
structure to store the weights.
3.2.2 Precoordination and Linkages
• Linkages are used to correlate related attributes
associated with concepts discussed in an item.
• This process of creating term linkages at index creation
time is called precoordination.
• When index terms are not coordinated at index time, the
coordination occurs at search time. This is called
postcoordination.
• Postcoordination is implemented by "AND"ing index terms
together, which only finds indexes that have all of the
search terms.
3.2.2 Precoordination and Linkages
• It assumes that an item discusses the “drilling of oil wells in
Mexico by CITGO” and the introduction of “oil refineries in Peru
by the U.S.”
• When the linked capability is added, the system does not
erroneously relate Peru and Mexico since they are not in the
same set of linked items.
• It still does not have the ability to discriminate between which
country is introducing oil refineries into the other country.
• Introducing roles in the last two examples removes this
ambiguity.
• Positional roles treat the data as a vector allowing only one
value per position.
3.2.2 Precoordination and Linkages
3.3 AUTOMATIC INDEXING
• Automatic indexing is the capability for the system to
automatically determine the index terms to be assigned
to an item.
• The simplest case is when all words in the document are
used as possible index terms.
• The advantages of human indexing are the ability to
determine concept abstraction and judge the value of a
concept.
• The disadvantages of human indexing over automatic
indexing are cost, processing time and consistency.
• Automatic indexing requires only a few seconds or less of
computer time based upon the size of the processor and
the complexity of the algorithms to generate the index.
3.3 AUTOMATIC INDEXING
• Human indexers typically generate different indexing for the
same document.
• Indexs resulting from automated indexing fall into two classes:
– weighted
– unweighted
• In a weighted indexing system, an attempt is made to place a
value on the index term's representation of its associated
concept in the document.
• An index term's weight is based upon a function associated
with the frequency of occurrence of the term in the item
• In an unweighted indexing system, the existence of an index
term in a document and sometimes its word location are kept as
part of the searchable data structure
3.3 AUTOMATIC INDEXING
3.3.1 Indexing by Term:
• When the terms of the original item are used as a basis of the
index process, there are two major techniques for creation of the
index:
o statistical
o natural language.
• Statistical techniques can be based upon vector models and
probabilistic models with a special case being Bayesian models.
• They are classified as statistical because their calculation of
weights use statistical information such as the frequency of
occurrence of words and their distributions in the searchable
database.
• Natural language techniques also use some statistical information,
but perform more complex parsing to define the final set of index
concepts.
3.3.1 Indexing by Term:
• Often weighted systems are discussed as vectorized
information systems
• The system emphasizes weights as a foundation for
information detection and stores these weights in a vector
form.
• Each vector represents a document and each position in a
vector represents a different unique word in the database.
• The value assigned to each position is the weight of that
term in the document.
• A value of zero indicates that the word was not in the
document.
3.3.1 Indexing by Term:

• In addition to a vector model, the other dominant


approach uses a probabilistic model.
• The model that has been most successful in this area
is the Bayesian approach.
• This approach is natural to information systems and is
based upon the theories of evidential reasoning
• A Bayesian network is a directed acyclic graph in
which each node represents a random variable and
the arcs between the nodes represent a probabilistic
dependence between the node and its parents
3.3.1 Indexing by Term:
3.3.1 Indexing by Term:
3.3.2 Indexing by Concept
• The basis for concept indexing is that there are many ways
to express the same idea and increased retrieval
performance comes from using a single representation.
• Concept indexing determines a canonical set of concepts
based upon a test set of terms and uses them as a basis for
indexing all items
• This is also called Latent Semantic Indexing because it is
indexing the latent semantic information in items.
• The determined set of concepts does not have a label
associated with each concept a but is a mathematical
representation (e.g., a vector).
3.3.2 Indexing by Concept
• Two neural networks are used.
• One neural network learning algorithm generates stem
context vectors that are sensitive to similarity of use
and another one performs query modification based
upon user feedback.
• Word stems, items and queries are represented by high
dimensional vectors called context vectors.
• Each dimension in a vector could be viewed as an
abstract concept class.
3.3.2 Indexing by Concept
• For any word stem k, its context vector V k is an n-
dimensional vector with each component j
interpreted as follows:
• The interpretation of components for concept vectors
is exactly the same as weights in neural networks.
• Each of the n features is viewed as an abstract
concept class.
• Then each word stem is mapped to how strongly it
reflects each concept in the items in the corpus.
3.4 Information Extraction
• There are two processes associated with information
extraction:
1. Determination of facts to go into structured fields in a
database
• In this case only a subset of the important facts in an item
may be identified and extracted.
2. Extraction of text that can be used to summarize an item.
– In summarization all of the major concepts in the item
should be represented in the summary.
• The process of extracting facts to go into indexes is
called Automatic File Build
• Its goal is to process incoming items and extract index
terms that will go into a structured database.
3.4 Information Extraction
• The term “slot” is used to define a particular category of
information to be extracted.
Metrics used are over generation and fallout.
• Over generation measures the amount of irrelevant information
that is extracted.
• Fallout measures how much a system assigns incorrect slot fillers
as the number of potential incorrect slot fillers increases.
• Different algorithms produce different summaries.
• Just as different humans create different abstracts for the same
item, automated techniques that generate different summaries
does not intrinsically imply major deficiencies between the
summaries.
• Most automated algorithms approach summarization by
calculating a score for each sentence and then extracting the
sentences with the highest scores
3.4 Information Extraction
• They selected the following five feature sets as a basis
for their algorithm:
• Sentence Length Feature that requires sentence to be
over five words in length
• Fixed Phrase Feature that looks for the existence of
phrase “cues” (e.g.,“in conclusion)
• Paragraph Feature that places emphasis on the first ten
and last five paragraphs in an item and also the location
of the sentences within the paragraph
• Thematic Word Feature that uses word frequency
• Uppercase Word Feature that places emphasis on
proper names and acronyms.
DATA STRUCTURES
1. Introduction to Data Structures
2. Stemming Algorithms
– Porter Stemming Algorithm
– Dictionary Look-up Stemmers
– Successors stemmers
3. Inverted File Structure
4. N-Gram Data Structure
5. PAT Data Structure
6. Signature File Structure
7. Hypertext Data Structure
4.1 Introduction to Data Structures
• There are usually two major data structures in any
information system.
• One structure stores and manages the received
items in their normalized form.
• The process supporting this structure is called the
"document manager."
• The other major data structure contains the
processing tokens and associated data to
support search.
• The results of a search are references to the items
that satisfy the search statement, which are passed
to the document manager for retrieval.
Introduction to Data Structures
Introduction to Data Structures
• One of the first transformations often applied
to data before placing it in the searchable data
structure is stemming
• The risk with stemming is that concept
discrimination information may be lost
• On the positive side, stemming has the
potential to improve recall
4.2 Stemming Algorithms
4.2.1 Introduction to the Stemming Process
• Stemming algorithms are used to improve the
efficiency of the information system and to improve
recall
• For example, the stem "comput“ could associate
"computable, computability, computation,
computational,computed, computing, computer,
computerese, computerize" to one compressed word.
• Stemming “blindly” strips off known affixes (prefixes
and suffixes) in an iterative fashion.
• Stemming provide compression, savings in storage
4.2.1 Introduction to the Stemming Process
• Stemming improves recall
• Stemming process has to categorize a word prior
to making the decision to stem it.
• Proper names and acronyms should not stem as
they are not related to a common core concept.
• Stemming process in NLP causes loss of
information
1. Porter Stemming Algorithm
2. Dictionary Look-up Stemmers
3. Successors stemmers
4.2.1 Introduction to the Stemming Process

• The Porter algorithm:


• The Porter algorithm consists of a set of
condition/action rules.
• The condition fall into three classes
– Conditions on the stem
– Conditions on the suffix
– Conditions on rules
The Porter algorithm
• Conditions on the stem :
1. The measure , denoted m ,of a stem is based on its
alternate vowel-consonant sequences.
[C] (VC) m [V]
Measure Example
M=0 TR, EE, TREE, Y, BY
M=1 TROUBLE, OATS, TREES, IVY
M=2 TROUBLES, PRIVATE, OATEN
2. *<X> ---the stem ends with a given letter X
3.*v*---the stem contains a vowel
4. *d ---the stem ends in double consonant
5. *o --the stem ends with a consonant-vowel-consonant,
sequence, where the final consonant is not w, x or y
The Porter algorithm

• Measure Example
• m=0 free, why
• m=l frees, whose
• m=2 prologue, compute
Suffix conditions take the form:
(current_suffix == pattern)
The Porter algorithm
• Conditions on rules : Rules are divided into steps to
define the order of applying the rules. The following
are some examples of the rules:
The Porter algorithm
• Given the word "duplicatable," the following are the steps in
the stemming process:

duplicat rule 4
Duplicate rule lbl
duplic rule 2
• The application of another rule in step 4, removing "ic," can
not be applied since only one rule from each step is allowed
be applied.

• Apply the Porter stemming steps to the following words:


irreplaceable, informative, activation, and triplicate.
Dictionary Look-Up Stemmers
• In this approach, simple stemming rules still may be
applied.
• The rules are taken from those that have the fewest
exceptions.
• The original term or stemmed version of the term is
looked up in a dictionary and replaced by the stem
that best represents it.
• This technique has been implemented in the INQUERY
and Retrieval Ware Systems.
• The INQUERY system uses a stemming technique
called Kstem.
• It tries to avoid collapsing words with different
meanings into the same root.
• For example, "memorial" and "memorize" reduce to
"memory." But "memorial" and "memorize" are not
Dictionary Look-Up Stemmers

• Kstem, like other stemmers associated with Natural Language Processors


and dictionaries
• Kstem requires a word to be in the dictionary before it reduces one word
form to another
• The Kstem system uses the following six major data files to control and
limit the stemming process:
1) Dictionary of words (lexicon)
2) Supplemental list of words for the dictionary
3) Exceptions list for those words that should retain an "e" at the end
(e.g., "suites" to "suite" but "suited" to "suit")
4) Direct_Conflation - allows definition of direct conflation via word pairs
that override the stemming algorithm
5) Country_Nationality - conflations between nationalities and countries
("British" maps to "Britain")
6) Proper Nouns - a list of proper nouns that should not be stemmed.
Successor Stemmers
• Successor stemmers are based upon the length of
prefixes that optimally stem expansions of additional
suffixes
• The process determines the successor varieties for a
word, uses this information to divide a word into
segments and selects one of the segments as the
stem.
• The successor variety for any prefix of a word is the
number of children that are associated with the node
in the symbol tree representing that prefix.
• For example, the successor variety for the first letter
"b" is three. The successor variety for the prefix "ba" is
two.
Successor Stemmers

Tree for the terms


bag, barn, bring, both, box, and bottle.
Successor Stemmers
• The successor varieties of a word are used to segment
a word by applying one of the following four methods
• 1. Cutoff method: a cutoff value is selected to define
stem length. The value varies for each possible set of
words.
• 2. Peak and Plateau: a segment break is made after a
character whose successor variety exceeds that of the
character immediately preceding it and the character
immediately following it.
• 3. Complete word method: break on boundaries of
complete words.
• 4. Entropy method: uses the distribution of successor
variety letters.
Successor Stemmers
4.3 Inverted File Structure
• The most common data structure used in both database
management and Information Retrieval Systems is the
inverted file structure.
• Inverted file structures are composed of three basic files
– The document file,
– The inversion lists (sometimes called posting files)
– The dictionary.
• Each document in the system is given a unique numerical
identifier.
• It is that identifier that is stored in the inversion list.
• The way to locate the inversion list for a particular word is via
the Dictionary.
• The Dictionary is typically a sorted list of all unique words in
the system and a pointer to the location of its inversion list
4.3 Inverted File Structure
4.3 Inverted File Structure
• Thus if the word "bit" was the tenth, twelfth
and eighteenth word in document #1, then
the inversion list would appear:
bit- 1(10), 1(12), 1(18)
• Weights can also be stored in inversion lists.
Words with special characteristics are
frequently stored in their own dictionaries to
allow for optimum internal representation and
manipulation
4.3 Inverted File Structure
4.4 N-Gram Data Structures
• N-Grams can be viewed as a special technique for conflation
(stemming) and as a unique data structure in information
systems
• N-Grams are a fixed length consecutive series of "n" characters
• N-Grams do not care about semantics.
• Instead they are algorithmically based upon a fixed number of
characters.
• The searchable data structure is transformed into overlapping
n-grams, which are then used to create the searchable
database.
• some systems allow interword symbols to be part of the n-
gram set usually excluding the single character with interword
symbol option.
• The symbol # is used to represent the interword symbol which
is anyone of a set of symbols
4.4 N-Gram Data Structures
4.4 N-Gram Data Structures
4.4.1 History
• Another major use of n-grams (in particular trigrams)
is in spelling error detection and correction
• Most approaches look at the statistics on probability
of occurrence of n-grams (trigrams in most
approaches) in the English vocabulary and indicate
any word that contains non-existent to seldom used
ngrams as a potential erroneous word.
• In information retrieval, trigrams have been used for
text compression mid to manipulate the length of
index terms
4.4.1 History
4.4.2 N-Gram Data Structure
• Trigrams were determined to be the optimal
length, trading off information versus size of
data structure.
• The advantage of n-grams is that they place a
finite limit on the number of searchable
tokens
4.5 PAT Data Structure
• A different view of addressing a continuous text
input data structure comes from PAT trees and
PAT arrays.
• The input stream is transformed into a
searchable data structure consisting of
substrings.
• The original concepts of PAT tree data
structures were described as Patricia trees.
• The name PAT is short for Patricia Trees
(PATRICIA stands for Practical Algorithm To
Retrieve Information Coded In Alphanumerics.)
4.5 PAT Data Structure
4.5 PAT Data Structure
4.6 Signature File Structure
• The goal of a signature file structure is to provide a
fast test to eliminate the majority of items that are
not related to a query.
• The text of the items is represented in a highly
compressed form that facilitates the fast test.
• Because file structure is highly compressed and
unordered, it requires significantly less space
• Since items are seldom deleted from information
data bases, it is typical to leave deleted items in place
and mark them as deleted
• Signature files provide a practical solution for storing
and locating information in a number of different
situations.
Signature File Structure
Signature File Structure
• The coding is based upon words in the item. The words are mapped into
a "word signature."
• A word signature is a fixed length code with a fixed number of bits set to
"1.“
• The bit positions that are set to one are determined via a hash function
of the word.
• The word signatures are OR ed together to create the signature of an
item.
• Longer code lengths reduce the probability of collision in hashing the
words
• Search of the signature matrix requires O(N) search time.
• To reduce the search time the signature matrix is partitioned
horizontally.
• If a query has less than the number of words in a block it maps to a
number of possible slots rather than just one.
4.7 Hypertext Data Structure
• The advent of the Internet and its exponential
growth and wide acceptance as a new global
information network has introduced a new
mechanism for representing information.
• This structure is called hypertext and differs
from traditional information storage data
structures in format and use.
• The hypertext is stored in Hypertext Markup
Language (HTML).
4.7.1 Definition of Hypertext Structure
• The Hypertext data structure is used extensively in the
Internet environment and requires an electronic media
storage for the item
• Hypertext allows one item to reference another item via
an imbedded pointer.
• Each separate item is called a node and the reference
pointer is called a link.
• The referenced item can be of the same or a different
data type than the original
• Hypertext Markup Language (HTML) defines the internal
structure for information exchange across the World
Wide Web on the Internet
4.7.1 Definition of Hypertext Structure

• Tags are formatting or structural keywords


contained between less-than, greater than symbols
(e.g., <rifle>, <strong> meaning display prominently).
• "href' is the hypertext reference and where "a"
and "/a" are an anchor start tag and anchor end
tag
• Hypertext is a non-sequential directed graph
structure, where each node contains its own
information
4.7.1 Definition of Hypertext Structure
4.7.2 Hypertext History
• The term "hypertext" came from Ted Nelson in 1965
• Hypertext became more available in the early 1990's via its
use in CDROMs for a variety of educational and
entertainment products.
• The lack of cost effective computers with sufficient speed
and memory to implement hypertext effectively was one of
the main inhibitors to its development.
• One of the first commercial uses of a hypertext system was
the mainframe system, Hypertext Editing System,developed
at Brown University by Andres van Dam and later sold to
Houston Manned Spacecraft Center where it was used for
Apollo mission documentation

You might also like