INFORMATION RETRIEVAL
SYSTEMS
III B.TECH - I SEMESTER
COMPUTER SCIENCE AND ENGINEERING
https://www.youtube.com/watch?v=F5--OFgRWr4
SVSV PRASAD SANABOINA
1
UNIT-III
2
Automatic Indexing
Classes of Automatic Indexing
Statistical Indexing
Natural Language
Concept Indexing
Hypertext Linkages
Summary
Document and Term Clustering:
Introduction to Clustering,
Thesaurus Generation,
Item Clustering,
Hierarchy of Clusters
Introduced the concept and objectives of
indexing along with its history.
This chapter focuses on the process and algorithms to perform
indexing. The indexing process is a transformation of an item that
extracts the semantics of the topics discussed in the item.
The extracted information is used to create the processing tokens and
the searchable data structure. The semantics of the item not only
refers to the subjects discussed in the item but also in weighted
systems, the depth to which the subject is discussed.
The index can be based on the full text of the item, automatic or
manual generation of a subset of terms/phrases to represent the item,
natural language representation of the item or abstraction to concepts
in the item.
Classes of Automatic Indexing
Automatic indexing is the process of analyzing an item to extract the
information to be permanently kept in an index.
This process is associated with the generation of the searchable data
structures associated with an item. Figure Data Flow in an
Information Processing System is reproduced here as Figure 5.1 to
show where the indexing process is in the overall processing of an
item.
The figure is expanded to show where the search process relates to
the indexing process. The left side of the figure including Identify
Processing Tokens, Apply Stop Lists, Characterize tokens, Apply
Stemming and Create Searchable Data Structure is all part of the
indexing process.
Conn..
All systems go through an initial stage of zoning (described in
Section 1.3.1) and identifying the processing tokens used to
create the index. Some systems automatically divide the
document up into fixed length passages or localities, which
become the item unit that is indexed.
Filters, such as stop lists and stemming algorithms, are frequently
applied to reduce the number of tokens to be processed. The next
step depends upon the search strategy of a particular system.
Search strategies can be classified as statistical, natural language,
and concept. An index is the data structure created to support the
search strategy. Statistical strategies cover the broadest range of
indexing techniques and are the most prevalent in commercial
systems.
Conn..
The basis for a statistical approach is use of frequency of
occurrence of events. The events usually are related to occurrences
of processing tokens (words/phrases) within documents and within
the database. The words/phrases are the domain of searchable
values.
The statistics that are applied to the event data are probabilistic,
Bayesian, vector space, neural net. The static approach stores a
single statistic, such as how often each word occurs in an item, that
is used in generating relevance scores after a standard Boolean
search.
Conn..
Finally, a special class of indexing can be defined by creation of
hypertext linkages.
These linkages provide virtual threads of concepts between items
versus directly defining the concept within an item. Each
technique has its own strengths and weaknesses.
Current evaluations from TREC conferences (see Chapter 11) show
that to maximize location of relevant items, applying several
different algorithms to the same corpus provides the optimum
results, but the storage and processing overhead is significant
Statistical Indexing
Statistical indexing uses frequency of occurrence of events to
calculate a number that is used to indicate the potential relevance of
an item.
One approach used in search of older systems does not use the
statistics to aid in the initial selection, but uses them to assist in
calculating a relevance value of each item for ranking.
The documents are found by a normal Boolean search and then
statistical calculations are performed on the Hit file, ranking the
output (e.g., term frequency algorithms). Since the index does not
contain any special data.
.
Conn..
Probabilistic systems attempt to calculate a probability value that should
be invariant to both calculation method and text corpora. This allows
easy integration of the final results when searches are performed across
multiple databases and use different search algorithms.
A probability of 50 per cent would mean that if enough items are
reviewed, on the average one half of the reviewed items are relevant.
The Bayesian and Vector approaches calculate a relative relevance value
(e.g., confidence level) that a particular item is relevant. Quite often term
distributions across the searchable database are used in the calculations.
An issue that continues to be researched is how to merge results, even
from the same search algorithm, from multiple databases. The problem is
compounded when an attempt is made to merge the results from different
search algorithms. This would not be a problem if true probabilities were
calculated.
Probabilistic Weighting
The probabilistic approach is based upon direct application of the theory of
probability to information retrieval systems. This has the advantage of being able to
use the developed formal theory of probability to direct the algorithmic
development. It also leads to an invariant result that facilitates integration of results
from different databases. The use of probability theory is a natural choice because it
is the basis of evidential reasoning (i.e., drawing conclusions from evidence). This
is summarized by the Probability Ranking Principle (PRP) and its Plausible
Corollary
HYPOTHESIS: If a reference retrieval system’s response to each request is a
ranking of the documents in the collection in order of decreasing probability of
usefulness to the user who submitted the request, where the probabilities are
estimated as accurately as possible on the basis of whatever data is available for
this purpose, then the overall effectiveness of the system to its users is the best
obtainable on the basis of that data.
PLAUSIBLE COROLLARY: The most promising source of techniques for
estimating the probabilities of usefulness for output ranking in IR is standard
probability theory and statistics.
https://www.youtube.com/watch?
v=mWLQn4G266g
https://www.youtube.com/watch?
v=mWLQn4G266g
Concept Indexing
Natural language processing starts with a basis of the terms within
an item and extends the information kept on an item to phrases and
higher level concepts such as the relationships between concepts.
In the DR-LINK system, terms within an item are replaced by an
associated Subject Code.
Use of subject codes or some other controlled vocabulary is one
way to map from specific terms to more general terms. Often the
controlled vocabulary is defined by an organization to be
representative of the concepts they consider important
representations of their data. Concept indexing takes the
abstraction a level further. Its goal is to gain the implementation
advantages of an index term system but use concepts instead of
terms as the basis for the index, producing a reduced dimension
vector space.
Con…
Rather than a priori defining a set of concepts that the terms in an
item are mapped to, concept indexing can start with a number of
unlabeled concept classes and let the information in the items
define the concepts classes created.
The process of mapping from a specific term to a concept that the
term represents is complex because a term may represent multiple
different concepts to different degrees. A term such as
“automobile” could be associated with concepts such as “vehicle,”
“transportation,” “mechanical device,” “fuel,” and “environment.”
The term “automobile” is strongly related to “vehicle,” lesser to
“transportation” and much lesser the other terms. Thus a term in an
item needs to be represented by many concept codes with different
weights for a particular item
Hypertext Linkages
A new class of information representation, described in Chapter 4
as the hypertext data structure, is evolving on the Internet.
Hypertext data structures must be generated manually although
user interface tools may simplify the process.
Very little research has been done on the information retrieval
aspects of hypertext linkages and automatic mechanisms to use the
information of item pointers in creating additional search
structures. In effect, hypertext linkages are creating an additional
information retrieval dimension. Traditional items can be viewed as
two dimensional constructs. The text of the items is one dimension
representing the information in the items.
Conn…
Hypertext, with its linkages to additional electronic items, can be
viewed as networking between items that extends the contents. To
understand the total subject of an item it is necessary to follow
these additional information concept paths.
The imbedding of the linkage allows the user to go immediately to
the linked item for additional information. The issue is how to use
this additional dimension to locate relevant information. The
easiest approach is to do nothing and let the user follow these paths
to view items. But this is avoiding one of the challenges in
information systems on creating techniques to assist the user in
finding relevant information.
Conn...
Looking at the Internet at the current time there are three classes of
mechanisms to help find information: manually generated indexes,
automatically generated indexes and web crawlers (intelligent
agents). YAHOO (http://www.yahoo.com) is an example of the first
case where information sources (home pages) are indexed
manually into a hyperlinked hierarchy. The user can navigate
through the hierarchy by expanding the hyperlink on a particular
topic to see the more detailed subtopics.
Document and Term Clustering
• Introduced indexing associated with representation of
the semantics of an item.
• Our information database can be viewed as being
composed of a number of independent items indexed by
a series of index terms.
• This model lends itself to two types of clustering:
clustering index terms to create a statistical thesaurus
and clustering items to create document clusters.
• In the first case clustering is used to increase recall by
expanding searches with related terms.
• In document clustering the search can retrieve items
similar to an item of interest, even if the query would
not have retrieved the item.
Document and Term Clustering...Con..
•The clustering process is not precise and care must be taken on use
of clustering techniques to minimize the negative impact
misuse
can have. The techniques can be categorized as those that use the
complete database to perform the clustering and those that start
with some initial structure.
•A class of clustering algorithms creates a hierarchical output.
The hierarchy of clusters usually reflects more abstract concepts
in the higher levels and more detailed specific items in the lower
levels.
•Given the large data sets in information retrieval systems, it is
essential to optimize the clustering process in terms of time and
required processing power.
https://www.youtube.com/watch?
v=cBLqi1wae8Y
Introduction to Clustering
Purpose:
Efficiency
Effectiveness
The concept of clustering has been around as long as
there have been libraries. One of the first uses of
clustering was an attempt to cluster items discussing
the same subject. The goal of the clustering was to
assist in the location of information. This eventually
lead to indexing schemes used in organization of
items in libraries and standards associated with use
of electronic indexes.
a. Define the domain for the clustering effort. If a
thesaurus is being created, this equates to determining
the scope of the thesaurus such as “medical terms.” If
document clustering is being performed, it is
determination of the set of items to be clustered. This
can be a subset of the database or the complete
database. Defining the domain for the clustering
identifies those objects to be used in the clustering
process and reduce the potential for erroneous data that
could induce errors in the clustering process.
b. Once the domain is determined, determine
the attributes of the objects to be clustered. If a
thesaurus is being generated, determine the
specific words in the objects to be used in the
clustering process. Similarly, if documents are
being clustered, the clustering process may focus
on specific zones within the items (e.g., Title and
abstract only, main body of the item but not the
references, etc.) that are to be used to determine
similarity. The objective, as with the first step (a.)
is to reduce erroneous associations.
c. Determine the strength of the
relationships between the attributes whose
co-occurrence in objects suggest those
objects should be in the same class. For
thesauri this is determining which words are
synonyms and the strength of their term
relationships. For documents it may be
defining a similarity function based upon
word co-occurrences that determine the
similarity between two items.
d. At this point, the total set of
objects and the strengths of the
relationships between the objects
have been determined. The final
step is Document and Term
Clustering 141 applying some
algorithm to determine the class(s)
to which each item will be assigned.
There are
guidelines (not hard constraints) on the characteristics
of the classes:
A well-defined semantic definition should exist for
each class. There is a risk that the name assigned to
the semantic definition of the class could also be
misleading. In some systems numbers are assigned to
classes to reduce the misinterpretation that a name
attached to each class could have. A clustering of
items into a class called “computer” could mislead a
user into thinking that it includes items on main
memory that may actually reside in another class
called “hardware.”
The size of the classes should be within the same order of
magnitude. One of the primary uses of the classes is to expand
queries or expand the resultant set of retrieved items. If a
particular class contains 90 per cent of the objects, that class is
not useful for either purpose. It also places in question the
utility of the other classes that are distributed across 10 per cent
of the remaining objects.
Within a class, one object should not dominate the class. For
example, assume a thesaurus class called “computer” exists and
it contains the objects (words/word phrases) “microprocessor,”
“286-processor,” “386- processor” and “pentium.” If the term
“microprocessor” is found 85 per cent of the time and the other
terms are used 5 per cent each, there is a strong possibility that
using “microprocessor” as a synonym for “286- processor” will
introduce too many errors. It may be better to place
“microprocessor” into its own class.
https://www.youtube.com/watch?
v=cBLqi1wae8Y
https://www.youtube.com/watch?v=DFX-sOnU
ljI
----------item clustering
https://www.youtube.com/watch?
v=tcKCI278myM
Thank You
64