0% found this document useful (0 votes)

36 views74 pages

Irs Unit III

Uploaded by

217r1a05k5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views74 pages

Irs Unit III

Uploaded by

217r1a05k5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 74

INFORMATION RETRIEVAL

SYSTEMS
III B.TECH - I SEMESTER

COMPUTER SCIENCE AND ENGINEERING

https://www.youtube.com/watch?v=F5--OFgRWr4

SVSV PRASAD SANABOINA

1
UNIT-III

2
Automatic Indexing
Classes of Automatic Indexing
Statistical Indexing
Natural Language
Concept Indexing
Hypertext Linkages
Summary

Document and Term Clustering:

Introduction to Clustering,
Thesaurus Generation,
Item Clustering,
Hierarchy of Clusters
Introduced the concept and objectives of
indexing along with its history.
This chapter focuses on the process and algorithms to perform
indexing. The indexing process is a transformation of an item that
extracts the semantics of the topics discussed in the item.

The extracted information is used to create the processing tokens and

the searchable data structure. The semantics of the item not only
refers to the subjects discussed in the item but also in weighted
systems, the depth to which the subject is discussed.

The index can be based on the full text of the item, automatic or
manual generation of a subset of terms/phrases to represent the item,
natural language representation of the item or abstraction to concepts
in the item.
Classes of Automatic Indexing
Automatic indexing is the process of analyzing an item to extract the
information to be permanently kept in an index.

This process is associated with the generation of the searchable data

structures associated with an item. Figure Data Flow in an
Information Processing System is reproduced here as Figure 5.1 to
show where the indexing process is in the overall processing of an
item.

The figure is expanded to show where the search process relates to

the indexing process. The left side of the figure including Identify
Processing Tokens, Apply Stop Lists, Characterize tokens, Apply
Stemming and Create Searchable Data Structure is all part of the
indexing process.
Conn..
All systems go through an initial stage of zoning (described in
Section 1.3.1) and identifying the processing tokens used to
create the index. Some systems automatically divide the
document up into fixed length passages or localities, which
become the item unit that is indexed.

Filters, such as stop lists and stemming algorithms, are frequently

applied to reduce the number of tokens to be processed. The next
step depends upon the search strategy of a particular system.
Search strategies can be classified as statistical, natural language,
and concept. An index is the data structure created to support the
search strategy. Statistical strategies cover the broadest range of
indexing techniques and are the most prevalent in commercial
systems.
Conn..
The basis for a statistical approach is use of frequency of
occurrence of events. The events usually are related to occurrences
of processing tokens (words/phrases) within documents and within
the database. The words/phrases are the domain of searchable
values.

The statistics that are applied to the event data are probabilistic,
Bayesian, vector space, neural net. The static approach stores a
single statistic, such as how often each word occurs in an item, that
is used in generating relevance scores after a standard Boolean
search.
Conn..
Finally, a special class of indexing can be defined by creation of
hypertext linkages.

These linkages provide virtual threads of concepts between items

versus directly defining the concept within an item. Each
technique has its own strengths and weaknesses.

Current evaluations from TREC conferences (see Chapter 11) show

that to maximize location of relevant items, applying several
different algorithms to the same corpus provides the optimum
results, but the storage and processing overhead is significant
Statistical Indexing
Statistical indexing uses frequency of occurrence of events to
calculate a number that is used to indicate the potential relevance of
an item.

One approach used in search of older systems does not use the
statistics to aid in the initial selection, but uses them to assist in
calculating a relevance value of each item for ranking.

The documents are found by a normal Boolean search and then

statistical calculations are performed on the Hit file, ranking the
output (e.g., term frequency algorithms). Since the index does not
contain any special data.
.
Conn..
Probabilistic systems attempt to calculate a probability value that should
be invariant to both calculation method and text corpora. This allows
easy integration of the final results when searches are performed across
multiple databases and use different search algorithms.

A probability of 50 per cent would mean that if enough items are

reviewed, on the average one half of the reviewed items are relevant.
The Bayesian and Vector approaches calculate a relative relevance value
(e.g., confidence level) that a particular item is relevant. Quite often term
distributions across the searchable database are used in the calculations.

An issue that continues to be researched is how to merge results, even

from the same search algorithm, from multiple databases. The problem is
compounded when an attempt is made to merge the results from different
search algorithms. This would not be a problem if true probabilities were
calculated.
Probabilistic Weighting
The probabilistic approach is based upon direct application of the theory of
probability to information retrieval systems. This has the advantage of being able to
use the developed formal theory of probability to direct the algorithmic
development. It also leads to an invariant result that facilitates integration of results
from different databases. The use of probability theory is a natural choice because it
is the basis of evidential reasoning (i.e., drawing conclusions from evidence). This
is summarized by the Probability Ranking Principle (PRP) and its Plausible
Corollary

HYPOTHESIS: If a reference retrieval system’s response to each request is a

ranking of the documents in the collection in order of decreasing probability of
usefulness to the user who submitted the request, where the probabilities are
estimated as accurately as possible on the basis of whatever data is available for
this purpose, then the overall effectiveness of the system to its users is the best
obtainable on the basis of that data.

PLAUSIBLE COROLLARY: The most promising source of techniques for

estimating the probabilities of usefulness for output ranking in IR is standard
probability theory and statistics.
https://www.youtube.com/watch?
v=mWLQn4G266g
https://www.youtube.com/watch?
v=mWLQn4G266g
Concept Indexing
Natural language processing starts with a basis of the terms within
an item and extends the information kept on an item to phrases and
higher level concepts such as the relationships between concepts.
In the DR-LINK system, terms within an item are replaced by an
associated Subject Code.

Use of subject codes or some other controlled vocabulary is one

way to map from specific terms to more general terms. Often the
controlled vocabulary is defined by an organization to be
representative of the concepts they consider important
representations of their data. Concept indexing takes the
abstraction a level further. Its goal is to gain the implementation
advantages of an index term system but use concepts instead of
terms as the basis for the index, producing a reduced dimension
vector space.
Con…
Rather than a priori defining a set of concepts that the terms in an
item are mapped to, concept indexing can start with a number of
unlabeled concept classes and let the information in the items
define the concepts classes created.

The process of mapping from a specific term to a concept that the

term represents is complex because a term may represent multiple
different concepts to different degrees. A term such as
“automobile” could be associated with concepts such as “vehicle,”
“transportation,” “mechanical device,” “fuel,” and “environment.”
The term “automobile” is strongly related to “vehicle,” lesser to
“transportation” and much lesser the other terms. Thus a term in an
item needs to be represented by many concept codes with different
weights for a particular item
Hypertext Linkages
A new class of information representation, described in Chapter 4
as the hypertext data structure, is evolving on the Internet.
Hypertext data structures must be generated manually although
user interface tools may simplify the process.

Very little research has been done on the information retrieval

aspects of hypertext linkages and automatic mechanisms to use the
information of item pointers in creating additional search
structures. In effect, hypertext linkages are creating an additional
information retrieval dimension. Traditional items can be viewed as
two dimensional constructs. The text of the items is one dimension
representing the information in the items.
Conn…
Hypertext, with its linkages to additional electronic items, can be
viewed as networking between items that extends the contents. To
understand the total subject of an item it is necessary to follow
these additional information concept paths.

The imbedding of the linkage allows the user to go immediately to

the linked item for additional information. The issue is how to use
this additional dimension to locate relevant information. The
easiest approach is to do nothing and let the user follow these paths
to view items. But this is avoiding one of the challenges in
information systems on creating techniques to assist the user in
finding relevant information.
Conn...
Looking at the Internet at the current time there are three classes of
mechanisms to help find information: manually generated indexes,
automatically generated indexes and web crawlers (intelligent
agents). YAHOO (http://www.yahoo.com) is an example of the first
case where information sources (home pages) are indexed
manually into a hyperlinked hierarchy. The user can navigate
through the hierarchy by expanding the hyperlink on a particular
topic to see the more detailed subtopics.
Document and Term Clustering
• Introduced indexing associated with representation of
the semantics of an item.
• Our information database can be viewed as being
composed of a number of independent items indexed by

a series of index terms.

• This model lends itself to two types of clustering:
clustering index terms to create a statistical thesaurus
and clustering items to create document clusters.
• In the first case clustering is used to increase recall by
expanding searches with related terms.
• In document clustering the search can retrieve items
similar to an item of interest, even if the query would
not have retrieved the item.
Document and Term Clustering...Con..
•The clustering process is not precise and care must be taken on use
of clustering techniques to minimize the negative impact
misuse
can have. The techniques can be categorized as those that use the
complete database to perform the clustering and those that start
with some initial structure.
•A class of clustering algorithms creates a hierarchical output.
The hierarchy of clusters usually reflects more abstract concepts
in the higher levels and more detailed specific items in the lower
levels.

•Given the large data sets in information retrieval systems, it is

essential to optimize the clustering process in terms of time and
required processing power.
https://www.youtube.com/watch?
v=cBLqi1wae8Y
Introduction to Clustering

Purpose:
Efficiency
Effectiveness
The concept of clustering has been around as long as
there have been libraries. One of the first uses of
clustering was an attempt to cluster items discussing
the same subject. The goal of the clustering was to
assist in the location of information. This eventually
lead to indexing schemes used in organization of
items in libraries and standards associated with use
of electronic indexes.
a. Define the domain for the clustering effort. If a
thesaurus is being created, this equates to determining
the scope of the thesaurus such as “medical terms.” If
document clustering is being performed, it is
determination of the set of items to be clustered. This
can be a subset of the database or the complete
database. Defining the domain for the clustering
identifies those objects to be used in the clustering
process and reduce the potential for erroneous data that
could induce errors in the clustering process.
b. Once the domain is determined, determine
the attributes of the objects to be clustered. If a
thesaurus is being generated, determine the
specific words in the objects to be used in the
clustering process. Similarly, if documents are
being clustered, the clustering process may focus
on specific zones within the items (e.g., Title and
abstract only, main body of the item but not the
references, etc.) that are to be used to determine
similarity. The objective, as with the first step (a.)
is to reduce erroneous associations.
c. Determine the strength of the
relationships between the attributes whose
co-occurrence in objects suggest those
objects should be in the same class. For
thesauri this is determining which words are
synonyms and the strength of their term
relationships. For documents it may be
defining a similarity function based upon
word co-occurrences that determine the
similarity between two items.
d. At this point, the total set of
objects and the strengths of the
relationships between the objects
have been determined. The final
step is Document and Term
Clustering 141 applying some
algorithm to determine the class(s)
to which each item will be assigned.
There are
guidelines (not hard constraints) on the characteristics
of the classes:
A well-defined semantic definition should exist for
each class. There is a risk that the name assigned to
the semantic definition of the class could also be
misleading. In some systems numbers are assigned to
classes to reduce the misinterpretation that a name
attached to each class could have. A clustering of
items into a class called “computer” could mislead a
user into thinking that it includes items on main
memory that may actually reside in another class
called “hardware.”
The size of the classes should be within the same order of
magnitude. One of the primary uses of the classes is to expand
queries or expand the resultant set of retrieved items. If a
particular class contains 90 per cent of the objects, that class is
not useful for either purpose. It also places in question the
utility of the other classes that are distributed across 10 per cent
of the remaining objects.
Within a class, one object should not dominate the class. For
example, assume a thesaurus class called “computer” exists and
it contains the objects (words/word phrases) “microprocessor,”
“286-processor,” “386- processor” and “pentium.” If the term
“microprocessor” is found 85 per cent of the time and the other
terms are used 5 per cent each, there is a strong possibility that
using “microprocessor” as a synonym for “286- processor” will
introduce too many errors. It may be better to place
“microprocessor” into its own class.
https://www.youtube.com/watch?
v=cBLqi1wae8Y
https://www.youtube.com/watch?v=DFX-sOnU
ljI
----------item clustering
https://www.youtube.com/watch?
v=tcKCI278myM
Thank You

SEO Basics Course 2023
No ratings yet
SEO Basics Course 2023
41 pages
Schem SPI Basic Engineering Users Guide
100% (1)
Schem SPI Basic Engineering Users Guide
651 pages
Excellent Tricks and Techniques of Google Hacks
No ratings yet
Excellent Tricks and Techniques of Google Hacks
304 pages
IRS III Year UNIT-3 Part 1
50% (2)
IRS III Year UNIT-3 Part 1
18 pages
Splunk Questions and Answers Final Document
No ratings yet
Splunk Questions and Answers Final Document
128 pages
Apress Tech SEO Guide 1484290534
No ratings yet
Apress Tech SEO Guide 1484290534
155 pages
Splunk Validated Architectures: October 2020
100% (1)
Splunk Validated Architectures: October 2020
48 pages
Google Hacking Mini
No ratings yet
Google Hacking Mini
8 pages
CBTP Phase 3
100% (2)
CBTP Phase 3
16 pages
Hacking: Using Internet Search Engine As A Tool To Find Information Related To Creativity & Innovation
100% (1)
Hacking: Using Internet Search Engine As A Tool To Find Information Related To Creativity & Innovation
34 pages
Fault Tree Analysis
100% (1)
Fault Tree Analysis
107 pages
Indexingand Abstracting Services
No ratings yet
Indexingand Abstracting Services
27 pages
Indexing - Library Scinece
No ratings yet
Indexing - Library Scinece
92 pages
IRS Automatic Indexing UNIT-2
67% (3)
IRS Automatic Indexing UNIT-2
18 pages
UNIT 2 IRS Up
No ratings yet
UNIT 2 IRS Up
42 pages
Sap Hana Questions
No ratings yet
Sap Hana Questions
35 pages
Unit-3 Irs
No ratings yet
Unit-3 Irs
48 pages
Automatic Indexing
100% (1)
Automatic Indexing
15 pages
IRS Cataloging and Indexing 2.1
No ratings yet
IRS Cataloging and Indexing 2.1
12 pages
Sound Forge
No ratings yet
Sound Forge
102 pages
Unit 2
No ratings yet
Unit 2
40 pages
IRS Unit-2
No ratings yet
IRS Unit-2
37 pages
Locating TIBs in SIS Caterpillar 2011
No ratings yet
Locating TIBs in SIS Caterpillar 2011
11 pages
Wikipidea - Concept Search
No ratings yet
Wikipidea - Concept Search
7 pages
IRS - Unit 2
No ratings yet
IRS - Unit 2
12 pages
Cs8080 Irt Unit 1 PDF
No ratings yet
Cs8080 Irt Unit 1 PDF
28 pages
Anti-Serendipity: Finding Useless Documents and Similar Documents
No ratings yet
Anti-Serendipity: Finding Useless Documents and Similar Documents
9 pages
The Ultimate Guide To JavaScript SEO (2020 Edition) - Onely Blog
100% (2)
The Ultimate Guide To JavaScript SEO (2020 Edition) - Onely Blog
48 pages
Actix Analyzer Getting Started Guide
No ratings yet
Actix Analyzer Getting Started Guide
68 pages
A Comparison of Open Source Search Engine
No ratings yet
A Comparison of Open Source Search Engine
46 pages
Chapter 2
No ratings yet
Chapter 2
64 pages
Cmrit Isr Notes - Docx New
No ratings yet
Cmrit Isr Notes - Docx New
54 pages
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
No ratings yet
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
3 pages
SDT QP Spring 2021 FINAL
No ratings yet
SDT QP Spring 2021 FINAL
8 pages
Exemple Rapport Stage Web Scraping 2022 ZD
0% (1)
Exemple Rapport Stage Web Scraping 2022 ZD
26 pages
Unit-Ii Notes
No ratings yet
Unit-Ii Notes
17 pages
Irs Unit Ii Part 1
No ratings yet
Irs Unit Ii Part 1
16 pages
Example Based Search 2001
No ratings yet
Example Based Search 2001
7 pages
IRS Bits Unit-3
No ratings yet
IRS Bits Unit-3
3 pages
Om 02 - Block 2
No ratings yet
Om 02 - Block 2
44 pages
Hci Unit 5
No ratings yet
Hci Unit 5
22 pages
Indexing Database Systems
No ratings yet
Indexing Database Systems
5 pages
Business Intelligence
No ratings yet
Business Intelligence
76 pages
IRSUnit 2
No ratings yet
IRSUnit 2
21 pages
OpenText Content Server CE 21.3 - Module Installation and Upgrade Guide English (LLESCOR210300-IMO-En-01)
No ratings yet
OpenText Content Server CE 21.3 - Module Installation and Upgrade Guide English (LLESCOR210300-IMO-En-01)
22 pages
Keyword Query Language (KQL) Syntax Reference Microsoft Learn
No ratings yet
Keyword Query Language (KQL) Syntax Reference Microsoft Learn
23 pages
Hci Unit 5 PDF
No ratings yet
Hci Unit 5 PDF
22 pages
QUESEM: Towards Building A Meta Search Service Utilizing Query Semantics
No ratings yet
QUESEM: Towards Building A Meta Search Service Utilizing Query Semantics
10 pages
Chapter 3 Part 1
No ratings yet
Chapter 3 Part 1
43 pages
Organon Manual
No ratings yet
Organon Manual
44 pages
Information Retrieval System
No ratings yet
Information Retrieval System
4 pages
Unit-2 Irs
No ratings yet
Unit-2 Irs
28 pages
From 3D Model Data To Semantics
No ratings yet
From 3D Model Data To Semantics
17 pages
DMS A Notion Towards Paperless Office
No ratings yet
DMS A Notion Towards Paperless Office
8 pages
Unit-3 Irs
No ratings yet
Unit-3 Irs
46 pages
Unit-4 1
No ratings yet
Unit-4 1
7 pages
A Title Should Be The Fewest Possible Words That Accurately Describe The Content of The Paper (Center, Bold, 16pt)
No ratings yet
A Title Should Be The Fewest Possible Words That Accurately Describe The Content of The Paper (Center, Bold, 16pt)
3 pages
IRS Unit 2 by Krishna
No ratings yet
IRS Unit 2 by Krishna
39 pages
Set 2 AK
No ratings yet
Set 2 AK
11 pages
Unit - 3 Irs
No ratings yet
Unit - 3 Irs
57 pages
Exploring The Benefits of Chatgpt in Medical Equipment Maintenance: An Evaluation of Performance
No ratings yet
Exploring The Benefits of Chatgpt in Medical Equipment Maintenance: An Evaluation of Performance
7 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
3rd Unit Part-1
No ratings yet
3rd Unit Part-1
7 pages
Indexingand Abstracting Services
No ratings yet
Indexingand Abstracting Services
27 pages
IRS Unit 4 by Krishna
No ratings yet
IRS Unit 4 by Krishna
23 pages
Search Engines
No ratings yet
Search Engines
4 pages
Unit Iii
No ratings yet
Unit Iii
100 pages
Irs Cie-II Notes
No ratings yet
Irs Cie-II Notes
30 pages
IRS Unit 2
No ratings yet
IRS Unit 2
15 pages
Unit Ii
No ratings yet
Unit Ii
61 pages
IRS - Notes - I&2 CSE A&B
No ratings yet
IRS - Notes - I&2 CSE A&B
27 pages
IRS Unit-2
No ratings yet
IRS Unit-2
63 pages
M.Tech AI and DS 2021 22
No ratings yet
M.Tech AI and DS 2021 22
19 pages
Semantic News Finder: A Semantic Retrieval From News Items: M.Thangaraj G.Sujatha
No ratings yet
Semantic News Finder: A Semantic Retrieval From News Items: M.Thangaraj G.Sujatha
9 pages
Automatic Indexing
No ratings yet
Automatic Indexing
26 pages
Irs Unit-Ii-Notes
No ratings yet
Irs Unit-Ii-Notes
18 pages
Irs Unit-2 Notes - 241015 - 102936
No ratings yet
Irs Unit-2 Notes - 241015 - 102936
27 pages
Exploring Indexing Systems and Techniques New1
No ratings yet
Exploring Indexing Systems and Techniques New1
20 pages
Irs Unit-4 Notes - 241202 - 150037
No ratings yet
Irs Unit-4 Notes - 241202 - 150037
18 pages
Exploring Indexing System and Techniques
No ratings yet
Exploring Indexing System and Techniques
20 pages
Irs Unit - 3
No ratings yet
Irs Unit - 3
68 pages
Irs Unit-3 Notes - 241202 - 145950
No ratings yet
Irs Unit-3 Notes - 241202 - 145950
21 pages
TYBSC CS Information Retrieval Munotes
No ratings yet
TYBSC CS Information Retrieval Munotes
85 pages
Unit III
No ratings yet
Unit III
37 pages
Unit II
No ratings yet
Unit II
28 pages
IRS Unit - 1 & 2
No ratings yet
IRS Unit - 1 & 2
33 pages
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet

Irs Unit III

Uploaded by

Irs Unit III

Uploaded by

INFORMATION RETRIEVAL

COMPUTER SCIENCE AND ENGINEERING

SVSV PRASAD SANABOINA

Document and Term Clustering:

The extracted information is used to create the processing tokens and

This process is associated with the generation of the searchable data

The figure is expanded to show where the search process relates to

Filters, such as stop lists and stemming algorithms, are frequently

These linkages provide virtual threads of concepts between items

Current evaluations from TREC conferences (see Chapter 11) show

The documents are found by a normal Boolean search and then

A probability of 50 per cent would mean that if enough items are

An issue that continues to be researched is how to merge results, even

HYPOTHESIS: If a reference retrieval system’s response to each request is a

PLAUSIBLE COROLLARY: The most promising source of techniques for

Use of subject codes or some other controlled vocabulary is one

The process of mapping from a specific term to a concept that the

Very little research has been done on the information retrieval

The imbedding of the linkage allows the user to go immediately to

a series of index terms.

•Given the large data sets in information retrieval systems, it is

You might also like