0% found this document useful (0 votes)
13 views50 pages

IRS Unit 3 by Krishna

The document discusses automatic indexing in information retrieval systems, detailing the indexing process, various search strategies, and the importance of probabilistic weighting and vector weighting. It covers techniques like term frequency, inverse document frequency, and signal weighting, emphasizing their roles in improving search accuracy and relevance. Challenges such as data limitations and the complexity of continuous relevance are also addressed, alongside practical applications of these concepts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views50 pages

IRS Unit 3 by Krishna

The document discusses automatic indexing in information retrieval systems, detailing the indexing process, various search strategies, and the importance of probabilistic weighting and vector weighting. It covers techniques like term frequency, inverse document frequency, and signal weighting, emphasizing their roles in improving search accuracy and relevance. Challenges such as data limitations and the complexity of continuous relevance are also addressed, alongside practical applications of these concepts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

1

INFORMATION RETRIEVAL SYSTEMS


UNIT-3 PART-1

AUTOMATIC INDEXING

Indexing transforms items to extract semantic information, creating searchable data


structures. This process, categorized into statistical, natural language, concept, and
hypertext linkages, in uences search techniques.

Classes of Automatic Indexing


Automatic Indexing refers to the process where a system analyzes documents to
extract key information and create a searchable structure called an index. This index
is essential for enabling ef cient search strategies.

Indexing Process

1. Zoning: Divides the document into meaningful parts (e.g., passages).


2. Token Processing: Identi es important words or phrases (tokens).
◦ Stop Lists: Filters out common words (e.g., "the," "and") to reduce
noise.
◦ Stemming: Reduces words to their base form (e.g., "running" → "run").
3. Searchable Structure: The resulting index supports different search strategies.

Search Strategies in Indexing

1. Statistical Strategies:
◦ Use the frequency of words/phrases in documents.
◦ Examples:
▪ Static Methods: Store simple statistics like word counts.
▪ Probabilistic Indexing: Estimate how likely a document is
relevant to a query.
▪ Bayesian & Vector Models: Rank documents by their con dence
in relevance.
▪ Neural Networks: Learn patterns and concepts dynamically.
fl
fi
fi
fi
2

2. Natural Language Strategies:


◦ Analyze grammar and context.
◦ Recognize abstract meanings (e.g., identifying actions or relationships).
◦ Store this extra information to improve search precision.
3. Concept Indexing:
◦ Links words to broader ideas (concepts).
◦ Example: "Apple" may be associated with "fruit" or "technology."
◦ Concepts may not always have speci c labels but rely on statistical
signi cance.
4. Hypertext Linkages:
◦ Create virtual links between documents based on shared concepts.
◦ Useful for navigating related topics.

Key Observations

• Strengths: Combining multiple indexing techniques improves accuracy and


relevance.
• Challenges: Requires signi cant storage and computational resources.

Statistical Indexing

• Statistical Indexing: Uses frequency of occurrence to calculate relevance, often


after initial Boolean search.
• Probabilistic Systems: Aim to calculate a probability of relevance, enabling
integration across multiple databases and search algorithms.
• Result Merging: Challenges arise in merging results from multiple databases,
especially when different search algorithms are used.

Probabilistic Weighting

Probabilistic Weighting is a method in information retrieval (IR) based on


probability theory. It applies the theory of probability to rank documents based on
how likely they are to be useful for a given query.
fi
fi
fi
3

Key Concepts:

• Probability Ranking Principle (PRP): This principle suggests that documents


should be ranked by the probability of their relevance to the user’s query. If we
calculate these probabilities accurately, the system should be as effective as
possible with the available data.

• Plausible Corollary: The most promising way to estimate these probabilities


is by using standard probability theory and statistics.

Challenges of Probabilistic Weighting:

• Binary Relevance: Probabilistic models often treat relevance as binary


(relevant or not), but in real-life systems, relevance is continuous (some
documents are slightly relevant, others are very relevant).

• Complexity: To deal with the continuous nature of relevance, more advanced


models like expected utility theory are needed.

• Heuristics: Sometimes, domain-speci c heuristics (rules based on experience)


might perform better than a probabilistic ranking in certain areas. However,
these cases are usually niche.

• Data Limitations: Probabilistic models may not always be very accurate due
to a lack of precise data, which leads to the need for simplifying assumptions.
Even with this, probabilistic models remain competitive in performance.

Applications of Probabilistic Weighting:

1. Logistic Regression: A common technique where probabilities of relevance


are calculated based on several factors such as:

◦ Term frequency in the query, document, and the database.


◦ A model is trained with known data (query-document pairs with
relevance judgments), and the coef cients of this model are used to
predict relevance for other query-document pairs.
fi
fi
4

2. Log-Odds Formula: This formula is used to predict the odds of a document


being relevant based on speci c statistical features.
q

∑(
Formula : log(O(R | Q)) = − 5.138 + zj + 5.138)
k=1

3. Evaluation: In tests (e.g., on the Cran eld Collection), logistic inference


methods performed better than other traditional methods like the SMART
vector system.

Issues with Combining Probabilistic Techniques:

• There have been efforts to combine different probabilistic methods (like


averaging Log-Odds) to improve accuracy, but these combinations often
perform worse than individual methods.

In summary, The probabilistic approach to information retrieval, based on probability


theory, aims to rank documents based on their probability of usefulness to the user.
While challenges arise from continuous relevance levels and potential suboptimal
ranking compared to domain-speci c heuristics, the bene ts of accurate probability
values for integrating results from multiple sources outweigh these issues. Logistic
regression is highlighted as an example of applying probabilistic methods to
information retrieval, demonstrating competitive performance against other
approaches.

Vector Weighting
Vector Weighting model, which represents items in an information retrieval system
as vectors. Each vector consists of values corresponding to speci c processing tokens
(typically terms) in the document. The main ideas highlighted are the following:

1. Binary vs. Weighted Vectors:


◦ Binary Vectors: Represent the presence or absence of terms. If a term is
present in a document, the corresponding value in the vector is 1,
otherwise 0.
◦ Weighted Vectors: Assign positive real numbers to each term,
representing the relative importance of each term in the document. These
weights allow for more nuanced ranking of documents.
fi
fi
fi
fi
fi
5

Type Petroleum Mexico Oil Taxes Re neries Shipping

Binary 1 1 1 0 1 0

Weighted 2.8 1.6 3.5 0.3 3.1 0.1

2. Threshold for Binary Vectors:


◦ When using binary vectors, a decision process is involved to determine
whether the presence of a term in a document is signi cant enough to be
included in the vector. For instance, in a document about petroleum
re neries, a term like "taxation" may be below the threshold of
importance and therefore not included in the binary vector.
3. Vector Representation in Space:
◦ The vector space model treats each term as a separate dimension in a
multi-dimensional space. A query can also be represented as a vector in
the same space, allowing for mathematical manipulation to rank
documents based on their relevance to the query.
4. Handling Transcription Errors:
◦ A technique to handle errors in audio transcriptions (e.g., from broadcast
news or conversational speech) involves using the transcribed document
as a query against an existing database. This helps in retrieving a set of
results, from which important terms are extracted and added to the
document to improve retrieval accuracy.
5. Normalization and Weighting:
◦ Algorithms are used to calculate the weights for processing tokens. One
common issue that arises is normalization, which accounts for variations
such as differing word counts across documents. Normalization helps
ensure that documents are weighted and compared fairly during
retrieval.
In summary, this section covers the vector representation of documents in
information retrieval, discussing both binary and weighted approaches, and how these
can be used to rank documents based on relevance. It also touches on techniques for
improving document retrieval when faced with errors in sources like audio
transcriptions.
fi
fi
fi
6

Vector weighting has various sub concepts like


1. Simple Term Frequency Algorithm
2. Inverse Document Frequency
3. Signal Weighting
4. Discrimination Law

Simple Term Frequency Algorithm


This section describes the Simple Term Frequency (TF) Algorithm used in
automatic indexing for information retrieval. In both unweighted and weighted
approaches, the algorithm calculates the weight assigned to a processing token in a
document. Here's an overview:

1. Basic Term Frequency (TF):


◦ The simplest approach for calculating weights is to set the weight equal
to the term frequency (TF), i.e., the number of times a term occurs in a
document. For example, if the word "computer" appears 15 times in a
document, its TF weight is 15.
◦ The challenge with this approach is that longer documents will naturally
have higher TF values, which could bias the ranking. Thus,
normalization is required.
2. Normalization:
◦ Maximum Term Frequency (MTF): One method of normalization
divides the term frequency by the maximum frequency of the word in
any document. This results in a normalized TF value between 0 and 1.
However, this technique can reduce the signi cance of terms in shorter
documents.
◦ Logarithmic Term Frequency: Another approach uses the logarithm of
the term frequency (log(TF + constant)) to normalize values, especially
in cases where term frequencies vary widely across documents of
different lengths. This normalization reduces the impact of large
variations in term frequency.
3. Cosine Similarity:
◦ The Cosine Similarity measure can be used to normalize term weights
by considering the vector representation of documents and dividing each
term's weight by the vector's length. This ensures that the document
fi
7

vector has a maximum length of 1. However, this method can distort the
importance of terms when a document contains multiple topics.
4. Pivot Point Normalization:
◦ To address the over-penalization of longer documents, pivot point
normalization was introduced. This normalization adjusts the weight
based on document length and aims to reduce the bias that favors shorter
documents.
◦ The "pivot point" is de ned as the length at which the probability of
relevance equals the probability of retrieval. Using this, a slope is
generated, and the normalization is adjusted accordingly.
5. Final Algorithm:
◦ The nal weighting algorithm involves using the logarithmic term
frequency divided by a pivoted normalization function to adjust for
document length.
◦ This approach has been shown to outperform simple term frequency
(TF/MAX(TF)) or vector length normalization in certain datasets, such
as TREC data.
6. Practical Bene ts:
◦ The normalization process, when properly adjusted, compensates for
biases introduced by document length and term frequency variations,
providing a more balanced approach to weighting terms in information
retrieval.
Summary

This section explains the different approaches to calculating the term frequency (TF)
and adjusting for variations in document length and term occurrence across a dataset.
The goal is to derive a weight for each term that accurately re ects its importance
within the document and the entire database. By normalizing these values using
techniques like logarithmic functions and pivot point adjustments, the algorithm
ensures that document relevance is determined more effectively.
fi
fi
fi
fl
8

Inverse Document Frequency (IDF)


The Inverse Document Frequency (IDF) is a key component of the formula for
calculating the weight of a term in a document. It is designed to reduce the weight of
terms that appear in many documents and increase the weight of terms that appear in
fewer documents. This helps to give more importance to words that are distinctive
and potentially more meaningful in understanding the content of a document.

Inverse Document Frequency in the Context of the Formula:

The formula for WEIGHT is:

WEIGHTij = TFij × (log2(n) − log2(DFj ) + 1)

Here, IDF is represented by the expression:


IDFj = log2(n) − log2(DFj ) + 1
Let’s break it down:
• DFj : Document Frequency of term j. This is the number of documents in the
database that contain the term j.
• n: The total number of documents in the database.

The term IDF gives us a measure of how important the term j is across all the
documents in the database. The IDF increases when the term appears in fewer
documents, and decreases when the term appears in many documents.

Explanation of the IDF formula:

1. Logarithmic Component:
◦ log2(n): This is the logarithmic factor based on the total number of
documents. A higher number of documents means the term will be less
important.
◦ log2(DFj ): This is the logarithmic factor based on the number of
documents containing the term. If the term is found in many documents
(higher DF), the weight will be reduced.

2. Adjustment with +1: The formula adds 1 to the logarithmic difference to


ensure that the IDF never becomes zero, even for terms that appear in every
document (i.e., when DF equals n).
9

How IDF Affects Term Weight:

• Low IDF: If a term appears in many documents (high DF), the IDF value will
be low. This means the term is not very unique or distinctive across the entire
database. The weight given to such common terms will be reduced.

• High IDF: If a term appears in only a few documents (low DF), the IDF value
will be high. This means the term is more unique and could be more
informative, so it is given a higher weight in the document.

Practical Example:

Let’s say we have 2048 documents in the database and the term "oil" appears in 128
documents, the term "Mexico" appears in 16 documents, and the term "re nery"
appears in 1024 documents.

For the term “oil":

IDFoil = 4 * log2(2048) − log2(128) + 1 = 11 − 7 + 1 = 20

For the term “Mexico":

IDFMexico = 8 * log2(2048) − log2(16) + 1 = 11 − 4 + 1 = 64

For the term “re nery”:

IDFre nery = 10 * log2(2048) − log2(1024) + 1 = 11 − 10 + 1 = 20

This shows that "Mexico" has the highest IDF and will receive the highest weight
compared to "oil" and "re nery."

Conclusion:

• IDF helps to distinguish between common and rare terms in the database.
• Terms that are rare across documents are given more importance, while
common terms that appear frequently across the documents are down-
weighted.
• This is why the formula uses IDF to adjust the weight of a term based on its
rarity in the dataset, giving more meaningful weight to terms that are less
common and hence more informative.
fi
fi
fi
fi
10

Signal Weighting
Signal Weighting builds upon the concept of inverse document frequency (IDF) by
addressing a limitation: while IDF adjusts the weight of a term based on how many
documents contain it, it does not consider how frequently the term appears within
those documents. The term's frequency distribution across documents can
signi cantly affect the ability to rank items accurately.

Key Points:

1. Limitation of IDF:
◦ IDF alone considers only how many documents a term appears in but not
how it is distributed within those documents.
◦ For example, if two terms, “SAW” and “DRILL,” both appear in ve
items and have the same total frequency (50 occurrences), IDF would
assign them equal importance.
◦ However, “SAW” is evenly distributed across the items, whereas
“DRILL” has a highly uneven distribution. This uneven distribution
might indicate that "DRILL" is more relevant to certain items.
2. Impact of Frequency Distribution:
◦ Uniform distribution of a term (e.g., “SAW”) across items provides less
insight into the relative importance of the term in any speci c item.
◦ A term with a non-uniform distribution (e.g., “DRILL”) may be more
indicative of item relevance and should be weighted higher.
Item Distribution SAW DRILL
A 10 2
B 10 2
C 10 18
D 10 10
E 10 18

3. Theoretical Foundation:
◦ Signal Weighting leverages Shannon's Information Theory, where the
information content of an event is inversely proportional to its
probability of occurrence:=

INFORMATION = −log2(p)
fi
fi
fi
11

◦ A rare event contains more information than a frequent one. For


example:

▪ A term occurring in 0.5% of cases provides more information


(−log2(0.005)≈7.64) than a term occurring in 50% of cases
(−log2(0.5)=1).

4. Weight Calculation:
◦ Signal Weighting accounts for the term's distribution within the
documents by using a formula derived from average information value
(AVE_INFO):
n


AVE_INFO = − pi log2(pi)
i=1
Where pi is the relative frequency of the term in the ith document.
◦ The Signal Weight is calculated as the inverse of AVE_INFO, giving
higher weights to terms with non-uniform distributions. This ensures that
terms like “DRILL” receive higher importance than evenly distributed
terms like “SAW.”
5. Example from Figure 5.5:
◦ Terms "SAW" and "DRILL" both appear 50 times across ve
documents, but their distributions differ:
▪ "SAW": Equal distribution across all documents (e.g., 10 times in
each).
▪ "DRILL": Uneven distribution (e.g., higher in some documents
like 18 in C and E).
◦ The Signal Weighting formula assigns higher weight to "DRILL" due to
its non-uniform distribution.
6. Practical Use:
◦ Signal Weighting can work alone or in combination with IDF and other
algorithms to improve precision (ranking relevant items higher).
◦ However, its practical implementation requires additional data and
computations, and its effectiveness compared to simpler methods has not
been conclusively demonstrated.
fi
12

In conclusion, Signal Weighting re nes document retrieval algorithms by


emphasizing terms with non-uniform distributions, leveraging concepts from
Information Theory to improve ranking precision.

Discrimination value
Discrimination Value is a term-weighting method that measures how effectively a
term helps distinguish between different items in a database. The goal is to prioritize
terms that improve search accuracy by making items more distinguishable.

Concept
If all items in a database appear similar, it becomes dif cult to identify relevant items.
The discrimination value evaluates how much a term contributes to differentiating
items.

Formula
The discrimination value for a term i is calculated as:
DISCRIMi = AVESIM − AVESIMi
• AVESIM: The average similarity between all items in the database.
• AVESIMi : The average similarity between items after removing term i from
all items.

Interpretation of Results
• Positive DISCRIMi :
If the value is positive, removing term i increases the similarity between items.
This means that term i helps differentiate items and should be given higher
weight.
• DISCRIMi ≈ 0:
A value near zero indicates that removing or keeping the term does not impact
item similarity. The term has little effect on search relevance.
• Negative DISCRIMi :
A negative value suggests that removing the term decreases the similarity
between items. This means the term actually causes items to appear more
similar and is not helpful for discrimination.
fi
fi
13

Normalization for Weighting


After computing the discrimination value, it is normalized to a positive value and
integrated into the standard weighting formula:
Weighti = TFi ⋅ DISCRIMi
• TFi : Term frequency of term i.
• Weighti : Final weight used for ranking items in search results.

Purpose
The discrimination value helps assign higher importance to terms that make items
more distinct, improving the precision of search results.

Problems With Weighting Schemes


Weighting schemes in information retrieval systems often rely on token distribution
data within a database. Common methods like Inverse Document Frequency (IDF)
and Signal Weighting depend on term frequency across items, making them sensitive
to dynamic changes in the database due to new additions, modi cations, or deletions
of items.
Challenges in Weighting Schemes
1. Dynamic Databases:
Constant updates in the database change token distributions, impacting term
weights.
2. System Overhead:
Continuously recalculating term weights can be resource-intensive, especially
for large databases.
3. Fluctuating Search Results:
Changes in term frequencies can lead to inconsistent search results over time.
4. Time-Sensitive Information:
The relevance of information often decreases over time, requiring strategies to
manage older data.

Approaches to Address These Challenges


1. Periodic Rebuilding:
◦ Strategy: Ignore changes and calculate term weights using current
values. Rebuild the entire database periodically.
fi
14
◦ Advantage: Minimizes system overhead during regular operations.
◦ Disadvantage: Rebuilding large databases can be costly and time-
consuming.
2. Threshold-Based Updates:
◦ Strategy: Use xed weight values but monitor for signi cant changes.
Once a threshold is crossed, update the weights and vectors.
◦ Advantage: Reduces frequent recalculations and spreads out updates.
◦ Disadvantage: May delay updates, impacting search relevance in the
short term.
3. Dynamic Weight Calculation:
◦ Strategy: Store static values (like term frequency within items) and
calculate weights during each query.
◦ Advantage: Provides the most accurate and updated weights.
◦ Disadvantage: Increases runtime computation overhead, although
minimal when using inverted le structures.

Additional Concerns
• Changing Query Results Over Time:
As new terms are introduced and their frequency stabilizes, search results can
change, causing inconsistency in search rankings.
• Time-Based Partitioning:
Databases may be split by time periods (e.g., yearly), allowing users to search
speci c time frames. However, this complicates how term weights are
aggregated across partitions and how results are merged.
• Integration Across Databases:
Ideally, users should be able to search multiple databases with different
weighting algorithms and receive a uni ed, ranked result set. This integration
challenge is addressed in more detail in Chapter 7.

Problems with the Vector Model


The vector model, while widely used in information retrieval systems, has several
inherent limitations that arise due to its structure and assumptions. Below is an
explanation of the key issues highlighted in the text.
fi
fi
fi
fi
fi
15

1. Lack of Contextual Associations


• Issue: The vector model treats each processing token (term) as independent
and does not allow for associations between terms.
• Example:
An item discussing "oil in Mexico" and "coal in Pennsylvania" will produce
high relevance scores for a query like "coal in Mexico."
◦ This occurs because the model lacks a mechanism to associate terms
("coal" with "Pennsylvania" and "oil" with "Mexico").
◦ Root Cause: Each dimension in the vector space is independent of the
others, preventing the establishment of correlations between terms (also
referred to as precoordination in Chapter 3).

2. Absence of Positional Information


• Issue: The vector model does not account for the positional relationship of
terms within the text.
• Example:
A query such as "term A within 10 words of term B" cannot be resolved
because the vector model only stores a scalar weight for each term in an item,
without any information about where the term appears.
• Impact: Proximity searches and precise contextual understanding are not
supported.

3. Multi-Topic Items
• Issue: When an item discusses multiple distinct topics, the vector model
cannot differentiate between them.
• Example:
If an item discusses multiple energy sources ("oil" and "coal") in different
geographic locations ("Mexico" and "Pennsylvania"), the model fails to
partition these discussions.
◦ Instead, it treats the item as a whole, leading to imprecise search results
when queries involve overlapping topics.

4. Inability to Restrict Searches to Subsets of Items


• Issue: The model operates on entire items and does not allow for searches
restricted to speci c portions of an item.
fi
16

• Solution:
By enabling searches in subsets of items, the system can focus on sections
discussing speci c topics, improving precision.
◦ Bene t: This addresses the multi-topic problem by isolating and
evaluating relevant subsets of the text.

Conclusion
While the vector model provides a mathematical framework for representing
documents, it struggles with semantic associations, positional information, and multi-
topic distinctions. Addressing these limitations often requires enhancements, such as
incorporating positional data or enabling searches in subsets of items, which are
discussed further I

Bayesian Model
The Bayesian model is introduced as a way to overcome some of the limitations of
the vector model by using conditional probabilities to determine relevance and
assign weights to processing tokens in a database. Below is a breakdown of the
concepts involved in the Bayesian approach.

1. Bayesian Approach Overview


• The Bayesian model uses conditional probabilities to make decisions based
on prior knowledge and the evidence at hand.
◦ A general formula used is:
P(REL/DOCi, Queryj)
This represents the probability of relevance (REL) given a particular
document (DOCi) and query (Queryj).
▪ REL refers to the relevance of a document to a query.
▪ The idea is to predict how likely a document is relevant based on
the query, utilizing past data and the relationships between them.

2. Bayesian Term Weighting


• Objective: The goal is to use the Bayesian model to assign weights to
processing tokens (terms) in an item, so as to represent the semantic content
accurately.
fi
fi
17

• Process Overview:
◦ A Bayesian network is used to determine which processing tokens (also
called "topics") should represent the content of the document and how
relevant they are.
◦ The gure shows that for a particular document, P(Topic i) represents
the relevance of topic i in the document, and P(Token j) represents a
statistic related to the occurrence of processing token j in the document.

3. Assumption of Binary Independence


• The Bayesian model operates on the Assumption of Binary Independence,
which simpli es the model:
◦ Independence of Topics: The existence of one topic does not affect the
existence of other topics.
◦ Independence of Tokens: The presence of one processing token does
not in uence the presence of others.
• This assumption is often not true in real-world scenarios because topics and
tokens often interact or depend on each other (e.g., "Politics" and "Economics"
might frequently overlap in many contexts, while in others, they are entirely
unrelated).

4. Handling Dependencies Between Topics and Tokens


There are two main approaches to handle dependencies in the Bayesian model:
a. Ignoring Dependencies
• Approach: Assume that the errors caused by the independence assumption do
not signi cantly impact the relevance determination or ranking of items.
• Outcome: This is a common approach used in practical systems, as it reduces
complexity and computational overhead.
b. Extending the Network for Dependencies
• Approach: Extend the Bayesian network to account for dependencies between
topics and processing tokens.
◦ New Layers: Add additional layers for Independent Topics (ITs) and
Independent Processing Tokens (IPs) above the existing layers.
▪ These new layers capture interdependencies between tokens and
topics, re ning the model's representation of document semantics.
fi
fl
fi
fi
fi
18

◦ Impact: While this approach is more mathematically precise, it


introduces complexity and may reduce the precision of the semantic
representation due to a smaller number of concepts.

5. Summary
The Bayesian model introduces an advanced way to represent document relevance
and token weighting by using probabilities and dependencies. It offers a more
nuanced method than the vector model, especially in handling complex dependencies.
However, practical implementations often use simpli cations (like ignoring
dependencies) to balance accuracy with computational ef ciency. Extending the
model to account for these dependencies can enhance precision but at the cost of
model complexity.

Natural Language
• Natural Language Processing Goal: Enhance indexing precision by utilizing
semantic information in addition to statistical information.
• Semantic Information Extraction: Extracted from language processing, going
beyond treating words as independent entities.
• Output of Natural Language Processing: Generates phrases as indexes and
potentially thematic representations of events.

Index Phrase Generation


Purpose of Indexing:
The main goal of indexing in an information retrieval system is to represent the main
ideas (semantic concepts) of items (like documents) to help users nd relevant
information.
• Single Words are sometimes too broad and can return too many irrelevant
results.
• Phrases (groups of words) give more precise information, making search
results more relevant.
Example:
• The word " eld" can mean a grassy area or a magnetic eld.
• But "grass eld" or "magnetic eld" makes the meaning clear.
fi
fi
fi
fi
fi
fi
fi
19

Statistical Phrase Detection


Early Method by Salton:
• Salton introduced the COHESION factor to decide if two words should be
combined into a phrase.
• It checks how often two words appear together (e.g., side-by-side, in the same
sentence).
Formula :
Cohesionk,h= Size-factor * (Pair - frequencyk,h / Totfk * Totfh)
SMART System Improvements:
1. Pairs of adjacent important words (non-stop words) can form phrases.
2. The pair must appear in at least 25 items to be considered.
3. Phrase importance is measured using a modi ed version of the single-word
method.
4. Results are adjusted by dividing by the length of the word list to normalize.

Natural Language Processing (NLP) for Phrase Detection


Why NLP is Better:
• NLP can create longer and more meaningful phrases.
• Example: For the text "industrious intelligent students":
◦ Statistical methods might create: "industrious intelligent"
and "intelligent students".
◦ NLP could create: "industrious student", "intelligent
student", and "industrious intelligent student".
How NLP Works:
1. Lexical Analysis: Identi es parts of speech (nouns, adjectives, etc.).
2. Proper Noun Detection: Recognizes names, places, and organizations
(important for indexing).
3. Semantic Hierarchies: Builds relationships between concepts.
◦ Example: "nuclear reactor fusion" can be split into
"nuclear reactor" and "nuclear fusion".
4. Normalization: Different ways of saying the same thing are grouped together.
◦ Example: "blind Venetian" and "Venetian who is
blind" are indexed the same.
fi
fi
20

Example of NLP in Practice


New York University System:
• Processes text with a fast syntactic analyzer.
• Adds both single words and phrases to the index.
• Uses statistical analysis to link similar phrases and categorize their
relationships (e.g., synonyms, opposites).
Tagged Text Parser (TTP):
• Breaks sentences into parts (subject, object, etc.).
• Identi es the main idea (header) and details (modi ers).
• Example sentence:
◦ "The former Soviet President has been a local hero ever since a
Russian tank invaded Wisconsin."
◦ TTP identi es "Soviet President" and "local hero" as important
phrases.

Measuring Phrase Importance


Informational Contribution (IC):
• Measures how closely two words are related.
• Higher IC means a stronger connection between words.
Weighting Phrases:
• Phrases occur less frequently than single words, so their importance must be
adjusted.
• The inverse document frequency (IDF) method is adapted to avoid
undervaluing rare but meaningful phrases.
• In the NYU system, more weight is given to the top phrases based on their IDF
scores.
Summary
• Indexing phrases improves search accuracy compared to single words.
• Statistical methods use word co-occurrence, but NLP understands grammar
and meaning for better phrase detection.
• Proper weighting ensures that rare but important phrases help users nd
relevant results.
fi
fi
fi
fi
21

Natural Language Processing


Natural Language Processing (NLP) is a eld of arti cial intelligence that involves
analyzing, understanding, and generating human language in a way that is meaningful
and useful. The text you provided explains a process for advanced NLP, particularly
in the context of indexing and understanding relationships between concepts in
documents.
Here's a breakdown of the key components from the text:
1. Lexical Analysis and Term Phrases: Before performing higher-level analysis,
basic tasks like identifying verb tense, plurality, and parts of speech are done.
Term phrases are generated as indexes, which are essential for retrieving
relevant information later.
2. Semantic Analysis: NLP systems go beyond simple indexing by identifying
relationships between concepts. The goal is to process text to understand its
deeper meanings and relationships, such as how events are linked or the causes
of speci c actions.
3. The DR-LINK System: The DR-LINK system is an NLP system that includes
processes for:
◦ Relationship Concept Detectors: These identify relationships between
concepts in text.
◦ Conceptual Graph Generators: These create graphs representing
concepts and their relationships.
◦ Conceptual Graph Matchers: These match different conceptual graphs
to identify how concepts in the document relate.
4. Mapping to Subject Codes: In this phase, words are assigned codes that
represent speci c topics or categories, which come from a prede ned
dictionary (Longman’s Dictionary of Common English). These codes help
disambiguate terms and assign them to the most likely meaning based on
statistical relationships.
5. Text Structuring: The system attempts to categorize parts of a document (e.g.,
news stories) into thematic areas like "Evaluation," "Main Event," or
"Expectations." These categories help the system understand the structure of
the text, which could be further re ned based on user preferences or the
speci c query being searched.
6. Topic and Time Frame Identi cation: The system also attempts to identify
not just the topics of a document but also their time frames (e.g., past, present,
or future). This is particularly useful for analyzing time-sensitive data like
fi
fi
fi
fi
fi
fi
fi
fi
22

news articles, where the understanding of timing can change the meaning of
the content.
7. News Schema Components: The system uses a structured model (like a News
Schema) to identify different components within a news article, such as
"Circumstance," "Consequence," "History," and others. This structure helps
evaluate which parts of a document should be emphasized based on the user's
query.
8. Semantic Processing: Beyond identifying topics, the system classi es the
intent behind the terms and identi es the relationships between different
concepts. For example, in the context of national elections and guerrilla
warfare, the system needs to identify whether elections were caused by warfare
or vice versa. This relationship is essential for accurately understanding the
context.
9. Relationships and Weights: The relationships between concepts are
visualized as triples (Concept 1, Relationship, Concept 2), and each
relationship is assigned a weight based on its relevance and how it is expressed
in the text (e.g., using active vs. passive verbs).
10. Data Structures: Additional data structures store the semantic information,
which can be used for natural language-based search queries or to respond to
speci c user requests for deeper information about the relationships between
terms.
In Summary, Natural language processing enhances indexing accuracy by
identifying semantic relationships between concepts. Systems like DR-LINK use
statistical methods, prede ned subject codes, and discourse analysis to categorize text
and assign weights to different components. These components, along with inter-
concept relationships, are stored for use in natural language-based search queries.

Concept Indexing
Concept Indexing is a technique used in natural language processing (NLP) to
map speci c terms in a document to higher-level concepts rather than individual
words or terms. This approach is aimed at creating more abstract and semantically
meaningful representations of documents, making it easier to categorize and retrieve
relevant information based on broader ideas rather than just speci c terms.
fi
fi
fi
fi
fi
fi
23

1. Mapping Terms to Concepts


• In Concept Indexing, rather than indexing a document based on individual
words or terms, the system maps these terms to broader concepts. These
concepts are not just synonyms or related terms; they represent larger, more
general ideas or categories that the words might belong to.
• For instance, the term "automobile" could be associated with multiple
concepts like vehicle, transportation, mechanical device, fuel, and
environment, each of which may have different levels of relevance or
association to the term.
2. Use of Controlled Vocabularies or Subject Codes
• Subject Codes or controlled vocabularies are prede ned systems of terms that
represent key concepts in a domain (e.g., "Science," "Technology," "Health").
These vocabularies help map speci c terms in documents to more general
concepts that are deemed important by an organization.
• In the DR-LINK system mentioned, terms within a document are replaced by
associated Subject Codes, which represent the broader concept the term refers
to.
3. Automatic Creation of Concept Classes
• Instead of manually de ning a set of concepts, Concept Indexing can use
machine learning algorithms to automatically create concept classes. These
concept classes group terms with similar meanings or contexts based on the
content of the documents themselves.
• The process of creating concept classes is similar to generating a thesaurus or
a taxonomy. It allows the system to identify which concepts are related based
on how terms appear together in various contexts within the data.
4. Complexity of Mapping Terms to Concepts
• One challenge in Concept Indexing is that a single term can map to multiple
concepts in different contexts. For example, the term "automobile" might
refer strongly to the concept "vehicle", but less strongly to concepts like
"fuel" or "environment."
• To handle this, terms are typically associated with several concept codes, each
with a weight that indicates the strength of its relationship to the concept.
These weights help the system determine how relevant each concept is in
relation to the speci c context of the document or item.
fi
fi
fi
fi
24

5. Neural Network Models and Contextual Relationships


• The Convectis System uses a neural network approach to identify
relationships between terms based on their proximity and context within a
document. This means that terms that appear in similar contexts are likely to be
related to the same or similar concepts.
• In this system, terms that are close to one another in context (e.g., not too far
apart in a document) are considered more strongly related, and this proximity
can help de ne the concept class they belong to.
6. Vector Representation of Concepts
• Concept Indexing can represent both terms and documents as vectors in a
high-dimensional space. Each term is represented by a vector of concepts, and
these vectors are weighted according to the relevance of each concept.
• A document can be represented as a weighted average of the concept vectors of
the terms it contains. This enables the system to compare documents not just
by the speci c words they contain, but by the broader concepts they represent.
7. Dimensionality Reduction
• Techniques like Latent Semantic Indexing (LSI) are used to reduce the
complexity of the high-dimensional space by nding a smaller set of
underlying concepts (or latent factors) that can still represent the signi cant
relationships between terms and documents.
• LSI applies singular value decomposition (SVD) to break down large term-
document matrices into smaller, orthogonal concept vectors that capture the
main semantic structures in the data, while reducing noise and irrelevant
details.
• Probabilistic Latent Semantic Analysis (PLSA) is a modi cation of LSI that
adds a probabilistic framework, improving the system's ability to model
relationships between terms.
8. Trade-offs and Challenges
• There are challenges with nding a perfect set of concepts that can represent all
possible relationships between terms. In practice, dimensionality reduction
often means that some degree of overlap between concepts remains, and terms
may be mapped to multiple concepts with varying degrees of association.
• Choosing the correct dimensionality (the number of concepts used) is crucial
for the system's performance. Too few concepts will result in poor
discrimination, with many documents being falsely matched. Too many
fi
fi
fi
fi
fi
fi
25

concepts will result in reduced effectiveness, as the system will be closer to a


standard vector space model and lose the bene ts of concept-based indexing.
9. Applications of Concept Indexing
• Latent Semantic Indexing (LSI) and other vector space models can use
concept indexing to improve document retrieval by nding the latent semantic
structure between terms.
• Systems like Convectis use these techniques to automatically group and map
terms to concept classes, making the search and retrieval process more
effective by focusing on the relationships between broader concepts, rather
than just matching speci c terms.
In summary, Concept indexing aims to map terms to concepts, reducing the
dimensionality of the index space. Approaches like Convectis and Latent Semantic
Indexing (LSI) use neural networks and singular value decomposition to identify
related terms and create concept classes. LSI, in particular, reduces the term-
document matrix to a smaller set of orthogonal factors, approximating the original
matrix and enabling ef cient information retrieval.

HyperText Linkages
Hypertext Linkages refer to the connections between different items or pieces of
information on the Internet (or in hypertext systems) that allow users to navigate
between them, creating a network of interlinked documents. In this text, hypertext
linkages are discussed in the context of information retrieval (IR) systems, and how
these linkages can enhance or complicate the process of indexing and searching for
information. Here's a breakdown of the key points:

1. Hypertext as a Data Structure


• De nition: Hypertext allows for the linking of content (text, images,
multimedia), forming a network of interconnected information.
• Usage: Primarily used on the Internet, where hyperlinks connect webpages or
related topics.
• Potential: Hyperlinks, if properly analyzed, could improve indexing and
searching, though current research and applications are limited.

2. Adding a Retrieval Dimension


• Traditional IR Systems: Typically consider:
fi
fi
fi
fi
fi
26

◦ Content Dimension: The text or information in a document.


◦ Reference Dimension: Embedded references (e.g., citations,
hyperlinks).
• Hypertext Enhancement: Hyperlinks act as an additional dimension, linking
related documents and expanding the context of searches.

3. Challenges with Hyperlinks in Indexing


• Current Tools: Use manual indexes, automated crawlers, or web search
engines (e.g., Lycos, AltaVista).
• Limitations: Tools often overlook the semantic relationships hyperlinks create
between content.
• Solution Needed: Advanced algorithms that:
◦ Treat hyperlinks as contextual extensions of a document.
◦ Consider linked content’s relevance and its impact on search results.

4. Hyperlinks as Content Extensions


• Concept: Links provide additional insights, enriching a document’s context.
• Proposed Model: Include concepts from linked items in a document's index
but assign reduced weight to prevent irrelevant content from dominating.
◦ Link Strength: Determines how much weight a link should contribute
based on its relevance.
◦ Weighted Indexing: Balances main content and linked content
effectively.

5. Automatic Hyperlink Generation


• Objective: Dynamically generate hyperlinks by analyzing document content
for similarities.
• Techniques:
◦ Clustering: Groups similar documents for potential linking.
◦ Thresholding: Links are created when document similarity exceeds a
set threshold.
• Challenges: Parsing errors, segmentation issues, and managing dynamic
databases.
27

6. Weighting and Normalization


• Weight Assignment: Hyperlinks are less important than main content but still
play a signi cant role.
• Normalization: Adjusts the relevance of linked content based on the strength
and context of links.

7. Enhancing Search Relevance


• Impact of Hyperlinks: Linking related documents improves search results,
even if the primary document doesn’t explicitly cover all related topics.
• Example: A query on "droughts in Louisiana" might retrieve:
◦ A document on Louisiana's economy.
◦ Linked content about crop damage due to droughts, thanks to hyperlink
context.

Conclusion
Hypertext linkages represent a largely untapped resource in IR systems. By
incorporating hyperlinks into indexing algorithms, search systems can enhance
relevance and provide a richer, more interconnected retrieval experience. This would
elevate hyperlinks from navigational aids to integral components of concept-based
information retrieval.
fi
28

INFORMATION RETRIEVAL SYSTEMS


UNIT-3 PART-2

DOCUMENT AND TERM CLUSTERING

Clustering techniques are discussed for index terms and items in information retrieval
systems. These techniques aim to improve recall and retrieve similar items, but
require careful use to avoid negative impacts.

Introduction to Clustering
Clustering is the process of grouping similar objects (e.g., words, documents, or
terms) into classes or categories to enhance the organization and retrieval of
information. This approach has been widely applied in libraries, thesauri creation, and
modern information systems. Here's an overview of clustering based on your text:

1. Purpose of Clustering
• Historical Context: Initially used in libraries to group items on the same
subject, enabling easier information retrieval. This evolved into indexing
schemes and standards for electronic indexes.
• De nition: Clustering groups similar objects under a general title (class) and
allows linkages between clusters.
• Applications: Includes thesauri development, document organization, and
query expansion in information retrieval.

2. Steps in Clustering
a. De ning the Domain:
• Identify the scope of items to be clustered (e.g., medical terms or a speci c
subset of a database).
• Reduces erroneous data and ensures a focused clustering process.
b. Determining Attributes:
• Decide which features of the objects will be used to assess similarity.
fi
fi
fi
29

• For documents, focus on speci c zones (e.g., title, abstract) to avoid irrelevant
associations.
c. Measuring Relationships:
• Establish similarity functions or metrics to determine the strength of
relationships.
• For thesauri, this involves identifying synonyms and other term relationships.
For documents, word co-occurrence can de ne similarity.
d. Applying Clustering Algorithms:
• Use algorithms to assign objects to appropriate clusters based on their
relationships.
• Results in well-organized classes with meaningful groupings.

3. Characteristics of Effective Clusters


• Semantic Clarity: Each class should have a clear, well-de ned meaning.
Misleading names or class labels can confuse users.
• Balanced Sizes: Classes should be similar in size to ensure utility. Overly large
or dominant classes can dilute the effectiveness of the clustering.
• Avoiding Dominance: No single object should dominate a cluster, as it may
skew results and introduce errors.
• Multiple Class Membership: Objects can belong to multiple classes, allowing
exibility but adding complexity.

4. Clustering in Thesauri
• Relationships Between Words:
◦ Equivalence: Identifying synonyms.
◦ Hierarchical: Grouping under a general term (e.g., "computer" with
subclasses like "microprocessor" and "Pentium").
◦ Non-Hierarchical: Relationships like object-attribute pairs (e.g.,
"employee" and "job title").
• Word Coordination: Determines if clustering is based on phrases or
individual terms.
• Homograph Resolution: Distinguishes between words with multiple
meanings (e.g., " eld" as agriculture or electronics).
fl
fi
fi
fi
fi
30

5. Challenges in Clustering
• Ambiguity of Language: Inherent imprecision in clustering algorithms and
natural language ambiguities can complicate the process.
• Recall vs. Precision: While clustering improves recall ( nding more relevant
items), it can reduce precision (retrieving irrelevant items).
• Algorithm Selection: The success of clustering depends on choosing effective
similarity measures and algorithms.

6. Importance of Similarity and Algorithms


• Similarity Measures: Key to determining which items belong in the same
class.
• Algorithm Selection: Determines how items are grouped into clusters.
• Hierarchical Clustering: Can organize items into a hierarchy but may reduce
recall due to over-speci ed groupings.

Conclusion
Clustering, a technique used in information organization, groups similar objects
(terms or items) into classes under general titles. The process involves de ning the
domain, determining attributes, assessing relationship strength, and applying an
algorithm to assign items to classes. While clustering improves recall, it may decrease
precision, necessitating careful selection of similarity measures and algorithms to
ensure manageable retrieval results.

Thesaurus Generation
• Thesaurus Generation Methods: Hand-crafted, co-occurrence, and header-
modi er based.
• Co-occurrence Thesaurus: Terms are grouped based on their co-occurrence
patterns in a corpus.
• Header-modi er Based Thesaurus: Terms are grouped based on their linguistic
relationships, such as co-occurrence in similar grammatical contexts.

Thesaurus generation is done in 2 ways.


1. Manual clustering
2. Automatic Term clustering
fi
fi
fi
fi
fi
31

1. Manual Clustering
Manual clustering is a structured, human-driven process that aligns with the broader
steps of clustering outlined in Section 6.1. It focuses on carefully curating terms for a
thesaurus, employing human judgment and contextual understanding to create
meaningful groupings. Here's a detailed breakdown:

1. Steps in Manual Clustering


1. De ning the Domain
◦ Establish the scope or subject area for clustering (e.g., medical terms,
data processing machines).
◦ This helps reduce ambiguities caused by homographs (e.g., " eld" as a
domain term).
◦ Sources like existing thesauri, concordances, and dictionaries serve as
starting points for gathering potential terms.
2. Using Concordances
◦ A concordance is an alphabetical listing of words, including their
frequency and references to items where they appear.
◦ It helps identify candidate terms for inclusion in the thesaurus.
◦ Care is taken to exclude:
▪ Words unrelated to the domain.
▪ High-frequency terms with low informational value (e.g.,
"computer" in a thesaurus for data processing).
3. Tools for Contextual Analysis
◦ Tools like KWOC, KWIC, and KWAC aid in understanding word
usage and meaning:
▪ KWOC (Key Word Out of Context): Displays a concordance-
like list of keywords without their context.
▪ KWIC (Key Word In Context): Shows keywords within their
sentence or phrase, making it easier to identify meaning. For
instance, "memory chips" helps distinguish "chips" from "wood
chips."
▪ KWAC (Key Word And Context): Displays keywords followed
by more context, providing a detailed understanding of usage.
fi
fi
32

2. Term Selection
• Art of Selection:
◦ Requires expertise to identify meaningful terms while avoiding
irrelevant or overly common words.
◦ Homographs and ambiguous terms are resolved using tools like KWIC
and KWAC for context.

3. Clustering Terms
• Word Relationship Guidelines:
◦ Terms are grouped based on prede ned relationships (e.g., synonyms,
hierarchical structures).
◦ Strength of relationships is interpreted by the human editor.
• Judgment-Based Process:
◦ The human analyst plays a key role in assessing relationships and
organizing terms into clusters.

4. Quality Assurance
• Review Process:
◦ The thesaurus undergoes multiple quality checks by additional editors.
◦ Guidelines, such as those for term relationships and class sizes, ensure
precision and usability.

Conclusion
Manual clustering relies heavily on human expertise, contextual understanding, and
judgment. Tools like concordances, KWOC, KWIC, and KWAC provide support, but
the effectiveness of the process depends on careful term selection, relationship
interpretation, and iterative quality assurance.

2. Automatic Term Clustering


• Term Clustering Techniques: Various techniques exist for automatic term
clustering, differing in completeness and computational overhead.
• Thesaurus Generation: Automatic thesaurus generation involves creating clusters
of terms based on their co-occurrence frequency in a set of items representing the
target domain.
fi
33

• Clustering Algorithm: Automated clustering of documents is based on polythetic


clustering, where clusters are de ned by sets of words and phrases, and items are
assigned to clusters based on the similarity of their words and phrases to those in
the cluster.

Automatic term clustering has 3 methods


1. Complete term relation method
2. Clustering using existing clusters
3. One pass Assignment

Complete Term Relation Method


The Complete Term Relation Method is a clustering approach that calculates the
similarity between all term pairs to form clusters. This method involves detailed
matrix operations and multiple algorithms to group terms based on their relationships.

1. Core Concept
• Vector Model Representation:
◦ A matrix represents the items (rows) and terms (columns).
◦ The values in the matrix indicate the strength of association between a
term and an item.
◦ Example: Figure 6.2 shows a sample database with ve items and eight
terms.

Term-1 Term-2 Term-3 Term-4 Term-5 Term-6 Term-7 Term-8


Item 1 0 4 0 0 0 2 1 3
Item 2 3 1 4 3 1 2 0 1
Item 3 3 0 0 0 3 0 3 0
Item 4 0 1 0 3 0 0 2 0
Item 5 2 2 2 3 1 4 0 2

• Similarity Measure:
◦ Determines how closely two terms are related.


Similarity(Termi, Termj ) = (Termk,i) ⋅ (Termk, j )

k
fi
fi
34

◦ Summation across all items of the product of term pair values from the
matrix.
◦ Results in a symmetric Term-Term Matrix (Figure 6.3) where each cell
represents the similarity score between two terms.
Term1 Term2 Term3 Term4 Term5 Term6 Term7 Term8
Term1 7 16 15 14 14 9 7
Term2 7 8 12 3 18 6 17
Term3 16 8 18 6 16 0 8
Term4 15 12 18 6 18 6 9
Term5 14 3 6 6 6 9 3
Term6 14 18 16 18 6 2 16
Term7 9 6 0 6 9 2 3
Term8 7 17 8 9 3 16 3

• Threshold for Clustering:


◦ A similarity threshold (e.g., 10) is set to determine which terms are
closely related.
◦ A Term Relationship Matrix (Figure 6.4) is created, showing binary
relationships (1 for similar terms, 0 otherwise).

Term1 Term2 Term3 Term4 Term5 Term6 Term7 Term8

Term1 1 1 1 1 1 0 0

Term2 0 1 0 1 0 1 1

Term3 1 0 1 0 1 0 0

Term4 1 1 1 1 0 0 0

Term5 1 0 0 0 0 0 0

Term6 1 1 1 1 0 0 1

Term7 0 0 0 0 0 0 0

Term8 0 1 0 0 0 1 0
35

2. Algorithms for Clustering


Multiple algorithms are used to determine clusters from the Term Relationship
Matrix, each with unique constraints and outputs.
A. Clique Technique
• De nition:
◦ All terms in a cluster must be similar to every other term in that cluster.
• Characteristics:
◦ High precision, strong relationships, and concept-speci c clusters.
◦ Produces more clusters as strict similarity constraints reduce cluster size.
• Example Classes (Figure 6.4):
◦ Class 1: Term 1, Term 3, Term 4, Term 6
◦ Class 2: Term 1, Term 5
◦ Class 3: Term 2, Term 4, Term 6
◦ Class 4: Term 2, Term 6, Term 8
◦ Class 5: Term 7
B. Single Link Clustering
• De nition:
◦ A term is added to a cluster if it is similar to any term already in the
cluster.
• Characteristics:
◦ Looser constraints result in fewer clusters with weaker relationships.
◦ Suitable for sparse Term Relationship Matrices.
• Example Classes (Figure 6.4):
◦ Class 1: Term 1, Term 3, Term 4, Term 5, Term 6, Term 2, Term 8
◦ Class 2: Term 7
C. Star Technique
• De nition:
◦ Select a term as the cluster core, then add all terms related to it.
◦ Repeat with unclustered terms as new cores.
• Characteristics:
◦ Terms may belong to multiple clusters unless constraints are applied.
fi
fi
fi
fi
36

• Example Classes (Figure 6.4):


◦ Class 1: Term 1, Term 3, Term 4, Term 5, Term 6
◦ Class 2: Term 2, Term 4, Term 8, Term 6
◦ Class 3: Term 7
D. String Technique
• De nition:
◦ Begin with a term and iteratively add one related term at a time, forming
a chain.
◦ Stop when no more terms can be added, then start a new class.
• Characteristics:
◦ Sequential, produces smaller, linear clusters.
• Example Classes (Figure 6.4):
◦ Class 1: Term 1, Term 3, Term 4, Term 2, Term 6, Term 8
◦ Class 2: Term 5
◦ Class 3: Term 7

3. Visual Representation
fi
37

• Network Diagram:
◦ Nodes represent terms, and arcs indicate similarity relationships.
◦ Identi es cliques (subnetworks), single links, and other structures.
◦ Example: Figure 6.5 demonstrates network-based clustering.

4. Algorithm Selection Considerations


• Matrix Density:
◦ Sparse Matrix: Use Single Link for broader clusters.
◦ Dense Matrix: Use Cliques for precision and compact clusters.
• Thesaurus Objectives:
◦ Query Expansion: Cliques for high precision.
◦ Maximizing Recall: Single Link for broad coverage.

5. Strengths and Limitations


Algorithm Strengths Limitations
Clique High precision, concept-speci c More clusters, computationally intensive
Single Weaker relationships, less concept-
High recall, fewer clusters
Link speci c
Flexible, allows overlap of
Star Overlap may cause redundancy
clusters
String Simplistic and ef cient Limited by sequential nature

Conclusion
The Complete Term Relation Method provides a comprehensive approach to term
clustering, with various algorithms catering to different needs and data structures.
The choice of algorithm depends on the density of relationships, computational
resources, and the intended use of the thesaurus.
Clustering Using Existing Clusters
Clustering using existing clusters is an iterative methodology aimed at reducing
computational complexity compared to the Complete Term Relation Method. This
approach re nes initial arbitrary cluster assignments by revalidating and reallocating
terms until cluster assignments stabilize.
fi
fi
fi
fi
fi
38

1. Key Concepts
Centroids:
• A centroid represents the "center of mass" of a cluster in an N-dimensional
space, where N is the number of items in the dataset.
• It is calculated as the average vector of all terms in a cluster.
Iterative Process:
1. Initial Assignment:
◦ Terms are arbitrarily assigned to clusters, de ning initial centroids.
2. Reallocation Based on Similarity:
◦ Similarity between each term and cluster centroids is recalculated.
◦ Terms are reassigned to clusters with the highest similarity.
◦ New centroids are computed for updated clusters.
3. Stabilization:
◦ The process repeats until minimal movement occurs between clusters.
Ef ciency:
• Computational complexity is reduced to O(n), as similarity calculations are
performed only between terms and cluster centroids, not every pair of terms.
2. Example Process
Initial Setup:
Using the data from below table:
Term-1 Term-2 Term-3 Term-4 Term-5 Term-6 Term-7 Term-8
Item 1 0 4 0 0 0 2 1 3
Item 2 3 1 4 3 1 2 0 1
Item 3 3 0 0 0 3 0 3 0
Item 4 0 1 0 3 0 0 2 0
Item 5 2 2 2 3 1 4 0 2

• Arbitrary Assignments:
◦ Class 1: Term 1, Term 2
◦ Class 2: Term 3, Term 4
◦ Class 3: Term 5, Term 6
Centroid Calculation:
fi
fi
39

• Class Centroids:
◦ Each centroid value is the average of the term weights for the respective
cluster.
◦ Example: For Class 1, the centroid for Item 1 is calculated as:

Weight(Term 1, Item 1) + Weight(Term 2, Item 1)


Centroid (Item 1) =
2

Reassignment Based on Similarity:


1. Calculate the similarity of each term to all centroids using the similarity
measure from 6.2.2.1.
2. Assign terms to the cluster with the highest similarity.
Example Assignment (Figure 6.7):
• Terms are reassigned based on similarity scores.
• For ties, additional heuristics (e.g., majority similarity within a cluster) are
used.
• Example: Term 5 was assigned to Class 3 because of closer overall similarity.

Term-1 Term-2 Term-3 Term-4 Term-5 Term-6 Term-7 Term-8


Class 1 29/2 29/2 24/2 27/2 17/2 32/2 15/2 24/2
Class 2 31/2 20/2 38/2 45/2 12/2 34/2 6/2 17/2
Class 3 28/2 21/2 22/2 24/2 17/2 30/2 11/2 19/2
Average Class 2 Class 1 Class 2 Class 2 Class 3 Class 2 Class 1 Class 1

3. Iterative Re nement
• New Centroids:
◦ Recalculate centroids for updated clusters after reassignment.
• Adjustments:
◦ Example: Term 7 moves from Class 1 to Class 3 in the next iteration due
to its weaker relation to Class 1 terms.
Final Stabilized Clusters :
• Class 1: Terms most strongly associated with Centroid 1.
fi
40

• Class 2: Terms related to Centroid 2.


• Class 3: Terms with the highest similarity to Centroid 3.

Term-1 Term-2 Term-3 Term-4 Term-5 Term-6 Term-7 Term-8


Class 1 23/3 45/3 16/3 27/3 15/3 36/3 23/3 34/3
Class 2 67/4 45/4 70/4 78/4 33/4 72/4 17/4 40/4
Class 3 12/1 3/1 6/1 6/1 11/1 6/1 9/1 3/1
Average Class 2 Class 1 Class 2 Class 2 Class 3 Class 2 Class 3 Class 1

4. Strengths and Limitations


Strengths:
1. Ef ciency:
◦ Reduces the number of similarity calculations compared to complete
term relation methods.
2. Iterative Improvement:
◦ Initial arbitrary assignments are re ned for better cluster accuracy.
3. Applicability:
◦ Useful for datasets where the number of clusters is prede ned.
Limitations:
1. Fixed Number of Clusters:
◦ The number of clusters is predetermined and cannot increase during the
process.
2. Forced Assignments:
◦ All terms must belong to a cluster, even if their similarity to any cluster
is weak.
3. Dependence on Initialization:
◦ Poor initial assignments may lead to suboptimal clustering.

5. Summary
Clustering using existing clusters is an ef cient and iterative approach to term
clustering. While it offers computational bene ts, its xed cluster count and forced
assignments can limit its adaptability. This method is particularly effective in
fi
fi
fi
fi
fi
fi
41

scenarios where the number of clusters is known in advance and computational


resources are a concern.

One Pass Assignments


The One Pass Assignments technique is a clustering method that minimizes
overhead by only requiring a single pass through all terms. In this approach, terms are
sequentially assigned to clusters based on their similarity to existing cluster centroids.
It is a quick method, with the computational complexity on the order of O(n), but it
may not always produce optimal clusters.

1. Key Concepts
Process Overview:
1. Initial Term Assignment:
◦ The rst term is assigned to the rst class.
2. Subsequent Term Assignments:
◦ For each subsequent term:
▪ It is compared to the centroids of existing classes.
▪ A threshold value is set to decide if the term is similar enough to
be added to an existing class.
▪ If the term's similarity to the closest centroid exceeds the
threshold, it is assigned to that class, and a new centroid is
calculated for the class.
▪ If no similarity exceeds the threshold, the term starts a new class.
3. Iteration:
◦ The process continues until all terms are assigned to a class.
Centroid Update:
• Each time a new term is added to a class, the centroid for that class is
recalculated as the average of the weights of all terms in the class.
fi
fi
42

2. Example:
Data Setup (from Figure 6.3):
Term-1 Term-2 Term-3 Term-4 Term-5 Term-6 Term-7 Term-8
Item 1 0 4 0 0 0 2 1 3
Item 2 3 1 4 3 1 2 0 1
Item 3 3 0 0 0 3 0 3 0
Item 4 0 1 0 3 0 0 2 0
Item 5 2 2 2 3 1 4 0 2

• Terms: Term 1, Term 2, Term 3, Term 4, Term 5, Term 6, Term 7, Term 8


• Threshold: 10
Class Assignments:
1. Class 1:
◦ Initially, Term 1 is assigned to Class 1.
◦ Term 3 is compared to Class 1's centroid, and it is added to Class 1
because it meets the threshold for similarity.
◦ Term 4 is compared to Class 1's centroid and also added to Class 1.
2. Class 1 = Term 1, Term 3, Term 4
3. Class 2:
◦ Term 2 is compared to Class 1's centroid and doesn't meet the threshold.
It is then compared to other centroids and is assigned to Class 2 (after
comparing with Term 6 and Term 8).
4. Class 2 = Term 2, Term 6, Term 8
5. Class 3:
◦ Term 5 does not meet the similarity threshold with any of the existing
classes, so it starts a new class.
6. Class 3 = Term 5
7. Class 4:
◦ Term 7 doesn't meet the threshold for any class and is assigned to its
own new class.
8. Class 4 = Term 7
43

3. Centroid Calculations:
The centroid values for the classes are updated as terms are added:
1. Class 1 (Term 1, Term 3):
Term 1 + Term 3
Centroid (Class 1) =
2
2. Class 1 (Term 1, Term 3, Term 4):
Term 1 + Term 3 + Term 4
Centroid (Class 1) =
3
3. Class 2 (Term 2, Term 6):
Term 2 + Term 6
Centroid (Class 2) =
2
4. Strengths and Limitations
Strengths:
1. Minimal Computational Overhead:
◦ Requires only one pass through the terms, making it ef cient with
complexity O(n).
2. Simplicity:
◦ Easy to implement and understand, with no need for complex iterative
calculations.
Limitations:
1. Non-optimal Clusters:
◦ The clusters may not be ideal because the order of term processing
affects the resulting clusters.
◦ Terms that should be in the same cluster may end up in different clusters
because centroids are averaged as new terms are added.
2. Order Sensitivity:
◦ The nal clusters can vary depending on the order in which the terms are
processed.
3. Threshold Dependency:
◦ The threshold value heavily in uences cluster formation and must be
carefully chosen to ensure meaningful clusters.
fi
fl
fi
44

5. Summary
The One Pass Assignments method is a fast and ef cient clustering technique with
minimal computational complexity. However, it may not generate the most optimal
clusters, as the clustering outcome depends on the order in which terms are
processed. The simplicity of the method makes it useful for large datasets where
computational ef ciency is a priority, but it is less suitable when precise clustering
results are needed.

Item Clustering
Item clustering is the process of grouping items based on their similarity. It is similar
to term clustering, as both techniques aim to categorize elements based on shared
characteristics. The key difference lies in the focus: item clustering groups
documents or items, while term clustering organizes terms used within those items.

1. Manual Item Clustering:


• Traditional Manual Clustering:
In traditional settings (such as libraries or ling systems), items are manually
categorized into classes based on their content. A person reads the item and
assigns it to one or more relevant categories. Items are typically placed in a
primary category but may also appear in other categories de ned by index
terms.
• Electronic Clustering:
With the rise of electronic records, item clustering can be automated, using
algorithms to calculate similarity between items based on shared terms. This
removes the need for manual assignment of items to categories, which is
especially useful when dealing with large collections of items.

2. Similarity between Items:


• Similarity between items is determined by the terms they share. In other words,
items are similar if they have overlapping terms.
• The similarity function is calculated by comparing the rows in the item-term
matrix (each row representing an item with its associated terms).


Formula: Similarity(Itemi, Itemj ) = (Termi,k ) ⋅ (Termj,k )

k
fi
fi
fi
fi
45

• Item-Item Matrix:
The rst step in item clustering is to create an Item-Item matrix, which
calculates the similarity between all pairs of items based on shared terms. For
example, if two items share a large number of terms, their similarity value will
be high.

3. Example of Item Clustering:


1. Item-Term Matrix (Figure 6.9):
This matrix lists items and their associated terms. The rows represent the items,
and the columns represent terms.
Item 1 Item 2 Item 3 Item 4 Item 5
Item 1 11 3 6 22
Item 2 11 12 10 36
Item 3 3 12 6 9
Item 4 6 10 6 11
Item 5 22 36 9 11

2. Item Relationship Matrix (Figure 6.10):


After calculating similarities, an Item Relationship matrix is generated,
which shows the similarity scores between items.
Item 1 Item 2 Item 3 Item 4 Item 5
Item 1 1 0 0 1
Item 2 1 1 1 1
Item 3 0 1 0 0
Item 4 0 1 0 1
Item 5 1 1 0 1

4. Clustering Algorithms:
Several clustering techniques can be applied to the Item Relationship matrix to group
items based on similarity. Below are a few algorithms applied to the same item data:
Clique Algorithm:
• This method generates the following clusters:
◦ Class 1: Item 1, Item 2, Item 5
◦ Class 2: Item 2, Item 3
◦ Class 3: Item 2, Item 4, Item 5
fi
46

Single Link Algorithm:


• This method connects all items that are similar, resulting in one large cluster:
◦ Class 1: Item 1, Item 2, Item 5, Item 3, Item 4
Star Technique:
• This method starts by selecting the lowest non-assigned item and progressively
assigns items to clusters:
◦ Class 1: Item 1, Item 2, Item 5
◦ Class 2: Item 3, Item 2
◦ Class 3: Item 4, Item 2, Item 5
String Technique:
• This method iterates until all items are assigned to clusters:
◦ Class 1: Item 1, Item 2, Item 3
◦ Class 2: Item 4, Item 5

5. Challenges in Item Clustering:


• Homographs and Ambiguity:
In the vocabulary domain, homographs (words with the same spelling but
different meanings) can create ambiguities. Similarly, in item clustering,
documents containing multiple topics (e.g., "Politics in America" and
"Economics in Mexico") can lead to incorrect clustering if the content is not
pre-coordinated.

6. Item Clustering with Existing Clusters:


Similar to term clustering, item clustering can begin with pre-de ned clusters. The
process involves calculating centroids for each initial cluster and reassigning items
based on their similarity to these centroids.
• Centroid Calculation:
The centroid for each class is the average of the term vectors for the items in
that class.
• Recalculation of Similarities:
After the initial assignments, similarities between items and centroids are
recalculated. Items are then reassigned to the class whose centroid they are
most similar to. The process continues until no reassignment is needed.
Example of Initial Clusters:
fi
47

• Class 1: Item 1, Item 3


• Class 2: Item 2, Item 4
The centroids for these initial clusters would be computed, and then the items would
be reassigned based on their similarity to the centroids. If no further reassignment
occurs, the process is considered to have stabilized.

7. Use of N-grams for Clustering:


Instead of using words, the Acquaintance system uses n-grams (a sequence of n
consecutive terms) to cluster items. This method allows for more nuanced clustering
and can also handle multiple languages within the same dataset.

Summary:
Item clustering is a vital process for grouping similar items in large datasets,
particularly for digital libraries, document retrieval, or information organization.
Various clustering algorithms can be applied, and while techniques like Clique,
Single Link, Star, and String each have their strengths, challenges like ambiguity
and multiple topics in items must be addressed. The use of n-grams and pre-existing
clusters can enhance the accuracy of clustering results.

Hierarchical Clustering for Items:


Hierarchical clustering is a key method used in Information Retrieval (IR) to organize
items into a structure that can be effectively navigated and searched. This section
primarily focuses on Hierarchical Agglomerative Clustering Methods (HACM),
which is a bottom-up approach to clustering.

1. Agglomerative vs. Divisive Clustering:


• Agglomerative (Bottom-Up):
This approach begins with individual, unclustered items and progressively
merges them based on their similarities. In other words, clusters are created by
merging pairs of items or clusters iteratively. Each iteration combines the most
similar clusters.
• Divisive (Top-Down):
The divisive approach starts with a single, large cluster and splits it into
smaller clusters, progressively dividing the data into ner groups.
fi
48

2. Objectives of Hierarchical Clustering:


The purpose of creating a hierarchy of clusters includes:
1. Reducing Search Overhead:
A hierarchical structure enables more ef cient searches by focusing on the
centroids of clusters. A top-down search through the hierarchy trims branches
that are not relevant, improving the retrieval process.
2. Visual Representation of the Information Space:
Hierarchical clustering can be represented visually (e.g., through a
dendrogram), making it easier for users to navigate through the database and
explore different clusters and items within them.
3. Expanding Retrieval of Relevant Items:
Once an item of interest is identi ed, users can easily browse through related
clusters or drill down into more speci c child clusters for greater speci city.

3. Dendrogram Representation:
A dendrogram is a tree-like diagram used to represent the hierarchy of clusters. It
allows users to visually determine which clusters are related and choose alternate
paths for browsing. The diagram also indicates the strength of relationships
(through line styles, such as dashed lines for weaker connections).

4. Similarity and Dissimilarity Measures:


Most HACM approaches rely on a similarity function to merge or split clusters. The
Lance-Williams dissimilarity update formula is commonly used to calculate the
dissimilarity between clusters as they are merged. The key is to measure the distance
or dissimilarity between clusters at each step.

5. Ward’s Method:
• Ward’s Method (Ward-63) is a speci c technique used in HACM. It
minimizes the sum of squared Euclidean distances between points (usually
centroids of clusters) in order to form clusters.
• Ward’s method calculates the variance I of clusters and chooses the minimum
variance option for merging clusters:
I=∑ (n/distance)Where n represents the number of items in the clusters being
merged and distance represents the squared Euclidean distance between
centroids.
fi
fi
fi
fi
fi
49

6. Hierarchical Clustering for Items:


Hierarchical clustering can be applied to items in the same way it is applied to terms.
The centroids of clusters can be calculated using vectors of terms (rows in the Item/
Term matrix). Once initial clusters are formed, centroids are recalculated, and the
process continues until a desired number of clusters is reached.
• Centroids for Clusters:
After each iteration, a centroid (representing the average or typical item in the
cluster) is recalculated. This helps improve computational ef ciency by
focusing on representative points rather than the entire set of items.
• Scatter/Gather System:
An example of hierarchical item clustering is the Scatter/Gather system
(Hearst-96), which works by generating an initial set of clusters, then
progressively re-clustering those clusters into higher-level clusters, continuing
this process until individual items remain at the lowest level.

7. Monolithic vs. Polythetic Clusters:


• Monolithic Clusters:
A monolithic cluster is one where all items within the cluster share a speci c,
well-de ned attribute. These clusters are easy for users to interpret, as they
focus on a single attribute (e.g., a speci c topic). For example, Yahoo uses a
monolithic cluster structure.
• Polythetic Clusters:
In contrast, polythetic clusters are formed using multiple attributes. They are
more complex and may require additional descriptive information to help users
understand the main themes of the cluster.

8. Hierarchical Concept Construction:


• Sanderson and Croft’s Methodology (Sanderson-99):
In building a concept hierarchy, they focused on extracting terms from
documents and representing them in a hierarchical structure. Their
methodology involves:
◦ Parent-Child Relationships: Parent terms represent more general
concepts, while child terms are related subtopics.
◦ Directed Acyclic Graph (DAG): Relationships between terms can be
represented as a DAG, allowing for multiple meanings and relationships
(unlike a strict hierarchical structure).
fi
fi
fi
fi
50

• Subsumption Testing:
One key method for identifying hierarchical relationships is subsumption,
where a parent term subsumes a child term if the documents containing the
child term are a large subset of the documents containing the parent term
(typically 80%).
• WordNet Representation:
Similar to WordNet (Miller-95), which uses relationships like synonyms,
antonyms, hyponyms (subtypes), and meronyms (parts of), Sanderson and
Croft propose representing term hierarchies with similar relationships.

9. Techniques for Building Hierarchies:


Several methods can be used for building hierarchical relationships:
• Grefenstette’s Context Similarity:
Similarity between terms based on the context in which they appear is used to
identify synonyms.
• Key Phrases and Relationships:
Key phrases (e.g., "such as", "and other") help identify hierarchical
relationships like hyponyms and hypernyms.
• Cohesion Metrics:
Techniques using cohesion statistics measure the degree of association
between terms to help build clusters with stronger relationships.

10. Conclusion:
Hierarchical clustering plays an important role in organizing documents or terms into
manageable clusters that can be more easily navigated and searched. By reducing the
overhead of search, providing a visual representation (e.g., dendrograms), and
expanding the retrieval of relevant items, hierarchical clustering methods enhance the
retrieval process. Various algorithms like Ward’s Method and HACM can be
applied to items and terms, with the goal of grouping related elements and
simplifying the information retrieval process.

You might also like