IRS Unit 3 by Krishna
IRS Unit 3 by Krishna
AUTOMATIC INDEXING
Indexing Process
1. Statistical Strategies:
◦ Use the frequency of words/phrases in documents.
◦ Examples:
▪ Static Methods: Store simple statistics like word counts.
▪ Probabilistic Indexing: Estimate how likely a document is
relevant to a query.
▪ Bayesian & Vector Models: Rank documents by their con dence
in relevance.
▪ Neural Networks: Learn patterns and concepts dynamically.
fl
fi
fi
fi
2
Key Observations
Statistical Indexing
Probabilistic Weighting
Key Concepts:
• Data Limitations: Probabilistic models may not always be very accurate due
to a lack of precise data, which leads to the need for simplifying assumptions.
Even with this, probabilistic models remain competitive in performance.
∑(
Formula : log(O(R | Q)) = − 5.138 + zj + 5.138)
k=1
Vector Weighting
Vector Weighting model, which represents items in an information retrieval system
as vectors. Each vector consists of values corresponding to speci c processing tokens
(typically terms) in the document. The main ideas highlighted are the following:
Binary 1 1 1 0 1 0
vector has a maximum length of 1. However, this method can distort the
importance of terms when a document contains multiple topics.
4. Pivot Point Normalization:
◦ To address the over-penalization of longer documents, pivot point
normalization was introduced. This normalization adjusts the weight
based on document length and aims to reduce the bias that favors shorter
documents.
◦ The "pivot point" is de ned as the length at which the probability of
relevance equals the probability of retrieval. Using this, a slope is
generated, and the normalization is adjusted accordingly.
5. Final Algorithm:
◦ The nal weighting algorithm involves using the logarithmic term
frequency divided by a pivoted normalization function to adjust for
document length.
◦ This approach has been shown to outperform simple term frequency
(TF/MAX(TF)) or vector length normalization in certain datasets, such
as TREC data.
6. Practical Bene ts:
◦ The normalization process, when properly adjusted, compensates for
biases introduced by document length and term frequency variations,
providing a more balanced approach to weighting terms in information
retrieval.
Summary
This section explains the different approaches to calculating the term frequency (TF)
and adjusting for variations in document length and term occurrence across a dataset.
The goal is to derive a weight for each term that accurately re ects its importance
within the document and the entire database. By normalizing these values using
techniques like logarithmic functions and pivot point adjustments, the algorithm
ensures that document relevance is determined more effectively.
fi
fi
fi
fl
8
The term IDF gives us a measure of how important the term j is across all the
documents in the database. The IDF increases when the term appears in fewer
documents, and decreases when the term appears in many documents.
1. Logarithmic Component:
◦ log2(n): This is the logarithmic factor based on the total number of
documents. A higher number of documents means the term will be less
important.
◦ log2(DFj ): This is the logarithmic factor based on the number of
documents containing the term. If the term is found in many documents
(higher DF), the weight will be reduced.
• Low IDF: If a term appears in many documents (high DF), the IDF value will
be low. This means the term is not very unique or distinctive across the entire
database. The weight given to such common terms will be reduced.
• High IDF: If a term appears in only a few documents (low DF), the IDF value
will be high. This means the term is more unique and could be more
informative, so it is given a higher weight in the document.
Practical Example:
Let’s say we have 2048 documents in the database and the term "oil" appears in 128
documents, the term "Mexico" appears in 16 documents, and the term "re nery"
appears in 1024 documents.
This shows that "Mexico" has the highest IDF and will receive the highest weight
compared to "oil" and "re nery."
Conclusion:
• IDF helps to distinguish between common and rare terms in the database.
• Terms that are rare across documents are given more importance, while
common terms that appear frequently across the documents are down-
weighted.
• This is why the formula uses IDF to adjust the weight of a term based on its
rarity in the dataset, giving more meaningful weight to terms that are less
common and hence more informative.
fi
fi
fi
fi
10
Signal Weighting
Signal Weighting builds upon the concept of inverse document frequency (IDF) by
addressing a limitation: while IDF adjusts the weight of a term based on how many
documents contain it, it does not consider how frequently the term appears within
those documents. The term's frequency distribution across documents can
signi cantly affect the ability to rank items accurately.
Key Points:
1. Limitation of IDF:
◦ IDF alone considers only how many documents a term appears in but not
how it is distributed within those documents.
◦ For example, if two terms, “SAW” and “DRILL,” both appear in ve
items and have the same total frequency (50 occurrences), IDF would
assign them equal importance.
◦ However, “SAW” is evenly distributed across the items, whereas
“DRILL” has a highly uneven distribution. This uneven distribution
might indicate that "DRILL" is more relevant to certain items.
2. Impact of Frequency Distribution:
◦ Uniform distribution of a term (e.g., “SAW”) across items provides less
insight into the relative importance of the term in any speci c item.
◦ A term with a non-uniform distribution (e.g., “DRILL”) may be more
indicative of item relevance and should be weighted higher.
Item Distribution SAW DRILL
A 10 2
B 10 2
C 10 18
D 10 10
E 10 18
3. Theoretical Foundation:
◦ Signal Weighting leverages Shannon's Information Theory, where the
information content of an event is inversely proportional to its
probability of occurrence:=
INFORMATION = −log2(p)
fi
fi
fi
11
4. Weight Calculation:
◦ Signal Weighting accounts for the term's distribution within the
documents by using a formula derived from average information value
(AVE_INFO):
n
∑
AVE_INFO = − pi log2(pi)
i=1
Where pi is the relative frequency of the term in the ith document.
◦ The Signal Weight is calculated as the inverse of AVE_INFO, giving
higher weights to terms with non-uniform distributions. This ensures that
terms like “DRILL” receive higher importance than evenly distributed
terms like “SAW.”
5. Example from Figure 5.5:
◦ Terms "SAW" and "DRILL" both appear 50 times across ve
documents, but their distributions differ:
▪ "SAW": Equal distribution across all documents (e.g., 10 times in
each).
▪ "DRILL": Uneven distribution (e.g., higher in some documents
like 18 in C and E).
◦ The Signal Weighting formula assigns higher weight to "DRILL" due to
its non-uniform distribution.
6. Practical Use:
◦ Signal Weighting can work alone or in combination with IDF and other
algorithms to improve precision (ranking relevant items higher).
◦ However, its practical implementation requires additional data and
computations, and its effectiveness compared to simpler methods has not
been conclusively demonstrated.
fi
12
Discrimination value
Discrimination Value is a term-weighting method that measures how effectively a
term helps distinguish between different items in a database. The goal is to prioritize
terms that improve search accuracy by making items more distinguishable.
Concept
If all items in a database appear similar, it becomes dif cult to identify relevant items.
The discrimination value evaluates how much a term contributes to differentiating
items.
Formula
The discrimination value for a term i is calculated as:
DISCRIMi = AVESIM − AVESIMi
• AVESIM: The average similarity between all items in the database.
• AVESIMi : The average similarity between items after removing term i from
all items.
Interpretation of Results
• Positive DISCRIMi :
If the value is positive, removing term i increases the similarity between items.
This means that term i helps differentiate items and should be given higher
weight.
• DISCRIMi ≈ 0:
A value near zero indicates that removing or keeping the term does not impact
item similarity. The term has little effect on search relevance.
• Negative DISCRIMi :
A negative value suggests that removing the term decreases the similarity
between items. This means the term actually causes items to appear more
similar and is not helpful for discrimination.
fi
fi
13
Purpose
The discrimination value helps assign higher importance to terms that make items
more distinct, improving the precision of search results.
Additional Concerns
• Changing Query Results Over Time:
As new terms are introduced and their frequency stabilizes, search results can
change, causing inconsistency in search rankings.
• Time-Based Partitioning:
Databases may be split by time periods (e.g., yearly), allowing users to search
speci c time frames. However, this complicates how term weights are
aggregated across partitions and how results are merged.
• Integration Across Databases:
Ideally, users should be able to search multiple databases with different
weighting algorithms and receive a uni ed, ranked result set. This integration
challenge is addressed in more detail in Chapter 7.
3. Multi-Topic Items
• Issue: When an item discusses multiple distinct topics, the vector model
cannot differentiate between them.
• Example:
If an item discusses multiple energy sources ("oil" and "coal") in different
geographic locations ("Mexico" and "Pennsylvania"), the model fails to
partition these discussions.
◦ Instead, it treats the item as a whole, leading to imprecise search results
when queries involve overlapping topics.
• Solution:
By enabling searches in subsets of items, the system can focus on sections
discussing speci c topics, improving precision.
◦ Bene t: This addresses the multi-topic problem by isolating and
evaluating relevant subsets of the text.
Conclusion
While the vector model provides a mathematical framework for representing
documents, it struggles with semantic associations, positional information, and multi-
topic distinctions. Addressing these limitations often requires enhancements, such as
incorporating positional data or enabling searches in subsets of items, which are
discussed further I
Bayesian Model
The Bayesian model is introduced as a way to overcome some of the limitations of
the vector model by using conditional probabilities to determine relevance and
assign weights to processing tokens in a database. Below is a breakdown of the
concepts involved in the Bayesian approach.
• Process Overview:
◦ A Bayesian network is used to determine which processing tokens (also
called "topics") should represent the content of the document and how
relevant they are.
◦ The gure shows that for a particular document, P(Topic i) represents
the relevance of topic i in the document, and P(Token j) represents a
statistic related to the occurrence of processing token j in the document.
5. Summary
The Bayesian model introduces an advanced way to represent document relevance
and token weighting by using probabilities and dependencies. It offers a more
nuanced method than the vector model, especially in handling complex dependencies.
However, practical implementations often use simpli cations (like ignoring
dependencies) to balance accuracy with computational ef ciency. Extending the
model to account for these dependencies can enhance precision but at the cost of
model complexity.
Natural Language
• Natural Language Processing Goal: Enhance indexing precision by utilizing
semantic information in addition to statistical information.
• Semantic Information Extraction: Extracted from language processing, going
beyond treating words as independent entities.
• Output of Natural Language Processing: Generates phrases as indexes and
potentially thematic representations of events.
news articles, where the understanding of timing can change the meaning of
the content.
7. News Schema Components: The system uses a structured model (like a News
Schema) to identify different components within a news article, such as
"Circumstance," "Consequence," "History," and others. This structure helps
evaluate which parts of a document should be emphasized based on the user's
query.
8. Semantic Processing: Beyond identifying topics, the system classi es the
intent behind the terms and identi es the relationships between different
concepts. For example, in the context of national elections and guerrilla
warfare, the system needs to identify whether elections were caused by warfare
or vice versa. This relationship is essential for accurately understanding the
context.
9. Relationships and Weights: The relationships between concepts are
visualized as triples (Concept 1, Relationship, Concept 2), and each
relationship is assigned a weight based on its relevance and how it is expressed
in the text (e.g., using active vs. passive verbs).
10. Data Structures: Additional data structures store the semantic information,
which can be used for natural language-based search queries or to respond to
speci c user requests for deeper information about the relationships between
terms.
In Summary, Natural language processing enhances indexing accuracy by
identifying semantic relationships between concepts. Systems like DR-LINK use
statistical methods, prede ned subject codes, and discourse analysis to categorize text
and assign weights to different components. These components, along with inter-
concept relationships, are stored for use in natural language-based search queries.
Concept Indexing
Concept Indexing is a technique used in natural language processing (NLP) to
map speci c terms in a document to higher-level concepts rather than individual
words or terms. This approach is aimed at creating more abstract and semantically
meaningful representations of documents, making it easier to categorize and retrieve
relevant information based on broader ideas rather than just speci c terms.
fi
fi
fi
fi
fi
fi
23
HyperText Linkages
Hypertext Linkages refer to the connections between different items or pieces of
information on the Internet (or in hypertext systems) that allow users to navigate
between them, creating a network of interlinked documents. In this text, hypertext
linkages are discussed in the context of information retrieval (IR) systems, and how
these linkages can enhance or complicate the process of indexing and searching for
information. Here's a breakdown of the key points:
Conclusion
Hypertext linkages represent a largely untapped resource in IR systems. By
incorporating hyperlinks into indexing algorithms, search systems can enhance
relevance and provide a richer, more interconnected retrieval experience. This would
elevate hyperlinks from navigational aids to integral components of concept-based
information retrieval.
fi
28
Clustering techniques are discussed for index terms and items in information retrieval
systems. These techniques aim to improve recall and retrieve similar items, but
require careful use to avoid negative impacts.
Introduction to Clustering
Clustering is the process of grouping similar objects (e.g., words, documents, or
terms) into classes or categories to enhance the organization and retrieval of
information. This approach has been widely applied in libraries, thesauri creation, and
modern information systems. Here's an overview of clustering based on your text:
1. Purpose of Clustering
• Historical Context: Initially used in libraries to group items on the same
subject, enabling easier information retrieval. This evolved into indexing
schemes and standards for electronic indexes.
• De nition: Clustering groups similar objects under a general title (class) and
allows linkages between clusters.
• Applications: Includes thesauri development, document organization, and
query expansion in information retrieval.
2. Steps in Clustering
a. De ning the Domain:
• Identify the scope of items to be clustered (e.g., medical terms or a speci c
subset of a database).
• Reduces erroneous data and ensures a focused clustering process.
b. Determining Attributes:
• Decide which features of the objects will be used to assess similarity.
fi
fi
fi
29
• For documents, focus on speci c zones (e.g., title, abstract) to avoid irrelevant
associations.
c. Measuring Relationships:
• Establish similarity functions or metrics to determine the strength of
relationships.
• For thesauri, this involves identifying synonyms and other term relationships.
For documents, word co-occurrence can de ne similarity.
d. Applying Clustering Algorithms:
• Use algorithms to assign objects to appropriate clusters based on their
relationships.
• Results in well-organized classes with meaningful groupings.
4. Clustering in Thesauri
• Relationships Between Words:
◦ Equivalence: Identifying synonyms.
◦ Hierarchical: Grouping under a general term (e.g., "computer" with
subclasses like "microprocessor" and "Pentium").
◦ Non-Hierarchical: Relationships like object-attribute pairs (e.g.,
"employee" and "job title").
• Word Coordination: Determines if clustering is based on phrases or
individual terms.
• Homograph Resolution: Distinguishes between words with multiple
meanings (e.g., " eld" as agriculture or electronics).
fl
fi
fi
fi
fi
30
5. Challenges in Clustering
• Ambiguity of Language: Inherent imprecision in clustering algorithms and
natural language ambiguities can complicate the process.
• Recall vs. Precision: While clustering improves recall ( nding more relevant
items), it can reduce precision (retrieving irrelevant items).
• Algorithm Selection: The success of clustering depends on choosing effective
similarity measures and algorithms.
Conclusion
Clustering, a technique used in information organization, groups similar objects
(terms or items) into classes under general titles. The process involves de ning the
domain, determining attributes, assessing relationship strength, and applying an
algorithm to assign items to classes. While clustering improves recall, it may decrease
precision, necessitating careful selection of similarity measures and algorithms to
ensure manageable retrieval results.
Thesaurus Generation
• Thesaurus Generation Methods: Hand-crafted, co-occurrence, and header-
modi er based.
• Co-occurrence Thesaurus: Terms are grouped based on their co-occurrence
patterns in a corpus.
• Header-modi er Based Thesaurus: Terms are grouped based on their linguistic
relationships, such as co-occurrence in similar grammatical contexts.
1. Manual Clustering
Manual clustering is a structured, human-driven process that aligns with the broader
steps of clustering outlined in Section 6.1. It focuses on carefully curating terms for a
thesaurus, employing human judgment and contextual understanding to create
meaningful groupings. Here's a detailed breakdown:
2. Term Selection
• Art of Selection:
◦ Requires expertise to identify meaningful terms while avoiding
irrelevant or overly common words.
◦ Homographs and ambiguous terms are resolved using tools like KWIC
and KWAC for context.
3. Clustering Terms
• Word Relationship Guidelines:
◦ Terms are grouped based on prede ned relationships (e.g., synonyms,
hierarchical structures).
◦ Strength of relationships is interpreted by the human editor.
• Judgment-Based Process:
◦ The human analyst plays a key role in assessing relationships and
organizing terms into clusters.
4. Quality Assurance
• Review Process:
◦ The thesaurus undergoes multiple quality checks by additional editors.
◦ Guidelines, such as those for term relationships and class sizes, ensure
precision and usability.
Conclusion
Manual clustering relies heavily on human expertise, contextual understanding, and
judgment. Tools like concordances, KWOC, KWIC, and KWAC provide support, but
the effectiveness of the process depends on careful term selection, relationship
interpretation, and iterative quality assurance.
1. Core Concept
• Vector Model Representation:
◦ A matrix represents the items (rows) and terms (columns).
◦ The values in the matrix indicate the strength of association between a
term and an item.
◦ Example: Figure 6.2 shows a sample database with ve items and eight
terms.
• Similarity Measure:
◦ Determines how closely two terms are related.
∑
Similarity(Termi, Termj ) = (Termk,i) ⋅ (Termk, j )
◦
k
fi
fi
34
◦ Summation across all items of the product of term pair values from the
matrix.
◦ Results in a symmetric Term-Term Matrix (Figure 6.3) where each cell
represents the similarity score between two terms.
Term1 Term2 Term3 Term4 Term5 Term6 Term7 Term8
Term1 7 16 15 14 14 9 7
Term2 7 8 12 3 18 6 17
Term3 16 8 18 6 16 0 8
Term4 15 12 18 6 18 6 9
Term5 14 3 6 6 6 9 3
Term6 14 18 16 18 6 2 16
Term7 9 6 0 6 9 2 3
Term8 7 17 8 9 3 16 3
Term1 1 1 1 1 1 0 0
Term2 0 1 0 1 0 1 1
Term3 1 0 1 0 1 0 0
Term4 1 1 1 1 0 0 0
Term5 1 0 0 0 0 0 0
Term6 1 1 1 1 0 0 1
Term7 0 0 0 0 0 0 0
Term8 0 1 0 0 0 1 0
35
3. Visual Representation
fi
37
• Network Diagram:
◦ Nodes represent terms, and arcs indicate similarity relationships.
◦ Identi es cliques (subnetworks), single links, and other structures.
◦ Example: Figure 6.5 demonstrates network-based clustering.
Conclusion
The Complete Term Relation Method provides a comprehensive approach to term
clustering, with various algorithms catering to different needs and data structures.
The choice of algorithm depends on the density of relationships, computational
resources, and the intended use of the thesaurus.
Clustering Using Existing Clusters
Clustering using existing clusters is an iterative methodology aimed at reducing
computational complexity compared to the Complete Term Relation Method. This
approach re nes initial arbitrary cluster assignments by revalidating and reallocating
terms until cluster assignments stabilize.
fi
fi
fi
fi
fi
38
1. Key Concepts
Centroids:
• A centroid represents the "center of mass" of a cluster in an N-dimensional
space, where N is the number of items in the dataset.
• It is calculated as the average vector of all terms in a cluster.
Iterative Process:
1. Initial Assignment:
◦ Terms are arbitrarily assigned to clusters, de ning initial centroids.
2. Reallocation Based on Similarity:
◦ Similarity between each term and cluster centroids is recalculated.
◦ Terms are reassigned to clusters with the highest similarity.
◦ New centroids are computed for updated clusters.
3. Stabilization:
◦ The process repeats until minimal movement occurs between clusters.
Ef ciency:
• Computational complexity is reduced to O(n), as similarity calculations are
performed only between terms and cluster centroids, not every pair of terms.
2. Example Process
Initial Setup:
Using the data from below table:
Term-1 Term-2 Term-3 Term-4 Term-5 Term-6 Term-7 Term-8
Item 1 0 4 0 0 0 2 1 3
Item 2 3 1 4 3 1 2 0 1
Item 3 3 0 0 0 3 0 3 0
Item 4 0 1 0 3 0 0 2 0
Item 5 2 2 2 3 1 4 0 2
• Arbitrary Assignments:
◦ Class 1: Term 1, Term 2
◦ Class 2: Term 3, Term 4
◦ Class 3: Term 5, Term 6
Centroid Calculation:
fi
fi
39
• Class Centroids:
◦ Each centroid value is the average of the term weights for the respective
cluster.
◦ Example: For Class 1, the centroid for Item 1 is calculated as:
3. Iterative Re nement
• New Centroids:
◦ Recalculate centroids for updated clusters after reassignment.
• Adjustments:
◦ Example: Term 7 moves from Class 1 to Class 3 in the next iteration due
to its weaker relation to Class 1 terms.
Final Stabilized Clusters :
• Class 1: Terms most strongly associated with Centroid 1.
fi
40
5. Summary
Clustering using existing clusters is an ef cient and iterative approach to term
clustering. While it offers computational bene ts, its xed cluster count and forced
assignments can limit its adaptability. This method is particularly effective in
fi
fi
fi
fi
fi
fi
41
1. Key Concepts
Process Overview:
1. Initial Term Assignment:
◦ The rst term is assigned to the rst class.
2. Subsequent Term Assignments:
◦ For each subsequent term:
▪ It is compared to the centroids of existing classes.
▪ A threshold value is set to decide if the term is similar enough to
be added to an existing class.
▪ If the term's similarity to the closest centroid exceeds the
threshold, it is assigned to that class, and a new centroid is
calculated for the class.
▪ If no similarity exceeds the threshold, the term starts a new class.
3. Iteration:
◦ The process continues until all terms are assigned to a class.
Centroid Update:
• Each time a new term is added to a class, the centroid for that class is
recalculated as the average of the weights of all terms in the class.
fi
fi
42
2. Example:
Data Setup (from Figure 6.3):
Term-1 Term-2 Term-3 Term-4 Term-5 Term-6 Term-7 Term-8
Item 1 0 4 0 0 0 2 1 3
Item 2 3 1 4 3 1 2 0 1
Item 3 3 0 0 0 3 0 3 0
Item 4 0 1 0 3 0 0 2 0
Item 5 2 2 2 3 1 4 0 2
3. Centroid Calculations:
The centroid values for the classes are updated as terms are added:
1. Class 1 (Term 1, Term 3):
Term 1 + Term 3
Centroid (Class 1) =
2
2. Class 1 (Term 1, Term 3, Term 4):
Term 1 + Term 3 + Term 4
Centroid (Class 1) =
3
3. Class 2 (Term 2, Term 6):
Term 2 + Term 6
Centroid (Class 2) =
2
4. Strengths and Limitations
Strengths:
1. Minimal Computational Overhead:
◦ Requires only one pass through the terms, making it ef cient with
complexity O(n).
2. Simplicity:
◦ Easy to implement and understand, with no need for complex iterative
calculations.
Limitations:
1. Non-optimal Clusters:
◦ The clusters may not be ideal because the order of term processing
affects the resulting clusters.
◦ Terms that should be in the same cluster may end up in different clusters
because centroids are averaged as new terms are added.
2. Order Sensitivity:
◦ The nal clusters can vary depending on the order in which the terms are
processed.
3. Threshold Dependency:
◦ The threshold value heavily in uences cluster formation and must be
carefully chosen to ensure meaningful clusters.
fi
fl
fi
44
5. Summary
The One Pass Assignments method is a fast and ef cient clustering technique with
minimal computational complexity. However, it may not generate the most optimal
clusters, as the clustering outcome depends on the order in which terms are
processed. The simplicity of the method makes it useful for large datasets where
computational ef ciency is a priority, but it is less suitable when precise clustering
results are needed.
Item Clustering
Item clustering is the process of grouping items based on their similarity. It is similar
to term clustering, as both techniques aim to categorize elements based on shared
characteristics. The key difference lies in the focus: item clustering groups
documents or items, while term clustering organizes terms used within those items.
∑
Formula: Similarity(Itemi, Itemj ) = (Termi,k ) ⋅ (Termj,k )
◦
k
fi
fi
fi
fi
45
• Item-Item Matrix:
The rst step in item clustering is to create an Item-Item matrix, which
calculates the similarity between all pairs of items based on shared terms. For
example, if two items share a large number of terms, their similarity value will
be high.
4. Clustering Algorithms:
Several clustering techniques can be applied to the Item Relationship matrix to group
items based on similarity. Below are a few algorithms applied to the same item data:
Clique Algorithm:
• This method generates the following clusters:
◦ Class 1: Item 1, Item 2, Item 5
◦ Class 2: Item 2, Item 3
◦ Class 3: Item 2, Item 4, Item 5
fi
46
Summary:
Item clustering is a vital process for grouping similar items in large datasets,
particularly for digital libraries, document retrieval, or information organization.
Various clustering algorithms can be applied, and while techniques like Clique,
Single Link, Star, and String each have their strengths, challenges like ambiguity
and multiple topics in items must be addressed. The use of n-grams and pre-existing
clusters can enhance the accuracy of clustering results.
3. Dendrogram Representation:
A dendrogram is a tree-like diagram used to represent the hierarchy of clusters. It
allows users to visually determine which clusters are related and choose alternate
paths for browsing. The diagram also indicates the strength of relationships
(through line styles, such as dashed lines for weaker connections).
5. Ward’s Method:
• Ward’s Method (Ward-63) is a speci c technique used in HACM. It
minimizes the sum of squared Euclidean distances between points (usually
centroids of clusters) in order to form clusters.
• Ward’s method calculates the variance I of clusters and chooses the minimum
variance option for merging clusters:
I=∑ (n/distance)Where n represents the number of items in the clusters being
merged and distance represents the squared Euclidean distance between
centroids.
fi
fi
fi
fi
fi
49
• Subsumption Testing:
One key method for identifying hierarchical relationships is subsumption,
where a parent term subsumes a child term if the documents containing the
child term are a large subset of the documents containing the parent term
(typically 80%).
• WordNet Representation:
Similar to WordNet (Miller-95), which uses relationships like synonyms,
antonyms, hyponyms (subtypes), and meronyms (parts of), Sanderson and
Croft propose representing term hierarchies with similar relationships.
10. Conclusion:
Hierarchical clustering plays an important role in organizing documents or terms into
manageable clusters that can be more easily navigated and searched. By reducing the
overhead of search, providing a visual representation (e.g., dendrograms), and
expanding the retrieval of relevant items, hierarchical clustering methods enhance the
retrieval process. Various algorithms like Ward’s Method and HACM can be
applied to items and terms, with the goal of grouping related elements and
simplifying the information retrieval process.