0% found this document useful (0 votes)
2 views

Unit4-Clustering

This document provides an overview of clustering, defining it as the process of grouping similar objects while distinguishing them from others. It discusses various clustering methods, challenges, applications, and types of data involved in cluster analysis. Key concepts include the difference between clustering and classification, the importance of distance measures, and the need for scalable algorithms that can handle various data types and complexities.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit4-Clustering

This document provides an overview of clustering, defining it as the process of grouping similar objects while distinguishing them from others. It discusses various clustering methods, challenges, applications, and types of data involved in cluster analysis. Key concepts include the difference between clustering and classification, the importance of distance measures, and the need for scalable algorithms that can handle various data types and complexities.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Unit 4

Clustering
Basic Concept and Terminologies

Rupak Raj Ghimire

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 1
Objective

Basic Concept of Clustering

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 2
What is Cluster Analysis?

Finding groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or
unrelated to) the objects in other groups

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 3
What is Cluster Analysis?

A cluster of data objects can be treated collectively as one
group and so may be considered as a form of data
Compression

The process of grouping a set of physical or abstract objects
into classes of similar objects is called clustering.

A cluster is a collection of data objects that are similar to one
another within the same cluster and are dissimilar to the
objects in other clusters.

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 4
Cluster vs. Classification

The classification is an effective means for distinguishing
groups or classes of objects, it requires the often costly
collection and labeling of a large set of training tuples or
patterns, which the classifier uses to model each group

It is often more desirable to proceed in the reverse direction:
First partition the set of data into groups based on data
similarity (e.g., using clustering), and then assign labels to the
relatively small number of groups.

Additional advantages of such a clustering-based process are
that it is adaptable to changes and helps single out useful
features that distinguish different groups
COM 315 Advanced Programming Techniques)
2024 BBIS, KU Unit 4: Clustering 5
Applications of Cluster Analysis

Land Use detection

Crop / Forest Type identification

Data segmentation

Fault Isolation

Outlier Detection

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 6
What is not Cluster Analysis?

Supervised classification
– Have class label information

Simple segmentation
– Dividing students into different registration groups alphabetically, by
last name

Results of a query
– Groupings are a result of an external specification

Graph partitioning
– Some mutual relevance and synergy, but areas are not identical

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 7
Challenges of Clustering

Scalability:
– Many clustering algorithms work well on small data sets containing
fewer than several hundred data objects; however, a large database
may contain millions of objects.
– Clustering on a sample of a given large data set may lead to biased
results.
– Highly scalable clustering algorithms are needed.

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 8
Challenges of Clustering

Ability to deal with different types of attributes
– Many algorithms are designed to cluster interval-based (numerical)
data.
– However, applications may require clustering other types of data, such
as binary, categorical (nominal), and ordinal data, or mixtures of these
data types.

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 9
Challenges of Clustering

Discovery of clusters with arbitrary shape
– Many clustering algorithms determine clusters based on Euclidean or
Manhattan distance measures.
– Algorithms based on such distance measures tend to find spherical
clusters with similar size and density.
– However, a cluster could be of any shape. It is important to develop
algorithms that can detect clusters of arbitrary shape

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 10
Challenges of Clustering

Ability to deal with noisy data
– Most real-world databases contain outliers or missing, unknown, or
erroneous data.
– Some clustering algorithms are sensitive to such data and may lead to
clusters of poor quality.

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 11
Challenges of Clustering

Incremental clustering and insensitivity to the order of input
records
– Some clustering algorithms cannot incorporate newly inserted data
(i.e., database updates) into existing clustering structures and, instead,
must determine a new clustering from scratch.
– Some clustering algorithms are sensitive to the order of input data.

That is, given a set of data objects, such an algorithm may return
dramatically different clustering depending on the order of presentation of
the input objects.
– It is important to develop incremental clustering algorithms and
algorithms that are insensitive to the order of input.

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 12
Challenges of Clustering

Incremental clustering and insensitivity to the order of input
records
– Some clustering algorithms cannot incorporate newly inserted data
(i.e., database updates) into existing clustering structures and, instead,
must determine a new clustering from scratch.
– Some clustering algorithms are sensitive to the order of input data.

That is, given a set of data objects, such an algorithm may return
dramatically different clustering depending on the order of presentation of
the input objects.
– It is important to develop incremental clustering algorithms and
algorithms that are insensitive to the order of input.

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 13
Challenges of Clustering

High dimensionality
– A database or a data warehouse can contain several dimensions or
attributes. Many clustering algorithms are good at handling low-
dimensional data, involving only two to three dimensions.
– Human eyes are good at judging the quality of clustering for up to
three dimensions.
– Finding clusters of data objects in high dimensional space is
challenging, especially considering that such data can be sparse and
highly skewed.

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 14
Challenges of Clustering

Constraint-based clustering
– Real-world applications may need to perform clustering under various
kinds of constraints. Suppose that your job is to choose the locations
for a given number of new automatic banking machines (ATMs) in a
city.
– To decide upon this, you may cluster households while considering
constraints such as the city’s rivers and highway networks, and the
type and number of customers per cluster.
– A challenging task is to find groups of data with good clustering
behavior that satisfy specified constraints.

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 15
Challenges of Clustering

Interpretability and usability
– Users expect clustering results to be interpretable, comprehensible,
and usable.
– That is, clustering may need to be tied to specific semantic
interpretations and applications.
– It is important to study how an application goal may influence the
selection of clustering features and methods.

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 16
Types of Data in Cluster Analysis

Suppose that a data set to be clustered contains n objects,
which may represent persons, houses, documents, countries,
and so on.

Main memory-based clustering algorithms typically operate on
either of the following two data structures
– Data Matrix: Object-by-variable-structure
– Dissimilarity Matrix: object-by-object structure

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 17
Types of Data in Cluster Analysis

Suppose that a data set to be clustered contains n objects,
which may represent persons, houses, documents, countries,
and so on.

Main memory-based clustering algorithms typically operate on
either of the following two data structures
– Data Matrix: Object-by-variable-structure
– Dissimilarity Matrix: object-by-object structure

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 18
Data Matrix

This represents n objects, such as persons, with p variables
(also called measurements or attributes), such as age, height,
weight, gender, and so on.

The structure is in the form of a relational table, or n-by-p
matrix (n objects × p variables):

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 19
Dissimilarity Matrix

This stores a collection of proximities that are available for all
pairs of n objects. It is often represented by an n-by-n table
– where d(i, j) is the measured difference or dissimilarity between
objects i and j.
– In general, d(i, j) is a nonnegative number that is close to 0 when
objects i and j are highly similar or “near” each other, and becomes
larger the more they differ.
– Since d(i, j) = d( j, i), and d(i, i) = 0, we have the matrix in (7.2).

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 20
Interval-Scaled Variables

Interval-scaled variables are continuous measurements of a
roughly linear scale

Typical examples include weight and height, latitude and
longitude coordinates (e.g., when clustering houses), and
weather temperature

The measurement unit used can affect the clustering analysis.
For example, changing measurement units from meters to
inches for height, or from kilograms to pounds for weight, may
lead to a very different clustering structure

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 21
Interval-Scaled Variables

To help avoid dependence on the choice of measurement
units, the data should be standardized.

Standardizing measurements attempts to give all variables an
equal weight.

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 22
Standarization

To standardize measurements, one choice is to convert the
original measurements to unitless variables.

Given measurements for a variable f , this can be performed
as follows

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 23
Important Note

Standardization may or may not be useful in a particular
application. Thus the choice of whether and how to perform
standardization should be left to the user

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 24
Dissimilarity Measure

Euclidean Distance

Manhattan (or City Block) Distance

Minkowski Distance

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 25
Euclidean Distance

The most popular distance measure is Euclidean distance

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 26
Manhattan Distance

Another well-known metric is Manhattan (or city block) distance,
defined as


Both the Euclidean distance and Manhattan distance satisfy the
following mathematic requirements of a distance function:
1. d(i, j) ≥ 0: Distance is a nonnegative number.
2. d(i, i) = 0: The distance of an object to itself is 0.
3. d(i, j) = d( j, i): Distance is a symmetric function.
4. d(i, j) ≤ d(i, h) + d(h, j): Going directly from object i to object j in space is o
more than making a detour over any other object h (triangular inequality)
COM 315 Advanced Programming Techniques)
2024 BBIS, KU Unit 4: Clustering 27
Minkowski Distance

Minkowski distance is a generalization of both Euclidean
distance and Manhattan distance. It is defined as


where p is a positive integer. Such a distance is also called Lp
norm, in some literature.

It represents the Manhattan distance when p = 1 (i.e., L1
norm) and Euclidean distance when p = 2 (i.e., L2 norm).

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 28
Weighted Euclidean distance

f each variable is assigned a weight according to its perceived
importance, the weighted Euclidean distance can be
computed as


Weighting can also be applied to the Manhattan and
Minkowski distances

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 29
Categories of the Clustering Methods

Many clustering algorithms exist in the literature.

It is difficult to provide a crisp cate gorization of clustering
methods because these categories may overlap, so that a
method may have features from several categories
– Partitioning Methods
– Hierarchical methods
– Density-based methods
– Grid-based methods
– Model-based methods

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 30
Partitioning Methods

Given a database of n objects or data tuples
– a partitioning method constructs k partitions of the data, where each
partition represents a cluster and k ≤ n.

It classifies the data into k groups, which together satisfy the
following requirements:
– each group must contain at least one object, and
– Each object must belong to exactly one group.

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 31
Partitioning Methods

Given k, the number of partitions to construct, a partitioning
method creates an initial partitioning.

It then uses an iterative relocation technique that attempts
to improve the partitioning by moving objects from one group
to another

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 32
Hierarchical methods

A hierarchical method creates a hierarchical decomposition of
the given set of data objects.

A hierarchical method can be classified as being either
agglomerative or divisive, based on how the hierarchical
decomposition is formed.

The agglomerative approach, also called the bottom-up
approach

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 33
Hierarchical methods

How it works?
– Starts with all of the objects in the same cluster.
– In each successive iteration, a cluster is split up into smaller clusters,
until eventually each object is in one cluster, or until a termination
condition holds.

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 34
Density-based methods

Distance based method unable to cluster the arbitrary shape
of the cluster

Other clustering methods have been developed based on the
notion of density.

Their general idea is to continue growing the given cluster as
long as the density (number of objects or datapoints) in the
“neighborhood” exceeds some threshold;
– that is, for each data point within a given cluster, the neighborhood of a
given radius has to contain at least a minimum number of points.

Such a method can be used to filter out noise (outliers) and
discover clusters of arbitrary shape.
COM 315 Advanced Programming Techniques)
2024 BBIS, KU Unit 4: Clustering 35
Grid-based methods

Grid-based methods quantize the object space into a finite
number of cells that form a grid structure.

All of the clustering operations are performed on the grid
structure (i.e., on the quantized space).
– The main advantage of this approach is its fast processing time, which
is typically independent of the number of data objects and dependent
only on the number of cells in each dimension in the quantized space.

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 36
Model-based Method

Model-based methods hypothesize a model for each of the
clusters and find the best fit of the data to the given model.

A model-based algorithm may locate clusters by constructing
a density function that reflects the spatial distribution of the
data points.

It also leads to a way of automatically determining the number
of clusters based on standard statistics, taking “noise” or
outliers into account and thus yielding robust clustering
methods.

Example: EM, COBWEB, SOM (Neural network based)

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 37
Partitioning Methods

Given D, a data set of n objects, and k, the number of clusters
to form,
– a partitioning algorithm organizes the objects into k partitions (k ≤ n),
where each partition represents a cluster.
– The clusters are formed to optimize an objective partitioning criterion,
such as a dissimilarity function based on distance, so that the objects
within a cluster are “similar,” whereas the objects of different clusters
are “dissimilar” in terms of the data set attributes.

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 38
Centroid-Based Technique

The k-Means Method


The k-means algorithm takes the input parameter, k, and
partitions a set of n objects into k clusters so that the
resulting intra-cluster similarity is high but the inter-cluster
similarity is low.

Cluster similarity is measured in regard to the mean value of
the objects in a cluster, which can be viewed as the cluster’s
centroid or center of gravity.

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 39
The k-Means Method

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 40
The k-Means Method

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 41
Limitation of K-Means

Can be applied only when the mean of a cluster is defined
– When data has categorical attributes, k means can not be applied.

The necessity for users to specify k, the number of clusters, in
advance can be seen as a disadvantage

It is sensitive to noise and outlier data points because a small
number of such data can substantially influence the mean
value.

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 42
K-Modes Method

Another variant to k-means is the k-modes method

It extends the k-means paradigm to cluster categorical data
by replacing the means of clusters with modes, \

Using new dissimilarity measures to deal with categorical
objects and a frequency-based method to update modes of
clusters.


The k-means and the k-modes methods can be integrated to
cluster data with mixed numeric and categorical values.

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 43
EM algorithm

The EM (Expectation-Maximization) algorithm extends the k-
means paradigm in a different way.

Whereas the k-means algorithm assigns each object to a
cluster, in EM each object is assigned to each cluster
according to a weight representing its probability of
membership. In other words, there are no strict boundaries
between clusters.

Therefore, new means are computed based on weighted
measure

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 44

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 46
Thank you

COM 315 Advanced Programming Techniques)


2024 BBIS, KU Unit 4: Clustering 47

You might also like