0% found this document useful (0 votes)
32 views6 pages

Outlier Detection Using PAM Clustering

Uploaded by

kbaytar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views6 pages

Outlier Detection Using PAM Clustering

Uploaded by

kbaytar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Outliers Detection Based on Partitioning Around Medoids

Ei Ei Htwe
Computer University (Mandalay)
eieihtwe007@[Link]

Abstract Medoids (PAM). PAM attempts to determine k


partitions for n objects. The algorithm uses the most
Clustering is the process of grouping a set of centrally located object in a cluster (called medoid)
objects into classes or clusters so that objects within instead of the cluster mean. The system uses Iris
a cluster have similarity in comparison to one plants dataset. There are fifty plants of each species
another, but are dissimilar to objects in other with the four measurements on each plant: petal
clusters. Clustering analysis is a descriptive task length, petal width, sepal length, and sepal width
that seeks to identify homogeneous groups of objects [3]. PAM is more robust than the k-means algorithm
based on the values of their attributes. This system in the presence of noise and outliers. This is because
is intended to cluster Iris Plants (Setosa, Versicolor the medoids produced by PAM are robust
and Virginica) by using Partitioning Around representations of the cluster centers and are less
Medoids (PAM) clustering algorithm. Partitioning influenced by outliers and other extreme values than
Around Medoids (PAM) algorithm processes step of the means. Small clusters are then determined and
methods for selecting initial medoids. It also considered as outlier clusters. To detect the outliers
calculates the distance matrix once and uses it for in the rest of clusters, computer the Absolute
finding new medoids at every iterative step. On Distances between the Medoid, , of the current
further analysis of Iris data, the researchers found cluster and each one of the Points, pi, in the same
that Setosa is a clearly separable cluster while the cluster (i.e., |pi – μ|). The produced value will be
other two clusters, Versicolor and Virginica, have termed (ADMP) [4].
significant overlap with each other. All clustering
methods were able to identify Setosa more or less 2. Related Work
correctly, but made mistakes on Versicolor and
Virginica. So, there are to detect inconsistent data Clustering has been studied extensively for more
which are outliers in the clusters. By using this than 40 years and the researchers discovered many
system, the user can know about Iris plants, at the clustering algorithms due to its wide applications.
same time, they can learn outlier detection by PAM For partitioning methods, k-means algorithm is
clustering algorithm. the simplest and most commonly used clustering
algorithm employing a square error criterion. In k-
1. Introduction means, each cluster is presented by the center of the
cluster. It is relatively scalable and efficient in
Outliers are the set of objects that are clustering large data sets. K-means algorithm is by
considerably dissimilar from the remainder of the far the most popular clustering tool used in scientific
data. Outlier detection is an extremely important and industrial applications. K-means algorithm is
problem with a direct application in a wide variety computationally fast, and iteratively partitions a data
of application domains, including fraud detection, set into k-disjoint clusters, where the value of k is
identifying computer network intrusions and and algorithmic input. The goal is to obtain the
bottlenecks, criminal activities in e-commerce and partition with the smallest square error. After that,
detecting suspicious activities. Clustering is a the researchers developed some partitioning
popular technique used to group similar data points clustering algorithm by the extension of k-means
or objects in groups or clusters. Clustering is an algorithm. Because k-means algorithm has some
important tool for outlier analysis. Several drawbacks such as it is sensitive to noise and
clustering-based outlier detection techniques have outliers. The k-medoids algorithm of PAM in which
been developed. Most of these techniques rely on the each cluster is represented by the most centrally
key assumption that normal objects belong to large located objects called medoids. PAM is more robust
and dense clusters, while outliers form very small than k-means in the presence of noise and outliers
clusters [4]. This system uses Partitioning Around
because a medoid is less influenced by outliers or cluster. So, to improve the clustering such
other extreme values than a mean. algorithms use the same process and functionality to
Clustering-based approaches consider clusters of solve both clustering and outlier discovery [6].
small sizes as clustered outliers. In these
approaches, small clusters (i.e., clusters containing 3.3 Partitioning Around Medoids (PAM)
significantly less points than other clusters) are Algorithm
considered outliers. The advantage of the clustering-
based approaches is that they do not have to be Input: The number of clusters k and a database
supervised. Moreover, clustering-based techniques containing n objects.
are capable of being used in an incremental mode Output: A set of k clusters that minimizes the
(i.e., after learning the clusters, new points can be sum of the dissimilarities of all the objects to their
inserted into the system and tested for outliers) [4]. nearest medoid.
Method:
3. Theory Background Use real object to represent the cluster
1. Select k representative objects arbitrarily
3.1 Outliers 2. For each pair of non-selected object h and
selected object I (Go to process 1:), calculate the
Data mining, in general, deals with the discovery total swapping cost TCih (Go to Process 2:)
of non-trivial, hidden and interesting knowledge 3. For each pair of i and h,
from different types of data. With the development If TCih < 0, i is replaced by h
of information technologies, the number of Then assign each non-selected object to the
databases, as well as their dimension and most similar representative object
complexity, grow rapidly. It is necessary what we 4. repeat steps 2-3 until there is no change[2];
need automated analysis of great amount of Process 1: To determine whether a non-medoid
information. The analysis results are then used for object, h, is a good replacement for a current
making a decision by a human or program. One of medoid, i, the following four cases are exmained
the basic problems of data mining is the outlier for each of the non-medoid objects, j.
detection [6].
Case 1: j currently belongs to medoid i. If i is
An outlier is an observation of the data that
deviates from other observations so much that it replaced by h as a medoid and j is closest to one
arouses suspicions that it was generated by a of t, i is not equal to t, then j is reassigned to t.
different mechanism from the most part of data. Case 2: j currently belongs to medoid i. If i is
Outlier detection has many applications, such as replaced by h as a medoid and j is closest to h,
data cleaning, fraud detection and network then j is reassigned to h.
intrusion. The existence of outliers can indicate Case 3: j currently belongs to medoid t, t is not
individuals or groups that have behavior very equal to i. If i is replaced by h as a medoid and j
different from the most of the individuals of the is still closet to t, then the assignment does not
dataset. Frequently, outliers are removed to improve change.
accuracy of the estimators. But sometimes the
Case 4: j currently belongs to medoid t, t is not
presence of an outlier has a certain meaning, which
explanations can be lost if the outlier is deleted. equal to i. If i is replaced by h as a medoid and j
is closet to h, then j is reassigned to h.
Process 2: PAM Clustering: Total Swapping Cost
3.2 Outliers in Clustering
TCih   j Cjih
The outlier detection problem in some cases is • i is a current medoid, h is a non-selected
similar to the classification problem. The main object
concern of clustering-based outlier detection • Assume that i is replaced by h in the set of
algorithms is to find clusters and outliers, which are medoids
often regarded as noise that should be removed in • TCih = 0;
order to make more reliable clustering. Some noisy • For each non-selected object j ≠ h:
points may be far away from the data points, TCih+=d(j,new_medj)-d(j,prev_medj):
whereas the others may be close. The far away noisy • new_medj = the closest medoid to j after i is
points would affect the result more significantly replaced by h
because they are more different from the data points. • prev_medj = the closest medoid to j before i
It is desirable to identify and remove the outliers, is replaced by h
which are far away from all the other points in
Step 4. For each point i in cluster j, if
ADMPij>Tj then classify point i as an outlier;
3.4 Euclidean Distance otherwise not.
The system uses Euclidean distance measure End.
between two points: The system explains the functionalities of the system
in next section.
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp
Where X=x1, x2, ……….xn and y1,y2,……..yn
Start
are two points. It calculates distance based on sepal
length, sepal width, petal length and petal width.
No: of Database
3.5 The Purpose Of Studying Outliers clusters

Outliers can provide useful information about the Assign each remaining
process. An outlier can be created by a shift in the objects to clusters with
location (mean) or in the scale (variability) of the nearest medoid
process. Though an observation in a particular
sample might be a candidate as an outlier, the
Randomly select a non-
process might have shifted. Sometimes, the spurious medoid object
result is a gross recording error or a measurement
error. Measurement systems should be shown to be Compute the total cost of swapping
capable for the process they measure. Outliers also
come from incorrect specifications that are based on
Swap medoid
the wrong distributional assumptions at the time the Yes
object with Total
specifications are generated [5]. non-medoid cost<0
object
4. Architecture of the System No
Display k
First, perform PAM algorithm, producing a set of
clusters
clusters and a set of medoids (cluster centers).
Small clusters are determined and considered as
Compute ADMP value
outlier clusters.
A small cluster is defined as a cluster with fewer No
points than half the average number of points in the Not outliers
ADMP>T
in the
k clusters. Yes
clusters
To detect the outliers in the rest of clusters,
Result outliers
computer the Absolute Distances between the
Medoid, , of the current cluster and each one of the
Points, pi, in the same cluster (i.e., |pi – μ|). The End
produced value will be termed (ADMP). If the
ADMP value is greater than a calculated threshold,
T, then the point is considered an outlier: otherwise, Figure 1: System Flow Diagram
it is not. The value of T is calculated as the average
of all ADMP values of the same cluster multiplied 4.1 Functionalities of the System
by (1.5). The basic structure of the proposed method
is as follows [4]: There are four processes of the system.
Step 1. Perform PAM clustering algorithm to Process 1: Perform PAM clustering
produce a set of k clusters. algorithm.
Step 2. Determine small clusters and consider the Process 2: Determine small clusters.
points (objects) that belong to these clusters as Process 3: Compute the ADMP values and
outliers. T (for the rest of the clusters).
For the rest of the clusters (not determined in
Process 4: If ADMPij > Tj then classify
Step 2).
point i as an outlier; otherwise not.
Begin
Process 1: In PAM, k partitions for n objects are
Step 3. For each cluster j, compute the ADMPj
formed. Initially randomly k medoids are chosen out
and Tj values.
of set of objects. Medoid representing a cluster is Iris data consists of three species of Iris: Iris
most centrally located object in the cluster. Each Setosa, Iris Versicolor and Iris Virginica. Iris dataset
remaining object is clustered with the medoid to is obtained from UCI Machine Learning Repository.
which it is the most similar based on the distance There are fifty plants of each species with the
between the object and medoid by using Euclidean following four measurements on each plant: sepal
distance. The strategy then replaces one of the length, sepal width, petal length and petal width.
medoids by one of the non-medoids as long as Process 1: Perform PAM clustering algorithm
quality of resulting cluster is improving. This quality Table 3.2 Centroids from Iris Dataset
is estimated using a cost function that measures the No: Sepal Sepal Petal Pet
average dissimilarity between an object and the length width length al
medoid of its cluster. width
In the iterative process a non-medoid object is 1 5.4 3.9 1.7 0.4
randomly chosen for replacement with current 5 4.9 2.4 3.3 1.0
medoids. Each replacement causes movement of 9 7.6 3.0 6.6 2.1
some objects from one cluster to the other cluster.
Each time a reassignment occurs a difference in Euclidean distance, which is defined as
square error E is contributed to the cost function. 2 2 2
Therefore the cost function calculate the difference d (i, j )  ( xi1  x j1  xi 2  x j 2  ...  xip  x jp
is square error value if a non-medoid object replaces
current medoid. The total cost of swapping is the
d (2,1)  5.4  4.6  3.9  3.4  1.7  1.4  0.4  0.3
2 2 2 2
sum of costs incurred by all non-medoid objects. If
the total cost is negative then replacement of medoid
with non-medoid object is good since the actual
=0.995
square error would be reduced. The process is
d (2,5)  4.9  4.6  2.4  3.4  3.3  1.4  1.0  0.3
2 2 2 2
iterated until good replacements of medoids are
found. In the end k-medoids are returned [1].
Process 2: A small cluster is defined as a cluster =2.3
with fewer points than half the average number of
d (2,9)  7.6  4.6  3.0  3.4  6.6  1.4  2.1  0.3
2 2 2 2
points in the k clusters.
Process 3: Compute the Absolute Distances between =6.3
the Medoid, , of the current cluster and each one of
d (3,1)  5.4  4.8  3.9  3.4  1.7  1.6  0.4  0.2
2 2 2 2
the points, pi, in the same cluster (i.e, |pi – μ|). The
value of T is calculated as the average of all ADMP
values of the same cluster multiplied by (1.5).
=0.8
Process 4: If the ADMP value is greater than a
d (3,5)  4.9  4.8  2.4  3.4  3.3  1.6  1.0  0.2
2 2 2 2
calculated threshold, T, then the point is considered
an outlier, otherwise it is not.

4.2 Result Set =2.1


d (3,9)  7.6  4.8  3.0  3.4  6.6  1.6  2.1  0.2
2 2 2 2

Table 4.1 Sample Iris Dataset


No Sepal Sepal Petal Petal
Length Width Length width =6.1

d (4,1)  5.4  4.3  3.9  3.0  1.7  1.1  0.4  0.1


1 5.4 3.9 1.7 0.4 2 2 2 2

2 4.6 3.4 1.4 0.3


3 4.8 3.4 1.6 0.2
4 4.3 3.0 1.1 0.1 =1.6

d (4,5)  4.9  4.3  2.4  3.0  3.3  1.1  1.0  0.1


5 4.9 2.4 3.3 1.0 2 2 2 2

6 6.6 2.9 4.6 1.3


7 5.0 2.0 3.5 1.0
8 5.9 3.0 4.2 1.5 =2.5

d (4,9)  7.6  4.3  3.0  3.0  6.6  1.1  2.1  0.1


9 7.6 3.0 6.6 2.1 2 2 2 2

10 4.9 2.5 4.5 1.7


11 6.5 3.2 5.1 2.0
12 6.4 2.7 5.3 1.9 = 6.72
d (6,1)  5.4  6.6  3.9  2.9  1.7  4.6  0.4  1.3 d (11,5)  4.9  6.5  2.4  3.2  3.3  5.1  1.0  2.0
2 2 2 2 2 2 2 2

= 3.4 = 2.73
d (11,9)  7.6  6.5  3.0  3.2  6.6  5.1  2.1  2.0
2 2 2 2
d (6,5)  4.9  6.6  2.4  2.9  3.3  4.6  1.0  1.3
2 2 2 2

= 1.9
= 2.22 d (12,1)  5.4  6.4  3.9  2.7  1.7  5.3  0.4  1.9
2 2 2 2

d (6,9)  7.6  6.6  3.0  2.9  6.6  4.6  2.1  1.3


2 2 2 2

= 4.2
d (12,5)  4.9  6.4  2.4  2.7  3.3  5.3  1.0  1.9
2 2 2 2
= 2.38

d (7,1)  5.4  5.0  3.9  2.0  1.7  3.5  0.4  1.0


2 2 2 2

= 2.7

= 2.7
d (12,9)  7.6  6.4  3.0  2.7  6.6  5.3  2.1  1.9
2 2 2 2

d (7,5)  4.9  5.0  2.4  2.0  3.3  3.5  1.0  1.0


2 2 2 2

= 0.5 = 1.81

d (7,9)  7.6  5.0  3.0  2.0  6.6  3.5  2.1  1.0


2 2 2 2
1, 5 and 9 are centroids in these clusters.

= 4.3
6 7
d (8,1)  5.4  5.9  3.9  3.0  1.7  4.2  0.4  1.5
2 2 2 2
2 8 9
3 10 5
4 1 11 12
= 2.92

d (8,5)  4.9  5.9  2.4  3.0  3.3  4.2  1.0  1.5


2 2 2 2

Figure 3.1 Clusters based on Euclidean distance


= 1.6 of Iris Plants
In the iterative process a non-medoid object is
randomly chosen for replacement with current
d (8,9)  7.6  5.9  3.0  3.0  6.6  4.2  2.1  1.5
2 2 2 2
medoids. Each replacement causes movement of
=3 some objects from one cluster to the other cluster.
Each time a reassignment occurs a difference in
d (10,1)  5.4  4.9  3.9  2.5  1.7  4.5  0.4  1.7
2 2 2 2
square error E is contributed to the cost function.
Therefore the cost function calculate the difference
= 3.4 is square error value if a non-medoid object replaces
current medoid. The total cost of swapping is the
d (10,5)  4.9  4.9  2.4  2.5  3.3  4.5  1.0  1.7
2 2 2 2
sum of costs incurred by all non-medoid objects. If
the total cost is negative then replacement of medoid
with non-medoid object is good since the actual
= 1.4
square error would be reduced. The process is
d (10,9)  7.6  4.9  3.0  2.5  6.6  4.5  2.1  1.7
2 2 2 2
iterated until good replacements of medoids are
found.
Table 3.3 Final Centroids from Iris Dataset
= 3.5 No Sepal Sepal Petal Petal
d (11,1)  5.4  6.5  3.9  3.2  1.7  5.1  0.4  2.0
2 2 2 2 length width length width
3 4.8 3.4 1.6 0.2
7 5.0 2.0 3.5 1.0
= 3.98 11 6.5 3.2 5.1 2.0
3, 7 and 11 are centroids in these clusters.

12 6
9
cluster with fewer points than half the average
1 2 5 number of points in the k clusters. The rest of
4 8 7 11 outliers are then found (if any) in the remaining
3 10 clusters based on calculating the absolute distances
between the medoid of the current cluster and each
of the points in the same cluster. PAM works
Figure 3.2 Clusters Using PAM algorithm effectively for small data sets, but does not scale well
Process 2: There are not small clusters in this for large data sets. To deal with large data sets, a
sample dataset. sampling-based method, called CLARA (Clustering
For process 3 and process 4: LARge Applications) can be used. CLARANS has
ADMP values in cluster with centroid 3. been experimentally shown to be more effective than
1  3  5.4  4.8  3.9  3.4  1.7  1.6  0.4  0.2  1.4 both PAM and CLARA. The performance of
CLARANS can be further improved by exploring
2  3  4.6  4.8  3.4  3.4  1.4  1.6  0.3  0.2  0.5 spatial data structures, such as R*-trees, and some
focusing techniques.
4  3  4.3  4.8  3.0  3.4  1.1  1.6  0.1  0.2  1.5
The total of ADMP values is 3.4. 6. References
3.4 *1.5
Threshold T3= =1.275
4 [1] D.K. Swami and R.C. Jain, Department of
ADMP1 > T3, ADMP4 > T3 Computer Applications, “Partitioning Around
So, 1 and 4 are outliers in cluster with centroid 3. Medoids for Classification”.
ADMP values in cluster with centroid 7. [2] Kaufman and Rosseeuw, “Cluster Analysis Part
5  7  4.9  5.0  2.4  2.0  3.3  3.5  1.0  1.0  0.7 1”, 1987.
[3] Mahesh Kumar and James B. Orlin
| 8  7  5.9  5.0  3.0  2.0  4.2  3.5  1.5  1.0  3.1 b,”Scale-invariant Clustering with Minimum
Volume Ellipsoids”, 28, September, 2005.
10  7  4.9  5.0  2.5  2.0  4.5  3.5  1.7  1.0  2.3 [4] Moh’d Belal Al-Zoubi, “An Effective Clustering-
The total of ADMP values is 6.1. Based Approach for Outlier Detection, Computer
6.1*1.5 Information Systems Department, University of
Threshold T7 = = 2.2875 Jordan.
4 [5] Steven Walfish, “A Review of Statistical Outlier
ADMP8> T7, ADMP10> T7 Methods”, [Link].2006.
So, 8 and 10 are outliers in cluster with centroid 7. [6] Svetlana Cherednichenko, “Outlier Detection In
ADMP values in cluster with centroid 11. Clustering”, University of Joensuu, Department of
6  11  6.6  6.5  2.9  3.2  4.6  5.1  1.3  2.0  1.6 Computer Science, [Link].2005.
[7] Blake, C. L. & C. J. Merz, 1998. UCI Repository
9  11  7.6  6.5  3.0  3.2  6.6  5.1  2.1  2.0  2.9 of Machine Learning Databases,
12  11  6.4  6.5  2.7  3.2  5.3  5.1  1.9  2.0  0.9 [Link]
University of California, Irvine, Department of
The total of ADMP values is 5.4. Information and Computer Sciences.
5.4 *1.5
Threshold T11= =2.025
4
ADMP9 > T 11
9 is outlier in cluster with centroid 11.

5. Conclusion

This system describes Partitioning Around


Medoids (PAM) clustering algorithm which belongs
to partitional clustering algorithms. Partitioning
Around Medoids (PAM)_based on clustering
algorithms for outlier detection is presented. We first
perform the PAM clustering algorithm. Small
clusters are then determined and considered as
outlier clusters. A small cluster is defined as a

You might also like