Outlier Detection Using PAM Clustering
Outlier Detection Using PAM Clustering
Ei Ei Htwe
Computer University (Mandalay)
eieihtwe007@[Link]
Outliers can provide useful information about the Assign each remaining
process. An outlier can be created by a shift in the objects to clusters with
location (mean) or in the scale (variability) of the nearest medoid
process. Though an observation in a particular
sample might be a candidate as an outlier, the
Randomly select a non-
process might have shifted. Sometimes, the spurious medoid object
result is a gross recording error or a measurement
error. Measurement systems should be shown to be Compute the total cost of swapping
capable for the process they measure. Outliers also
come from incorrect specifications that are based on
Swap medoid
the wrong distributional assumptions at the time the Yes
object with Total
specifications are generated [5]. non-medoid cost<0
object
4. Architecture of the System No
Display k
First, perform PAM algorithm, producing a set of
clusters
clusters and a set of medoids (cluster centers).
Small clusters are determined and considered as
Compute ADMP value
outlier clusters.
A small cluster is defined as a cluster with fewer No
points than half the average number of points in the Not outliers
ADMP>T
in the
k clusters. Yes
clusters
To detect the outliers in the rest of clusters,
Result outliers
computer the Absolute Distances between the
Medoid, , of the current cluster and each one of the
Points, pi, in the same cluster (i.e., |pi – μ|). The End
produced value will be termed (ADMP). If the
ADMP value is greater than a calculated threshold,
T, then the point is considered an outlier: otherwise, Figure 1: System Flow Diagram
it is not. The value of T is calculated as the average
of all ADMP values of the same cluster multiplied 4.1 Functionalities of the System
by (1.5). The basic structure of the proposed method
is as follows [4]: There are four processes of the system.
Step 1. Perform PAM clustering algorithm to Process 1: Perform PAM clustering
produce a set of k clusters. algorithm.
Step 2. Determine small clusters and consider the Process 2: Determine small clusters.
points (objects) that belong to these clusters as Process 3: Compute the ADMP values and
outliers. T (for the rest of the clusters).
For the rest of the clusters (not determined in
Process 4: If ADMPij > Tj then classify
Step 2).
point i as an outlier; otherwise not.
Begin
Process 1: In PAM, k partitions for n objects are
Step 3. For each cluster j, compute the ADMPj
formed. Initially randomly k medoids are chosen out
and Tj values.
of set of objects. Medoid representing a cluster is Iris data consists of three species of Iris: Iris
most centrally located object in the cluster. Each Setosa, Iris Versicolor and Iris Virginica. Iris dataset
remaining object is clustered with the medoid to is obtained from UCI Machine Learning Repository.
which it is the most similar based on the distance There are fifty plants of each species with the
between the object and medoid by using Euclidean following four measurements on each plant: sepal
distance. The strategy then replaces one of the length, sepal width, petal length and petal width.
medoids by one of the non-medoids as long as Process 1: Perform PAM clustering algorithm
quality of resulting cluster is improving. This quality Table 3.2 Centroids from Iris Dataset
is estimated using a cost function that measures the No: Sepal Sepal Petal Pet
average dissimilarity between an object and the length width length al
medoid of its cluster. width
In the iterative process a non-medoid object is 1 5.4 3.9 1.7 0.4
randomly chosen for replacement with current 5 4.9 2.4 3.3 1.0
medoids. Each replacement causes movement of 9 7.6 3.0 6.6 2.1
some objects from one cluster to the other cluster.
Each time a reassignment occurs a difference in Euclidean distance, which is defined as
square error E is contributed to the cost function. 2 2 2
Therefore the cost function calculate the difference d (i, j ) ( xi1 x j1 xi 2 x j 2 ... xip x jp
is square error value if a non-medoid object replaces
current medoid. The total cost of swapping is the
d (2,1) 5.4 4.6 3.9 3.4 1.7 1.4 0.4 0.3
2 2 2 2
sum of costs incurred by all non-medoid objects. If
the total cost is negative then replacement of medoid
with non-medoid object is good since the actual
=0.995
square error would be reduced. The process is
d (2,5) 4.9 4.6 2.4 3.4 3.3 1.4 1.0 0.3
2 2 2 2
iterated until good replacements of medoids are
found. In the end k-medoids are returned [1].
Process 2: A small cluster is defined as a cluster =2.3
with fewer points than half the average number of
d (2,9) 7.6 4.6 3.0 3.4 6.6 1.4 2.1 0.3
2 2 2 2
points in the k clusters.
Process 3: Compute the Absolute Distances between =6.3
the Medoid, , of the current cluster and each one of
d (3,1) 5.4 4.8 3.9 3.4 1.7 1.6 0.4 0.2
2 2 2 2
the points, pi, in the same cluster (i.e, |pi – μ|). The
value of T is calculated as the average of all ADMP
values of the same cluster multiplied by (1.5).
=0.8
Process 4: If the ADMP value is greater than a
d (3,5) 4.9 4.8 2.4 3.4 3.3 1.6 1.0 0.2
2 2 2 2
calculated threshold, T, then the point is considered
an outlier, otherwise it is not.
= 3.4 = 2.73
d (11,9) 7.6 6.5 3.0 3.2 6.6 5.1 2.1 2.0
2 2 2 2
d (6,5) 4.9 6.6 2.4 2.9 3.3 4.6 1.0 1.3
2 2 2 2
= 1.9
= 2.22 d (12,1) 5.4 6.4 3.9 2.7 1.7 5.3 0.4 1.9
2 2 2 2
= 4.2
d (12,5) 4.9 6.4 2.4 2.7 3.3 5.3 1.0 1.9
2 2 2 2
= 2.38
= 2.7
= 2.7
d (12,9) 7.6 6.4 3.0 2.7 6.6 5.3 2.1 1.9
2 2 2 2
= 0.5 = 1.81
= 4.3
6 7
d (8,1) 5.4 5.9 3.9 3.0 1.7 4.2 0.4 1.5
2 2 2 2
2 8 9
3 10 5
4 1 11 12
= 2.92
12 6
9
cluster with fewer points than half the average
1 2 5 number of points in the k clusters. The rest of
4 8 7 11 outliers are then found (if any) in the remaining
3 10 clusters based on calculating the absolute distances
between the medoid of the current cluster and each
of the points in the same cluster. PAM works
Figure 3.2 Clusters Using PAM algorithm effectively for small data sets, but does not scale well
Process 2: There are not small clusters in this for large data sets. To deal with large data sets, a
sample dataset. sampling-based method, called CLARA (Clustering
For process 3 and process 4: LARge Applications) can be used. CLARANS has
ADMP values in cluster with centroid 3. been experimentally shown to be more effective than
1 3 5.4 4.8 3.9 3.4 1.7 1.6 0.4 0.2 1.4 both PAM and CLARA. The performance of
CLARANS can be further improved by exploring
2 3 4.6 4.8 3.4 3.4 1.4 1.6 0.3 0.2 0.5 spatial data structures, such as R*-trees, and some
focusing techniques.
4 3 4.3 4.8 3.0 3.4 1.1 1.6 0.1 0.2 1.5
The total of ADMP values is 3.4. 6. References
3.4 *1.5
Threshold T3= =1.275
4 [1] D.K. Swami and R.C. Jain, Department of
ADMP1 > T3, ADMP4 > T3 Computer Applications, “Partitioning Around
So, 1 and 4 are outliers in cluster with centroid 3. Medoids for Classification”.
ADMP values in cluster with centroid 7. [2] Kaufman and Rosseeuw, “Cluster Analysis Part
5 7 4.9 5.0 2.4 2.0 3.3 3.5 1.0 1.0 0.7 1”, 1987.
[3] Mahesh Kumar and James B. Orlin
| 8 7 5.9 5.0 3.0 2.0 4.2 3.5 1.5 1.0 3.1 b,”Scale-invariant Clustering with Minimum
Volume Ellipsoids”, 28, September, 2005.
10 7 4.9 5.0 2.5 2.0 4.5 3.5 1.7 1.0 2.3 [4] Moh’d Belal Al-Zoubi, “An Effective Clustering-
The total of ADMP values is 6.1. Based Approach for Outlier Detection, Computer
6.1*1.5 Information Systems Department, University of
Threshold T7 = = 2.2875 Jordan.
4 [5] Steven Walfish, “A Review of Statistical Outlier
ADMP8> T7, ADMP10> T7 Methods”, [Link].2006.
So, 8 and 10 are outliers in cluster with centroid 7. [6] Svetlana Cherednichenko, “Outlier Detection In
ADMP values in cluster with centroid 11. Clustering”, University of Joensuu, Department of
6 11 6.6 6.5 2.9 3.2 4.6 5.1 1.3 2.0 1.6 Computer Science, [Link].2005.
[7] Blake, C. L. & C. J. Merz, 1998. UCI Repository
9 11 7.6 6.5 3.0 3.2 6.6 5.1 2.1 2.0 2.9 of Machine Learning Databases,
12 11 6.4 6.5 2.7 3.2 5.3 5.1 1.9 2.0 0.9 [Link]
University of California, Irvine, Department of
The total of ADMP values is 5.4. Information and Computer Sciences.
5.4 *1.5
Threshold T11= =2.025
4
ADMP9 > T 11
9 is outlier in cluster with centroid 11.
5. Conclusion