Anomaly Detection: Lecture Notes For Chapter 9 Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
Anomaly Detection: Lecture Notes For Chapter 9 Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
Natural variation
– Unusually tall people
Data errors
– 200 pound 2 year old
Evaluation
– How do you measure performance?
– Supervised vs. unsupervised situations
Efficiency
Context
– Professional basketball team
Proximity-based
– Anomalies are points far away from other points
– Can detect this graphically in some cases
Density-based
– Low density points are outliers
Pattern matching
– Create profiles or templates of atypical but important
events or objects
– Algorithms to detect these patterns are usually simple
and efficient
Limitations
– Not automatic
– Subjective
One-dimensional
Gaussian
7
0.1
6
0.09
5
0.08
4
Two-dimensional
0.07
3
0.06
2
1
0.05 Gaussian
y
0 0.04
-1 0.03
-2 0.02
-3 0.01
-4
probability
-5 density
-4 -3 -2 -1 0 1 2 3 4 5
x
Data distribution, D = (1 – ) M + A
M is a probability distribution estimated from data
– Can be based on any modeling method (naïve Bayes,
maximum entropy, etc)
A is initially assumed to be uniform distribution
Likelihood at time t:
N |At |
Lt ( D ) PD ( xi ) (1 ) PM t ( xi ) PAt ( xi )
|M t |
i 1 xi M t xiAt
LLt ( D ) M t log(1 ) log PM t ( xi ) At log log PAt ( xi )
xi M t xi At
D 2
1.8
1.6
1.4
1.2
0.8
0.6
0.4
Outlier Score
06/06/2022 Introduction to Data Mining, 2nd Edition 20
One Nearest Neighbor - Two Outliers
0.55
D
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
Outlier Score
06/06/2022 Introduction to Data Mining, 2nd Edition 21
Five Nearest Neighbors - Small Cluster
D
1.8
1.6
1.4
1.2
0.8
0.6
0.4
Outlier Score
06/06/2022 Introduction to Data Mining, 2nd Edition 22
Five Nearest Neighbors - Differing Density
D 1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
Outlier Score
06/06/2022 Introduction to Data Mining, 2nd Edition 23
Strengths/Weaknesses of Distance-Based Approaches
Simple
Expensive – O(n2)
Sensitive to parameters
6.85
6
C
1.40 4
D
1.33
2
A
Outlier Score
06/06/2022 Introduction to Data Mining, 2nd Edition 27
Density-based: LOF approach
In the NN approach, p2 is
not considered as outlier,
while LOF approach find
both p1 and p2 as outliers
p2
p1
Simple
Expensive – O(n2)
Sensitive to parameters
Clustering-based Outlier: An
object is a cluster-based outlier if
it does not strongly belong to any
cluster
– For prototype-based clusters, an
object is an outlier if it is not close
enough to a cluster center
– For density-based clusters, an object
is an outlier if its density is too low
– For graph-based clusters, an object
is an outlier if it is not well connected
Other issues include the impact of
outliers on the clusters and the
number of clusters
06/06/2022 Introduction to Data Mining, 2nd Edition 30
Distance of Points from Closest Centroids
4.5
4.6
4
C
3.5
2.5
D 0.17
2
1.5
1.2 1
A
0.5
Outlier Score
06/06/2022 Introduction to Data Mining, 2nd Edition 31
Relative Distance of Points from Closest Centroid
3.5
2.5
1.5
0.5
Outlier Score
06/06/2022 Introduction to Data Mining, 2nd Edition 32
Strengths/Weaknesses of Distance-Based Approaches
Simple