0% found this document useful (0 votes)
79 views

Anomaly Detection: Lecture Notes For Chapter 9 Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar

The document provides an overview of anomaly detection. It defines anomalies as data points that are considerably different from most of the data. While anomalies are rare, they can be important to identify. The document discusses causes of anomalies, distinguishing anomalies from noise, issues in anomaly detection like attribute selection and scoring, and different modeling approaches like statistical, proximity-based, and density-based methods. Visualization and statistical tests like boxplots and Grubbs' test are also covered.

Uploaded by

Andrean Sergio
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views

Anomaly Detection: Lecture Notes For Chapter 9 Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar

The document provides an overview of anomaly detection. It defines anomalies as data points that are considerably different from most of the data. While anomalies are rare, they can be important to identify. The document discusses causes of anomalies, distinguishing anomalies from noise, issues in anomaly detection like attribute selection and scoring, and different modeling approaches like statistical, proximity-based, and density-based methods. Visualization and statistical tests like boxplots and Grubbs' test are also covered.

Uploaded by

Andrean Sergio
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Anomaly Detection

Lecture Notes for Chapter 9

Introduction to Data Mining, 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar

06/06/2022 Introduction to Data Mining, 2nd Edition 1


Anomaly/Outlier Detection
 What are anomalies/outliers?
– The set of data points that are
considerably different than the
remainder of the data

 Natural implication is that anomalies are


relatively rare
– One in a thousand occurs often if you have lots of data
– Context is important, e.g., freezing temps in July

 Can be important or a nuisance


– 10 foot tall 2 year old
– Unusually highIntroduction
06/06/2022 bloodtopressure
Data Mining, 2nd Edition 2
Importance of Anomaly Detection

Ozone Depletion History


 In 1985 three researchers (Farman,
Gardinar and Shanklin) were
puzzled by data gathered by the
British Antarctic Survey showing that
ozone levels for Antarctica had
dropped 10% below normal levels

 Why did the Nimbus 7 satellite,


which had instruments aboard for
recording ozone levels, not record
similarly low ozone concentrations?

 The ozone concentrations recorded


by the satellite were so low they Sources:
were being treated as outliers by a http://exploringdata.cqu.edu.au/ozone.html
computer program and discarded! http://www.epa.gov/ozone/science/hole/size.html

06/06/2022 Introduction to Data Mining, 2nd Edition 3


Causes of Anomalies

 Data from different classes


– Measuring the weights of oranges, but a few grapefruit
are mixed in

 Natural variation
– Unusually tall people

 Data errors
– 200 pound 2 year old

06/06/2022 Introduction to Data Mining, 2nd Edition 4


Distinction Between Noise and
Anomalies

 Noise is erroneous, perhaps random, values or


contaminating objects
– Weight recorded incorrectly
– Grapefruit mixed in with the oranges

 Noise doesn’t necessarily produce unusual values or


objects
 Noise is not interesting
 Anomalies may be interesting if they are not a result of
noise
 Noise and anomalies are related but distinct concepts

06/06/2022 Introduction to Data Mining, 2nd Edition 5


General Issues: Number of Attributes

 Many anomalies are defined in terms of a single attribute


– Height
– Shape
– Color

 Can be hard to find an anomaly using all attributes


– Noisy or irrelevant attributes
– Object is only anomalous with respect to some attributes

 However, an object may not be anomalous in any one


attribute

06/06/2022 Introduction to Data Mining, 2nd Edition 6


General Issues: Anomaly Scoring

 Many anomaly detection techniques provide only a binary


categorization
– An object is an anomaly or it isn’t
– This is especially true of classification-based approaches

 Other approaches assign a score to all points


– This score measures the degree to which an object is an
anomaly
– This allows objects to be ranked

 In the end, you often need a binary decision


– Should this credit card transaction be flagged?
– Still useful to have a score
How many
 06/06/2022 anomalies are
Introduction there?
to Data Mining, 2nd Edition 7
Other Issues for Anomaly Detection

 Find all anomalies at once or one at a time


– Swamping
– Masking

 Evaluation
– How do you measure performance?
– Supervised vs. unsupervised situations

 Efficiency

 Context
– Professional basketball team

06/06/2022 Introduction to Data Mining, 2nd Edition 8


Variants of Anomaly Detection Problems

 Given a data set D, find all data points x  D with


anomaly scores greater than some threshold t

 Given a data set D, find all data points x  D


having the top-n largest anomaly scores

 Given a data set D, containing mostly normal (but


unlabeled) data points, and a test point x,
compute the anomaly score of x with respect to D

06/06/2022 Introduction to Data Mining, 2nd Edition 9


Model-Based Anomaly Detection

 Build a model for the data and see


– Unsupervised
 Anomalies are those points that don’t fit well
 Anomalies are those points that distort the model
 Examples:
– Statistical distribution
– Clusters
– Regression
– Geometric
– Graph
– Supervised
 Anomalies are regarded as a rare class
 Need to have training data

06/06/2022 Introduction to Data Mining, 2nd Edition 10


Additional Anomaly Detection
Techniques

 Proximity-based
– Anomalies are points far away from other points
– Can detect this graphically in some cases
 Density-based
– Low density points are outliers
 Pattern matching
– Create profiles or templates of atypical but important
events or objects
– Algorithms to detect these patterns are usually simple
and efficient

06/06/2022 Introduction to Data Mining, 2nd Edition 11


Visual Approaches

 Boxplots or scatter plots

 Limitations
– Not automatic
– Subjective

06/06/2022 Introduction to Data Mining, 2nd Edition 12


Statistical Approaches
Probabilistic definition of an outlier: An outlier is an object that
has a low probability with respect to a probability distribution
model of the data.
 Usually assume a parametric model describing the distribution
of the data (e.g., normal distribution)
 Apply a statistical test that depends on
– Data distribution
– Parameters of distribution (e.g., mean, variance)
– Number of expected outliers (confidence limit)
 Issues
– Identifying the distribution of a data set
 Heavytailed distribution
– Number of attributes
– Is the data a mixture of distributions?
06/06/2022 Introduction to Data Mining, 2nd Edition 13
Normal Distributions

One-dimensional
Gaussian

7
0.1
6
0.09
5
0.08
4

Two-dimensional
0.07
3
0.06
2

1
0.05 Gaussian
y

0 0.04

-1 0.03

-2 0.02

-3 0.01

-4
probability
-5 density

-4 -3 -2 -1 0 1 2 3 4 5
x

06/06/2022 Introduction to Data Mining, 2nd Edition 14


Grubbs’ Test

 Detect outliers in univariate data


 Assume data comes from normal distribution
 Detects one outlier at a time, remove the outlier,
and repeat
– H0: There is no outlier in data
– HA: There is at least one outlier
 Grubbs’ test statistic: max X  X
G
s
 Reject H0 if: ( N  1) t (2 / N , N  2 )
G
N N  2  t (2 / N , N  2 )
06/06/2022 Introduction to Data Mining, 2nd Edition 15
Statistical-based – Likelihood Approach

 Assume the data set D contains samples from a


mixture of two probability distributions:
– M (majority distribution)
– A (anomalous distribution)
 General Approach:
– Initially, assume all the data points belong to M
– Let Lt(D) be the log likelihood of D at time t
– For each point xt that belongs to M, move it to A
 Let Lt+1 (D) be the new log likelihood.
 Compute the difference,  = Lt(D) – Lt+1 (D)
 If  > c (some threshold), then xt is declared as an anomaly
and moved permanently from M to A
06/06/2022 Introduction to Data Mining, 2nd Edition 16
Statistical-based – Likelihood Approach

 Data distribution, D = (1 – ) M +  A
 M is a probability distribution estimated from data
– Can be based on any modeling method (naïve Bayes,
maximum entropy, etc)
 A is initially assumed to be uniform distribution
 Likelihood at time t:

N   |At | 
Lt ( D )   PD ( xi )   (1   )  PM t ( xi )    PAt ( xi ) 
|M t |

i 1  xi M t  xiAt 
LLt ( D )  M t log(1   )   log PM t ( xi )  At log    log PAt ( xi )
xi M t xi At

06/06/2022 Introduction to Data Mining, 2nd Edition 17


Strengths/Weaknesses of Statistical Approaches

 Firm mathematical foundation

 Can be very efficient

 Good results if distribution is known

 In many cases, data distribution may not be known

 For high dimensional data, it may be difficult to estimate


the true distribution

 Anomalies can distort the parameters of the distribution

06/06/2022 Introduction to Data Mining, 2nd Edition 18


Distance-Based Approaches

 Several different techniques

 An object is an outlier if a specified fraction of the


objects is more than a specified distance away
(Knorr, Ng 1998)
– Some statistical definitions are special cases of this

 The outlier score of an object is the distance to


its kth nearest neighbor

06/06/2022 Introduction to Data Mining, 2nd Edition 19


One Nearest Neighbor - One Outlier

D 2

1.8

1.6

1.4

1.2

0.8

0.6

0.4

Outlier Score
06/06/2022 Introduction to Data Mining, 2nd Edition 20
One Nearest Neighbor - Two Outliers

0.55

D
0.5

0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

Outlier Score
06/06/2022 Introduction to Data Mining, 2nd Edition 21
Five Nearest Neighbors - Small Cluster

D
1.8

1.6

1.4

1.2

0.8

0.6

0.4

Outlier Score
06/06/2022 Introduction to Data Mining, 2nd Edition 22
Five Nearest Neighbors - Differing Density

D 1.8

1.6

1.4

1.2

0.8

0.6

0.4

0.2

Outlier Score
06/06/2022 Introduction to Data Mining, 2nd Edition 23
Strengths/Weaknesses of Distance-Based Approaches

 Simple

 Expensive – O(n2)

 Sensitive to parameters

 Sensitive to variations in density

 Distance becomes less meaningful in high-


dimensional space

06/06/2022 Introduction to Data Mining, 2nd Edition 24


Density-Based Approaches

 Density-based Outlier: The outlier score of an


object is the inverse of the density around the
object.
– Can be defined in terms of the k nearest neighbors
– One definition: Inverse of distance to kth neighbor
– Another definition: Inverse of the average distance to k
neighbors
– DBSCAN definition

 If there are regions of different density, this


approach can have problems
06/06/2022 Introduction to Data Mining, 2nd Edition 25
Relative Density

 Consider the density of a point relative to that of


its k nearest neighbors

06/06/2022 Introduction to Data Mining, 2nd Edition 26


Relative Density Outlier Scores

6.85

6
C

1.40 4
D

1.33
2
A

Outlier Score
06/06/2022 Introduction to Data Mining, 2nd Edition 27
Density-based: LOF approach

 For each point, compute the density of its local neighborhood


 Compute local outlier factor (LOF) of a sample p as the
average of the ratios of the density of sample p and the
density of its nearest neighbors
 Outliers are points with largest LOF value

In the NN approach, p2 is
not considered as outlier,
while LOF approach find
both p1 and p2 as outliers
p2
 p1

06/06/2022 Introduction to Data Mining, 2nd Edition 28


Strengths/Weaknesses of Density-Based Approaches

 Simple

 Expensive – O(n2)

 Sensitive to parameters

 Density becomes less meaningful in high-


dimensional space

06/06/2022 Introduction to Data Mining, 2nd Edition 29


Clustering-Based Approaches

 Clustering-based Outlier: An
object is a cluster-based outlier if
it does not strongly belong to any
cluster
– For prototype-based clusters, an
object is an outlier if it is not close
enough to a cluster center
– For density-based clusters, an object
is an outlier if its density is too low
– For graph-based clusters, an object
is an outlier if it is not well connected
 Other issues include the impact of
outliers on the clusters and the
number of clusters
06/06/2022 Introduction to Data Mining, 2nd Edition 30
Distance of Points from Closest Centroids

4.5
4.6

4
C
3.5

2.5

D 0.17
2

1.5

1.2 1

A
0.5

Outlier Score
06/06/2022 Introduction to Data Mining, 2nd Edition 31
Relative Distance of Points from Closest Centroid

3.5

2.5

1.5

0.5

Outlier Score
06/06/2022 Introduction to Data Mining, 2nd Edition 32
Strengths/Weaknesses of Distance-Based Approaches

 Simple

 Many clustering techniques can be used

 Can be difficult to decide on a clustering


technique

 Can be difficult to decide on number of clusters

 Outliers can distort the clusters


06/06/2022 Introduction to Data Mining, 2nd Edition 33

You might also like