Anomaly Detection
Adapted from slides by
Jing Gao
SUNY Buffalo
1
Definition of Anomalies
• Anomaly is a pattern in the data that does not
conform to the expected behavior
• Also referred to as outliers, exceptions,
peculiarities, surprise, etc.
• Anomalies translate to significant (often critical)
real life entities
– Cyber intrusions
– Credit card fraud
2
Real World Anomalies
• Credit Card Fraud
– An abnormally high purchase
made on a credit card
• Cyber Intrusions
– Computer virus spread over
Internet
3
Simple Example
Y
• N1 and N2 are
regions of normal N1 o1
O3
behavior
• Points o1 and o2 are
anomalies
o2
• Points in region O3 N2
are anomalies
4
Related problems
• Rare Class Mining
• Chance discovery
• Novelty Detection
• Exception Mining
• Noise Removal
5
Key Challenges
• Defining a representative normal region is challenging
• The boundary between normal and outlying behavior is
often not precise
• The exact notion of an outlier is different for different
application domains
• Limited availability of labeled data for
training/validation
• Malicious adversaries
• Data might contain noise
• Normal behaviour keeps evolving
6
Aspects of Anomaly Detection Problem
• Nature of input data
• Availability of supervision
• Type of anomaly: point, contextual, structural
• Output of anomaly detection
• Evaluation of anomaly detection techniques
7
Data Labels
• Supervised Anomaly Detection
– Labels available for both normal data and anomalies
– Similar to skewed (imbalanced) classification
• Semi-supervised Anomaly Detection
– Limited amount of labeled data
– Combine supervised and unsupervised techniques
• Unsupervised Anomaly Detection
– No labels assumed
– Based on the assumption that anomalies are very rare
compared to normal data
8
Type of Anomalies
• Point Anomalies
• Contextual Anomalies
• Collective Anomalies
9
Point Anomalies
• An individual data instance is anomalous w.r.t.
the data
Y
N1 o1
O3
o2
N2
X
10
Contextual Anomalies
• An individual data instance is anomalous within a context
• Requires a notion of context
• Also referred to as conditional anomalies
Anomaly
Normal
11
Collective Anomalies
• A collection of related data instances is anomalous
• Requires a relationship among data instances
– Sequential Data
– Spatial Data
– Graph Data
• The individual instances within a collective anomaly are not
anomalous by themselves
Anomalous Subsequence
12
Output of Anomaly Detection
• Label
– Each test instance is given a normal or anomaly label
– This is especially true of classification-based
approaches
• Score
– Each test instance is assigned an anomaly score
• Allows the output to be ranked
• Requires an additional threshold parameter
13
Metrics for Performance Evaluation
PREDICTED CLASS
+ -
+ a b
ACTUAL (TP) (FN)
CLASS c d
-
(FP) (TN)
• Measure used in classification:
a d TP TN
Accuracy
a b c d TP TN FP FN
14
Limitation of Accuracy
• Anomaly detection
– Number of negative examples = 9990
– Number of positive examples = 10
• If model predicts everything to be class 0,
accuracy is 9990/10000 = 99.9 %
– Accuracy is misleading because model does not
detect any positive examples
15
Cost Matrix
PREDICTED CLASS
C(i|j) + -
+ C(+|+) C(-|+)
ACTUAL
CLASS C(+|-) C(-|-)
-
C(i|j): Cost of misclassifying class j example as class i
20
Computing Cost of Classification
Cost PREDICTED CLASS
Matrix
C(i|j) + -
ACTUAL + -1 100
CLASS
- 1 0
Model PREDICTED CLASS Model PREDICTED CLASS
M1 M2
+ - + -
ACTUAL + 150 40 ACTUAL + 250 45
CLASS CLASS
- 60 250 - 5 200
Accuracy = 80% Accuracy = 90%
Cost = 3910 Cost = 4255
17
Cost-Sensitive Measures
a
Precision (p)
a c
a
Recall (r)
a b
2rp 2a
F - measure (F)
r p 2a b c
wa w d
Weighted Accuracy 1 4
wa w b w c w d
1 2 3 4
18
ROC (Receiver Operating Characteristic)
• ROC curve plots TPR (Recall) on the y-axis
against FPR (FP/#N) on the x-axis
• Performance of each classifier represented as
a point on the ROC curve
– changing the threshold of algorithm, sample
distribution or cost matrix changes the location of
the point
19
ROC Curve
- 1-dimensional data set containing 2 classes (positive and negative)
- any points located at x > t is classified as positive
At threshold t:
TP=0.5, FN=0.5, FP=0.12, FN=0.88
20
ROC Curve
(TPR,FPR):
• (0,0): declare everything
to be negative class
• (1,1): declare everything
to be positive class
• (1,0): ideal
• Diagonal line:
– Random guessing
– Below diagonal line:
• prediction is opposite of the
true class
21
Using ROC for Model Comparison
Comparing two models
M1 is better for small
FPR
M2 is better for large
FPR
Area Under the ROC
curve
Ideal:
Area =1
Random guess:
Area = 0.5
22
How to Construct an ROC curve
Instance Score Label •Calculate the outlier scores of
1 0.95 + the given instances
2 0.93 +
3 0.87 -
•Sort the instances according to
the scores in decreasing order
4 0.85 -
5 0.85 - • Apply threshold at each unique
6 0.85 + value of the score
7 0.76 -
• Count the number of TP, FP,
8 0.53 +
TN, FN at each threshold
9 0.43
PREDICTED CLASS
-
10 0.25 + • TP rate, TPR = TP/(TP+FN)
+ -
+ a b
• FP rate, FPR = FP/(FP + TN)
ACTUAL
CLASS (TP) (FN)
- c d
(FP) (TN) 27
How to construct an ROC curve
Class + - + - - - + - + +
0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
Threshold > =
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0
Area under ROC curve = prob.
ROC Curve:
that a randomly sample positive
example will score higher than a
randomly sampled negative
example
24
Applications of Anomaly Detection
• Network intrusion detection
• Insurance / Credit card fraud detection
• Healthcare Informatics / Medical diagnostics
• Image Processing / Video surveillance
• …
25
Anomaly Detection Schemes
• General Steps
– Build a profile of the “normal” behavior
• Profile can be patterns or summary statistics for the overall population
– Use the “normal” profile to detect anomalies
• Anomalies are observations whose characteristics
differ significantly from the normal profile
• Methods
– Statistical-based
– Distance-based
– Model-based
26
Statistical Approaches
• Assume a parametric model describing the
distribution of the data (e.g., normal distribution)
• Apply a statistical test that depends on
– Data distribution
– Parameter of distribution (e.g., mean, variance)
– Number of expected outliers (confidence limit)
27
Limitations of Statistical Approaches
• Most of the tests are for a single attribute
• In many cases, data distribution may not be
known
• For high dimensional data, it may be difficult
to estimate the true distribution
28
Distance-based Approaches
• Data is represented as a vector of features
• Three major approaches
– Nearest-neighbor based
– Density based
– Clustering based
40
Nearest-Neighbor Based Approach
• Approach:
– Compute the distance between every pair of data
points
– There are various ways to define outliers:
• Data points for which there are fewer than p
neighboring points within a distance D
• The top n data points whose distance to the k-th
nearest neighbor is greatest
• The top n data points whose average distance to the k
nearest neighbors is greatest
30
Distance-Based Outlier Detection
• For each object o, examine the # of other objects in the r-
neighborhood of o, where r is a user-specified distance
threshold
• An object o is an outlier if most (taking π as a fraction
threshold) of the objects in D are far away from o, i.e., not in
the r-neighborhood of o
• An object o is a DB(r, π) outlier if
• Equivalently, one can check the distance between o and its k-
th nearest neighbor ok, where . o is an outlier if
dist(o, ok) > r
31
Density-based Approach
• For each point, compute the density of its local
neighborhood
• Points whose local density is significantly lower than its
nearest neighbor‘s local density are consider outliers
Example:
Distance from p3 to In the NN approach, p2 is
nearest neighbor
p3 not considered as outlier,
while a density based
Distance from p2 to approach may find both
nearest neighbor p1 and p2 as outliers
p2
p1
Clustering-Based
• Basic idea:
– Cluster the data into
groups of different density
– Choose points in small
cluster as candidate
outliers
– Compute the distance
between candidate points
and non-candidate
clusters.
• If candidate points are far
from all other non-
candidate points, they are
outliers
33
Classification-Based Methods
• Idea: Train a classification model that can distinguish “normal”
data from outliers
• Consider a training set that contains samples labeled as
“normal” and others labeled as “outlier”
– But, the training set is typically heavily biased: # of “normal”
samples likely far exceeds # of outlier samples
• Handle the imbalanced distribution
– Oversampling positives and/or undersampling negatives
– Alter decision threshold
– Cost-sensitive learning
34
One-Class Model
One-class model: A classifier is built to describe only the
normal class
Learn the decision boundary of the normal class using
classification methods such as SVM
Any samples that do not belong to the normal class (not
within the decision boundary) are declared as outliers
Adv: can detect new outliers that may not appear close to
any outlier objects in the training set
35
Take-away Message
• Definition of outlier detection
• Applications of outlier detection
• Evaluation of outlier detection techniques
• Unsupervised approaches (statistical, distance,
density-based)
36