Anomaly Detection: Jing Gao
Anomaly Detection: Jing Gao
Jing Gao
SUNY Buffalo
1
Anomaly Detection
• Anomalies
– the set of objects are
considerably dissimilar from
the remainder of the data
– occur relatively infrequently
– when they do occur, their
consequences can be quite
dramatic and quite often in a
negative sense “Mining needle in a haystack.
So much hay and so little time”
2
Definition of Anomalies
3
Real World Anomalies
• Cyber Intrusions
– Computer virus spread over
Internet
4
Simple Example
Y
• N1 and N2 are
regions of normal N1 o1
behavior
O3
5
Related problems
6
Key Challenges
8
Input Data
detection techniques is
1 206.135.38.95 11:07:20 160.94.179.223 139 192 No
Record Data
2 206.163.37.95 11:13:56 160.94.179.219 139 195 No
– Multivariate
5 206.163.37.95 11:14:32 160.94.179.254 139 19 Yes
9
Input Data – Complex Data Types
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
• Temporal
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
– Spatial TGGGCTGCCTGCTGCGACCAGGG
– Spatio-temporal
– Graph
10
Data Labels
11
Type of Anomalies
• Point Anomalies
• Contextual Anomalies
• Collective Anomalies
12
Point Anomalies
N1 o1
O3
o2
N2
X
13
Contextual Anomalies
Anomaly
Normal
14
Collective Anomalies
Anomalous Subsequence
15
Output of Anomaly Detection
• Label
– Each test instance is given a normal or anomaly label
– This is especially true of classification-based
approaches
• Score
– Each test instance is assigned an anomaly score
• Allows the output to be ranked
• Requires an additional threshold parameter
16
Metrics for Performance Evaluation
• Confusion Matrix
PREDICTED CLASS
+ -
+ a b
ACTUAL
CLASS c d
-
PREDICTED CLASS
+ -
+ a b
ACTUAL (TP) (FN)
CLASS c d
-
(FP) (TN)
• Anomaly detection
– Number of negative examples = 9990
– Number of positive examples = 10
19
Cost Matrix
PREDICTED CLASS
C(i|j) + -
+ C(+|+) C(-|+)
ACTUAL
CLASS C(+|-) C(-|-)
-
20
Computing Cost of Classification
Cost PREDICTED CLASS
Matrix
C(i|j) + -
ACTUAL
+ -1 100
CLASS
- 1 0
a
Precision (p) =
a+c
a
Recall (r) =
a+b
2rp 2a
F - measure (F) = =
r + p 2a + b + c
wa + w d
Weighted Accuracy = 1 4
wa + wb+ wc+ w d
1 2 3 4
22
ROC (Receiver Operating Characteristic)
23
ROC Curve
- 1-dimensional data set containing 2 classes (positive and negative)
- any points located at x > t is classified as positive
At threshold t:
TP=0.5, FN=0.5, FP=0.12, FN=0.88
24
ROC Curve
(TPR,FPR):
• (0,0): declare everything
to be negative class
• (1,1): declare everything
to be positive class
• (1,0): ideal
• Diagonal line:
– Random guessing
– Below diagonal line:
• prediction is opposite of the
true class
25
Using ROC for Model Comparison
l Comparing two models
l M1 is better for small
FPR
l M2 is better for large
FPR
26
How to Construct an ROC curve
Instance Score Label • Calculate the outlier scores of
1 0.95 + the given instances
2 0.93 +
3 0.87 -
• Sort the instances according to
4 0.85 -
the scores in decreasing order
5 0.85 - • Apply threshold at each unique
6 0.85 + value of the score
7 0.76 -
8 0.53 +
• Count the number of TP, FP,
TN, FN at each threshold
9 0.43
PREDICTED -
CLASS
10 0.25
+
+
-
• TP rate, TPR = TP/(TP+FN)
Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
ROC Curve:
28
Applications of Anomaly Detection
29
Intrusion Detection
• Intrusion Detection
– Process of monitoring the events occurring in a computer system or
network and analyzing them for intrusions
– Intrusions are defined as attempts to bypass the security mechanisms
of a computer or network
• Challenges
– Traditional signature-based intrusion detection
systems are based on signatures of known
attacks and cannot detect emerging cyber threats
– Substantial latency in deployment of newly
created signatures across the computer system
• Anomaly detection can alleviate these
limitations
30
Fraud Detection
31
Healthcare Informatics
32
Image Processing
150
• Used in
50 100 150 200 250 300 350
33
Anomaly Detection Schemes
• General Steps
– Build a profile of the “normal” behavior
• Profile can be patterns or summary statistics for the overall population
– Use the “normal” profile to detect anomalies
• Anomalies are observations whose characteristics
differ significantly from the normal profile
• Methods
– Statistical-based
– Distance-based
– Model-based
34
Statistical Approaches
• Assume a parametric model describing the
distribution of the data (e.g., normal distribution)
35
Grubbs’ Test
• Detect outliers in univariate data
• Assume data comes from normal distribution
• Detects one outlier at a time, remove the outlier,
and repeat
– H0: There is no outlier in data
– HA: There is at least one outlier
• Grubbs’ test statistic: G = max X - X
s
2
• Reject H0 if: G > ( N - 1) t (a / N , N -2 )
N N - 2 + t (2a / N , N - 2 )
36
Statistical-based – Likelihood
Approach
• Assume the data set D contains samples from a
mixture of two probability distributions:
– M (majority distribution)
– A (anomalous distribution)
• General Approach:
– Initially, assume all the data points belong to M
– Let Lt(D) be the log likelihood of D at time t
– For each point xt that belongs to M, move it to A
• Let Lt+1 (D) be the new log likelihood.
• Compute the difference, D = Lt(D) – Lt+1 (D)
• If D > c (some threshold), then xt is declared as an anomaly
and moved permanently from M to A
37
Statistical-based – Likelihood
Approach
• Data distribution, D = (1 – l) M + l A
• M is a probability distribution estimated from data
– Can be based on any modeling method, e.g., mixture
model
• A can be assumed to be uniform distribution
• Likelihood at time t:
N æ öæ | At | ö
Lt ( D) = Õ PD ( xi ) = çç (1 - l ) Õ PM t ( xi ) ÷÷çç l Õ PAt ( xi ) ÷÷
|M t |
i =1 è xi ÎM t øè xi Î At ø
38
Limitations of Statistical Approaches
39
Distance-based Approaches
40
Nearest-Neighbor Based Approach
• Approach:
– Compute the distance between every pair of data
points
42
Density-based Approach
– Example:
43
Density-based: LOF approach
44
Local Outlier Factor: LOF
• Reachability distance from o’ to o:
n The higher the local reachability distance of o, and the higher the local
reachability density of the kNN of o, the higher LOF
n This captures a local outlier whose local density is relatively low
comparing to the local densities of its kNN
45
Clustering-Based
• Basic idea:
– Cluster the data into
groups of different density
– Choose points in small
cluster as candidate
outliers
– Compute the distance
between candidate points
and non-candidate
clusters.
• If candidate points are far
from all other non-
candidate points, they are
outliers
46
Detecting Outliers in Small Clusters
n Ex. In the figure, o is outlier since its closest large cluster is C1, but the
similarity between o and C1 is small. For any point in C3, its closest
large cluster is C2 but its similarity from C2 is low, plus |C3| = 3 is small
47
Classification-Based Methods
• Idea: Train a classification model that can distinguish “normal”
data from outliers
• Consider a training set that contains samples labeled as
“normal” and others labeled as “outlier”
– But, the training set is typically heavily biased: # of “normal”
samples likely far exceeds # of outlier samples
• Handle the imbalanced distribution
– Oversampling positives and/or undersampling negatives
– Alter decision threshold
– Cost-sensitive learning
48
One-Class Model
n One-class model: A classifier is built to describe only the
normal class
n Learn the decision boundary of the normal class using
49
Semi-Supervised Learning
50
Take-away Message
51