0% found this document useful (0 votes)
38 views

Anomaly Detection: Jing Gao

The document discusses anomaly detection, which involves identifying patterns in data that do not conform to expected behavior. It defines anomalies and outlines some key challenges in anomaly detection, such as defining the boundary between normal and anomalous behavior. The document also covers different aspects of anomaly detection problems, including the nature of input data, availability of labels, types of anomalies, outputs, and how performance is evaluated using metrics like accuracy, precision, recall and ROC curves.

Uploaded by

Haythem Mzoughi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Anomaly Detection: Jing Gao

The document discusses anomaly detection, which involves identifying patterns in data that do not conform to expected behavior. It defines anomalies and outlines some key challenges in anomaly detection, such as defining the boundary between normal and anomalous behavior. The document also covers different aspects of anomaly detection problems, including the nature of input data, availability of labels, types of anomalies, outputs, and how performance is evaluated using metrics like accuracy, precision, recall and ROC curves.

Uploaded by

Haythem Mzoughi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Anomaly Detection

Jing Gao
SUNY Buffalo

1
Anomaly Detection
• Anomalies
– the set of objects are
considerably dissimilar from
the remainder of the data
– occur relatively infrequently
– when they do occur, their
consequences can be quite
dramatic and quite often in a
negative sense “Mining needle in a haystack.
So much hay and so little time”

2
Definition of Anomalies

• Anomaly is a pattern in the data that does not


conform to the expected behavior
• Also referred to as outliers, exceptions,
peculiarities, surprise, etc.
• Anomalies translate to significant (often critical)
real life entities
– Cyber intrusions
– Credit card fraud

3
Real World Anomalies

• Credit Card Fraud


– An abnormally high purchase
made on a credit card

• Cyber Intrusions
– Computer virus spread over
Internet

4
Simple Example
Y
• N1 and N2 are
regions of normal N1 o1

behavior
O3

• Points o1 and o2 are


anomalies
o2
• Points in region O3 N2
are anomalies
X

5
Related problems

• Rare Class Mining


• Chance discovery
• Novelty Detection
• Exception Mining
• Noise Removal

6
Key Challenges

• Defining a representative normal region is challenging


• The boundary between normal and outlying behavior is
often not precise
• The exact notion of an outlier is different for different
application domains
• Limited availability of labeled data for
training/validation
• Malicious adversaries
• Data might contain noise
• Normal behaviour keeps evolving
7
Aspects of Anomaly Detection Problem

• Nature of input data


• Availability of supervision
• Type of anomaly: point, contextual, structural
• Output of anomaly detection
• Evaluation of anomaly detection techniques

8
Input Data

• Most common form of


Start Number
data handled by anomaly Tid SrcIP Dest IP Dest Attack
time Port of bytes

detection techniques is
1 206.135.38.95 11:07:20 160.94.179.223 139 192 No

Record Data
2 206.163.37.95 11:13:56 160.94.179.219 139 195 No

3 206.163.37.95 11:14:29 160.94.179.217 139 180 No

– Univariate 4 206.163.37.95 11:14:30 160.94.179.255 139 199 No

– Multivariate
5 206.163.37.95 11:14:32 160.94.179.254 139 19 Yes

6 206.163.37.95 11:14:35 160.94.179.253 139 177 No

7 206.163.37.95 11:14:36 160.94.179.252 139 172 No

8 206.163.37.95 11:14:38 160.94.179.251 139 285 Yes

9 206.163.37.95 11:14:41 160.94.179.250 139 195 No

10 206.163.37.95 11:14:44 160.94.179.249 139 163 Yes


10

9
Input Data – Complex Data Types
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC

• Relationship among data instances GAGAAGGGCCCGCCTGGCGGGCG


GGGGGAGGCGGGGCCGCCCGAGC
– Sequential CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA

• Temporal
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC

– Spatial TGGGCTGCCTGCTGCGACCAGGG

– Spatio-temporal
– Graph

10
Data Labels

• Supervised Anomaly Detection


– Labels available for both normal data and anomalies
– Similar to skewed (imbalanced) classification
• Semi-supervised Anomaly Detection
– Limited amount of labeled data
– Combine supervised and unsupervised techniques
• Unsupervised Anomaly Detection
– No labels assumed
– Based on the assumption that anomalies are very rare
compared to normal data

11
Type of Anomalies

• Point Anomalies

• Contextual Anomalies

• Collective Anomalies

12
Point Anomalies

• An individual data instance is anomalous w.r.t.


the data
Y

N1 o1
O3

o2

N2

X
13
Contextual Anomalies

• An individual data instance is anomalous within a context


• Requires a notion of context
• Also referred to as conditional anomalies

Anomaly
Normal

14
Collective Anomalies

• A collection of related data instances is anomalous


• Requires a relationship among data instances
– Sequential Data
– Spatial Data
– Graph Data
• The individual instances within a collective anomaly are not
anomalous by themselves

Anomalous Subsequence

15
Output of Anomaly Detection

• Label
– Each test instance is given a normal or anomaly label
– This is especially true of classification-based
approaches
• Score
– Each test instance is assigned an anomaly score
• Allows the output to be ranked
• Requires an additional threshold parameter

16
Metrics for Performance Evaluation
• Confusion Matrix

PREDICTED CLASS

+ -
+ a b
ACTUAL
CLASS c d
-

a: TP (true positive) c: FP (false positive)


b: FN (false negative) d: TN (true negative)
Metrics for Performance Evaluation

PREDICTED CLASS

+ -
+ a b
ACTUAL (TP) (FN)
CLASS c d
-
(FP) (TN)

• Measure used in classification:


a+d TP + TN
Accuracy = =
a + b + c + d TP + TN + FP + FN
18
Limitation of Accuracy

• Anomaly detection
– Number of negative examples = 9990
– Number of positive examples = 10

• If model predicts everything to be class 0,


accuracy is 9990/10000 = 99.9 %
– Accuracy is misleading because model does not
detect any positive examples

19
Cost Matrix

PREDICTED CLASS

C(i|j) + -
+ C(+|+) C(-|+)
ACTUAL
CLASS C(+|-) C(-|-)
-

C(i|j): Cost of misclassifying class j example as class i

20
Computing Cost of Classification
Cost PREDICTED CLASS
Matrix
C(i|j) + -
ACTUAL
+ -1 100
CLASS
- 1 0

Model PREDICTED CLASS Model PREDICTED CLASS


M1 M2
+ - + -
ACTUAL ACTUAL
+ 150 40 + 250 45
CLASS CLASS
- 60 250 - 5 200

Accuracy = 80% Accuracy = 90%


Cost = 3910 Cost = 4255
21
Cost-Sensitive Measures

a
Precision (p) =
a+c
a
Recall (r) =
a+b
2rp 2a
F - measure (F) = =
r + p 2a + b + c

wa + w d
Weighted Accuracy = 1 4

wa + wb+ wc+ w d
1 2 3 4

22
ROC (Receiver Operating Characteristic)

• ROC curve plots TPR (on the y-axis) against


FPR (on the x-axis)
• Performance of each classifier represented as
a point on the ROC curve
– changing the threshold of algorithm, sample
distribution or cost matrix changes the location of
the point

23
ROC Curve
- 1-dimensional data set containing 2 classes (positive and negative)
- any points located at x > t is classified as positive

At threshold t:
TP=0.5, FN=0.5, FP=0.12, FN=0.88
24
ROC Curve
(TPR,FPR):
• (0,0): declare everything
to be negative class
• (1,1): declare everything
to be positive class
• (1,0): ideal

• Diagonal line:
– Random guessing
– Below diagonal line:
• prediction is opposite of the
true class

25
Using ROC for Model Comparison
l Comparing two models
l M1 is better for small
FPR
l M2 is better for large
FPR

l Area Under the ROC


curve
l Ideal:
§ Area =1
l Random guess:
§ Area = 0.5

26
How to Construct an ROC curve
Instance Score Label • Calculate the outlier scores of
1 0.95 + the given instances
2 0.93 +
3 0.87 -
• Sort the instances according to
4 0.85 -
the scores in decreasing order
5 0.85 - • Apply threshold at each unique
6 0.85 + value of the score
7 0.76 -
8 0.53 +
• Count the number of TP, FP,
TN, FN at each threshold
9 0.43
PREDICTED -
CLASS
10 0.25
+
+
-
• TP rate, TPR = TP/(TP+FN)

+ a b • FP rate, FPR = FP/(FP + TN)


ACTUAL
CLASS (TP) (FN)
- c d
(FP) (TN) 27
How to construct an ROC curve
Class + - + - - - + - + +

Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

ROC Curve:

28
Applications of Anomaly Detection

• Network intrusion detection


• Insurance / Credit card fraud detection
• Healthcare Informatics / Medical diagnostics
• Image Processing / Video surveillance
• …

29
Intrusion Detection

• Intrusion Detection
– Process of monitoring the events occurring in a computer system or
network and analyzing them for intrusions
– Intrusions are defined as attempts to bypass the security mechanisms
of a computer or network
• Challenges
– Traditional signature-based intrusion detection
systems are based on signatures of known
attacks and cannot detect emerging cyber threats
– Substantial latency in deployment of newly
created signatures across the computer system
• Anomaly detection can alleviate these
limitations

30
Fraud Detection

• Fraud detection refers to detection of criminal activities


occurring in commercial organizations
– Malicious users might be the actual customers of the organization or
might be posing as a customer (also known as identity theft).
• Types of fraud
– Credit card fraud
– Insurance claim fraud
– Mobile / cell phone fraud
– Insider trading
• Challenges
– Fast and accurate real-time detection
– Misclassification cost is very high

31
Healthcare Informatics

• Detect anomalous patient records


– Indicate disease outbreaks, instrumentation errors,
etc.
• Key Challenges
– Misclassification cost is very high
– Data can be complex: spatio-temporal

32
Image Processing

• Detecting outliers in a image


50

monitored over time


100

150

• Detecting anomalous regions 200

within an image 250

• Used in
50 100 150 200 250 300 350

– mammography image analysis


– video surveillance
– satellite image analysis
• Key Challenges
– Detecting collective anomalies
– Data sets are very large
Anomaly

33
Anomaly Detection Schemes
• General Steps
– Build a profile of the “normal” behavior
• Profile can be patterns or summary statistics for the overall population
– Use the “normal” profile to detect anomalies
• Anomalies are observations whose characteristics
differ significantly from the normal profile

• Methods
– Statistical-based
– Distance-based
– Model-based

34
Statistical Approaches
• Assume a parametric model describing the
distribution of the data (e.g., normal distribution)

• Apply a statistical test that depends on


– Data distribution
– Parameter of distribution (e.g., mean, variance)
– Number of expected outliers (confidence limit)

35
Grubbs’ Test
• Detect outliers in univariate data
• Assume data comes from normal distribution
• Detects one outlier at a time, remove the outlier,
and repeat
– H0: There is no outlier in data
– HA: There is at least one outlier
• Grubbs’ test statistic: G = max X - X
s
2
• Reject H0 if: G > ( N - 1) t (a / N , N -2 )

N N - 2 + t (2a / N , N - 2 )

36
Statistical-based – Likelihood
Approach
• Assume the data set D contains samples from a
mixture of two probability distributions:
– M (majority distribution)
– A (anomalous distribution)
• General Approach:
– Initially, assume all the data points belong to M
– Let Lt(D) be the log likelihood of D at time t
– For each point xt that belongs to M, move it to A
• Let Lt+1 (D) be the new log likelihood.
• Compute the difference, D = Lt(D) – Lt+1 (D)
• If D > c (some threshold), then xt is declared as an anomaly
and moved permanently from M to A

37
Statistical-based – Likelihood
Approach
• Data distribution, D = (1 – l) M + l A
• M is a probability distribution estimated from data
– Can be based on any modeling method, e.g., mixture
model
• A can be assumed to be uniform distribution
• Likelihood at time t:
N æ öæ | At | ö
Lt ( D) = Õ PD ( xi ) = çç (1 - l ) Õ PM t ( xi ) ÷÷çç l Õ PAt ( xi ) ÷÷
|M t |

i =1 è xi ÎM t øè xi Î At ø

38
Limitations of Statistical Approaches

• Most of the tests are for a single attribute

• In many cases, data distribution may not be


known

• For high dimensional data, it may be difficult


to estimate the true distribution

39
Distance-based Approaches

• Data is represented as a vector of features

• Three major approaches


– Nearest-neighbor based
– Density based
– Clustering based

40
Nearest-Neighbor Based Approach

• Approach:
– Compute the distance between every pair of data
points

– There are various ways to define outliers:


• Data points for which there are fewer than p
neighboring points within a distance D

• The top n data points whose distance to the k-th


nearest neighbor is greatest

• The top n data points whose average distance to the k


nearest neighbors is greatest
41
Distance-Based Outlier Detection
• For each object o, examine the # of other objects in the r-
neighborhood of o, where r is a user-specified distance
threshold
• An object o is an outlier if most (taking π as a fraction
threshold) of the objects in D are far away from o, i.e., not in
the r-neighborhood of o

• An object o is a DB(r, π) outlier if


• Equivalently, one can check the distance between o and its k-
th nearest neighbor ok, where . o is an outlier if
dist(o, ok) > r

42
Density-based Approach

• Local Outlier Factor (LOF) approach

– Example:

Distance from p3 to In the NN approach, p2 is


nearest neighbor
p3 ´ not considered as outlier,
while the LOF approach
Distance from p2 to find both p1 and p2 as
nearest neighbor outliers

p2 NN approach may consider


´ p1 p3 as outlier, but LOF
´
approach does not

43
Density-based: LOF approach

• For each point, compute the density of its local


neighborhood
• Compute local outlier factor (LOF) of a sample p as
the average of the ratios of the density of sample p
and the density of its nearest neighbors
• Outliers are points with largest LOF value

44
Local Outlier Factor: LOF
• Reachability distance from o’ to o:

– where k is a user-specified parameter


• Local reachability density of o:

n LOF (Local outlier factor) of an object o is the average of the ratio of


local reachability of o and those of o’s k-nearest neighbors

n The higher the local reachability distance of o, and the higher the local
reachability density of the kNN of o, the higher LOF
n This captures a local outlier whose local density is relatively low
comparing to the local densities of its kNN
45
Clustering-Based
• Basic idea:
– Cluster the data into
groups of different density
– Choose points in small
cluster as candidate
outliers
– Compute the distance
between candidate points
and non-candidate
clusters.
• If candidate points are far
from all other non-
candidate points, they are
outliers

46
Detecting Outliers in Small Clusters

• FindCBLOF: Detect outliers in small clusters


– Find clusters, and sort them in decreasing size
– To each data point, assign a cluster-based local outlier
factor (CBLOF):
– If obj p belongs to a large cluster, CBLOF =
cluster_size X similarity between p and cluster
– If p belongs to a small one, CBLOF = cluster size X
similarity betw. p and the closest large cluster

n Ex. In the figure, o is outlier since its closest large cluster is C1, but the
similarity between o and C1 is small. For any point in C3, its closest
large cluster is C2 but its similarity from C2 is low, plus |C3| = 3 is small

47
Classification-Based Methods
• Idea: Train a classification model that can distinguish “normal”
data from outliers
• Consider a training set that contains samples labeled as
“normal” and others labeled as “outlier”
– But, the training set is typically heavily biased: # of “normal”
samples likely far exceeds # of outlier samples
• Handle the imbalanced distribution
– Oversampling positives and/or undersampling negatives
– Alter decision threshold
– Cost-sensitive learning

48
One-Class Model
n One-class model: A classifier is built to describe only the
normal class
n Learn the decision boundary of the normal class using

classification methods such as SVM


n Any samples that do not belong to the normal class (not

within the decision boundary) are declared as outliers


n Adv: can detect new outliers that may not appear close to

any outlier objects in the training set

49
Semi-Supervised Learning

• Semi-supervised learning: Combining classification-


based and clustering-based methods
• Method
– Using a clustering-based approach, find a large
cluster, C, and a small cluster, C1
– Since some objects in C carry the label “normal”,
treat all objects in C as normal
– Use the one-class model of this cluster to
identify normal objects in outlier detection
– Since some objects in cluster C1 carry the label
“outlier”, declare all objects in C1 as outliers
– Any object that does not fall into the model for
C (such as a) is considered an outlier as well

50
Take-away Message

• Definition of outlier detection


• Applications of outlier detection
• Evaluation of outlier detection techniques
• Unsupervised approaches (statistical, distance,
density-based)
• Supervised and semi-supervised approaches

51

You might also like