0% found this document useful (0 votes)

212 views

Outlier Detection

The document discusses outlier detection and anomaly detection in data mining. It defines outliers as data points that are considerably different from the majority of data points. It describes different types of anomaly detection problems and applications. It also outlines several approaches to anomaly detection, including graphical, statistical, distance-based, and clustering-based methods. Each approach has its own advantages and limitations for detecting outliers in datasets.

Uploaded by

Savitha Vasanthan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

212 views

Outlier Detection

Uploaded by

Savitha Vasanthan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 19

Outlier Discovery/Anomaly Detection

Data Mining: Concepts and

July 12, 2019 Techniques 1
Anomaly/Outlier Detection
 What are anomalies/outliers?
 The set of data points that are considerably different than

the remainder of the data

 Variants of Anomaly/Outlier Detection Problems
 Given a database D, find all the data points x  D with

anomaly scores greater than some threshold t

 Given a database D, find all the data points x  D having

the top-n largest anomaly scores f(x)

 Given a database D, containing mostly normal (but

unlabeled) data points, and a test point x, compute the

anomaly score of x with respect to D

July 12, 2019 Data Mining: Concepts and Techniques 2

Applications

 Credit card fraud detection

 telecommunication fraud detection
 network intrusion detection
 fault detection
 many more

July 12, 2019 Data Mining: Concepts and Techniques 3

Anomaly Detection

 Challenges
 How many outliers are there in the data?

 Method is unsupervised

 Validation can be quite challenging (just like for

clustering)
 Finding needle in a haystack

 Working assumption:
 There are considerably more “normal”

observations than “abnormal” observations

(outliers/anomalies) in the data
July 12, 2019 Data Mining: Concepts and Techniques 4
Anomaly Detection Schemes
 General Steps
 Build a profile of the “normal” behavior

 Profile can be patterns or summary statistics for the overall

population
 Use the “normal” profile to detect anomalies
 Anomalies are observations whose characteristics
differ significantly from the normal profile

 Types of anomaly detection

schemes
 Graphical & Statistical-based

 Distance-based

 Model-based

July 12, 2019 Data Mining: Concepts and Techniques 5

Graphical Approaches

 Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D)

 Limitations
 Time consuming

 Subjective

July 12, 2019 Data Mining: Concepts and Techniques 6

Convex Hull Method

 Extreme points are assumed to be outliers

 Use convex hull method to detect extreme values

 What if the outlier occurs in the middle of the

data?
July 12, 2019 Data Mining: Concepts and Techniques 7
Statistical Approaches
 Assume a parametric model describing the distribution of the data (e.g.,
normal distribution)

 Apply a statistical test that depends on

 Data distribution
 Parameter of distribution (e.g., mean, variance)
 Number of expected outliers (confidence limit)

July 12, 2019 Data Mining: Concepts and Techniques 8

Grubbs’ Test
 Detect outliers in univariate data
 Assume data comes from normal distribution
 Detects one outlier at a time, remove the outlier, and
repeat
 H0: There is no outlier in data

 HA: There is at least one outlier

 Grubbs’ test statistic: max X  X

G
s
 Reject H0 if:
( N  1) t (2 / N , N 2 )
G
N N  2  t (2 / N , N 2 )
July 12, 2019 Data Mining: Concepts and Techniques 9
Statistical-based – Likelihood
Approach
 Assume the data set D contains samples from a mixture of two
probability distributions:
 M (majority distribution)
 A (anomalous distribution)
 General Approach:
 Initially, assume all the data points belong to M
 Let Lt(D) be the log likelihood of D at time t
 For each point xt that belongs to M, move it to A
 Let Lt+1 (D) be the new log likelihood.

 Compute the difference,  = Lt(D) – Lt+1 (D)

 If  > c (some threshold), then xt is declared as an anomaly

and moved permanently from M to A

July 12, 2019 Data Mining: Concepts and Techniques 10

Statistical-based – Likelihood
Approach
 Data distribution, D = (1 – ) M +  A
 M is a probability distribution estimated from data
 Can be based on any modeling method

 A is initially assumed to be uniform distribution

 Likelihood at time t:
N   |At | 
Lt ( D )   PD ( xi )   (1   )  PM t ( xi )    PAt ( xi ) 
|M t |

i 1  xi M t  xiAt 
LLt ( D )  M t log( 1   )   log PM t ( xi )  At log    log PAt ( xi )
xi M t xi At

July 12, 2019 Data Mining: Concepts and Techniques 11

Limitations of Statistical Approaches

 Most of the tests are for a single attribute

 In many cases, data distribution may not be

known

 For multi-dimensional data, it may be difficult to

estimate the true distribution

July 12, 2019 Data Mining: Concepts and Techniques 12

Distance-based Approaches

 Data is represented as a vector of features

 Three major approaches

 Nearest-neighbor based

 Density based

 Clustering based

July 12, 2019 Data Mining: Concepts and Techniques 13

Nearest-Neighbor Based Approach
 Approach:
 Compute the distance between every pair of data points

 There are various ways to define outliers:

 Data points for which there are fewer than p neighboring

points within a distance D

 The top n data points whose distance to the kth nearest

neighbor is greatest

 The top n data points whose average distance to the k nearest

neighbors is greatest

July 12, 2019 Data Mining: Concepts and Techniques 14

Density-based: LOF approach
 For each point, compute the density of its local
neighborhood
 Compute local outlier factor (LOF) of a sample p as the
average of the ratios of the density of sample p and the
density of its nearest neighbors
 Outliers are points with largest LOF value

In the NN approach, p2 is
not considered as outlier,
while LOF approach find
both p1 and p2 as outliers
p2
 p1


July 12, 2019 Data Mining: Concepts and Techniques 15

LOF

The local outlier factor LOF, is defined as follows:

lrd k (o)
oNk ( p ) lrd ( p)
LOFk ( p)  k
| N k ( p) |
where Nk(p) is the set of k-nearest neighbors to p
| N k ( p) |
and lrd k ( p) 
 oN k ( p)
reach  dist ( p, o)

reach  dist k ( p)  max{ k  dist (o), dist ( p, o)}

July 12, 2019 Data Mining: Concepts and Techniques 16

Clustering-Based
 Basic idea:
 Cluster the data into groups of

different density
 Choose points in small cluster

as candidate outliers
 Compute the distance between

candidate points and non-

candidate clusters.
 If candidate points are far

from all other non-candidate

points, they are outliers

July 12, 2019 Data Mining: Concepts and Techniques 17

Outliers in Lower Dimensional Projection

 Divide each attribute into  equal-depth intervals

 Each interval contains a fraction f = 1/ of the records

 Consider a d-dimensional cube created by picking grid ranges

from d different dimensions
 If attributes are independent, we expect region to contain
a fraction fk of the records
 If there are N points, we can measure sparsity of a cube
D as:

 Negative sparsity indicates cube contains smaller number

of points than expected
 To detect the sparse cells, you have to consider all cells….
exponential to d. Heuristics can be used to find them…
July 12, 2019 Data Mining: Concepts and Techniques 19
Example

 N=100,  = 5, f = 1/5 = 0.2, N  f2 = 4

July 12, 2019 Data Mining: Concepts and Techniques 20

Full download Automated Machine Learning in Action 1st Edition Qingquan Song pdf docx
100% (5)
Full download Automated Machine Learning in Action 1st Edition Qingquan Song pdf docx
81 pages
Living Design Workbook (English)
79% (14)
Living Design Workbook (English)
119 pages
Corrosion Inhibitor Guidelines
100% (2)
Corrosion Inhibitor Guidelines
66 pages
Handling Imbalanced Datasets in Machine Learning - by Baptiste Rocca - Towards Data Science
No ratings yet
Handling Imbalanced Datasets in Machine Learning - by Baptiste Rocca - Towards Data Science
24 pages
Resume C.V.W PDF
No ratings yet
Resume C.V.W PDF
3 pages
Comarch Fault Management
No ratings yet
Comarch Fault Management
5 pages
A Survey of Deep Learning Based Network Anomaly Detection
No ratings yet
A Survey of Deep Learning Based Network Anomaly Detection
13 pages
12 Outlier
No ratings yet
12 Outlier
55 pages
03-EnodeB LTE FDD V100R005 Product Description ISSUE 1.01-Libre
No ratings yet
03-EnodeB LTE FDD V100R005 Product Description ISSUE 1.01-Libre
83 pages
A "Short" Introduction To Model Selection
No ratings yet
A "Short" Introduction To Model Selection
25 pages
Types of Classification Algorithm
No ratings yet
Types of Classification Algorithm
27 pages
SJ-20121112095141-003-LTE (V3.10.10) Alarm Handling (FDD) PDF
No ratings yet
SJ-20121112095141-003-LTE (V3.10.10) Alarm Handling (FDD) PDF
338 pages
Learning Best Practices For Model Evaluation and Hyperparameter Tuning
No ratings yet
Learning Best Practices For Model Evaluation and Hyperparameter Tuning
17 pages
Creditcard Fraud Detection
No ratings yet
Creditcard Fraud Detection
26 pages
A Study On Feature Selection Techniques in Bio Informatics
100% (1)
A Study On Feature Selection Techniques in Bio Informatics
7 pages
Isolation Forest
No ratings yet
Isolation Forest
11 pages
A Machine Learning Approach To Network Intrusion Detection System
No ratings yet
A Machine Learning Approach To Network Intrusion Detection System
52 pages
Handling Missing Value in Decision Tree Algorithm PDF
No ratings yet
Handling Missing Value in Decision Tree Algorithm PDF
6 pages
Lung Disease Detection Using X Rays: Under The Mentorship of
No ratings yet
Lung Disease Detection Using X Rays: Under The Mentorship of
39 pages
Churn Data
100% (1)
Churn Data
56 pages
Auc Roc Curve Machine Learning
No ratings yet
Auc Roc Curve Machine Learning
12 pages
Introduction To Learning: Frederic Precioso 24/01/2019
No ratings yet
Introduction To Learning: Frederic Precioso 24/01/2019
179 pages
Network Layer - Control Plane
100% (1)
Network Layer - Control Plane
107 pages
Anomaly Detection
No ratings yet
Anomaly Detection
11 pages
Unit 3 Univariate Analysis
No ratings yet
Unit 3 Univariate Analysis
39 pages
Big Data in Telecom
No ratings yet
Big Data in Telecom
35 pages
Intro To Data Science Summary
No ratings yet
Intro To Data Science Summary
17 pages
Discriminant Analysis
No ratings yet
Discriminant Analysis
13 pages
3GPP Fault Management
No ratings yet
3GPP Fault Management
21 pages
LTE Dimensioning Toola
No ratings yet
LTE Dimensioning Toola
33 pages
RAN Performance L900 Blanket 2019 Project - ACT (4) - 20191020
No ratings yet
RAN Performance L900 Blanket 2019 Project - ACT (4) - 20191020
9 pages
Machine Learning
No ratings yet
Machine Learning
2 pages
Netops
No ratings yet
Netops
81 pages
2018 Miccai PDF
No ratings yet
2018 Miccai PDF
239 pages
Scheduling_Algorithms_for_5G_Networks_and_Beyond_Classification_and_Survey
No ratings yet
Scheduling_Algorithms_for_5G_Networks_and_Beyond_Classification_and_Survey
20 pages
Karanja Evanson Mwangi Cit Masters Report Libre PDF
No ratings yet
Karanja Evanson Mwangi Cit Masters Report Libre PDF
136 pages
Honours in Artificial Intelligence and Machine Learning: Board of Studies (Computer Engineering)
No ratings yet
Honours in Artificial Intelligence and Machine Learning: Board of Studies (Computer Engineering)
16 pages
CHURN Analysis
100% (1)
CHURN Analysis
6 pages
Anomaly Detection: Course: Data Mining II
No ratings yet
Anomaly Detection: Course: Data Mining II
12 pages
Python Notes
No ratings yet
Python Notes
279 pages
Unit III: Concept Description: Characterization and Comparison
No ratings yet
Unit III: Concept Description: Characterization and Comparison
53 pages
Huawei ERAN6 0 KPI Introduction
100% (2)
Huawei ERAN6 0 KPI Introduction
57 pages
Breast Cancer Classification
No ratings yet
Breast Cancer Classification
18 pages
Emerging Technologies in Information and Communications Technology
From Everand
Emerging Technologies in Information and Communications Technology
Fouad Sabry
No ratings yet
GSM Frequency Planning-Presentation PDF
No ratings yet
GSM Frequency Planning-Presentation PDF
22 pages
CustomerChurn PDF
No ratings yet
CustomerChurn PDF
16 pages
Data To Knowledge To Results Rev4
No ratings yet
Data To Knowledge To Results Rev4
21 pages
Throughput Prediction in Cellular Networks Final
No ratings yet
Throughput Prediction in Cellular Networks Final
5 pages
Machine Learning Methods For Data Security
No ratings yet
Machine Learning Methods For Data Security
141 pages
DR Antonio Gulli - A Collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark (II) - Hands-On Big Data and Machine - Programming Interview Questions) (
No ratings yet
DR Antonio Gulli - A Collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark (II) - Hands-On Big Data and Machine - Programming Interview Questions) (
112 pages
Agenda: - Introduction - Basics - Classification - Clustering - Regression - Use-Cases
No ratings yet
Agenda: - Introduction - Basics - Classification - Clustering - Regression - Use-Cases
30 pages
Model Answers For Chapter 7: CLASSIFICATION AND REGRESSION TREES
No ratings yet
Model Answers For Chapter 7: CLASSIFICATION AND REGRESSION TREES
3 pages
Cardio Fitness Project
No ratings yet
Cardio Fitness Project
1 page
Fastica: Fastica Is An Efficient and Popular Algorithm For Independent Component Analysis Invented by Aapo Hyvärinen
100% (1)
Fastica: Fastica Is An Efficient and Popular Algorithm For Independent Component Analysis Invented by Aapo Hyvärinen
3 pages
OS by JJsir
No ratings yet
OS by JJsir
269 pages
5G Based On Cognitive Radio 5G Concept
No ratings yet
5G Based On Cognitive Radio 5G Concept
3 pages
Ltecellplanning 170508051224
No ratings yet
Ltecellplanning 170508051224
82 pages
Drive testing The Ultimate Step-By-Step Guide
From Everand
Drive testing The Ultimate Step-By-Step Guide
Gerardus Blokdyk
No ratings yet
Voice over LTE Standard Requirements
From Everand
Voice over LTE Standard Requirements
Gerardus Blokdyk
3/5 (1)
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
From Everand
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
Sebastian Thelen
5/5 (1)
The Concise Guide to the Internet of Things for Executives
From Everand
The Concise Guide to the Internet of Things for Executives
alasdair gilchrist
4/5 (7)
Lte and Lte-Advanced Complete Self-Assessment Guide
From Everand
Lte and Lte-Advanced Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
IP Multimedia Subsystem IMS A Complete Guide
From Everand
IP Multimedia Subsystem IMS A Complete Guide
Gerardus Blokdyk
No ratings yet
Container Load Plan As Mate Receipt - China Po# 4001377580 - NGC Id#5011959816
No ratings yet
Container Load Plan As Mate Receipt - China Po# 4001377580 - NGC Id#5011959816
2 pages
1 Cloud Computing
No ratings yet
1 Cloud Computing
5 pages
Strategic Air Command History 1961
No ratings yet
Strategic Air Command History 1961
31 pages
Seagate Digest
No ratings yet
Seagate Digest
3 pages
BIMESTRAL 11° 1ER CORTE 2023
No ratings yet
BIMESTRAL 11° 1ER CORTE 2023
2 pages
Solution PKP Tutorial
No ratings yet
Solution PKP Tutorial
6 pages
BBC5300D Hardware Description (01) (PDF) - EN
No ratings yet
BBC5300D Hardware Description (01) (PDF) - EN
11 pages
Earth Science: Quarter 1 - Module 5
67% (6)
Earth Science: Quarter 1 - Module 5
41 pages
Material Downloaded From - 1 / 5
No ratings yet
Material Downloaded From - 1 / 5
5 pages
English Paper 2
No ratings yet
English Paper 2
21 pages
Halon Extinguishing Agents: Safety & Health Concerns
No ratings yet
Halon Extinguishing Agents: Safety & Health Concerns
4 pages
Science
No ratings yet
Science
3 pages
NB 264
No ratings yet
NB 264
12 pages
Cu Medical I Pad Aed Manual
No ratings yet
Cu Medical I Pad Aed Manual
65 pages
Docx
No ratings yet
Docx
15 pages
Subwoofer Box Pioneer W312D4
No ratings yet
Subwoofer Box Pioneer W312D4
4 pages
Catalogo SECO
No ratings yet
Catalogo SECO
340 pages
MADAM RIDES THE BUS
No ratings yet
MADAM RIDES THE BUS
4 pages
Engineering Cutoff Ug 13
No ratings yet
Engineering Cutoff Ug 13
4 pages
Seminar Report Blackberry Phones : Submitted To: Submitted by
No ratings yet
Seminar Report Blackberry Phones : Submitted To: Submitted by
16 pages
Events and Issues - Script
No ratings yet
Events and Issues - Script
2 pages
Max 7705
No ratings yet
Max 7705
8 pages
Thesis Topics Civil Engineering
100% (3)
Thesis Topics Civil Engineering
5 pages
SSC CHSL 2023 August 4 Shift 1
No ratings yet
SSC CHSL 2023 August 4 Shift 1
29 pages
Rollright Stones Oxford Shire
100% (1)
Rollright Stones Oxford Shire
94 pages
Pharmacy Inspection Checklist - SafetyCulture
No ratings yet
Pharmacy Inspection Checklist - SafetyCulture
16 pages
Aveva Bocad NC
No ratings yet
Aveva Bocad NC
2 pages

Outlier Detection

Uploaded by

Outlier Detection

Uploaded by

Outlier Discovery/Anomaly Detection

Data Mining: Concepts and

the remainder of the data

anomaly scores greater than some threshold t

the top-n largest anomaly scores f(x)

unlabeled) data points, and a test point x, compute the

July 12, 2019 Data Mining: Concepts and Techniques 2

 Credit card fraud detection

July 12, 2019 Data Mining: Concepts and Techniques 3

 Validation can be quite challenging (just like for

observations than “abnormal” observations

 Profile can be patterns or summary statistics for the overall

 Types of anomaly detection

July 12, 2019 Data Mining: Concepts and Techniques 5

 Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D)

July 12, 2019 Data Mining: Concepts and Techniques 6

 Extreme points are assumed to be outliers

 What if the outlier occurs in the middle of the

 Apply a statistical test that depends on

July 12, 2019 Data Mining: Concepts and Techniques 8

 HA: There is at least one outlier

 Grubbs’ test statistic: max X  X

 Compute the difference,  = Lt(D) – Lt+1 (D)

 If  > c (some threshold), then xt is declared as an anomaly

and moved permanently from M to A

July 12, 2019 Data Mining: Concepts and Techniques 10

 A is initially assumed to be uniform distribution

July 12, 2019 Data Mining: Concepts and Techniques 11

 Most of the tests are for a single attribute

 In many cases, data distribution may not be

 For multi-dimensional data, it may be difficult to

July 12, 2019 Data Mining: Concepts and Techniques 12

 Data is represented as a vector of features

 Three major approaches

July 12, 2019 Data Mining: Concepts and Techniques 13

 There are various ways to define outliers:

points within a distance D

 The top n data points whose distance to the kth nearest

 The top n data points whose average distance to the k nearest

July 12, 2019 Data Mining: Concepts and Techniques 14

July 12, 2019 Data Mining: Concepts and Techniques 15

The local outlier factor LOF, is defined as follows:

reach  dist k ( p)  max{ k  dist (o), dist ( p, o)}

July 12, 2019 Data Mining: Concepts and Techniques 16

candidate points and non-

from all other non-candidate

July 12, 2019 Data Mining: Concepts and Techniques 17

 Divide each attribute into  equal-depth intervals

 Consider a d-dimensional cube created by picking grid ranges

 Negative sparsity indicates cube contains smaller number

 N=100,  = 5, f = 1/5 = 0.2, N  f2 = 4

July 12, 2019 Data Mining: Concepts and Techniques 20

You might also like