0% found this document useful (0 votes)

12 views

Outlier Detection

Uploaded by

wasimrajaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Outlier Detection

Uploaded by

wasimrajaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 15

Outlier Discovery/Anomaly

Detection

Data Mining: Concepts and

December 10, 202 Techniques 1
4
Anomaly/Outlier Detection
 What are anomalies/outliers?
 The set of data points that are considerably

different than the remainder of the data

 Variants of Anomaly/Outlier Detection Problems
 Given a database D, find all the data points x  D

with anomaly scores greater than some threshold t

 Given a database D, find all the data points x  D

having the top-n largest anomaly scores f(x)

 Given a database D, containing mostly normal (but

unlabeled) data points, and a test point x, compute

the anomaly score of x with respect to D

Data Mining: Concepts and

December 10, 2024 Techniques 2
Applications

 Credit card fraud detection

 telecommunication fraud detection
 network intrusion detection
 fault detection
 many more

Data Mining: Concepts and

December 10, 2024 Techniques 3
Anomaly Detection
 Challenges
 How many outliers are there in the data?

 Method is unsupervised


Validation can be quite challenging (just like for
clustering)
 Finding needle in a haystack

 Working assumption:
 There are considerably more “normal”

observations than “abnormal” observations

(outliers/anomalies) in the data
Data Mining: Concepts and
December 10, 2024 Techniques 4
Anomaly Detection Schemes
 General Steps
 Build a profile of the “normal” behavior


Profile can be patterns or summary statistics for the
overall population
 Use the “normal” profile to detect anomalies

Anomalies are observations whose characteristics
differ significantly from the normal profile

 Types of anomaly detection

schemes
 Graphical & Statistical-based

 Distance-based

 Model-based

Data Mining: Concepts and

December 10, 2024 Techniques 5
Graphical Approaches
 Boxplot (1-D), Scatter plot (2-D), Spin plot
(3-D)

 Limitations
 Time consuming

 Subjective

Data Mining: Concepts and

December 10, 2024 Techniques 6
Convex Hull Method
 Extreme points are assumed to be outliers
 Use convex hull method to detect extreme
values

 What if the outlier occurs in the middle of the

data? Data Mining: Concepts and
December 10, 2024 Techniques 7
Statistical Approaches
 Assume a parametric model describing the distribution of the data
(e.g., normal distribution)

 Apply a statistical test that depends on

 Data distribution
 Parameter of distribution (e.g., mean, variance)
 Number of expected outliers (confidence limit)

Data Mining: Concepts and

December 10, 2024 Techniques 8
Grubbs’ Test
 Detect outliers in univariate data
 Assume data comes from normal distribution
 Detects one outlier at a time, remove the outlier,
and repeat
 H : There is no outlier in data
0

 HA: There is at least one outlier

 Grubbs’ test statistic: max X  X
G
s
 Reject H0 if:
( N  1) t (2 / N , N  2 )
G
N N  2  t (2 / N , N  2 )
Data Mining: Concepts and
December 10, 2024 Techniques 9
Statistical-based – Likelihood
Approach
 Assume the data set D contains samples from a mixture of
two probability distributions:
 M (majority distribution)
 A (anomalous distribution)
 General Approach:
 Initially, assume all the data points belong to M
 Let Lt(D) be the log likelihood of D at time t
 For each point xt that belongs to M, move it to A
 Let Lt+1 (D) be the new log likelihood.
 Compute the difference,  = Lt(D) – Lt+1 (D)
If  > c (some threshold), then xt is declared as an
anomaly and moved permanently from M to A

Data Mining: Concepts and

December 10, 2024 Techniques 10
Limitations of Statistical
Approaches
 Most of the tests are for a single attribute

 In many cases, data distribution may not be

known

 For multi-dimensional data, it may be

difficult to estimate the true distribution

Data Mining: Concepts and

December 10, 2024 Techniques 11
Distance-based Approaches
 Data is represented as a vector of features

 Three major approaches

 Nearest-neighbor based

 Density based

 Clustering based

Data Mining: Concepts and

December 10, 2024 Techniques 12
Nearest-Neighbor Based
Approach
 Approach:
 Compute the distance between every pair of data points

 There are various ways to define outliers:


Data points for which there are fewer than p
neighboring points within a distance D


The top n data points whose distance to the kth
nearest neighbor is greatest


The top n data points whose average distance to the k
nearest neighbors is greatest

Data Mining: Concepts and

December 10, 2024 Techniques 13
Density-based: LOF approach
 For each point, compute the density of its local
neighborhood
 Compute local outlier factor (LOF) of a sample p as the
average of the ratios of the density of sample p and the
density of its nearest neighbors
 Outliers are points with largest LOF value

In the NN approach, p2
is not considered as
outlier, while LOF
approach find both p1
p2 and p2 as outliers
 p1


Data Mining: Concepts and

December 10, 2024 Techniques 14
Clustering-Based
 Basic idea:
 Cluster the data into groups

of different density
 Choose points in small

cluster as candidate
outliers
 Compute the distance

between candidate points

and non-candidate clusters.

If candidate points are
far from all other non-
candidate points, they
are outliers

Data Mining: Concepts and

December 10, 2024 Techniques 15

OutlierDetection.ppt
No ratings yet
OutlierDetection.ppt
20 pages
Outlier Detection
No ratings yet
Outlier Detection
19 pages
8 Clustering
No ratings yet
8 Clustering
89 pages
Chapter2 Data Preprocssing
No ratings yet
Chapter2 Data Preprocssing
70 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
99 pages
Data Mining:: - Chapter 2
No ratings yet
Data Mining:: - Chapter 2
75 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
3 Prep
No ratings yet
3 Prep
50 pages
Chap10 Anomaly Detection
No ratings yet
Chap10 Anomaly Detection
24 pages
Lecture Notes For Chapter 10 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 10 Introduction To Data Mining: by Tan, Steinbach, Kumar
24 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
51 pages
8clst
No ratings yet
8clst
98 pages
Data Preprocessing - DWM
No ratings yet
Data Preprocessing - DWM
42 pages
Data Mining
No ratings yet
Data Mining
29 pages
8clst
No ratings yet
8clst
100 pages
Data Pre Processing
No ratings yet
Data Pre Processing
35 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
127 pages
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
No ratings yet
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
24 pages
Swetha Unit 1 Part 2 Data Preprocessing
No ratings yet
Swetha Unit 1 Part 2 Data Preprocessing
74 pages
Chapter 2 dataPreProcessing HAN
No ratings yet
Chapter 2 dataPreProcessing HAN
76 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
78 pages
Anomaly or Outlier Detection
No ratings yet
Anomaly or Outlier Detection
14 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
78 pages
Outlier Analysis in Data Mining
No ratings yet
Outlier Analysis in Data Mining
5 pages
BITS-WASE-DATA MINING-Session-07-2015 PDF
No ratings yet
BITS-WASE-DATA MINING-Session-07-2015 PDF
25 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Anomaly Detection: Lecture Notes For Chapter 9 Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Anomaly Detection: Lecture Notes For Chapter 9 Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
33 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
52 pages
Chapter 2: Data Preprocessing: Why Preprocess The Data?
No ratings yet
Chapter 2: Data Preprocessing: Why Preprocess The Data?
42 pages
Chap9 Anomaly Detection
No ratings yet
Chap9 Anomaly Detection
46 pages
Kmeans Ex
No ratings yet
Kmeans Ex
98 pages
8 CLST
No ratings yet
8 CLST
98 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
78 pages
3prep
No ratings yet
3prep
53 pages
Unit 2 - Data Preprocessing
No ratings yet
Unit 2 - Data Preprocessing
42 pages
Data Mining and Machine Learning Notes by Niraj
No ratings yet
Data Mining and Machine Learning Notes by Niraj
34 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
Cluster Analysis: Concepts and Techniques - Chapter 7
100% (1)
Cluster Analysis: Concepts and Techniques - Chapter 7
60 pages
Lect 4
No ratings yet
Lect 4
30 pages
1. Introduction to Data Mining
No ratings yet
1. Introduction to Data Mining
23 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
56 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
80 pages
Concepts and Techniques
100% (2)
Concepts and Techniques
118 pages
Data Mining: Concepts and Techniques: - Introduction
No ratings yet
Data Mining: Concepts and Techniques: - Introduction
43 pages
Chapter 8. Cluster Analysis
No ratings yet
Chapter 8. Cluster Analysis
51 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
123 pages
Data Mining: Concepts and Techniques: - Chapter 1 - Introduction
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 1 - Introduction
36 pages
Clustering
No ratings yet
Clustering
123 pages
Chapter 1 - Introduction
No ratings yet
Chapter 1 - Introduction
36 pages
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
No ratings yet
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
21 pages
Chapitre 1
No ratings yet
Chapitre 1
22 pages
1 Intro
No ratings yet
1 Intro
29 pages
UNIT 4
No ratings yet
UNIT 4
17 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
59 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
120 pages
Data Mining: Concepts and Techniques: - Chapter 1 - Introduction
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 1 - Introduction
30 pages
Krishnendu PCB-IT602B
No ratings yet
Krishnendu PCB-IT602B
11 pages
Clustering Full 1
No ratings yet
Clustering Full 1
98 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
36 pages
Data Science Unveiled: A Practical Guide to Key Techniques
From Everand
Data Science Unveiled: A Practical Guide to Key Techniques
Ed A Norex
No ratings yet
Syllabus
No ratings yet
Syllabus
2 pages
2 - 1
No ratings yet
2 - 1
17 pages
Data Privacy Basics
No ratings yet
Data Privacy Basics
21 pages
C++ Qbank
No ratings yet
C++ Qbank
5 pages
Greedy
No ratings yet
Greedy
34 pages
Recursion and Non Recursion Programs
No ratings yet
Recursion and Non Recursion Programs
29 pages
Time Complexity
No ratings yet
Time Complexity
22 pages
Searching
No ratings yet
Searching
21 pages
Enders (2014) - Statistical Tables
No ratings yet
Enders (2014) - Statistical Tables
6 pages
Lesson 17 Validity and Reliability of The Instrument (Cont)
No ratings yet
Lesson 17 Validity and Reliability of The Instrument (Cont)
11 pages
Statistics
No ratings yet
Statistics
13 pages
Syllabus MAT 251 Probability and Math Statistics - EditKh
No ratings yet
Syllabus MAT 251 Probability and Math Statistics - EditKh
6 pages
Cases Conjoint Analysis
No ratings yet
Cases Conjoint Analysis
5 pages
M.tech Question Paper 2021-2022
No ratings yet
M.tech Question Paper 2021-2022
9 pages
Comparing Ordinary Kriging Interpolation Variance and Indicator Kriging Conditional Variance For Assessing Uncertainties at Unsampled Locations
No ratings yet
Comparing Ordinary Kriging Interpolation Variance and Indicator Kriging Conditional Variance For Assessing Uncertainties at Unsampled Locations
5 pages
CHAPTER 9 Estimation and Confidence Interval-1
100% (1)
CHAPTER 9 Estimation and Confidence Interval-1
19 pages
.Chapter 1: What Is Statistics?: 1.1 Key Statistical Concepts
No ratings yet
.Chapter 1: What Is Statistics?: 1.1 Key Statistical Concepts
66 pages
Veggi_2023.07.03
No ratings yet
Veggi_2023.07.03
562 pages
Sop Table
No ratings yet
Sop Table
3 pages
2 Interval Estimation
0% (1)
2 Interval Estimation
41 pages
Sit: Exploring Flow and Diffusion-Based Generative Models With Scalable Interpolant Transformers
No ratings yet
Sit: Exploring Flow and Diffusion-Based Generative Models With Scalable Interpolant Transformers
24 pages
Maths 2A Random Variables and Probability Distributions Important Questions
No ratings yet
Maths 2A Random Variables and Probability Distributions Important Questions
13 pages
2019 Fin Econ
No ratings yet
2019 Fin Econ
6 pages
Pseudorandom Number Generator - v2
No ratings yet
Pseudorandom Number Generator - v2
23 pages
Jurnal Pengaruh Prestasi Kerja, Pendidikan, Dan Masa Kerja Terhadap Promosi Jabatan
No ratings yet
Jurnal Pengaruh Prestasi Kerja, Pendidikan, Dan Masa Kerja Terhadap Promosi Jabatan
19 pages
Statistics (Kind of Statistics, Classification)
No ratings yet
Statistics (Kind of Statistics, Classification)
2 pages
Instructions For How To Solve Assignment
No ratings yet
Instructions For How To Solve Assignment
3 pages
Vignan'S Institute of Information Technology (A) - Visakhapatnam
No ratings yet
Vignan'S Institute of Information Technology (A) - Visakhapatnam
8 pages
Fulltext Jurnal Hadinata
No ratings yet
Fulltext Jurnal Hadinata
17 pages
Project - Time Series Forecasting (Sparkling - CSV) & (Rose - CSV)
100% (1)
Project - Time Series Forecasting (Sparkling - CSV) & (Rose - CSV)
15 pages
GS49521 - Individual Assignment - Data Analysis Annual VKT
No ratings yet
GS49521 - Individual Assignment - Data Analysis Annual VKT
29 pages
Slides 9: Queuing Models
No ratings yet
Slides 9: Queuing Models
48 pages
Actl 3001/5104: Actuarial Statistics Mid-Term Exam: School of Risk and Actuarial Studies SESSION 1, 2013
No ratings yet
Actl 3001/5104: Actuarial Statistics Mid-Term Exam: School of Risk and Actuarial Studies SESSION 1, 2013
16 pages
Tests of Significance and Measures of Association
No ratings yet
Tests of Significance and Measures of Association
21 pages
Computational Finance
No ratings yet
Computational Finance
36 pages
Seminar Talk - Data Analysis Unimas
No ratings yet
Seminar Talk - Data Analysis Unimas
28 pages
Simple Linear Regression Analysis: Mcgraw-Hill/Irwin
No ratings yet
Simple Linear Regression Analysis: Mcgraw-Hill/Irwin
16 pages
(Universitext) Pagès, Gilles - Numerical Probability - An Introduction With Applications To Finance-Springer (2018)
100% (1)
(Universitext) Pagès, Gilles - Numerical Probability - An Introduction With Applications To Finance-Springer (2018)
591 pages