10.anomaly Detection
10.anomaly Detection
anomaby
anomaly
-
-
Anomaly Detection
sudden changes
outlier defection
-(box plot] HH iters
§ It is the identification of unexpected events, observations, or items that differ significantly from the
- e n
norm al
-
§ Often applied to unlabelled data by data scientists in a process called unsupervised anomaly
-
-
detection
§ Any type of anomaly detection rests upon two basic assumptions:
-
§ The features of data anomalies are significantly different from those of normal instances
-
§ Typically, anomalous data is linked to some sort of problem or rare event such as hacking, bank fraud,
- -
§ For this reason, identifying actual anomalies rather than false positives or data noise is essential from
-
-
a business perspective
-
§ Anomaly detection is the identification of rare events, items, or observations which are suspicious
- -
-
-
·
§ Anomalies in data are also called standard deviations, outliers, noise, novelties, and exceptions
-
§ In -
the network anomaly detection/network intrusion
-
and abuse detection context, interesting events
are often not rare—just unusual
e
§ For example, unexpected jumps in activity are typically notable, although such a spurt in activity may
-
- - -
§ -
Many outlier detection methods, especially unsupervised techniques, do not detect this kind of sudden
jump in activity as an outlier or rare object. However, these types of micro clusters can often be
-
§ Semi-supervised
-
§ Supervised
2
-
§ Essentially, the correct anomaly detection method depends on the available labels in the dataset
-
§ Supervised anomaly detection techniques demand a data set with a complete set of “normal” and
-
§ This is similar to traditional pattern recognition, except that with outlier detection there is a naturally
- -
-
strong imbalance between the classes
§ Not all statistical classification algorithms are well-suited for the inherently unbalanced nature of
-nee
-
anomaly detection
§ The advantage of supervised models is that they may offer a higher rate of detection than
-
unsupervised techniques. This is because they can return a confidence score with model output,
-
incorporate both data and prior knowledge, and encode interdependencies between variables.
-
nuhr-carnal-
I
I model -
anomaly
-
-> notification
=>
↳ication
Stream
l
-
-
§ Semi-supervised
-
anomaly detection techniques use a normal, labelled training data set to construct a
model representing normal behavior
-
§ They then use that model to detect anomalies by testing how likely the model is to generate any one
n e e
instance encountered
-
§ A semi-supervised anomaly detection algorithm might also work with a data set that is partially flagged
-
§-
It will then build a classification algorithm on just that
-
flagged subset of data, and use that model to
predict the status of the remaining data
-
al
=
i-Fri
⑪ supersed
model
[2
§ Unsupervised methods of anomaly detection detect anomalies in an unlabelled test set of data based
-
solely on the intrinsic properties of that data
-
§ The working assumption is that, as in most cases, the large majority of the instances in the data set
-
-
will be normal
-
§ The anomaly detection algorithm will then detect instances that appear to fit with the rest of the data
-
-
§ The
-
most popular unsupervised anomaly detection algorithms include Autoencoders, K-means,
hypothesis tests-based analysis, and PCAs
- -
§ Network anomalies
-
§ To detect network anomalies, network owners must have a concept of expected or normal behavior
-
§ Detection of anomalies in network behavior demands the continuous monitoring of a network for unexpected
-
trends or events
-
§ These systems observe application function, collecting data on all problems, including supporting
-
§ When anomalies are detected, rate limiting is triggered and admins are notified about the source of the issue
-
with the problematic data
-
§ These include any other anomalous or suspicious web application behavior that might impact security such
-
§ It is critical for network admins to be able to identify and react to changing operational conditions
-
§ Any nuances in the operational conditions of data centers or cloud applications can signal
-
§ Therefore, anomaly detection is central to extracting essential business insights and maintaining core
- n e e
-
operations
&
§ Consider these patterns - all of which demand the ability to discern between normal and abnormal
-
§ An IT security team must prevent hacking and needs to detect abnormal login patterns and user behaviors
-
§ A cloud provider has to allot traffic and services and has to assess changes to infrastructure in light of
---
existing patterns in traffic and past resource failures
-
§ A evidence-based, well-constructed behavioral model can not only represent data behavior, but also
-
§ Static
s
alerts and thresholds aree
not enough, because of the overwhelming
e
scale of the operational
parameters, and because it’s too easy to miss anomalies in false positives or negatives
-
-
§ To address these kinds of operational constraints, newer systems use smart algorithms for identifying
-
outliers in seasonal time series data and accurately forecasting periodic data patterns
-
§ In searching data for anomalies that are relatively rare, it is inevitable that the user will encounter
--
relatively high levels of noise that could be similar to abnormal behavior
-
§ This is because the line between abnormal and normal behavior is typically imprecise, and may
- -
§ Furthermore, because many data patterns are based on time and seasonality, there is additional
- - -
§ The need to break down multiple trends over time, for example, demands more sophisticated
-
§ For all of these reasons, there are various anomaly detection techniques
-
§ Depending on the circumstances, one might be better than others for a particular user or data set
-
§ A
-
generative approach creates a model based solely on examples of normal data from training and
then evaluates each test case to see how well it fits the model
- n e e
§ It rests upon the assumption that similar data points tend to cluster together in groups, as determined
-
by their proximity to local centroids
-
§ K-means, a commonly-used clustering algorithm, creates ‘k’ similar clusters of data points. Users can
e
- -
then set systems to mark data instances that fall outside of these groups as data anomalies. As an
unsupervised technique, clustering does not require any data labelling.
§ Clustering algorithms might be deployed to capture an anomalous class of data. The algorithm has
- n e e
already created many data clusters on the training set in order to calculate the threshold for an
m e e te
anomalous event. It can then use this rule to create new clusters, presumably capturing new
-- --
anomalous data.
-
§ However, clustering does not always work for time series data. This is because the data depicts
-
evolution over time, yet the technique produces a fixed set of clusters.
-
§ These anomaly detection methods rest upon the assumption that normal data points tend to occur in
-
a dense neighbourhood, while anomalies pop up far away and sparsely
-
outliers
->
§ There are two types of algorithms for this type of data anomaly evaluation:
m e n
-
§ K-nearest neighbor (k-NN) is a basic, non-parametric, supervised machine learning technique that can be
-- --
used to either regress or classify data based on distance metrics such as Euclidean, Hamming, Manhattan,
n a m e
a n
or Minkowski distance.
e
§ Local outlier factor (LOF), also called the relative density of data, is based on reachability distance
-
§ A support vector machine (SVM) is typically used in supervised settings, but SVM extensions can also
-
-
§ A SVM is a neural network that is well-suited for classifying linearly separable binary patterns—
n e a re
obviously the better the separation is, the clearer the results.
-
§ Such anomaly detection algorithms may learn a softer boundary depending on the goals to cluster the
data instances and identify the abnormalities properly
§ Depending on the situation, an anomaly detector like this might output numeric scalar values for
various uses
§ An anomaly based intrusion detection system (IDS) is any system designed to identify and prevent malicious
-See a
activity in a computer network
-
§ A single computer may have its own IDS, called a Host Intrusion Detection System (HIDS), and such a
e n t e
system can also be scaled up to cover large networks. At that scale it is called Network Intrusion Detection
-
-
(NIDS)
-
§ This
-
is also sometimes called network behavior anomaly detection, and this is the kind of ongoing monitoring
network behavior anomaly detection tools are designed to provide
-
§ Most IDS depend on signature-based or anomaly-based detection methods, but since signature-based IDS
-
-
are ill-equipped to detect unique attacks, anomaly-based detection techniques remain more popular
---
§ Fraud detection
-
§ Fraud in banking (credit card transactions, tax return claims, etc.), insurance claims (automobile, health,
- n e e
etc.), telecommunications,
-
and other areas is a significant issue for both private business and governments
§ Fraud detection demands adaptation, detection, and prevention, all with data in real-time
---
Sunbeam Infotech www.sunbeaminfo.com
Anomaly Detection Use Cases