0% found this document useful (0 votes)
9 views

Unit 3 unsupervised learning algorith

unit 3 notes for mba students in easy language Prerna Kushwaha

Uploaded by

prernakush0212
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Unit 3 unsupervised learning algorith

unit 3 notes for mba students in easy language Prerna Kushwaha

Uploaded by

prernakush0212
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Unit-3 Unsupervised learning algorithm

Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.

Working of unsupervised learning can be understood by the below diagram:

Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to
the machine learning model in order to train it. Firstly, it will interpret the raw data
to find the hidden patterns from the data and then will apply suitable algorithms such
as k-means clustering, Decision tree, etc.

Advantages of Unsupervised Learning


o Unsupervised learning is used for more complex tasks as compared to supervised
learning because, in unsupervised learning, we don't have labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in
comparison to labeled data.

Disadvantages of Unsupervised Learning


o Unsupervised learning is intrinsically more difficult than supervised learning as it
does not have corresponding output.
o The result of the unsupervised learning algorithm might be less accurate as input
data is not labeled, and algorithms do not know the exact output in advance.

In unsupervised learning, we don't have a predetermined result. The machine tries


to find useful insights from the huge amount of data. It can be further classifieds into
two categories of algorithms:
o Clustering: Clustering is a method of grouping the objects into clusters
such that objects with most similarities remains into a group and has less
or no similarities with the objects of another group. Cluster analysis finds
the commonalities between the data objects and categorizes them as per
the presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which
is used for finding the relationships between variables in the large database.
It determines the set of items that occurs together in the dataset.
Association rule makes marketing strategy more effective. Such as people
who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam)
item. A typical example of Association rule is Market Basket Analysis.

Unsupervised Learning algorithms:


Below is the list of some popular unsupervised learning algorithms:

o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis

What is Clustering ?
The task of grouping data points based on their similarity with each other is called
Clustering or Cluster Analysis. This method is defined under the branch of
Unsupervised Learning, which aims at gaining insights from unlabelled data points, that
is, unlike supervised learning we don’t have a target variable.
Clustering aims at forming groups of homogeneous data points from a heterogeneous
dataset. It evaluates the similarity based on a metric like Euclidean distance, Cosine
similarity, Manhattan distance, etc. and then group the points with highest similarity
score together.

For Example, In the graph given below, we can clearly see that there are 3 circular
clusters forming on the basis of distance.

Clustering in Machine Learning

Now it is not necessary that the clusters formed must be circular in shape. The shape
of clusters can be arbitrary. There are many algortihms that work well with detecting
arbitrary shaped clusters.

For example, In the below given graph we can see that the clusters formed are not
circular in shape.

Applications of Clustering in different fields:


1. Marketing: It can be used to characterize & discover customer segments for
marketing purposes.
2. Biology: It can be used for classification among different species of plants and
animals.
3. Libraries: It is used in clustering different books on the basis of topics and
information.
4. Insurance: It is used to acknowledge the customers, their policies and
identifying the frauds.
5. City Planning: It is used to make groups of houses and to study their values
based on their geographical locations and other factors present.
6. Earthquake studies: By learning the earthquake-affected areas we can
determine the dangerous zones.
7. Image Processing: Clustering can be used to group similar images together,
classify images based on content, and identify patterns in image data.
8. Genetics: Clustering is used to group genes that have similar expression
patterns and identify gene networks that work together in biological processes.
9. Finance: Clustering is used to identify market segments based on customer
behavior, identify patterns in stock market data, and analyze risk in investment
portfolios.
10.Customer Service: Clustering is used to group customer inquiries and
complaints into categories, identify common issues, and develop targeted
solutions.

What Is Hierarchical Clustering?


Hierarchical clustering, or hierarchical clustering analysis, is a cluster
analysis technique that creates a hierarchy of clusters from points in a
dataset.

With clustering, data points are put into groups — known as clusters —
based on similarities like color, shape or other features. In hierarchical
clustering, each cluster is placed within a nested tree-like hierarchy,
where clusters are grouped and break down further into smaller
clusters depending on similarities. Here, the closer clusters are
together in the hierarchy, the more similar they are to each other.

While clustering analyses like k-means can visualize data points as


distinct and linear groups, hierarchical clustering visualizes data groups
in relation to one another with multiple levels of similarity.

Hierarchical clustering is used to help find patterns and related


occurrences within datasets, especially those that are complex or
multifaceted.
How Does Hierarchical Clustering Work?
The hierarchical clustering process involves finding the two data points
closest to each other and combining the two most similar ones. After
repeating this process until all data points are grouped into clusters,
the end result is a hierarchical tree of related groups known as a
dendrogram.

Hierarchical clustering is based on the core idea that similar objects lie
nearby to each other in a data space while others lie far away. It uses
distance functions to find nearby data points and group the data points
together as clusters.

There are different types of clustering algorithms, including centroid-


based clustering algorithms, connectivity-based clustering algorithms
(hierarchical clustering), distribution-based clustering algorithms and
density-based clustering algorithms. The two main types of hierarchical
clustering include agglomerative clustering and divisive clustering.

Types of Hierarchical Clustering


There are two major types of hierarchical clustering approaches:

 Agglomerative clustering: Divide the data points into different


clusters and then aggregate them as the distance decreases.
 Divisive clustering: Combine all the data points as a single
cluster and divide them as the distance between them increases.

1. Agglomerative Clustering
Agglomerative clustering is a bottom-up approach. It starts clustering
by treating the individual data points as a single cluster, then it is
merged continuously based on similarity until it forms one big cluster
containing all objects. It is good at identifying small clusters.

How the Agglomerative Hierarchical clustering Work?


The working of the AHC algorithm can be explained using the below steps:
o Step-1: Create each data point as a single cluster. Let's say there are N data
points, so the number of clusters will also be N.

o Step-2: Take two closest data points or clusters and merge them to form one
cluster. So, there will now be N-1 clusters.
o Step-3: Again, take the two closest clusters and merge them together to form
one cluster. There will be N-2 clusters.

o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following
clusters. Consider the below images:
o Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.

Note: To better understand hierarchical clustering, it is advised to have a look


on k-means clustering

Divisive Clustering
Divisive clustering works just the opposite of agglomerative clustering.
It starts by considering all the data points into a big single cluster and
later on splitting them into smaller heterogeneous clusters
continuously until all data points are in their own cluster. Thus, they
are good at identifying large clusters. It follows a top-down approach
and is more efficient than agglomerative clustering. But, due to its
complexity in implementation, it doesn’t have any predefined
implementation in any of the major machine learning frameworks.
Advantages of Hierarchical clustering
o It is simple to implement and gives the best output in some cases.
o It is easy and results in a hierarchy, a structure that contains more information.
o It does not need us to pre-specify the number of clusters.

Disadvantages of hierarchical clustering


o It breaks the large clusters.
o It is Difficult to handle different sized clusters and convex shapes.
o It is sensitive to noise and outliers.
o The algorithm can never be changed or deleted once it was done previously.

Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also
known as the centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to define
the number of pre-defined groups. The cluster center is created in such a way that
the distance between the data points of one cluster is minimum as compared to
another cluster centroid.

Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters,
and the arbitrarily shaped distributions are formed as long as the dense region can
be connected. This algorithm does it by identifying different clusters in the dataset
and connects the areas of high densities into clusters. The dense areas in data space
are divided from each other by sparser (refers to areas in data where data points
are not required to be assigned to a cluster but rather identified as noise.) areas.

These algorithms can face difficulty in clustering the data points if the dataset has
varying densities and high dimensions.

Introduction

DBSCAN is the abbreviation for Density-Based Spatial Clustering of Applications


with Noise. It is an unsupervised clustering algorithm.DBSCAN clustering can work
with clusters of any size from huge amounts of data and can work with datasets
containing a significant amount of noise. It is basically based on the criteria of a
minimum number of points within a region.

What is DBSCAN Algorithm?

DBSCAN algorithm can cluster densely grouped points efficiently into one cluster. It
can identify local density in the data points among large datasets. DBSCAN can very
effectively handle outliers. An advantage of DBSACN over the K-means algorithm is
that the number of centroids need not be known beforehand in the case of DBSCAN.

 DBSCAN algorithm depends upon two parameters epsilon and minPoints.


 Epsilon is defined as the radius of each data point around which the density is
considered.
 minPoints is the number of points required within the radius so that the data
point becomes a core point.
 The circle can be extended to higher dimensions.

Parameters Required For DBSCAN Algorithm

 eps: It defines the neighborhood around a data point i.e. if the distance
between two points is lower or equal to ‘eps’ then they are considered
neighbors. If the eps value is chosen too small then a large part of the data
will be considered as an outlier. If it is chosen very large then the clusters will
merge and the majority of the data points will be in the same clusters. One
way to find the eps value is based on the k-distance graph.
 MinPts: Minimum number of neighbors (data points) within eps radius. The
larger the dataset, the larger value of MinPts must be chosen. As a general
rule, the minimum MinPts can be derived from the number of dimensions D in
the dataset as, MinPts >= D+1. The minimum value of MinPts must be chosen
at least 3.

In this algorithm, we have 3 types of data points.

 Core Point: A point is a core point if it has more than MinPts points within eps.
 Border Point: A point which has fewer than MinPts within eps but it is in the
neighborhood of a core point.
 Noise or outlier: A point which is not a core point or border point.

Applications of DBSCAN
 It is used in satellite imagery.
 Used in XRay crystallography
 Anamoly detection in temperature.

Unsupervised Learning Applications

 Natural language processing (NLP). Google News is known to


leverage unsupervised learning to categorize articles based on the
same story from various news outlets. For instance, the results of the
football transfer window can all be categorized under football.
 Image and video analysis. Visual Perception tasks such as object
recognition leverage unsupervised learning.
 Anomaly detection. Unsupervised learning is used to identify data
points, events, and/or observations that deviate from a dataset's
normal behavior.
 Customer segmentation. Interesting buyer persona profiles can be
created using unsupervised learning. This helps businesses to
understand their customers' common traits and purchasing habits,
thus, enabling them to align their products more accordingly.
 Recommendation Engines. Past purchase behavior coupled with
unsupervised learning can be used to help businesses discover data
trends that they could use to develop effective cross-selling strategies.

Association Rule Learning


Association rule learning is a type of unsupervised learning technique that checks for
the dependency of one data item on another data item and maps accordingly so that
it can be more profitable. It tries to find some interesting relations or associations
among the variables of dataset. It is based on different rules to discover the
interesting relations between variables in the database.

The association rule learning is one of the very important concepts of machine
learning, and it is employed in Market Basket analysis, Web usage mining,
continuous production, etc. Here market basket analysis is a technique used by
the various big retailer to discover the associations between items. We can
understand it by taking an example of a supermarket, as in a supermarket, all
products that are purchased together are put together.

For example, if a customer buys bread, he most likely can also buy butter, eggs, or
milk, so these products are stored within a shelf or mostly nearby. Consider the below
diagram:
Association rule learning can be divided into three types of algorithms:

1. Apriori
2. Eclat
3. F-P Growth Algorithm

We will understand these algorithms in later chapters.

How does Association Rule Learning work?


Association rule learning works on the concept of If and Else Statement, such as if A
then B.

Here the If element is called antecedent, and then statement is called


as Consequent. These types of relationships where we can find out some association
or relation between two items is known as single cardinality. It is all about creating
rules, and if the number of items increases, then cardinality also increases
accordingly. So, to measure the associations between thousands of data items, there
are several metrics. These metrics are given below:

o Support
o Confidence
o Lift
Let's understand each of them:

Support
Support is the frequency of A or how frequently an item appears in the dataset. It is
defined as the fraction of the transaction T that contains the itemset X. If there are
X datasets, then for transactions T, it can be written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the
items X and Y occur together in the dataset when the occurrence of X is already
given. It is the ratio of the transaction that contains X and Y to the number of records
that contain X.

Lift
It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:

o If Lift= 1: The probability of occurrence of antecedent and consequent is


independent of each other.
o Lift>1: It determines the degree to which the two itemsets are dependent to each
other.
o Lift<1: It tells us that one item is a substitute for other items, which means one
item has a negative effect on another.

Types of Association Rule Lerning


Association rule learning can be divided into three algorithms:

Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is designed to
work on the databases that contain transactions. This algorithm uses a breadth-first
search and Hash Tree to calculate the itemset efficiently.

It is mainly used for market basket analysis and helps to understand the products
that can be bought together. It can also be used in the healthcare field to find drug
reactions for patients.

Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses
a depth-first search technique to find frequent itemsets in a transaction database. It
performs faster execution than Apriori Algorithm.

F-P Growth Algorithm


The F-P growth algorithm stands for Frequent Pattern, and it is the improved
version of the Apriori Algorithm. It represents the database in the form of a tree
structure that is known as a frequent pattern or tree. The purpose of this frequent
tree is to extract the most frequent patterns.

Applications of Association Rule Learning


It has various applications in machine learning and data mining. Below are some
popular applications of association rule learning:

o Market Basket Analysis: It is one of the popular examples and applications of


association rule mining. This technique is commonly used by big retailers to
determine the association between items.
o Medical Diagnosis: With the help of association rules, patients can be cured
easily, as it helps in identifying the probability of illness for a particular disease.
o Protein Sequence: The association rules help in determining the synthesis of
artificial Proteins.
o It is also used for the Catalog Design and Loss-leader Analysis and many
more other applications.

You might also like