0% found this document useful (0 votes)
24 views

Cluster

Uploaded by

sahasayak2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Cluster

Uploaded by

sahasayak2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 42

Unsupervised

Learning

Dr. Soumi Dutta


Introduction
In the real world, we are surrounded by humans who can learn
everything from their experiences with their learning capability, and
we have computers or machines which work on our instructions. But
can a machine also learn from experiences or past data like a
human does? So here comes the role of Machine Learning.

2
How does Machine Learning work
A machine learning system builds prediction models, learns from previous
data, and predicts the output of new data whenever it receives it. The amount
of data helps to build a better model that accurately predicts the output, which
in turn affects the accuracy of the predicted output.

The Machine Learning algorithm's operation is depicted in the following block


diagram:

3
Classification of Machine Learning
At a broad level, machine learning can be classified into three types:

1.Supervised learning
2.Unsupervised learning
3.Reinforcement learning

4
1) Supervised Learning
In supervised learning, sample labelled data are provided to the machine
learning system for training, and the system then predicts the output based on
the training data. The system uses labelled data to build a model that
understands the datasets and learns about each one. After the training and
processing are done, we test the model with sample data to see if it can
accurately predict the output.

The mapping of the input data to the output data is the objective of supervised
learning. The managed learning depends on oversight, and it is equivalent to
when an understudy learns things in the management of the educator. Spam
filtering is an example of supervised learning.

Supervised learning can be grouped further in two categories of algorithms:


•Regression: In regression problems, the goal is to predict a continuous output
or value. For example, predicting the price of a house based on its features,
such as the number of bedrooms, square footage, and location.
5
•Classification: In classification problems, the goal is to assign input data to
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any
supervision. The training is provided to the machine with the set of data that has not
been labelled, classified, or categorized, and the algorithm needs to act on that data
without any supervision. The goal of unsupervised learning is to restructure the input
data into new features or a group of objects with similar patterns.

In unsupervised learning, we don't have a predetermined result. The machine tries to


find useful insights from the huge amount of data. It can be further classifieds into
two categories of algorithms:

1.Clustering: Clustering algorithms aim to group similar data points into clusters
based on some similarity metric. K-means clustering and hierarchical clustering are
examples of unsupervised clustering techniques.
2.Dimensionality Reduction: These techniques aim to reduce the number of
features (or dimensions) in the data while preserving its essential information.
Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding
(t-SNE) are examples of dimensionality reduction methods.
6
3.Association: Association rule learning is used to discover interesting relationships
Supervised learning vs. unsupervised
learning

7
Supervised learning vs. unsupervised
learning

8
Applications of Machine learning
Machine learning is a buzzword for today's technology, and it is growing very rapidly
day by day. We are using machine learning in our daily life even without knowing it
such as Google Maps, Google assistant, Alexa, etc. Below are some most trending
real-world applications of Machine Learning:

9
Machine learning Life cycle
Machine learning has given the computer systems the abilities to automatically
learn without being explicitly programmed. Machine learning life cycle is a cyclic
process to build an efficient machine learning project.

Machine learning life cycle involves seven major steps, which are given below:

•Gathering Data

•Data preparation

•Data Wrangling

•Analyse Data

•Train the model

•Test the model

•Deployment

10
Machine learning Life cycle
Machine learning has given the computer systems the abilities to automatically
learn without being explicitly programmed. Machine learning life cycle is a cyclic
process to build an efficient machine learning project.

Machine learning life cycle involves seven major steps, which are given below:

•Gathering Data

•Data preparation

•Data Wrangling

•Analyse Data

•Train the model

•Test the model

•Deployment

11
1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this
step is to identify and obtain all data-related problems.

In this step, we need to identify the different data sources, as data can be collected
from various sources such as files, database, internet, or mobile devices. It is one of
the most important steps of the life cycle. The quantity and quality of the collected
data will determine the efficiency of the output. The more will be the data, the more
accurate will be the prediction.

This step includes the below tasks:

•Identify various data sources


•Collect data
•Integrate the data obtained from different sources

By performing the above task, we get a coherent set of data, also called as a
dataset. It will be used in further steps. 12
2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is
a step where we put our data into a suitable place and prepare it to use in our
machine learning training.

In this step, first, we put all data together, and then randomize the ordering of data.

This step can be further divided into two processes:

Data exploration:
It is used to understand the nature of data that we have to work with. We need to
understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.

Data pre-processing:
Now the next step is preprocessing of data for its analysis.

13
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable
format. It is the process of cleaning the data, selecting the variable to use, and
transforming the data in a proper format to make it more suitable for analysis in the
next step. It is one of the most important steps of the complete process. Cleaning of
data is required to address the quality issues.

It is not necessary that data we have collected is always of our use as some of the data
may not be useful. In real-world applications, collected data may have various issues,
including:

•Missing Values
•Duplicate data
•Invalid data
•Noise
So, we use various filtering techniques to clean the data.

It is mandatory to detect and remove the above issues because it can negatively affect
14
the quality of the outcome.
4. Data Analysis

Now the cleaned and prepared data is passed on to the analysis step. This step
involves:

•Selection of analytical techniques

•Building models

•Review the result

The aim of this step is to build a machine learning model to analyze the data using
various analytical techniques and review the outcome. It starts with the determination
of the type of the problems, where we select the machine learning techniques such as
Classification, Regression, Cluster analysis, Association, etc. then build the model using
prepared data, and evaluate the model.

15
5. Train Model

Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.

We use datasets to train the model using various machine learning algorithms. Training
a model is required so that it can understand the various patterns, rules, and, features.

6. Test Model
Once our machine learning model has been trained on a given dataset, then we
test the model. In this step, we check for the accuracy of our model by providing a
test dataset to it.

Testing the model determines the percentage accuracy of the model as per the
requirement of project or problem.
16
7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the model
in the real-world system.

If the above-prepared model is producing an accurate result as per our requirement


with acceptable speed, then we deploy the model in the real system. But before
deploying the project, we will check whether it is improving its performance using
available data or not. The deployment phase is similar to making the final report for a
project.

17
Clustering
Clustering is a technique for finding similarity groups in data,
called clusters. i.e., it groups data instances that are similar to
(near) each other in one cluster and data instances that are very
different (far away) from each other into different clusters.

Clustering is often called an unsupervised learning task as


no class values denoting an a priori grouping of the data
instances are given, which is the case in supervised learning.

18
Real Life Application of Clustering

19
Real Life Application of Clustering

20
Aspects of clustering
A clustering algorithm
Partitioning Clustering
Density-Based Clustering
Distribution Model-Based Clustering
Hierarchical Clustering
Fuzzy Clustering

A distance (similarity, or dissimilarity) function

Clustering quality
Inter-clusters distance  maximized
Intra-clusters distance  minimized

The quality of a clustering result depends on the algorithm, the distance


function, and the application.

21
Partitioning Clustering
• It is a type of clustering that divides the data into non-hierarchical groups. It
is also known as the centroid-based method. The most common example
of partitioning clustering is the K-Means Clustering algorithm.

• In this type, the dataset is divided into a set of k groups, where K is used to
define the number of pre-defined groups. The cluster center is created in
such a way that the distance between the data points of one cluster is
minimum as compared to another cluster centroid.

22
K-means clustering
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the
new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
23
K-means clustering
K-means is a partitioned clustering algorithm
Let the set of data points (or instances) D be
{x1, x2, …, xn},

where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X 


Rr, and r is the number of attributes (dimensions) in the data.
The k-means algorithm partitions the given data into k clusters.
Each cluster has a cluster center, called centroid.
k is specified by the user

24
Stopping/convergence criterion

1. No (or minimum) re-assignments of data points


to different clusters,
2. No (or minimum) change of centroids, or
3. Minimum decrease in the sum of squared error
(SSE), k
(1)
SSE  
j 1
xC j
dist (x, m j ) 2

Ci is the jth cluster, mj is the centroid of cluster Cj (the mean


vector of all the data points in Cj), and dist(x, mj) is the
distance between data point x and centroid mj. 25
An Example

26
An Example

27
An Example

28
An Example

29
An Example

30
An Example

An Example Distance Function

31
Strengths of k-means
Strengths:
Simple: easy to understand and to implement
Efficient: Time complexity: O(tkn),
where n is the number of data points,

k is the number of clusters, and

t is the number of iterations.

Since both k and t are small. k-means is considered a linear


algorithm.

K-means is the most popular clustering algorithm.


Note that: it terminates at a local optimum if SSE is used.
The global optimum is hard to find due to complexity.
32
Weaknesses of k-means
 The algorithm is only applicable if the mean is defined.
 For categorical data, k-mode - the centroid is represented by most
frequent values.

 The user needs to specify k.


 The algorithm is sensitive to outliers
 Outliers are data points that are very far away from other data points.
 Outliers could be errors in the data recording or some special data
points with very different values.

33
Weaknesses of k-means: To deal
with outliers
 One method is to remove some data points in the clustering
process that are much further away from the centroids than other
data points.
 To be safe, we may want to monitor these possible outliers over a few
iterations and then decide to remove them.

 Another method is to perform random sampling. Since in sampling


we only choose a small subset of the data points, the chance of
selecting an outlier is very small.
 Assign the rest of the data points to the clusters by distance or
similarity comparison, or classification

34
Weaknesses of k-means
• The algorithm is sensitive to initial seeds.

35
K-means Summary
Despite weaknesses, k-means is still the most popular
algorithm due to its simplicity, efficiency and
other clustering algorithms have their own lists of weaknesses.
No clear evidence that any other clustering algorithm
performs better in general
although they may be more suitable for some specific types of
data or applications.

Comparing different clustering algorithms is a difficult task.


No one knows the correct clusters!

36
K-medoids
k-medoid is a classical partitioning technique of clustering that
clusters the data set of n objects into k clusters known a priori. It
is more robust to noise and outliers as compared to k-means
because it minimizes a sum of pair wise dissimilarities instead of
a sum of squared Euclidean distances.

A medoid can be defined as the object of a cluster whose average


dissimilarity to all the objects in the cluster is minimal. i.e. it is a
most centrally located point in the cluster. The most common
realisation of k-medoid clustering is the Partitioning Around
Medoids (PAM) algorithm and is as follows:

37
K-medoids

38
Density-Based Clustering
The density-based clustering method connects the highly-dense
areas into clusters, and the arbitrarily shaped distributions are
formed as long as the dense region can be connected. This
algorithm does it by identifying different clusters in the dataset
and connects the areas of high densities into clusters. The dense
areas in data space are divided from each other by sparser
areas. These algorithms can face difficulty in clustering the data
points if the dataset has varying densities and high dimensions.

39
Distribution Model-Based
Clustering
In the distribution model-based clustering method, the data is
divided based on the probability of how a dataset belongs to a
particular distribution. The grouping is done by assuming some
distributions commonly Gaussian Distribution.

The example of this type is the Expectation-Maximization


Clustering algorithm that uses Gaussian Mixture Models (GMM).

40
Hierarchical Clustering
Hierarchical clustering is another unsupervised machine learning algorithm, which
is used to group the unlabeled datasets into a cluster and also known as
hierarchical cluster analysis or HCA. In this algorithm, we develop the hierarchy of
clusters in the form of a tree, and this tree-shaped structure is known as the
dendrogram. Sometimes the results of K-means clustering and hierarchical
clustering may look similar, but they both differ depending on how they work. As
there is no requirement to predetermine the The
number of clusters
hierarchical as we didtechnique
clustering in the K-
Means algorithm. has two approaches:

Agglomerative: Agglomerative is a
bottom-up approach, in which the
algorithm starts with taking all data
points as single clusters and merging
them until one cluster is left.

Divisive: Divisive algorithm is the


reverse of the agglomerative
41
algorithm as it is a top-down
Fuzzy Clustering
Fuzzy clustering is a type
of soft method in which a
data object may belong
to more than one group
or cluster. Each dataset
has a set of membership
coefficients, which
depend on the degree of
membership to be in a
cluster. Fuzzy C-means
algorithm is the example
of this type of clustering;
it is sometimes also 42

known as the Fuzzy k-

You might also like