Cluster
Cluster
Learning
2
How does Machine Learning work
A machine learning system builds prediction models, learns from previous
data, and predicts the output of new data whenever it receives it. The amount
of data helps to build a better model that accurately predicts the output, which
in turn affects the accuracy of the predicted output.
3
Classification of Machine Learning
At a broad level, machine learning can be classified into three types:
1.Supervised learning
2.Unsupervised learning
3.Reinforcement learning
4
1) Supervised Learning
In supervised learning, sample labelled data are provided to the machine
learning system for training, and the system then predicts the output based on
the training data. The system uses labelled data to build a model that
understands the datasets and learns about each one. After the training and
processing are done, we test the model with sample data to see if it can
accurately predict the output.
The mapping of the input data to the output data is the objective of supervised
learning. The managed learning depends on oversight, and it is equivalent to
when an understudy learns things in the management of the educator. Spam
filtering is an example of supervised learning.
1.Clustering: Clustering algorithms aim to group similar data points into clusters
based on some similarity metric. K-means clustering and hierarchical clustering are
examples of unsupervised clustering techniques.
2.Dimensionality Reduction: These techniques aim to reduce the number of
features (or dimensions) in the data while preserving its essential information.
Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding
(t-SNE) are examples of dimensionality reduction methods.
6
3.Association: Association rule learning is used to discover interesting relationships
Supervised learning vs. unsupervised
learning
7
Supervised learning vs. unsupervised
learning
8
Applications of Machine learning
Machine learning is a buzzword for today's technology, and it is growing very rapidly
day by day. We are using machine learning in our daily life even without knowing it
such as Google Maps, Google assistant, Alexa, etc. Below are some most trending
real-world applications of Machine Learning:
9
Machine learning Life cycle
Machine learning has given the computer systems the abilities to automatically
learn without being explicitly programmed. Machine learning life cycle is a cyclic
process to build an efficient machine learning project.
Machine learning life cycle involves seven major steps, which are given below:
•Gathering Data
•Data preparation
•Data Wrangling
•Analyse Data
•Deployment
10
Machine learning Life cycle
Machine learning has given the computer systems the abilities to automatically
learn without being explicitly programmed. Machine learning life cycle is a cyclic
process to build an efficient machine learning project.
Machine learning life cycle involves seven major steps, which are given below:
•Gathering Data
•Data preparation
•Data Wrangling
•Analyse Data
•Deployment
11
1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this
step is to identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected
from various sources such as files, database, internet, or mobile devices. It is one of
the most important steps of the life cycle. The quantity and quality of the collected
data will determine the efficiency of the output. The more will be the data, the more
accurate will be the prediction.
By performing the above task, we get a coherent set of data, also called as a
dataset. It will be used in further steps. 12
2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is
a step where we put our data into a suitable place and prepare it to use in our
machine learning training.
In this step, first, we put all data together, and then randomize the ordering of data.
Data exploration:
It is used to understand the nature of data that we have to work with. We need to
understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.
Data pre-processing:
Now the next step is preprocessing of data for its analysis.
13
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable
format. It is the process of cleaning the data, selecting the variable to use, and
transforming the data in a proper format to make it more suitable for analysis in the
next step. It is one of the most important steps of the complete process. Cleaning of
data is required to address the quality issues.
It is not necessary that data we have collected is always of our use as some of the data
may not be useful. In real-world applications, collected data may have various issues,
including:
•Missing Values
•Duplicate data
•Invalid data
•Noise
So, we use various filtering techniques to clean the data.
It is mandatory to detect and remove the above issues because it can negatively affect
14
the quality of the outcome.
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step
involves:
•Building models
The aim of this step is to build a machine learning model to analyze the data using
various analytical techniques and review the outcome. It starts with the determination
of the type of the problems, where we select the machine learning techniques such as
Classification, Regression, Cluster analysis, Association, etc. then build the model using
prepared data, and evaluate the model.
15
5. Train Model
Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.
We use datasets to train the model using various machine learning algorithms. Training
a model is required so that it can understand the various patterns, rules, and, features.
6. Test Model
Once our machine learning model has been trained on a given dataset, then we
test the model. In this step, we check for the accuracy of our model by providing a
test dataset to it.
Testing the model determines the percentage accuracy of the model as per the
requirement of project or problem.
16
7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the model
in the real-world system.
17
Clustering
Clustering is a technique for finding similarity groups in data,
called clusters. i.e., it groups data instances that are similar to
(near) each other in one cluster and data instances that are very
different (far away) from each other into different clusters.
18
Real Life Application of Clustering
19
Real Life Application of Clustering
20
Aspects of clustering
A clustering algorithm
Partitioning Clustering
Density-Based Clustering
Distribution Model-Based Clustering
Hierarchical Clustering
Fuzzy Clustering
Clustering quality
Inter-clusters distance maximized
Intra-clusters distance minimized
21
Partitioning Clustering
• It is a type of clustering that divides the data into non-hierarchical groups. It
is also known as the centroid-based method. The most common example
of partitioning clustering is the K-Means Clustering algorithm.
• In this type, the dataset is divided into a set of k groups, where K is used to
define the number of pre-defined groups. The cluster center is created in
such a way that the distance between the data points of one cluster is
minimum as compared to another cluster centroid.
22
K-means clustering
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the
new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
23
K-means clustering
K-means is a partitioned clustering algorithm
Let the set of data points (or instances) D be
{x1, x2, …, xn},
24
Stopping/convergence criterion
26
An Example
27
An Example
28
An Example
29
An Example
30
An Example
31
Strengths of k-means
Strengths:
Simple: easy to understand and to implement
Efficient: Time complexity: O(tkn),
where n is the number of data points,
33
Weaknesses of k-means: To deal
with outliers
One method is to remove some data points in the clustering
process that are much further away from the centroids than other
data points.
To be safe, we may want to monitor these possible outliers over a few
iterations and then decide to remove them.
34
Weaknesses of k-means
• The algorithm is sensitive to initial seeds.
35
K-means Summary
Despite weaknesses, k-means is still the most popular
algorithm due to its simplicity, efficiency and
other clustering algorithms have their own lists of weaknesses.
No clear evidence that any other clustering algorithm
performs better in general
although they may be more suitable for some specific types of
data or applications.
36
K-medoids
k-medoid is a classical partitioning technique of clustering that
clusters the data set of n objects into k clusters known a priori. It
is more robust to noise and outliers as compared to k-means
because it minimizes a sum of pair wise dissimilarities instead of
a sum of squared Euclidean distances.
37
K-medoids
38
Density-Based Clustering
The density-based clustering method connects the highly-dense
areas into clusters, and the arbitrarily shaped distributions are
formed as long as the dense region can be connected. This
algorithm does it by identifying different clusters in the dataset
and connects the areas of high densities into clusters. The dense
areas in data space are divided from each other by sparser
areas. These algorithms can face difficulty in clustering the data
points if the dataset has varying densities and high dimensions.
39
Distribution Model-Based
Clustering
In the distribution model-based clustering method, the data is
divided based on the probability of how a dataset belongs to a
particular distribution. The grouping is done by assuming some
distributions commonly Gaussian Distribution.
40
Hierarchical Clustering
Hierarchical clustering is another unsupervised machine learning algorithm, which
is used to group the unlabeled datasets into a cluster and also known as
hierarchical cluster analysis or HCA. In this algorithm, we develop the hierarchy of
clusters in the form of a tree, and this tree-shaped structure is known as the
dendrogram. Sometimes the results of K-means clustering and hierarchical
clustering may look similar, but they both differ depending on how they work. As
there is no requirement to predetermine the The
number of clusters
hierarchical as we didtechnique
clustering in the K-
Means algorithm. has two approaches:
Agglomerative: Agglomerative is a
bottom-up approach, in which the
algorithm starts with taking all data
points as single clusters and merging
them until one cluster is left.