0% found this document useful (0 votes)
18 views47 pages

Unit III Clustering

Data Mining Clustering

Uploaded by

ihtisham26hfs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views47 pages

Unit III Clustering

Data Mining Clustering

Uploaded by

ihtisham26hfs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Unit III: Clustering

What is Clustering ?

The task of grouping data points based on their similarity with each other is called Clustering or
Cluster Analysis. This method is defined under the branch of Unsupervised Learning, which aims at
gaining insights from unlabelled data points, that is, unlike supervised learning we don’t have a target
variable.

Clustering aims at forming groups of homogeneous data points from a heterogeneous dataset. It
evaluates the similarity based on a metric like Euclidean distance, Cosine similarity, Manhattan
distance, etc. and then group the points with highest similarity score together.

For Example, In the graph given below, we can clearly see that there are 3 circular clusters forming on
the basis of distance.

Now it is not necessary that the clusters formed must be circular in shape. The shape of clusters can be
arbitrary. There are many algortihms that work well with detecting arbitrary shaped clusters.

For example, In the below given graph we can see that the clusters formed are not circular in shape.
Types of Clustering

Broadly speaking, there are 2 types of clustering that can be performed to group similar data points:

• Hard Clustering: In this type of clustering, each data point belongs to a cluster completely or
not. For example, Let’s say there are 4 data point and we have to cluster them into 2 clusters.
So each data point will either belong to cluster 1 or cluster 2.

Data Points Clusters

A C1

B C2

C C2

D C1
• Soft Clustering: In this type of clustering, instead of assigning each data point into a separate
cluster, a probability or likelihood of that point being that cluster is evaluated. For example,
Let’s say there are 4 data point and we have to cluster them into 2 clusters. So we will be
evaluating a probability of a data point belonging to both clusters. This probability is
calculated for all data points.

Data Points Probability of C1 Probability of C2

A 0.91 0.09

B 0.3 0.7

C 0.17 0.83

D 1 0

Uses of Clustering

Now before we begin with types of clustering algorithms, we will go through the use cases of
Clustering algorithms. Clustering algorithms are majorly used for:

• Market Segmentation – Businesses use clustering to group their customers and use targeted
advertisements to attract more audience.

• Market Basket Analysis – Shop owners analyze their sales and figure out which items are
majorly bought together by the customers. For example, In USA, according to a study diapers
and beers were usually bought together by fathers.

• Social Network Analysis – Social media sites use your data to understand your browsing
behaviour and provide you with targeted friend recommendations or content
recommendations.

• Medical Imaging – Doctors use Clustering to find out diseased areas in diagnostic images like X-
rays.

• Anomaly Detection – To find outliers in a stream of real-time dataset or forecasting fraudulent


transactions we can use clustering to identify them.

• Simplify working with large datasets – Each cluster is given a cluster ID after clustering is
complete. Now, you may reduce a feature set’s whole feature set into its cluster ID. Clustering
is effective when it can represent a complicated case with a straightforward cluster ID. Using
the same principle, clustering data can make complex datasets simpler.
There are many more use cases for clustering but there are some of the major and common use cases
of clustering. Moving forward we will be discussing Clustering Algorithms that will help you perform
the above tasks.

Types of Clustering Algorithms

At the surface level, clustering helps in the analysis of unstructured data. Graphing, the shortest
distance, and the density of the data points are a few of the elements that influence cluster formation.
Clustering is the process of determining how related the objects are based on a metric called the
similarity measure. Similarity metrics are easier to locate in smaller sets of features. It gets harder to
create similarity measures as the number of features increases. Depending on the type of clustering
algorithm being utilized in data mining, several techniques are employed to group the data from the
datasets. In this part, the clustering techniques are described. Various types of clustering algorithms
are:

1. Centroid-based Clustering (Partitioning methods)

2. Density-based Clustering (Model-based methods)

3. Connectivity-based Clustering (Hierarchical clustering)

4. Distribution-based Clustering

We will be going through each of these types in brief.

1. Centroid-based Clustering (Partitioning methods)

Partitioning methods are the most easiest clustering algorithms. They group data points on the basis
of their closeness. Generally, the similarity measure chosen for these algorithms are Euclidian
distance, Manhattan Distance or Minkowski Distance. The datasets are separated into a
predetermined number of clusters, and each cluster is referenced by a vector of values. When
compared to the vector value, the input data variable shows no difference and joins the cluster.

The primary drawback for these algorithms is the requirement that we establish the number of
clusters, “k,” either intuitively or scientifically (using the Elbow Method) before any clustering machine
learning system starts allocating the data points. Despite this, it is still the most popular type of
clustering. K-means and K-medoids clustering are some examples of this type clustering.

2. Density-based Clustering (Model-based methods)

Density-based clustering, a model-based method, finds groups based on the density of data points.
Contrary to centroid-based clustering, which requires that the number of clusters be predefined and is
sensitive to initialization, density-based clustering determines the number of clusters automatically
and is less susceptible to beginning positions. They are great at handling clusters of different sizes and
forms, making them ideally suited for datasets with irregularly shaped or overlapping clusters. These
methods manage both dense and sparse data regions by focusing on local density and can distinguish
clusters with a variety of morphologies.

In contrast, centroid-based grouping, like k-means, has trouble finding arbitrary shaped clusters. Due
to its preset number of cluster requirements and extreme sensitivity to the initial positioning of
centroids, the outcomes can vary. Furthermore, the tendency of centroid-based approaches to
produce spherical or convex clusters restricts their capacity to handle complicated or irregularly
shaped clusters. In conclusion, density-based clustering overcomes the drawbacks of centroid-based
techniques by autonomously choosing cluster sizes, being resilient to initialization, and successfully
capturing clusters of various sizes and forms. The most popular density-based clustering algorithm
is DBSCAN.

3. Connectivity-based Clustering (Hierarchical clustering)

A method for assembling related data points into hierarchical clusters is called hierarchical clustering.
Each data point is initially taken into account as a separate cluster, which is subsequently combined
with the clusters that are the most similar to form one large cluster that contains all of the data points.

Think about how you may arrange a collection of items based on how similar they are. Each object
begins as its own cluster at the base of the tree when using hierarchical clustering, which creates a
dendrogram, a tree-like structure. The closest pairings of clusters are then combined into larger
clusters after the algorithm examines how similar the objects are to one another. When every object is
in one cluster at the top of the tree, the merging process has finished. Exploring various granularity
levels is one of the fun things about hierarchical clustering. To obtain a given number of clusters, you
can select to cut the dendrogram at a particular height. The more similar two objects are within a
cluster, the closer they are. It’s comparable to classifying items according to their family trees, where
the nearest relatives are clustered together and the wider branches signify more general connections.
There are 2 approaches for Hierarchical clustering:

• Divisive Clustering: It follows a top-down approach, here we consider all data points to be part
one big cluster and then this cluster is divide into smaller groups.

• Agglomerative Clustering: It follows a bottom-up approach, here we consider all data points to
be part of individual clusters and then these clusters are clubbed together to make one big
cluster with all data points.

4. Distribution-based Clustering

Using distribution-based clustering, data points are generated and organized according to their
propensity to fall into the same probability distribution (such as a Gaussian, binomial, or other) within
the data. The data elements are grouped using a probability-based distribution that is based on
statistical distributions. Included are data objects that have a higher likelihood of being in the cluster.
A data point is less likely to be included in a cluster the further it is from the cluster’s central point,
which exists in every cluster.

A notable drawback of density and boundary-based approaches is the need to specify the clusters a
priori for some algorithms, and primarily the definition of the cluster form for the bulk of algorithms.
There must be at least one tuning or hyper-parameter selected, and while doing so should be simple,
getting it wrong could have unanticipated repercussions. Distribution-based clustering has a definite
advantage over proximity and centroid-based clustering approaches in terms of flexibility, accuracy,
and cluster structure. The key issue is that, in order to avoid overfitting, many clustering methods only
work with simulated or manufactured data, or when the bulk of the data points certainly belong to a
preset distribution. The most popular distribution-based clustering algorithm is Gaussian Mixture
Model.

Applications of Clustering in different fields:

1. Marketing: It can be used to characterize & discover customer segments for marketing
purposes.

2. Biology: It can be used for classification among different species of plants and animals.

3. Libraries: It is used in clustering different books on the basis of topics and information.

4. Insurance: It is used to acknowledge the customers, their policies and identifying the frauds.

5. City Planning: It is used to make groups of houses and to study their values based on their
geographical locations and other factors present.

6. Earthquake studies: By learning the earthquake-affected areas we can determine the


dangerous zones.

7. Image Processing: Clustering can be used to group similar images together, classify images
based on content, and identify patterns in image data.

8. Genetics: Clustering is used to group genes that have similar expression patterns and identify
gene networks that work together in biological processes.

9. Finance: Clustering is used to identify market segments based on customer behavior, identify
patterns in stock market data, and analyze risk in investment portfolios.

10. Customer Service: Clustering is used to group customer inquiries and complaints into
categories, identify common issues, and develop targeted solutions.

11. Manufacturing: Clustering is used to group similar products together, optimize production
processes, and identify defects in manufacturing processes.

12. Medical diagnosis: Clustering is used to group patients with similar symptoms or diseases,
which helps in making accurate diagnoses and identifying effective treatments.

13. Fraud detection: Clustering is used to identify suspicious patterns or anomalies in financial
transactions, which can help in detecting fraud or other financial crimes.

14. Traffic analysis: Clustering is used to group similar patterns of traffic data, such as peak hours,
routes, and speeds, which can help in improving transportation planning and infrastructure.

15. Social network analysis: Clustering is used to identify communities or groups within social
networks, which can help in understanding social behavior, influence, and trends.

16. Cybersecurity: Clustering is used to group similar patterns of network traffic or system
behavior, which can help in detecting and preventing cyberattacks.
17. Climate analysis: Clustering is used to group similar patterns of climate data, such as
temperature, precipitation, and wind, which can help in understanding climate change and its
impact on the environment.

18. Sports analysis: Clustering is used to group similar patterns of player or team performance
data, which can help in analyzing player or team strengths and weaknesses and making
strategic decisions.

19. Crime analysis: Clustering is used to group similar patterns of crime data, such as location,
time, and type, which can help in identifying crime hotspots, predicting future crime trends,
and improving crime prevention strategies.

Measures of Distance in Data Mining


Clustering

consists of grouping certain objects that are similar to each other, it can be used to decide if two items
are similar or dissimilar in their properties. In a

Data Mining

sense, the similarity measure is a distance with dimensions describing object features. That means if the
distance among two data points is

small

then there is a

high

degree of similarity among the objects and vice versa. The similarity is

subjective

and depends heavily on the context and application. For example, similarity among vegetables can be
determined from their taste, size, colour etc. Most clustering approaches use distance measures to
assess the similarities or differences between a pair of objects, the most popular distance measures used
are:

1. Euclidean Distance:

Euclidean distance is considered the traditional metric for problems with geometry. It can be simply
explained as the

ordinary distance

between two points. It is one of the most used algorithms in the cluster analysis. One of the algorithms
that use this formula would be

K-mean
. Mathematically it computes the

root of squared differences

between the coordinates between two objects.

d(p,q)=d(q,p)=(q1−p1)2+(q2−p2)2+⋯+(qn−pn)2=∑i=1n(qi−pi)2d(p,q)=d(q,p)=(q1−p1)2+(q2−p2)2+⋯+(qn

−pn)2=i=1∑n(qi−pi)2

Figure –

Euclidean Distance

2. Manhattan Distance:

This determines the absolute difference among the pair of the coordinates. Suppose we have two points
P and Q to determine the distance between these points we simply have to calculate the perpendicular
distance of the points from X-Axis and Y-Axis. In a plane with P at coordinate (x1, y1) and Q at (x2, y2).
Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|
Here the total distance of the

Red

line gives the Manhattan distance between both the points.

K-Means Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in
machine learning or data science. In this topic, we will learn what is K-means clustering algorithm, how
the algorithm works, along with the Python implementation of k-means clustering.

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that
each dataset belongs only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover the categories of
groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in
this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.

o Assigns each data point to its closest k-center. Those data points which are near to the particular
k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:


Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given
below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.

o We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two
points as k points, which are not the part of our dataset. Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both the centroids. Consider the below
image:

From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and
points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow for
clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new centroid.
To choose the new centroids, we will compute the center of gravity of these centroids, and will
find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two blue points
are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-
points.
o We will repeat the process by finding the center of gravity of centroids, so the new centroids will
be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data points. So,
the image will be:

o We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be
as shown in the below image:

How to choose the value of "K number of clusters" in K-means Clustering?

The performance of the K-means clustering algorithm depends upon highly efficient clusters that it
forms. But choosing the optimal number of clusters is a big task. There are some different ways to find
the optimal number of clusters, but here we are discussing the most appropriate method to find the
number of clusters or value of K. The method is given below:

K-Medoids clustering

K-Medoids and K-Means are two types of clustering mechanisms in Partition Clustering. First, Clustering
is the process of breaking down an abstract group of data points/ objects into classes of similar objects
such that all the objects in one cluster have similar traits. , a group of n objects is broken down into k
number of clusters based on their similarities.

Two statisticians, Leonard Kaufman, and Peter J. Rousseeuw came up with this method.

K-medoids is an unsupervised method with unlabelled data to be clustered. It is an improvised version of


the K-Means algorithm mainly designed to deal with outlier data sensitivity. Compared to other
partitioning algorithms, the algorithm is simple, fast, and easy to implement.

The partitioning will be carried on such that:

1. Each cluster must have at least one object


2. An object must belong to only one cluster

n the K-Means algorithm, given the value of k and unlabelled data:

1. Choose k number of random points (Data point from the data set or some other points). These
points are also called "Centroids" or "Means".

2. Assign all the data points in the data set to the closest centroid by applying any distance formula
like Euclidian distance, Manhattan distance, etc.

3. Now, choose new centroids by calculating the mean of all the data points in the clusters and
goto step 2

4. Continue step 3 until no data point changes classification between two iterations.

The problem with the K-Means algorithm is that the algorithm needs to handle outlier data. An
outlier is a point different from the rest of the points. All the outlier data points show up in a
different cluster and will attract other clusters to merge with it. Outlier data increases the mean
of a cluster by up to 10 units. Hence, K-Means clustering is highly affected by outlier data.

K-Medoids:

Medoid: A Medoid is a point in the cluster from which the sum of distances to other data points
is minimal.

(or)

A Medoid is a point in the cluster from which dissimilarities with all the other points in the
clusters are minimal.

Instead of centroids as reference points in K-Means algorithms, the K-Medoids algorithm takes a
Medoid as a reference point.

There are three types of algorithms for K-Medoids Clustering:

1. PAM (Partitioning Around Clustering)

2. CLARA (Clustering Large Applications)

3. CLARANS (Randomized Clustering Large Applications)

PAM is the most powerful algorithm of the three algorithms but has the disadvantage of time
complexity. The following K-Medoids are performed using PAM. In the further parts, we'll see
what CLARA and CLARANS are.

Algorithm:

Given the value of k and unlabelled data:

1. Choose k number of random points from the data and assign these k points to k number of
clusters. These are the initial medoids.
2. For all the remaining data points, calculate the distance from each medoid and assign it to the
cluster with the nearest medoid.

3. Calculate the total cost (Sum of all the distances from all the data points to the medoids)

4. Select a random point as the new medoid and swap it with the previous medoid. Repeat 2 and 3
steps.

5. If the total cost of the new medoid is less than that of the previous medoid, make the new
medoid permanent and repeat step 4.

6. If the total cost of the new medoid is greater than the cost of the previous medoid, undo the
swap and repeat step 4.

7. The Repetitions have to continue until no change is encountered with new medoids to classify
data points.

Advantages of using K-Medoids:

1. Deals with noise and outlier data effectively

2. Easily implementable and simple to understand

3. Faster compared to other partitioning algorithms

Disadvantages:

1. Not suitable for Clustering arbitrarily shaped groups of data points.

2. As the initial medoids are chosen randomly, the results might vary based on the choice in
different runs.

K-Means and K-Medoids:

K-Means K-Medoids

Both methods are types of Partition Clustering.

Unsupervised iterative algorithms

Have to deal with unlabelled data


Both algorithms group n objects into k clusters based on similar traits where k is pre-
defined.

Inputs: Unlabelled data and the value of k

Metric of similarity: Euclidian Distance Metric of similarity: Manhattan Distance

Clustering is done based on distance Clustering is done based on distance


from centroids. from medoids.

A centroid can be a data point or some A medoid is always a data point in the
other point in the cluster cluster.

Can't cope with outlier data Can manage outlier data too

Sometimes, outlier sensitivity can turn out Tendency to ignore meaningful clusters in
to be useful outlier data

Density-Based Spatial Clustering Of Applications With Noise (DBSCAN)

Clusters are dense regions in the data space, separated by regions of the lower density of points.
The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea is that for
each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of
points.
Why DBSCAN?

Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for finding spherical-
shaped clusters or convex clusters. In other words, they are suitable only for compact and well-separated
clusters. Moreover, they are also severely affected by the presence of noise and outliers in the data.

Real-life data may contain irregularities, like:

1. Clusters can be of arbitrary shape such as those shown in the figure below.

2. Data may contain noise.


The figure above shows a data set containing non-convex shape clusters and outliers. Given such data,
the k-means algorithm has difficulties in identifying these clusters with arbitrary shapes.

Parameters Required For DBSCAN Algorithm

1. eps: It defines the neighborhood around a data point i.e. if the distance between two points is
lower or equal to ‘eps’ then they are considered neighbors. If the eps value is chosen too small
then a large part of the data will be considered as an outlier. If it is chosen very large then the
clusters will merge and the majority of the data points will be in the same clusters. One way to
find the eps value is based on the k-distance graph.

2. MinPts: Minimum number of neighbors (data points) within eps radius. The larger the dataset,
the larger value of MinPts must be chosen. As a general rule, the minimum MinPts can be
derived from the number of dimensions D in the dataset as, MinPts >= D+1. The minimum value
of MinPts must be chosen at least 3.

In this algorithm, we have 3 types of data points.


Core Point: A point is a core point if it has more than MinPts points within eps.
Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood of a core
point.
Noise or outlier: A point which is not a core point or border point.
Steps Used In DBSCAN Algorithm

1. Find all the neighbor points within eps and identify the core points or visited with more than
MinPts neighbors.

2. For each core point if it is not already assigned to a cluster, create a new cluster.

3. Find recursively all its density-connected points and assign them to the same cluster as the core
point.
A point a and b are said to be density connected if there exists a point c which has a sufficient
number of points in its neighbors and both points a and b are within the eps distance. This is a
chaining process. So, if b is a neighbor of c, c is a neighbor of d, and d is a neighbor of e, which in
turn is neighbor of a implying that b is a neighbor of a.

4. Iterate through the remaining unvisited points in the dataset. Those points that do not belong to
any cluster are noise.
Difference Between DBSCAN and K-Means.

DBSCAN K-Means

K-Means is very sensitive to the number of


In DBSCAN we need not specify the number
clusters so it
of clusters.
need to specified

Clusters formed in K-Means are spherical or


Clusters formed in DBSCAN can be of any
arbitrary shape. convex in shape

K-Means does not work well with outliers data.


Outliers
DBSCAN can work well with datasets having
noise and outliers can skew the clusters in K-Means to a very large
extent.

In K-Means only one parameter is required is


In DBSCAN two parameters are required for for training
training the Model
the model

Support Vector Machine (SVM) Algorithm

A Support Vector Machine (SVM) is a powerful machine learning algorithm widely used for both linear
and nonlinear classification, as well as regression and outlier detection tasks. SVMs are highly
adaptable, making them suitable for various applications such as text classification, image
classification, spam detection, handwriting identification, gene expression analysis, face detection,
and anomaly detection.

SVMs are particularly effective because they focus on finding the maximum separating
hyperplane between the different classes in the target feature, making them robust for both binary and
multiclass classification. In this outline, we will explore the Support Vector Machine (SVM) algorithm,
its applications, and how it effectively handles both linear and nonlinear classification, as well
as regression and outlier detection tasks.
Support Vector Machine

A Support Vector Machine (SVM) is a supervised machine learning algorithm used for
both classification and regression tasks. While it can be applied to regression problems, SVM is best
suited for classification tasks. The primary objective of the SVM algorithm is to identify the optimal
hyperplane in an N-dimensional space that can effectively separate data points into different classes in
the feature space. The algorithm ensures that the margin between the closest points of different classes,
known as support vectors, is maximized.

The dimension of the hyperplane depends on the number of features. For instance, if there are two
input features, the hyperplane is simply a line, and if there are three input features, the hyperplane
becomes a 2-D plane. As the number of features increases beyond three, the complexity of visualizing
the hyperplane also increases.

Consider two independent variables, x1 and x2, and one dependent variable represented as either a
blue circle or a red circle.

• In this scenario, the hyperplane is a line because we are working with two features (x1 and x2).

• There are multiple lines (or hyperplanes) that can separate the data points.

• The challenge is to determine the best hyperplane that maximizes the separation margin
between the red and blue circles.

Linearly Separable Data points

From the figure above it’s very clear that there are multiple lines (our hyperplane here is a line because
we are considering only two input features x1, x2) that segregate our data points or do a classification
between red and blue circles. So how do we choose the best line or in general the best hyperplane that
segregates our data points?
How does Support Vector Machine Algorithm Work?

One reasonable choice for the best hyperplane in a Support Vector Machine (SVM) is the one that
maximizes the separation margin between the two classes. The maximum-margin hyperplane, also
referred to as the hard margin, is selected based on maximizing the distance between the hyperplane
and the nearest data point on each side.

Multiple hyperplanes separate the data from two classes

So we choose the hyperplane whose distance from it to the nearest data point on each side is
maximized. If such a hyperplane exists it is known as the maximum-margin hyperplane/hard margin. So
from the above figure, we choose L2. Let’s consider a scenario like shown below
Selecting hyperplane for data with outlier

Here we have one blue ball in the boundary of the red ball. So how does SVM classify the data? It’s
simple! The blue ball in the boundary of red ones is an outlier of blue balls. The SVM algorithm has the
characteristics to ignore the outlier and finds the best hyperplane that maximizes the margin. SVM is
robust to outliers.

Hyperplane which is the most optimized one

So in this type of data point what SVM does is, finds the maximum margin as done with previous data
sets along with that it adds a penalty each time a point crosses the margin. So the margins in these types
of cases are called soft margins. When there is a soft margin to the data set, the SVM tries to
minimize (1/margin+∧(∑penalty)). Hinge loss is a commonly used penalty. If no violations no hinge loss.If
violations hinge loss proportional to the distance of violation.

Till now, we were talking about linearly separable data(the group of blue balls and red balls are
separable by a straight line/linear line). What to do if data are not linearly separable?

Original 1D dataset for classification

Say, our data is shown in the figure above. SVM solves this by creating a new variable using a kernel. We
call a point xi on the line and we create a new variable yi as a function of distance from origin o.so if we
plot this we get something like as shown below

Mapping 1D data to 2D to become able to separate the two classes


In this case, the new variable y is created as a function of distance from the origin. A non-linear function
that creates a new variable is referred to as a kernel.

Support Vector Machine Terminology


Hyperplane: The hyperplane is the decision boundary used to separate data
points of different classes in a feature space. For linear classification, this is a linear
equation represented as wx+b=0. Support Vectors: Support vectors are the
closest data points to the hyperplane. These points are critical in determining the
hyperplane and the margin in Support Vector Machine (SVM).

Margin: The margin refers to the distance between the support vector and the
hyperplane. The primary goal of the SVM algorithm is to maximize this margin, as a
wider margin typically results in better
AI ML DS Data Science Data Analysis Data Visualization Machine Learning
Deep Learning NLP Compute classification performance.
Kernel: The kernel is a mathematical function used in SVM to map input data into
a higher-dimensional feature space. This allows the SVM to find a hyperplane in
cases where data points are not linearly separable in the original space. Common
kernel functions include linear, polynomial, radial basis function (RBF), and
sigmoid.

Hard Margin: A hard margin refers to the maximum-margin hyperplane that perfectly
separates the data points of different classes without any misclassifications.

Soft Margin: When data contains outliers or is not perfectly separable, SVM uses the
soft margin technique. This method introduces a slack variable for each data point
to allow some misclassifications while balancing between maximizing the margin
and minimizing violations.

C: The C parameter in SVM is a regularization term that balances margin


maximization and the penalty for misclassifications. A higher C value imposes a
stricter penalty for margin violations, leading to a smaller margin but fewer
misclassifications.

Hinge Loss: The hinge loss is a common loss function in SVMs. It penalizes
misclassified points or margin violations and is often combined with a regularization
term in the objective function.

Dual Problem: The dual problem in SVM involves solving for the Lagrange multipliers
associated with the support vectors. This formulation allows for the use of the
kernel trick and facilitates more efficient computation.
Mathematical Computation: SVM
Consider a binary classification problem with two classes, labeled as +1 and -1. We
have a training dataset consisting of input feature vectors X and their corresponding
class labels Y.
The equation for the linear hyperplane can be written as:

wTx + b = 0

The vector W represents the normal vector to the hyperplane. i.e the direction
perpendicular to the hyperplane. The parameter b in the equation represents the
offset or distance of the hyperplane from the origin along the normal vector w.
The distance between a data point x_i and the decision boundary can be calculated as:

di wT x + b

where ||w|| represents the Euclidean norm of the weight vector w. Euclidean norm of
the normal vector W
For Linear SVM classifier :

1 : wTx + b ≥ 0
y^= { 0 : wTx + b < 0

Optimization:
For Hard margin linear SVM classifier:

minimize12wTw = minimizeW,b
w,b subject to yi(wTxi+ b) ≥ 1
for i = 1, 2, 3, ⋯ , m
The target variable or label for the ith training instance is denoted by the symbol ti in
this statement. And ti=-1 for negative occurrences (when yi= 0) and ti=1positive
instances (when yi = 1) respectively. Because we require the decision boundary that
satisfy the constraint: ti(wTxi+ b) ≥ 1 For Soft margin linear SVM classifier:

minimize 12wTw i
w,b

subject to yi(wTxi+ b) ≥ 1 − ζiand ζi≥ 0 for i = 1, 2, 3, ⋯ , m


Dual Problem: A dual Problem of the optimisation problem that requires locating the
Lagrange multipliers related to the support vectors can be used to solve SVM. The
optimal Lagrange multipliers α(i) that maximize the following dual objective
function

maximize: jtitjK(xi,xj)−
α
∑ αi i→m j→m i→m

where, αi is the Lagrange multiplier associated with the ith training sample. K(xi, xj)
is the kernel function that computes the similarity between two samples xi and
xj. It allows SVM to handle nonlinear classification problems by implicitly
mapping the samples into a higher-dimensional feature space.
The term ∑αi represents the sum of all Lagrange multipliers.

The SVM decision boundary can be described in terms of these optimal Lagrange
multipliers and the support vectors once the dual issue has been solved and the
optimal Lagrange multipliers have been discovered. The training samples that have i > 0
are the support vectors, while the decision boundary is supplied by:

w = ∑ αitiK(xi,x)
+ b i→m ti(wTxi− b) = 1
⟺ b = wTxi− ti

Types of Support Vector Machine


Based on the nature of the decision boundary, Support Vector Machines (SVM) can be
divided into two main parts:

Linear SVM: Linear SVMs use a linear decision boundary to separate the data points
of different classes. When the data can be precisely linearly separated, linear SVMs
are very suitable. This means that a single straight line (in 2D) or a hyperplane (in
higher dimensions) can entirely divide the data points into their respective classes.
A hyperplane that maximizes the margin between the classes is the decision
boundary.
Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be
separated into two classes by a straight line (in the case of 2D). By using kernel
functions, nonlinear SVMs can handle nonlinearly separable data. The original input
data is transformed by these kernel functions into a higher-dimensional feature
space, where the data points can be linearly separated. A linear SVM is used to
locate a nonlinear decision boundary in this modified space.
Popular kernel functions in SVM
The SVM kernel is a function that takes low-dimensional input space and transforms it
into higher-dimensional space, ie it converts nonseparable problems to separable
problems. It is mostly useful in non-linear separation problems. Simply put the kernel,
does some extremely complex data transformations and then finds out the process to
separate the data based on the labels or outputs defined.
Linear : K(w, b) = wTx + b
Polynomial : K(w, x) = (γwTx + b)N
Gaussian RBF: K(w, x)= exp(−γ∣∣xi− xj∣∣n
Sigmoid :K(xi,xj)= tanh(αxTi xj+ b)

Advantages and Disadvantages of Support Vector Machine (SVM)

Advantages of Support Vector Machine (SVM)


1. High-Dimensional Performance: SVM excels in high-dimensional spaces, making it
suitable for image classification and gene expression analysis.
2. Nonlinear Capability: Utilizing kernel functions like RBF and polynomial, SVM
effectively handles nonlinear relationships.
3. Outlier Resilience: The soft margin feature allows SVM to ignore outliers,
enhancing robustness in spam detection and anomaly detection.
4. Binary and Multiclass Support: SVM is effective for both binary classification and
multiclass classification, suitable for applications in text classification.
5. Memory Efficiency: SVM focuses on support vectors, making it memory efficient
compared to other algorithms.

Disadvantages of Support Vector Machine (SVM)

1. Slow Training: SVM can be slow for large datasets, affecting performance in SVM in
data mining tasks.
2. Parameter Tuning Difficulty: Selecting the right kernel and adjusting parameters
like C requires careful tuning, impacting SVM algorithms.
3. Noise Sensitivity: SVM struggles with noisy datasets and overlapping classes,
limiting effectiveness in real-world scenarios.
4. Limited Interpretability: The complexity of the hyperplane in higher dimensions
makes SVM less interpretable than other models.
5. Feature Scaling Sensitivity: Proper feature scaling is essential; otherwise, SVM
models may perform poorly.

What is Web Mining?

Web mining is the best type of practice for sifting through the vast amount of data
in the system that is available on the World Wide Web to find and extract pertinent
information as per requirements. One unique feature of web mining is its ability to
deliver a wide range of required data types in the actual process. There are various
elements of the web that lead to diverse methods for the actual mining process. For
example, web pages are made up of text; they are connected by hyperlinks in the
system or process; and web server logs allow for the monitoring of user behavior to
simplify all the required systems. Combining all the required methods from data
mining, machine learning, artificial intelligence, statistics, and information retrieval,
web mining is an interdisciplinary field for the overall system. Analyzing user
behavior and website traffic is the one basic type or example of web mining.

Applications of Web Mining

Web mining is the process of discovering patterns, structures, and relationships in


web data. It involves using data mining techniques to analyze web data and extract
valuable insights. The applications of web mining are wide-ranging and include:

• Personalized marketing:Web mining can be used to analyze customer behavior on


websites and social media platforms. This information can be used to create
personalized marketing campaigns that target customers based on their interests and
preferences.

• E-commerce: Web mining can be used to analyze customer behavior on e-commerce


websites. This information can be used to improve the user experience and increase
sales by recommending products based on customer preferences.
• Search engine optimization: Web mining can be used to analyze search engine queries
and search engine results pages (SERPs). This information can be used to improve the
visibility of websites in search engine results and increase traffic to the website.

• Fraud detection: Web mining can be used to detect fraudulent activity on websites.
This information can be used to prevent financial fraud, identity theft, and other types
of online fraud.

• Sentiment analysis: Web mining can be used to analyze social media data and extract
sentiment from posts, comments, and reviews. This information can be used to
understand customer sentiment towards products and services and make informed
business decisions.

• Web content analysis: Web mining can be used to analyze web content and extract
valuable information such as keywords, topics, and themes. This information can be
used to improve the relevance of web content and optimize search engine rankings.

• Customer service: Web mining can be used to analyze customer service interactions
on websites and social media platforms. This information can be used to improve the
quality of customer service and identify areas for improvement.

• Healthcare: Web mining can be used to analyze health-related websites and extract
valuable information about diseases, treatments, and medications. This information
can be used to improve the quality of healthcare and inform medical research.

Process of Web Mining

Web Mining Process


Web mining can be broadly divided into three different types of techniques of
mining: Web Content Mining, Web Structure Mining, and Web Usage Mining. These
are explained as following below.

Categories of Web Mining

• Web Content Mining: Web content mining is the application of extracting useful
information from the content of the web documents. Web content consist of several
types of data – text, image, audio, video etc. Content data is the group of facts that a
web page is designed. It can provide effective and interesting patterns about user
needs. Text documents are related to text mining, machine learning and natural
language processing. This mining is also known as text mining. This type of mining
performs scanning and mining of the text, images and groups of web pages according
to the content of the input.

• Web Structure Mining: Web structure mining is the application of discovering


structure information from the web. The structure of the web graph consists of web
pages as nodes, and hyperlinks as edges connecting related pages. Structure mining
basically shows the structured summary of a particular website. It identifies
relationship between web pages linked by information or direct link connection. To
determine the connection between two commercial websites, Web structure mining
can be very useful.

• Web Usage Mining: Web usage mining is the application of identifying or discovering
interesting usage patterns from large data sets. And these patterns enable you to
understand the user behaviors or something like that. In web usage mining, user
access data on the web and collect data in form of logs. So, Web usage mining is also
called log mining.

Comparison between Data Mining and Web Mining

Parameters Data Mining Web Mining

Data Mining is the Web Mining is the


process that attempts to process of data mining
discover pattern and techniques to
Definition
hidden knowledge in automatically discover
large data sets in any and extract information
system. from web documents.

Data Mining is very Web Mining is very


Application useful for web page useful for a particular
analysis. website and e-service.

Data scientist and data Data scientists along


Target Users
engineers. with data analysts.

In Data Mining get the In Web Mining get the


Structure information from explicit information from
structure. structured, unstructured
Parameters Data Mining Web Mining

and semi-structured web


pages.

Clustering, classification,
Web content mining,
Problem Type regression, prediction,
Web structure mining.
optimization and control.

Special tools for web


It includes tools like
mining are Scrapy,
Tools machine learning
PageRank and Apache
algorithms.
logs.

It includes application
It includes approaches
level knowledge, data
for data cleansing,
engineering with
Skills machine learning
mathematical modules
algorithms. Statistics
like statistics and
and probability.
probability.

What is Text Mining?

Text mining is a component of data mining that deals specifically with unstructured
text data. It involves the use of natural language processing (NLP) techniques to
extract useful information and insights from large amounts of unstructured text
data. Text mining can be used as a preprocessing step for data mining or as a
standalone process for specific tasks.
Text Mining in Data Mining?

Text mining in data mining is mostly used for, the unstructured text data that can
be transformed into structured data that can be used for data mining tasks such as
classification, clustering, and association rule mining. This allows organizations to
gain insights from a wide range of data sources, such as customer feedback, social
media posts, and news articles.

Text Mining Process

Conventional Process of Text Mining

• Gathering unstructured information from various sources accessible in various


document organizations, for example, plain text, web pages, PDF records, etc.

• Pre-processing and data cleansing tasks are performed to distinguish and eliminate
inconsistency in the data. The data cleansing process makes sure to capture the
genuine text, and it is performed to eliminate stop words stemming (the process of
identifying the root of a certain word and indexing the data.

• Processing and controlling tasks are applied to review and further clean the data set.

• Pattern analysis is implemented in Management Information System.


• Information processed in the above steps is utilized to extract important and
applicable data for a powerful and convenient decision-making process and trend
analysis.

Common Methods for Analyzing Text Mining

• Text Summarization: To extract its partial content and reflect its whole content
automatically.

• Text Categorization: To assign a category to the text among categories predefined by


users.

• Text Clustering: To segment texts into several clusters, depending on the substantial
relevance.

Text Mining Applications

• Digital Library: Various text mining strategies and tools are being used to get the
pattern and trends from journal and proceedings which is stored in text database
repositories. These resources of information help in the field of research area. Libraries
are a good resource for text data in digital form. It gives a novel technique for getting
useful data in such a way that makes it conceivable to access millions of records
online.
A green-stone international digital library that supports numerous languages and
multilingual interfaces gives a springy method for extracting reports that handle
various formats, i.e. Microsoft Word, PDF, postscript, HTML, scripting languages, and
email. It additionally supports the extraction of audiovisual and image formats along
with text documents. Text Mining processes perform different activities like document
collection, determination, enhancement, removing data, and handling substances, and
Producing summarization.

• Academic and Research Field: In the education field, different text-mining tools and
strategies are utilized to examine the instructive patterns in a specific region/research
field. The main purpose of text mining utilization in the research field is help to
discover and arrange research papers and relevant material from various fields on one
platform.
For this, we use k-Means clustering and different strategies help to distinguish the
properties of significant data. Also, student performance in various subjects can be
accessed, and how various qualities impact the selection of subjects evaluated by this
mining.

• Life Science: Life science and healthcare industries are producing an enormous volume
of textual and mathematical data regarding patient records, sicknesses, medicines,
symptoms, and treatments of diseases, etc. It is a major issue to filter data and
relevant text to make decisions from a biological data repository. The clinical records
contain variable data which is unpredictable, and lengthy. Text mining can help to
manage such kinds of data. Text mining is used in biomarkers disclosure, the pharmacy
industry, clinical trade analysis examination, clinical study, and patent competitive
intelligence also.

• Social-Media: Text mining is accessible for dissecting and analyzing web-based media
applications to monitor and investigate online content like the plain text from internet
news, web journals, emails, blogs, etc. Text mining devices help to distinguish and
investigate the number of posts, likes, and followers on the web-based media network.
This kind of analysis shows individuals’ responses to various posts, and news and how
it spread around. It shows the behavior of people who belong to a specific age group
and variations in views about the same post.

• Business Intelligence: Text mining plays an important role in business intelligence that
help different organization and enterprises to analyze their customers and competitors
to make better decisions. It gives an accurate understanding of business and gives data
on how to improve consumer satisfaction and gain competitive benefits. The text
mining devices like IBM text analytics.
This mining can be used in the telecom sector, commerce, and customer chain
management system.

Advantages of Text Mining


• Large Amounts of Data: Text mining allows organizations to extract insights from large
amounts of unstructured text data.

• Variety of Applications: Text mining has a wide range of applications, including


sentiment analysis, named entity recognition, and topic modeling.

• Improved Decision Making

• Cost-effective: Text mining can be a cost-effective way, as it eliminates the need for
manual data entry.

Disadvantages of Text Mining

• Complexity: Text mining can be a complex process requiring advanced skills in natural
language processing and machine learning.

• Quality of Data: The quality of text data can vary, affecting the accuracy of the insights
extracted from text mining.

• High Computational Cost: Text mining requires high computational resources, and it
may be difficult for smaller organizations to afford the technology.

• Limited to Text Data: Text mining is limited to extracting insights from unstructured
text data and cannot be used with other data types.

• Noise in text mining results: Text mining of documents may result in mistakes. It’s
possible to find false links or to miss others. In most situations, if the noise (error rate)
is sufficiently low, the benefits of automation exceed the chance of a larger mistake
than that produced by a human reader.

• Lack of transparency: Text mining is frequently viewed as a mysterious process where


large corpora of text documents are input and new information is produced. Text
mining is in fact opaque when researchers lack the technical know-how or expertise to
comprehend how it operates, or when they lack access to corpora or text mining tools.
Procedures for Analyzing Text Mining

Text Mining Techniques

Information Retrieval

In the process of Information retrieval, we try to process the available documents


and the text data into a structured form so, that we can apply different pattern
recognition and analytical processes. It is a process of extracting relevant and
associated patterns according to a given set of words or text documents.
For this, we have processes like Tokenization of the document or
the stemming process in which we try to extract the base word or let’s say the root
word present there.

Information Extraction

It is a process of extracting meaningful words from documents.

• Feature Extraction – In this process, we try to develop some new features from
existing ones. This objective can be achieved by parsing an existing feature or
combining two or more features based on some mathematical operation.

• Feature Selection – In this process, we try to reduce the dimensionality of the dataset
which is generally a common issue while dealing with the text data by selecting a
subset of features from the whole dataset.

Natural Language Processing

Natural Language Processing includes tasks that are accomplished by using Machine
Learning and Deep Learning methodologies. It concerns the automatic processing
and analysis of unstructured text information.

• Named Entity Recognition (NER): Identifying and classifying named entities such as
people, organizations, and locations in text data.

• Sentiment Analysis: Identifying and extracting the sentiment (e.g. positive, negative,
neutral) of text data.

• Text Summarization: Creating a condensed version of a text document that captures


the main points.

You might also like