Unit III Clustering
Unit III Clustering
What is Clustering ?
The task of grouping data points based on their similarity with each other is called Clustering or
Cluster Analysis. This method is defined under the branch of Unsupervised Learning, which aims at
gaining insights from unlabelled data points, that is, unlike supervised learning we don’t have a target
variable.
Clustering aims at forming groups of homogeneous data points from a heterogeneous dataset. It
evaluates the similarity based on a metric like Euclidean distance, Cosine similarity, Manhattan
distance, etc. and then group the points with highest similarity score together.
For Example, In the graph given below, we can clearly see that there are 3 circular clusters forming on
the basis of distance.
Now it is not necessary that the clusters formed must be circular in shape. The shape of clusters can be
arbitrary. There are many algortihms that work well with detecting arbitrary shaped clusters.
For example, In the below given graph we can see that the clusters formed are not circular in shape.
Types of Clustering
Broadly speaking, there are 2 types of clustering that can be performed to group similar data points:
• Hard Clustering: In this type of clustering, each data point belongs to a cluster completely or
not. For example, Let’s say there are 4 data point and we have to cluster them into 2 clusters.
So each data point will either belong to cluster 1 or cluster 2.
A C1
B C2
C C2
D C1
• Soft Clustering: In this type of clustering, instead of assigning each data point into a separate
cluster, a probability or likelihood of that point being that cluster is evaluated. For example,
Let’s say there are 4 data point and we have to cluster them into 2 clusters. So we will be
evaluating a probability of a data point belonging to both clusters. This probability is
calculated for all data points.
A 0.91 0.09
B 0.3 0.7
C 0.17 0.83
D 1 0
Uses of Clustering
Now before we begin with types of clustering algorithms, we will go through the use cases of
Clustering algorithms. Clustering algorithms are majorly used for:
• Market Segmentation – Businesses use clustering to group their customers and use targeted
advertisements to attract more audience.
• Market Basket Analysis – Shop owners analyze their sales and figure out which items are
majorly bought together by the customers. For example, In USA, according to a study diapers
and beers were usually bought together by fathers.
• Social Network Analysis – Social media sites use your data to understand your browsing
behaviour and provide you with targeted friend recommendations or content
recommendations.
• Medical Imaging – Doctors use Clustering to find out diseased areas in diagnostic images like X-
rays.
• Simplify working with large datasets – Each cluster is given a cluster ID after clustering is
complete. Now, you may reduce a feature set’s whole feature set into its cluster ID. Clustering
is effective when it can represent a complicated case with a straightforward cluster ID. Using
the same principle, clustering data can make complex datasets simpler.
There are many more use cases for clustering but there are some of the major and common use cases
of clustering. Moving forward we will be discussing Clustering Algorithms that will help you perform
the above tasks.
At the surface level, clustering helps in the analysis of unstructured data. Graphing, the shortest
distance, and the density of the data points are a few of the elements that influence cluster formation.
Clustering is the process of determining how related the objects are based on a metric called the
similarity measure. Similarity metrics are easier to locate in smaller sets of features. It gets harder to
create similarity measures as the number of features increases. Depending on the type of clustering
algorithm being utilized in data mining, several techniques are employed to group the data from the
datasets. In this part, the clustering techniques are described. Various types of clustering algorithms
are:
4. Distribution-based Clustering
Partitioning methods are the most easiest clustering algorithms. They group data points on the basis
of their closeness. Generally, the similarity measure chosen for these algorithms are Euclidian
distance, Manhattan Distance or Minkowski Distance. The datasets are separated into a
predetermined number of clusters, and each cluster is referenced by a vector of values. When
compared to the vector value, the input data variable shows no difference and joins the cluster.
The primary drawback for these algorithms is the requirement that we establish the number of
clusters, “k,” either intuitively or scientifically (using the Elbow Method) before any clustering machine
learning system starts allocating the data points. Despite this, it is still the most popular type of
clustering. K-means and K-medoids clustering are some examples of this type clustering.
Density-based clustering, a model-based method, finds groups based on the density of data points.
Contrary to centroid-based clustering, which requires that the number of clusters be predefined and is
sensitive to initialization, density-based clustering determines the number of clusters automatically
and is less susceptible to beginning positions. They are great at handling clusters of different sizes and
forms, making them ideally suited for datasets with irregularly shaped or overlapping clusters. These
methods manage both dense and sparse data regions by focusing on local density and can distinguish
clusters with a variety of morphologies.
In contrast, centroid-based grouping, like k-means, has trouble finding arbitrary shaped clusters. Due
to its preset number of cluster requirements and extreme sensitivity to the initial positioning of
centroids, the outcomes can vary. Furthermore, the tendency of centroid-based approaches to
produce spherical or convex clusters restricts their capacity to handle complicated or irregularly
shaped clusters. In conclusion, density-based clustering overcomes the drawbacks of centroid-based
techniques by autonomously choosing cluster sizes, being resilient to initialization, and successfully
capturing clusters of various sizes and forms. The most popular density-based clustering algorithm
is DBSCAN.
A method for assembling related data points into hierarchical clusters is called hierarchical clustering.
Each data point is initially taken into account as a separate cluster, which is subsequently combined
with the clusters that are the most similar to form one large cluster that contains all of the data points.
Think about how you may arrange a collection of items based on how similar they are. Each object
begins as its own cluster at the base of the tree when using hierarchical clustering, which creates a
dendrogram, a tree-like structure. The closest pairings of clusters are then combined into larger
clusters after the algorithm examines how similar the objects are to one another. When every object is
in one cluster at the top of the tree, the merging process has finished. Exploring various granularity
levels is one of the fun things about hierarchical clustering. To obtain a given number of clusters, you
can select to cut the dendrogram at a particular height. The more similar two objects are within a
cluster, the closer they are. It’s comparable to classifying items according to their family trees, where
the nearest relatives are clustered together and the wider branches signify more general connections.
There are 2 approaches for Hierarchical clustering:
• Divisive Clustering: It follows a top-down approach, here we consider all data points to be part
one big cluster and then this cluster is divide into smaller groups.
• Agglomerative Clustering: It follows a bottom-up approach, here we consider all data points to
be part of individual clusters and then these clusters are clubbed together to make one big
cluster with all data points.
4. Distribution-based Clustering
Using distribution-based clustering, data points are generated and organized according to their
propensity to fall into the same probability distribution (such as a Gaussian, binomial, or other) within
the data. The data elements are grouped using a probability-based distribution that is based on
statistical distributions. Included are data objects that have a higher likelihood of being in the cluster.
A data point is less likely to be included in a cluster the further it is from the cluster’s central point,
which exists in every cluster.
A notable drawback of density and boundary-based approaches is the need to specify the clusters a
priori for some algorithms, and primarily the definition of the cluster form for the bulk of algorithms.
There must be at least one tuning or hyper-parameter selected, and while doing so should be simple,
getting it wrong could have unanticipated repercussions. Distribution-based clustering has a definite
advantage over proximity and centroid-based clustering approaches in terms of flexibility, accuracy,
and cluster structure. The key issue is that, in order to avoid overfitting, many clustering methods only
work with simulated or manufactured data, or when the bulk of the data points certainly belong to a
preset distribution. The most popular distribution-based clustering algorithm is Gaussian Mixture
Model.
1. Marketing: It can be used to characterize & discover customer segments for marketing
purposes.
2. Biology: It can be used for classification among different species of plants and animals.
3. Libraries: It is used in clustering different books on the basis of topics and information.
4. Insurance: It is used to acknowledge the customers, their policies and identifying the frauds.
5. City Planning: It is used to make groups of houses and to study their values based on their
geographical locations and other factors present.
7. Image Processing: Clustering can be used to group similar images together, classify images
based on content, and identify patterns in image data.
8. Genetics: Clustering is used to group genes that have similar expression patterns and identify
gene networks that work together in biological processes.
9. Finance: Clustering is used to identify market segments based on customer behavior, identify
patterns in stock market data, and analyze risk in investment portfolios.
10. Customer Service: Clustering is used to group customer inquiries and complaints into
categories, identify common issues, and develop targeted solutions.
11. Manufacturing: Clustering is used to group similar products together, optimize production
processes, and identify defects in manufacturing processes.
12. Medical diagnosis: Clustering is used to group patients with similar symptoms or diseases,
which helps in making accurate diagnoses and identifying effective treatments.
13. Fraud detection: Clustering is used to identify suspicious patterns or anomalies in financial
transactions, which can help in detecting fraud or other financial crimes.
14. Traffic analysis: Clustering is used to group similar patterns of traffic data, such as peak hours,
routes, and speeds, which can help in improving transportation planning and infrastructure.
15. Social network analysis: Clustering is used to identify communities or groups within social
networks, which can help in understanding social behavior, influence, and trends.
16. Cybersecurity: Clustering is used to group similar patterns of network traffic or system
behavior, which can help in detecting and preventing cyberattacks.
17. Climate analysis: Clustering is used to group similar patterns of climate data, such as
temperature, precipitation, and wind, which can help in understanding climate change and its
impact on the environment.
18. Sports analysis: Clustering is used to group similar patterns of player or team performance
data, which can help in analyzing player or team strengths and weaknesses and making
strategic decisions.
19. Crime analysis: Clustering is used to group similar patterns of crime data, such as location,
time, and type, which can help in identifying crime hotspots, predicting future crime trends,
and improving crime prevention strategies.
consists of grouping certain objects that are similar to each other, it can be used to decide if two items
are similar or dissimilar in their properties. In a
Data Mining
sense, the similarity measure is a distance with dimensions describing object features. That means if the
distance among two data points is
small
then there is a
high
degree of similarity among the objects and vice versa. The similarity is
subjective
and depends heavily on the context and application. For example, similarity among vegetables can be
determined from their taste, size, colour etc. Most clustering approaches use distance measures to
assess the similarities or differences between a pair of objects, the most popular distance measures used
are:
1. Euclidean Distance:
Euclidean distance is considered the traditional metric for problems with geometry. It can be simply
explained as the
ordinary distance
between two points. It is one of the most used algorithms in the cluster analysis. One of the algorithms
that use this formula would be
K-mean
. Mathematically it computes the
d(p,q)=d(q,p)=(q1−p1)2+(q2−p2)2+⋯+(qn−pn)2=∑i=1n(qi−pi)2d(p,q)=d(q,p)=(q1−p1)2+(q2−p2)2+⋯+(qn
−pn)2=i=1∑n(qi−pi)2
Figure –
Euclidean Distance
2. Manhattan Distance:
This determines the absolute difference among the pair of the coordinates. Suppose we have two points
P and Q to determine the distance between these points we simply have to calculate the perpendicular
distance of the points from X-Axis and Y-Axis. In a plane with P at coordinate (x1, y1) and Q at (x2, y2).
Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|
Here the total distance of the
Red
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in
machine learning or data science. In this topic, we will learn what is K-means clustering algorithm, how
the algorithm works, along with the Python implementation of k-means clustering.
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that
each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the categories of
groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in
this algorithm.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the particular
k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two
points as k points, which are not the part of our dataset. Consider the below image:
o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both the centroids. Consider the below
image:
From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and
points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow for
clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new centroid.
To choose the new centroids, we will compute the center of gravity of these centroids, and will
find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue points
are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-
points.
o We will repeat the process by finding the center of gravity of centroids, so the new centroids will
be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data points. So,
the image will be:
o We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be
as shown in the below image:
The performance of the K-means clustering algorithm depends upon highly efficient clusters that it
forms. But choosing the optimal number of clusters is a big task. There are some different ways to find
the optimal number of clusters, but here we are discussing the most appropriate method to find the
number of clusters or value of K. The method is given below:
K-Medoids clustering
K-Medoids and K-Means are two types of clustering mechanisms in Partition Clustering. First, Clustering
is the process of breaking down an abstract group of data points/ objects into classes of similar objects
such that all the objects in one cluster have similar traits. , a group of n objects is broken down into k
number of clusters based on their similarities.
Two statisticians, Leonard Kaufman, and Peter J. Rousseeuw came up with this method.
1. Choose k number of random points (Data point from the data set or some other points). These
points are also called "Centroids" or "Means".
2. Assign all the data points in the data set to the closest centroid by applying any distance formula
like Euclidian distance, Manhattan distance, etc.
3. Now, choose new centroids by calculating the mean of all the data points in the clusters and
goto step 2
4. Continue step 3 until no data point changes classification between two iterations.
The problem with the K-Means algorithm is that the algorithm needs to handle outlier data. An
outlier is a point different from the rest of the points. All the outlier data points show up in a
different cluster and will attract other clusters to merge with it. Outlier data increases the mean
of a cluster by up to 10 units. Hence, K-Means clustering is highly affected by outlier data.
K-Medoids:
Medoid: A Medoid is a point in the cluster from which the sum of distances to other data points
is minimal.
(or)
A Medoid is a point in the cluster from which dissimilarities with all the other points in the
clusters are minimal.
Instead of centroids as reference points in K-Means algorithms, the K-Medoids algorithm takes a
Medoid as a reference point.
PAM is the most powerful algorithm of the three algorithms but has the disadvantage of time
complexity. The following K-Medoids are performed using PAM. In the further parts, we'll see
what CLARA and CLARANS are.
Algorithm:
1. Choose k number of random points from the data and assign these k points to k number of
clusters. These are the initial medoids.
2. For all the remaining data points, calculate the distance from each medoid and assign it to the
cluster with the nearest medoid.
3. Calculate the total cost (Sum of all the distances from all the data points to the medoids)
4. Select a random point as the new medoid and swap it with the previous medoid. Repeat 2 and 3
steps.
5. If the total cost of the new medoid is less than that of the previous medoid, make the new
medoid permanent and repeat step 4.
6. If the total cost of the new medoid is greater than the cost of the previous medoid, undo the
swap and repeat step 4.
7. The Repetitions have to continue until no change is encountered with new medoids to classify
data points.
Disadvantages:
2. As the initial medoids are chosen randomly, the results might vary based on the choice in
different runs.
K-Means K-Medoids
A centroid can be a data point or some A medoid is always a data point in the
other point in the cluster cluster.
Can't cope with outlier data Can manage outlier data too
Sometimes, outlier sensitivity can turn out Tendency to ignore meaningful clusters in
to be useful outlier data
Clusters are dense regions in the data space, separated by regions of the lower density of points.
The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea is that for
each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of
points.
Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for finding spherical-
shaped clusters or convex clusters. In other words, they are suitable only for compact and well-separated
clusters. Moreover, they are also severely affected by the presence of noise and outliers in the data.
1. Clusters can be of arbitrary shape such as those shown in the figure below.
1. eps: It defines the neighborhood around a data point i.e. if the distance between two points is
lower or equal to ‘eps’ then they are considered neighbors. If the eps value is chosen too small
then a large part of the data will be considered as an outlier. If it is chosen very large then the
clusters will merge and the majority of the data points will be in the same clusters. One way to
find the eps value is based on the k-distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius. The larger the dataset,
the larger value of MinPts must be chosen. As a general rule, the minimum MinPts can be
derived from the number of dimensions D in the dataset as, MinPts >= D+1. The minimum value
of MinPts must be chosen at least 3.
1. Find all the neighbor points within eps and identify the core points or visited with more than
MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density-connected points and assign them to the same cluster as the core
point.
A point a and b are said to be density connected if there exists a point c which has a sufficient
number of points in its neighbors and both points a and b are within the eps distance. This is a
chaining process. So, if b is a neighbor of c, c is a neighbor of d, and d is a neighbor of e, which in
turn is neighbor of a implying that b is a neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do not belong to
any cluster are noise.
Difference Between DBSCAN and K-Means.
DBSCAN K-Means
A Support Vector Machine (SVM) is a powerful machine learning algorithm widely used for both linear
and nonlinear classification, as well as regression and outlier detection tasks. SVMs are highly
adaptable, making them suitable for various applications such as text classification, image
classification, spam detection, handwriting identification, gene expression analysis, face detection,
and anomaly detection.
SVMs are particularly effective because they focus on finding the maximum separating
hyperplane between the different classes in the target feature, making them robust for both binary and
multiclass classification. In this outline, we will explore the Support Vector Machine (SVM) algorithm,
its applications, and how it effectively handles both linear and nonlinear classification, as well
as regression and outlier detection tasks.
Support Vector Machine
A Support Vector Machine (SVM) is a supervised machine learning algorithm used for
both classification and regression tasks. While it can be applied to regression problems, SVM is best
suited for classification tasks. The primary objective of the SVM algorithm is to identify the optimal
hyperplane in an N-dimensional space that can effectively separate data points into different classes in
the feature space. The algorithm ensures that the margin between the closest points of different classes,
known as support vectors, is maximized.
The dimension of the hyperplane depends on the number of features. For instance, if there are two
input features, the hyperplane is simply a line, and if there are three input features, the hyperplane
becomes a 2-D plane. As the number of features increases beyond three, the complexity of visualizing
the hyperplane also increases.
Consider two independent variables, x1 and x2, and one dependent variable represented as either a
blue circle or a red circle.
• In this scenario, the hyperplane is a line because we are working with two features (x1 and x2).
• There are multiple lines (or hyperplanes) that can separate the data points.
• The challenge is to determine the best hyperplane that maximizes the separation margin
between the red and blue circles.
From the figure above it’s very clear that there are multiple lines (our hyperplane here is a line because
we are considering only two input features x1, x2) that segregate our data points or do a classification
between red and blue circles. So how do we choose the best line or in general the best hyperplane that
segregates our data points?
How does Support Vector Machine Algorithm Work?
One reasonable choice for the best hyperplane in a Support Vector Machine (SVM) is the one that
maximizes the separation margin between the two classes. The maximum-margin hyperplane, also
referred to as the hard margin, is selected based on maximizing the distance between the hyperplane
and the nearest data point on each side.
So we choose the hyperplane whose distance from it to the nearest data point on each side is
maximized. If such a hyperplane exists it is known as the maximum-margin hyperplane/hard margin. So
from the above figure, we choose L2. Let’s consider a scenario like shown below
Selecting hyperplane for data with outlier
Here we have one blue ball in the boundary of the red ball. So how does SVM classify the data? It’s
simple! The blue ball in the boundary of red ones is an outlier of blue balls. The SVM algorithm has the
characteristics to ignore the outlier and finds the best hyperplane that maximizes the margin. SVM is
robust to outliers.
So in this type of data point what SVM does is, finds the maximum margin as done with previous data
sets along with that it adds a penalty each time a point crosses the margin. So the margins in these types
of cases are called soft margins. When there is a soft margin to the data set, the SVM tries to
minimize (1/margin+∧(∑penalty)). Hinge loss is a commonly used penalty. If no violations no hinge loss.If
violations hinge loss proportional to the distance of violation.
Till now, we were talking about linearly separable data(the group of blue balls and red balls are
separable by a straight line/linear line). What to do if data are not linearly separable?
Say, our data is shown in the figure above. SVM solves this by creating a new variable using a kernel. We
call a point xi on the line and we create a new variable yi as a function of distance from origin o.so if we
plot this we get something like as shown below
Margin: The margin refers to the distance between the support vector and the
hyperplane. The primary goal of the SVM algorithm is to maximize this margin, as a
wider margin typically results in better
AI ML DS Data Science Data Analysis Data Visualization Machine Learning
Deep Learning NLP Compute classification performance.
Kernel: The kernel is a mathematical function used in SVM to map input data into
a higher-dimensional feature space. This allows the SVM to find a hyperplane in
cases where data points are not linearly separable in the original space. Common
kernel functions include linear, polynomial, radial basis function (RBF), and
sigmoid.
Hard Margin: A hard margin refers to the maximum-margin hyperplane that perfectly
separates the data points of different classes without any misclassifications.
Soft Margin: When data contains outliers or is not perfectly separable, SVM uses the
soft margin technique. This method introduces a slack variable for each data point
to allow some misclassifications while balancing between maximizing the margin
and minimizing violations.
Hinge Loss: The hinge loss is a common loss function in SVMs. It penalizes
misclassified points or margin violations and is often combined with a regularization
term in the objective function.
Dual Problem: The dual problem in SVM involves solving for the Lagrange multipliers
associated with the support vectors. This formulation allows for the use of the
kernel trick and facilitates more efficient computation.
Mathematical Computation: SVM
Consider a binary classification problem with two classes, labeled as +1 and -1. We
have a training dataset consisting of input feature vectors X and their corresponding
class labels Y.
The equation for the linear hyperplane can be written as:
wTx + b = 0
The vector W represents the normal vector to the hyperplane. i.e the direction
perpendicular to the hyperplane. The parameter b in the equation represents the
offset or distance of the hyperplane from the origin along the normal vector w.
The distance between a data point x_i and the decision boundary can be calculated as:
di wT x + b
where ||w|| represents the Euclidean norm of the weight vector w. Euclidean norm of
the normal vector W
For Linear SVM classifier :
1 : wTx + b ≥ 0
y^= { 0 : wTx + b < 0
Optimization:
For Hard margin linear SVM classifier:
minimize12wTw = minimizeW,b
w,b subject to yi(wTxi+ b) ≥ 1
for i = 1, 2, 3, ⋯ , m
The target variable or label for the ith training instance is denoted by the symbol ti in
this statement. And ti=-1 for negative occurrences (when yi= 0) and ti=1positive
instances (when yi = 1) respectively. Because we require the decision boundary that
satisfy the constraint: ti(wTxi+ b) ≥ 1 For Soft margin linear SVM classifier:
minimize 12wTw i
w,b
maximize: jtitjK(xi,xj)−
α
∑ αi i→m j→m i→m
where, αi is the Lagrange multiplier associated with the ith training sample. K(xi, xj)
is the kernel function that computes the similarity between two samples xi and
xj. It allows SVM to handle nonlinear classification problems by implicitly
mapping the samples into a higher-dimensional feature space.
The term ∑αi represents the sum of all Lagrange multipliers.
The SVM decision boundary can be described in terms of these optimal Lagrange
multipliers and the support vectors once the dual issue has been solved and the
optimal Lagrange multipliers have been discovered. The training samples that have i > 0
are the support vectors, while the decision boundary is supplied by:
w = ∑ αitiK(xi,x)
+ b i→m ti(wTxi− b) = 1
⟺ b = wTxi− ti
Linear SVM: Linear SVMs use a linear decision boundary to separate the data points
of different classes. When the data can be precisely linearly separated, linear SVMs
are very suitable. This means that a single straight line (in 2D) or a hyperplane (in
higher dimensions) can entirely divide the data points into their respective classes.
A hyperplane that maximizes the margin between the classes is the decision
boundary.
Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be
separated into two classes by a straight line (in the case of 2D). By using kernel
functions, nonlinear SVMs can handle nonlinearly separable data. The original input
data is transformed by these kernel functions into a higher-dimensional feature
space, where the data points can be linearly separated. A linear SVM is used to
locate a nonlinear decision boundary in this modified space.
Popular kernel functions in SVM
The SVM kernel is a function that takes low-dimensional input space and transforms it
into higher-dimensional space, ie it converts nonseparable problems to separable
problems. It is mostly useful in non-linear separation problems. Simply put the kernel,
does some extremely complex data transformations and then finds out the process to
separate the data based on the labels or outputs defined.
Linear : K(w, b) = wTx + b
Polynomial : K(w, x) = (γwTx + b)N
Gaussian RBF: K(w, x)= exp(−γ∣∣xi− xj∣∣n
Sigmoid :K(xi,xj)= tanh(αxTi xj+ b)
1. Slow Training: SVM can be slow for large datasets, affecting performance in SVM in
data mining tasks.
2. Parameter Tuning Difficulty: Selecting the right kernel and adjusting parameters
like C requires careful tuning, impacting SVM algorithms.
3. Noise Sensitivity: SVM struggles with noisy datasets and overlapping classes,
limiting effectiveness in real-world scenarios.
4. Limited Interpretability: The complexity of the hyperplane in higher dimensions
makes SVM less interpretable than other models.
5. Feature Scaling Sensitivity: Proper feature scaling is essential; otherwise, SVM
models may perform poorly.
Web mining is the best type of practice for sifting through the vast amount of data
in the system that is available on the World Wide Web to find and extract pertinent
information as per requirements. One unique feature of web mining is its ability to
deliver a wide range of required data types in the actual process. There are various
elements of the web that lead to diverse methods for the actual mining process. For
example, web pages are made up of text; they are connected by hyperlinks in the
system or process; and web server logs allow for the monitoring of user behavior to
simplify all the required systems. Combining all the required methods from data
mining, machine learning, artificial intelligence, statistics, and information retrieval,
web mining is an interdisciplinary field for the overall system. Analyzing user
behavior and website traffic is the one basic type or example of web mining.
• Fraud detection: Web mining can be used to detect fraudulent activity on websites.
This information can be used to prevent financial fraud, identity theft, and other types
of online fraud.
• Sentiment analysis: Web mining can be used to analyze social media data and extract
sentiment from posts, comments, and reviews. This information can be used to
understand customer sentiment towards products and services and make informed
business decisions.
• Web content analysis: Web mining can be used to analyze web content and extract
valuable information such as keywords, topics, and themes. This information can be
used to improve the relevance of web content and optimize search engine rankings.
• Customer service: Web mining can be used to analyze customer service interactions
on websites and social media platforms. This information can be used to improve the
quality of customer service and identify areas for improvement.
• Healthcare: Web mining can be used to analyze health-related websites and extract
valuable information about diseases, treatments, and medications. This information
can be used to improve the quality of healthcare and inform medical research.
• Web Content Mining: Web content mining is the application of extracting useful
information from the content of the web documents. Web content consist of several
types of data – text, image, audio, video etc. Content data is the group of facts that a
web page is designed. It can provide effective and interesting patterns about user
needs. Text documents are related to text mining, machine learning and natural
language processing. This mining is also known as text mining. This type of mining
performs scanning and mining of the text, images and groups of web pages according
to the content of the input.
• Web Usage Mining: Web usage mining is the application of identifying or discovering
interesting usage patterns from large data sets. And these patterns enable you to
understand the user behaviors or something like that. In web usage mining, user
access data on the web and collect data in form of logs. So, Web usage mining is also
called log mining.
Clustering, classification,
Web content mining,
Problem Type regression, prediction,
Web structure mining.
optimization and control.
It includes application
It includes approaches
level knowledge, data
for data cleansing,
engineering with
Skills machine learning
mathematical modules
algorithms. Statistics
like statistics and
and probability.
probability.
Text mining is a component of data mining that deals specifically with unstructured
text data. It involves the use of natural language processing (NLP) techniques to
extract useful information and insights from large amounts of unstructured text
data. Text mining can be used as a preprocessing step for data mining or as a
standalone process for specific tasks.
Text Mining in Data Mining?
Text mining in data mining is mostly used for, the unstructured text data that can
be transformed into structured data that can be used for data mining tasks such as
classification, clustering, and association rule mining. This allows organizations to
gain insights from a wide range of data sources, such as customer feedback, social
media posts, and news articles.
• Pre-processing and data cleansing tasks are performed to distinguish and eliminate
inconsistency in the data. The data cleansing process makes sure to capture the
genuine text, and it is performed to eliminate stop words stemming (the process of
identifying the root of a certain word and indexing the data.
• Processing and controlling tasks are applied to review and further clean the data set.
• Text Summarization: To extract its partial content and reflect its whole content
automatically.
• Text Clustering: To segment texts into several clusters, depending on the substantial
relevance.
• Digital Library: Various text mining strategies and tools are being used to get the
pattern and trends from journal and proceedings which is stored in text database
repositories. These resources of information help in the field of research area. Libraries
are a good resource for text data in digital form. It gives a novel technique for getting
useful data in such a way that makes it conceivable to access millions of records
online.
A green-stone international digital library that supports numerous languages and
multilingual interfaces gives a springy method for extracting reports that handle
various formats, i.e. Microsoft Word, PDF, postscript, HTML, scripting languages, and
email. It additionally supports the extraction of audiovisual and image formats along
with text documents. Text Mining processes perform different activities like document
collection, determination, enhancement, removing data, and handling substances, and
Producing summarization.
• Academic and Research Field: In the education field, different text-mining tools and
strategies are utilized to examine the instructive patterns in a specific region/research
field. The main purpose of text mining utilization in the research field is help to
discover and arrange research papers and relevant material from various fields on one
platform.
For this, we use k-Means clustering and different strategies help to distinguish the
properties of significant data. Also, student performance in various subjects can be
accessed, and how various qualities impact the selection of subjects evaluated by this
mining.
• Life Science: Life science and healthcare industries are producing an enormous volume
of textual and mathematical data regarding patient records, sicknesses, medicines,
symptoms, and treatments of diseases, etc. It is a major issue to filter data and
relevant text to make decisions from a biological data repository. The clinical records
contain variable data which is unpredictable, and lengthy. Text mining can help to
manage such kinds of data. Text mining is used in biomarkers disclosure, the pharmacy
industry, clinical trade analysis examination, clinical study, and patent competitive
intelligence also.
• Social-Media: Text mining is accessible for dissecting and analyzing web-based media
applications to monitor and investigate online content like the plain text from internet
news, web journals, emails, blogs, etc. Text mining devices help to distinguish and
investigate the number of posts, likes, and followers on the web-based media network.
This kind of analysis shows individuals’ responses to various posts, and news and how
it spread around. It shows the behavior of people who belong to a specific age group
and variations in views about the same post.
• Business Intelligence: Text mining plays an important role in business intelligence that
help different organization and enterprises to analyze their customers and competitors
to make better decisions. It gives an accurate understanding of business and gives data
on how to improve consumer satisfaction and gain competitive benefits. The text
mining devices like IBM text analytics.
This mining can be used in the telecom sector, commerce, and customer chain
management system.
• Cost-effective: Text mining can be a cost-effective way, as it eliminates the need for
manual data entry.
• Complexity: Text mining can be a complex process requiring advanced skills in natural
language processing and machine learning.
• Quality of Data: The quality of text data can vary, affecting the accuracy of the insights
extracted from text mining.
• High Computational Cost: Text mining requires high computational resources, and it
may be difficult for smaller organizations to afford the technology.
• Limited to Text Data: Text mining is limited to extracting insights from unstructured
text data and cannot be used with other data types.
• Noise in text mining results: Text mining of documents may result in mistakes. It’s
possible to find false links or to miss others. In most situations, if the noise (error rate)
is sufficiently low, the benefits of automation exceed the chance of a larger mistake
than that produced by a human reader.
Information Retrieval
Information Extraction
• Feature Extraction – In this process, we try to develop some new features from
existing ones. This objective can be achieved by parsing an existing feature or
combining two or more features based on some mathematical operation.
• Feature Selection – In this process, we try to reduce the dimensionality of the dataset
which is generally a common issue while dealing with the text data by selecting a
subset of features from the whole dataset.
Natural Language Processing includes tasks that are accomplished by using Machine
Learning and Deep Learning methodologies. It concerns the automatic processing
and analysis of unstructured text information.
• Named Entity Recognition (NER): Identifying and classifying named entities such as
people, organizations, and locations in text data.
• Sentiment Analysis: Identifying and extracting the sentiment (e.g. positive, negative,
neutral) of text data.