6 - Machine Learning and Unlabeled Data
6 - Machine Learning and Unlabeled Data
Unlabeled data
As mentioned in Chapter 01, unsupervised Learning is based on unlabeled data,
where the model learn from it to provide predictions on new and unseen data. It
cannot be directly applied to a regression or classification problem because unlike
supervised learning, we have the input data but no corresponding output data. The
goal of unsupervised learning is to find the underlying structure of a dataset, group
the data according to some similarities, and/or represent this dataset in a
compressed format. Hence, unsupervised learning can be divided into two main
categories: Clustering and Association.
Association
An association rule is used for finding
the relationships between variables in
the large database. It determines the
set of items that occurs together in the
dataset. For example: In a market
basket analysis, where a company or
market owners study how their
customers tend to use their products,
saying for example people who buy X
item (suppose a bread) are also tend to
purchase Y (Butter) item.
Clustering
We may further classify clustering based on the used criterion like: Partition,
hierarchy, density, distribution, graph or fuzzy theory and neighborhood of the
data points.
Hierarchical Clustering
The Hierarchical type tries to create a tree or a hierarchy of clusters, called
a Dendrogram. The most similar documents are grouped into clusters at the
lowest levels, while the less similar documents are grouped into clusters at the
highest levels.
Depending on how the hierarchy is created, this type of algorithms can further
be divided into two: divisive or agglomerative. In partition (divisive), we try to
divide a large cluster into 2 smaller ones (top-down approach). In grouping
(agglomerative), we try to group 2 clusters into a larger one (bottom-up
approach).
Agglomerative clustering
• Complete Linkage:
o Definition: The distance between two clusters is the maximum distance
between any two points in the two clusters.
o Characteristics: Tends to produce compact, spherical clusters. Less
sensitive to outliers.
• Average Linkage:
o Definition: The distance between two clusters is the average
distance between all pairs of points in the two clusters.
o Characteristics: Strikes a balance between single and complete
linkage. Less sensitive to outliers.
• Centroid Linkage:
o Definition: The distance between two clusters is the distance
between their centroids (mean points).
o Characteristics: Can produce well-balanced clusters. Sensitive to
outliers.
• Ward Linkage:
o Definition: Minimizes the variance within clusters. It measures the
increase in variance for a cluster being merged.
o Characteristics: Tends to produce compact, spherical clusters.
Suitable for minimizing the overall variance.
Choosing the Linkage Method:
• The choice of linkage method depends on the nature of the data and the
desired characteristics of the clusters.
• Single linkage is sensitive to noise but can detect elongated clusters.
• Complete linkage is less sensitive to outliers and noise, forming compact
clusters.
• Average linkage provides a balance between the extremes of single and
complete linkage.
• Centroid linkage calculates distances based on centroids and can be
effective for various cluster shapes.
• Ward linkage minimizes the variance within clusters and is suitable for
balanced, compact clusters.
Agglomerative Implementation
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('dark')
plt.xlabel('x coordinate')
plt.ylabel('y coordinate')
plt.title('Scatter Plot of the data')
plt.xlim([0,10]), plt.ylim([0,10])
plt.xticks(range(10)), plt.yticks(range(10))
plt.grid()
plt.show()
Using Dendrogram
The fcluster function gives the correspondent cluster to each element of the
array, and places it in the same index. For this case, you can see that there is two
clusters. The elements in X1 array are grouping as follows:
X1 Cluster
[1 1] --> 2
[3 2] --> 2
Output: [9 1] --> 1
Clusters: [2 2 1 2 1 1 2 1 2] [3 7] --> 2
[7 2] --> 1
[9 7] --> 1
[4 8] --> 2
[8 3] --> 1
[1 4] --> 2
Using sklearn
#use AgglomerativeClustering function within Scikit-learn
library to find the clusters for Ward linkage
The K-means algorithm starts with a first group of randomly selected centroids,
which are used as the beginning points for every cluster, and then performs iterative
(repetitive) calculations to optimize the positions of the centroids.
It halts creating and optimizing clusters when either:
• The centroids have stabilized — there is no change in their values because the
clustering has been successful.
• The defined number of iterations has been achieved.
K-Means Algorithm
The elbow graph shows WCSS values (on the y-axis) corresponding to the
different values of K(on the x-axis). When we see an elbow shape in the
graph, we pick the K-value where the elbow gets created. We can call this
point the Elbow point. Beyond the Elbow point, increasing the value of ‘K’
does not lead to a significant reduction in WCSS.
Implementation using Sklearn
#importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
#create a random dataset
X= -2 * np.random.rand(100,2)
X1 = 1 + 2 * np.random.rand(50,2)
X[50:100, :] = X1
#scatter plot of the dataset
plt.scatter (X[ : , 0], X[ :, 1], s=50, c='b')
plt.show()
#import KMeans algorithme
from sklearn.cluster import KMeans
#fix the number k to 2
Kmean = KMeans(n_clusters=2)
#fit the data to the kmeans model
Kmean.fit(X)
Kmean.cluster_centers_
Advantages Disadvantages
• It is commonly used and easy to • Its performance is usually not as
understand. competitive as those of the other
• It delivers training results quickly. sophisticated clustering
techniques. Slight variations in the
data could lead to high variance.
• Furthermore, clusters are assumed
to be spherical and evenly sized,
which may reduce the accuracy of
the K-means clustering algorithm.
DBSCAN (Density-Based Spatial Clustering
of Applications with Noise)
• Parameters:
o eps (Epsilon):The maximum distance between two points for one to be
considered as in the neighborhood of the other.
o min_samples: The minimum number of data points required to form a
dense region (including the point itself).
• Core Points:
o A data point is a core point if there are at least `min_samples` points
(including itself) within a distance of `eps` from it.
• Border Points:
o A data point is a border point if it has fewer than `min_samples` points
within `eps` of it but is reachable from a core point.
• Noise Points:
o A data point is a noise point if it is neither a core nor a border point.
• Cluster Formation:
o Connect core points that are within `eps` distance of each other.
o Assign each border point to the cluster of its reachable core point.
• Repeat:
o Repeat the process until all points are assigned to a cluster or
labeled as noise.
Impelmentation using Sklearn
It works very well when there is a distance between examples. The learning
speed is slow when the training set is large, and the distance calculation is
nontrivial.
import numpy as np
from sklearn.neighbors import NearestNeighbors
samples = [[0, 0, 2], [1, 0, 0], [0, 0, 1]]
neigh = NearestNeighbors(2, 0.4)
neigh.fit(samples)
neigh.kneighbors([[0, 0, 1.3]], 2, return_distance=False)
Dimensionality Reduction
Introduction
The explained Variance Ratio is a measure that indicates the proportion of the
dataset's variance that is captured by each principal component. It helps in
understanding the importance of each principal component in representing
the overall variability of the data.
print("Original Data:")
print(data)
print(transformed_data)
print(explained_variance_ratio)
Next, we'll be seeing how much proportion of each original feature's
variance that is captured by each principal component.
It helps to interpret the meaning of each PC, for example in this case we
can say that PC1 represents the opposition of how good the student is in
Calculus and Algebra (Opposition mean that if he is good in one, he is bad
in the other, if a student is good in both of them, he might be closer to zero
in PC1).
While PC2 represents only how good is a student in Lang.
Note : Every point who is too close to Zero cannot be truly correctly
interpreted.
# Convert the 2D array to a DataFrame
df = pd.DataFrame(pca.components_.T,
columns=["PC1","PC2"],index=["Algebra",
"Calculus","Lang"])
# Display the DataFrame
print("DataFrame:")
print(df)
If you want to add a new point, we just have to pass his original coordinates scaled
between 0 and 1, and pass them to pca.transform resulting in the coordinates of
the new point with PC1 and PC2
Just like in supervised machine learning, neural networks can be used for the
unsupervised learning, thanks to their wide variety of architectures and
algorithms, which can be deployed in different real-world problems.