0% found this document useful (0 votes)
12 views

6 - Machine Learning and Unlabeled Data

Uploaded by

Sefouane bechikh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

6 - Machine Learning and Unlabeled Data

Uploaded by

Sefouane bechikh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Machine Learning &

Unlabeled data
As mentioned in Chapter 01, unsupervised Learning is based on unlabeled data,
where the model learn from it to provide predictions on new and unseen data. It
cannot be directly applied to a regression or classification problem because unlike
supervised learning, we have the input data but no corresponding output data. The
goal of unsupervised learning is to find the underlying structure of a dataset, group
the data according to some similarities, and/or represent this dataset in a
compressed format. Hence, unsupervised learning can be divided into two main
categories: Clustering and Association.
Association
An association rule is used for finding
the relationships between variables in
the large database. It determines the
set of items that occurs together in the
dataset. For example: In a market
basket analysis, where a company or
market owners study how their
customers tend to use their products,
saying for example people who buy X
item (suppose a bread) are also tend to
purchase Y (Butter) item.
Clustering

Clustering is a method of grouping the objects into clusters. Objects with


most similarities remains into the same group and has less or no similarities
with the objects of another group. In doing so, we want to achieve one of the
following goals:

- The number of clusters, compared to the number of input data, is much


smaller. Thus, one can speed up the search process. If one document is
relevant to a query, then similar documents are more likely to be relevant
too. Thus, clustering can also be seen as a means of expansion. Finally,
system responses can be grouped together, rather than being listed
individually. The advantage of this presentation of results is that the user can
get an overall idea of the results that the system has found quite quickly.
Clustering Types
Clustering algorithms may be classified into four main types:
• Exclusive Clustering: If a certain data point belongs to a definite cluster then it
could not be included in another cluster.
• Overlapping Clustering: Uses fuzzy sets to cluster data, so that each point may
belong to two or more clusters with different degrees of membership.
• Hierarchical Clustering: Is based on the union between the two nearest
clusters. The beginning condition is realized by setting every data point as a
cluster. After a few iterations it reaches the final clusters wanted.
• Probabilistic Clustering (Distribution-based Clustering): Uses a probabilistic
approach. It assumes data is composed of distributions, such as Gaussian
Distribution.

We may further classify clustering based on the used criterion like: Partition,
hierarchy, density, distribution, graph or fuzzy theory and neighborhood of the
data points.
Hierarchical Clustering
The Hierarchical type tries to create a tree or a hierarchy of clusters, called
a Dendrogram. The most similar documents are grouped into clusters at the
lowest levels, while the less similar documents are grouped into clusters at the
highest levels.

Depending on how the hierarchy is created, this type of algorithms can further
be divided into two: divisive or agglomerative. In partition (divisive), we try to
divide a large cluster into 2 smaller ones (top-down approach). In grouping
(agglomerative), we try to group 2 clusters into a larger one (bottom-up
approach).
Agglomerative clustering

Hierarchical Clustering is a type of clustering algorithm that builds a hierarchy


of clusters. The agglomerative approach starts with each data point as a single
cluster and then merges the closest pairs of clusters until only one cluster
remains. This process creates a tree-like structure known as a dendrogram,
which represents the relationships between clusters.
How Agglomerative Clustering works ?
1. Initialization:
• Start with each data point as a separate cluster.
• Calculate the proximity (distance) between all pairs of clusters.
2. Merge:
• Identify the two closest clusters based on the chosen proximity metric.
• Merge these clusters into a single cluster.
• Recalculate the proximity matrix.
3. Repeat:
• Repeat the merging step until only one cluster remains.
4. Dendrogram:
• Visualize the hierarchy of clusters using a dendrogram.
• The vertical lines represent clusters, and the height of the vertical lines
indicates the distance at which clusters were merged.
Linkage Methods
In hierarchical clustering, the linkage method determines how the distance
between clusters is calculated during the merging process. Different linkage
methods can result in distinct cluster structures. Here are the common linkage
methods:
• Single Linkage:
o Definition: The distance between two clusters is the minimum distance
between any two points in the two clusters.
o Characteristics: Tends to produce long, elongated clusters. Sensitive to
outliers and noise.

• Complete Linkage:
o Definition: The distance between two clusters is the maximum distance
between any two points in the two clusters.
o Characteristics: Tends to produce compact, spherical clusters. Less
sensitive to outliers.
• Average Linkage:
o Definition: The distance between two clusters is the average
distance between all pairs of points in the two clusters.
o Characteristics: Strikes a balance between single and complete
linkage. Less sensitive to outliers.
• Centroid Linkage:
o Definition: The distance between two clusters is the distance
between their centroids (mean points).
o Characteristics: Can produce well-balanced clusters. Sensitive to
outliers.
• Ward Linkage:
o Definition: Minimizes the variance within clusters. It measures the
increase in variance for a cluster being merged.
o Characteristics: Tends to produce compact, spherical clusters.
Suitable for minimizing the overall variance.
Choosing the Linkage Method:

• The choice of linkage method depends on the nature of the data and the
desired characteristics of the clusters.
• Single linkage is sensitive to noise but can detect elongated clusters.
• Complete linkage is less sensitive to outliers and noise, forming compact
clusters.
• Average linkage provides a balance between the extremes of single and
complete linkage.
• Centroid linkage calculates distances based on centroids and can be
effective for various cluster shapes.
• Ward linkage minimizes the variance within clusters and is suitable for
balanced, compact clusters.
Agglomerative Implementation

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('dark')

#define a small sample dataset


X1 = np.array([[1,1], [3,2], [9,1], [3,7], [7,2], [9,7], [4,8], [8,3],[1,4]])
#Scatter plot of the data sample
plt.figure(figsize=(6, 6))
plt.scatter(X1[:,0], X1[:,1], c='r')

# Create numbered labels for each point


for i in range(X1.shape[0]):
plt.annotate(str(i), xy=(X1[i,0], X1[i,1]), xytext=(3, 3),
textcoords='offset points')

plt.xlabel('x coordinate')
plt.ylabel('y coordinate')
plt.title('Scatter Plot of the data')
plt.xlim([0,10]), plt.ylim([0,10])
plt.xticks(range(10)), plt.yticks(range(10))
plt.grid()
plt.show()
Using Dendrogram

The example below uses dendrograms. A dendrogram is a diagram that shows


the hierarchical relationship between objects. It is most commonly created as
an output from hierarchical clustering. The main use of a dendrogram is to
work out the best way to allocate objects to clusters.
#import likange and dendrogram functions from Scipy library
from scipy.cluster.hierarchy import dendrogram, linkage
Z1 = linkage(X1, method='single', metric='euclidean')
Z2 = linkage(X1, method='complete', metric='euclidean')
Z3 = linkage(X1, method='average', metric='euclidean')
Z4 = linkage(X1, method='ward', metric='euclidean')

#pass dendrogram function to matplotlib to plot the different linkage


plt.figure(figsize=(15, 10))
plt.subplot(2,2,1), dendrogram(Z1), plt.title('Single')
plt.subplot(2,2,2), dendrogram(Z2), plt.title('Complete')
plt.subplot(2,2,3), dendrogram(Z3), plt.title('Average')
plt.subplot(2,2,4), dendrogram(Z4), plt.title('Ward')
plt.show()
Using scipy cluster

#use fcluster function to find the clusters for Ward linkage


from scipy.cluster.hierarchy import fcluster
f1 = fcluster(Z4, 2, criterion='maxclust')
print(f"Clusters: {f1}")

The fcluster function gives the correspondent cluster to each element of the
array, and places it in the same index. For this case, you can see that there is two
clusters. The elements in X1 array are grouping as follows:
X1 Cluster
[1 1] --> 2
[3 2] --> 2
Output: [9 1] --> 1
Clusters: [2 2 1 2 1 1 2 1 2] [3 7] --> 2
[7 2] --> 1
[9 7] --> 1
[4 8] --> 2
[8 3] --> 1
[1 4] --> 2
Using sklearn
#use AgglomerativeClustering function within Scikit-learn
library to find the clusters for Ward linkage

from sklearn.cluster import AgglomerativeClustering


Z1 = AgglomerativeClustering(n_clusters=2, linkage='ward')
Z1.fit_predict(X1)
print(Z1.labels_)

Same as The fcluster function, the AgglomerativeClustering function gives


the correspondent cluster to each element of the array and places it in the
same index. For this case, you can see that there is also two clusters, but
numbered as cluster 0 and cluster 1. The elements in X1 array are grouping as
follows:
X1 Cluster
[1 1] --> 0
[3 2] --> 0
Output: [9 1] --> 1
[0 0 1 0 1 1 0 1 0] [3 7] --> 0
[7 2] --> 1
[9 7] --> 1
[4 8] --> 0
[8 3] --> 1
[1 4] --> 0
#draw the scatter plot to visualize the clusters
using their labels
fig, ax = plt.subplots(figsize=(6, 6))
scatter = ax.scatter(X1[:,0], X1[:,1], c=Z1.labels_,
cmap='rainbow')
legend = ax.legend(*scatter.legend_elements(),
title="Clusters", bbox_to_anchor=(1, 1))
ax.add_artist(legend)
plt.title('Scatter plot of clusters')
plt.show()
Non-Hierarchical Clustering
Non Hierarchical Clustering
forms new clusters by merging or
splitting the clusters.It does not
follow a tree like structure. This
technique groups the data in
order to maximize or minimize
some evaluation criteria.
K-Means
K-means groups similar data points together and discover underlying patterns. To
achieve this objective, K-means looks for a fixed number k of clusters in a
dataset. K refers to the number of centroids (centers) of the clusters. In other words,
the K-means algorithm identifies k number of centroids, and then allocates every
data point to the nearest cluster, while keeping the centroids as small as possible.

The K-means algorithm starts with a first group of randomly selected centroids,
which are used as the beginning points for every cluster, and then performs iterative
(repetitive) calculations to optimize the positions of the centroids.
It halts creating and optimizing clusters when either:
• The centroids have stabilized — there is no change in their values because the
clustering has been successful.
• The defined number of iterations has been achieved.
K-Means Algorithm

The recipe for k-means is quite straightforward.


1. Decide how many clusters you want, i.e. choose k
2. Randomly assign a centroid to each of the k clusters
3. Calculate the distance of all observation to each of the k centroids
4. Assign observations to the closest centroid
5. Find the new location of the centroid by taking the mean of all the
observations in each cluster
6. Repeat steps 3-5 until the centroids do not change position
Note: To find the optimal K, the Elbow method can be used. It is a graphical
representation. It works by finding WCSS (Within-Cluster Sum of Square) i.e.
the sum of the square distance between points in a cluster and the cluster
centroid.

The elbow graph shows WCSS values (on the y-axis) corresponding to the
different values of K(on the x-axis). When we see an elbow shape in the
graph, we pick the K-value where the elbow gets created. We can call this
point the Elbow point. Beyond the Elbow point, increasing the value of ‘K’
does not lead to a significant reduction in WCSS.
Implementation using Sklearn
#importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
#create a random dataset
X= -2 * np.random.rand(100,2)
X1 = 1 + 2 * np.random.rand(50,2)
X[50:100, :] = X1
#scatter plot of the dataset
plt.scatter (X[ : , 0], X[ :, 1], s=50, c='b')
plt.show()
#import KMeans algorithme
from sklearn.cluster import KMeans
#fix the number k to 2
Kmean = KMeans(n_clusters=2)
#fit the data to the kmeans model
Kmean.fit(X)
Kmean.cluster_centers_

#plot the clusters


plt.scatter(X[ : , 0], X[ : , 1], s =50, c='b')
plt.scatter(-0.94665068, -0.97138368,
s=200, c='g', marker='s')
plt.scatter(2.01559419, 2.02597093,
s=200, c='r', marker='s')
plt.show()
Kmean.labels_
#create a testing point
sample_test=np.array([-3.0,-3.0])
second_test=sample_test.reshape(1, -1)
#predict the cluster for the testing point
Kmean.predict(second_test)
Advantages and Disadvantages

Advantages Disadvantages
• It is commonly used and easy to • Its performance is usually not as
understand. competitive as those of the other
• It delivers training results quickly. sophisticated clustering
techniques. Slight variations in the
data could lead to high variance.
• Furthermore, clusters are assumed
to be spherical and evenly sized,
which may reduce the accuracy of
the K-means clustering algorithm.
DBSCAN (Density-Based Spatial Clustering
of Applications with Noise)

DBSCAN is a density-based clustering algorithm that divides a dataset into


clusters based on the density of data points. Unlike K-Means, DBSCAN does
not require specifying the number of clusters in advance and can identify
clusters of arbitrary shapes. It classifies points as core points, border points, or
noise points, allowing for the detection of outliers.
DBscan Algorithm

• Parameters:
o eps (Epsilon):The maximum distance between two points for one to be
considered as in the neighborhood of the other.
o min_samples: The minimum number of data points required to form a
dense region (including the point itself).
• Core Points:
o A data point is a core point if there are at least `min_samples` points
(including itself) within a distance of `eps` from it.
• Border Points:
o A data point is a border point if it has fewer than `min_samples` points
within `eps` of it but is reachable from a core point.
• Noise Points:
o A data point is a noise point if it is neither a core nor a border point.
• Cluster Formation:
o Connect core points that are within `eps` distance of each other.
o Assign each border point to the cluster of its reachable core point.
• Repeat:
o Repeat the process until all points are assigned to a cluster or
labeled as noise.
Impelmentation using Sklearn

from sklearn.cluster import DBSCAN


import matplotlib.pyplot as plt
import numpy as np
import sklearn.datasets as datasets
X, Y = datasets.make_moons(n_samples=400,
noise=0.09,
random_state=1)
print('Dataset Size : ',X.shape, Y.shape)
with plt.style.context("ggplot"):
plt.scatter(X[:,0],X[:,1], c = Y,cmap = "rainbow" ,marker="o", s=50)
plt.title("Original Data")
db = DBSCAN(eps=0.2,
min_samples=5,
)
Y_preds = db.fit_predict(X)
plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
plt.scatter(X[:,0],X[:,1], c = Y, cmap="rainbow",marker=".", s=50)+
plt.title("Original Data")
plt.subplot(1,2,2)
plt.scatter(X[:,0],X[:,1], c = Y_preds,cmap="rainbow" , marker=".", s=50)
plt.title("Clustering Algorithm Prediction");
Neighborhood approaches
Different Neighborhood approaches are used to group the unlabeled data. K-
nearest neighbour (KNN) is one of the simplest approaches. It differs from
other machine learning techniques, in that it doesn’t produce a model. It is a
simple algorithm which stores all available cases and classifies new
instances based on a similarity measure. For that, it can be also considered
for unsupervised learning problems.

It works very well when there is a distance between examples. The learning
speed is slow when the training set is large, and the distance calculation is
nontrivial.

Sklearn has an unsupervised version of KNN. Unlike k-means, the


unsupervised KNN does not associate a label to instances. All it can do is tell
you what instances in your training data is k-nearest to the point you are
polling for. Unsupervised KNN is more about distance to neighbors of each
data whereas k-means is more about distance to centroids (clustering).
Implementation using Sklearn

import numpy as np
from sklearn.neighbors import NearestNeighbors
samples = [[0, 0, 2], [1, 0, 0], [0, 0, 1]]
neigh = NearestNeighbors(2, 0.4)
neigh.fit(samples)
neigh.kneighbors([[0, 0, 1.3]], 2, return_distance=False)
Dimensionality Reduction
Introduction

In both Statistics and Machine Learning, dimensionality refers to the number


of attributes, features or input variables of a dataset. For example, a simple
dataset containing 2 attributes called Height and Weight, is a 2-dimensional
dataset and any observation of this dataset can be plotted in a 2D plot.

Real-world datasets have many attributes. The observations of those datasets


lie in high-dimensional space which is hard to imagine. In a tabular dataset
containing rows and columns, the columns represent the dimensions of the n-
dimensional feature space and the rows are the data points lying in that space.
Dimensionality reduction simply refers to the process of reducing the number
of attributes in a dataset while keeping as much of the variation in the original
dataset as possible. There are several dimensionality reduction methods that
can be used with different types of data for different requirements. The
following chart summarizes those dimensionality reduction methods.

The concept behind dimensionality reduction is that high-dimensional data are


dominated by a small number of simple variables. This way, we can find a
subset of variables to represent the same level of information in the data or
transform the variables into a new set of variables without losing much
information.
Types of Dimensionality reduction
Principal Component Analysis (PCA)

PCA is a linear dimensionality reduction algorithm that helps us extract a


new set of variables from an existing high-dimensional dataset. The idea is to
reduce the dimensionality of a dataset while retaining as much variance as
possible.

PCA is also an unsupervised algorithm that creates linear combinations of


the original features, called principal components. Principal components are
learned in such a way that the first principal component explains maximum
variance in the dataset, the second principal component tries to explain the
remaining variance in the dataset while being uncorrelated to the first one, and
so on.
Instead of simply choosing useful
features and discarding others, PCA
uses a linear combination of the
existing features in the dataset and
constructs new features that are an
alternative representation of the
original data.
In simple words, PCA measures data in
terms of its principal components
rather than on a normal x-y axis. Where
principal components are the
directions where there is the most
variance, the directions where the data
is most spread out.
PCA Implementation
Here we implemented a PCA using sklearn on a randomly generated dataset
(notice how the values are between 0 and 1, it is important before doing a PCA
that we normalize the values, so we have to use MinMax Scaler for example or
Standard Scaler, here we didn't use any one them because the generated
values were already between 0 and 1)

The explained Variance Ratio is a measure that indicates the proportion of the
dataset's variance that is captured by each principal component. It helps in
understanding the importance of each principal component in representing
the overall variability of the data.

Specifically, the explained variance ratio of a principal component is the ratio


of the variance attributed to that principal component to the total variance of
the dataset. It quantifies the amount of information (variance) retained by each
principal component.
from sklearn.decomposition import PCA
import numpy as np
import pandas as pd
# Generate example data
np.random.seed(42)
data = np.random.rand(5, 3) # 5 samples, 3 features
data = pd.DataFrame(data,columns=["Algebra","Calculus","Lang"])
# Apply PCA
pca = PCA(n_components=2) # We chose 2 PCs only
transformed_data = pca.fit_transform(data)
# Explained variance ratio
explained_variance_ratio =
pca.explained_variance_ratio_

print("Original Data:")

print(data)

print("\nTransformed Data (2 Principal


Components):")

print(transformed_data)

print("\nExplained Variance Ratio:")

print(explained_variance_ratio)
Next, we'll be seeing how much proportion of each original feature's
variance that is captured by each principal component.

It helps to interpret the meaning of each PC, for example in this case we
can say that PC1 represents the opposition of how good the student is in
Calculus and Algebra (Opposition mean that if he is good in one, he is bad
in the other, if a student is good in both of them, he might be closer to zero
in PC1).
While PC2 represents only how good is a student in Lang.

Note : Every point who is too close to Zero cannot be truly correctly
interpreted.
# Convert the 2D array to a DataFrame
df = pd.DataFrame(pca.components_.T,
columns=["PC1","PC2"],index=["Algebra",
"Calculus","Lang"])
# Display the DataFrame
print("DataFrame:")
print(df)
If you want to add a new point, we just have to pass his original coordinates scaled
between 0 and 1, and pass them to pca.transform resulting in the coordinates of
the new point with PC1 and PC2

# New data point to be transformed


new_data_point = np.array([[0, 1,0]]) # Replace this with your new data point
# Transform the new data point using the fitted PCA model
transformed_new_data_point = pca.transform(new_data_point)
transformed_new_data_point
Next, we are just plotting the transformed_data and the
new_transformed_point and the original features on a plot where PC1
is the X-Axis and PC2 is the Y-Axis.

It helps with the interpretations of points


import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 10))
plt.figure(figsize=(20,12))
ax.scatter(transformed_data[:,0],transformed_data[:,1])
ax.plot([0,pca.components_[0,0]],[0,pca.components_[1,0]],c='r',label="Algebra")
ax.plot([0,pca.components_[0,1]],[0,pca.components_[1,1]],c='g',label="Calculus")
ax.plot([0,pca.components_[0,2]],[0,pca.components_[1,2]],c='orange',label="Lang")
ax.scatter(transformed_new_data_point[0,0],transformed_new_data_point[0,1],label="New Point")
ax.plot([0,])
ax.set_aspect('equal')
ax.grid(True, which='both')
ax.legend()
ax.axhline(y=0, color='k')
ax.axvline(x=0, color='k')
Other Unsupervised Learning
Models
Autoencoders

Just like in supervised machine learning, neural networks can be used for the
unsupervised learning, thanks to their wide variety of architectures and
algorithms, which can be deployed in different real-world problems.

One of the well-known architectures for unsupervised learning is


the autoencoders. The aim of an autoencoder is to learn a lower-dimensional
representation (encoding) for a higher-dimensional data, typically for
dimensionality reduction, by training the network to capture the most
important parts of the input image.
When we think of dimensionality reduction, we tend to think of methods
like PCA. However, PCA can only build linear relationships. While methods
like the ones named undercomplete autoencoders can learn non-linear
relationships and, therefore, perform better in dimensionality reduction.

This form of nonlinear dimensionality reduction where the autoencoder


learns a non-linear manifold is also termed as manifold learning.

You might also like