ML ch 4 (4)
ML ch 4 (4)
Unsupervised Learning
✔ Group unstructured data according to its similarities and distinct patterns in the
dataset
✔ How do we know if results are meaningful since no answer labels are available?
Some applications of unsupervised machine
learning techniques include:
Clustering -- automatically split the dataset into groups according to
similarity.
Anomaly detection -- discovers unusual data points in your dataset.
Association mining -- identifies sets of items that frequently occur together
in your dataset.
Latent variable models are commonly used for data
preprocessing--(dimensionality reduction)
What is a Good cluster
❖ The quality of a clustering result depends on:
✔ The similarity measure used by the method and its implementation.
Start
Input K( Randomly
initialize center)
Calculate Centroid
Calculate Distance
If no change clusters
for each data points
Group Data Points Based with previous
If data points cluster
on Minimum Distances End
changes
K-means algorithm Pseudocode
Assume Euclidean distance
Example -Start by picking k, the number of clusters
-Initialize clusters y picking one point per cluster
- Let k = 2, let us choice observations A & C as the two cluster means (mean
centroids)
Data Point Distance from center (1, 0.5) Distance from center Point belongs to
of Cluster-01 (1.7, 3.7) of Cluster
Cluster-02
A(1, 1) 0.5 2.7 1
B(1, 0) 0.5 3.7 1
C(0, 2) 1.8 2.4 1
D(2, 4) 3.6 0.5 2
E(3, 5) 4.9 1.8 2
After second Iteration
A, B & C are in cluster 1
D & E are in cluster 2
Therefore C1 ={A, B, C} = {1, 1}, {1, 0}, {0, 2} to find new k(centroids) = {(1 + 1 + 0) / 3, (1+ 0 + 2 )/ 3} = {0.7, 1}
and C2={D, E} ={2, 4}, {3, 5} = {(2 + 3 )/ 2, (4 + 5) / 2} = {2.5, 4.5}
Next recalculate the distance of each point from the cluster mean.
Data Point Distance from center (0.7, 1) of Distance from center (2.5, Point belongs to
Cluster-01 4.5) of Cluster-01 Cluster
B(1, 0) 1 4.7 1
D(2, 4) 3.3 1 2
E(3, 5) 4.6 1 2
Home study
❖ Cluster the following eight points (with (x, y) representing locations) into
three clusters:
❖ A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
❖ Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2) and Use Both
Euclidean and Manhattan distance
Hierarchical Clustering
✔ Produces a set of nested clusters organized as a hierarchical tree
✔ The endpoint refers to a different set of clusters, where each cluster is different
from the other cluster, and the objects within each cluster are the same as one
another.
Dendrogram
Types of Hierarchical Clustering
Agglomerative:
✔ Start with the points as individual clusters– uses bottom up strategy
✔ At each step, merge the closest pair of clusters until only one cluster left
Divisive:
✔ Starts by placing all objects in one cluster– employs a top-down strategy
✔ At each step, split a cluster until each cluster contains a point
✔ Less widely used due to its complexity compared with agglomerative
approach.
Steps in Agglomerative Clustering
1. Determine the similarity between individuals and all other clusters. (Find proximity matrix).
2. Consider each data point as an individual cluster.
3. Combine similar clusters.
4. Recalculate the proximity matrix for each cluster.
5. Repeat step 3 and step 4 until you get a single cluster.
Types of HAC
Single - nearest distance (single linkage)
✔ This is the distance between the closest members of the cluster
Complete – farthest distance (Complete linkage)
✔ This is the distance between members that are farthest apart
Average – Average distance ( average linkage)
✔ This method involves looking at the distances between all pairs and
average all of these distances. Also called unweighted pair group mean
Find the clusters using single linkage. Use Euclidean distance to
calculate proximity matrices and finally, draw dendrogram
Example: Distance between (P1, p2) can be calculated by:
{0.40, 0.53}, {0.22, 0.38}
The distances Matrices
From the distance matric we found the minimum value and do cluster
The minimum value is 0.11 which found in p3 and p6. Therefore we combine p3, p6 and make a
cluster.
To update the distance matrix MIN[(dist(P3, P6), P1)]
✔ Intersection value b/n {p3, p1}, {p6, p1) = {0.22, 0.23}.
✔ Therefore, the Minimum value is = 0.22
To update the distance matrix MIN[(dist(P3, P6), P2)]
✔ Intersection value b/n {p3, p2}, {p6, p2) = {0.15, 0.25}.
✔ Therefore, the Minimum value is = 0.15
To update the distance matrix MIN[(dist(P3, P6), P4)]
✔ Intersection value b/n {p3, p4}, {p6, p4) = {0.15, 0.22}.
✔ Therefore, the Minimum value is = 0.15
To update the distance matrix MIN[(dist(P3, P6), P5)]
✔ Intersection value b/n {p3, p5}, {p6, p5) = {0.28, 0.39}.
✔ Therefore, the Minimum value is = 0.28
The minimum value is 0.14 which found in (p2, p5).
To update the distance matrix MIN[(dist(P2, P5), P1)]
✔ Intersection value b/n {p2, p1}, {p5, p1) = {0.23, 0.34}.
✔ Therefore, the Minimum value is = 0.23
To update the distance matrix MIN[(dist(P2, P5), (p3, p6)]
✔ Intersection value b/n {p2 , (p3, p6)}, {p5, (p3, p6)) = {0.15, 0.28}.
✔ Therefore, the Minimum value is = 0.15
To update the distance matrix MIN[(dist(P2, P5), P4)]
✔ Intersection value b/n {p2, p4}, {p5, p4) = {0.20, 0.28}.
✔ Therefore, the Minimum value is = 0.20
The minimum value is 0.15 which found in (p2, p5) , (p3, p6) and p4, (p3, p6). In
this case, there two data points with value 0.15. Therefore, we choose the first one
without any doubt {(p2, p5) , (p3, p6)}.
To update the distance matrix MIN[(dist(p2, p5) , (p3, p6), P1)]
✔ Intersection value b/n (p2, p5), p1, (p3, p6), p1) = {0.23, 0.22}.
✔ Therefore, the Minimum value is = 0.22
To update the distance matrix MIN[(p2, p5) , (p3, p6), P4)]
✔ Intersection value b/n (p2, p5), p4, (p3, p6), p4) = {0.20, 0.15}.
✔ Therefore, the Minimum value is = 0.15
The minimum value is 0.15 which found in (p2, p3, p5, p6) and p4.
To update the distance matrix MIN[(dist (p2, p3, p5, p6), P1), (p4, p1]
✔ Intersection value b/n (p2, p3, p5, p6), P1, (p4), p1) = {0.22, 0.37}.
✔ Therefore, the Minimum value is = 0.22
Association Rule Mining
Association analysis is useful for discovering interesting relationships
hidden in large data sets --- frequent co-occurrence of items in a dataset.
For example, a retailer might use association rule mining to discover that customers who
purchase mobile are also likely to purchase safety cover. We can use this information to
optimize product placements and promotions to increase sales.
Customer Segmentation
Association rule mining can also be used to segment customers based on their purchasing
habits.
For example, a company might use association rule mining to discover that customers who
purchase certain types of products are more likely to be younger. Similarly, they could learn that
customers who purchase certain combinations of products are more likely to be located in
specific geographic regions.
Fraud Detection
You can also use association rule mining to detect fraudulent activity. For example, a credit
card company might use association rule mining to identify patterns of fraudulent transactions,
such as multiple purchases from the same merchant within a short period of time. We can then
use this information can to flag potentially fraudulent activity and take preventative measures
to protect customers.
Recommendation systems
Association rule mining can be used to suggest items that a customer might be interested in
based on their past purchases or browsing history. For example, a music streaming service
might use association rule mining to recommend new artists or albums to a user based on their
listening history.
The following rule can be extracted from the data set shown in the Table below:
{Mobile} {Safety Cover}
❖ The rule suggests that a strong relationship exists between the sale of safety cover and
mobile because many customers who buy mobile also buy safety cover.
TID Items
1 {Mobile charger, Earphone}
2 {Mobile charger, Mobile, Safety cover, Screen glass}
3 {Earphone, Mobile, Safety cover, Mobile Charger}
4 {Mobile charger, Earphone, Mobile, Safety cover}
5 {Mobile charger, Earphone, Mobile, Selfie Stick}
Item Set and Support Counts
•
Metrics for Evaluating Association Rules
❖ High confidence indicates that the presence of the first item is a strong predictor of
the presence of the second item.
Lift
Lift is a measure of the strength of the association between two items, taking into account
the frequency of both items in the dataset.
It is calculated as the confidence of the association divided by the support of the second
item.
Lift > 1: It determines the degree to which the two itemsets are dependent to each other.
Lift < 1: It tells us that one item is a substitute for other items, which means one item has
a negative effect on another.
•
TID Items
1 {Mobile charger, Earphone}
2 {Mobile charger, Mobile, Safety cover, Screen glass}
3 {Earphone, Mobile, Safety cover, Mobile Charger}
4 {Mobile charger, Earphone, Mobile, Safety cover}
5 {Mobile charger, Earphone, Mobile, Selfie Stick}
Association Rule Mining Tasks
Example
Observations:
All the below rules are binary partitions of the same itemset :
Rules originating from the same itemset have identical support but can have
different confidence.
Rule Generation: these rules must satisfy minimum support and minimum
confidence.
TID Items
1 {Mobile charger, Earphone}
2 {Mobile charger, Mobile, Safety cover, Screen glass}
3 {Earphone, Mobile, Safety cover, Mobile Charger}
4 {Mobile charger, Earphone, Mobile, Safety cover}
5 {Mobile charger, Earphone, Mobile, Selfie Stick}
3-itemset
1-itemset 2-itemset
Items Freq
Item Freq Items Freq
Charger, Earphone, Mobile 3
Charger 5 Charger, Earphone 4
Charger, Earphone, Safety cover 2
Earphone 4 Charger, Mobile 4
Charger, Mobile, safety cover 3
Mobile 4 Charger, Safety cover 4
Earphone, Mobile, Safety cover 2
Safety cover 4 Earphone, Mobile 3
4-itemset
Screen glass 1 Earphone, Safety cover 2 Items Freq
Selfie stick 1 Mobile, Safety cover 3 Charger, Earphone, Mobile, Safety cover 2
❖ The minimum support is 3, therefore we have prune the items in which their
frequency is < 3. Screen glass and selfie stick.
❖ Generate candidate using brute-force approach.
❖ From pair itemset we also remove third and fifth transactions because both not
satisfy minimum support.
Frequent Itemset
Therefore, this two item sets are the frequent items purchased by
customers from super electronics shop center.
❖ To enhance the speed of the learning process, the training set is divided into
different subsets, which are known as a batch.
Example
Say a machine learning model will take 5000 training examples to be trained. This
large data set can be broken down into smaller bits called batches.
Suppose the batch size is 500; hence, 10 batches are created. It would take ten iterations
to complete one Epoch.
Hyperparameter for Specific Models
❖ Hyperparameters that are involved in the structure of the model are known as
hyperparameters for specific models. These are given below:
❖ A number of Hidden Units: Hidden units are part of neural networks, which
refer to the components comprising the layers of processors between input and
output units in a neural network.
✔ It is important to specify the number of hidden units hyperparameter for the
neural network. It should be between the size of the input layer and the size of
the output layer. More specifically, the number of hidden units should be 2/3
of the size of the input layer, plus the size of the output layer.
True Negative: is the value which is actually negative and also correctly predicted as negative.
False Positive: is the value which is actually negative, but incorrectly predicted as positive.
False Negative: is the value which is actually positive, but incorrectly predicted as negative
Precision: out of the total predicted positive value how many values are actual positive.
Accuracy: is total value of both actual and predicted as positive divided by total values.
Recall(TPR/Sensitivity): out of the total actual positive value how many values correctly predicted as
positive.
❖ What if both false positive and false negative have great impact on
model performance?
❖ F-beta score
Formula
Imbalanced Datasets
Suppose that you are working in a given company and you are asked to create a model
that, based on various measurements at your disposal, predicts whether a product is
defective or not. You decide to use your favorite classifier, train it on the data: you get a
96.2% accuracy !.
Your boss is astonished and decides to use your model without any further tests. A few
weeks later he enters your office and underlines the uselessness of your model. Indeed, the
model you created has not found any defective product from the time it has been used in
production.
After some investigations, you find out that there is only around 3.8% of the product made
by your company that are defective and your model just always answers “not defective”,
leading to a 96.2% accuracy.
Approach to deal with the imbalanced dataset problem
3. Association rule represented in what form? Item found in left and right side called as what
respectively?
7. Why batch size and epoch is crucial in machine learning? Give example
8. Assume that table shown below is 2x2 confusion matrix for binary classification. Which letter
is TP, FP, FN and TN respectively.