0% found this document useful (0 votes)
9 views

ML ch 4 (4)

ml

Uploaded by

gemechisltujuba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

ML ch 4 (4)

ml

Uploaded by

gemechisltujuba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Unsupervised Learning

Unsupervised Learning
✔ Group unstructured data according to its similarities and distinct patterns in the
dataset

✔ No labels are given to the learning algorithms

✔ Can be a goal in itself or a means toward an end

✔ Unsupervised Learning is harder as compared to Supervised Learning tasks.

✔ How do we know if results are meaningful since no answer labels are available?
Some applications of unsupervised machine
learning techniques include:
Clustering -- automatically split the dataset into groups according to
similarity.
Anomaly detection -- discovers unusual data points in your dataset.
Association mining -- identifies sets of items that frequently occur together
in your dataset.
Latent variable models are commonly used for data
preprocessing--(dimensionality reduction)
What is a Good cluster
❖ The quality of a clustering result depends on:
✔ The similarity measure used by the method and its implementation.

✔ Its ability to discover some or all of the hidden patterns.

❖ A good clustering method will produce high quality clusters in which:


✔ The intra-class similarity is low.

✔ The inter-class similarity is high.


Basic Steps in Clustering
❖ Feature Selection– minimal information redundancy
❖ Proximity measure
✔ Similarity of two feature vectors
❖ Clustering criterion
✔ Expressed via a cost function or some rules
❖ Clustering algorithms – choice
❖ Validation of the result
❖ Interpretation of the result – integration with application
Types of Clustering
❖ Partitioning methods: Given a set of n objects, a partitioning method constructs k
partitions of the data
✔ Each partition represents a cluster and k <= n.
✔ Typical methods: K-means, K-mediods, CLARANS
❖ Hierarchical methods:
✔ Creates a hierarchical decomposition of the given set of data objects
✔ Can be classified as being either agglomerative or divisive
✔ Typical methods: Diana, Agnes, BIRCH, CAMELEON
❖ Density-based approach: based on connectivity and density functions
✔ Typical methods: DBSACN, OPTICS, DenClue
❖ Grid-based approach: based on a multiple-level granularity structure
✔ Typical methods:STING, WaveCluster, CLIQUE
Distance Measures
Assume a k-dimensional Euclidean space, the distance between two points, x = [x1, x2,…,xk] and y = [y1,
y2,…,yk]
Distance between clusters
✔ Single link: smallest distance between an element in one cluster and an element in the
other dist
✔ Complete link: largest distance between an element in one cluster and an element in the
other
✔ Average: avg distance between an element in one cluster and an element in the other
cluster
✔ Centroid : distance between the cetroid of two clusters
✔ Medoid: distance between the medoids of two clusters
Centroid-based clustering techniques
✔ Given k, find a partition of k clusters that optimize the chosen partitioning criterion.
Uses the centroid (center point) of a cluster, to represent that cluster.
k-means or kmedoid
How does the k-means algorithm works
✔ Randomly selects k of the objects in D, each of which initially represents a cluster mean
or center
✔ ‘Closeness’ is measured by Euclidean distance
✔ K-means algorithm then iteratively improves the within-cluster variation
✔ Most of the convergence happens in the first few iterations.

❑ What is the complexity of k-means clustering algorithm?


K-means algorithm flowchart

Start

Input K( Randomly
initialize center)

Calculate Centroid

Calculate Distance
If no change clusters
for each data points
Group Data Points Based with previous
If data points cluster
on Minimum Distances End
changes
K-means algorithm Pseudocode
Assume Euclidean distance
Example -Start by picking k, the number of clusters
-Initialize clusters y picking one point per cluster
- Let k = 2, let us choice observations A & C as the two cluster means (mean
centroids)

Data Point Distance from Distance from Point belongs


center (1, 1) of center (0, 2) of to Cluster
Cluster-01 Cluster-02
Data x y
Points A(1, 1) 0 1.4 1
A 1 1 B(1, 0) 1 2.2 1
B 1 0 C(0, 2) 1.4 0 2
C 0 2 D(2, 4) 3.2 2.8 2
D 2 4 E(3, 5) 4.5 4.2 2
E 3 5
After first Iteration of computing the distance
B is grouped in cluster 1 since 1 <2.2
While C, D & E is grouped in C2
Therefore C1 ={A, B} = {1, 1}, {1, 0} to find new k(centroids) = {(1 + 1) / 2, (1+ 0 )/ 2} = {1, 0.5}
and C2={C,D,E} = {0, 2}, {2, 4}, {3, 5} = {(0 + 2 + 3 )/ 3, ( 2 + 4 + 5) / 3} = {1.7, 3.7}
Next recalculate the distance of each point from the cluster mean.

Data Point Distance from center (1, 0.5) Distance from center Point belongs to
of Cluster-01 (1.7, 3.7) of Cluster
Cluster-02
A(1, 1) 0.5 2.7 1
B(1, 0) 0.5 3.7 1
C(0, 2) 1.8 2.4 1
D(2, 4) 3.6 0.5 2
E(3, 5) 4.9 1.8 2
After second Iteration
A, B & C are in cluster 1
D & E are in cluster 2
Therefore C1 ={A, B, C} = {1, 1}, {1, 0}, {0, 2} to find new k(centroids) = {(1 + 1 + 0) / 3, (1+ 0 + 2 )/ 3} = {0.7, 1}
and C2={D, E} ={2, 4}, {3, 5} = {(2 + 3 )/ 2, (4 + 5) / 2} = {2.5, 4.5}
Next recalculate the distance of each point from the cluster mean.
Data Point Distance from center (0.7, 1) of Distance from center (2.5, Point belongs to
Cluster-01 4.5) of Cluster-01 Cluster

A(1, 1) 0.3 3.8 1

B(1, 0) 1 4.7 1

C(0, 2) 1.2 3.5 1

D(2, 4) 3.3 1 2

E(3, 5) 4.6 1 2
Home study
❖ Cluster the following eight points (with (x, y) representing locations) into
three clusters:

❖ A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)

❖ Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2) and Use Both
Euclidean and Manhattan distance
Hierarchical Clustering
✔ Produces a set of nested clusters organized as a hierarchical tree

✔ Can be visualized as a dendrogram

✔ A tree like diagram that records the sequences of merges

✔ Hierarchical clustering starts with k = N clusters and proceed by


merging the two closest objects into one cluster, obtaining k = N-1
clusters.

✔ This process repeated until we reach the desired number of clusters


K.
✔ Hierarchical clustering refers to an unsupervised learning procedure that
determines successive clusters based on previously defined clusters.

✔ It works via grouping data into a tree of clusters.

✔ Hierarchical clustering stats by treating each data points as an individual


cluster.

✔ The endpoint refers to a different set of clusters, where each cluster is different
from the other cluster, and the objects within each cluster are the same as one
another.
Dendrogram
Types of Hierarchical Clustering
Agglomerative:
✔ Start with the points as individual clusters– uses bottom up strategy
✔ At each step, merge the closest pair of clusters until only one cluster left
Divisive:
✔ Starts by placing all objects in one cluster– employs a top-down strategy
✔ At each step, split a cluster until each cluster contains a point
✔ Less widely used due to its complexity compared with agglomerative
approach.
Steps in Agglomerative Clustering
1. Determine the similarity between individuals and all other clusters. (Find proximity matrix).
2. Consider each data point as an individual cluster.
3. Combine similar clusters.
4. Recalculate the proximity matrix for each cluster.
5. Repeat step 3 and step 4 until you get a single cluster.
Types of HAC
Single - nearest distance (single linkage)
✔ This is the distance between the closest members of the cluster
Complete – farthest distance (Complete linkage)
✔ This is the distance between members that are farthest apart
Average – Average distance ( average linkage)
✔ This method involves looking at the distances between all pairs and
average all of these distances. Also called unweighted pair group mean
Find the clusters using single linkage. Use Euclidean distance to
calculate proximity matrices and finally, draw dendrogram
Example: Distance between (P1, p2) can be calculated by:
{0.40, 0.53}, {0.22, 0.38}
The distances Matrices
From the distance matric we found the minimum value and do cluster
The minimum value is 0.11 which found in p3 and p6. Therefore we combine p3, p6 and make a
cluster.
To update the distance matrix MIN[(dist(P3, P6), P1)]
✔ Intersection value b/n {p3, p1}, {p6, p1) = {0.22, 0.23}.
✔ Therefore, the Minimum value is = 0.22
To update the distance matrix MIN[(dist(P3, P6), P2)]
✔ Intersection value b/n {p3, p2}, {p6, p2) = {0.15, 0.25}.
✔ Therefore, the Minimum value is = 0.15
To update the distance matrix MIN[(dist(P3, P6), P4)]
✔ Intersection value b/n {p3, p4}, {p6, p4) = {0.15, 0.22}.
✔ Therefore, the Minimum value is = 0.15
To update the distance matrix MIN[(dist(P3, P6), P5)]
✔ Intersection value b/n {p3, p5}, {p6, p5) = {0.28, 0.39}.
✔ Therefore, the Minimum value is = 0.28
The minimum value is 0.14 which found in (p2, p5).
To update the distance matrix MIN[(dist(P2, P5), P1)]
✔ Intersection value b/n {p2, p1}, {p5, p1) = {0.23, 0.34}.
✔ Therefore, the Minimum value is = 0.23
To update the distance matrix MIN[(dist(P2, P5), (p3, p6)]
✔ Intersection value b/n {p2 , (p3, p6)}, {p5, (p3, p6)) = {0.15, 0.28}.
✔ Therefore, the Minimum value is = 0.15
To update the distance matrix MIN[(dist(P2, P5), P4)]
✔ Intersection value b/n {p2, p4}, {p5, p4) = {0.20, 0.28}.
✔ Therefore, the Minimum value is = 0.20
The minimum value is 0.15 which found in (p2, p5) , (p3, p6) and p4, (p3, p6). In
this case, there two data points with value 0.15. Therefore, we choose the first one
without any doubt {(p2, p5) , (p3, p6)}.
To update the distance matrix MIN[(dist(p2, p5) , (p3, p6), P1)]
✔ Intersection value b/n (p2, p5), p1, (p3, p6), p1) = {0.23, 0.22}.
✔ Therefore, the Minimum value is = 0.22
To update the distance matrix MIN[(p2, p5) , (p3, p6), P4)]
✔ Intersection value b/n (p2, p5), p4, (p3, p6), p4) = {0.20, 0.15}.
✔ Therefore, the Minimum value is = 0.15
The minimum value is 0.15 which found in (p2, p3, p5, p6) and p4.
To update the distance matrix MIN[(dist (p2, p3, p5, p6), P1), (p4, p1]
✔ Intersection value b/n (p2, p3, p5, p6), P1, (p4), p1) = {0.22, 0.37}.
✔ Therefore, the Minimum value is = 0.22
Association Rule Mining
Association analysis is useful for discovering interesting relationships
hidden in large data sets --- frequent co-occurrence of items in a dataset.

The uncovered relationships can be represented in the form of


association rules or sets of frequent items.

Due to its good scalability characteristics association rules are an


essential data mining tool for extracting knowledge from data.
Application Areas
❖ Market-basket data analysis,
❖ Catalog design
❖ Customizing store layout
❖ Data preprocessing
❖ Personalization and recommendation systems e.g. for browsing web pages
❖ Analysis of genomic data
Application Areas
Market Basket Analysis
involves analyzing the items customers purchase together to understand their purchasing habits
and preferences.

For example, a retailer might use association rule mining to discover that customers who
purchase mobile are also likely to purchase safety cover. We can use this information to
optimize product placements and promotions to increase sales.
Customer Segmentation
Association rule mining can also be used to segment customers based on their purchasing
habits.

For example, a company might use association rule mining to discover that customers who
purchase certain types of products are more likely to be younger. Similarly, they could learn that
customers who purchase certain combinations of products are more likely to be located in
specific geographic regions.
Fraud Detection

You can also use association rule mining to detect fraudulent activity. For example, a credit
card company might use association rule mining to identify patterns of fraudulent transactions,
such as multiple purchases from the same merchant within a short period of time. We can then
use this information can to flag potentially fraudulent activity and take preventative measures
to protect customers.

Recommendation systems

Association rule mining can be used to suggest items that a customer might be interested in
based on their past purchases or browsing history. For example, a music streaming service
might use association rule mining to recommend new artists or albums to a user based on their
listening history.
The following rule can be extracted from the data set shown in the Table below:
{Mobile} {Safety Cover}

❖ Mobile is referred to as the antecedent and Safety Cover is the consequent

❖ The rule suggests that a strong relationship exists between the sale of safety cover and
mobile because many customers who buy mobile also buy safety cover.
TID Items
1 {Mobile charger, Earphone}
2 {Mobile charger, Mobile, Safety cover, Screen glass}
3 {Earphone, Mobile, Safety cover, Mobile Charger}
4 {Mobile charger, Earphone, Mobile, Safety cover}
5 {Mobile charger, Earphone, Mobile, Selfie Stick}
Item Set and Support Counts

Metrics for Evaluating Association Rules

❖ In association rule mining, several metrics are commonly used to evaluate


the quality and importance of the discovered association rules.

❖ Goal: To select the most relevant rules for a given application

❖ Interpreting the results of association rule mining metrics requires


understanding the meaning and implications of each metric, as well as how
to use them to evaluate the quality and importance of the discovered
association rules.
Support

❖ Support is a measure of how frequently an item or itemset appears in


the dataset.

❖ It is calculated as the number of transactions containing the item(s)


divided by the total number of transactions in the dataset.

❖ High support indicates that an item or itemset is common in the


dataset, while low support indicates that it is rare.
Confidence
❖ Confidence is a measure of the strength of the association between two items.

❖ It is calculated as the number of transactions containing both items divided by the


number of transactions containing the first item.

❖ High confidence indicates that the presence of the first item is a strong predictor of
the presence of the second item.
Lift
Lift is a measure of the strength of the association between two items, taking into account
the frequency of both items in the dataset.
It is calculated as the confidence of the association divided by the support of the second
item.

If Lift = 1: The probability of occurrence of antecedent and consequent is independent of


each other.

Lift > 1: It determines the degree to which the two itemsets are dependent to each other.

Lift < 1: It tells us that one item is a substitute for other items, which means one item has
a negative effect on another.

TID Items
1 {Mobile charger, Earphone}
2 {Mobile charger, Mobile, Safety cover, Screen glass}
3 {Earphone, Mobile, Safety cover, Mobile Charger}
4 {Mobile charger, Earphone, Mobile, Safety cover}
5 {Mobile charger, Earphone, Mobile, Selfie Stick}
Association Rule Mining Tasks
Example
Observations:

All the below rules are binary partitions of the same itemset :

Rules originating from the same itemset have identical support but can have
different confidence.

Example: {Earphone, Mobile} {cover}

{Earphone, Mobile} {cover} (S = 0.4, C = 0.67)


{Earphone, Cover} {Mobile} (S = 0.4, C = 1)
{Cover, Mobile} {Earphone} (S = 0.4, C = 0.67)
{Earphone} {Mobile, cover} (S = 0.4, C = 0.5)
{Mobile} {cover, Earphone} (S = 0.4, C = 0.5)
{Cover} {Earphone, Mobile} (S = 0.4, C = 0.67)
Mining Association Rules
Steps in ARM approach:

Frequent Itemset Generation: generate all itemsets whose support ≥


minsup.

Rule Generation: these rules must satisfy minimum support and minimum
confidence.

Frequent itemset generation is still computationally expensive.


Apriori algorithm
❖ support counting: is the process of determining the frequency of occurrence for
every candidate itemset that survives the candidate pruning step of the apriori-gen
function
❑ Scan the database of transactions to determine the support of each candidate
itemset.
❑ To reduce the number of comparisons, store the candidates in a hash structure.
✔ Instead of matching each transaction against every candidate match it
against candidates contained in the hashed buckets.
How Apriori Algorithm works?
Rules Generation in Apriori Algorithm
❖ First, we have identify frequent itemset that only satisfy minimum support.
❖ Second, from obtained frequent itemset generate non-empty subset.
❖ Finally, to form association rule from non-empty subset:
S (I-S)
Where S is antecedent and I is frequent itemset.
Example: {Earphone, Mobile, Cover} are frequent itemset
Therefore, S (I-S) = ({Earphone} {{Earphone, Mobile, Cover} –
{Earphone}) = {Earphone} {Mobile, Cover}
Practical Example
From the transaction given below find the itemset sold together from
super electronics shop center in which the minimum support is 3 and
minimum confidence is 60%. Finally, generate valid association rule
from obtained frequent itemset(s).

TID Items
1 {Mobile charger, Earphone}
2 {Mobile charger, Mobile, Safety cover, Screen glass}
3 {Earphone, Mobile, Safety cover, Mobile Charger}
4 {Mobile charger, Earphone, Mobile, Safety cover}
5 {Mobile charger, Earphone, Mobile, Selfie Stick}
3-itemset
1-itemset 2-itemset
Items Freq
Item Freq Items Freq
Charger, Earphone, Mobile 3
Charger 5 Charger, Earphone 4
Charger, Earphone, Safety cover 2
Earphone 4 Charger, Mobile 4
Charger, Mobile, safety cover 3
Mobile 4 Charger, Safety cover 4
Earphone, Mobile, Safety cover 2
Safety cover 4 Earphone, Mobile 3
4-itemset
Screen glass 1 Earphone, Safety cover 2 Items Freq
Selfie stick 1 Mobile, Safety cover 3 Charger, Earphone, Mobile, Safety cover 2
❖ The minimum support is 3, therefore we have prune the items in which their
frequency is < 3. Screen glass and selfie stick.
❖ Generate candidate using brute-force approach.
❖ From pair itemset we also remove third and fifth transactions because both not
satisfy minimum support.
Frequent Itemset
Therefore, this two item sets are the frequent items purchased by
customers from super electronics shop center.

Let’s generate association rule from first transaction.


Before that, we have obtain non-empty subsets
Subset = {charger}, {Earphone}, {mobile}, {Charger, Earphone},
{charger, Mobile}, {Earphone, Mobile}
Association Rule Generation
minSup = 3 and minconf = 60% minSup = 3 and minconf = 60%
Rule 1: {charger} {Charger, Earphone} Rule 4:{Mobile, Earphone} {charger}
▪ S = 3/5 = 0.6 and C = 3/5 = 60% ▪ S = 3/5 = 0.6 and C = 3/3 = 100%
Rule 2: {Earphone} {charger, Mobile} Rule 5:{charger, Mobile} {Earphone}
▪ S = 3/5 = 0.6 and C = 3/4 = 75%
▪ S = 3/5 = 0.6 and C = 3/4 = 75%
Rule 6: {Earphone, Charger} {mobile}
Rule 3: {mobile} {Earphone, Mobile}
▪ S = 3/5 = 0.6 and C = 3/4 = 75%
▪ S = 3/5 = 0.6 and C = 3/4 = 75%

All rules are valid


All rules are valid Rules
Hyperparameters and Parameters in Machine Learning
Hyperparameters
❖ Are those parameters that are explicitly defined by the user to control the learning process.
❖ Are used to improve the learning of the model, and their values are set before starting the
learning process of the model.
❖ These are usually defined manually by the machine learning engineer.
❖ The best value can be determined either by the rule of thumb or by trial and error.
Some examples of Hyperparameters in Machine Learning
▪ The k in kNN or K-Nearest Neighbour algorithm
▪ Learning rate for training a neural network
▪ Train-test split ratio
▪ Batch Size
▪ Number of Epochs
▪ Branches in Decision Tree
▪ Number of clusters in Clustering Algorithm
Model parameters
❖ Model parameters are configuration variables that are internal to the model,
and a model learns them on its own.
❖ For example, Weights or Coefficients of independent variables in SVM,
weight, and biases of a neural network, cluster centroid in clustering.
Some key points for model parameters are as follows:
▪ They are used by the model for making predictions.
▪ They are learned by the model from the data itself
▪ These are usually not set manually.
▪ These are the part of the model and key to a machine learning Algorithm.
Hyperparameter for Optimization
Learning Rate:
❖ Controls how much the model needs to change in response to the estimated
error for each time when the model's weights are updated.
❖ It is one of the crucial parameters while building a neural network, and also
it determines the frequency of cross-checking with model parameters.
❖ Selecting the optimized learning rate is a challenging task because if the
learning rate is very less, then it may slow down the training process. On the
other hand, if the learning rate is too large, then it may not optimize the
model properly.
Batch Size:

❖ To enhance the speed of the learning process, the training set is divided into
different subsets, which are known as a batch.

❖ Batch size is a hyperparameter which defines the number of samples taken


to work through a particular machine learning model before updating its
internal model parameters.

❖ A training dataset can be broken down into multiple batches.

❖ Epoch: is the number of passes a training dataset takes around an algorithm.


The total number of batches required to complete one Epoch is called an iteration. The
number of batches equals the total number of iterations for one Epoch.

Example

Say a machine learning model will take 5000 training examples to be trained. This
large data set can be broken down into smaller bits called batches.

Suppose the batch size is 500; hence, 10 batches are created. It would take ten iterations
to complete one Epoch.
Hyperparameter for Specific Models
❖ Hyperparameters that are involved in the structure of the model are known as
hyperparameters for specific models. These are given below:

❖ A number of Hidden Units: Hidden units are part of neural networks, which
refer to the components comprising the layers of processors between input and
output units in a neural network.
✔ It is important to specify the number of hidden units hyperparameter for the
neural network. It should be between the size of the input layer and the size of
the output layer. More specifically, the number of hidden units should be 2/3
of the size of the input layer, plus the size of the output layer.

❖ Number of Layers: Input, Hidden and Output layer


Model Performance
❖ Imagine that there is a hospital which is trying to predict if a patient
has diabetes in future by seeing his medical conditions.
❖ They have built a binary model from the past data of their patients
and are comparing the predicted results with the actual results.
❖ Here the algorithm is trying to predict the probability of a possible
diabetes. Hence, 1 is coded as a yes case(the person can have
diabetes).
Some common Terminologies
True Positive: is the value which is actually positive and also correctly predicted as positive.

True Negative: is the value which is actually negative and also correctly predicted as negative.

False Positive: is the value which is actually negative, but incorrectly predicted as positive.

False Negative: is the value which is actually positive, but incorrectly predicted as negative

Precision: out of the total predicted positive value how many values are actual positive.

Accuracy: is total value of both actual and predicted as positive divided by total values.

Recall(TPR/Sensitivity): out of the total actual positive value how many values correctly predicted as
positive.

F-Beta score: is called harmonic mean.


Confusion Matrix: A confusion matrix is a n*n matrix which depicts these correct classifications and
misclassifications in a tabular form.
When we will use each metrics
Assume, you have diabetics dataset with 1000 number of records.
Scenario 1: 500 records labeled as diabetics and 500 records labeled as
non-diabetics.
Scenario 2: 600 records labeled as diabetics and 400 records labeled as
non-diabetics.
Scenario 1: 700 records labeled as diabetics and 300 records labeled as
non-diabetics.
Scenario 1: 900 records labeled as diabetics and 100 records labeled as
non-diabetics.
Cont..
❖ For scenario 1, 2 and 3 you can use accuracy and the remaining performance
evaluation metrics but for scenario 4 you can use accuracy. Why? Because
scenario 4 dataset is not balanced dataset or there is imbalance between two
classes. So, what will Happen?
❖ If false positive have great impact on the model performance, we must
reduce false positive. For instance, assume that, you applied for job vacancy
announced by xyz company and you passed the exam and interview from
candidates. Finally, the company send email for you that says “
congratulations you successfully passed everything and please start hire
process as soon as possible”. However, the mail was detected as spam but
actually not spam. In this case, we must use precision.
Recall
❖ Suppose, the person have lung cancer and the model predict that the
person have no cancer. It is crazy.

❖ If false negative impact is great then we must use recall/sensitivity.

❖ What if both false positive and false negative have great impact on
model performance?

❖ F-beta score
Formula
Imbalanced Datasets
Suppose that you are working in a given company and you are asked to create a model
that, based on various measurements at your disposal, predicts whether a product is
defective or not. You decide to use your favorite classifier, train it on the data: you get a
96.2% accuracy !.
Your boss is astonished and decides to use your model without any further tests. A few
weeks later he enters your office and underlines the uselessness of your model. Indeed, the
model you created has not found any defective product from the time it has been used in
production.
After some investigations, you find out that there is only around 3.8% of the product made
by your company that are defective and your model just always answers “not defective”,
leading to a 96.2% accuracy.
Approach to deal with the imbalanced dataset problem

1. Choose Proper Evaluation Metric


❖ The other metrics such as precision is the measure of how accurate the
classifier’s prediction of a specific class and recall is the measure of
the classifier’s ability to identify a class.

❖ For an imbalanced class dataset F1 score is a more appropriate metric.


2. Resampling (Oversampling and Undersampling)
❖ When we are using an imbalanced dataset, we can oversample the minority class using
replacement. This technique is called oversampling.
❖ Similarly, we can randomly delete rows from the majority class to match them with the
minority class which is called undersampling.
❖ After sampling the data we can get a balanced dataset for both majority and minority
classes.
❖ So, when both classes have a similar number of records present in the dataset, we can
assume that the classifier will give equal importance to both classes.
Is there any comment or suggestion
Regarding my teaching style?. I’ll
genuinely accept and correct my future
carrier please

The End!!!. Thank you


Choose only 5 questions and write the correct answer
1. What is association Rule Mining?

2. What is Support and Confidence?

3. Association rule represented in what form? Item found in left and right side called as what
respectively?

4. How to handle imbalanced dataset in machine learning?

5. List different types of performance evaluation metrics in machine learning

6. Difference between hyperparameters and parameters in machine learning

7. Why batch size and epoch is crucial in machine learning? Give example

8. Assume that table shown below is 2x2 confusion matrix for binary classification. Which letter
is TP, FP, FN and TN respectively.

You might also like