Unit 2
Unit 2
Both approaches require labeled data for training but differ in their objectives—
classification aims to find decision boundaries that separate classes, whereas regression focuses
on finding the best-fitting line to predict numerical outcomes. Understanding these distinctions
helps in selecting the right approach for specific machine learning tasks.
For example, it can determine whether an email is spam or not, classify images as “cat”
or “dog,” or predict weather conditions like “sunny,” “rainy,” or “cloudy.” with decision
boundary and regression models are used to predict house prices based on features like size
and location, or forecast stock prices over time with straight fit line.
1|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Decision Boundary vs Best-Fit Line
A linear decision boundary might separate two classes in a 2D space with a straight line
(e.g., logistic regression).
A more complex model, may create non-linear boundaries to better fit intricate datasets.
During training classifier learns to partition the feature space by finding a boundary that
minimizes classification errors.
For binary classification, this boundary separates data points into two groups (e.g.,
spam vs. non-spam emails).
In multi-class classification, multiple boundaries are created to separate more than two
classes.
The decision boundary is not inherent to the training data but rather depends on the classifier
used.
2|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
2. Best-Fit Line in Regression
In regression, a best-fit line (or regression line) represents the relationship between
independent variables (inputs) and a dependent variable (output). It is used to predict
continuous numerical values capturing trends and relationships within the data, allowing for
accurate predictions of continuous variables. The best-fit line can be linear or non-linear:
The plot demonstrates Regression, where both Linear and Polynomial models are used to
predict continuous target values based on the input feature, in contrast to Classification,
which would create decision boundaries to separate discrete classes.
Classification Algorithms
There are different types of classification algorithms that have been developed over
time to give the best results for classification tasks.
Logistic Regression
Decision Tree
Random Forest
K – Nearest Neighbors
Support Vector Machine
Naive Bayes
3|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Regression Algorithms
There are different types of regression algorithms that have been developed over time
to give the best results for regression tasks.
Lasso Regression
Ridge Regression
XGBoost Regressor
LGBM Regressor
BINARY CLASSIFICATION
A classification task can fall under one of these two categories:
Binary classification, where the number of classes is two. For example, the email spam
classification that we saw earlier.
4|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Multi-class classification, where the number of classes is more than two. For example,
classifying eCommerce inquiry emails into three types: shipping, returns, or tracking.
Example:
Let’s stick with the spam email classification, where our task is to classify a list of
emails into one of two classes: Spam or Not Spam. We’ll represent Spam with the integer 1 (or
Positive) and Not Spam with 0 (or Negative).
Here we have a dataset containing 20 email titles. We put each data point through a
binary classifier to get the predicted class and then compare it with its actual class.
Accuracy is then given as the number of correct predictions divided by the total number
of predictions. From the spam classifier output above, we have 15 correct predictions and 5
incorrect predictions, which gives us an Accuracy of 75%.
5|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Accuracy is often used as the measure of classification performance because it is simple
to compute and easy to interpret. However, it can turn out to be misleading in some cases.
This is especially true when dealing with imbalanced data, a scenario when certain
classes contain way more data points than the others.
Let's go back to our dataset to understand this. Notice that if the classifier had not been
learning anything and was simply classifying all the outputs to be 0 (Not Spam), we would get
17 out of 20 correct classifications, which translates to a very high Accuracy of 85%! Clearly,
something isn’t right.
If you haven’t noticed yet, our dataset is indeed imbalanced. We have way more emails
that are not spam than emails that are spam.
The issue of imbalanced datasets is common in the real world. For this, there must be a
better way to measure a classifier’s performance than using Accuracy alone.
2. Confusion Matrix
The other three metrics can provide a more balanced view of a classifier’s true
performance. But before we can see them in action, we need to first understand the Confusion
Matrix.
The Confusion Matrix takes the classification results and groups them into four
categories:
True Positive (TP): when both the actual and predicted values are 1.
True Negative (TN): when both the actual and predicted values are 0.
False Positive (FP): when the actual value is 0 but the predicted value is 1.
False Negative (FN): when the actual value is 1 but the predicted value is 0.
6|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Recall that in our case, we refer to the event we want to capture (1 - Spam) as Positive
and non-event (0 - Not Spam) as Negative.
The Confusion Matrix for binary classification is a 2-by-2 matrix, where each column
represents one class, as follows:
7|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
We can also map the Confusion Matrix to the Accuracy formula that we saw earlier, as
follows:
We can now see via this matrix why Accuracy can sometimes hide the nuance of
imbalanced datasets. The reason is in these kinds of datasets, the True Negative category
dominates, diluting the effect of the rest.
8|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
So even if the classifier were to perform poorly in the other three categories, its
Accuracy will still look good, masking its deficiencies.
3. Precision
The other three metrics can provide a more balanced view of a classifier’s performance
and that’s Precision. Precision is calculated as follows:
Notice what just happened? Now, the True Negatives are not even part of the
calculation. Precision focuses on the True Positives and False Positives, therefore providing a
9|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
representation that may be missed via Accuracy. Whereas Accuracy looked impressive at 75%
earlier, we now see that Precision is quite far off at 25%.
4. Recall
Recall uses the same principle as Precision, except the focus is now on the False
Negatives instead of the False Positives. Again, the True Negatives are not part of the
consideration. Recall is calculated as follows:
Between Precision and Recall though, there is a tradeoff. It is hard to optimize for both
simultaneously as optimizing for the False Positives (thereby improving Precision) comes at
the expense of the False Negatives (thereby deteriorating Recall), and vice versa.
Which then brings the question: which metric should you prioritize—Precision or
Recall?
The answer is, that it depends on the nature of your task. Let’s see why.
Suppose the spam classifier achieved high Precision and low Recall (Scenario A). This
would result in fewer non-spam emails flagged as spam (False Positive). But this would also
mean more of the actual spam emails went undetected (False Negative).
10 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Conversely, if the classifier achieved high Recall and low Precision (Scenario B), there
would be fewer undetected spam emails (False Negative), but it comes at the expense of more
non-spam emails being flagged as spam (False Positive).
For a spam classification task, it’s probably more desirable to avoid important emails
being moved into the spam folder than to have the occasional spam emails going into the inbox.
So for this task, we will want to prioritize Precision over Recall.
5. F1
What if both Precision and Recall are important to you and you need the classifier to
do well in both? The answer is, to use the final metric of the four—F1.
11 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
F1 provides the balance between Precision and Recall. Now, there are different versions
of the ‘F-score’ family if you want to go for it, for example assigning bigger weight to either
Precision or Recall, but F1 is a good enough option to get started.
12 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
CLASS PROBABLITY ESTIMATION ASSESSING CLASS
PROBABLITY ESTIMATES
Class Probability Estimation in Machine Learning refers to the process of predicting
the probability that a given data instance belongs to a particular class. This is essential in tasks
where understanding the confidence of predictions is as important as the prediction itself.
1. Logistic Regression
o Well-calibrated by design.
2. Naive Bayes
4. Decision Trees
5. Random Forest
13 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
o Aggregates tree probabilities to improve stability and accuracy.
6. Neural Networks
To evaluate how well these probabilities reflect real-world outcomes, consider these
metrics:
1. Brier Score
o Measures the mean squared error between predicted probabilities and actual
outcomes.
14 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
REGRESSION: ASSESSING PERFORMANCE OF REGRESSION-
ERROR MEASURES
Regression algorithms are used to expect continuous numerical values primarily based
on entering features. In scikit-learn, we will use numerous regression algorithms, such as
Linear Regression, Decision Trees, Random Forests, and Support Vector Machines (SVM),
amongst others.
In regression, we've got two units of values to compare: the actual target values
(authentic values) and the values expected by our version (anticipated values). The
performance of the model is assessed by means of measuring the similarity among these sets.
2. Evaluation Metrics:
Regression metrics are quantitative measures used to evaluate the nice of a regression
model. Scikit-analyze provides several metrics, each with its own strengths and boundaries, to
assess how well a model suits the statistics.
Mathematical Formula
The formula to calculate MAE for a data with "n" data points is:
15 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Where:
xi represents the actual or observed values for the i-th data point.
yi represents the predicted value for the i-th data point.
2. Mean Squared Error (MSE)
A popular metric in statistics and machine learning is the (MSE). It measures the square
root of the average discrepancies between a dataset's actual values and projected values. MSE
is frequently utilized in regression issues and is used to assess how well predictive models
work.
Mathematical Formula
For a dataset containing 'n' data points, the MSE calculation formula is:
where:
xi represents the actual or observed value for the i-th data point.
yi represents the predicted value for the i-th data point.
3. R-squared (R²) Score
A statistical metric frequently used to assess the goodness of fit of a regression model
is the R2 score, also referred to as the coefficient of determination. It quantifies the percentage
of the dependent variable's variation that the model's independent variables contribute to. R2 is
a useful statistic for evaluating the overall effectiveness and explanatory power of a regression
model.
Mathematical Formula
Where:
R2 is the R-Squared.
16 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
SSR represents the sum of squared residuals between the predicted values and actual
values.
SST represents the total sum of squares, which measures the total variance in the
dependent variable.
4. Root Mean Squared Error (RMSE)
RMSE stands for Root Mean Squared Error. It is a usually used metric in regression
analysis and machine learning to measure the accuracy or goodness of fit of a predictive model,
especially when the predictions are continuous numerical values.
The RMSE quantifies how well the predicted values from a model align with the actual
observed values in the dataset. Here's how it works:
1. Calculate the Squared Differences: For each data point, subtract the predicted value
from the actual (observed) value, square the result, and sum up these squared
differences.
2. Compute the Mean: Divide the sum of squared differences by the number of data
points to get the mean squared error (MSE).
3. Take the Square Root: To obtain the RMSE, simply take the square root of the MSE.
Mathematical Formula
The formula for RMSE for a data with 'n' data points is as follows:
Where:
As a result, the model works great on training data but fails when tested on new data.
Overfitting models are like students who memorize answers instead of understanding
the topic. They do well in practice tests (training) but struggle in real exams (testing).
For example, imagine drawing a straight line to fit points that actually follow a curve.
The line misses most of the pattern.
In this case, the model doesn’t work well on either the training or testing data.
Underfitting models are like students who don’t study enough. They don’t do well in practice
tests or real exams.
18 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Overfitting: A squiggly curve passing through all training points, failing to generalize
performing well on training data but poorly on test data.
Appropriate Fitting: Curve that follows the data trend without overcomplicating to
capture the true patterns in the data.
CASE STUDY OF POLYNOMIAL REGRESSION
Introduction
Polynomial Regression is a type of regression analysis where the relationship between
the independent variable xxx and the dependent variable yyy is modeled as an nnnth degree
polynomial. While linear regression assumes a straight-line relationship between variables,
polynomial regression can capture more complex, non-linear relationships. This case study
explores how polynomial regression can be applied to predict real-world data, providing
insights into its usage, benefits, and limitations.
Problem Statement
We are tasked with predicting the growth of a plant based on the number of days since
planting. We have historical data that includes the number of days and corresponding plant
heights, and we wish to model the plant growth using polynomial regression.
Data Collection
We have collected the following dataset:
19 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Plotting the Data: The data appears to have a quadratic pattern, meaning it follows a
curve rather than a straight line. We can visualize this by plotting the data points on a
scatter plot.
Where x is the independent variable (Days Since Planting) and y is the dependent
variable (Plant Height).
In Python, this can be done using the PolynomialFeatures class from the sklearn.preprocessing
module.
20 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
1. Transform the original feature (Days Since Planting) into polynomial features.
2. Fit a linear regression model on the polynomial features.
We also compare the results of fitting models with different polynomial degrees (e.g.,
2nd, 3rd, and 4th degrees) to see which one performs best.
For example, if a second-degree polynomial provides a good fit, the curve will closely
follow the quadratic shape of the data. If higher-degree polynomials are overfitting, the curve
will start to oscillate unnecessarily.
Step 7: Conclusion
Model Choice: Based on the evaluation metrics, we choose the polynomial regression
model that provides the best fit. In this case, the second-degree polynomial model may
be sufficient to accurately model the data, as the relationship between the variables
appears quadratic.
Overfitting: Higher-degree polynomials may lead to overfitting, where the model
performs well on training data but poorly on unseen data.
Interpretability: Polynomial regression provides a way to model non-linear
relationships, but the complexity of the polynomial increases with higher degrees,
making it harder to interpret.
1. Flexibility: Polynomial regression is highly flexible and can model a wide range of
non-linear relationships.
21 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
2. Simple to Implement: It is relatively easy to implement using libraries like Scikit-
learn, making it a good option for small to medium-sized datasets.
3. Improved Prediction: Polynomial regression can improve prediction accuracy
compared to linear regression when the underlying data has a non-linear relationship.
1. Overfitting: Higher-degree polynomials can lead to overfitting, where the model learns
the noise in the training data instead of the underlying relationship.
2. Interpretability: As the degree of the polynomial increases, the model becomes more
difficult to interpret.
3. Computational Complexity: Higher-degree models can be computationally
expensive, especially with large datasets.
Conclusion
Polynomial regression is a powerful tool for modeling non-linear relationships. In this
case study, we showed how polynomial regression can be applied to predict plant growth over
time. While polynomial regression provides an accurate model for many real-world problems,
careful consideration must be given to the degree of the polynomial used to avoid overfitting
and maintain model interpretability.
22 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
One way to understand generalization is through the concept of the effective number
of hypotheses. This concept stems from statistical learning theory and helps us understand how
the complexity of a model affects its ability to generalize.
Key Concepts
1. Hypothesis Space: The hypothesis space refers to the set of all possible models or
functions that could be considered by the learning algorithm to fit the data. In machine
learning, we typically search for a function within this space that minimizes some form
of error (like mean squared error in regression or cross-entropy in classification).
2. Effective Number of Hypotheses: The effective number of hypotheses refers to the
"effective" size of the hypothesis space that a model is able to select from based on the
data. It is often smaller than the total number of hypotheses available in the hypothesis
space because not all hypotheses are consistent with the data. In other words, this
number indicates how many hypotheses are effectively being considered when training
the model, based on the constraints imposed by the training data.
o The idea behind the effective number of hypotheses is that a smaller effective
number of hypotheses leads to better generalization. The more hypotheses there
are in the hypothesis space that fit the training data, the higher the risk of
overfitting to noise in the data and the poorer the model's ability to generalize.
Overfitting: If a model has access to a very large hypothesis space, it may be able to
perfectly fit the training data, including any noise. This results in overfitting, where the
model has too many hypotheses that fit the training data exactly, but fails to generalize
to new data. In this case, the effective number of hypotheses is large, but the model's
generalization ability is poor.
Underfitting: If the hypothesis space is too small, the model may not be able to capture
the underlying patterns in the data. It may underfit, resulting in poor performance both
on the training data and on unseen data. Here, the effective number of hypotheses is
small, which might hinder the model's ability to find the true underlying relationships.
Model Complexity: The complexity of the model determines the size of the hypothesis
space. For example, a linear regression model (with a single linear function) has a
23 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
relatively small hypothesis space compared to a deep neural network with many layers
and parameters. The hypothesis space size is proportional to the flexibility of the
model—more parameters lead to more hypotheses being possible. However, too many
parameters can lead to overfitting if the model is not regularized or controlled.
L2 regularization (Ridge) penalizes large weights, which reduces the capacity of the
model and shrinks the effective number of hypotheses.
L1 regularization (Lasso) can lead to sparse solutions, effectively reducing the number
of active features (hypotheses).
Dropout in neural networks forces the network to learn redundant representations by
randomly dropping units during training, reducing the effective number of hypotheses.
Imagine you are trying to fit a curve to a set of data points. If you use a very flexible
curve (like a high-degree polynomial), the number of ways you can fit the curve to the
data increases significantly, leading to many possible hypotheses. This makes it more
likely that your model fits noise in the training data and does not generalize well to
unseen data.
On the other hand, if you use a simple linear model, there are fewer possible curves that
can fit the data. In this case, the effective number of hypotheses is smaller, and the
model may generalize better because it is less likely to overfit to noise.
The balance between model complexity (size of the hypothesis space) and the effective
number of hypotheses that the model actually uses is key to achieving good generalization.
24 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Theoretical Formulation of the Effective Number of Hypotheses
In a more formal sense, the effective number of hypotheses can be related to the VC
dimension (Vapnik-Chervonenkis dimension) of the hypothesis space. The VC dimension is a
measure of the capacity or complexity of a model, i.e., how well it can fit different types of
data. A model with a higher VC dimension has a larger hypothesis space, meaning it has more
hypotheses to choose from and is more prone to overfitting.
The effective number of hypotheses is related to the degrees of freedom of the model.
In the context of linear regression with regularization, it can be approximated as:
Where X is the design matrix of features. This formula provides a way of estimating
how much flexibility the model has based on the input features and their interactions.
Generalization is about a model's ability to perform well on unseen data, and the
effective number of hypotheses impacts generalization.
25 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Overfitting occurs when a model has too many possible hypotheses that fit the training
data too closely, leading to poor performance on new data.
Underfitting happens when the hypothesis space is too small, and the model fails to
capture the underlying patterns in the0020data.
In simple terms, the effective number of hypotheses determines how many possible
models are effectively considered by the learning algorithm based on the data, and controlling
this number is key to balancing model complexity and generalization.
Key Concepts
1. Hypothesis Space: This refers to the set of all possible models or functions that a
learning algorithm can choose from. Each model in the hypothesis space represents a
different way of mapping inputs to outputs.
2. Shattering: A model is said to shatter a set of points if, for every possible labeling of
those points (i.e., all combinations of positive/negative labels in classification), there
exists at least one hypothesis in the hypothesis space that perfectly classifies those
points according to the given labels.
3. VC Dimension: The VC dimension of a model or hypothesis space is defined as the
largest set of points that can be shattered by the model. It represents the model’s ability
to adapt to different datasets.
o If a model can shatter k points, then its VC dimension is at least k.
o If no set of k+1 points can be shattered, then the VC dimension is at most k.
26 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Importance of VC Dimension in Generalization
The VC dimension plays a critical role in understanding the generalization ability of
a machine learning model:
High VC Dimension: A model with a high VC dimension has more capacity to fit
complex patterns in data, which means it can potentially overfit the training data. This
leads to poor generalization on unseen data.
Low VC Dimension: A model with a low VC dimension has limited capacity to fit data,
which may result in underfitting. It might not be flexible enough to capture the true
underlying patterns in the data.
Examples
28 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning