0% found this document useful (0 votes)

24 views

Unit 2

The document provides an overview of classification and regression in supervised machine learning, highlighting their differences in output types and goals. It discusses key concepts such as decision boundaries for classification and best-fit lines for regression, along with various algorithms and performance metrics for each approach. Additionally, it covers binary and multi-class classification, assessing classification performance using metrics like accuracy, precision, recall, and F1 score, as well as regression performance metrics such as MAE, MSE, and R-squared.

Uploaded by

jarvis9657

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

Unit 2

Uploaded by

jarvis9657

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

UNIT 1: CLASSIFICATION AND REGRESSION

INTRODUCTION TO CLASSIFICATION AND REGRESSION

Classification and regression are two primary tasks in supervised machine learning,
where key difference lies in the nature of the output: classification deals with discrete
outcomes (e.g., yes/no, categories), while regression handles continuous values (e.g., price,
temperature).

Both approaches require labeled data for training but differ in their objectives—
classification aims to find decision boundaries that separate classes, whereas regression focuses
on finding the best-fitting line to predict numerical outcomes. Understanding these distinctions
helps in selecting the right approach for specific machine learning tasks.

For example, it can determine whether an email is spam or not, classify images as “cat”
or “dog,” or predict weather conditions like “sunny,” “rainy,” or “cloudy.” with decision
boundary and regression models are used to predict house prices based on features like size
and location, or forecast stock prices over time with straight fit line.

1|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Decision Boundary vs Best-Fit Line

When teaching the difference between classification and regression in machine

learning, a key concept to focus on is the decision boundary (used in classification) versus
the best-fit line (used in regression). These are fundamental tools that help models make
predictions, but they serve distinctly different purposes.

1. Decision Boundary in Classification

It is a surface or line that separates data points into different classes in a feature space.
It can be linear (a straight line) or non-linear (a curve), depending on the complexity of the data
and the algorithm used. For example:

 A linear decision boundary might separate two classes in a 2D space with a straight line
(e.g., logistic regression).
 A more complex model, may create non-linear boundaries to better fit intricate datasets.

During training classifier learns to partition the feature space by finding a boundary that
minimizes classification errors.

 For binary classification, this boundary separates data points into two groups (e.g.,
spam vs. non-spam emails).
 In multi-class classification, multiple boundaries are created to separate more than two
classes.

The decision boundary is not inherent to the training data but rather depends on the classifier
used.

2|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
2. Best-Fit Line in Regression
In regression, a best-fit line (or regression line) represents the relationship between
independent variables (inputs) and a dependent variable (output). It is used to predict
continuous numerical values capturing trends and relationships within the data, allowing for
accurate predictions of continuous variables. The best-fit line can be linear or non-linear:

 A straight line is used for linear regression.

 Curves are used for more complex regressions, like polynomial regression

The plot demonstrates Regression, where both Linear and Polynomial models are used to
predict continuous target values based on the input feature, in contrast to Classification,
which would create decision boundaries to separate discrete classes.

Classification Algorithms

There are different types of classification algorithms that have been developed over
time to give the best results for classification tasks.

 Logistic Regression
 Decision Tree
 Random Forest
 K – Nearest Neighbors
 Support Vector Machine
 Naive Bayes

3|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Regression Algorithms

There are different types of regression algorithms that have been developed over time
to give the best results for regression tasks.

 Lasso Regression
 Ridge Regression
 XGBoost Regressor
 LGBM Regressor

Comparison between Classification and Regression

Feature Classification Regression

Output type In this problem statement, the Continuous numerical value (e.g.,
target variables are discrete. price, temperature).
Discrete categories (e.g.,
“spam” or “not spam”)
Goal To predict which category a data To predict an exact numerical value
point belongs to. based on input data.
Example Email spam detection, image House price prediction, stock market
problems recognition, customer sentiment forecasting, sales prediction.
analysis.
Evaluation Evaluation metrics like Mean Squared Error, R2-Score,
metrics Precision, Recall, and F1-Score MAPE and RMSE.
Decision Clearly defined boundaries No distinct boundaries, focuses on
boundary between different classes. finding the best fit line.
Common Logistic regression, Decision Linear Regression, Polynomial
algorithms trees, Support Vector Machines Regression, Decision Trees (with
(SVM) regression objective).

BINARY CLASSIFICATION
A classification task can fall under one of these two categories:

 Binary classification, where the number of classes is two. For example, the email spam
classification that we saw earlier.

4|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
 Multi-class classification, where the number of classes is more than two. For example,
classifying eCommerce inquiry emails into three types: shipping, returns, or tracking.

Example:

Let’s stick with the spam email classification, where our task is to classify a list of
emails into one of two classes: Spam or Not Spam. We’ll represent Spam with the integer 1 (or
Positive) and Not Spam with 0 (or Negative).

Here we have a dataset containing 20 email titles. We put each data point through a
binary classifier to get the predicted class and then compare it with its actual class.

The classifier returns the following outcome:

ASSESSING CLASSIFICATION PERFORMANCE

1. Accuracy
The most straightforward way to measure a classifier’s performance is using the
Accuracy metric. Here, we compare the actual and predicted class of each data point, and each
match counts for one correct prediction.

Accuracy is then given as the number of correct predictions divided by the total number
of predictions. From the spam classifier output above, we have 15 correct predictions and 5
incorrect predictions, which gives us an Accuracy of 75%.

5|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Accuracy is often used as the measure of classification performance because it is simple
to compute and easy to interpret. However, it can turn out to be misleading in some cases.

This is especially true when dealing with imbalanced data, a scenario when certain
classes contain way more data points than the others.

Let's go back to our dataset to understand this. Notice that if the classifier had not been
learning anything and was simply classifying all the outputs to be 0 (Not Spam), we would get
17 out of 20 correct classifications, which translates to a very high Accuracy of 85%! Clearly,
something isn’t right.

If you haven’t noticed yet, our dataset is indeed imbalanced. We have way more emails
that are not spam than emails that are spam.

The issue of imbalanced datasets is common in the real world. For this, there must be a
better way to measure a classifier’s performance than using Accuracy alone.

2. Confusion Matrix
The other three metrics can provide a more balanced view of a classifier’s true
performance. But before we can see them in action, we need to first understand the Confusion
Matrix.

The Confusion Matrix takes the classification results and groups them into four
categories:

 True Positive (TP): when both the actual and predicted values are 1.
 True Negative (TN): when both the actual and predicted values are 0.
 False Positive (FP): when the actual value is 0 but the predicted value is 1.
 False Negative (FN): when the actual value is 1 but the predicted value is 0.
6|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Recall that in our case, we refer to the event we want to capture (1 - Spam) as Positive
and non-event (0 - Not Spam) as Negative.

The Confusion Matrix for binary classification is a 2-by-2 matrix, where each column
represents one class, as follows:

Applied to our dataset, we get the following values:

 True Positive (TP): 1

 True Negative (TN): 14
 False Positive (FP): 3
 False Negative (FN): 2

We can populate these values in the Confusion Matrix, as follows:

7|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
We can also map the Confusion Matrix to the Accuracy formula that we saw earlier, as
follows:

We can now see via this matrix why Accuracy can sometimes hide the nuance of
imbalanced datasets. The reason is in these kinds of datasets, the True Negative category
dominates, diluting the effect of the rest.

8|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
So even if the classifier were to perform poorly in the other three categories, its
Accuracy will still look good, masking its deficiencies.

3. Precision
The other three metrics can provide a more balanced view of a classifier’s performance
and that’s Precision. Precision is calculated as follows:

Notice what just happened? Now, the True Negatives are not even part of the
calculation. Precision focuses on the True Positives and False Positives, therefore providing a

9|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
representation that may be missed via Accuracy. Whereas Accuracy looked impressive at 75%
earlier, we now see that Precision is quite far off at 25%.

4. Recall
Recall uses the same principle as Precision, except the focus is now on the False
Negatives instead of the False Positives. Again, the True Negatives are not part of the
consideration. Recall is calculated as follows:

Between Precision and Recall though, there is a tradeoff. It is hard to optimize for both
simultaneously as optimizing for the False Positives (thereby improving Precision) comes at
the expense of the False Negatives (thereby deteriorating Recall), and vice versa.

Which then brings the question: which metric should you prioritize—Precision or
Recall?

The answer is, that it depends on the nature of your task. Let’s see why.

Suppose the spam classifier achieved high Precision and low Recall (Scenario A). This
would result in fewer non-spam emails flagged as spam (False Positive). But this would also
mean more of the actual spam emails went undetected (False Negative).

10 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Conversely, if the classifier achieved high Recall and low Precision (Scenario B), there
would be fewer undetected spam emails (False Negative), but it comes at the expense of more
non-spam emails being flagged as spam (False Positive).

For a spam classification task, it’s probably more desirable to avoid important emails
being moved into the spam folder than to have the occasional spam emails going into the inbox.
So for this task, we will want to prioritize Precision over Recall.

5. F1
What if both Precision and Recall are important to you and you need the classifier to
do well in both? The answer is, to use the final metric of the four—F1.

F1 takes into consideration both Precision and Recall. It is calculated as follows:

11 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
F1 provides the balance between Precision and Recall. Now, there are different versions
of the ‘F-score’ family if you want to go for it, for example assigning bigger weight to either
Precision or Recall, but F1 is a good enough option to get started.

12 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
CLASS PROBABLITY ESTIMATION ASSESSING CLASS
PROBABLITY ESTIMATES
Class Probability Estimation in Machine Learning refers to the process of predicting
the probability that a given data instance belongs to a particular class. This is essential in tasks
where understanding the confidence of predictions is as important as the prediction itself.

Why Assess Class Probability Estimates?

1. Decision Making: Helps in risk-sensitive decisions (e.g., medical diagnosis, fraud

detection).

2. Threshold Tuning: Probabilities enable flexible decision thresholds for different

applications.

3. Uncertainty Handling: Provides insights into model confidence.

Key Techniques for Class Probability Estimation

1. Logistic Regression

o Directly outputs class probabilities.

o Well-calibrated by design.

2. Naive Bayes

o Computes probabilities using Bayes' theorem.

o Often requires calibration for improved accuracy.

3. k-Nearest Neighbors (k-NN)

o Probability is estimated by the proportion of points in the neighborhood

belonging to each class.

4. Decision Trees

o Leaf node proportions provide class probabilities.

o May require calibration for better reliability.

5. Random Forest

13 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
o Aggregates tree probabilities to improve stability and accuracy.

6. Neural Networks

o Softmax activation in the output layer estimates class probabilities.

Assessment of Probability Estimates

To evaluate how well these probabilities reflect real-world outcomes, consider these
metrics:

1. Brier Score

o Measures the mean squared error between predicted probabilities and actual
outcomes.

Lower values indicate better probability estimation.

2. Log Loss (Cross-Entropy Loss)

o Penalizes confident but incorrect predictions more severely.

3. Calibration Plot (Reliability Diagram)

o Visual comparison of predicted probabilities vs. actual outcomes.

o Ideal calibration shows points lying on the diagonal.

4. Expected Calibration Error (ECE)

o Measures the gap between predicted probabilities and observed frequencies.

14 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
REGRESSION: ASSESSING PERFORMANCE OF REGRESSION-
ERROR MEASURES
Regression algorithms are used to expect continuous numerical values primarily based
on entering features. In scikit-learn, we will use numerous regression algorithms, such as
Linear Regression, Decision Trees, Random Forests, and Support Vector Machines (SVM),
amongst others.

1. True Values and Predicted Values:

In regression, we've got two units of values to compare: the actual target values
(authentic values) and the values expected by our version (anticipated values). The
performance of the model is assessed by means of measuring the similarity among these sets.

2. Evaluation Metrics:

Regression metrics are quantitative measures used to evaluate the nice of a regression
model. Scikit-analyze provides several metrics, each with its own strengths and boundaries, to
assess how well a model suits the statistics.

TYPES OF REGRESSION METRICS

Some common regression metrics in scikit-learn with examples

1. Mean Absolute Error (MAE)

2. Mean Squared Error (MSE)
3. R-squared (R²) Score
4. Root Mean Squared Error (RMSE)

1. Mean Absolute Error (MAE)

In the fields of statistics and machine learning, the MAE is a frequently employed
metric. It's a measurement of the typical absolute discrepancies between a dataset's actual
values and projected values.

Mathematical Formula

The formula to calculate MAE for a data with "n" data points is:

15 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Where:
 xi represents the actual or observed values for the i-th data point.
 yi represents the predicted value for the i-th data point.
2. Mean Squared Error (MSE)
A popular metric in statistics and machine learning is the (MSE). It measures the square
root of the average discrepancies between a dataset's actual values and projected values. MSE
is frequently utilized in regression issues and is used to assess how well predictive models
work.

Mathematical Formula

For a dataset containing 'n' data points, the MSE calculation formula is:

where:

 xi represents the actual or observed value for the i-th data point.
 yi represents the predicted value for the i-th data point.
3. R-squared (R²) Score
A statistical metric frequently used to assess the goodness of fit of a regression model
is the R2 score, also referred to as the coefficient of determination. It quantifies the percentage
of the dependent variable's variation that the model's independent variables contribute to. R2 is
a useful statistic for evaluating the overall effectiveness and explanatory power of a regression
model.

Mathematical Formula

The formula to calculate the R-squared score is as follows:

Where:

 R2 is the R-Squared.

16 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
 SSR represents the sum of squared residuals between the predicted values and actual
values.
 SST represents the total sum of squares, which measures the total variance in the
dependent variable.
4. Root Mean Squared Error (RMSE)
RMSE stands for Root Mean Squared Error. It is a usually used metric in regression
analysis and machine learning to measure the accuracy or goodness of fit of a predictive model,
especially when the predictions are continuous numerical values.

The RMSE quantifies how well the predicted values from a model align with the actual
observed values in the dataset. Here's how it works:

1. Calculate the Squared Differences: For each data point, subtract the predicted value
from the actual (observed) value, square the result, and sum up these squared
differences.
2. Compute the Mean: Divide the sum of squared differences by the number of data
points to get the mean squared error (MSE).
3. Take the Square Root: To obtain the RMSE, simply take the square root of the MSE.

Mathematical Formula

The formula for RMSE for a data with 'n' data points is as follows:

Where:

 RMSE is the Root Mean Squared Error.

 xi represents the actual or observed value for the i-th data point.
 yi represents the predicted value for the i-th data point.

OVERFITTING – CATALYSTS FOR OVERFITTING

Overfitting in Machine Learning
Overfitting happens when a model learns too much from the training data, including
details that don’t matter (like noise or outliers).
17 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
For example, imagine fitting a very complicated curve to a set of points. The curve will
go through every point, but it won’t represent the actual pattern.

As a result, the model works great on training data but fails when tested on new data.

Overfitting models are like students who memorize answers instead of understanding
the topic. They do well in practice tests (training) but struggle in real exams (testing).

Reasons for Overfitting:

 High variance and low bias.

 The model is too complex.
 The size of the training data.
Underfitting in Machine Learning

Underfitting is the opposite of overfitting. It happens when a model is too simple to

capture what’s going on in the data.

For example, imagine drawing a straight line to fit points that actually follow a curve.
The line misses most of the pattern.

In this case, the model doesn’t work well on either the training or testing data.
Underfitting models are like students who don’t study enough. They don’t do well in practice
tests or real exams.

Reasons for Underfitting:

1. The model is too simple, So it may be not capable to represent the complexities in the
data.
2. The input features which are used to train the model is not the adequate representations
of underlying factors influencing the target variable.
3. The size of the training dataset used is not enough.
4. Excessive regularizations are used to prevent the overfitting, which constraint the
model to capture the data well.
5. Features are not scaled.
Note:
 Underfitting : Straight line trying to fit a curved dataset but cannot capture the data’s
patterns, leading to poor performance on both training and test sets.

18 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
 Overfitting: A squiggly curve passing through all training points, failing to generalize
performing well on training data but poorly on test data.
 Appropriate Fitting: Curve that follows the data trend without overcomplicating to
capture the true patterns in the data.
CASE STUDY OF POLYNOMIAL REGRESSION
Introduction
Polynomial Regression is a type of regression analysis where the relationship between
the independent variable xxx and the dependent variable yyy is modeled as an nnnth degree
polynomial. While linear regression assumes a straight-line relationship between variables,
polynomial regression can capture more complex, non-linear relationships. This case study
explores how polynomial regression can be applied to predict real-world data, providing
insights into its usage, benefits, and limitations.

Problem Statement
We are tasked with predicting the growth of a plant based on the number of days since
planting. We have historical data that includes the number of days and corresponding plant
heights, and we wish to model the plant growth using polynomial regression.

Data Collection
We have collected the following dataset:

Days Since Planting (X) Plant Height (Y)

1 2.3
2 4.1
3 6.0
4 8.2
5 10.5
6 12.8
7 14.7
8 17.1
9 19.0
10 20.1
Steps in the Case Study
Step 1: Exploratory Data Analysis (EDA)
Before applying polynomial regression, we conduct an exploratory data analysis to
understand the structure of the data.

19 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
 Plotting the Data: The data appears to have a quadratic pattern, meaning it follows a
curve rather than a straight line. We can visualize this by plotting the data points on a
scatter plot.

Step 2: Model Selection and Setup

We want to fit a polynomial regression model to this data. We decide to start by fitting
a second-degree polynomial (quadratic), but we also consider trying higher-degree polynomials
to see if they improve model performance.

The polynomial regression model is represented by:

Where x is the independent variable (Days Since Planting) and y is the dependent
variable (Plant Height).

Step 3: Preprocessing the Data

Before applying polynomial regression, we need to transform the input data into the
form that the model can work with.

 Feature Transformation: To apply polynomial regression, we need to create

polynomial features from the input data. For a quadratic model, this involves adding

as a feature along with x. For higher-degree models, we create features for ,

etc.

In Python, this can be done using the PolynomialFeatures class from the sklearn.preprocessing
module.

Step 4: Model Fitting

We apply the polynomial regression using the transformed features and fit the model.
The steps involved are:

20 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
1. Transform the original feature (Days Since Planting) into polynomial features.
2. Fit a linear regression model on the polynomial features.

Step 5: Evaluation and Model Testing

To assess the performance of the polynomial regression model, we split the data into
training and test sets. We then evaluate the model using metrics like the Mean Squared Error
(MSE) and R-squared.

We also compare the results of fitting models with different polynomial degrees (e.g.,
2nd, 3rd, and 4th degrees) to see which one performs best.

Step 6: Model Results

After fitting the models, we plot the predicted values against the actual data points. This
allows us to visually assess how well the model fits the data. The model’s predictions should
closely follow the curve of the actual data points.

For example, if a second-degree polynomial provides a good fit, the curve will closely
follow the quadratic shape of the data. If higher-degree polynomials are overfitting, the curve
will start to oscillate unnecessarily.

Step 7: Conclusion

 Model Choice: Based on the evaluation metrics, we choose the polynomial regression
model that provides the best fit. In this case, the second-degree polynomial model may
be sufficient to accurately model the data, as the relationship between the variables
appears quadratic.
 Overfitting: Higher-degree polynomials may lead to overfitting, where the model
performs well on training data but poorly on unseen data.
 Interpretability: Polynomial regression provides a way to model non-linear
relationships, but the complexity of the polynomial increases with higher degrees,
making it harder to interpret.

Benefits of Polynomial Regression

1. Flexibility: Polynomial regression is highly flexible and can model a wide range of
non-linear relationships.

21 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
2. Simple to Implement: It is relatively easy to implement using libraries like Scikit-
learn, making it a good option for small to medium-sized datasets.
3. Improved Prediction: Polynomial regression can improve prediction accuracy
compared to linear regression when the underlying data has a non-linear relationship.

Limitations of Polynomial Regression

1. Overfitting: Higher-degree polynomials can lead to overfitting, where the model learns
the noise in the training data instead of the underlying relationship.
2. Interpretability: As the degree of the polynomial increases, the model becomes more
difficult to interpret.
3. Computational Complexity: Higher-degree models can be computationally
expensive, especially with large datasets.

Conclusion
Polynomial regression is a powerful tool for modeling non-linear relationships. In this
case study, we showed how polynomial regression can be applied to predict plant growth over
time. While polynomial regression provides an accurate model for many real-world problems,
careful consideration must be given to the degree of the polynomial used to avoid overfitting
and maintain model interpretability.

Polynomial regression can be an effective choice in scenarios where the relationship

between variables is inherently non-linear, but it is essential to balance model complexity and
performance.

THEORY OF GENERALIZATION: EFFECTIVE NUMBER OF

HYPOTHESES
Generalization refers to the ability of a machine learning model to perform well on
unseen, out-of-sample data. In essence, a model is said to generalize well if it can make accurate
predictions on new data points that were not part of the training set. The concept of
generalization is central to the success of machine learning algorithms, as it ensures that the
model doesn’t just memorize the training data (overfitting), but instead learns the underlying
patterns that can be applied to new, unseen examples.

22 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
One way to understand generalization is through the concept of the effective number
of hypotheses. This concept stems from statistical learning theory and helps us understand how
the complexity of a model affects its ability to generalize.

Key Concepts

1. Hypothesis Space: The hypothesis space refers to the set of all possible models or
functions that could be considered by the learning algorithm to fit the data. In machine
learning, we typically search for a function within this space that minimizes some form
of error (like mean squared error in regression or cross-entropy in classification).
2. Effective Number of Hypotheses: The effective number of hypotheses refers to the
"effective" size of the hypothesis space that a model is able to select from based on the
data. It is often smaller than the total number of hypotheses available in the hypothesis
space because not all hypotheses are consistent with the data. In other words, this
number indicates how many hypotheses are effectively being considered when training
the model, based on the constraints imposed by the training data.
o The idea behind the effective number of hypotheses is that a smaller effective
number of hypotheses leads to better generalization. The more hypotheses there
are in the hypothesis space that fit the training data, the higher the risk of
overfitting to noise in the data and the poorer the model's ability to generalize.

Understanding the Effective Number of Hypotheses

To intuitively understand this, let’s break down a few points:

 Overfitting: If a model has access to a very large hypothesis space, it may be able to
perfectly fit the training data, including any noise. This results in overfitting, where the
model has too many hypotheses that fit the training data exactly, but fails to generalize
to new data. In this case, the effective number of hypotheses is large, but the model's
generalization ability is poor.
 Underfitting: If the hypothesis space is too small, the model may not be able to capture
the underlying patterns in the data. It may underfit, resulting in poor performance both
on the training data and on unseen data. Here, the effective number of hypotheses is
small, which might hinder the model's ability to find the true underlying relationships.
 Model Complexity: The complexity of the model determines the size of the hypothesis
space. For example, a linear regression model (with a single linear function) has a
23 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
relatively small hypothesis space compared to a deep neural network with many layers
and parameters. The hypothesis space size is proportional to the flexibility of the
model—more parameters lead to more hypotheses being possible. However, too many
parameters can lead to overfitting if the model is not regularized or controlled.

The Role of Regularization

Regularization techniques such as L2 regularization (Ridge), L1 regularization
(Lasso), and Dropout in neural networks aim to control the hypothesis space by penalizing
overly complex models. Regularization reduces the number of hypotheses that the model can
effectively select, thereby limiting overfitting and improving generalization.

 L2 regularization (Ridge) penalizes large weights, which reduces the capacity of the
model and shrinks the effective number of hypotheses.
 L1 regularization (Lasso) can lead to sparse solutions, effectively reducing the number
of active features (hypotheses).
 Dropout in neural networks forces the network to learn redundant representations by
randomly dropping units during training, reducing the effective number of hypotheses.

The Effective Number of Hypotheses in Practice

To practically understand how the effective number of hypotheses relates to
generalization, consider the following simplified analogy:

 Imagine you are trying to fit a curve to a set of data points. If you use a very flexible
curve (like a high-degree polynomial), the number of ways you can fit the curve to the
data increases significantly, leading to many possible hypotheses. This makes it more
likely that your model fits noise in the training data and does not generalize well to
unseen data.
 On the other hand, if you use a simple linear model, there are fewer possible curves that
can fit the data. In this case, the effective number of hypotheses is smaller, and the
model may generalize better because it is less likely to overfit to noise.

The balance between model complexity (size of the hypothesis space) and the effective
number of hypotheses that the model actually uses is key to achieving good generalization.

24 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Theoretical Formulation of the Effective Number of Hypotheses
In a more formal sense, the effective number of hypotheses can be related to the VC
dimension (Vapnik-Chervonenkis dimension) of the hypothesis space. The VC dimension is a
measure of the capacity or complexity of a model, i.e., how well it can fit different types of
data. A model with a higher VC dimension has a larger hypothesis space, meaning it has more
hypotheses to choose from and is more prone to overfitting.

The effective number of hypotheses is related to the degrees of freedom of the model.
In the context of linear regression with regularization, it can be approximated as:

Where X is the design matrix of features. This formula provides a way of estimating
how much flexibility the model has based on the input features and their interactions.

The theory of effective number of hypotheses in generalization highlights the delicate

balance between model complexity and the ability of a model to generalize well to unseen data.
A model that has a large hypothesis space but is poorly regularized may overfit the training
data, leading to poor generalization. On the other hand, a model with a small effective
hypothesis space may underfit the data and fail to capture important patterns.

Understanding and controlling the effective number of hypotheses is crucial for

building models that generalize well, and it is an essential part of the theoretical foundation of
machine learning. Regularization techniques play a pivotal role in limiting the number of
hypotheses a model can effectively use, ensuring that it learns the most relevant patterns while
avoiding overfitting.

BOUNDING THE GROWTH FUNCTION

The effective number of hypotheses refers to the "effective" size of the hypothesis space
that a model can choose from based on the training data. A hypothesis space represents all
possible models a learning algorithm might consider. The effective number of hypotheses is
smaller than the total number of hypotheses, as not all of them are consistent with the data.

 Generalization is about a model's ability to perform well on unseen data, and the
effective number of hypotheses impacts generalization.

25 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
 Overfitting occurs when a model has too many possible hypotheses that fit the training
data too closely, leading to poor performance on new data.
 Underfitting happens when the hypothesis space is too small, and the model fails to
capture the underlying patterns in the0020data.

Regularization techniques like L1 and L2 regularization, and Dropout in neural

networks, help control the effective number of hypotheses by penalizing complex models,
reducing overfitting, and improving generalization.

In simple terms, the effective number of hypotheses determines how many possible
models are effectively considered by the learning algorithm based on the data, and controlling
this number is key to balancing model complexity and generalization.

VC DIMENSION (VAPNIK-CHERVONENKIS DIMENSION)

The Vapnik-Chervonenkis (VC) dimension is a fundamental concept in statistical
learning theory. It measures the capacity or flexibility of a hypothesis space — that is, how
well a model can fit different sets of data. Specifically, the VC dimension quantifies the
maximum number of data points that a model can shatter (or perfectly classify) in all possible
ways.

Key Concepts

1. Hypothesis Space: This refers to the set of all possible models or functions that a
learning algorithm can choose from. Each model in the hypothesis space represents a
different way of mapping inputs to outputs.
2. Shattering: A model is said to shatter a set of points if, for every possible labeling of
those points (i.e., all combinations of positive/negative labels in classification), there
exists at least one hypothesis in the hypothesis space that perfectly classifies those
points according to the given labels.
3. VC Dimension: The VC dimension of a model or hypothesis space is defined as the
largest set of points that can be shattered by the model. It represents the model’s ability
to adapt to different datasets.
o If a model can shatter k points, then its VC dimension is at least k.
o If no set of k+1 points can be shattered, then the VC dimension is at most k.

26 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Importance of VC Dimension in Generalization
The VC dimension plays a critical role in understanding the generalization ability of
a machine learning model:

 High VC Dimension: A model with a high VC dimension has more capacity to fit
complex patterns in data, which means it can potentially overfit the training data. This
leads to poor generalization on unseen data.
 Low VC Dimension: A model with a low VC dimension has limited capacity to fit data,
which may result in underfitting. It might not be flexible enough to capture the true
underlying patterns in the data.

VC Dimension and Model Complexity

The VC dimension is related to the complexity of a model. More complex models (e.g.,
deep neural networks, high-degree polynomials) have higher VC dimensions, meaning they
have a larger capacity to fit diverse data sets, but they also have a higher risk of overfitting.
Simpler models (e.g., linear classifiers) have lower VC dimensions and are less prone to
overfitting, but they might underfit if the true relationship is more complex.

Examples

1. Linear Classifiers: The VC dimension of a linear classifier in a 2D space is 3. This

means a linear classifier can shatter any set of 3 points, but it cannot always shatter 4
points in a 2D space.
2. Decision Trees: The VC dimension of a decision tree depends on its depth. A deeper
tree can shatter more data points (higher VC dimension), but it also risks overfitting.

VC Dimension and Sample Size

The VC dimension is also used in deriving bounds for the generalization error. The
generalization error refers to how well a model’s performance on training data matches its
performance on unseen data.

 Upper Bound: The VC dimension helps to establish an upper bound on the

generalization error. If a model has a high VC dimension, the model can fit more
complex patterns, but we need more data to ensure that it generalizes well.
 Sample Complexity: A model with a higher VC dimension typically requires a larger
sample size to learn effectively and avoid overfitting.
27 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
The VC dimension is an essential concept for understanding the complexity of machine
learning models and their potential for overfitting or underfitting. It helps balance model
capacity and generalization ability, guiding the selection of appropriate models based on the
amount of available data. Models with a high VC dimension are more expressive but can
overfit, while models with a low VC dimension might underfit if they are too simple to capture
the underlying patterns.

28 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning

Predictive Modelling Project_Nandini
No ratings yet
Predictive Modelling Project_Nandini
31 pages
ML NOTES BY PUSHPA
No ratings yet
ML NOTES BY PUSHPA
26 pages
ML Unit 2
No ratings yet
ML Unit 2
31 pages
Key Differences Between Regression and Classification (1)
No ratings yet
Key Differences Between Regression and Classification (1)
6 pages
Classification
100% (2)
Classification
105 pages
Unit 4 ML
No ratings yet
Unit 4 ML
28 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
6.Data Mining - Classification Ppt
No ratings yet
6.Data Mining - Classification Ppt
37 pages
ML-2-PPT-UNIT-2
No ratings yet
ML-2-PPT-UNIT-2
214 pages
5 markd
No ratings yet
5 markd
24 pages
BSC ML CH1.pptx
No ratings yet
BSC ML CH1.pptx
63 pages
KNN-Unit1-Notes (1)
No ratings yet
KNN-Unit1-Notes (1)
57 pages
Machine Learning Note (2)
No ratings yet
Machine Learning Note (2)
40 pages
Binary Classification PDF
No ratings yet
Binary Classification PDF
27 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
Machine Learningassignment
No ratings yet
Machine Learningassignment
10 pages
Chapter 4. Classification Algorithms-Stud
No ratings yet
Chapter 4. Classification Algorithms-Stud
43 pages
Accuracy Precision and Recall
No ratings yet
Accuracy Precision and Recall
21 pages
KNN Evaluation
No ratings yet
KNN Evaluation
51 pages
Unit 9 - Classification & Clustering
No ratings yet
Unit 9 - Classification & Clustering
34 pages
ML+LVC+3+Post-Session+Summary
No ratings yet
ML+LVC+3+Post-Session+Summary
16 pages
Lecture 9
No ratings yet
Lecture 9
27 pages
Classification
No ratings yet
Classification
22 pages
Supervised Learning
No ratings yet
Supervised Learning
187 pages
Lec 17 -Dsfa23
No ratings yet
Lec 17 -Dsfa23
32 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
100 pages
ml4
No ratings yet
ml4
32 pages
Classification Algorithm in Machine Learning
No ratings yet
Classification Algorithm in Machine Learning
13 pages
Lesson 8 - Classification
No ratings yet
Lesson 8 - Classification
74 pages
ML Notes UT-2
No ratings yet
ML Notes UT-2
19 pages
19-Introduction classification algorithm-18-09-2024
No ratings yet
19-Introduction classification algorithm-18-09-2024
102 pages
Lecture 11_09.09.24 Classification Part 1
No ratings yet
Lecture 11_09.09.24 Classification Part 1
51 pages
Notes Binary Mulitclass Classification
No ratings yet
Notes Binary Mulitclass Classification
7 pages
mla_unit-5'2 (1)
No ratings yet
mla_unit-5'2 (1)
74 pages
Lecture-5 Classification in ML
No ratings yet
Lecture-5 Classification in ML
50 pages
Regression: Classification
No ratings yet
Regression: Classification
9 pages
Performance Measures - Session 2
No ratings yet
Performance Measures - Session 2
35 pages
classification basic concept.data mining
No ratings yet
classification basic concept.data mining
20 pages
7 Types of Classification Algorithms
No ratings yet
7 Types of Classification Algorithms
9 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
ML Unit 3
No ratings yet
ML Unit 3
127 pages
8.predictive Analytics - Classification 2
No ratings yet
8.predictive Analytics - Classification 2
28 pages
Intermediate Analytics-Regression-Week 3-1
No ratings yet
Intermediate Analytics-Regression-Week 3-1
44 pages
DM 09 Classification and Prediction 19112024 102854am
No ratings yet
DM 09 Classification and Prediction 19112024 102854am
21 pages
Data Mining Classification and Prediction
No ratings yet
Data Mining Classification and Prediction
17 pages
ML - Module 3
No ratings yet
ML - Module 3
58 pages
Chapter 3 Model Evaluation Final
No ratings yet
Chapter 3 Model Evaluation Final
30 pages
Day35 Classification Algorithm
No ratings yet
Day35 Classification Algorithm
5 pages
datamining-lect12
No ratings yet
datamining-lect12
75 pages
DS_UNIT_4
No ratings yet
DS_UNIT_4
13 pages
ml_unit2
No ratings yet
ml_unit2
22 pages
INSY446 - 4 - Classification Part 1
No ratings yet
INSY446 - 4 - Classification Part 1
26 pages
UNIT II Machine Learning
No ratings yet
UNIT II Machine Learning
118 pages
Classification FoundationalMathofAI S24
No ratings yet
Classification FoundationalMathofAI S24
6 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
Regression vs Classification in Machine Learning Explained!
No ratings yet
Regression vs Classification in Machine Learning Explained!
10 pages
Machine Learning
No ratings yet
Machine Learning
16 pages
Lecture 1, Part 2: Linear Classification: Roger Grosse
No ratings yet
Lecture 1, Part 2: Linear Classification: Roger Grosse
10 pages
DM - Ch4 - Classification (Part1)
No ratings yet
DM - Ch4 - Classification (Part1)
20 pages
Session-11 Machine Learning
No ratings yet
Session-11 Machine Learning
27 pages
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Examination of the Relationships Between Original Real and Apparent Extracts-done
No ratings yet
Examination of the Relationships Between Original Real and Apparent Extracts-done
10 pages
Tabular Data Classification and Regression XGBoost or Deep Learning With Retrieval-Augmented Generation
No ratings yet
Tabular Data Classification and Regression XGBoost or Deep Learning With Retrieval-Augmented Generation
14 pages
UNIT-2 ML notes
No ratings yet
UNIT-2 ML notes
15 pages
2010_ The challenge of adapting grapevine varieties to climate_change
No ratings yet
2010_ The challenge of adapting grapevine varieties to climate_change
13 pages
[Ebooks PDF] download (Ebook) Advanced Time Series Data Analysis: Forecasting Using EViews by I. Gusti Ngurah Agung ISBN 9781119504719, 1119504716 full chapters
No ratings yet
[Ebooks PDF] download (Ebook) Advanced Time Series Data Analysis: Forecasting Using EViews by I. Gusti Ngurah Agung ISBN 9781119504719, 1119504716 full chapters
65 pages
Al-Hamed Equation in Mechanical Motion: An Extensive Refinement of Newton's Second Law with Comprehensive Experimental and Analytical Validation
No ratings yet
Al-Hamed Equation in Mechanical Motion: An Extensive Refinement of Newton's Second Law with Comprehensive Experimental and Analytical Validation
12 pages
2.+Buddha+Subedi
No ratings yet
2.+Buddha+Subedi
24 pages
Get Test Bank for Business Forecasting, 6th Edition: Wilson free all chapters
100% (22)
Get Test Bank for Business Forecasting, 6th Edition: Wilson free all chapters
43 pages
A_Machine_Learning_Approach_to_Wireless_Propagation_Modeling_in_Industrial_Environment
No ratings yet
A_Machine_Learning_Approach_to_Wireless_Propagation_Modeling_in_Industrial_Environment
12 pages
phase 3
No ratings yet
phase 3
23 pages
Water level prediction using various machine learning algorithms a case study of Durian Tunggal river Malaysia
No ratings yet
Water level prediction using various machine learning algorithms a case study of Durian Tunggal river Malaysia
20 pages
Quickly access every chapter of Test Bank for Business Forecasting, 6th Edition: Wilson via PDF download.
100% (17)
Quickly access every chapter of Test Bank for Business Forecasting, 6th Edition: Wilson via PDF download.
42 pages
Automobile Rear Axle Health Monitoring Technology
No ratings yet
Automobile Rear Axle Health Monitoring Technology
7 pages
Unit-5 ML Notes
No ratings yet
Unit-5 ML Notes
72 pages
Lnmiit Btp Report Nirf Rank Prediction
No ratings yet
Lnmiit Btp Report Nirf Rank Prediction
31 pages
Energy Efficiency in Smart Buildings through Prediction modeling and Optimization Using a Modified Whale Optimization Algorithm
No ratings yet
Energy Efficiency in Smart Buildings through Prediction modeling and Optimization Using a Modified Whale Optimization Algorithm
7 pages
index
No ratings yet
index
4 pages
Rabies Outbreak Prediction Using Deep Learning with Long Short Term Memory
100% (1)
Rabies Outbreak Prediction Using Deep Learning with Long Short Term Memory
11 pages
(Ebook) Applications of Artificial Intelligence Techniques in the Petroleum Industry by Abdolhossein Hemmati Sarapardeh, Aydin Larestani, Nait Amar Menad, Sassan Hajirezaie ISBN 9780128186800, 0128186801 2024 Scribd Download
100% (8)
(Ebook) Applications of Artificial Intelligence Techniques in the Petroleum Industry by Abdolhossein Hemmati Sarapardeh, Aydin Larestani, Nait Amar Menad, Sassan Hajirezaie ISBN 9780128186800, 0128186801 2024 Scribd Download
67 pages
AI-900 Exam Valid Dumps
No ratings yet
AI-900 Exam Valid Dumps
18 pages
Advanced Intelligent Systems for Sustainable Development AI2SD 2018 Vol 3 Advanced Intelligent Systems Applied to Environment Mostafa Ezziyyani download
100% (4)
Advanced Intelligent Systems for Sustainable Development AI2SD 2018 Vol 3 Advanced Intelligent Systems Applied to Environment Mostafa Ezziyyani download
64 pages
Documentation of our project
No ratings yet
Documentation of our project
21 pages
Body Composition Tools for Assessment of Adult
No ratings yet
Body Composition Tools for Assessment of Adult
36 pages
NN-BNU3
No ratings yet
NN-BNU3
42 pages
Supervised and Unsupervised Learning
No ratings yet
Supervised and Unsupervised Learning
14 pages
Ultra-wideband_High_Precision_Positioning_System_Based_on_EDS-TWR_and_EKF_Algorithm
No ratings yet
Ultra-wideband_High_Precision_Positioning_System_Based_on_EDS-TWR_and_EKF_Algorithm
6 pages
Adhikari et al - 2024 - heavy metals soils USA
No ratings yet
Adhikari et al - 2024 - heavy metals soils USA
15 pages
A predictive model for uniaxial compressive strength of carbonaet rocks from schmidt hardness
No ratings yet
A predictive model for uniaxial compressive strength of carbonaet rocks from schmidt hardness
7 pages
Unit 2
No ratings yet
Unit 2
28 pages

Unit 2

Uploaded by

Unit 2

Uploaded by

UNIT 1: CLASSIFICATION AND REGRESSION

INTRODUCTION TO CLASSIFICATION AND REGRESSION

When teaching the difference between classification and regression in machine

1. Decision Boundary in Classification

 A straight line is used for linear regression.

Comparison between Classification and Regression

Feature Classification Regression

The classifier returns the following outcome:

ASSESSING CLASSIFICATION PERFORMANCE

Applied to our dataset, we get the following values:

 True Positive (TP): 1

We can populate these values in the Confusion Matrix, as follows:

F1 takes into consideration both Precision and Recall. It is calculated as follows:

Why Assess Class Probability Estimates?

1. Decision Making: Helps in risk-sensitive decisions (e.g., medical diagnosis, fraud

2. Threshold Tuning: Probabilities enable flexible decision thresholds for different

3. Uncertainty Handling: Provides insights into model confidence.

Key Techniques for Class Probability Estimation

o Directly outputs class probabilities.

o Computes probabilities using Bayes' theorem.

o Often requires calibration for improved accuracy.

3. k-Nearest Neighbors (k-NN)

o Probability is estimated by the proportion of points in the neighborhood

o Leaf node proportions provide class probabilities.

o May require calibration for better reliability.

o Softmax activation in the output layer estimates class probabilities.

Assessment of Probability Estimates

Lower values indicate better probability estimation.

2. Log Loss (Cross-Entropy Loss)

o Penalizes confident but incorrect predictions more severely.

3. Calibration Plot (Reliability Diagram)

o Visual comparison of predicted probabilities vs. actual outcomes.

o Ideal calibration shows points lying on the diagonal.

4. Expected Calibration Error (ECE)

o Measures the gap between predicted probabilities and observed frequencies.

1. True Values and Predicted Values:

TYPES OF REGRESSION METRICS

1. Mean Absolute Error (MAE)

1. Mean Absolute Error (MAE)

The formula to calculate the R-squared score is as follows:

 RMSE is the Root Mean Squared Error.

OVERFITTING – CATALYSTS FOR OVERFITTING

Reasons for Overfitting:

 High variance and low bias.

Underfitting is the opposite of overfitting. It happens when a model is too simple to

Reasons for Underfitting:

Days Since Planting (X) Plant Height (Y)

Step 2: Model Selection and Setup

The polynomial regression model is represented by:

Step 3: Preprocessing the Data

 Feature Transformation: To apply polynomial regression, we need to create

as a feature along with x. For higher-degree models, we create features for ,

Step 4: Model Fitting

Step 5: Evaluation and Model Testing

Step 6: Model Results

Benefits of Polynomial Regression

Limitations of Polynomial Regression

Polynomial regression can be an effective choice in scenarios where the relationship

THEORY OF GENERALIZATION: EFFECTIVE NUMBER OF

Understanding the Effective Number of Hypotheses

The Role of Regularization

The Effective Number of Hypotheses in Practice

The theory of effective number of hypotheses in generalization highlights the delicate

Understanding and controlling the effective number of hypotheses is crucial for

BOUNDING THE GROWTH FUNCTION

Regularization techniques like L1 and L2 regularization, and Dropout in neural

VC DIMENSION (VAPNIK-CHERVONENKIS DIMENSION)

VC Dimension and Model Complexity

1. Linear Classifiers: The VC dimension of a linear classifier in a 2D space is 3. This

VC Dimension and Sample Size

 Upper Bound: The VC dimension helps to establish an upper bound on the

You might also like