0% found this document useful (0 votes)
15 views

Unit 2

The document provides an overview of classification and regression in supervised machine learning, highlighting their differences in output types and goals. It discusses key concepts such as decision boundaries for classification and best-fit lines for regression, along with various algorithms and performance metrics for each approach. Additionally, it covers binary and multi-class classification, assessing classification performance using metrics like accuracy, precision, recall, and F1 score, as well as regression performance metrics such as MAE, MSE, and R-squared.

Uploaded by

jarvis9657
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Unit 2

The document provides an overview of classification and regression in supervised machine learning, highlighting their differences in output types and goals. It discusses key concepts such as decision boundaries for classification and best-fit lines for regression, along with various algorithms and performance metrics for each approach. Additionally, it covers binary and multi-class classification, assessing classification performance using metrics like accuracy, precision, recall, and F1 score, as well as regression performance metrics such as MAE, MSE, and R-squared.

Uploaded by

jarvis9657
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

UNIT 1: CLASSIFICATION AND REGRESSION

INTRODUCTION TO CLASSIFICATION AND REGRESSION


Classification and regression are two primary tasks in supervised machine learning,
where key difference lies in the nature of the output: classification deals with discrete
outcomes (e.g., yes/no, categories), while regression handles continuous values (e.g., price,
temperature).

Both approaches require labeled data for training but differ in their objectives—
classification aims to find decision boundaries that separate classes, whereas regression focuses
on finding the best-fitting line to predict numerical outcomes. Understanding these distinctions
helps in selecting the right approach for specific machine learning tasks.

For example, it can determine whether an email is spam or not, classify images as “cat”
or “dog,” or predict weather conditions like “sunny,” “rainy,” or “cloudy.” with decision
boundary and regression models are used to predict house prices based on features like size
and location, or forecast stock prices over time with straight fit line.

1|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Decision Boundary vs Best-Fit Line

When teaching the difference between classification and regression in machine


learning, a key concept to focus on is the decision boundary (used in classification) versus
the best-fit line (used in regression). These are fundamental tools that help models make
predictions, but they serve distinctly different purposes.

1. Decision Boundary in Classification


It is a surface or line that separates data points into different classes in a feature space.
It can be linear (a straight line) or non-linear (a curve), depending on the complexity of the data
and the algorithm used. For example:

 A linear decision boundary might separate two classes in a 2D space with a straight line
(e.g., logistic regression).
 A more complex model, may create non-linear boundaries to better fit intricate datasets.

During training classifier learns to partition the feature space by finding a boundary that
minimizes classification errors.

 For binary classification, this boundary separates data points into two groups (e.g.,
spam vs. non-spam emails).
 In multi-class classification, multiple boundaries are created to separate more than two
classes.

The decision boundary is not inherent to the training data but rather depends on the classifier
used.

2|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
2. Best-Fit Line in Regression
In regression, a best-fit line (or regression line) represents the relationship between
independent variables (inputs) and a dependent variable (output). It is used to predict
continuous numerical values capturing trends and relationships within the data, allowing for
accurate predictions of continuous variables. The best-fit line can be linear or non-linear:

 A straight line is used for linear regression.


 Curves are used for more complex regressions, like polynomial regression

The plot demonstrates Regression, where both Linear and Polynomial models are used to
predict continuous target values based on the input feature, in contrast to Classification,
which would create decision boundaries to separate discrete classes.

Classification Algorithms

There are different types of classification algorithms that have been developed over
time to give the best results for classification tasks.

 Logistic Regression
 Decision Tree
 Random Forest
 K – Nearest Neighbors
 Support Vector Machine
 Naive Bayes

3|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Regression Algorithms

There are different types of regression algorithms that have been developed over time
to give the best results for regression tasks.

 Lasso Regression
 Ridge Regression
 XGBoost Regressor
 LGBM Regressor

Comparison between Classification and Regression

Feature Classification Regression


Output type In this problem statement, the Continuous numerical value (e.g.,
target variables are discrete. price, temperature).
Discrete categories (e.g.,
“spam” or “not spam”)
Goal To predict which category a data To predict an exact numerical value
point belongs to. based on input data.
Example Email spam detection, image House price prediction, stock market
problems recognition, customer sentiment forecasting, sales prediction.
analysis.
Evaluation Evaluation metrics like Mean Squared Error, R2-Score,
metrics Precision, Recall, and F1-Score MAPE and RMSE.
Decision Clearly defined boundaries No distinct boundaries, focuses on
boundary between different classes. finding the best fit line.
Common Logistic regression, Decision Linear Regression, Polynomial
algorithms trees, Support Vector Machines Regression, Decision Trees (with
(SVM) regression objective).

BINARY CLASSIFICATION
A classification task can fall under one of these two categories:

 Binary classification, where the number of classes is two. For example, the email spam
classification that we saw earlier.

4|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
 Multi-class classification, where the number of classes is more than two. For example,
classifying eCommerce inquiry emails into three types: shipping, returns, or tracking.

Example:

Let’s stick with the spam email classification, where our task is to classify a list of
emails into one of two classes: Spam or Not Spam. We’ll represent Spam with the integer 1 (or
Positive) and Not Spam with 0 (or Negative).

Here we have a dataset containing 20 email titles. We put each data point through a
binary classifier to get the predicted class and then compare it with its actual class.

The classifier returns the following outcome:

ASSESSING CLASSIFICATION PERFORMANCE


1. Accuracy
The most straightforward way to measure a classifier’s performance is using the
Accuracy metric. Here, we compare the actual and predicted class of each data point, and each
match counts for one correct prediction.

Accuracy is then given as the number of correct predictions divided by the total number
of predictions. From the spam classifier output above, we have 15 correct predictions and 5
incorrect predictions, which gives us an Accuracy of 75%.

5|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Accuracy is often used as the measure of classification performance because it is simple
to compute and easy to interpret. However, it can turn out to be misleading in some cases.

This is especially true when dealing with imbalanced data, a scenario when certain
classes contain way more data points than the others.

Let's go back to our dataset to understand this. Notice that if the classifier had not been
learning anything and was simply classifying all the outputs to be 0 (Not Spam), we would get
17 out of 20 correct classifications, which translates to a very high Accuracy of 85%! Clearly,
something isn’t right.

If you haven’t noticed yet, our dataset is indeed imbalanced. We have way more emails
that are not spam than emails that are spam.

The issue of imbalanced datasets is common in the real world. For this, there must be a
better way to measure a classifier’s performance than using Accuracy alone.

2. Confusion Matrix
The other three metrics can provide a more balanced view of a classifier’s true
performance. But before we can see them in action, we need to first understand the Confusion
Matrix.

The Confusion Matrix takes the classification results and groups them into four
categories:

 True Positive (TP): when both the actual and predicted values are 1.
 True Negative (TN): when both the actual and predicted values are 0.
 False Positive (FP): when the actual value is 0 but the predicted value is 1.
 False Negative (FN): when the actual value is 1 but the predicted value is 0.
6|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Recall that in our case, we refer to the event we want to capture (1 - Spam) as Positive
and non-event (0 - Not Spam) as Negative.

The Confusion Matrix for binary classification is a 2-by-2 matrix, where each column
represents one class, as follows:

Applied to our dataset, we get the following values:

 True Positive (TP): 1


 True Negative (TN): 14
 False Positive (FP): 3
 False Negative (FN): 2

We can populate these values in the Confusion Matrix, as follows:

7|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
We can also map the Confusion Matrix to the Accuracy formula that we saw earlier, as
follows:

We can now see via this matrix why Accuracy can sometimes hide the nuance of
imbalanced datasets. The reason is in these kinds of datasets, the True Negative category
dominates, diluting the effect of the rest.

8|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
So even if the classifier were to perform poorly in the other three categories, its
Accuracy will still look good, masking its deficiencies.

3. Precision
The other three metrics can provide a more balanced view of a classifier’s performance
and that’s Precision. Precision is calculated as follows:

Notice what just happened? Now, the True Negatives are not even part of the
calculation. Precision focuses on the True Positives and False Positives, therefore providing a

9|Page
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
representation that may be missed via Accuracy. Whereas Accuracy looked impressive at 75%
earlier, we now see that Precision is quite far off at 25%.

4. Recall
Recall uses the same principle as Precision, except the focus is now on the False
Negatives instead of the False Positives. Again, the True Negatives are not part of the
consideration. Recall is calculated as follows:

Between Precision and Recall though, there is a tradeoff. It is hard to optimize for both
simultaneously as optimizing for the False Positives (thereby improving Precision) comes at
the expense of the False Negatives (thereby deteriorating Recall), and vice versa.

Which then brings the question: which metric should you prioritize—Precision or
Recall?

The answer is, that it depends on the nature of your task. Let’s see why.

Suppose the spam classifier achieved high Precision and low Recall (Scenario A). This
would result in fewer non-spam emails flagged as spam (False Positive). But this would also
mean more of the actual spam emails went undetected (False Negative).

10 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Conversely, if the classifier achieved high Recall and low Precision (Scenario B), there
would be fewer undetected spam emails (False Negative), but it comes at the expense of more
non-spam emails being flagged as spam (False Positive).

For a spam classification task, it’s probably more desirable to avoid important emails
being moved into the spam folder than to have the occasional spam emails going into the inbox.
So for this task, we will want to prioritize Precision over Recall.

5. F1
What if both Precision and Recall are important to you and you need the classifier to
do well in both? The answer is, to use the final metric of the four—F1.

F1 takes into consideration both Precision and Recall. It is calculated as follows:

11 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
F1 provides the balance between Precision and Recall. Now, there are different versions
of the ‘F-score’ family if you want to go for it, for example assigning bigger weight to either
Precision or Recall, but F1 is a good enough option to get started.

12 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
CLASS PROBABLITY ESTIMATION ASSESSING CLASS
PROBABLITY ESTIMATES
Class Probability Estimation in Machine Learning refers to the process of predicting
the probability that a given data instance belongs to a particular class. This is essential in tasks
where understanding the confidence of predictions is as important as the prediction itself.

Why Assess Class Probability Estimates?

1. Decision Making: Helps in risk-sensitive decisions (e.g., medical diagnosis, fraud


detection).

2. Threshold Tuning: Probabilities enable flexible decision thresholds for different


applications.

3. Uncertainty Handling: Provides insights into model confidence.

Key Techniques for Class Probability Estimation

1. Logistic Regression

o Directly outputs class probabilities.

o Well-calibrated by design.

2. Naive Bayes

o Computes probabilities using Bayes' theorem.

o Often requires calibration for improved accuracy.

3. k-Nearest Neighbors (k-NN)

o Probability is estimated by the proportion of points in the neighborhood


belonging to each class.

4. Decision Trees

o Leaf node proportions provide class probabilities.

o May require calibration for better reliability.

5. Random Forest

13 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
o Aggregates tree probabilities to improve stability and accuracy.

6. Neural Networks

o Softmax activation in the output layer estimates class probabilities.

Assessment of Probability Estimates

To evaluate how well these probabilities reflect real-world outcomes, consider these
metrics:

1. Brier Score

o Measures the mean squared error between predicted probabilities and actual
outcomes.

Lower values indicate better probability estimation.

2. Log Loss (Cross-Entropy Loss)

o Penalizes confident but incorrect predictions more severely.

3. Calibration Plot (Reliability Diagram)

o Visual comparison of predicted probabilities vs. actual outcomes.

o Ideal calibration shows points lying on the diagonal.

4. Expected Calibration Error (ECE)

o Measures the gap between predicted probabilities and observed frequencies.

14 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
REGRESSION: ASSESSING PERFORMANCE OF REGRESSION-
ERROR MEASURES
Regression algorithms are used to expect continuous numerical values primarily based
on entering features. In scikit-learn, we will use numerous regression algorithms, such as
Linear Regression, Decision Trees, Random Forests, and Support Vector Machines (SVM),
amongst others.

1. True Values and Predicted Values:

In regression, we've got two units of values to compare: the actual target values
(authentic values) and the values expected by our version (anticipated values). The
performance of the model is assessed by means of measuring the similarity among these sets.

2. Evaluation Metrics:

Regression metrics are quantitative measures used to evaluate the nice of a regression
model. Scikit-analyze provides several metrics, each with its own strengths and boundaries, to
assess how well a model suits the statistics.

TYPES OF REGRESSION METRICS


Some common regression metrics in scikit-learn with examples

1. Mean Absolute Error (MAE)


2. Mean Squared Error (MSE)
3. R-squared (R²) Score
4. Root Mean Squared Error (RMSE)

1. Mean Absolute Error (MAE)


In the fields of statistics and machine learning, the MAE is a frequently employed
metric. It's a measurement of the typical absolute discrepancies between a dataset's actual
values and projected values.

Mathematical Formula

The formula to calculate MAE for a data with "n" data points is:

15 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Where:
 xi represents the actual or observed values for the i-th data point.
 yi represents the predicted value for the i-th data point.
2. Mean Squared Error (MSE)
A popular metric in statistics and machine learning is the (MSE). It measures the square
root of the average discrepancies between a dataset's actual values and projected values. MSE
is frequently utilized in regression issues and is used to assess how well predictive models
work.

Mathematical Formula

For a dataset containing 'n' data points, the MSE calculation formula is:

where:

 xi represents the actual or observed value for the i-th data point.
 yi represents the predicted value for the i-th data point.
3. R-squared (R²) Score
A statistical metric frequently used to assess the goodness of fit of a regression model
is the R2 score, also referred to as the coefficient of determination. It quantifies the percentage
of the dependent variable's variation that the model's independent variables contribute to. R2 is
a useful statistic for evaluating the overall effectiveness and explanatory power of a regression
model.

Mathematical Formula

The formula to calculate the R-squared score is as follows:

Where:

 R2 is the R-Squared.

16 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
 SSR represents the sum of squared residuals between the predicted values and actual
values.
 SST represents the total sum of squares, which measures the total variance in the
dependent variable.
4. Root Mean Squared Error (RMSE)
RMSE stands for Root Mean Squared Error. It is a usually used metric in regression
analysis and machine learning to measure the accuracy or goodness of fit of a predictive model,
especially when the predictions are continuous numerical values.

The RMSE quantifies how well the predicted values from a model align with the actual
observed values in the dataset. Here's how it works:

1. Calculate the Squared Differences: For each data point, subtract the predicted value
from the actual (observed) value, square the result, and sum up these squared
differences.
2. Compute the Mean: Divide the sum of squared differences by the number of data
points to get the mean squared error (MSE).
3. Take the Square Root: To obtain the RMSE, simply take the square root of the MSE.

Mathematical Formula

The formula for RMSE for a data with 'n' data points is as follows:

Where:

 RMSE is the Root Mean Squared Error.


 xi represents the actual or observed value for the i-th data point.
 yi represents the predicted value for the i-th data point.

OVERFITTING – CATALYSTS FOR OVERFITTING


Overfitting in Machine Learning
Overfitting happens when a model learns too much from the training data, including
details that don’t matter (like noise or outliers).
17 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
For example, imagine fitting a very complicated curve to a set of points. The curve will
go through every point, but it won’t represent the actual pattern.

As a result, the model works great on training data but fails when tested on new data.

Overfitting models are like students who memorize answers instead of understanding
the topic. They do well in practice tests (training) but struggle in real exams (testing).

Reasons for Overfitting:

 High variance and low bias.


 The model is too complex.
 The size of the training data.
Underfitting in Machine Learning

Underfitting is the opposite of overfitting. It happens when a model is too simple to


capture what’s going on in the data.

For example, imagine drawing a straight line to fit points that actually follow a curve.
The line misses most of the pattern.

In this case, the model doesn’t work well on either the training or testing data.
Underfitting models are like students who don’t study enough. They don’t do well in practice
tests or real exams.

Reasons for Underfitting:


1. The model is too simple, So it may be not capable to represent the complexities in the
data.
2. The input features which are used to train the model is not the adequate representations
of underlying factors influencing the target variable.
3. The size of the training dataset used is not enough.
4. Excessive regularizations are used to prevent the overfitting, which constraint the
model to capture the data well.
5. Features are not scaled.
Note:
 Underfitting : Straight line trying to fit a curved dataset but cannot capture the data’s
patterns, leading to poor performance on both training and test sets.

18 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
 Overfitting: A squiggly curve passing through all training points, failing to generalize
performing well on training data but poorly on test data.
 Appropriate Fitting: Curve that follows the data trend without overcomplicating to
capture the true patterns in the data.
CASE STUDY OF POLYNOMIAL REGRESSION
Introduction
Polynomial Regression is a type of regression analysis where the relationship between
the independent variable xxx and the dependent variable yyy is modeled as an nnnth degree
polynomial. While linear regression assumes a straight-line relationship between variables,
polynomial regression can capture more complex, non-linear relationships. This case study
explores how polynomial regression can be applied to predict real-world data, providing
insights into its usage, benefits, and limitations.

Problem Statement
We are tasked with predicting the growth of a plant based on the number of days since
planting. We have historical data that includes the number of days and corresponding plant
heights, and we wish to model the plant growth using polynomial regression.

Data Collection
We have collected the following dataset:

Days Since Planting (X) Plant Height (Y)


1 2.3
2 4.1
3 6.0
4 8.2
5 10.5
6 12.8
7 14.7
8 17.1
9 19.0
10 20.1
Steps in the Case Study
Step 1: Exploratory Data Analysis (EDA)
Before applying polynomial regression, we conduct an exploratory data analysis to
understand the structure of the data.

19 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
 Plotting the Data: The data appears to have a quadratic pattern, meaning it follows a
curve rather than a straight line. We can visualize this by plotting the data points on a
scatter plot.

Step 2: Model Selection and Setup


We want to fit a polynomial regression model to this data. We decide to start by fitting
a second-degree polynomial (quadratic), but we also consider trying higher-degree polynomials
to see if they improve model performance.

The polynomial regression model is represented by:

Where x is the independent variable (Days Since Planting) and y is the dependent
variable (Plant Height).

Step 3: Preprocessing the Data


Before applying polynomial regression, we need to transform the input data into the
form that the model can work with.

 Feature Transformation: To apply polynomial regression, we need to create


polynomial features from the input data. For a quadratic model, this involves adding

as a feature along with x. For higher-degree models, we create features for ,


etc.

In Python, this can be done using the PolynomialFeatures class from the sklearn.preprocessing
module.

Step 4: Model Fitting


We apply the polynomial regression using the transformed features and fit the model.
The steps involved are:

20 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
1. Transform the original feature (Days Since Planting) into polynomial features.
2. Fit a linear regression model on the polynomial features.

Step 5: Evaluation and Model Testing


To assess the performance of the polynomial regression model, we split the data into
training and test sets. We then evaluate the model using metrics like the Mean Squared Error
(MSE) and R-squared.

We also compare the results of fitting models with different polynomial degrees (e.g.,
2nd, 3rd, and 4th degrees) to see which one performs best.

Step 6: Model Results


After fitting the models, we plot the predicted values against the actual data points. This
allows us to visually assess how well the model fits the data. The model’s predictions should
closely follow the curve of the actual data points.

For example, if a second-degree polynomial provides a good fit, the curve will closely
follow the quadratic shape of the data. If higher-degree polynomials are overfitting, the curve
will start to oscillate unnecessarily.

Step 7: Conclusion

 Model Choice: Based on the evaluation metrics, we choose the polynomial regression
model that provides the best fit. In this case, the second-degree polynomial model may
be sufficient to accurately model the data, as the relationship between the variables
appears quadratic.
 Overfitting: Higher-degree polynomials may lead to overfitting, where the model
performs well on training data but poorly on unseen data.
 Interpretability: Polynomial regression provides a way to model non-linear
relationships, but the complexity of the polynomial increases with higher degrees,
making it harder to interpret.

Benefits of Polynomial Regression

1. Flexibility: Polynomial regression is highly flexible and can model a wide range of
non-linear relationships.

21 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
2. Simple to Implement: It is relatively easy to implement using libraries like Scikit-
learn, making it a good option for small to medium-sized datasets.
3. Improved Prediction: Polynomial regression can improve prediction accuracy
compared to linear regression when the underlying data has a non-linear relationship.

Limitations of Polynomial Regression

1. Overfitting: Higher-degree polynomials can lead to overfitting, where the model learns
the noise in the training data instead of the underlying relationship.
2. Interpretability: As the degree of the polynomial increases, the model becomes more
difficult to interpret.
3. Computational Complexity: Higher-degree models can be computationally
expensive, especially with large datasets.

Conclusion
Polynomial regression is a powerful tool for modeling non-linear relationships. In this
case study, we showed how polynomial regression can be applied to predict plant growth over
time. While polynomial regression provides an accurate model for many real-world problems,
careful consideration must be given to the degree of the polynomial used to avoid overfitting
and maintain model interpretability.

Polynomial regression can be an effective choice in scenarios where the relationship


between variables is inherently non-linear, but it is essential to balance model complexity and
performance.

THEORY OF GENERALIZATION: EFFECTIVE NUMBER OF


HYPOTHESES
Generalization refers to the ability of a machine learning model to perform well on
unseen, out-of-sample data. In essence, a model is said to generalize well if it can make accurate
predictions on new data points that were not part of the training set. The concept of
generalization is central to the success of machine learning algorithms, as it ensures that the
model doesn’t just memorize the training data (overfitting), but instead learns the underlying
patterns that can be applied to new, unseen examples.

22 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
One way to understand generalization is through the concept of the effective number
of hypotheses. This concept stems from statistical learning theory and helps us understand how
the complexity of a model affects its ability to generalize.

Key Concepts

1. Hypothesis Space: The hypothesis space refers to the set of all possible models or
functions that could be considered by the learning algorithm to fit the data. In machine
learning, we typically search for a function within this space that minimizes some form
of error (like mean squared error in regression or cross-entropy in classification).
2. Effective Number of Hypotheses: The effective number of hypotheses refers to the
"effective" size of the hypothesis space that a model is able to select from based on the
data. It is often smaller than the total number of hypotheses available in the hypothesis
space because not all hypotheses are consistent with the data. In other words, this
number indicates how many hypotheses are effectively being considered when training
the model, based on the constraints imposed by the training data.
o The idea behind the effective number of hypotheses is that a smaller effective
number of hypotheses leads to better generalization. The more hypotheses there
are in the hypothesis space that fit the training data, the higher the risk of
overfitting to noise in the data and the poorer the model's ability to generalize.

Understanding the Effective Number of Hypotheses


To intuitively understand this, let’s break down a few points:

 Overfitting: If a model has access to a very large hypothesis space, it may be able to
perfectly fit the training data, including any noise. This results in overfitting, where the
model has too many hypotheses that fit the training data exactly, but fails to generalize
to new data. In this case, the effective number of hypotheses is large, but the model's
generalization ability is poor.
 Underfitting: If the hypothesis space is too small, the model may not be able to capture
the underlying patterns in the data. It may underfit, resulting in poor performance both
on the training data and on unseen data. Here, the effective number of hypotheses is
small, which might hinder the model's ability to find the true underlying relationships.
 Model Complexity: The complexity of the model determines the size of the hypothesis
space. For example, a linear regression model (with a single linear function) has a
23 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
relatively small hypothesis space compared to a deep neural network with many layers
and parameters. The hypothesis space size is proportional to the flexibility of the
model—more parameters lead to more hypotheses being possible. However, too many
parameters can lead to overfitting if the model is not regularized or controlled.

The Role of Regularization


Regularization techniques such as L2 regularization (Ridge), L1 regularization
(Lasso), and Dropout in neural networks aim to control the hypothesis space by penalizing
overly complex models. Regularization reduces the number of hypotheses that the model can
effectively select, thereby limiting overfitting and improving generalization.

 L2 regularization (Ridge) penalizes large weights, which reduces the capacity of the
model and shrinks the effective number of hypotheses.
 L1 regularization (Lasso) can lead to sparse solutions, effectively reducing the number
of active features (hypotheses).
 Dropout in neural networks forces the network to learn redundant representations by
randomly dropping units during training, reducing the effective number of hypotheses.

The Effective Number of Hypotheses in Practice


To practically understand how the effective number of hypotheses relates to
generalization, consider the following simplified analogy:

 Imagine you are trying to fit a curve to a set of data points. If you use a very flexible
curve (like a high-degree polynomial), the number of ways you can fit the curve to the
data increases significantly, leading to many possible hypotheses. This makes it more
likely that your model fits noise in the training data and does not generalize well to
unseen data.
 On the other hand, if you use a simple linear model, there are fewer possible curves that
can fit the data. In this case, the effective number of hypotheses is smaller, and the
model may generalize better because it is less likely to overfit to noise.

The balance between model complexity (size of the hypothesis space) and the effective
number of hypotheses that the model actually uses is key to achieving good generalization.

24 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Theoretical Formulation of the Effective Number of Hypotheses
In a more formal sense, the effective number of hypotheses can be related to the VC
dimension (Vapnik-Chervonenkis dimension) of the hypothesis space. The VC dimension is a
measure of the capacity or complexity of a model, i.e., how well it can fit different types of
data. A model with a higher VC dimension has a larger hypothesis space, meaning it has more
hypotheses to choose from and is more prone to overfitting.

The effective number of hypotheses is related to the degrees of freedom of the model.
In the context of linear regression with regularization, it can be approximated as:

Where X is the design matrix of features. This formula provides a way of estimating
how much flexibility the model has based on the input features and their interactions.

The theory of effective number of hypotheses in generalization highlights the delicate


balance between model complexity and the ability of a model to generalize well to unseen data.
A model that has a large hypothesis space but is poorly regularized may overfit the training
data, leading to poor generalization. On the other hand, a model with a small effective
hypothesis space may underfit the data and fail to capture important patterns.

Understanding and controlling the effective number of hypotheses is crucial for


building models that generalize well, and it is an essential part of the theoretical foundation of
machine learning. Regularization techniques play a pivotal role in limiting the number of
hypotheses a model can effectively use, ensuring that it learns the most relevant patterns while
avoiding overfitting.

BOUNDING THE GROWTH FUNCTION


The effective number of hypotheses refers to the "effective" size of the hypothesis space
that a model can choose from based on the training data. A hypothesis space represents all
possible models a learning algorithm might consider. The effective number of hypotheses is
smaller than the total number of hypotheses, as not all of them are consistent with the data.

 Generalization is about a model's ability to perform well on unseen data, and the
effective number of hypotheses impacts generalization.

25 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
 Overfitting occurs when a model has too many possible hypotheses that fit the training
data too closely, leading to poor performance on new data.
 Underfitting happens when the hypothesis space is too small, and the model fails to
capture the underlying patterns in the0020data.

Regularization techniques like L1 and L2 regularization, and Dropout in neural


networks, help control the effective number of hypotheses by penalizing complex models,
reducing overfitting, and improving generalization.

In simple terms, the effective number of hypotheses determines how many possible
models are effectively considered by the learning algorithm based on the data, and controlling
this number is key to balancing model complexity and generalization.

VC DIMENSION (VAPNIK-CHERVONENKIS DIMENSION)


The Vapnik-Chervonenkis (VC) dimension is a fundamental concept in statistical
learning theory. It measures the capacity or flexibility of a hypothesis space — that is, how
well a model can fit different sets of data. Specifically, the VC dimension quantifies the
maximum number of data points that a model can shatter (or perfectly classify) in all possible
ways.

Key Concepts

1. Hypothesis Space: This refers to the set of all possible models or functions that a
learning algorithm can choose from. Each model in the hypothesis space represents a
different way of mapping inputs to outputs.
2. Shattering: A model is said to shatter a set of points if, for every possible labeling of
those points (i.e., all combinations of positive/negative labels in classification), there
exists at least one hypothesis in the hypothesis space that perfectly classifies those
points according to the given labels.
3. VC Dimension: The VC dimension of a model or hypothesis space is defined as the
largest set of points that can be shattered by the model. It represents the model’s ability
to adapt to different datasets.
o If a model can shatter k points, then its VC dimension is at least k.
o If no set of k+1 points can be shattered, then the VC dimension is at most k.

26 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
Importance of VC Dimension in Generalization
The VC dimension plays a critical role in understanding the generalization ability of
a machine learning model:

 High VC Dimension: A model with a high VC dimension has more capacity to fit
complex patterns in data, which means it can potentially overfit the training data. This
leads to poor generalization on unseen data.
 Low VC Dimension: A model with a low VC dimension has limited capacity to fit data,
which may result in underfitting. It might not be flexible enough to capture the true
underlying patterns in the data.

VC Dimension and Model Complexity


The VC dimension is related to the complexity of a model. More complex models (e.g.,
deep neural networks, high-degree polynomials) have higher VC dimensions, meaning they
have a larger capacity to fit diverse data sets, but they also have a higher risk of overfitting.
Simpler models (e.g., linear classifiers) have lower VC dimensions and are less prone to
overfitting, but they might underfit if the true relationship is more complex.

Examples

1. Linear Classifiers: The VC dimension of a linear classifier in a 2D space is 3. This


means a linear classifier can shatter any set of 3 points, but it cannot always shatter 4
points in a 2D space.
2. Decision Trees: The VC dimension of a decision tree depends on its depth. A deeper
tree can shatter more data points (higher VC dimension), but it also risks overfitting.

VC Dimension and Sample Size


The VC dimension is also used in deriving bounds for the generalization error. The
generalization error refers to how well a model’s performance on training data matches its
performance on unseen data.

 Upper Bound: The VC dimension helps to establish an upper bound on the


generalization error. If a model has a high VC dimension, the model can fit more
complex patterns, but we need more data to ensure that it generalizes well.
 Sample Complexity: A model with a higher VC dimension typically requires a larger
sample size to learn effectively and avoid overfitting.
27 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning
The VC dimension is an essential concept for understanding the complexity of machine
learning models and their potential for overfitting or underfitting. It helps balance model
capacity and generalization ability, guiding the selection of appropriate models based on the
amount of available data. Models with a high VC dimension are more expressive but can
overfit, while models with a low VC dimension might underfit if they are too simple to capture
the underlying patterns.

28 | P a g e
Notes By: Dr. Deepali G. Chaudhari
Subject: Machine Learning

You might also like