Chapter2 1
Chapter2 1
CO-3:
.
CO-5:
2
Course Objectives
3
Syllabus
• UNIT-II
•
•
Supervised Learning with Regression and Classification techniques -1: Linear
Regression, Multiple Regression, Bias-Variance Dichotomy, Model Validation
Approaches, Evaluation of the performance of an algorithm: Mean Squared Error, Root
Mean Squared Error.
4
Linear Regression
• Linear regression is a basic and commonly used type of predictive analysis.
(1) does a set of predictor variables do a good job in predicting an outcome (dependent)
variable?
(2) Which variables in particular are significant predictors of the outcome variable, and in
what way do they–indicated by the magnitude and sign of the beta estimates–impact
the outcome variable?
5
Linear Regression
• These regression estimates are used to explain the relationship between one dependent
variable and one or more independent variables.
• Naming the Variables. There are many names for a regression’s dependent variable. It
may be called an outcome variable, criterion variable, endogenous variable, or
regressand.
6
Uses of Linear Regression
• Three major uses for regression analysis are (1) determining the strength of predictors,
(2) forecasting an effect, and (3) trend forecasting.
• First, the regression might be used to identify the strength of the effect that the
independent variable(s) have on a dependent variable. Typical questions are what is the
strength of relationship between dose and effect, sales and marketing spending, or age
and income.
• Second, it can be used to forecast effects or impact of changes. That is, the regression
analysis helps us to understand how much the dependent variable changes with a change
in one or more independent variables. A typical question is, “how much additional sales
income do I get for each additional $1000 spent on marketing?”
• Third, regression analysis predicts trends and future values. The regression analysis can
be used to get point estimates. A typical question is, “what will the price of gold be in 6
months?”
7
Types of Linear Regression
• Simple linear regression
1 dependent variable (interval or ratio), 1 independent variable (interval or ratio or
dichotomous)
• Multiple linear regression
1 dependent variable (interval or ratio) , 2+ independent variables (interval or ratio or
dichotomous)
• Logistic regression
1 dependent variable (dichotomous), 2+ independent variable(s) (interval or ratio or
dichotomous)
• Ordinal regression
1 dependent variable (ordinal), 1+ independent variable(s) (nominal or dichotomous)
• Multinomial regression
1 dependent variable (nominal), 1+ independent variable(s) (interval or ratio or
dichotomous)
• Discriminant analysis 8
1 dependent variable (nominal), 1+ independent variable(s) (interval or ratio)
Simple Linear Regression
• Linear regression models are used to show or predict the relationship between
two variables or factors.
• The factor that is being predicted (the factor that the equation solves for) is called
the dependent variable.
• The factors that are used to predict the value of the dependent variable are called the
independent variables.
• In linear regression, each observation consists of two values.
• One value is for the dependent variable and one value is for the independent variable.
• In this simple model, a straight line approximates the relationship between the dependent
variable and the independent variable
9
Formula for Simple Linear Regression
• The two factors that are involved in simple linear regression analysis are
designated x and y.
• The equation that describes how y is related to x is known as the regression
model.
• The simple linear regression model is represented by:
• y = β0 +β1x+ε
• The linear regression model contains an error term that is represented by ε.
• The error term is used to account for the variability in y that cannot be explained
by the linear relationship between x and y.
• If ε were not present, that would mean that knowing x would provide enough
information to determine the value of y.
10
Formula for Simple Linear Regression
• There also parameters that represent the population being studied.
• These parameters of the model are represented by β0 and β1.
• The simple linear regression equation is graphed as a straight line, where:
• β0 is the y-intercept of the regression line.
• β1 is the slope.
• Ε(y) is the mean or expected value of y for a given value of x.
• A regression line can show a positive linear relationship, a negative linear
relationship, or no relationship.
11
Formula for Simple Linear Regression
• No relationship: The graphed line in a simple linear regression is flat (not
sloped). There is no relationship between the two variables.
• Positive relationship: The regression line slopes upward with the lower end of
the line at the y-intercept (axis) of the graph and the upper end of the line
extending upward into the graph field, away from the x-intercept (axis). There is a
positive linear relationship between the two variables: as the value of one
increases, the value of the other also increases.
• Negative relationship: The regression line slopes downward with the upper end
of the line at the y-intercept (axis) of the graph and the lower end of the line
extending downward into the graph field, toward the x-intercept (axis). There is a
negative linear relationship between the two variables: as the value of one
increases, the value of the other decreases.
12
Formula for Simple Linear Regression
• If the parameters of the population were known, the simple linear regression equation
could be used to compute the mean value of y for a known value of x.
• Ε(y) = β0 +β1x+ε
• In practice, however, parameter values generally are not known so they must be
estimated by using data from a sample of the population.
• The population parameters are estimated by using sample statistics.
• The sample statistics are represented by β0 and β1.
• When the sample statistics are substituted for the population parameters, the estimated
regression equation is formed.
13
Formula for Simple Linear Regression
14
RSS Residual Sum of Squares
• In statistics, the residual sum of squares (RSS), also known as the sum of squared
residuals (SSR) or the sum of squared errors of prediction (SSE), is the sum of the
squares of residuals (deviations of predicted from actual empirical values of data).
• Residual Sum of Squares (RSS) is defined and given by the following function:
RSS =
RSS =
Where
y= dependent variable
x= independent variable
m= slope of the line
c= y-intercept of the line
n= number of samples
15
Simple Numerical
• Formula for Linear Regression is given by:
y = mx + c
Where
y= dependent variable
x= independent variable
m= slope of the line
c= y-intercept of the line
16
Simple Numerical
• Formula to calculate m and c are given by:
m(slope) =
c (intercept) =
Where
y= dependent variable
x= independent variable
m= slope of the line
c= y-intercept of the line
n= number of samples
17
Simple Numerical
• Find linear regression equation for the following two sets of data:
X 2 4 6 8
y 3 7 5 10
2 3 4 6
4 7 16 28
6 5 36 30
8 10 64 80
= 20 = 25 = 120 = 144
18
Simple Numerical
• Formula to calculate m and c are given by:
m(slope) = = = 0.95
c (intercept) = = 1.5
19
Simple Numerical
• Find RSS:
X 2 4 6 8
y 3 7 5 10
• m= 0.95
• c= 1.5
• RSS =
• The technique enables analysts to determine the variation of the model and the relative
contribution of each independent variable in the total variance.
• Multiple regression can take two forms, i.e., linear regression and non-linear regression.
21
Multiple Linear Regression
• Multiple linear regression formula:
• Where:
• yiis the dependent or predicted variable
• β0 is the y-intercept, i.e., the value of y when both xi and x2 are 0.
• β1 and β2 are the regression coefficients that represent the change in y relative to a one-
unit change in xi1 and xi2, respectively.
• βp is the slope coefficient for each independent variable
• ϵ is the model’s random error (residual) term.
22
Multiple Linear Regression-Assumptions
• A linear relationship between the dependent and independent variables
The first assumption of multiple linear regression is that there is a linear relationship
between the dependent variable and each of the independent variables.
The best way to check the linear relationships is to create scatterplots and then visually
inspect the scatterplots for linearity.
If the relationship displayed in the scatterplot is not linear, then the analyst will need to run
a non-linear regression or transform the data using statistical software, such as SPSS.
• The independent variables are not highly correlated with each other
The data should not show multicollinearity, which occurs when the independent variables
are highly correlated to one another.
When independent variables show multicollinearity, there will be problems in figuring out
the specific variable that contributes to the variance in the dependent variable.
23
Multiple Linear Regression-Assumptions
• The variance of the residuals is constant
Multiple linear regression assumes that the amount of error in the residuals is similar at
each point of the linear model.
When analyzing the data, the analyst should plot the standardized residuals against the
predicted values to determine if the points are distributed fairly across all the values of
independent variables.
• Independence of observation
The model assumes that the observations should be independent of one another.
Simply put, the model assumes that the values of residuals are independent.
• Multivariate normality
Multivariate normality occurs when residuals are normally distributed.
To test this assumption, look at how the values of residuals are distributed.
It can also be tested using two main methods, i.e., a histogram with a superimposed normal
curve or the Normal Probability Plot method. 24
Logistic Regression
• Logistic Regression was used in the biological sciences in early twentieth century. It was
then used in many social science applications.
• For example,
• To predict whether an email is spam (1) or (0)
• Whether the tumor is malignant (1) or not (0)
25
Logistic Regression
26
Model
• Output = 0 or 1
• Hypothesis => Z = WX + B
• hΘ(x) = sigmoid (Z)
27
Logistic Regression-types
• 1. Binary Logistic Regression
• The categorical response has only two 2 possible outcomes. Example: Spam or Not
• Say, if predicted_value ≥ 0.5, then classify email as spam else as not spam.
29
Decision Boundary
• Say, if predicted_value ≥ 0.5, then classify email as spam else as not spam.
30
Errors in Machine Learning
Errors in Machine Learning
We can describe an error as an action which is inaccurate or wrong. In Machine Learning,
error is used to see how accurately our model can predict on data it uses to learn; as
well as new, unseen data. Based on our error, we choose the machine learning model
which performs best for a particular dataset.
. There are two main types of errors present in any machine learning model. They are
Reducible Errors and Irreducible Errors.
• Irreducible errors are errors which will always be present in a machine learning model,
because of unknown variables, and whose values cannot be reduced.
• Reducible errors are those errors whose values can be further reduced to improve a
model. They are caused because our model’s output function does not match the
desired output function and can be optimized.
31
We can further divide reducible errors into
two: Bias and Variance
32
CONTI.
• What is Bias?
• To make predictions, our model will analyze our data and find patterns in it. Using these
patterns, we can make generalizations about certain instances in our data. Our model
after training learns these patterns and applies them to the test set to predict them.
• Bias is the difference between our actual and predicted values. Bias is the simple
assumptions that our model makes about our data to be able to predict new data.
• When the Bias is high, assumptions made by our model are too basic, the model can’t
capture the important features of our data. This means that our model hasn’t captured
patterns in the training data and hence cannot perform well on the testing data too. If
this is the case, our model cannot perform on new data and cannot be sent into
production.
• This instance, where the model cannot find patterns in our training set and hence fails
for both seen and unseen data, is called Underfitting.
• The below figure shows an example of Underfitting. As we can see, the model has found
no patterns in our data and the line of best fit is a straight line that does not pass
through any of the data points. The model has failed to train properly on the data given
and cannot predict new data either. 33
What is Variance
• Variance is the very opposite of Bias. During training, it allows our model to ‘see’ the
data a certain number of times to find patterns in it. If it does not work on the data for
long enough, it will not find patterns and bias occurs. On the other hand, if our model is
allowed to view the data too many times, it will learn very well for only that data. It will
capture most patterns in the data, but it will also learn from the unnecessary data
present, or from the noise.
• We can define variance as the model’s sensitivity to fluctuations in the data. Our model
may learn from noise. This will cause our model to consider trivial features as
important.
34
Model validation
• What is Model Validation
• Machine learning is all about the data, its quality, quantity, and playing with the same. Here most of the
time, we collect the data; we have to clean it, preprocess it, and then we have to apply the appropriate
algorithm and get the best-fit model out of it. But after getting a model, the task is not done; the model
validation is as important as the training.
• Directly training and then deploying a model would not work, and in sensitive areas like the healthcare
model, there is a huge amount of risk associated, and real-life predictions have to be made; in this case,
there should not be an error in the model as it can cost a lot then.
• Advantages of Model Validation
Here are many advantages that model validation provides.
• Quality of the Model
The first and foremost advantage of model validation is the quality of the model; yes, we can quickly get an
idea bout the performance and quality of the model by validating the same.
• The flexibility of the Model
Secondly, validating the model makes it easy to get an idea about the flexibility. Model validation helps
make the model more flexible also
35
Validation
• This process of deciding whether the numerical results quantifying hypothesized
relationships between variables, are acceptable as descriptions of the data, is known
as validation.
• Generally, an error estimation for the model is made after training, better known as
evaluation of residuals.
• In this process, a numerical estimate of the difference in predicted and original responses
is done, also called the training error.
• However, this only gives us an idea about how well our model does on data used to train
it.
• Now its possible that the model is underfitting or overfitting the data.
• So, the problem with this evaluation technique is that it does not give an indication of
how well the learner will generalize to an independent/ unseen data set.
• Getting this idea about our model is known as Cross Validation.
36
Cross Validation
• To evaluate the performance of any machine learning model we need to test it on some
unseen data.
• Based on the models performance on unseen data we can say weather our model is
Under-fitting/Over-fitting/Well generalised.
• Cross validation (CV) is one of the technique used to test the effectiveness of a machine
learning models, it is also a re-sampling procedure used to evaluate a model if we have a
limited data.
• To perform CV we need to keep aside a sample/portion of the data on which is do not use
to train the model, later us this sample for testing/validating.
37
Cross Validation
• In machine learning, we couldn’t fit the model on the training data and can’t say that the
model will work accurately for the real data.
• For this, we must assure that our model got the correct patterns from the data, and it is
not getting up too much noise.
• For this purpose, we use the cross-validation technique.
• Cross-validation is a technique in which we train our model using the subset of the data-
set and then evaluate using the complementary subset of the data-set.
38
Steps
• Reserve some portion of sample data-set.
• Using the rest data-set train the model.
• Test the model using the reserve portion of the data-set.
39
Methods of Cross validation
41
Hold Out Cross Validation
• Now a basic remedy for this involves removing a part of the training data and using it to
get predictions from the model trained on rest of the data.
• The error estimation then tells how our model is doing on unseen data or the validation
set.
• This is a simple kind of cross validation technique, also known as the holdout
method.
• Although this method doesn’t take any overhead to compute and is better than traditional
validation, it still suffers from issues of high variance.
• This is because it is not certain which data points will end up in the validation set
and the result might be entirely different for different sets.
42
K-fold Cross Validation
• The procedure has a single parameter called k that refers to the number of groups that a
given data sample is to be split into.
• As such, the procedure is often called k-fold cross-validation.
• When a specific value for k is chosen, it may be used in place of k in the reference to the
model, such as k=10 becoming 10-fold cross-validation.
• If k=5 the dataset will be divided into 5 equal parts and the below process will run 5
times, each time with a different holdout set.
• 1. Take the group as a holdout or test data set
• 2. Take the remaining groups as a training data set
• 3. Fit a model on the training set and evaluate it on the test set
• 4. Retain the evaluation score and discard the model
43
K-fold Cross Validation
44
K-fold Cross Validation
• The value for k is chosen such that each train/test group of data samples is large enough
to be statistically representative of the broader dataset.
• A value of k=10 is very common in the field of applied machine learning, and is
recommend if you are struggling to choose a value for your dataset.
• If a value for k is chosen that does not evenly split the data sample, then one group will
contain a remainder of the examples.
• It is preferable to split the data sample into k groups with the same number of samples,
such that the sample of model skill scores are all equivalent.
45
Leave p Out Cross Validation(LPOCV)
• This approach leaves p data point out
of training data, i.e. if there are n data
points in the original sample then, n-
p samples are used to train the model
and p points are used as the
validation set.
• This is repeated for all combinations
in which the original sample can be
separated this way, and then the error
is averaged for all trials, to give
overall effectiveness.
• The number of possible
combinations is equal to the number
of data points in the original sample
or n. 46
Leave p Out Cross Validation
• This method is exhaustive in the sense that it needs to train and validate the model for all
possible combinations, and for moderately large p, it can become computationally
infeasible.
• A particular case of this method is when p = 1. This is known as Leave one out cross
validation.
• This method is generally preferred over the previous one because it does not suffer from
the intensive computation, as number of possible combinations is equal to number
of data points in original sample or n.
47
Evaluation of the performance of an algorithm
• When evaluating the performance of an algorithm, there are several key factors to
consider. Here are some important aspects to assess:
• Correctness: Determine if the algorithm produces the correct outputs for a given set of
inputs. This can be done by comparing the algorithm's results against known, expected
outputs.
• Efficiency: Evaluate the algorithm's efficiency in terms of time and space complexity.
Measure how quickly the algorithm runs and how much memory it requires to perform
its tasks. This can be done by analyzing the algorithm's asymptotic complexity (e.g., Big
O notation) or by benchmarking its execution time and memory usage on different input
sizes.
• Scalability: Assess how the algorithm's performance scales with increasing input sizes.
Determine if the algorithm can handle larger datasets or more complex problems
without a significant decrease in performance. This is particularly important if the
algorithm will be used in real-world scenarios with potentially large or growing datasets.48
Evaluation of the performance of an algorithm
• Robustness: Test the algorithm's robustness by evaluating its behavior with various types of input data.
This includes testing edge cases, boundary conditions, and extreme or unexpected inputs. Check if the
algorithm handles these cases correctly and does not produce unexpected errors or crashes.
• Comparison with alternative methods: Compare the algorithm's performance against other existing
algorithms or approaches that solve the same problem. Evaluate factors such as accuracy, efficiency, and
scalability to determine if the algorithm outperforms or is comparable to other solutions.
• Bias and fairness: If the algorithm involves decision-making or prediction, assess its fairness and potential
biases. Evaluate if the algorithm treats different groups of individuals fairly and does not discriminate
against any particular demographic.
• Real-world performance: Consider the algorithm's performance in practical applications or real-world
scenarios. Evaluate its usability, reliability, and whether it meets the desired objectives and requirements.
It's important to note that the evaluation process may vary depending on the specific algorithm, problem
domain, and context. Therefore, it's crucial to adapt the evaluation criteria to suit the specific
requirements and goals of the algorithm being assessed.
49
Evaluation of the performance of an algorithm
• Accuracy: Accuracy measures the proportion of correctly classified instances out of the total number of
instances. It provides an overall assessment of the algorithm's performance. However, accuracy alone may
not be sufficient if the dataset is imbalanced or if different types of errors have varying levels of
importance.
• Error Rate: Error rate is the complement of accuracy and represents the proportion of misclassified
instances. It is calculated by subtracting the accuracy from 1. Error rate provides an alternative
perspective on the algorithm's performance.
• Precision: Precision is the proportion of true positives (correctly classified positive instances) out of the
total instances predicted as positive. Precision focuses on the accuracy of positive predictions and is
useful when minimizing false positives is important. It is calculated as:
Precision = True Positives / (True Positives + False Positives)
• Recall (Sensitivity or True Positive Rate): Recall measures the proportion of true positives out of all actual
positive instances. It is useful when identifying all positive instances is crucial, as it focuses on minimizing
false negatives. Recall is calculated as:
Recall = True Positives / (True Positives + False Negatives)
50
Evaluation of the performance of an algorithm
• Specificity: Specificity measures the proportion of true negatives (correctly classified negative instances)
out of all actual negative instances. It complements recall and is relevant when minimizing false negatives
is less critical. Specificity is calculated as:
• Specificity = True Negatives / (True Negatives + False Positives)
• These metrics are commonly used in binary classification problems. However, they may need to be
adapted or extended for multi-class classification or other specific tasks. It is important to choose
appropriate evaluation metrics that align with the specific goals and requirements of the problem at
hand. Additionally, considering a combination of these metrics can provide a more comprehensive
evaluation of the algorithm's performance.
51
Mean Squared Error, Root Mean Squared Error
• Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) are common
evaluation metrics used in regression tasks to measure the performance of a predictive
model. They quantify the average squared difference between the predicted values and
the actual values.
• Here's an explanation of both metrics:
• Mean Squared Error (MSE): MSE is calculated by taking the average of the squared
differences between the predicted values and the actual values. It measures the average
squared distance between the predicted and actual values, giving higher weights to
larger errors.
• The formula for MSE is as follows:
• MSE = (1/n) * Σ(yᵢ - ŷᵢ)²
• n is the number of data points
• yᵢ is the actual value of the i-th data point
• ŷᵢ is the predicted value of the i-th data point
• MSE provides a measure of the overall model performance, with a higher value
indicating larger errors. 52
Mean Squared Error, Root Mean Squared Error
• Root Mean Squared Error (RMSE): RMSE is simply the square root of MSE. It is preferred
over MSE when you want to interpret the error metric in the same unit as the target
variable. RMSE gives a measure of the average difference between the predicted and
actual values, with the same scale as the target variable.
• The formula for RMSE is:
• RMSE = √(MSE)
• RMSE is useful because it is easily interpretable and allows for direct comparison with
the range of the target variable. Smaller RMSE values indicate better model
performance, with a value of zero indicating a perfect match between predicted and
actual values.
• Both MSE and RMSE penalize larger errors more than smaller ones because of the
squaring operation. These metrics are widely used to evaluate and compare regression
models, allowing for a quantitative assessment of their predictive accuracy.
53
References
• Books and Journals
• Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz and Shai
Ben-David-Cambridge University Press 2014
• Introduction to machine Learning – the Wikipedia Guide by Osman Omer.
• Video Link-
• https://www.youtube.com/watch?v=9f-GarcDY58
• https://www.youtube.com/watch?v=GwIo3gDZCVQ
• Web Link-
• https://towardsdatascience.com/data-science-simplified-simple-linear-regression-models-3a97811a6a3d
• https://www.nku.edu/~statistics/Simple_Linear_Regression.htm
• https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc
54
THANK YOU