0% found this document useful (0 votes)
30 views

CS550 Regression

This document provides an overview of regression analysis and machine learning methodology for a course on machine learning. It discusses the objectives of regression analysis, including prediction problems and modeling continuous target variables. It also covers regression analysis concepts like response and predictor variables, bias-variance tradeoff in modeling, and methods like k-nearest neighbors and linear regression. The document uses advertising data as an example to illustrate regression concepts.

Uploaded by

dipsresearch
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

CS550 Regression

This document provides an overview of regression analysis and machine learning methodology for a course on machine learning. It discusses the objectives of regression analysis, including prediction problems and modeling continuous target variables. It also covers regression analysis concepts like response and predictor variables, bias-variance tradeoff in modeling, and methods like k-nearest neighbors and linear regression. The document uses advertising data as an example to illustrate regression concepts.

Uploaded by

dipsresearch
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 62

CS550: Machine Learning

Regression

Dr. Gagan Gupta


Slides based on Aurelien Geron’s book; ISL and ELS book and Harvard’s CS109A
11/24/2023 Regression 1
Lecture Objectives
• What we will learn in this lecture?
• Regression Analysis and examples
• Machine Learning Methodology
• How to assess goodness of the models
• Understand the bias-variance trade-off
• Understand the method of K nearest neighbors
• Understand the method of linear (least-squares) regression

11/24/2023 Regression 2
Prediction problem motivation
• The variable we'd like to predict may be:
• more difficult to measure
• is more important than the other(s)
• or may be directly or indirectly influenced by the values of the other
variable(s)
• Thus, we'd like to define two categories of variables:
• variables whose value we want to predict
• variables whose values we use to make our prediction

11/24/2023 Regression 3
Regression Analysis
• Definition: A class of techniques that seeks to make predictions about
unknown continuous target variables given observed input variables.
• Applications:
• Predicting a person’s height given the height of their parents.
• Predicting the amount of time someone will take to pay back a loan
given their credit history.
• Predicting what time a package will arrive given current weather and
traffic conditions.
• Predicting the production of a particular crop given the rainfall

11/24/2023 Regression 4
Response and Predictor Variables
• We are observing numerical variables and we are making sets of
observations.

• We call the variable we'd like to predict the outcome or response variable;
typically, we denote this variable by and the individual measurements

• The variables we use in making the predictions the features or predictor


variables; typically, we denote these variables by and the individual
measurements .
Note: indexes the observation (and indexes the value of the -th predictor variable (.
Total number of predictor variables, J=p

11/24/2023 Regression 5
True vs. Statistical Model
• We will assume that the response variable, , relates to the predictors, , through
some unknown function ‘f’ expressed generally as:

• Here, is the unknown function expressing an underlying rule for relating to , is


the random amount (unrelated to ) that differs from the rule

• A statistical model is any algorithm that estimates . We denote the estimated


function as

11/24/2023 Regression 6
Machine Learning
Data Science Process
• ML algorithms have the objective of
Ask an interesting question generalization, i.e. use one dataset to
generate models that perform well
Data preparation on data that they have not seen.
• Thus, they prove to be effective in
Explore the Data generating predictive models.
• There are many ML techniques and
Model the Data ML models have several parameters
• How to choose the best one?
Communicate/Visualize the Results
11/24/2023 Regression 7
Machine Learning Methodology
Train Dataset Random • Input dataset is divided into a
(Historical) samples random split (80/20) or (90/10) to
be used for training and testing
respectively
Training
Test set • For each split, model is generated
set
and tested for accuracy
Apply • This is repeated 5 times or 10
ML algos times and average error is
computed
Test Data Best Validate • The best model is selected and
(Real) Model Results used in real world scenario
11/24/2023 Regression 8
Flexibility vs. Interpretability Tradeoff
• There are many methods of
regression (that estimate f)
• Some are less flexible but
more interpretable
• These are useful for
inference problems where
we want to study the
relationships between
predictor variables
• But highly flexible methods
can also lead to over-fitting!
11/24/2023 Regression 9
Error Evaluation
In order to quantify how well a model performs, we define a loss or error function.
A common loss function for quantitative outcomes is the Mean Squared Error
(MSE):

The quantity is called a residual and measures the error at the i-th prediction.
The square root of MSE is RMSE:

10
R-squared Error

• If our model is as good as the mean value, , then


• If our model is perfect then
• can be negative if the model is worst than the average. This can happen when
we evaluate the model in the real life test set.

11
Bias Variance Tradeoff

• Total Error = Bias2 + Variance + Irreducible Error


Bias is the average distance of estimate from the true mean of f(x)
Variance is the sq. dev of the estimate around its mean
11/24/2023 Regression 12
Bias Variance Tradeoff
“All models are wrong, but some models are useful.” : George Box (1919-2013)
• Occam’s razor: This philosophical principle states that “the
simplest explanation is best”.
• Bias is error from erroneous assumptions in the model, like
making it linear/simplistic. (underfitting)
• Variance is error from sensitivity to small fluctuations in the
training set, indicating it will not work in real world. (overfitting)
• First-principle models likely to suffer from bias, with data-driven
models in greater danger of overfitting.
11/24/2023 Regression 13
Example Problem (Advertising)

The Advertising data set consists of the sales of that product in 200
different markets, along with advertising budgets for the product in each
of those markets for three different media: TV, radio, and newspaper.
Everything is given in units of $1000.
Some of the figures in this presentation are taken from ISL book: "An Introduction to Statistical Learning, with applications in R"
(Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani "
11/24/2023 Regression 14
Response vs. Predictor Variables
X Y
predictors outcome
features response variable
covariates dependent variable

TV radio newspaper sales


n observations

230.1 37.8 69.2 22.1


44.5 39.3 45.1 10.4
17.2 45.9 69.3 9.3
151.5 41.3 58.5 18.5
180.8 10.8 58.4 12.9

11/24/2023 p predictors
Regression
k-Nearest Neighbors
The k-Nearest Neighbor (kNN) model is an intuitive way to predict a
quantitative response variable:

to predict a response for a set of observed predictor values, we use


the responses of other observations most similar to it

Note: this strategy can also be applied in classification to predict a


categorical variable.

11/24/2023 Regression 16
k-Nearest Neighbors
For a fixed a value of k, the predicted response for the -th observation as the
average of the observed response of the k-closest observations:

where are the k observations most similar to (similar refers to a notion of


distance between predictors).
Usually, Euclidean distance is chosen (sqrt of squared coordinate
differences)

Python: sklearn.neighbors.KNeighborsRegressor(n_neighors=3)
11/24/2023 Regression 17
4-Nearest Neighbors

• Very few assumptions made here about the nature of ‘f’


• Equal weights for the values of y, regardless of their distance from x
• As dimensionality increases, becomes hard to find neighbors closeby and f
may change significantly
11/24/2023 Regression 18
Model Comparison

Do the same for all k’s and


Q. Reason for discontinuity
compare the RMSEs
Q. Bias Variance tradeoff
Which k is best?
19
kNN: Kernel Regression
• What if we took all the points, not just the k nearest points, and
introduced a weighting function that weights by distance (so that we
weight the value of closer points more)?

• Traditional kNN method can be seen as a 0/1 weighting model


• Examples of kernels: Epanechnikov quadratic kernel, Tri-cube Kernel,
and Gaussian Kernel.
11/24/2023 Regression 20
Comparison of Kernels

11/24/2023 Regression 21
Linear Models
• In the kNN approach, we didn’t assume a form of the function ‘f’
• Such approaches are called non-parametric approaches
• In the linear regression approach, we assume that the response is a
linear function of the predictor variables
• Note that this technique can be easily extended by creating extra
predictor variables (features) from a combination (transformation) of
the original predictor variables.
• So lets assume:
Linear Regression

• … then it follows that our estimate is:

• where and are estimates of and respectively, that we compute


using observations.

23
Estimate of the regression coefficients

For a given data set

24
Estimate of the regression coefficients (cont)

Which of the above three lines fit the data points the best?
a. One which goes through maximum number of points
b. One with least slope
c. One from which no point is too far, i.e. it is approximately in middle of
all points
25
Estimate of the regression coefficients (cont)
To compute the best fit, we first calculate the residuals

26
Python package

11/24/2023 Regression 27
So how do Linear Regression solvers work?
• Matrix Methods
• Exact methods that solve the set of linear equations
• Involve computation of matrix inverse or pseudoinverse (more efficient)
• Gradient Descent
• A generic method of solving optimization problems
• Begin with a random point and reach the optimal solution through a
sequence of improvements
• Faster improvements could be done by stochastic methods

11/24/2023 Regression 28
Matrix Algebra for n-dimensions
• Loss (L) = MSE(β) =
• MSE(β) = ), where X is a nx(p+1) matrix with each row as an input
vector (including ‘1’ for the intercept) and y is a n dimensional
vector of the outputs in the training set
• To minimize, we differentiate with respect to β and we get
• )=0
• If is non-singular, meaning, inverse exists, then
•β=

11/24/2023 Regression 29
Matrix Algebra for n-dimensions
• Computational complexity of computing the matrix inverse: O(p2.4 ) to
O(p3 ) depending on implementation.
• Scikit-learn’s Linear Regression class uses SVD approach O(p2 )
• SVD stands for singular value decomposition
• Uses pseudoinverse approach (Moore-Penrose) :numpy.linalg.pinv()
• β=
• They both have linear complexity in terms of the number of
instances, n; but at least quadratic in p
• So, we need to look at alternate techniques if p is very large, e.g.,
100,000
11/24/2023 Regression 30
Gradient Descent Approach
• Start from a random point, i.e. generate a random β
1. Determine which direction will reduce the MSE ( 𝑖 +1) ( 𝑖) 𝑑L
β =β −𝜆
2. Compute the slope of the function (its derivative) at 𝑑β
this point and go in the reverse direction
3. is the learning rate parameter
L - +
4. Go to #1, until convergence, i.e. MSE is minimized
• For Linear regression, MSE is a convex function
• There is no local minima, just a global minimum
• Continuous and slope that never changes abruptly

β
11/24/2023 Regression 31
Stochastic Approaches
• Batch GD update equation
• Uses the whole batch of training data at each gradient step
• Stochastic GD
• Picks only a random instance of training data to update gradients
• Causes irregular descent, but better chance of finding global minimum
• Simulated annealing: Reduce the learning rate gradually to reduce
irregularity
• Mini-batch GD
• Small set of random instances of training data are used

11/24/2023 Regression 32
Parametric or Non-Parametric?
Linear Regression (parametric) k-NN Approach (non-parametric)

Assumption on A linear function is assumed Can work even if the function is non-
function f linear. But it has to be locally constant
High Complexity problems which can be Difficult to find neighbors nearby
dimensions overcome by efficient algorithms which can cause errors
Bias Low Small K=> Low bias
Large K => High Bias
Variance Depends on the problem Small K=> High Variance
Large K => Low Variance
Computations Once during the model fitting phase. Every time a prediction has to be
After that predictions are quick made, we look at all the training points

11/24/2023 Regression 33
Lecture Objectives
• Understanding the outputs of a Linear Model
• Limitations of Linear Models and their extensions
• How to reduce over-fitting/variance via regularization
• Support Vector Machines (SVM)

11/24/2023 Regression 34
Brief review of the linear model

11/24/2023 Regression 35
Possible Questions
• How accurately do we know our model parameters?
• Is at least one predictor variable useful in the prediction?
• We have to examine the p-values
• Which subset of the predictor variables are important?
• There are several techniques of predictor variable/feature selection
• What would be the accuracy of predictions on unseen data?
• We can generate confidence intervals on our estimates
• Cross-validation gives us an estimate.
• Do I need more predictor variables/features?
• Look at patterns in the residual errors
11/24/2023 Regression 36
Confidence intervals for predictor estimators
• What causes errors in estimation of ?

• we do not know the exact form of


• limited sample size
• Variance of is called as standard error,
• To estimate SE, we use Bootstrapping
• sampling from the training data (X,Y) to estimate its statistical properties.
• In our case, we can sample with replacement
• Compute multiple times by random sampling
• Variance of multiple estimates approximates the true variance
11/24/2023 Regression 37
Standard Errors Intuition from Formulae
• Better model:

• More data: and


• Larger coverage: or
• Better data:
−1
𝐺𝑒𝑛𝑒𝑟𝑎𝑙 𝐹𝑜𝑟𝑚𝑢𝑙𝑎 : 𝑆𝐸 ( 𝛽 ) =𝜎 ( 𝑋 𝑋 )
2 2 𝑇

38
Significance of predictor variables
• As we saw, there are inherent uncertainties in estimation of β
• We evaluate the importance of predictors using hypothesis testing, using
the t-statistics and p-values (Small p-value(<0.05) => significant)
• Null hypothesis is that βi=0

Test statistic here would be


Which measures the distance of the
mean from zero in units of standard
deviation.

11/24/2023 Regression 39
Sample Results

import statsmodels.api as sm
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

11/24/2023 Regression 40
Subset Selection Techniques
• Total number of subsets of a set of size J = ?
• Goal: All the variables in the model should have sufficiently low p-
values, and all the variables outside the model should have a large p-
value if added to the model.
• Three possible approaches
• Forward selection
• Backward selection
• Mixed selection

11/24/2023 Regression 41
Subset Selection Techniques
• Forward selection:
• Begin will a null set, S
• Perform J linear regressions, each with exactly one variable
• Add the variable that results in lowest Cross-validation error to the set, S
• Again, perform J-1 linear regressions with 2 variables
• Add the variable that results in lowest Cross-validation error to the set, S
• Continue until some stopping criteria is reached… eg. CV error is not decreasing
• Backward selection begins with all the variables and removes the
variable with highest p-value at successive steps
• Mixed selection is similar to Forward Selection, but it may also remove
a variable if it doesn’t yield any improvement to the model
11/24/2023 Regression 42
Do I need more predictors/change of model?
• When we estimated the variance of ϵ, we assumed that the residuals
were uncorrelated and normally distributed with mean 0 and fixed
variance.
• These assumptions need to be verified using the data. In residual
analysis, we typically create two types of plots:
1. a plot ofwith respect to or . This allows us to compare the
distribution of the noise at different values of .
2. a histogram of . This allows us to explore the distribution of the
noise independent of or .

11/24/2023 Regression 43
Patterns in Residuals

• Residuals are easier to interpret than the model


• We plot () with , so the graph is always 2-D
11/24/2023 Regression 44
Confidence intervals on predictions of y

• Depends on confidence on β
• Different β => different values of y
• Given , examine distribution of , determine the mean and standard deviation.
• For each of these the prediction for
11/24/2023 Regression 45
Potential problems of Linear Models
• Non-linearity
• Can use polynomial linear regression or design better features
• Outliers
• Disturbs the models because of quadratic penalty, Discard outliers carefully
• High-leverage points
• Outliers in the predictor variables
• Collinearity (2 or more predictor variables have high correlation)
• Keep one of them or design a good combined feature
• Correlation of error terms, Non-constant variance of error terms
• Gives higher confidence in the model, can’t trust the CI on model parameter
11/24/2023 Regression 46
Polynomial Regression
• The simplest non-linear model we can consider, for a
response Y and a predictor X, is a polynomial model of
degree M,
• Just as in the case of linear regression with cross terms,
polynomial regression is a special case of linear regression -
we treat each as a separate predictor. Thus, we can write

11/24/2023 Regression 47
Polynomial Regression

• Which of the above three is the best model?


• Check RMSE
• Check R2
• Remember bias and variance??
11/24/2023 Regression 48
Benefit of Cross-Validation

𝐾
1
𝐶𝑉 ( Model )= ∑
𝐾 𝑖=1
𝐿¿ ¿ ¿

• Using cross-validation, we generate validate the models on a portion


of training data which our learning algorithm has never seen.
• Leave-one out method is used when the number of sample points is
very small.
11/24/2023 Regression 49
Regularization of Linear Models
• Goal: Reduce over-fitting of the data by reducing degrees of freedom
• For a linear model, regularization is typically achieved by constraining
the weights of the model

where is a scalar that gives the weight (or importance) of the


regularization term.

• Fitting the model using the modified loss function Lreg would result in
model parameters with desirable properties (specified by R).

11/24/2023 Regression 50
Ridge Regression
• Alternatively, we can choose a regularization term that penalizes the
squares of the parameter magnitudes. Then, our regularized loss function
is:

• Works best when least-square estimates have high variance


• As increases, flexibility decreases, variance decreases, bias increases
slightly

• Note that is the l2 norm of the vector β

51
Ridge Regression
• We often say that Lridge is the loss function for l2 regularization.

• Finding the model parameters β ridge that minimize the l2 regularized loss
function is called ridge regression.

52
LASSO (least absolute shrinkage and selection operator) Regression
• Ridge regression reduces the parameter values but doesn’t force them
to go to zero. LASSO is very effective in doing that.
• It uses the following regularized loss function is:

• Note that is the l1 norm of the vector β

53
LASSO Regression
• Hence, we often say that LLASSO is the loss function for l1 regularization.

• Finding the model parameters β LASSO that minimize the l1 regularized loss function
is called LASSO regression.

54
Choosing l
• In both ridge and LASSO regression, we see that the larger our choice of the
regularization parameter , the more heavily we penalize large values in β,
• If is close to zero, we recover the MSE, i.e. ridge and LASSO regression is just
ordinary regression.
• If is sufficiently large, the MSE term in the regularized loss function will be
insignificant and the regularization term will force β ridge and β LASSO to be close
to zero.
• To avoid ad-hoc choices, we should select using cross-validation.
• Once the model is trained, we use the unregularized performance measure to
evaluate the model’s performance.
55
Elastic Net
• Middle ground between Ridge and Lasso regression
• Regularization term is a simple mix with parameter ‘r’

• Elastic Net has better convergence features over Lasso.

Sklearn.linear_model import ElasticNet


Elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
11/24/2023 Regression 56
SVM (Support Vector Machines)
• Uses a different approach to regression
• Instead of thinking of the fit as a line, let us think of it as a channel
• Fit as many instances as possible on the channel while limiting the
margin violations, (i.e. instances off the channel)
• Width of the channel is the hyper-parameter ‘ε’
• Adding more training instances within the channel doesn’t change
the model parameters
• Hence these models are more robust against over-fitting

11/24/2023 Regression 57
SVM Regression

11/24/2023 Regression 58
ε insensitive Loss function

-e e

11/24/2023 Regression 59
SVM Regression

Sklearn.svm import ElasticNet


Svm_reg = LinearSVR(epsilon=1, C=2)
11/24/2023 Regression 60
Parameters in SVM regression
• Parameter ε controls the width of the channel and can affect the
number of support vectors used to construct the regression function.
• Adding more training vectors
• Bigger ε => fewer support vectors
• Parameter C determines the trade-off between the model complexity
and the degree to which the deviations larger than ε can be tolerated
• It is interpreted as a traditional regularization parameter that can be
estimated by Cross Validation, for example

11/24/2023 Regression 61
Non-linear data
• SVM allow for a computationally efficient method of transforming
the dataset to higher dimensions using kernel trick.
• Common kernels that are used are
• Linear, polynomial, Gaussian RBF, Sigmoid

11/24/2023 Regression 62

You might also like