CS550 Regression
CS550 Regression
Regression
11/24/2023 Regression 2
Prediction problem motivation
• The variable we'd like to predict may be:
• more difficult to measure
• is more important than the other(s)
• or may be directly or indirectly influenced by the values of the other
variable(s)
• Thus, we'd like to define two categories of variables:
• variables whose value we want to predict
• variables whose values we use to make our prediction
11/24/2023 Regression 3
Regression Analysis
• Definition: A class of techniques that seeks to make predictions about
unknown continuous target variables given observed input variables.
• Applications:
• Predicting a person’s height given the height of their parents.
• Predicting the amount of time someone will take to pay back a loan
given their credit history.
• Predicting what time a package will arrive given current weather and
traffic conditions.
• Predicting the production of a particular crop given the rainfall
11/24/2023 Regression 4
Response and Predictor Variables
• We are observing numerical variables and we are making sets of
observations.
• We call the variable we'd like to predict the outcome or response variable;
typically, we denote this variable by and the individual measurements
11/24/2023 Regression 5
True vs. Statistical Model
• We will assume that the response variable, , relates to the predictors, , through
some unknown function ‘f’ expressed generally as:
11/24/2023 Regression 6
Machine Learning
Data Science Process
• ML algorithms have the objective of
Ask an interesting question generalization, i.e. use one dataset to
generate models that perform well
Data preparation on data that they have not seen.
• Thus, they prove to be effective in
Explore the Data generating predictive models.
• There are many ML techniques and
Model the Data ML models have several parameters
• How to choose the best one?
Communicate/Visualize the Results
11/24/2023 Regression 7
Machine Learning Methodology
Train Dataset Random • Input dataset is divided into a
(Historical) samples random split (80/20) or (90/10) to
be used for training and testing
respectively
Training
Test set • For each split, model is generated
set
and tested for accuracy
Apply • This is repeated 5 times or 10
ML algos times and average error is
computed
Test Data Best Validate • The best model is selected and
(Real) Model Results used in real world scenario
11/24/2023 Regression 8
Flexibility vs. Interpretability Tradeoff
• There are many methods of
regression (that estimate f)
• Some are less flexible but
more interpretable
• These are useful for
inference problems where
we want to study the
relationships between
predictor variables
• But highly flexible methods
can also lead to over-fitting!
11/24/2023 Regression 9
Error Evaluation
In order to quantify how well a model performs, we define a loss or error function.
A common loss function for quantitative outcomes is the Mean Squared Error
(MSE):
The quantity is called a residual and measures the error at the i-th prediction.
The square root of MSE is RMSE:
10
R-squared Error
11
Bias Variance Tradeoff
The Advertising data set consists of the sales of that product in 200
different markets, along with advertising budgets for the product in each
of those markets for three different media: TV, radio, and newspaper.
Everything is given in units of $1000.
Some of the figures in this presentation are taken from ISL book: "An Introduction to Statistical Learning, with applications in R"
(Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani "
11/24/2023 Regression 14
Response vs. Predictor Variables
X Y
predictors outcome
features response variable
covariates dependent variable
11/24/2023 p predictors
Regression
k-Nearest Neighbors
The k-Nearest Neighbor (kNN) model is an intuitive way to predict a
quantitative response variable:
11/24/2023 Regression 16
k-Nearest Neighbors
For a fixed a value of k, the predicted response for the -th observation as the
average of the observed response of the k-closest observations:
Python: sklearn.neighbors.KNeighborsRegressor(n_neighors=3)
11/24/2023 Regression 17
4-Nearest Neighbors
11/24/2023 Regression 21
Linear Models
• In the kNN approach, we didn’t assume a form of the function ‘f’
• Such approaches are called non-parametric approaches
• In the linear regression approach, we assume that the response is a
linear function of the predictor variables
• Note that this technique can be easily extended by creating extra
predictor variables (features) from a combination (transformation) of
the original predictor variables.
• So lets assume:
Linear Regression
23
Estimate of the regression coefficients
24
Estimate of the regression coefficients (cont)
Which of the above three lines fit the data points the best?
a. One which goes through maximum number of points
b. One with least slope
c. One from which no point is too far, i.e. it is approximately in middle of
all points
25
Estimate of the regression coefficients (cont)
To compute the best fit, we first calculate the residuals
26
Python package
11/24/2023 Regression 27
So how do Linear Regression solvers work?
• Matrix Methods
• Exact methods that solve the set of linear equations
• Involve computation of matrix inverse or pseudoinverse (more efficient)
• Gradient Descent
• A generic method of solving optimization problems
• Begin with a random point and reach the optimal solution through a
sequence of improvements
• Faster improvements could be done by stochastic methods
11/24/2023 Regression 28
Matrix Algebra for n-dimensions
• Loss (L) = MSE(β) =
• MSE(β) = ), where X is a nx(p+1) matrix with each row as an input
vector (including ‘1’ for the intercept) and y is a n dimensional
vector of the outputs in the training set
• To minimize, we differentiate with respect to β and we get
• )=0
• If is non-singular, meaning, inverse exists, then
•β=
11/24/2023 Regression 29
Matrix Algebra for n-dimensions
• Computational complexity of computing the matrix inverse: O(p2.4 ) to
O(p3 ) depending on implementation.
• Scikit-learn’s Linear Regression class uses SVD approach O(p2 )
• SVD stands for singular value decomposition
• Uses pseudoinverse approach (Moore-Penrose) :numpy.linalg.pinv()
• β=
• They both have linear complexity in terms of the number of
instances, n; but at least quadratic in p
• So, we need to look at alternate techniques if p is very large, e.g.,
100,000
11/24/2023 Regression 30
Gradient Descent Approach
• Start from a random point, i.e. generate a random β
1. Determine which direction will reduce the MSE ( 𝑖 +1) ( 𝑖) 𝑑L
β =β −𝜆
2. Compute the slope of the function (its derivative) at 𝑑β
this point and go in the reverse direction
3. is the learning rate parameter
L - +
4. Go to #1, until convergence, i.e. MSE is minimized
• For Linear regression, MSE is a convex function
• There is no local minima, just a global minimum
• Continuous and slope that never changes abruptly
β
11/24/2023 Regression 31
Stochastic Approaches
• Batch GD update equation
• Uses the whole batch of training data at each gradient step
• Stochastic GD
• Picks only a random instance of training data to update gradients
• Causes irregular descent, but better chance of finding global minimum
• Simulated annealing: Reduce the learning rate gradually to reduce
irregularity
• Mini-batch GD
• Small set of random instances of training data are used
11/24/2023 Regression 32
Parametric or Non-Parametric?
Linear Regression (parametric) k-NN Approach (non-parametric)
Assumption on A linear function is assumed Can work even if the function is non-
function f linear. But it has to be locally constant
High Complexity problems which can be Difficult to find neighbors nearby
dimensions overcome by efficient algorithms which can cause errors
Bias Low Small K=> Low bias
Large K => High Bias
Variance Depends on the problem Small K=> High Variance
Large K => Low Variance
Computations Once during the model fitting phase. Every time a prediction has to be
After that predictions are quick made, we look at all the training points
11/24/2023 Regression 33
Lecture Objectives
• Understanding the outputs of a Linear Model
• Limitations of Linear Models and their extensions
• How to reduce over-fitting/variance via regularization
• Support Vector Machines (SVM)
11/24/2023 Regression 34
Brief review of the linear model
11/24/2023 Regression 35
Possible Questions
• How accurately do we know our model parameters?
• Is at least one predictor variable useful in the prediction?
• We have to examine the p-values
• Which subset of the predictor variables are important?
• There are several techniques of predictor variable/feature selection
• What would be the accuracy of predictions on unseen data?
• We can generate confidence intervals on our estimates
• Cross-validation gives us an estimate.
• Do I need more predictor variables/features?
• Look at patterns in the residual errors
11/24/2023 Regression 36
Confidence intervals for predictor estimators
• What causes errors in estimation of ?
38
Significance of predictor variables
• As we saw, there are inherent uncertainties in estimation of β
• We evaluate the importance of predictors using hypothesis testing, using
the t-statistics and p-values (Small p-value(<0.05) => significant)
• Null hypothesis is that βi=0
11/24/2023 Regression 39
Sample Results
import statsmodels.api as sm
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())
11/24/2023 Regression 40
Subset Selection Techniques
• Total number of subsets of a set of size J = ?
• Goal: All the variables in the model should have sufficiently low p-
values, and all the variables outside the model should have a large p-
value if added to the model.
• Three possible approaches
• Forward selection
• Backward selection
• Mixed selection
11/24/2023 Regression 41
Subset Selection Techniques
• Forward selection:
• Begin will a null set, S
• Perform J linear regressions, each with exactly one variable
• Add the variable that results in lowest Cross-validation error to the set, S
• Again, perform J-1 linear regressions with 2 variables
• Add the variable that results in lowest Cross-validation error to the set, S
• Continue until some stopping criteria is reached… eg. CV error is not decreasing
• Backward selection begins with all the variables and removes the
variable with highest p-value at successive steps
• Mixed selection is similar to Forward Selection, but it may also remove
a variable if it doesn’t yield any improvement to the model
11/24/2023 Regression 42
Do I need more predictors/change of model?
• When we estimated the variance of ϵ, we assumed that the residuals
were uncorrelated and normally distributed with mean 0 and fixed
variance.
• These assumptions need to be verified using the data. In residual
analysis, we typically create two types of plots:
1. a plot ofwith respect to or . This allows us to compare the
distribution of the noise at different values of .
2. a histogram of . This allows us to explore the distribution of the
noise independent of or .
11/24/2023 Regression 43
Patterns in Residuals
• Depends on confidence on β
• Different β => different values of y
• Given , examine distribution of , determine the mean and standard deviation.
• For each of these the prediction for
11/24/2023 Regression 45
Potential problems of Linear Models
• Non-linearity
• Can use polynomial linear regression or design better features
• Outliers
• Disturbs the models because of quadratic penalty, Discard outliers carefully
• High-leverage points
• Outliers in the predictor variables
• Collinearity (2 or more predictor variables have high correlation)
• Keep one of them or design a good combined feature
• Correlation of error terms, Non-constant variance of error terms
• Gives higher confidence in the model, can’t trust the CI on model parameter
11/24/2023 Regression 46
Polynomial Regression
• The simplest non-linear model we can consider, for a
response Y and a predictor X, is a polynomial model of
degree M,
• Just as in the case of linear regression with cross terms,
polynomial regression is a special case of linear regression -
we treat each as a separate predictor. Thus, we can write
11/24/2023 Regression 47
Polynomial Regression
𝐾
1
𝐶𝑉 ( Model )= ∑
𝐾 𝑖=1
𝐿¿ ¿ ¿
• Fitting the model using the modified loss function Lreg would result in
model parameters with desirable properties (specified by R).
11/24/2023 Regression 50
Ridge Regression
• Alternatively, we can choose a regularization term that penalizes the
squares of the parameter magnitudes. Then, our regularized loss function
is:
51
Ridge Regression
• We often say that Lridge is the loss function for l2 regularization.
• Finding the model parameters β ridge that minimize the l2 regularized loss
function is called ridge regression.
52
LASSO (least absolute shrinkage and selection operator) Regression
• Ridge regression reduces the parameter values but doesn’t force them
to go to zero. LASSO is very effective in doing that.
• It uses the following regularized loss function is:
53
LASSO Regression
• Hence, we often say that LLASSO is the loss function for l1 regularization.
• Finding the model parameters β LASSO that minimize the l1 regularized loss function
is called LASSO regression.
54
Choosing l
• In both ridge and LASSO regression, we see that the larger our choice of the
regularization parameter , the more heavily we penalize large values in β,
• If is close to zero, we recover the MSE, i.e. ridge and LASSO regression is just
ordinary regression.
• If is sufficiently large, the MSE term in the regularized loss function will be
insignificant and the regularization term will force β ridge and β LASSO to be close
to zero.
• To avoid ad-hoc choices, we should select using cross-validation.
• Once the model is trained, we use the unregularized performance measure to
evaluate the model’s performance.
55
Elastic Net
• Middle ground between Ridge and Lasso regression
• Regularization term is a simple mix with parameter ‘r’
11/24/2023 Regression 57
SVM Regression
11/24/2023 Regression 58
ε insensitive Loss function
-e e
11/24/2023 Regression 59
SVM Regression
11/24/2023 Regression 61
Non-linear data
• SVM allow for a computationally efficient method of transforming
the dataset to higher dimensions using kernel trick.
• Common kernels that are used are
• Linear, polynomial, Gaussian RBF, Sigmoid
11/24/2023 Regression 62