0% found this document useful (0 votes)
59 views

Lecture 3

This document provides an overview of linear regression and gradient descent algorithms for machine learning. It discusses supervised machine learning approaches for predicting a target variable based on other features in the data. Linear regression finds the best-fitting linear relationship between variables by minimizing the sum of squared errors. Gradient descent is an optimization algorithm that can be used when an analytical solution is not possible, by iteratively taking steps in the direction of steepest descent on the loss function. Python code examples are provided to demonstrate linear regression and gradient descent.

Uploaded by

Ruben Kempter
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Lecture 3

This document provides an overview of linear regression and gradient descent algorithms for machine learning. It discusses supervised machine learning approaches for predicting a target variable based on other features in the data. Linear regression finds the best-fitting linear relationship between variables by minimizing the sum of squared errors. Gradient descent is an optimization algorithm that can be used when an analytical solution is not possible, by iteratively taking steps in the direction of steepest descent on the loss function. Python code examples are provided to demonstrate linear regression and gradient descent.

Uploaded by

Ruben Kempter
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Advanced Data Analytics

Lecture 3

Simon Scheidegger
Today’s Roadmap
1. Supervised Machine Learning – the general idea

2. Linear Regression (1 Variable)


Video 1
3. Again – Gradient descent

4. Linear Regression (multiple variables)

5. A Probabilistic Interpretation of Linear Regression

6. Polynomial Regression Video 2


7. Tuning Model Complexity

The next two notebooks are supposed to be self-study (no pre-recordings)!

8. An Introduction to Pandas (with a Jupyter Notebook)


Lecture_3b_pandas/pandas/Pandas_intro.ipynb

9. Stock Market Prediction (with “live data”) with Linear regression


see Stock_prediction_ML_Lecture3.ipynb
1. Supervised Machine Learning
Supervised methods assume that training data is available from which they can
learn to predict a target feature based on other features (e.g., fuel consumption
of a car as a function of car power [horse power]).

→ Given data like this, how can we learn to predict the fuel consumption of other
cars?

*data: https://archive.ics.uci.edu/ml/datasets/Auto+MPG
Supervised Machine Learning (II)
- x(i) : “input” variables (Horse Power in this example), also called input features.

- y(i) : “output” or target variable that we are trying to predict (Fuel consumption).

- A pair (x(i) , y(i)) is called a training example.

- The dataset that we’ll be using to learn—a list of m training examples

{(x(i), y (i) ); i = 1, . . . , m}

is called a training set.

- Our goal is, given a training set, to learn a function

so that h(x) is a “good” predictor for the corresponding


value of y.

- For historical reasons, this function h is called a hypothesis.


Classification versus Regression

- When the target variable that we’re trying to predict is continuous, such
as in our car example, we call the learning problem a regression problem.

- When y can take on only a small number of discrete values (such as if, given
he fuel consumption , we wanted to predict if a car is a SUV or a small
city car), we call it a classification problem.
2. Linear Regression
Model hypothesis and parameters

How can we predict the value of a numerical feature y based on the value of
another numerical feature x?

First, we need to make some assumption about their relationship, i.e., how x
influences y.

ŷ = w0 + w1 x


Our model thus assumes that there is a linear relationship between the two
features x and y.

→ How can we determine the coefficients w0 and w1?


Assume a Linear Model
Data: https://archive.ics.uci.edu/ml/datasets/Auto+MPG

Independent Feature/
Variable

Dependent Feature/ ŷ = w0 + w1 x
Variable Model

Coefficients / Parameters of the Model


 Different values of w0 and w1 correspond to different
lines in our plot.

Which ones are best?

Based on the linear equation that
we defined previously, linear
regression can be understood as
finding the best-fitting straight line
through the sample points, as
shown in the following figure.
Digression – The UCI ML Data Set
https://archive.ics.uci.edu/ml/datasets.html
Let’s look at an Example Data Set
Example by https://archive.ics.uci.edu/ml/datasets/Auto+MPG and K. Beberich (2017)

 We determine the values of the coefficients w0 and w1 based on training data


that is available to us.

Our training data consists of n data points

(xi , yi )


in our example those are pairs of power (in hp) and fuel consumption (in mpg)
of individual cars

Attribute Information:
1. mpg: continuous demo/auto-mpg.names.txt
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower:
5. weight:
continuous
continuous
demo/auto-mpg.data.txt
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)

18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle malibu"


15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320"
18.0 8 318.0 150.0 3436. 11.0 70 1 "plymouth satellite"
16.0 8 304.0 150.0 3433. 12.0 70 1 "amc rebel sst"
17.0 8 302.0 140.0 3449. 10.5 70 1 "ford torino"
15.0 8 429.0 198.0 4341. 10.0 70 1 "ford galaxie 500"
Recall from Last week:
Mean, Variance, Standard Deviation

We define the mean of feature x and y in our training data as


Variance of feature x and y in our training data is defined as

 The values σx and σy are referred to as the standard deviation of features x


and y.
Recall from Last week: Covariance
 Covariance covx,y measures the degree of joint variability between the
two features x and y


Large co-variance suggests that the two features vary jointly


a positive value indicates that they tend to deviate
from their respective mean in the same direction

a negative value indicates that they tend to deviate
from their respective mean in opposite directions.
Correlation

The correlation coefficient (also: Pearson’s r) is a normalized measure of
linear correlation between the two features x and y


The correlation coefficient takes values in [-1, +1]

a value of -1 indicates a negative linear correlation

a value of 0 indicates that there is no linear correlation

a value of +1 indicates a positive linear correlation
Example Correlations
demo/predict_prices.py
Basic Analytics – Scatter Plot
demo/predict_prices.py
Our Data Set
demo/predict_prices.py
Anscombe’s Quartet
https://en.wikipedia.org/wiki/Wikipedia:Featured_picture_candidates/Anscombe%27s_quartet

→ All four datasets have the same mean, variance, correlation coefficient, and
optimal regression line.
Loss function / Cost function


A Loss /Cost function measures how well our model, for a specific choice of
coefficients w0 and w1, describes the training data

“how much we loose by using the model”

 The Residual for data point (xi, yi) measures how much the observed value yi
differs from the prediction of our model
Loss Function /Cost Function


Ordinary least squares (OLS) uses the sum of squared residuals (also:
sum of squared errors) as a loss function

 Since we’re interested in finding the coefficients w0 and w1


that minimize the loss, we obtain the optimization problem
Minimize the Cost Function
Optimal values for the coefficients w0 and w1 can be determined
analytically in the case of OLS:

→ compute partial derivatives of loss function w.r.t. w0 and w1

→ identify common zero by solving system of equations


The optimal coefficients

→We obtain the following closed-form solutions to compute optimal values of


the coefficients based on our data
R²: Coefficient of Determination


The R² coefficient of determination (short: ”R squared”) measures how
well the determined regression line approximates the data.


Put differently: how much of the variation observed in the data is
explained by it.
Linear Regression in Python
demo/predict_prices.py
Regression Plot

Parameters:
w0: 39.974309

w1: -0.158120

R2: 0.606089
3. Recall – Gradient Descent

Unfortunately, it is not always possible to determine optimal coefficients
for a loss function analytically.

Gradient descent is an optimization algorithm that we can use to
determine the minimum of a loss function.

Basic Idea:
 start with a random choice of the coefficients (here: w and w 1)
0


repeat for a specified number of rounds or until convergence

compute the gradient for this choice of coefficients

update coefficients based on the gradient
Gradient Descent (II)

Intuition: Think of the loss function as a surface on which you want to
find the lowest point


start your journey at a random position

repeat the following

identify direction with steepest descent

walk a few steps in the identified direction

Fig: https://en.wikipedia.org/wiki/Gradient_descent
Gradient Descent in Python
demo/gradient_descent.py

The gradient descent algorithm is applied to find a local minimum of the


function 
with derivative 

cur_x = 6 # The algorithm starts at x=6


gamma = 0.01 # step size multiplier
precision = 0.00001
previous_step_size = 1
max_iters = 10000 # maximum number of iterations
iters = 0 #iteration counter

df = lambda x: 4 * x**3 - 9 * x**2

while previous_step_size > precision and iters < max_iters:


prev_x = cur_x
cur_x -= gamma * df(prev_x)
previous_step_size = abs(cur_x - prev_x)
iters+=1

print("The local minimum occurs at", cur_x)


#The output for the above will be: ('The local minimum occurs at', 2.2499646074278457)
The LMS algorithm (least mean square)
Gradient of a multivariate function (e.g., the Loss function J(θ)) is
defined as the vector of its partial derivatives; when evaluated at a
specific point, it indicates the direction of steepest ascent.

Specifically, let’s consider the gradient descent algorithm, which starts


with some initial θ, and repeatedly performs the update:

This update is simultaneously performed for all values of j = 1, . . . , n.


Here, α is called the learning rate (0 < α ≤ 1) This is a very natural
algorithm that repeatedly takes a step in the direction of steepest
decrease of J.
The LMS algorithm (II)

In order to implement this algorithm, we have to work out what is the


partial derivative term on the right hand side. Let’s first work it out for the
case of if we have only one training example (x, y), so that we can
neglect the sum in the definition of J. We have:
The LMS algorithm (III)

For a single training example, this gives the update rule:


This rule has several properties that seem natural and intuitive.


For instance, the magnitude of the update is proportional to the error
term (y(i) − hθ (x(i) )).


Thus, for instance, if we are encountering a training example on which
our prediction nearly matches the actual value of y (i) , then we find that
there is little need to change the parameters.

In contrast, a larger change to the parameters will be made if our
prediction hθ (x(i) ) has a large error (i.e., if it is very far from y(i)).
The LMS Algorithm (IV)


This method looks at every example in the entire training set on every
step, and is called batch gradient descent.

Note that, while gradient descent can be susceptible to local minima in
general, the optimization problem we have posed here for linear
regression has only one global, and no other local, optima.

Thus gradient descent always converges to the global minimum as J is
a convex quadratic function.
Stochastic Gradient Descent
(relevant e.g. for Neural Nets)


Gradient ascent as a counterpart to maximize functions.


Stochastic gradient descent (SGD) does not compute the true gradient,
but approximates the gradient based on a single or few randomly
chosen data points in each round.


Gradient can be approximated when the function at hand is
non-differentiable or when partial derivatives are expensive to compute.
Stochastic Gradient Descent (II)

In this algorithm, we repeatedly run through the training set, and each time we encounter
a training example, we update the parameters according to the gradient of the error with
respect to that single training example only.
Stochastic Gradient Descent (II)

In this algorithm, we repeatedly run through the training set, and each time we encounter
a training example, we update the parameters according to the gradient of the error with
respect to that single training example only.

Whereas batch gradient descent has to scan through the entire training set before taking
a single step—a costly operation if m is large—stochastic gradient descent can start
making progress right away, and continues to make progress with each example it looks
at.
Stochastic Gradient Descent (II)

In this algorithm, we repeatedly run through the training set, and each time we encounter
a training example, we update the parameters according to the gradient of the error with
respect to that single training example only.

Whereas batch gradient descent has to scan through the entire training set before taking
a single step—a costly operation if m is large—stochastic gradient descent can start
making progress right away, and continues to make progress with each example it looks
at.

Often, stochastic gradient descent gets θ “close” to the minimum much faster than batch
gradient descent. (Note however that it may never “converge” to the minimum, and the
parameters θ will keep oscillating around the minimum of J(θ); but in practice most of the
values near the minimum will be reasonably good approximations to the true minimum)
→ particularly when the training set is large, stochastic gradient descent is often
preferred over batch gradient descent.
Gradient Descent in parameter space
Loss Optimization

Randomly pick a an initial and

+
Loss Optimization

→ Steepest ascent
Loss Optimization
Take a small step in the opposite direction of the gradient.
Loss Optimization
Mini Batch Gradient Descent

In actual practice we use an approach called Mini batch
gradient descent.

This approach uses random samples but in batches.

What this means is that we do not calculate the gradients for
each observation but for a group of observations which
results in a faster optimization.

A simple way to implement is to shuffle the observations and
then create batches and then proceed with gradient descent
using batches.
Let’s run the Jupyer Notebook

demo/GradientDescent_StochasticGradientDescent.ipynb

X: Chirps/Second Y: Temperature (º F)
4. Regression with multiple variables

How can we predict the value of a numerical feature y based on
multiple other numerical features x1, ..., xm?


We assume that there is a linear relationship between the target
feature and the other features

ŷ = w0 + w1 x1 + . . . + wm xm


Given that we now deal with multiple data points and multiple
features, it is easier to formulate our optimization problem using
matrices and vectors.
Minimize the Cost Function
The training examples’ input values in its rows:

Also, let be the m-dimensional vector containing all the target values from
the training set:

Now, since hθ (x(i) ) = (x(i) )T θ (where θ is the vector of coefficients), we can


easily verify that
Minimize the Cost Function (II)
Thus, using the fact that for a vector z, we have that
Minimize the Cost Function (III)

To minimize J, we set its derivatives to zero, and obtain the normal equations:

Thus, the value of θ that minimizes L(θ) is given in closed form by the equation
Multivariate Linear Regression in Python
https://archive.ics.uci.edu/ml/datasets/Auto+MPG

→Let’s apply this to our car dataset and try to predict fuel consumption
based on horsepower and weight.

demo/multi_par_reg.py
Multivariate Linear Regression in Python

Here, the last line computes the mean squared error (MSE) as another
widely used measure for assessing the prediction quality of a regression
model.
Inspect the regression – Hyperplane
Handling non-numerical Features


How can we make non-numerical (i.e., nominal and ordinal)
features accessible to regression analysis?

Nominal features (e.g., origin of a car) can be converted using one-
hot encoding: for each value of the original feature, a binary feature is
introduced, indicating whether a data point has the corresponding value
for the feature.
Handling non-numerical Features

Ordinal features (e.g., energy efficiency class) can be mapped to
integer values preserving their order.


Note that this mapping implicitly assumes that the differences between
adjacent values are uniform, i.e., have the same magnitude.
Predict Miles/Gallon with Origin as Feature
demo/multi_par_reg_add_features.py

USA EURO JAPAN


Predict Miles/Gallon with Origin as Feature

→ Origin as an additional features reduces mean squared error


and allows the model to encode knowledge about fuel efficiency of
cars from different origins (U.S.A. least efficient, Japan most efficient).
5. A Probabilistic Interpretation of
Linear Regression

When faced with a regression problem, why might
linear regression, and specifically why might the
least-squares cost function J, be a reasonable choice?

→We look now at a set of probabilistic assumptions,


under which least-squares regression
is derived as a very natural algorithm. Fig. from E. Alpaydin (2018)


Let us assume that the target variables and the inputs are related via the
equation

where ε(i) is an error term that captures for instance random noise.
Probability and regression

Let us further assume that the ε(i) are distributed i.i.d (independently and
identically distributed) according to a Gaussian distribution with

mean zero and some variance σ2 . We can write this assumption as

ε(i) ∼ N (0, σ²).

i.e., the density of ε(i) is given by


Probability and regression (II)

This implies that


The notation “p(y(i) |x(i) ; θ)” indicates that this is the distribution of y(i)
given x(i) and parameterized by θ.

Note that we should not condition on θ (“p(y(i) |x(i) , θ)”), since θ is not a
random variable.

We can also write the distribution of y(i) as y(i) | x(i) ; θ ∼ N (θTx(i), σ²).
Likelihood function

Given X (the design matrix, which contains all the x(i)’s) and θ, what
is the distribution of the y(i)’s?

The probability of the data is given by

This quantity is typically viewed a function of


(and perhaps X), for a fixed value of θ.

When we wish to explicitly view this as a function of θ, we will instead call


it the likelihood function:
Likelihood function (II)

Note that by the independence assumption on the ε(i) ’s (and hence also
the y(i) ’s given the x(i) ’s), this can also be written
Likelihood function (III)

Now, given this probabilistic model relating the y(i)’s and the x(i)’s, what is
a reasonable way of choosing our best guess of the parameters θ?

The principle of maximum likelihood says that we should choose θ so as
to make the data as high probability as possible – that is, we should
choose θ to maximize L(θ).

Instead of maximizing L(θ), we can also maximize any strictly increasing
function of L(θ). In particular, the derivations will be a bit simpler if we
instead maximize the log likelihood l(θ):
Maximum Likelihood

Hence, maximizing l(θ) gives the same answer as minimizing

which we recognize to be J(θ), our original least-squares cost function.



Under the previous probabilistic assumptions on the data,
least-squares regression corresponds to finding the maximum likelihood
estimate of θ.

This is thus one set of assumptions under which least-squares
regression can be justified as a very natural method that’s just doing
maximum likelihood estimation.
Newton’s Method – another way of
maximizing functions

let’s now talk about a different algorithm for maximizing e.g. l(θ).

Let’s consider Newton’s method for finding a zero of a function.


Suppose we have some function f : R → R, and we wish to find a value of θ
such that f (θ) = 0. Here, θ is a real number.


Newton’s method performs the following update:
Newton’s Method

Newton’s method has a natural interpretation.

→We can think of it as approximating the function f via a linear function


that is tangent to f at the current guess θ, solving for where that
linear function equals to zero, and letting the next guess for θ be
where that linear function is zero.

Fig. from https://en.wikibooks.org/wiki/Calculus/Newton%27s_Method


Newton’s Method (II)


Newton’s method gives a way of getting to f (θ) = 0.

What if we want to use it to maximize some function l? The maximum
of l correspond to points where its first derivative l′ (θ) is zero.


So, by letting f (θ) = l ′ (θ), we can use the same algorithm to maximize
l, and we obtain update rule:
Newton’s Method for multiple variables

The generalization of Newton’s method to a multidimensional setting
(also called the Newton-Raphson method) is given by

 Here, ∇θ l(θ) is, as usual, the vector of partial derivatives of l(θ) with
respect to the θi’s; and H is an n-by-n matrix (actually, n + 1-by-n + 1,
assuming that we include the intercept term) called the Hessian, whose
entries are given by


Newton’s method typically enjoys faster convergence than
gradient descent, and requires fewer iterations to get very close to the
minimum. One iteration of Newton’s can, however, be more expensive
than one iteration of gradient descent, since it requires finding and
inverting an n-by-n Hessian; but so long as n is not too large, it is
usually much faster overall.
Newton’s Method: Pro’s and Con’s

When it converges, Newton's method usually converges very quickly and this
is its main advantage. However, Newton's method is not guaranteed to
converge and this is obviously a big disadvantage especially compared to the
bisection and secant methods which are guaranteed to converge to a solution
(provided they start with an interval containing a root).
Newton’s Method: Pro’s and Con’s

When it converges, Newton's method usually converges very quickly and this
is its main advantage. However, Newton's method is not guaranteed to
converge and this is obviously a big disadvantage especially compared to the
bisection and secant methods which are guaranteed to converge to a solution
(provided they start with an interval containing a root).

Newton's method also requires computing values of the derivative of the
function in question. This is potentially a disadvantage if the derivative is
difficult to compute.
Newton’s Method: Pro’s and Con’s

When it converges, Newton's method usually converges very quickly and this
is its main advantage. However, Newton's method is not guaranteed to
converge and this is obviously a big disadvantage especially compared to the
bisection and secant methods which are guaranteed to converge to a solution
(provided they start with an interval containing a root).

Newton's method also requires computing values of the derivative of the
function in question. This is potentially a disadvantage if the derivative is
difficult to compute.

The stopping criteria for Newton's method differs from the bisection and
secant methods. In those methods, we know how close we are to a solution
because we are computing intervals which contain a solution. In Newton's
method, we don't know how close we are to a solution. All we can compute
is the value f(x) and so we implement a stopping criteria based on f(x).
Newton’s Method: Pro’s and Con’s

When it converges, Newton's method usually converges very quickly and this
is its main advantage. However, Newton's method is not guaranteed to
converge and this is obviously a big disadvantage especially compared to the
bisection and secant methods which are guaranteed to converge to a solution
(provided they start with an interval containing a root).

Newton's method also requires computing values of the derivative of the
function in question. This is potentially a disadvantage if the derivative is
difficult to compute.

The stopping criteria for Newton's method differs from the bisection and
secant methods. In those methods, we know how close we are to a solution
because we are computing intervals which contain a solution. In Newton's
method, we don't know how close we are to a solution. All we can compute
is the value f(x) and so we implement a stopping criteria based on f(x).

Finally, there's no guarantee that the method converges to a solution and we
should set a maximum number of iterations so that our implementation ends if
we don't find a solution.
Newton’s Method: Example

Let's write a function called newton which takes 5 input
parameters: f, Df, x0, epsilon and max_iter and returns
an approximation of a solution of f(x) by Newton's method. The
function may terminate in 3 ways:

 If abs(f(xn)) < epsilon, the algorithm has found an


approximate solution and returns xn.

 If f'(xn) == 0, the algorithm stops and returns None.


If the number of iterations exceed max_iter, the algorithm stops
and returns None.
Example code
demo/newton_test.py
demo/newton_solver.py
Newton solver – test 1

f = lambda x: x**3 - x**2 - 1


Df = lambda x: 3*x**2 - 2*x
approx = newton(f,Df,1,1e-10,10)
print(approx)
Newton solver – divergent example

f = lambda x: x**(1/3)
Df = lambda x: (1/3)*x**(-2/3)
approx = newton(f,Df,0.1,1e-2,100)
6. Polynomial Regression

What if the relationship between our target feature y
and the independent features is “more complicated”?

Polynomial regression allows us to estimate the optimal
coefficients of a polynomial having degree d

hθ = w0 + w1 x + w2 x2 + . . . + wd xd

 To estimate the coefficients wi, we can pre-compute the values xi and


treat them just like other numerical features – no other changes are
required!


Minimize again the cost function
Polynomial Regression in Python
demo/poly_reg.py
A much better prediction
Polynomial regression & Probability

As before, we assume that the target
variable t is given by a deterministic
function y(x, w) with additive Gaussian
noise:

where the noise is a zero mean


Gaussian random variable with precision
(inverse variance) β. Thus we can write

→ to determine the parameters, maximize again the likelihood.

See Bishop (2006), Chapter 3 for more details.


7. Tuning Model Complexity

Plot of a training data set of N = 10 points,
shown as blue circles,
each comprising an observation
of the input variable x along with
the corresponding target variable t.

The green curve shows the function
sin(2πx) used to generate the data.

→ Our goal is to predict the value of t for


some new value of x, without
knowledge of the green curve.

See Bishop (2006), Chapter 1 and 3 for more details.


Over-fitting

Polynomial Curve Fitting of polynomials having various


orders M, shown as red curves, fitted to the data set shown
above.

See Bishop (2006), Chapter 1 and 3 for more details.


Over-fitting (II)

So far, we’ve assessed the prediction quality of our model based on
the same data that we used for training.

This is a very bad idea, since we can not accurately measure how
well our model works for previously unseen data points (e.g., for
cars not in our dataset)


Our model may over-fit to the training data and loose its ability to make
predictions.

→ Next, we’ll see some best practices for evaluating machine learning
models
Over-fitting (III)


Over-fitting occurs when our model describes the training data very
accurately, but fails to make predictions for previously unseen data
points.


When the number of features is large in comparison to the number of
data points available for training, over-fitting is likely to occur.


In that case, we learn a model that uses many features,
and is thus more complex, but fails to generalize.
Occam’s Razor

Occam's razor is a logical principle attributed to the mediaeval
philosopher William of Occam (or Ockham).

The principle states that one should not make more assumptions
than the minimum needed. This principle is often called the 
principle of parsimony.

It underlies all scientific modeling and theory building. It admonishes
us to choose from a set of otherwise equivalent models of a given
phenomenon the simplest one.

In any given model, Occam's razor helps us to "shave off" those
concepts, variables or constructs that are not really needed to explain
the phenomenon.


By doing that, developing the model will become much easier, and
there is less chance of introducing inconsistencies, ambiguities and
redundancies.
How to avoid Over-fitting


To avoid over-fitting, it is good practice to assess the quality
of a model based on test data that must not be used for training the
model.


The key idea is to split the available data (randomly)
into training, validation, and test data.
Splitting the Data

One common approach to reliably assess the quality of a machine
learning model and avoid over-fitting is to randomly split the available
data into

training data (~70% of the data)
is used for determining optimal coefficients.


validation data (~20% of the data) is used for model selection
(e.g., fixing degree of polynomial, selecting a subset of features, etc.)

test data (~10% of the data) is used to measure the quality that is
reported.
Model Selection: k-Fold Cross-Validation
demo/k-fold_cross_validation.py


Another common approach, especially suitable
when only limited data is available, is
k-fold cross-validation.

Data is (randomly) split into k folds of (about)
equal size.

k rounds of training and validation are performed, in which

(k-1) folds serve as training data

one fold serves as validation/test data


In the end the mean of the quality measure (e.g., MSE) is reported
as an estimate of the overall quality.
Cross-validation in Python

demo/k-fold_cross_validation.py
Regularization – Ridge Regression
One technique that is often used to control the over-fitting phenomenon
in such cases is that of regularization, which involves adding a penalty
term to the error function in order to discourage the coefficients from
reaching large values.

The simplest such penalty term takes the form of a sum of squares of all
of the coefficients, leading to a modified error function of the form

Ridge Regression

where and the coefficient λ governs


the relative importance of the regularization term compared with the sum-
of-squares error term.
Ridge Regression in Python
demo/ridge_regression.py

→We consider our car dataset


and learn coefficients for
polynomial ridge regression with
degree 5 based on a subset of
10 randomly chosen cars.
Ridge Regression in Python (II)
Regularization in general

A more general regularizer is sometimes used, for which the
regularized error takes the form


The case of q = 1 is know as the LASSO in the statistics literature.
It has the property that if λ is sufficiently large, some of the coefficients
wj are driven to zero, leading to a sparse model in which the
corresponding basis functions play no role.

Fig: Contours of the regularization term in in above’s equation for various values of the parameter q.
Some Literature for this lecture
Python Machine Learning
S. Raschka
PACKT Publishing, 2017

Pattern Recognition and Machine Learning


C.M. Bishop
Springer, 2006

You might also like