Lecture 3
Lecture 3
Lecture 3
Simon Scheidegger
Today’s Roadmap
1. Supervised Machine Learning – the general idea
→ Given data like this, how can we learn to predict the fuel consumption of other
cars?
*data: https://archive.ics.uci.edu/ml/datasets/Auto+MPG
Supervised Machine Learning (II)
- x(i) : “input” variables (Horse Power in this example), also called input features.
- y(i) : “output” or target variable that we are trying to predict (Fuel consumption).
{(x(i), y (i) ); i = 1, . . . , m}
- When the target variable that we’re trying to predict is continuous, such
as in our car example, we call the learning problem a regression problem.
- When y can take on only a small number of discrete values (such as if, given
he fuel consumption , we wanted to predict if a car is a SUV or a small
city car), we call it a classification problem.
2. Linear Regression
Model hypothesis and parameters
How can we predict the value of a numerical feature y based on the value of
another numerical feature x?
First, we need to make some assumption about their relationship, i.e., how x
influences y.
ŷ = w0 + w1 x
Our model thus assumes that there is a linear relationship between the two
features x and y.
Independent Feature/
Variable
Dependent Feature/ ŷ = w0 + w1 x
Variable Model
(xi , yi )
in our example those are pairs of power (in hp) and fuel consumption (in mpg)
of individual cars
Attribute Information:
1. mpg: continuous demo/auto-mpg.names.txt
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower:
5. weight:
continuous
continuous
demo/auto-mpg.data.txt
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)
Variance of feature x and y in our training data is defined as
Large co-variance suggests that the two features vary jointly
a positive value indicates that they tend to deviate
from their respective mean in the same direction
a negative value indicates that they tend to deviate
from their respective mean in opposite directions.
Correlation
The correlation coefficient (also: Pearson’s r) is a normalized measure of
linear correlation between the two features x and y
The correlation coefficient takes values in [-1, +1]
a value of -1 indicates a negative linear correlation
a value of 0 indicates that there is no linear correlation
a value of +1 indicates a positive linear correlation
Example Correlations
demo/predict_prices.py
Basic Analytics – Scatter Plot
demo/predict_prices.py
Our Data Set
demo/predict_prices.py
Anscombe’s Quartet
https://en.wikipedia.org/wiki/Wikipedia:Featured_picture_candidates/Anscombe%27s_quartet
→ All four datasets have the same mean, variance, correlation coefficient, and
optimal regression line.
Loss function / Cost function
A Loss /Cost function measures how well our model, for a specific choice of
coefficients w0 and w1, describes the training data
The Residual for data point (xi, yi) measures how much the observed value yi
differs from the prediction of our model
Loss Function /Cost Function
Ordinary least squares (OLS) uses the sum of squared residuals (also:
sum of squared errors) as a loss function
The R² coefficient of determination (short: ”R squared”) measures how
well the determined regression line approximates the data.
Put differently: how much of the variation observed in the data is
explained by it.
Linear Regression in Python
demo/predict_prices.py
Regression Plot
Parameters:
w0: 39.974309
w1: -0.158120
R2: 0.606089
3. Recall – Gradient Descent
Unfortunately, it is not always possible to determine optimal coefficients
for a loss function analytically.
Gradient descent is an optimization algorithm that we can use to
determine the minimum of a loss function.
Basic Idea:
start with a random choice of the coefficients (here: w and w 1)
0
repeat for a specified number of rounds or until convergence
compute the gradient for this choice of coefficients
update coefficients based on the gradient
Gradient Descent (II)
Intuition: Think of the loss function as a surface on which you want to
find the lowest point
start your journey at a random position
repeat the following
identify direction with steepest descent
walk a few steps in the identified direction
Fig: https://en.wikipedia.org/wiki/Gradient_descent
Gradient Descent in Python
demo/gradient_descent.py
This rule has several properties that seem natural and intuitive.
For instance, the magnitude of the update is proportional to the error
term (y(i) − hθ (x(i) )).
Thus, for instance, if we are encountering a training example on which
our prediction nearly matches the actual value of y (i) , then we find that
there is little need to change the parameters.
In contrast, a larger change to the parameters will be made if our
prediction hθ (x(i) ) has a large error (i.e., if it is very far from y(i)).
The LMS Algorithm (IV)
This method looks at every example in the entire training set on every
step, and is called batch gradient descent.
Note that, while gradient descent can be susceptible to local minima in
general, the optimization problem we have posed here for linear
regression has only one global, and no other local, optima.
Thus gradient descent always converges to the global minimum as J is
a convex quadratic function.
Stochastic Gradient Descent
(relevant e.g. for Neural Nets)
Gradient ascent as a counterpart to maximize functions.
Stochastic gradient descent (SGD) does not compute the true gradient,
but approximates the gradient based on a single or few randomly
chosen data points in each round.
Gradient can be approximated when the function at hand is
non-differentiable or when partial derivatives are expensive to compute.
Stochastic Gradient Descent (II)
In this algorithm, we repeatedly run through the training set, and each time we encounter
a training example, we update the parameters according to the gradient of the error with
respect to that single training example only.
Stochastic Gradient Descent (II)
In this algorithm, we repeatedly run through the training set, and each time we encounter
a training example, we update the parameters according to the gradient of the error with
respect to that single training example only.
Whereas batch gradient descent has to scan through the entire training set before taking
a single step—a costly operation if m is large—stochastic gradient descent can start
making progress right away, and continues to make progress with each example it looks
at.
Stochastic Gradient Descent (II)
In this algorithm, we repeatedly run through the training set, and each time we encounter
a training example, we update the parameters according to the gradient of the error with
respect to that single training example only.
Whereas batch gradient descent has to scan through the entire training set before taking
a single step—a costly operation if m is large—stochastic gradient descent can start
making progress right away, and continues to make progress with each example it looks
at.
Often, stochastic gradient descent gets θ “close” to the minimum much faster than batch
gradient descent. (Note however that it may never “converge” to the minimum, and the
parameters θ will keep oscillating around the minimum of J(θ); but in practice most of the
values near the minimum will be reasonably good approximations to the true minimum)
→ particularly when the training set is large, stochastic gradient descent is often
preferred over batch gradient descent.
Gradient Descent in parameter space
Loss Optimization
+
Loss Optimization
→ Steepest ascent
Loss Optimization
Take a small step in the opposite direction of the gradient.
Loss Optimization
Mini Batch Gradient Descent
In actual practice we use an approach called Mini batch
gradient descent.
This approach uses random samples but in batches.
What this means is that we do not calculate the gradients for
each observation but for a group of observations which
results in a faster optimization.
A simple way to implement is to shuffle the observations and
then create batches and then proceed with gradient descent
using batches.
Let’s run the Jupyer Notebook
demo/GradientDescent_StochasticGradientDescent.ipynb
X: Chirps/Second Y: Temperature (º F)
4. Regression with multiple variables
How can we predict the value of a numerical feature y based on
multiple other numerical features x1, ..., xm?
We assume that there is a linear relationship between the target
feature and the other features
ŷ = w0 + w1 x1 + . . . + wm xm
Given that we now deal with multiple data points and multiple
features, it is easier to formulate our optimization problem using
matrices and vectors.
Minimize the Cost Function
The training examples’ input values in its rows:
Also, let be the m-dimensional vector containing all the target values from
the training set:
To minimize J, we set its derivatives to zero, and obtain the normal equations:
Thus, the value of θ that minimizes L(θ) is given in closed form by the equation
Multivariate Linear Regression in Python
https://archive.ics.uci.edu/ml/datasets/Auto+MPG
→Let’s apply this to our car dataset and try to predict fuel consumption
based on horsepower and weight.
demo/multi_par_reg.py
Multivariate Linear Regression in Python
Here, the last line computes the mean squared error (MSE) as another
widely used measure for assessing the prediction quality of a regression
model.
Inspect the regression – Hyperplane
Handling non-numerical Features
How can we make non-numerical (i.e., nominal and ordinal)
features accessible to regression analysis?
Nominal features (e.g., origin of a car) can be converted using one-
hot encoding: for each value of the original feature, a binary feature is
introduced, indicating whether a data point has the corresponding value
for the feature.
Handling non-numerical Features
Ordinal features (e.g., energy efficiency class) can be mapped to
integer values preserving their order.
Note that this mapping implicitly assumes that the differences between
adjacent values are uniform, i.e., have the same magnitude.
Predict Miles/Gallon with Origin as Feature
demo/multi_par_reg_add_features.py
Let us assume that the target variables and the inputs are related via the
equation
where ε(i) is an error term that captures for instance random noise.
Probability and regression
Let us further assume that the ε(i) are distributed i.i.d (independently and
identically distributed) according to a Gaussian distribution with
The notation “p(y(i) |x(i) ; θ)” indicates that this is the distribution of y(i)
given x(i) and parameterized by θ.
Note that we should not condition on θ (“p(y(i) |x(i) , θ)”), since θ is not a
random variable.
We can also write the distribution of y(i) as y(i) | x(i) ; θ ∼ N (θTx(i), σ²).
Likelihood function
Given X (the design matrix, which contains all the x(i)’s) and θ, what
is the distribution of the y(i)’s?
Note that by the independence assumption on the ε(i) ’s (and hence also
the y(i) ’s given the x(i) ’s), this can also be written
Likelihood function (III)
Now, given this probabilistic model relating the y(i)’s and the x(i)’s, what is
a reasonable way of choosing our best guess of the parameters θ?
The principle of maximum likelihood says that we should choose θ so as
to make the data as high probability as possible – that is, we should
choose θ to maximize L(θ).
Instead of maximizing L(θ), we can also maximize any strictly increasing
function of L(θ). In particular, the derivations will be a bit simpler if we
instead maximize the log likelihood l(θ):
Maximum Likelihood
Hence, maximizing l(θ) gives the same answer as minimizing
Suppose we have some function f : R → R, and we wish to find a value of θ
such that f (θ) = 0. Here, θ is a real number.
Newton’s method performs the following update:
Newton’s Method
Newton’s method has a natural interpretation.
Newton’s method gives a way of getting to f (θ) = 0.
What if we want to use it to maximize some function l? The maximum
of l correspond to points where its first derivative l′ (θ) is zero.
So, by letting f (θ) = l ′ (θ), we can use the same algorithm to maximize
l, and we obtain update rule:
Newton’s Method for multiple variables
The generalization of Newton’s method to a multidimensional setting
(also called the Newton-Raphson method) is given by
Here, ∇θ l(θ) is, as usual, the vector of partial derivatives of l(θ) with
respect to the θi’s; and H is an n-by-n matrix (actually, n + 1-by-n + 1,
assuming that we include the intercept term) called the Hessian, whose
entries are given by
Newton’s method typically enjoys faster convergence than
gradient descent, and requires fewer iterations to get very close to the
minimum. One iteration of Newton’s can, however, be more expensive
than one iteration of gradient descent, since it requires finding and
inverting an n-by-n Hessian; but so long as n is not too large, it is
usually much faster overall.
Newton’s Method: Pro’s and Con’s
When it converges, Newton's method usually converges very quickly and this
is its main advantage. However, Newton's method is not guaranteed to
converge and this is obviously a big disadvantage especially compared to the
bisection and secant methods which are guaranteed to converge to a solution
(provided they start with an interval containing a root).
Newton’s Method: Pro’s and Con’s
When it converges, Newton's method usually converges very quickly and this
is its main advantage. However, Newton's method is not guaranteed to
converge and this is obviously a big disadvantage especially compared to the
bisection and secant methods which are guaranteed to converge to a solution
(provided they start with an interval containing a root).
Newton's method also requires computing values of the derivative of the
function in question. This is potentially a disadvantage if the derivative is
difficult to compute.
Newton’s Method: Pro’s and Con’s
When it converges, Newton's method usually converges very quickly and this
is its main advantage. However, Newton's method is not guaranteed to
converge and this is obviously a big disadvantage especially compared to the
bisection and secant methods which are guaranteed to converge to a solution
(provided they start with an interval containing a root).
Newton's method also requires computing values of the derivative of the
function in question. This is potentially a disadvantage if the derivative is
difficult to compute.
The stopping criteria for Newton's method differs from the bisection and
secant methods. In those methods, we know how close we are to a solution
because we are computing intervals which contain a solution. In Newton's
method, we don't know how close we are to a solution. All we can compute
is the value f(x) and so we implement a stopping criteria based on f(x).
Newton’s Method: Pro’s and Con’s
When it converges, Newton's method usually converges very quickly and this
is its main advantage. However, Newton's method is not guaranteed to
converge and this is obviously a big disadvantage especially compared to the
bisection and secant methods which are guaranteed to converge to a solution
(provided they start with an interval containing a root).
Newton's method also requires computing values of the derivative of the
function in question. This is potentially a disadvantage if the derivative is
difficult to compute.
The stopping criteria for Newton's method differs from the bisection and
secant methods. In those methods, we know how close we are to a solution
because we are computing intervals which contain a solution. In Newton's
method, we don't know how close we are to a solution. All we can compute
is the value f(x) and so we implement a stopping criteria based on f(x).
Finally, there's no guarantee that the method converges to a solution and we
should set a maximum number of iterations so that our implementation ends if
we don't find a solution.
Newton’s Method: Example
Let's write a function called newton which takes 5 input
parameters: f, Df, x0, epsilon and max_iter and returns
an approximation of a solution of f(x) by Newton's method. The
function may terminate in 3 ways:
If the number of iterations exceed max_iter, the algorithm stops
and returns None.
Example code
demo/newton_test.py
demo/newton_solver.py
Newton solver – test 1
f = lambda x: x**(1/3)
Df = lambda x: (1/3)*x**(-2/3)
approx = newton(f,Df,0.1,1e-2,100)
6. Polynomial Regression
What if the relationship between our target feature y
and the independent features is “more complicated”?
Polynomial regression allows us to estimate the optimal
coefficients of a polynomial having degree d
hθ = w0 + w1 x + w2 x2 + . . . + wd xd
Minimize again the cost function
Polynomial Regression in Python
demo/poly_reg.py
A much better prediction
Polynomial regression & Probability
As before, we assume that the target
variable t is given by a deterministic
function y(x, w) with additive Gaussian
noise:
Our model may over-fit to the training data and loose its ability to make
predictions.
→ Next, we’ll see some best practices for evaluating machine learning
models
Over-fitting (III)
Over-fitting occurs when our model describes the training data very
accurately, but fails to make predictions for previously unseen data
points.
When the number of features is large in comparison to the number of
data points available for training, over-fitting is likely to occur.
In that case, we learn a model that uses many features,
and is thus more complex, but fails to generalize.
Occam’s Razor
Occam's razor is a logical principle attributed to the mediaeval
philosopher William of Occam (or Ockham).
The principle states that one should not make more assumptions
than the minimum needed. This principle is often called the
principle of parsimony.
It underlies all scientific modeling and theory building. It admonishes
us to choose from a set of otherwise equivalent models of a given
phenomenon the simplest one.
In any given model, Occam's razor helps us to "shave off" those
concepts, variables or constructs that are not really needed to explain
the phenomenon.
By doing that, developing the model will become much easier, and
there is less chance of introducing inconsistencies, ambiguities and
redundancies.
How to avoid Over-fitting
To avoid over-fitting, it is good practice to assess the quality
of a model based on test data that must not be used for training the
model.
The key idea is to split the available data (randomly)
into training, validation, and test data.
Splitting the Data
One common approach to reliably assess the quality of a machine
learning model and avoid over-fitting is to randomly split the available
data into
training data (~70% of the data)
is used for determining optimal coefficients.
validation data (~20% of the data) is used for model selection
(e.g., fixing degree of polynomial, selecting a subset of features, etc.)
test data (~10% of the data) is used to measure the quality that is
reported.
Model Selection: k-Fold Cross-Validation
demo/k-fold_cross_validation.py
Another common approach, especially suitable
when only limited data is available, is
k-fold cross-validation.
Data is (randomly) split into k folds of (about)
equal size.
k rounds of training and validation are performed, in which
(k-1) folds serve as training data
one fold serves as validation/test data
In the end the mean of the quality measure (e.g., MSE) is reported
as an estimate of the overall quality.
Cross-validation in Python
demo/k-fold_cross_validation.py
Regularization – Ridge Regression
One technique that is often used to control the over-fitting phenomenon
in such cases is that of regularization, which involves adding a penalty
term to the error function in order to discourage the coefficients from
reaching large values.
The simplest such penalty term takes the form of a sum of squares of all
of the coefficients, leading to a modified error function of the form
Ridge Regression
The case of q = 1 is know as the LASSO in the statistics literature.
It has the property that if λ is sufficiently large, some of the coefficients
wj are driven to zero, leading to a sparse model in which the
corresponding basis functions play no role.
Fig: Contours of the regularization term in in above’s equation for various values of the parameter q.
Some Literature for this lecture
Python Machine Learning
S. Raschka
PACKT Publishing, 2017