Lecture04. Training Models (Regression in Chapter 4)
Lecture04. Training Models (Regression in Chapter 4)
Training Models
Chapter 4
SGD
Mini-Batch
Generalized Regression
-------------
Logistic Regression
Softmax Regression
2
Linear Regression
3
Goal: best fitting line
The optimum weights: Least Squared Errors
4
Solution 1.1: Analytical Solution
1.1 Use Numpy to find the inverse and product:
# use np to solve it
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
--------------------------------------------
y = Xθ
XTy = XTX θ
(XTX)-1XTy = (XTX)-1 (XTX) θ θ = (XTX)-1XTy
Prediction: y = Xnewθ 5
Prediction with found weights
6
Solution 1.2: using sk-learn
1.2 use sk-learn
from sklearn.linear_model import LinearRegression
# 3 steps: model/fitting/result
lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_
# predict
lin_reg.predict(X_new)
7
Matrix inverse vs. SVD
https://slideplayer.com/slide/5189063/
9
Singular Values
10
Solution 3. Minimize the Error
Minimize the cost function: random initial weights
11
Solution 2. Gradient Descent
Two key parameters
Direction: negative gradient
12
Gradient Descent
Local minimum and plateau
13
Data Scaling
In 2-D:
features 1 and 2 have the same scale (on the left)
14
Batch Gradient Descent-1
We have m data samples and each has n features
(j = 1, …, n) :
15
Batch Gradient Descent-2
Move to next step: weights updating
17
Batch Gradient Descent: summary
MSE cost function:
is convex
18
Stochastic Gradient Descent
Stochastic Gradient Descent:
Use random instance in the training set at every step
19
SGD: single sample each time
Problem:
cost function will bounce up and down,
20
SK-learn: SGD method
GDRegressor
from sklearn.linear_model import SGDRegressor
21
Best Choice: mini-batch
Mini-batch Gradient Descent:
small random sets of instances called mini-batches
22
simulated annealing for learning rate
SGD even jumps around, still have hard time find the
global optimum.
Method: Gradually decreasing the learning rate
23
Comparison of algorithms for Linear Regression
24
Challenges
Directions:
Gradient Descent =>
https://www.ruder.io/optimizing-gradient-descent/
Step length:
learning rate => decay
adaptive
25
Linear vs. Non-linear problem
Linear
analytical solution
Gradient descent
Exponential
Or others
26
Polynomial Regression
Polynomial: Y = w0 + w1 X + w2X2 + w3X3
More general:
Y = w0 + w1X1 + w2X2 + w3X3 + … + wdXd
28
Higher order: Overfitting data
Linear: underfitting
High-degree Polynomial: d = ?
d = 300, Overfitting
29
The Bias/Variance Trade-off
ERROR = sum of three very different errors:
Bias: due to wrong assumptions, models
Trade off:
Less complexity: increases its bias and reduces its
variance.
More complexity: increase its variance and reduce its
bias.
30
Learning Curve
31
Learning curve
Learning vs. validation => still underfit ?
(train error < val. error)
The homework 1 uses only partial data
32
Early Stopping: over epochs
Train the model many epochs:
if val_error < minimum_val_error:
33
Plateau Check
One Step
(Error_n+1 – Errror_n)/ Errror_n
(Error_101 - Error_100)/Error_100
(650 000 – 650 100)/650 100 < Tolerance = 0.01
Multiple Steps
34
Generalized Regression
You can image:
exponential regression,
logarithm regression
Generalized Regression
link(Y) = Z= WX
35
Regularized Linear Models
Ridge
Elastic
36
Ridge
Alpha bigger => Shrink the weights(Theta)
increasing α leads to flatter (i.e., less extreme, more reasonable) predictions,
thus
reducing the model’s variance but increasing its bias.
Linear model Poly d=10
37
LASSO
Diff. alpha => eliminate some weights
Linear model Poly d=10
38
Ridge vs. LASSO
Lasso Regression L1:
tends to eliminate the weights of the least important
features (i.e., set them to zero).
In other words, Lasso Regression automatically
performs feature selection and outputs a sparse model
(i.e., with few nonzero feature weights).
39
Optimization
40
Example: Cancer data with 30 features
41
Elastic
Combine both
When r = 0, equivalent to Ridge
42
Summary
Models:
Linear Regression
Multilinear Regression
Optimization
Cost = minimizing (mean(ERROR^2))
Analytical Solution