0% found this document useful (0 votes)
5 views

Lecture04. Training Models (Regression in Chapter 4)

Uploaded by

emad qedies
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture04. Training Models (Regression in Chapter 4)

Uploaded by

emad qedies
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Lecture 3

Training Models
Chapter 4

CSC 484 / 584, DA 515


Chapter 4 Training (Main Points)
Regression Models
 Linear Regression: optimization problem
 Batch

 SGD

 Mini-Batch

 Polynomial, Exponential, Logarithm..

 Generalized Regression

 Regularization (L1, L2, or Elastic)

 -------------

 Logistic Regression

 Softmax Regression
2
Linear Regression

3
Goal: best fitting line
 The optimum weights: Least Squared Errors

4
Solution 1.1: Analytical Solution
 1.1 Use Numpy to find the inverse and product:

# use np to solve it
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
--------------------------------------------
y = Xθ
XTy = XTX θ
(XTX)-1XTy = (XTX)-1 (XTX) θ  θ = (XTX)-1XTy
Prediction: y = Xnewθ 5
Prediction with found weights

6
Solution 1.2: using sk-learn
 1.2 use sk-learn
from sklearn.linear_model import LinearRegression

# 3 steps: model/fitting/result
lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_

# predict
lin_reg.predict(X_new)

7
Matrix inverse vs. SVD

 Matrix inverse has problem sometimes


 XT•X is not always invertible (singular**, for redundant features)
 X⊺ X, which is an (n + 1) × (n + 1) matrix (where n is the number of features).
The computational complexity of inverting such a matrix is typically about O(n2.4)
to O(n3)
 Too many data will not fit into the memory.
** A square matrix that does not have a matrix inverse. A matrix is singular iff its determinant is 0.
 SVD: Singular Value Decomposition
decompose the training set matrix X into the matrix
multiplication of three matrices X = U Σ V
Good:
 If m < n, data sample # m is smaller than feature # n
 If m >> n, SVD is more efficient, especially for large dataset
 Sk-learn is using SVD to find the solution (W in our case) 8

 The SVD in Scikit-Learn’s LinearRegression class is about O(n2).


Singular Value Decomposition
 A is the data Matrix mxn (or X we have used):

https://slideplayer.com/slide/5189063/

9
Singular Values

10
Solution 3. Minimize the Error
 Minimize the cost function: random initial weights

11
Solution 2. Gradient Descent
 Two key parameters
 Direction: negative gradient

 Step length: learning rate(η)

12
Gradient Descent
 Local minimum and plateau

13
Data Scaling
 In 2-D:
 features 1 and 2 have the same scale (on the left)

 feature 1 has much smaller values than feature 2 (on


the right).

14
Batch Gradient Descent-1
 We have m data samples and each has n features
(j = 1, …, n) :

 One feature direction


and all samples:

 For all feature and


All samples

15
Batch Gradient Descent-2
 Move to next step: weights updating

eta = 0.05 *decay # learning rate


n_iterations = 1000
m = 100 # sample size
theta = np.random.randn(2,1) # random initialization
for iteration in range(n_iterations):
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
theta = theta - eta * gradients
16
Batch Gradient Descent: Learning Rates
 Use: all samples and all features
 maximum 1000 steps(or converged)
 Different Learning Rates:

17
Batch Gradient Descent: summary
 MSE cost function:
 is convex

 its slope does not change abruptly

 Batch Gradient Descent:


 fixed learning rate:
 eventually converge to the optimal solution

 may have to wait a while:

 uses the whole training set to compute the gradients at


every step, very slow when the training set is large.

18
Stochastic Gradient Descent
 Stochastic Gradient Descent:
 Use random instance in the training set at every step

 computes the gradients based only on that single


instance.

 cost function will bounce up and down, decreasing only


on average. Over time it will end up very close to the
minimum, but once it gets there it will continue to
bounce around, never settling down

19
SGD: single sample each time
 Problem:
 cost function will bounce up and down,

 decreasing only on average.

 end up very close to the minimum.

 Good for irregular cost


 Escape from local optima

20
SK-learn: SGD method
 GDRegressor
from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3,


penalty=None, eta0=0.1)

# max epoch: 1000,


# learning rate = 0.1,
# converge=1e-3
sgd_reg.fit(X, y.ravel())

21
Best Choice: mini-batch
 Mini-batch Gradient Descent:
small random sets of instances called mini-batches

22
simulated annealing for learning rate
 SGD even jumps around, still have hard time find the
global optimum.
 Method: Gradually decreasing the learning rate

23
Comparison of algorithms for Linear Regression

24
Challenges
 Directions:
 Gradient Descent =>

 more than one gradients

https://www.ruder.io/optimizing-gradient-descent/

 Step length:
 learning rate => decay

 adaptive

25
Linear vs. Non-linear problem
 Linear
 analytical solution

 SVD: Singular Value Decomposition

 Gradient descent

 Non-linear  linear problem


 Polynomial

 Exponential

 Or others

26
Polynomial Regression
 Polynomial: Y = w0 + w1 X + w2X2 + w3X3

 For d = 3, we can do: X1 = X, X2 = X2 , X3 = X3

Now we still use multi-linear Regression:


Y = w0 + w1X1 + w2X2 + w3X3

 More general:
Y = w0 + w1X1 + w2X2 + w3X3 + … + wdXd

HOMEWORK 2: given data, find the best d


27
Polynomial Curve

28
Higher order: Overfitting data
 Linear: underfitting

 High-degree Polynomial: d = ?
 d = 300, Overfitting

29
The Bias/Variance Trade-off
 ERROR = sum of three very different errors:
 Bias: due to wrong assumptions, models

 Variance: due to variations in the training data.

 Irreducible error: noisiness and outliers

 Trade off:
 Less complexity: increases its bias and reduces its
variance.
 More complexity: increase its variance and reduce its
bias.
30
Learning Curve

31
Learning curve
 Learning vs. validation => still underfit ?
(train error < val. error)
The homework 1 uses only partial data

32
Early Stopping: over epochs
 Train the model many epochs:
 if val_error < minimum_val_error:

33
Plateau Check
 One Step
 (Error_n+1 – Errror_n)/ Errror_n

 (Error_101 - Error_100)/Error_100
 (650 000 – 650 100)/650 100 < Tolerance = 0.01

 Multiple Steps

34
Generalized Regression
 You can image:
 exponential regression,

 logarithm regression

 Generalized Regression

 link(Y) = Z= WX

for example: y = 2wx


log2 (y) = WX
if we use z = log2 (y), then we have z = WX

35
Regularized Linear Models
 Ridge

 LASSO: Least Absolute Shrinkage and Selection Operator Regression

 Elastic

36
Ridge
 Alpha bigger => Shrink the weights(Theta)
increasing α leads to flatter (i.e., less extreme, more reasonable) predictions,
thus
reducing the model’s variance but increasing its bias.
 Linear model Poly d=10

37
LASSO
 Diff. alpha => eliminate some weights
Linear model Poly d=10

38
Ridge vs. LASSO
 Lasso Regression L1:
 tends to eliminate the weights of the least important
features (i.e., set them to zero).
 In other words, Lasso Regression automatically
performs feature selection and outputs a sparse model
(i.e., with few nonzero feature weights).

 Ridge Regression L2:


 Reduce the overfitting

39
Optimization

40
Example: Cancer data with 30 features

41
Elastic

 Combine both
 When r = 0, equivalent to Ridge

 When r = 1, equivalent to Lasso Regression

 Ridge is a good default


 prefer Lasso (or Elastic) if only few features are useful

42
Summary
 Models:
 Linear Regression

 Multilinear Regression

 Generalized Linear Regression

 Optimization
 Cost = minimizing (mean(ERROR^2))

 Analytical Solution

 SGD or Batch or mini-batch (how many samples once?)

 Training stop: eval => plateau: how many epochs?

 Regularization: L2 for overcoming over-fitting,

L1 for feature selection 43


END

• Read book Chapter 4


• Practice code from this Chapter
• Do your homework HW2
• Next lecture: Decision Tree (ch. 6)

You might also like