0% found this document useful (0 votes)
29 views28 pages

Lecture3 Upload

The lecture covers regression analysis, focusing on linear regression, gradient descent, and polynomial regression. Key concepts include the least squares method for estimating parameters, the importance of minimizing errors, and various gradient descent techniques. Additionally, it addresses overfitting and the need for data splitting to improve model performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views28 pages

Lecture3 Upload

The lecture covers regression analysis, focusing on linear regression, gradient descent, and polynomial regression. Key concepts include the least squares method for estimating parameters, the importance of minimizing errors, and various gradient descent techniques. Additionally, it addresses overfitting and the need for data splitting to improve model performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Lecture 3

Regression Analysis

Ziyi Cao
University of Texas at Dallas
[email protected]
Spring 2025
Agenda

• Linear Regression
• Gradient Descent
• Polynomial Regression
• Python Practice
Linear Regression

Concepts
Least Square Method
Linear Regression Concepts

• Simple linear regression


• Models the linear relationship between a numeric target variable and an
explanatory variable

• : target variable / dependent variable / outcome variable


• : explanatory variable / independent variable / predictor / regressor

𝑌 = 𝛽0 + 𝛽1 𝑋 +𝜀
Linear Regression Concepts

• Multiple linear regression


• Models the linear relationship between a numeric target variable and a set of
explanatory variables

parameters
outcome (to be estimated)

𝑌 = 𝛽0 + 𝛽1 𝑋 1+ 𝛽 2 𝑋 2 +…+ 𝛽 𝑘 𝑋 𝑘 +𝜀
Random error
(unexplained part)
predictors
Linear Regression Concept

• Linear Relationship
• The linearity is defined on parameters, NOT predictors
• X are observed values (consider as constant), parameters are unknowns (to solve)
• You can always apply nonlinear transfers to variables before applying a model

Both are linear


regressions

𝑌 = 𝛽0 + 𝛽1 𝑋 +ε 𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝛽 2 𝑋 2 + ε

Define a new variable , then


Are They Linear Regressions?

• Suppose i=1, 2, …, 5, parameter is


Linear Regression Estimation

• Parameters (s) need to be estimated

outcome parameters

𝑌 = 𝛽0 + 𝛽1 𝑋 1+ 𝛽 2 𝑋 2 +…+ 𝛽 𝑘 𝑋 𝑘 +𝜀
Random error
(unexplained part)
predictors
Linear Regression Estimation – Intuition

𝑌 = 𝛽0 + 𝛽1 𝑋 +𝜀
• : Intercept
• : Slope 𝑌𝑖
^
𝑒 𝑖=𝑌 𝑖 − 𝑌 𝑖
^𝑖
𝑌
For datapoint
• Fitted Value:
𝛽1
• Predicted from regression
• Error (residual): 𝛽0

• Difference between an 𝑋𝑖
observed value and a predicted value
• Key component for performance measure
Linear Model Estimation

• Goal: Minimize the total error


• close to zero for each i => transfer s.t. non-negative

• Potential methods
1
• Mean (sum) of absolute errors 𝑀𝐴𝐸 =
𝑛
(|𝑒1|+|𝑒 2|+ …+|𝑒𝑛|)

1 2
• Mean (sum) of squared errors 𝑀𝑆𝐸=
𝑛
( 𝑒 1 +𝑒 2+ …+𝑒 𝑛 )
2 2

Note: For both methods, n is the number of observations


Minimize MSE is equivalent to minimize
Linear Model Estimation – Least Squares Method

• Criteria: minimize the sum (mean) of squared errors


𝑛
min 𝑀𝑆𝐸 ( 𝜷 ) ⇔ min ∑ ( 𝑦 𝑖 − ^
𝑦 𝑖 )2
𝜷 𝜷 𝑖 =1
• in bold represents a vector

• Solving for (calculus & linear algebra):


• For each , take partial differentiations 𝜕 𝑀𝑆𝐸( 𝜷)
=0
• We have K+1 equations, solve for K+1 unkown s 𝜕 𝛽𝑘

• The solution can be put to matrix form −1


𝜷 =( 𝑋 ′ 𝑋 ) 𝑋′ 𝑌
X is a N*(K+1) matrix;
Y is a N*1 vector;
is a (K+1)*1 vector
Gradient Descent
Linear Regression: An Example

𝑛
• Recall Optimization
min ∑ ( 𝑦 𝑖 − ^
𝑦𝑖)2 Cost function
𝜷 𝑖 =1

• Method 1:
• Closed-form solution ( ′
𝜷= 𝑋 𝑋 )−1
𝑋′ 𝑌
• Solvable
• Computational costly, no close form solution (complicated models)?

• Method 2: search for it


• Gradient Descent (Optimization Algorithm)
• Solve an optimization problem
• In conjunction with neural networks, regressions, SVMs, …
Gradient Descent – Intuition

• River flowing down a mountain


• A cost function is a hyperplane
A
• Searching for minima of cost function
• The lowest direction

• Start somewhere (A)


• Initial point/value
• Flow down to some (adjusted) directions
• Direction – gradient (slope)
• Distance – Learning rate (step size)
• Keep flowing until flat or into a lake (B) B
• Convergence – local minima
Gradient Descent – Concepts

• Search for the minimal point & the corresponding value minimizing the cost function

• Initial point
• Gradient
• Partial derivative w.r.t all variables

• Learning rate
• Length to move along the direction

• Next Step:
• Converge
• You need to set: initial point, learning rate, convergence criteria
Learning Rate (Step Size)

• Large learning rate <=> large step size

• Small learning rate: slow convergence


• Large learning rate: divergence (cannot converge)
Pros and Cons

• Advantages
• Simple and usually effective on ML tasks

• Disadvantages
• Local minimum if non-convex
• Change multiple initial points
Batch Gradient Descent

• Use the whole data set when computing the gradient – small data sets or simple models

• Example (Linear Regression):


• Compute the gradient (partial derivative) of the cost function MSE w.r.t. each
• Select a set of initial 𝑛
1
• Gradient: min
𝜷

𝑛 𝑖 =1
( ^
𝑦𝑖 − 𝑦𝑖) 2

• Compute the gradient using the whole dataset – plug in all X and y observations, and initial
• Update :

• Learning rate
• Pre-determined
Stochastic Gradient Descent

• Huge dataset – still computationally expensive


• Consider every sample to calculate the gradient and update parameters at every step
(iteratively till convergence)?
• How about pulling a subset of data?

• In extreme cases => SGD:


• Picks one random instance (sample) to compute the gradient
• Update the parameter

• Adds noise to the gradient


• Less likely to be stuck at local minima
• More iterations to converge
Mini Batch Gradient Descent

• Batch Gradient Descent vs. Stochastic Gradient Descent


• Smooth vs. Large data set

• Balance the two


• Calculate gradient based on a mini-batch (e.g., size = 32 records)
• Not too much noise (compared to SGD)
• More efficient (compared to BGD)

• Need to decide on batch size

• Very commonly used


Polynomial Regressions
Polynomial Regression

• Generate new features consisting of polynomial combinations of the original


features

• Captures the non-linear pattern between Y and X


• But still a linear model

• Example (poly = 2)
Example I: Polynomial Features

• Hyperbolic tangent: tanh(x):


• S Shape function
• Approximates to 1 and -1
Example I: Polynomial Features
Overfitting

• When overfitting occurs:


• The model captures the noise in the training data instead of the underlying structure
• Complicated model (e.g., too many variable, too complicated structure)

• Consequences:
• The model has low bias? (low)
• The model has high variance? (High, small x changes high y changes)
• Predictive power? (no)
• Significance for interpretation? (no)

• Solution / Preventions
• Splitting the data
• Regularization
• Cross-validation
• …
Data Splitting – Intuition

• Model with overfitting problem


• Nice performance for data in hand
• Poor predictive accuracy for new dataset

Blue point: data collected


Orange point: new data point
Grey point: predicted value

Error for Data collected: error = 0


new New data: high error
sample

What if orange is intentionally held


out for estimation ?
Data Splitting to Address Overfitting

• Nice performance for data in hand


• Poor predictive accuracy for new dataset

Split the data into two groups

Training Set Test Set

Hypothetically “In-hand” data “New” data Untouched

Purpose Train the model Show performance High error,


Poor performance
Overfitting Very low error High error

ALWAYS split the data first!


Test Set for Performance Measure
PYTHON PRACTICE

You might also like