Lecture 3
Regression Analysis
Ziyi Cao
University of Texas at Dallas
[email protected]
Spring 2025
Agenda
• Linear Regression
• Gradient Descent
• Polynomial Regression
• Python Practice
Linear Regression
Concepts
Least Square Method
Linear Regression Concepts
• Simple linear regression
• Models the linear relationship between a numeric target variable and an
explanatory variable
• : target variable / dependent variable / outcome variable
• : explanatory variable / independent variable / predictor / regressor
𝑌 = 𝛽0 + 𝛽1 𝑋 +𝜀
Linear Regression Concepts
• Multiple linear regression
• Models the linear relationship between a numeric target variable and a set of
explanatory variables
parameters
outcome (to be estimated)
𝑌 = 𝛽0 + 𝛽1 𝑋 1+ 𝛽 2 𝑋 2 +…+ 𝛽 𝑘 𝑋 𝑘 +𝜀
Random error
(unexplained part)
predictors
Linear Regression Concept
• Linear Relationship
• The linearity is defined on parameters, NOT predictors
• X are observed values (consider as constant), parameters are unknowns (to solve)
• You can always apply nonlinear transfers to variables before applying a model
Both are linear
regressions
𝑌 = 𝛽0 + 𝛽1 𝑋 +ε 𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝛽 2 𝑋 2 + ε
Define a new variable , then
Are They Linear Regressions?
• Suppose i=1, 2, …, 5, parameter is
Linear Regression Estimation
• Parameters (s) need to be estimated
outcome parameters
𝑌 = 𝛽0 + 𝛽1 𝑋 1+ 𝛽 2 𝑋 2 +…+ 𝛽 𝑘 𝑋 𝑘 +𝜀
Random error
(unexplained part)
predictors
Linear Regression Estimation – Intuition
𝑌 = 𝛽0 + 𝛽1 𝑋 +𝜀
• : Intercept
• : Slope 𝑌𝑖
^
𝑒 𝑖=𝑌 𝑖 − 𝑌 𝑖
^𝑖
𝑌
For datapoint
• Fitted Value:
𝛽1
• Predicted from regression
• Error (residual): 𝛽0
• Difference between an 𝑋𝑖
observed value and a predicted value
• Key component for performance measure
Linear Model Estimation
• Goal: Minimize the total error
• close to zero for each i => transfer s.t. non-negative
• Potential methods
1
• Mean (sum) of absolute errors 𝑀𝐴𝐸 =
𝑛
(|𝑒1|+|𝑒 2|+ …+|𝑒𝑛|)
1 2
• Mean (sum) of squared errors 𝑀𝑆𝐸=
𝑛
( 𝑒 1 +𝑒 2+ …+𝑒 𝑛 )
2 2
Note: For both methods, n is the number of observations
Minimize MSE is equivalent to minimize
Linear Model Estimation – Least Squares Method
• Criteria: minimize the sum (mean) of squared errors
𝑛
min 𝑀𝑆𝐸 ( 𝜷 ) ⇔ min ∑ ( 𝑦 𝑖 − ^
𝑦 𝑖 )2
𝜷 𝜷 𝑖 =1
• in bold represents a vector
• Solving for (calculus & linear algebra):
• For each , take partial differentiations 𝜕 𝑀𝑆𝐸( 𝜷)
=0
• We have K+1 equations, solve for K+1 unkown s 𝜕 𝛽𝑘
• The solution can be put to matrix form −1
𝜷 =( 𝑋 ′ 𝑋 ) 𝑋′ 𝑌
X is a N*(K+1) matrix;
Y is a N*1 vector;
is a (K+1)*1 vector
Gradient Descent
Linear Regression: An Example
𝑛
• Recall Optimization
min ∑ ( 𝑦 𝑖 − ^
𝑦𝑖)2 Cost function
𝜷 𝑖 =1
• Method 1:
• Closed-form solution ( ′
𝜷= 𝑋 𝑋 )−1
𝑋′ 𝑌
• Solvable
• Computational costly, no close form solution (complicated models)?
• Method 2: search for it
• Gradient Descent (Optimization Algorithm)
• Solve an optimization problem
• In conjunction with neural networks, regressions, SVMs, …
Gradient Descent – Intuition
• River flowing down a mountain
• A cost function is a hyperplane
A
• Searching for minima of cost function
• The lowest direction
• Start somewhere (A)
• Initial point/value
• Flow down to some (adjusted) directions
• Direction – gradient (slope)
• Distance – Learning rate (step size)
• Keep flowing until flat or into a lake (B) B
• Convergence – local minima
Gradient Descent – Concepts
• Search for the minimal point & the corresponding value minimizing the cost function
• Initial point
• Gradient
• Partial derivative w.r.t all variables
• Learning rate
• Length to move along the direction
• Next Step:
• Converge
• You need to set: initial point, learning rate, convergence criteria
Learning Rate (Step Size)
• Large learning rate <=> large step size
• Small learning rate: slow convergence
• Large learning rate: divergence (cannot converge)
Pros and Cons
• Advantages
• Simple and usually effective on ML tasks
• Disadvantages
• Local minimum if non-convex
• Change multiple initial points
Batch Gradient Descent
• Use the whole data set when computing the gradient – small data sets or simple models
• Example (Linear Regression):
• Compute the gradient (partial derivative) of the cost function MSE w.r.t. each
• Select a set of initial 𝑛
1
• Gradient: min
𝜷
∑
𝑛 𝑖 =1
( ^
𝑦𝑖 − 𝑦𝑖) 2
• Compute the gradient using the whole dataset – plug in all X and y observations, and initial
• Update :
• Learning rate
• Pre-determined
Stochastic Gradient Descent
• Huge dataset – still computationally expensive
• Consider every sample to calculate the gradient and update parameters at every step
(iteratively till convergence)?
• How about pulling a subset of data?
• In extreme cases => SGD:
• Picks one random instance (sample) to compute the gradient
• Update the parameter
• Adds noise to the gradient
• Less likely to be stuck at local minima
• More iterations to converge
Mini Batch Gradient Descent
• Batch Gradient Descent vs. Stochastic Gradient Descent
• Smooth vs. Large data set
• Balance the two
• Calculate gradient based on a mini-batch (e.g., size = 32 records)
• Not too much noise (compared to SGD)
• More efficient (compared to BGD)
• Need to decide on batch size
• Very commonly used
Polynomial Regressions
Polynomial Regression
• Generate new features consisting of polynomial combinations of the original
features
• Captures the non-linear pattern between Y and X
• But still a linear model
• Example (poly = 2)
Example I: Polynomial Features
• Hyperbolic tangent: tanh(x):
• S Shape function
• Approximates to 1 and -1
Example I: Polynomial Features
Overfitting
• When overfitting occurs:
• The model captures the noise in the training data instead of the underlying structure
• Complicated model (e.g., too many variable, too complicated structure)
• Consequences:
• The model has low bias? (low)
• The model has high variance? (High, small x changes high y changes)
• Predictive power? (no)
• Significance for interpretation? (no)
• Solution / Preventions
• Splitting the data
• Regularization
• Cross-validation
• …
Data Splitting – Intuition
• Model with overfitting problem
• Nice performance for data in hand
• Poor predictive accuracy for new dataset
Blue point: data collected
Orange point: new data point
Grey point: predicted value
Error for Data collected: error = 0
new New data: high error
sample
What if orange is intentionally held
out for estimation ?
Data Splitting to Address Overfitting
• Nice performance for data in hand
• Poor predictive accuracy for new dataset
Split the data into two groups
Training Set Test Set
Hypothetically “In-hand” data “New” data Untouched
Purpose Train the model Show performance High error,
Poor performance
Overfitting Very low error High error
ALWAYS split the data first!
Test Set for Performance Measure
PYTHON PRACTICE