0% found this document useful (0 votes)
11 views

Lecture 5 Regression

Uploaded by

bn23mer2r15
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Lecture 5 Regression

Uploaded by

bn23mer2r15
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Linear Regression

Linear Regression
Linear Regression
A regression attempts to fit a function to observed data to make predictions on new
data. A linear regression fits a straight line to observed data, attempting to
demonstrate a linear relationship between variables and make predictions on new
data yet to be observed.
Linear Regression
Linear Regression

ŷ = ax + b
The developed regression line, ŷ will be slope
the line that minimises the distance intercept
between data and fitted line, i.e. the
residuals

ε
ε = residual error
The Least Squares (Regression) Line

A good line is one that minimizes


the sum of squared differences between the points and the
line.

6
The Least Squares (Regression) Line
Sum of squared differences = (2 - 1)2 +(4 - 2)2 +(1.5 - 3)2 +(3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 +(4 - 2.5)2 + (1.5 - 2.5) +(3.2 - 2.5) = 3.99
2 2

Let us compare two lines


4 (2,4)
 The second line is horizontal
3  (4,3.2)
2.5
2
(1,2) (3,1.5)

1 The smaller the sum of
squared differences
the better the fit of the
1 2 3 4
line to the data.
7
Sum of Squares for Errors
• This is the sum of differences between the points
and the regression line.
• It can serve as a measure of how well the line fits the
data. SSE is defined by

n
SSE = 
i =1
( y i − ŷ i ) 2 .

– A shortcut formula

SSE =  yi2 −b0  yi − b1  xi yi


8
Least Squares Regression
To find the best line we must minimise the sum of the squares of the residuals
(the vertical distances from the data points to our line)
Model line: ŷ = ax + b
Residual (ε) = y - ŷ
ŷ = ax + b
Sum of squares of residuals = Σ (y – ŷ)2
slope
We must find values of a and b that minimise
a = slope, b = intercept
Σ (y – ŷ)2
ε

= ŷ Predicted Value
= y i , true value
ε = residual error
Finding b
First we find the value of b that gives the minimum sum of squares

b
ε b ε
b

Trying different values of b is equivalent to shifting the line up and down


the scatter plot
Finding a
Now we find the value of a that gives the min sum of squares

b b b

Trying out different values of a is equivalent to changing the slope of the


line, while b stays constant
Testing the slope
• When no linear relationship exists between two variables, the regression line
should be horizontal.




❑❑ ❑


❑ ❑

❑ ❑

Linear relationship. No linear relationship.


Different inputs (x) yield Different inputs (x) yield
different outputs (y). the same output (y).
The slope is not equal to zero The slope is equal to zero
12
Minimising sums of squares
• Need to minimise Σ(y–ŷ)2
• ŷ = ax + b
• so need to minimise:

Sum of squares (S)


Σ(y - ax - b)2

• If we plot the sums of squares


for all different values of a and
b we get a parabola, because it
is a squared term

• So, the min sum of squares is Gradient = 0


min S
at the bottom of the curve,
where the gradient is zero. Values of a and b
• The min sum of squares is at the bottom of the curve where the
gradient = 0

• So we can find a and b that give min sum of squares by taking


partial derivatives of Σ(y - ax - b)2 with respect to a and b
separately

• Then we solve these for 0 to give us the values of a and b that


give the min sum of squares
Doing this gives the following equations for a and b:

r sy r = correlation coefficient of x and y


a= sx sy = standard deviation of y
sx = standard deviation of x

◼ From you can see that:


▪ A low correlation coefficient gives a flatter slope (small value of a)
▪ Large spread of y, i.e. high standard deviation, results in a steeper slope (high value of a)
▪ Large spread of x, i.e. high standard deviation, results in a flatter slope (high value of a)
The solution cont.

• Our model equation is ŷ = ax + b


• This line must pass through the mean so:

y = ax + b b = y – ax
◼ We can put our equation for a into this giving:
r sy r = correlation coefficient of x and y
b=y- s x sy = standard deviation of y
x sx = standard deviation of x

◼ The smaller the correlation, the closer the intercept is to the mean of y
Back to the model
a b
r sy r sy
ŷ = ax + b = x+y- x
sx sx
a a
r sy
Rearranges to: ŷ= (x – x) + y
sx
• If the correlation is zero, we will simply predict the mean of y for every
value of x, and our regression line is just a flat straight line crossing the
x-axis at y

• But this isn’t very useful.

• We can calculate the regression line for any data, but the important
question is how well does this line fit the data, or how good is it at
predicting y from x
Regression Sums of Squares
Sum of squares due to the regression: difference
between TSS and SSE, i.e. SSR = TSS – SSE.
n n
SSR =  ( yi − yi ) −  ( yi − yˆ i )
2 2

i =1 i =1
n
=  ( y − yˆ i ) 2
i =1

SSR measures how much variability in the response is


explained by the regression.

STA6166-RegBasics 19
Graphical View
Linear Model

Mean Model

yˆ i = ˆ0 + ˆ1 xi

TSS = SSR + SSE

Total Variability Unexplained


variability = accounted + variability
in y-values for by the
regression
STA6166-RegBasics 20
TSS = SSR + SSE

Total Variability Unexplained


variability = accounted + variability
in y-values for by the
regression

regression model fits well

Then SSR approaches TSS and SSE gets small.


regression model adds little

Then SSR approaches 0 and SSE approaches TSS.

STA6166-RegBasics 21
Basic Linear
BasicRegression – Intercept
Linear Regression Determination
with SciPy

Python Code

Python Code
Basic Linear
BasicRegression – Intercept
Linear Regression Determination
with SciPy
Basic Linear
BasicRegression – Intercept
Linear Regression Determination
with SciPy
Basic Linear
BasicRegression – Intercept
Linear Regression Determination
with SciPy
Basic Linear
BasicRegression – Intercept
Linear Regression Determination
with SciPy
Basic Linear
BasicRegression – Intercept
Linear Regression Determination
with SciPy
Closed Form Equation
For a simple linear regression with only one input and one output
variable, here are the closed form equations to calculate m and b.
Closed Form Equation

For a simple linear regression with only one input and one output
variable, here are the closed form equations to calculate m and b.
Linear Regression
alculating m and b using Pytho
Linear Regression
Linear Regression
Linear Regression
Linear Regression
Basic Linear Regression – Intercept Determination
Basic Linear Regression with SciPy
Basic Linear Regression with m and b Calculation
What defines a “best fit”?
How do we get to that “best fit”?
Visualizing the sum of squares - sum of all areas where each square has
a side length equal to the residual

We minimize the squares, or


more specifically the sum of
the squared residuals.
Draw any line through the
points.
The residual is the numeric
difference between the line
and the points.
Points above the line will have a positive
residual, and points below the line will
have a negative residual. In other words, it is
the subtracted difference between the
predicted y-values (derived from the line) and
the actual y-values (which came from
the data). Another name for residuals
Visualizing the sum of squares - sum of all areas where each square has
a side length equal to the residual

Points above the line will have a


positive residual, and points below
the line will have a negative residual.

It is the subtracted difference


between the predicted y-values
(derived from the line) and the actual
y-values (which came from the data).

Another name for residuals are


errors, because they reflect how
wrong our line is in predicting the
data.
Visualizing the sum of squares - sum of all areas where each square has
a side length equal to the residual
Visualizing the sum of squares - sum of all areas where each square has
a side length equal to the residual
Visualizing the sum of squares - sum of all areas where each square has
a side length equal to the residual
Visualizing the sum of squares - sum of all areas where each square has
a side length equal to the residual
Visualizing the sum of squares - sum of all areas where each square has
a side length equal to the residual
Visualizing the sum of squares - sum of all areas where each square has
a side length equal to the residual
Visualizing the sum of squares - sum of all areas where each square has
a side length equal to the residual
Visualizing the sum of squares - sum of all areas where each square has
a side length equal to the residual
Calculating the residuals for a given line and data
Calculating the sum of squares for a given line and data
Calculating the sum of squares for a given line and data
Calculating the sum of squares for a given line and data
Calculating the sum of squares for a given line and data
Calculating the sum of squares for a given line and data
Calculating the sum of squares for a given line and data
Linear Regression - Inverse Matrix Techniques

We can use transposed and inverse matrices, to fit a linear regression.


Calculate a vector of coefficients b given a matrix of input variable values X
and a vector of output variable values y.
Linear Regression - Inverse Matrix Techniques
Linear Regression
Linear Regression
Linear Regression
Linear Regression
Linear Regression
Linear Regression
Linear Regression

You might also like