2.linear Regression
2.linear Regression
Figure: Example
Vineet Padmanabhan Data
Machine Learning
Linear Regression-Sample Data . . .
y = β 0 + β1 x (1)
sales ≈ β0 + β1 × T V
Once we have used our training data to produce estimates β̂0 and
β̂1 for the model coefficients, we can predict future sales on the
basis of a particular value of TV advertising by computing
Y = β0 + β1 X + (3)
The inclusion of the random error term allows (x, y) to fall either
above the true regression line (when > 0) or below the line
(when < 0).
The point estimates of β0 and β1 , denoted by βˆ0 and βˆ1 and called
the least squares estimates, are those values that minimize
f (b0 , b1 ). That is, βˆ0 and βˆ1 are such that f (βˆ0 , βˆ1 ) ≤ f (b0 , b1 ) for
any b0 and b1 . The estimated regression line or least squares line is
then the line whose equation is
y = βˆ0 + βˆ1 x
n
X
f (b0 , b1 ) = [yi − (b0 + b1 xi )]2
i=1
P
∂f (b0 ,b1 ) ∂ (yi −b0 −b1 xi )2
∂b1 = i
∂b1
Slope =
P P
−2 xi (yi − b0 − b1 xi ) or 2(yi − b0 − b1 xi )(−xi ) = 0.
i
P
0 = −2 i (yi − b0 − b1 x i )
P P P
0 = i yi − i b0 − bi xi
(multipling by − 12 and distributing sum)
P P
N b0 = i y i − bi i xi
P P
yi xi
b0 = N
i
− b1 i
N = b0 = ȳ − b1 x̄
P P 2
P P yi xi
b1 x2 = xi yi − i − b1 i
i i i N N
P 2 P P
P xi P yi xi
b1 x2 + b1 i = xi yi − i i
i i N i N
P 2 P P
P xi P yi xi
b1 x2 + i = xi yi − i i
i i N i N
P P
P yi xi
xi yi − i i
i N
b1 = P 2
P xi
x2 + i
i i N
P
(xi −x̄)(yi −ȳ) Sxy
b1 = βˆ1 = P
(x −x̄)2
= Sxx .
i
P P
P yi xi
Sxy = i xi yi − − T he N umerator
i i
N
2
P
P 2 ( i xi )
Sxx = i xi − N − T he Denominator
P P
yi −βˆ1 xi
b0 = βˆ0 = i
N
i
= ȳ − βˆ1 x̄
(1307.5)(779.2)
Sxy = 71, 347.30 − 14 = −1424.41429
The estimated slope of the true regression line (i.e., the slope of the
least squares line) is
Sxy −1424.41429
βˆ1 = = = −.20938742
Sxx 6802.7693
We estimate that the expected change in true average cetane number
associated with a 1 g increase in iodine value is 2.209 - i.e., a decrease
of .209. Since x̄ = 93.392857 and ȳ = 55.657143, the estimated
intercept of the true regression line (i.e., the intercept of the least
squares line) is
βˆ0 = ȳ − βˆ1 x̄ = 55.657143 − (−.20938742)(93.392857) = 75.2122432
The equation of the estimated regression line (least squares line) is
y = 75.212 − .2094x, exactly that reported in the cited article.
Vineet Padmanabhan Machine Learning
The estimated regression line can immediately be used for two
different purposes. For a fixed x value x∗ , βˆ0 + βˆ1 x∗ (the height of the
line above x∗ ) gives either (1) a point estimate of the expected value
of Y when x = x∗ or (2) a point prediction of the Y value that will
result from a single new observation made at x = x∗ . For instance a
point estimate of true average cetane number for all biofuels whose
iodine value is 100 is
µ̂Y.100 = βˆ0 + βˆ1 (100) = 75.212 − .2094(100) = 54.27
If a single biofuel sample whose iodine value is 100 is to be selected,
54.27 is also a point prediction for the resulting cetane number.
Vineet Padmanabhan Machine Learning
The least squares line should not be used to make a prediction for an
x value much beyond the range of the data, such as x = 40 or x = 150
in the Example given above. The danger of extrapolation is that
the fitted relationship (a line here) may not be valid for such x values.
βˆ1 = 18,921.8295
776.434
= .04103377 ≈ .041
βˆ0 = 78.74 − (.04103377)(140.895) = 72.958547 ≈ 72.96
Vineet Padmanabhan Machine Learning
Fitted Values . . .
The least square line is y = 72.96 + .041x
The fitted values are calculated from yˆi = 72.958547 + .04103377xi
yˆ1 = 72.958547 + .04103377(125.3) ≈ 78.100, y1 − yˆ1 ≈ −.200
From substituting yˆi = βˆ0 + βˆ1 xi into (yi − yˆi )2 , squaring the
P
summand, carrying through the sum to the resulting three terms and
simplifying.
These computational formulas are especially sensitive to the effects of
rounding in βˆ0 and βˆ1 , so carrying as many digits as possible in
intermediate computations will protect against round-off error.
Figure: Using the model to explain y variation: (a) data for which all
variation is explained; (b) data for which most variation is explained; (c)
data for which little variation is explained
( yi ) 2
X X P
2 2
SST = Sxy = (yi − ȳ) = yi −
n
A quantitative measure of the total amount of variation in observed y
values is given by the total sum of squares
1 Total sum of squares is the sum of squared deviations about the
sample mean of the observed y values.
2 The same number y is subtracted from each yi in SST, whereas
SSE involves subtracting each different predicted value yi from
the corresponding observed yi .
3 SSE is the sum of squared deviations about the least squares line
y = βˆ0 + βˆ1 x.
4 SST is the sum of squared deviations about the horizontal line at
height y (since then vertical deviations are yi − ȳ)
The sum of squared deviations about the least squares line is smaller
than the sum of squared deviations about any other line, SSE < SST
unless the horizontal line itself is the least squares line.
The ratio SSE
SST
is the proportion of total variation that cannot be
explained by the simple linear regression model,
1 − SSE
SST
(a number between 0 and 1) is the proportion of observed y
variation explained by the model.
Vineet Padmanabhan Machine Learning
The coefficient of determination
SSE
r2 = 1 −
SST
It is interpreted as the proportion of observed y variation that can be
explained by the simple linear regression model (attributed to an
approximate linear relationship between y and x).
The higher the value of r2 , the more successful is the simple
linear regression model in explaining y variation.
When regression analysis is done by a statistical computer
package, either r2 or 100r2 (the percentage of variation explained
by the model) is a prominent part of the output.
If r2 is small, an analyst will usually want to search for an
alternative model (either a nonlinear model or a multiple
regression model that involves more than a single independent
variable) that can more effectively explain y variation.