4 Regression
4 Regression
ioc.pdf
4 Regression models in R
ioc.pdf
4 Regression models in R
ioc.pdf
Recall that
In a cause and eect relationship, the independent variable is the cause, and the
dependent variable is the eect.
Here, we focus on the case where there is only one independent variable. This is called
simple regression (as opposed to multiple regression, which handles two or more
independent variables).
Least squares linear regression is a method for predicting the value of a dependent
variable Y, based on the value of an independent variable X.
ioc.pdf
4 Regression models in R
ioc.pdf
Simple linear regression is appropriate when the following conditions are satised.
The dependent variable Y has a linear relationship to the independent variable X . To
check this, make sure that the XY scatterplot is linear and that the residual plot shows
1
ioc.pdf
Linear regression nds the straight line, called the least squares regression line or LSRL,
that best represents observations in a bivariate data set.
Suppose Y is a dependent variable, and X is an independent variable. The population
regression line is:
Y = B0 + B1 X
where B0 is a constant, B1 is the regression coecient, X is the value of the independent
variable, and Y is the value of the dependent variable.
Given a random sample of observations, the population regression line is estimated by:
yb = b0 + b1 x
ioc.pdf
yi = α + β xi + εi .
To nd regression estimates b0 and b1 , one has to solve the following minimization
problem:
n n
Find min Q(b0 , b1 ), for Q(b0 , b1 ) = ∑ εi2 = ∑ (yi − b0 − b1 xi )2
b0 ,b1 i=1 i=1
b0 = y − b1 x
where rxy is the sample correlation coecient between x and y ; and sx and sy are the
sample standard deviation of x and y . A horizontal bar over a quantity indicates the
average value of that quantity.
ioc.pdf
When the regression parameters (b0 and b1 ) are dened as described above, the regression line
has the following properties.
The line minimizes the sum of squared dierences between observed values (the y values)
and predicted values (the yb values computed from the regression equation).
The regression line passes through the mean of the X values (c) and through the mean of
the Y values (y ).
The regression constant (b0 ) is equal to the y intercept of the regression line.
The regression coecient (b1 ) is the average change in the dependent variable (Y ) for a
1-unit change in the independent variable (X ). It is the slope of the regression line.
The least squares regression line is the only straight line that has all of these properties.
ioc.pdf
ioc.pdf
Coecient of determination
The coecient of determination (R 2 ) for a linear regression model with one independent
variable is: 2
∑(xi − x)(yi − y )
R2 =
Nσx σy
where N is the number of observations used to t the model, xi is the x value for observation i ,
x is the mean x value, yi is the y value for observation i , y is the mean y value, σx is the
standard deviation of x , and σy is the standard deviation of y .
If you know the linear correlation (r ) between two variables, then the coecient of
determination (R 2 ) is easily computed using the following formula: R 2 = r 2 .
The standard error about the regression line (often denoted by SE) is a measure of the
average amount that the regression equation over- or under-predicts. The higher the
coecient of determination, the lower the standard error; and the more accurate
predictions are likely to be.
ioc.pdf
4 Regression models in R
ioc.pdf
Last year, ve randomly selected students took a math aptitude test before they began their
statistics course. The Statistics Department has three questions.
What linear regression equation best predicts statistics performance, based on math
aptitude scores?
If a student made an 80 on the aptitude test, what grade would we expect her to make in
statistics?
How well does the regression equation t the data?
ioc.pdf
Then
∑ni=1 (xi − x)(yi − y ) 470
b1 = = = 0.644
∑ni=1 (xi − x)2 730
b0 = y − b1 x = 77 − 0.644 · 78 = 26.768
Once you have the regression equation, using it is a snap. Choose a value for the
independent variable (x ), perform the computation, and you have an estimated value (yb)
for the dependent variable.
In our example, the independent variable is the student's score on the aptitude test. The
dependent variable is the student's statistics grade. If a student made an 80 on the
aptitude test, the estimated statistics grade would be:
Warning: When you use a regression equation, do not use values for the independent
variable that are outside the range of values used to create the equation. Such
an extrapolation can produce unreasonable estimates.
ioc.pdf
A coecient of determination equal to 0.48 indicates that about 48% of the variation in
statistics grades (the dependent variable) can be explained by the relationship to math
aptitude scores (the independent variable). This would be considered a good t to the
data, in the sense that it would substantially improve an educator's ability to predict
student performance in statistics class.
ioc.pdf
4 Regression models in R
ioc.pdf
Example
> x <- c(95, 85, 80, 70, 60)
> y <- c(85, 95, 70, 65, 70)
> lmMod <- lm (y ~ x)
> lmMod
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
26.7808 0.6438
ioc.pdf
Call:
lm(formula = y ~ x)
Residuals:
1 2 3 4 5
-2.945 13.493 -8.288 -6.849 4.589
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.7808 30.5182 0.878 0.445
x 0.6438 0.3866 1.665 0.194
ioc.pdf
ioc.pdf
ioc.pdf
There are many ways to transform variables to achieve linearity for regression analysis. Some
common methods are summarized below.
ioc.pdf