In Class Exercise Linear Regression in R
In Class Exercise Linear Regression in R
Youll need two files to do this exercise: linearRegression.r (the R script file) and mtcars.csv
(the data file1). Both of those files can be found on the course site. The data was extracted
from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of
automobile design and performance for 32 automobiles.
Download both files and save them to the folder where you keep your R files.
This is the raw data for our analysis. This is a comma-separated file (CSV). That just
means that each data value is separated by a comma.
Now look at the contents of the file. The first line contains the names of the fields (think
of them like columns in a spreadsheet). You can see the first field is called model, the
second field is called mpg, the third field is called cyl, and so on. The remaining lines of
the file contain the data for each car model.
Here is the full list of the variables:
Variable
Name
mpg
cyl
disp
hp
drat
wt
qsec
vs
am
gear
carb
Variable Description
Miles/(US) gallon (or fuel efficiency)
Number of cylinders
Displacement (cu.in.)
Gross horsepower
Rear axle ratio
Weight (lb/1000)
1/4 mile time
V/S
Transmission (0 = automatic, 1 =
manual)
Number of forward gears
Number of carburetors
We will use this data set to predict the miles per gallon (mpg) based on any combination of
the remaining variables (i.e., cyl, wt, etc.).
mpg is a typical outcome variable for regression analysis because it describes a continuous
value.
3) Close the OrganicsPurchase.csv file by selecting File/Close. If it asks you to save the file,
choose Dont Save.
Value
Description
mtcars.csv
RegressionOutput.t
xt
3) One good news about this analysis is that we do not need to install any additional
package.
4) Now lets look at the simple linear regression model with only one predictor. Scroll down
5) Now lets look at the multiple linear regression model with more than one predictors.
The only change compared to the previous one is that now we have more than one
predictor (i.e. wt, disp and cyl). Specifically, now we are looking at the effect of not just
weight, but also the number of cylinders, and the volume, or displacement, of the car, on
fuel efficiency.
This output contains a lot of information. Let's look at a few parts of it.
(1) Briefly, it first shows the call that's the way that the function was called; miles per
gallon (y) explained by weight (x) using the mtcars data. The regression equation we
would like to fit would be
m
^
pg=b 0+ b1 wt .
(2) This next part summarizes the residuals: that's how much the model got each of
those predictions wrong- how different the predictions were from the actual results.
(3) This table, the most interesting part, is the coefficients- this shows the actual
predictors and the significance of each.
b0
predicted miles per gallon of the car based on our linear model would be
37.2851.
Then we can see the effect of the weight variable on miles per gallon (
b1 ):
also called the coefficient or the slope of the weight. This shows that there's a
negative relationship, where increasing the weight decreases the miles per
gallon. In particular, it shows that increasing the weight by 1000 pounds
decreases the efficiency by 5.3 miles per gallon.
m
^
pg=37.2851+5.3445 wt .
You can then use this equation to predict the gas mileage of a car that has a
weight of, say, 4500 pounds.
This second column is called the standard error: we won't examine it here, but
in short, it represents the amount of uncertainty in our estimate of the slope.
Higher is better, with 1 being the best. Corresponds with the amount of variability in
what you're predicting that is explained by the model. In this instance, 75% of the
variation in mpg can be explained by the cars weight.
(5) Adjusted R-squared (
Name
Residuals
Description:
The residuals are the difference between the actual values of the outcome
variable and predicted values from your regression--
Significance codes
Coefficient
Estimates
y ^y . For most
cofficients that we would like to get. These values measure the marginal
importance of each predictor variable on the outcome variable.
Standard Error of
the Coefficient
Estimate (Std.
Error)
Measure of the variability in the estimate for the coefficient. Lower means
better but this number is relative to the value of the coefficient. As a rule
of thumb, you'd like this value to be at least an order of magnitude less
than the coefficient estimate.
t value of the
Coefficient
Estimate
Score that measures whether or not the coefficient for this variable is
meaningful for the model. You probably won't use this value itself, but
know that it is used to calculate the p-value and the significance levels.
Pr(>|t|) (i.e.
Variable p-value)
Significance
Legend
Residual Std
Error / Degrees of
Freedom
R-squared
1
0
F-statistic &
resulting p-value
Another score that measures whether or not the coefficient for this
variable is meaningful for the model. You want this number to be as small
as possible. If the number is really small, R will display it in scientific
notation. In or example 2e-16 means that the odds that parent is
meaningless is about 15000000000000000
The more punctuation there is next to your variables, the better.
Blank=bad, Dots=pretty good, Stars=good, More Stars=very good
The Residual Std Error is just the standard deviation of your residuals.
You'd like this number to be proportional to the quantiles of the residuals
in #1. For a normal distribution, the 1st and 3rd quantiles should be 1.5
+/- the std error.
The Degrees of Freedom is the difference between the number of
observations included in your training sample and the number of variables
used in your model (intercept counts as a variable).
Metric for evaluating the goodness of fit of your model. Higher is better
with 1 being the best. Corresponds with the amount of variability in what
you're predicting that is explained by the model. In this instance, ~21% of
the cause for a child's height is due to the height their parent.
WARNING: While a high R-squared indicates good correlation, correlation
does not always imply causation.
Performs an F-test on the model. This takes the parameters of our model
(in our case we only have 1) and compares it to a model that has fewer
parameters. In theory the model with more parameters should fit better. If
the model with more parameters (your model) doesn't perform better
than the model with fewer parameters, the F-test will have a high p-value
(probability NOT significant boost). If the model with more parameters is
better than the model with fewer parameters, you will have a lower pvalue.
The DF, or degrees of freedom, pertains to how many variables are in the
model. In our case there is one variable so there is one degree of
freedom.
Try it:
Looking at the results returned by summary(mfit), and try to interpret the output.
> summary(mfit)
Call:
lm(formula = mpg ~ wt + disp + cyl, data = mtcars)
Residuals:
Min
1Q Median
3Q
Max
-4.4035 -1.4028 -0.4955 1.3387 6.0722
Coefficients:
Questions:
(1) Which predictor variables are statistically significant in predicting mpg?
(2) How is the model prediction compared to the simple linear regression model?
Answers:
(1) wt and cyl
(2) The R-squared is 0.8326, which is larger than 0.7528, indicating better prediction
accuracy.