Multiple Linear Regression Model Report On Possum Data
Multiple Linear Regression Model Report On Possum Data
For the final model, the dependent variable chosen was total length and the
independent variables were head length, tail length, and site 1 with a final adjusted R-squared of
0.786. The model was evaluated based on mean squared error, root mean squared error, mean
absolute error, and mean absolute percentage error. These errors, when subtracted from 100%,
indicated that our final model was 98.51% accurate in predicting a possum’s total length from
the independent variables chosen. This is all based on the assumptions that our dataset was high
enough to build a scalable model, the dependent and independent variables are linearly related,
the residuals are normally distributed, and that the independent variables are not correlated.
Table of Contents
Executive Summary ......................................................................................................................................... 1
Introduction .................................................................................................................................................... 3
Data Description & Bivariate Analysis .............................................................................................................. 3
Model Building ................................................................................................................................................ 4
Final Model ..................................................................................................................................................... 6
Assumptions ................................................................................................................................................... 6
Conclusion ...................................................................................................................................................... 7
Appendix ......................................................................................................................................................... 8
Figure 1 ....................................................................................................................................................... 8
Figure 2 ....................................................................................................................................................... 8
Figure 3 ....................................................................................................................................................... 9
Figure 4 ....................................................................................................................................................... 9
Figure 5 ..................................................................................................................................................... 10
Figure 6 ..................................................................................................................................................... 10
Figure 7 ..................................................................................................................................................... 11
Figure 8 ..................................................................................................................................................... 11
Figure 9 ..................................................................................................................................................... 12
Figure 10 ................................................................................................................................................... 12
Figure 11 ................................................................................................................................................... 13
Figure 12 ................................................................................................................................................... 13
Figure 13 ................................................................................................................................................... 14
Figure 14 ................................................................................................................................................... 14
Figure 15 ................................................................................................................................................... 15
Introduction
This paper will explore a dataset consisting of measurements and characteristic data
collected from observing possums in Australia. There are 104 observations, each with an
observation number and thirteen variables: site location, population (presented as a binary
variable with a value of either Victoria or other), sex, age, head length, skull width, total length,
tail length, foot length, and ear conch, eye, chest, and belly measurements. The objective of the
statistical model is to predict total length of a possum. We will accomplish this by implementing
a multiple linear regression model using the best of the twelve other variables contained in the
data. The best variables will be determined via backward stepwise regression and verified by
ordinary least squares regression.
Model Building
We implemented a backward stepwise regression model to determine the variables to use
in the final model. We started by comparing each of the independent variables to our dependent
variable of total length. Our first model included site, population, sex, age, head length, skull
width, tail length, foot length, ear conch, eye, chest, and belly. This model had an adjusted R-
squared value of 0.7443. We then began to remove the independent variable with the largest p-
value in each model. This is done because a large p-value indicates that a change in the
associated variable is not expected to produce a significant change in the predicted value.
Age was the first independent variable to be removed. This resulted in a model with the
same adjusted R-squared value, meaning age was not a significant variable. In the third model,
we removed belly, resulting in a model with an adjusted R-squared value of 0.747. The fourth
model removed the eye variable, giving an adjusted R-squared of 0.7495. Population was then
removed, increasing the adjusted R-squared value to 0.7519. When skull width was removed, the
adjusted R-squared further increased to a value of 0.7543. Ear conch was taken out giving an
adjusted R-squared of 0.756. Next, we removed foot length, which did decrease the adjusted R-
squared to 0.755, but the p-value was still too high to keep this variable in the model. Our next
model removed chest, which brought the adjusted R-squared up to 0.7552. In the tenth and final
model, we removed sex, which again had too large of a p-value to keep, giving us a model with
an adjusted R-squared of 0.7484. In this tenth model, site, head length, and tail length were left
as the independent variables to predict the dependent variable of total length.
We then used ordinary least squares stepwise regression to verify that we had determined
the correct independent variables for the multiple linear regression model. We first ran a forward
step modeling, then backwards and ended with both directional stepwise regression. This
confirmed our independent variables of site, head length, and tail length. To further confirm the
model with the chosen independent variables, we checked correlation to see if the variables have
any multicollinearity. No multicollinearity was detected since no two independent variables had
correlation higher than 0.5. To verify, we checked the multicollinearity again with ‘VIF’ method
(Variance Inflation Factor) and the results were consistent with previous checks.
After we confirmed the model, we created dummy variables for the site variable, since
site is a categorical variable. Again, we used backward stepwise modeling, eliminating the site
with the largest p-value. Site 1 was the last site left in the model, giving an adjusted R-squared
value of 0.786. This indicates that site 1 is the only site that has a statistically significant bearing
on the prediction of body length for a possum. We again confirmed the chosen independent
variables using ordinary least squares regression techniques. Results were slightly varying as
more sites were suggested by OLS method. However, upon checking with different sites, the
least MAPE value was obtained when using head length, tail length, and site 1 as independent
variables in the model. The finally chosen independent variables were checked for
multicollinearity using correlation and VIF methods. No correlation was found between head
length, tail length, and site 1.
Final Model
The final model has the dependent variable as total length and the independent variables as
head length, tail length, and site 1 with an adjusted R-squared of 0.786. To predict the total
length of Possum, we split the original data into training and test sets. The model was trained on
75% of the data, and then was run on a test set of the remaining 25% to predict the value for
length of the possum. We evaluated the accuracy of model using four types of error: mean
squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE) and mean
absolute percentage error (MAPE). We determined that the model has an MSE value of 2.75.
Mean squared error is a measure of how close the line of best fit is to the data. The RMSE value
of the model is 1.66. Root mean squared error gives the error in the same unit of measure as the
data, which in this case is centimeters. The MAE value is also given in centimeters like the data,
1.31 cm. The MAPE of the model is 1.49%. This mean absolute percentage error value, when
subtracted from 100%, indicates that the final model is 98.51% accurate in predicting a possum’s
total length from head length, tail length, and site location.
Assumptions
Perhaps the biggest assumption of this model is that the sample size of 104 observations
is large enough to build a scalable model. The final model is 98.51% accurate when executed on
a test set of the given data, so this assumption appears to be valid. There is also the assumption
that the dependent variable, total length, is linearly related to the independent variables that are
in the final model: head length, tail length, and the binary variable site 1. This is one of the key
assumptions of a multiple linear regression model. If the dependent and independent variables
are not linearly related, the model will not be accurate in predicting unknown values. For our
model, this assumption is validated by the scatter plot correlation analysis as shown in Figure 1
and Figure 3. Our third assumption is that the residuals of the data are normally distributed. We
ran a Lillifor’s test of the residuals from the final model and plotted the results, which can be
seen in Figure 15. This test verified that the residuals are approximately normally distributed.
Our fourth assumption is that the residuals show constant variance, which is validated in the
residual plot seen in Figure 17. Another major assumption of any multiple linear regression
model is that the residuals of the data are not correlated. The final assumption is that the
independent variables are not correlated. We verified this assumption in two ways: a correlation
analysis and a variance inflation factor (VIF) analysis and determined that there was no
significant correlation between the independent variables used in the final model.
Conclusion
The study concludes that the total length of a possum can be predicted from the head
length, tail length, and site location, with a 98.51% accuracy. To satisfy our objective of the
study, a multiple linear regression model with bivariate analysis (scatter plots and boxplots) was
conducted which gave us an indication of the relationship between the independent and
dependent variables. To build our final model, a backward stepwise regression model was
utilized. Using the adjusted R-squared value, the researchers gradually removed variables from
the regression model at each stage to arrive at a simplified model that best explains the data.
Furthermore, to confirm the accuracy and reliability of the chosen independent variables, the
researchers conducted least-squares stepwise regression (forward, backward, both-directional),
and checked for multicollinearity using correlation and VIF, which found no collinearity
between the independent variables. Hence, we can conclude, calculating MAPE, that the
researchers’ model is 98.51% accurate in predicting a possum’s total length from the head
length, tail length, and site location.
Appendix
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17