0% found this document useful (0 votes)
193 views

Multiple Linear Regression Model Report On Possum Data

For the graduate course on Statistical Analysis, we performed a Multiple Linear Regression model to analyze Possum Data.

Uploaded by

Salman Bin Habib
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
193 views

Multiple Linear Regression Model Report On Possum Data

For the graduate course on Statistical Analysis, we performed a Multiple Linear Regression model to analyze Possum Data.

Uploaded by

Salman Bin Habib
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Multiple Linear Regression Model

for Possum Data

SHAINAN AGRAWAL, TAJA


CASTILLEJA, MARCUS
GLEINSER, SALMAN BIN
HABIB, CHELSEA PAYTON

QMST 5323 - Musal


Executive Summary
The objective of our statistical model was to predict the total length of a possum
using a dataset of measurements from real, observed possums in Australia. These measurement
variables were site location, population, sex, age, head length, skull width, total length, tail
length, foot length, ear conch, eye, chest, and belly. We implemented a multiple linear regression
model and conducted bivariate analysis to determine the correlations between variables. In the
bivariate analysis, we found that head length, skull width, tail length, chest, and belly all had
strong, positive correlations with total length, indicating they are good predictors of possum total
length.

Next, we implemented a backward stepwise regression model to determine which


variables in our final model. By comparing the variables to total length, we ended up with a
starting adjusted R-squared value of 0.7443. After removing variable one-by-one to improve the
model, we were left with an adjusted R-squared of 0.7484. To check that we had identified the
correct variables to use in our multiple linear regression model, we used ordinary least squares
stepwise regression. This confirmed that our current variables; site, head length, and tail length
were what we should continue moving forward with. When checking the variables’ correlation to
one another, no two had a correlation higher than 0.5, meaning no multicollinearity was detected.
This was double checked with the ‘VIF’ method, which produced consistent results. After these
confirmations, we created dummy variables and used backward stepwise modeling once again,
leaving us with site 1 being the only statistically significant variable to predict the body length of
a possum. For the independent variables, one more check was don’t for correlation using the VIF
methods, leaving us with the variables head length, tail length, and site 1.

For the final model, the dependent variable chosen was total length and the
independent variables were head length, tail length, and site 1 with a final adjusted R-squared of
0.786. The model was evaluated based on mean squared error, root mean squared error, mean
absolute error, and mean absolute percentage error. These errors, when subtracted from 100%,
indicated that our final model was 98.51% accurate in predicting a possum’s total length from
the independent variables chosen. This is all based on the assumptions that our dataset was high
enough to build a scalable model, the dependent and independent variables are linearly related,
the residuals are normally distributed, and that the independent variables are not correlated.
Table of Contents
Executive Summary ......................................................................................................................................... 1
Introduction .................................................................................................................................................... 3
Data Description & Bivariate Analysis .............................................................................................................. 3
Model Building ................................................................................................................................................ 4
Final Model ..................................................................................................................................................... 6
Assumptions ................................................................................................................................................... 6
Conclusion ...................................................................................................................................................... 7
Appendix ......................................................................................................................................................... 8
Figure 1 ....................................................................................................................................................... 8
Figure 2 ....................................................................................................................................................... 8
Figure 3 ....................................................................................................................................................... 9
Figure 4 ....................................................................................................................................................... 9
Figure 5 ..................................................................................................................................................... 10
Figure 6 ..................................................................................................................................................... 10
Figure 7 ..................................................................................................................................................... 11
Figure 8 ..................................................................................................................................................... 11
Figure 9 ..................................................................................................................................................... 12
Figure 10 ................................................................................................................................................... 12
Figure 11 ................................................................................................................................................... 13
Figure 12 ................................................................................................................................................... 13
Figure 13 ................................................................................................................................................... 14
Figure 14 ................................................................................................................................................... 14
Figure 15 ................................................................................................................................................... 15
Introduction
This paper will explore a dataset consisting of measurements and characteristic data
collected from observing possums in Australia. There are 104 observations, each with an
observation number and thirteen variables: site location, population (presented as a binary
variable with a value of either Victoria or other), sex, age, head length, skull width, total length,
tail length, foot length, and ear conch, eye, chest, and belly measurements. The objective of the
statistical model is to predict total length of a possum. We will accomplish this by implementing
a multiple linear regression model using the best of the twelve other variables contained in the
data. The best variables will be determined via backward stepwise regression and verified by
ordinary least squares regression.

Data Description & Bivariate Analysis


The data set contains the following variables: total length, head length, skull width, tail
length, foot length, ear conch, eye, chest, belly, sex, population, site location, and age. We
plotted the twelve independent variables against the dependent variable of total length. For eight
of the independent variables, we used scatter plots, and for sex, population, site, and age we used
boxplots.
The first independent variable we used was head length. We see a positive relationship
between the two variables as the line of best fit slopes upward, meaning that as head length
increases so does total length. We see a stronger relationship between the two variables since
more points are closer to the line of best fit (Figure 1). With skull width, we again see a positive
relationship but its only moderately strong with outliers further from the line (Figure 2). Tail
length shows us another positive relationship, but it is also moderately strong. The points are not
very close together but are trending upwards (Figure 3). Foot length’s relationship with total
length is positive, but it has a much flatter line of best fit compared to the previous three
variables. It has a weaker relationship with total length as well (Figure 4). Foot length was also
one of the variables that had a missing value that was imputed using the forest library when we
ran the model. The next variable is ear conch, which also had a positive relationship but an
almost flat line of best fit. Ear conch also had a weak relationship with total length, with many of
the points being scattered far from the line (Figure 5). Eye, which is the variable that deals with
the distance from medial canthus to lateral canthus of the right eye, has a p a positive, yet weak
relationship with total length (Figure 6). Chest has a positive and stronger relationship with total
length (Figure 7). Belly also has a positive, and moderately strong relationship with total length
(Figure 8).
We used boxplots to compare sex and total length. We see that females have a higher
interquartile range, ranging from about 86 cm to 91 cm, and have a higher median. Females also
have some outliers. Males have an interquartile range that spans from 84cm to 89cm, with no
outliers (Figure 9). The population boxplot shows that the population in Victoria has a higher
interquartile range along with some outliers, as compared to the other population (Figure 10). For
the site variable, we see a variety of different interquartile ranges across the seven different sites.
In sites 1, 4, and 5 we have outliers present in the data set (Figure 11). With the age variable, e
see that the interquartile ranges are much closer across the nine different years. We see outliers
in years 1 and 6 (Figure 12). It should be noted that age also has missing values which were
imputed with the forest library as well.
After we had compared the various independent variables with total length, we plotted
bar charts to see the proportion of males and females in the different sites and in the population
of Victoria and others. Only site 1 had a greater number of females compared to males. The other
six sites had more males than females (Figure 13). In the two populations, we see that Victoria
has more females compared to males, while the other population contains more males (Figure
14).
Before we started creating the various plots, we looked through the data. The variable
foot length had one missing value, while the variable age had two missing values. This should be
noted when it comes to outliers. These missing values were replaced when we imputed data from
the forest library when we started creating our model.

Model Building
We implemented a backward stepwise regression model to determine the variables to use
in the final model. We started by comparing each of the independent variables to our dependent
variable of total length. Our first model included site, population, sex, age, head length, skull
width, tail length, foot length, ear conch, eye, chest, and belly. This model had an adjusted R-
squared value of 0.7443. We then began to remove the independent variable with the largest p-
value in each model. This is done because a large p-value indicates that a change in the
associated variable is not expected to produce a significant change in the predicted value.
Age was the first independent variable to be removed. This resulted in a model with the
same adjusted R-squared value, meaning age was not a significant variable. In the third model,
we removed belly, resulting in a model with an adjusted R-squared value of 0.747. The fourth
model removed the eye variable, giving an adjusted R-squared of 0.7495. Population was then
removed, increasing the adjusted R-squared value to 0.7519. When skull width was removed, the
adjusted R-squared further increased to a value of 0.7543. Ear conch was taken out giving an
adjusted R-squared of 0.756. Next, we removed foot length, which did decrease the adjusted R-
squared to 0.755, but the p-value was still too high to keep this variable in the model. Our next
model removed chest, which brought the adjusted R-squared up to 0.7552. In the tenth and final
model, we removed sex, which again had too large of a p-value to keep, giving us a model with
an adjusted R-squared of 0.7484. In this tenth model, site, head length, and tail length were left
as the independent variables to predict the dependent variable of total length.
We then used ordinary least squares stepwise regression to verify that we had determined
the correct independent variables for the multiple linear regression model. We first ran a forward
step modeling, then backwards and ended with both directional stepwise regression. This
confirmed our independent variables of site, head length, and tail length. To further confirm the
model with the chosen independent variables, we checked correlation to see if the variables have
any multicollinearity. No multicollinearity was detected since no two independent variables had
correlation higher than 0.5. To verify, we checked the multicollinearity again with ‘VIF’ method
(Variance Inflation Factor) and the results were consistent with previous checks.
After we confirmed the model, we created dummy variables for the site variable, since
site is a categorical variable. Again, we used backward stepwise modeling, eliminating the site
with the largest p-value. Site 1 was the last site left in the model, giving an adjusted R-squared
value of 0.786. This indicates that site 1 is the only site that has a statistically significant bearing
on the prediction of body length for a possum. We again confirmed the chosen independent
variables using ordinary least squares regression techniques. Results were slightly varying as
more sites were suggested by OLS method. However, upon checking with different sites, the
least MAPE value was obtained when using head length, tail length, and site 1 as independent
variables in the model. The finally chosen independent variables were checked for
multicollinearity using correlation and VIF methods. No correlation was found between head
length, tail length, and site 1.

Final Model
The final model has the dependent variable as total length and the independent variables as
head length, tail length, and site 1 with an adjusted R-squared of 0.786. To predict the total
length of Possum, we split the original data into training and test sets. The model was trained on
75% of the data, and then was run on a test set of the remaining 25% to predict the value for
length of the possum. We evaluated the accuracy of model using four types of error: mean
squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE) and mean
absolute percentage error (MAPE). We determined that the model has an MSE value of 2.75.
Mean squared error is a measure of how close the line of best fit is to the data. The RMSE value
of the model is 1.66. Root mean squared error gives the error in the same unit of measure as the
data, which in this case is centimeters. The MAE value is also given in centimeters like the data,
1.31 cm. The MAPE of the model is 1.49%. This mean absolute percentage error value, when
subtracted from 100%, indicates that the final model is 98.51% accurate in predicting a possum’s
total length from head length, tail length, and site location.

Assumptions
Perhaps the biggest assumption of this model is that the sample size of 104 observations
is large enough to build a scalable model. The final model is 98.51% accurate when executed on
a test set of the given data, so this assumption appears to be valid. There is also the assumption
that the dependent variable, total length, is linearly related to the independent variables that are
in the final model: head length, tail length, and the binary variable site 1. This is one of the key
assumptions of a multiple linear regression model. If the dependent and independent variables
are not linearly related, the model will not be accurate in predicting unknown values. For our
model, this assumption is validated by the scatter plot correlation analysis as shown in Figure 1
and Figure 3. Our third assumption is that the residuals of the data are normally distributed. We
ran a Lillifor’s test of the residuals from the final model and plotted the results, which can be
seen in Figure 15. This test verified that the residuals are approximately normally distributed.
Our fourth assumption is that the residuals show constant variance, which is validated in the
residual plot seen in Figure 17. Another major assumption of any multiple linear regression
model is that the residuals of the data are not correlated. The final assumption is that the
independent variables are not correlated. We verified this assumption in two ways: a correlation
analysis and a variance inflation factor (VIF) analysis and determined that there was no
significant correlation between the independent variables used in the final model.

Conclusion
The study concludes that the total length of a possum can be predicted from the head
length, tail length, and site location, with a 98.51% accuracy. To satisfy our objective of the
study, a multiple linear regression model with bivariate analysis (scatter plots and boxplots) was
conducted which gave us an indication of the relationship between the independent and
dependent variables. To build our final model, a backward stepwise regression model was
utilized. Using the adjusted R-squared value, the researchers gradually removed variables from
the regression model at each stage to arrive at a simplified model that best explains the data.
Furthermore, to confirm the accuracy and reliability of the chosen independent variables, the
researchers conducted least-squares stepwise regression (forward, backward, both-directional),
and checked for multicollinearity using correlation and VIF, which found no collinearity
between the independent variables. Hence, we can conclude, calculating MAPE, that the
researchers’ model is 98.51% accurate in predicting a possum’s total length from the head
length, tail length, and site location.
Appendix
Figure 1

Figure 2
Figure 3

Figure 4
Figure 5

Figure 6
Figure 7

Figure 8
Figure 9

Figure 10
Figure 11

Figure 12
Figure 13

Figure 14
Figure 15

Figure 16
Figure 17

You might also like