Multiple Regression
Multiple Regression
edu)
Home > Course Materials > Quantitative Methods > Multiple Regression
Multiple Regression
Except otherwise noted, this work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Introduction
Observing Variables
In the module on Linear Correlation, Regression, and Prediction, we have discussed determining the correlation
and possible regression relationships between an independent variable X and a dependent variable Y.
Speci�cally, in regression, the discussion was based on how the change in one variable (X) produced an effect
on another (Y). This is the essence of regression. But from experience, we know that often multiple causes
interact to produce a certain result. For example, yield from a crop is based on the amount of water a plant has
to use, the soil fertility of the �eld, the potential of the seed to produce a plant, pest and pathogen pressures,
and numerous other factors. In this lesson, we’ll explore how we can determine linear relationships between
multiple independent variables and a single dependent variable.
Exploring Multiple Variables
Multiple regression functionally relates several continuous independent variables to one dependent variable. In
the above example, barley yield per plot (Y) is shown as function of the percentage of plants in the plot affected
by rust (X1), and the days to maturity required by the cultivar in grown in a particular plot (X2). Yield is modeled
as a linear combination of these two X variables in the response surface on the previous slide. Before we can
relate the dependent variable Y to the independent X variables, we need to know the interrelationships between
all of the variables. Multiple correlation and partial correlation provide measures of the linear relationship
among the variables.
Separating the individual factor’s effect on the whole result, such as the effect of rust infection or the number of
days a particular cultivar require to reach maturity, can be di�cult and at times, confounding. The objective of
this module is to explain and illustrate the principles discussed in the module on Linear Correlation, Regression
and Prediction or correlating two variables or enumerating the effect of one variable on another, but now
expanded to multiple variables.
Objectives
• To de�ne correlation relationships among several variables
• To separate the individual relationships of multiple independent variables with a dependent variable
• To test the signi�cance of multiple independent variables and to determine their usefulness in regression
analysis
• To recognize some of the potential problems resulting from improper regression analysis
Multiple Correlation and Regression
Simple Correlation
The correlation of multiple variables is similar to the correlation between two variables. The same assumptions
apply, the sampled Y's should be independent and of equal variance. Error (variance) is associated with the Y's
while X's have no error or the error is small. But now, since there are multiple factors involved, the correlations
are somewhat more complex and interactions between the Xi variables are expected. A note on notation: we
now include a subscript with the "X" to indicate which independent variable to which we are referring. Three
levels of correlation are used in determining the multi-faceted relationships; simple correlation, partial
correlation, and total correlation.
simple correlation
The simple correlation between one of the Xi's and Y is computed for a simple correlation of X and Y. This
calculation assumes a direct relationship between the particular Xi and Y. It is also useful in stating the simple
relationship between two Xi's in the multiple correlation. When determining the signi�cance of regression
coe�cients, the variable with the largest simple correlation with Y is usually the starting point. Some interaction
among numerous X and Y variables is likely to occur. Because two Xi's have large simple correlations with a
resulting Y does not necessarily indicate that their relationships to Y are independent of each other. They may
be measuring the same effect on Y. The number of hours of sunlight (cloud-free skies) and GDDs both have a
good (simple) correlation to the rate of crop development. But their effects would not be additive. There would
be a signi�cant interaction between these two variables in describing crop development. The two variables are
measuring two different factors, light and temperature. But amount of sunlight and temperature are generally
highly correlated during the summer. So there would be a signi�cant relationship between the two variables.
These individual effects can be separated using the partial correlation.
Partial Correlation
Quantifying which continuous X variables are best correlated with the continuous Y-variable requires an
understanding of the interactions between the Xi's. To break down the interaction requires partial correlation
coe�cients. These use the simple correlation coe�cients to explain the correlation of two variables with all
other variables held constant. One such example is, "how much yield will result from nitrogen applications
assuming the seasonal amount of rainfall will be average?" Here, rainfall is held constant and the effect of
nitrogen on yield would be used for a partial correlation. This relationship is given for the partial coe�cient of
determination between Y and X1 where two X's are involved, as shown in Equation 1.
Equation 1
To calculate the partial coe�cient of determination between Y and X2, just reverse the equation, i.e. use rYX1
and vice versa. Don't panic! You will not be asked to hand-calculate this on the homework or exam. That is why
we use R!
Correlation Matrix
The value rYX1 is the simple correlation between Y and X1. The whole equation describes the correlation
between Y and X1 with X2 held constant. The relationship between the X1 and Y is displayed within the effect of
the interaction. Partial correlations can be calculated for all variables involved. They can also be calculated for
more than three variables, but the equation becomes more complex. Often, the total and partial correlations are
calculated and displayed in a table with the individual Xis and Y listed across the top and down the left side.
The correlations for each variable pair are displayed at the intersection of the variables.
X1 X2 Y
X1 1 0.462 0.693
X2 0.462 1 0.354
Y 0.693 0.354 1
Total Correlation
The combination of these partial effects leads to a multiple correlation coe�cient, R, which states how related
the Y is to the combined effects of the Xi's. For X1, X2, and Y the total correlation is determined once again
using the simple correlations:
Equation 2
In this equation, r2YX1 and r2YX2 are just the squares of partial correlation coe�cients.
The calculations for 2 Xi's are relatively straight-forward, but for three or more variables, the calculations involve
a large number of terms with the different correlations among individual variables. Consequently, total
correlation is calculated with computer programs, such as R.
Similar to the linear correlation coe�cient, the total correlation coe�cient, when squared produces the multiple
coe�cient of determination, R2. This value explains the proportion of the Y variation which can be accounted
for by a multiple regression relationship. The partial correlation coe�cients squared produce the partial
coe�cients of determination, r2, or that proportion of variance which can be described by one variable, while
the partial coe�cients will be used in testing individual regression coe�cients for signi�cance.
Calculating the Correlation
Graphing data to visualize the correlation relationships in multiple dimensions is di�cult. The graphing of data
involving 2 Xi's with Y is possible in 3-dimensional space. Using the variables mentioned, the regression
equation would be a plane in the X1, X2, Y space (Fig. 1). The partial regression coe�cients for X1 with Y and X2
with Y in this space could be used to produce be lines where the plane intersected a certain X value. For
example, the following equation would produce a plane on a graph.
Setting X1 equal to 0 would reduce the above equation of a plane to a linear equation:
Y = 3.9X2 - 7.1
Either X1 or X2 could be set to any value producing any number of different linear relationships in the plane.
With more than two X values, graphing the relationship in 3 dimensions is not easily done. Instead of graphing,
interpreting the data numerically and conceptually is the preferred method.
Ex. 1: Correlation-Multiple Regression Analysis
This exercise contains the following pages:
Ex. 1, Step 1
R CODE FUNCTIONS
• cor
• cor.test
• install.packages
• library
• pcor
Multiple regression functionally relates several continuous independent variables (X), to one dependent
variable, Y. For example, we could carry out multiple regression with yield as the dependent response variable
(Y), X1 as an independent variable indicating the amount of fertilizer applied, and X2 as an independent variable
indicating the amount of water each plot received. In this example, we model yield as a linear combination of
the amount of water and fertilizer applied to each plot in multiple regression. However, before we can relate Y to
the other variables, we need to know the interrelationships of all the variables. Multiple correlation provide
measures of the linear relationship among variables.
head(data)
cor(data$perc.inf, data$Yield)
cor(data$perc.inf, data$Yield)
Great! Now we have the simple correlation matrix showing the correlations between RIL (Line), days to maturity
(dtm), infection rate (perc.inf) and yield. The correlation matrix returned by R is constructed with the variables
listed as both row and column headings. The top number at the intersection of a row and column is the
correlation coe�cient for those two variables. For, example, the simple correlation between yield and perc.inf is
-0.94750681.
Ex. 1, Step 2
First, calculate the p-value for the simple correlation of perc.inf and yield.
data<-read.csv("barley.csv", header = T)
R returns
The p-value for the correlation between perc.inf and Yield is 2.2*10-16, which is extremely low. This low p-value
tells us that the correlation between the two variables (yield and perc.inf) is highly signi�cant.
Ex. 1, Step 3
Now, let’s calculate the p-value for the correlation of dtm and yield.
cor.test(data$dtm, data$yield)
R returns
Ex. 1, Step 4
Calculate the p-value for the correlation between DTM and perc.inf.
cor.test(data$dtm, data$perc.inf)
R returns
Based on the extremely high p-value, the correlation between perc.inf and DTM is not signi�cant.
Ex. 1, Step 5
Which of the variables are the most correlated? Which will contribute the most to the �nal regression
of yield on dtm and infection rate perc.inf? The �rst question can be answered by looking at the simple
correlation matrix that we created in step 5. perc.inf and yield have a simple correlation of -0.94750681, and
dtm and yield have a simple correlation of -0.22688955. The correlation of dtm and yield has a smaller absolute
magnitude, thus, infection rate (perc.inf) will contribute the most to the regression equation when we calculate
it.
Before we construct a regression model for yield, we need to analyze how days to maturity (dtm) interacts with
infection rate (perc.inf) in the multiple regression. Despite the simple correlation between dtm and perc.inf
being not statistically signi�cant, calculating the partial correlation between these two variables may
helpexplain a possible relationship between them. Simple correlations are the basis for calculating the
additional correlation relationships.
Ex. 1, Step 6
Now, let’s calculate the partial correlation matrix for the 3 variables. To do this, we’ll �rst need to get the
package ‘ppcor’.
Install.packages('ppcor')
library (ppcor)
ppcor(data)
R returns
$estimate
$p.value
$statistic
Did the partial correlation follow the simple correlation in magnitude? The partial correlation
of dtm with yield (with perc.inf held constant) was -0.6991729, while that of perc.inf with yield (dtm held
constant) is -0.9720637. The squared values of the partial coe�cients of determination are used in calculating
the contribution of each variable to the regression analysis. These values are calculated as in equation 2 from
above for the simplest case of multiple regression, where there are two X’s and one Y. More complex equations
result from equations with more than two X variables.
1. Set your working directory to the folder containing the data �le barley.csv
2. Read the �le into the R data frame, calling it data.
data<-read.csv("barley.csv", header=T)
3. Check the head of the data to make sure it was read in correctly.
4. Calculate the correlation between the fusarium infection rate (perc.inf) and barley yield.
5. Calculate the correlation between DTM and yield.
6. Install the package ‘ppcor’.
7. Load the package.
8. Calculate the partial correlation between yield, dtm, and perc.inf.
Multiple Regression
Multiple regression determines the nature of relationships among multiple variables. The resulting Y is based
on the effect of several X's. How much of an effect each has must be quanti�ed to determine the equation
(below). The degree of effect each X has on the Y is related through partial regression coe�cients. The b-value
estimate of each regression coe�cient can be determined by solving simultaneous equations. Usually,
computer programs determine these coe�cients from the data supplied.
Equation 3
The a is the Y-intercept, or Y estimate when all of the X's are 0. The b's are estimates of the true partial
regression coe�cients β, the weighting of each variable's effect on the resulting Y. The b's are interpreted as the
effect of a change in that X variable on Y assuming the other X's are held constant. These can be tested for
signi�cance. The weighting of effects now will be based on regression techniques.
The simplest example of multiple linear regression is where two X's are used in the regression. The technique of
estimating b1 and b2 minimizes the error sums of squares of the actual from estimated Y's. The variability of
the data (Y's) can be partitioned into that caused by different X variables or into error.
Example of Multiple Correlation and Regression
The simplest example of multiple correlation involves two X's. Calculations with more variables follow a similar
method, but become more complex. Computer programs have eased the computational problems. Proper
analysis of the data and interpretation of analyses are still necessary and follows similar procedures.
The following two-variable research data were gathered relating the yield of inbred maize to the amount of
nitrogen applied and the seasonal rainfall data (Table 2).
Table 2
50 5 5
57 10 10
60 12 15
62 18 20
63 25 25
65 30 25
68 36 30
70 40 30
69 45 25
66 48 30
Review the Data
The �rst issue is to review how highly correlated the data are. Since visualization of multiple data is more
di�cult, numerical relationships must be emphasized. The �rst step is to examine the correlations among the
variables. The simple correlations (calculated as in the module on Linear Correlation, Regression and
Prediction) may be computed for the three variables (see below).
Simple Correlations
Study Questions 1
Which of the X variables is best correlated with Y?
Rainfall (X2)
Fertilizer (X1)
Check
Partial Coefficients of Determination
All are highly correlated. But these simple correlations include the interactions among variables. To determine
individual relationships, calculations of partial coe�cients of determination are helpful (below).
Equation 4
These values are the additional variability which can be explained by a variable, such as that by X1, after the
variability of X2 alone has been accounted for. These values are used in computing the ANOVA for multiple
regression. The partial correlations may be found by taking the square root of these partial coe�cients of
determination.
Total Coefficients of Determination
The R2-value is the total coe�cient of determination, which combines the X's to describe how well their
combined effects are associated with the Y's. This is determined by the equation below.
Equation 5
The R2 value is the proportion of variance in Y that is explained by the regression equation. This can be used to
partition the variability in the ANOVA. The square root of this value gives the correlation of the X's with Y. It is
obvious that the correlations are not additive. The simple correlations are all greater than 0.8, and the
correlation between X1 and X2 is 0.905. This is where partial correlation comes into play.
Partial Regression Coefficients
Before we can create an ANOVA and test the regression we need a regression equation as determined by R. The
estimate of the regression relationship is found to be in the below equation.
The partial regression coe�cients indicate that for the data gathered here each additional pound of nitrogen
applied per acre would produce an additional 0.089 bushels of maize per acre, and for each additional inch of
rainfall, an additional 0.516 bushels per acre. An estimate of the yield is determined by entering the amount of
nitrogen applied to the �eld and the amount of rainfall into the equation. The number produced is the regression
equation estimate of the yield based on the data gathered.
The next issue is deciding if this equation is useful and explains the relationship in the gathered data. The sums
of squares are partitioned in an ANOVA table and the signi�cance of the regression equation as a whole and the
individual regression coe�cient estimates are tested for signi�cance in the next section.
Ex. 2: Multiple Regression and Anova Using R (1)
This exercise contains the following pages:
R CODE FUNCTIONS
• anova
• summary
• lm
• pf
• ppcor
Multiple regression is used to determine the nature of relationships among multiple variables. The response
variable (Y) is de�ned as the product of the effects of several explanatory variables (X’s). The level of effect
each X has on the Y variable must be quanti�ed before a regression equation can be constructed (i.e. equation
3). The degree of effect each X has on the Y is related through partial regression coe�cients. The coe�cient
estimate of each explanatory variable can be determined by solving simultaneous equations. Usually, computer
programs such as R determine these coe�cients from the data supplied.
In equation 3, the a term is the Y-intercept, or the estimate of Y when all of the X’s are 0. The b with each X is an
estimate of the true partial regression coe�cient ß for that X variable; the weighting of each variable’s effect on
the resulting Y. The b’s are interpreted as the effect of a change in that X variable on Y, assuming the other X’s
are held constant. These coe�cients can also be tested for signi�cance. The weighting of effects will now be
based on regression techniques.
The simplest example of multiple linear regression is where two X variables are used in the regression. The
technique of estimating b1 and b2 via multiple regression minimizes the error sums of squares of the actual
data from the estimated Y’s. The variability in the data can be partitioned into that which is caused by different
X variables, or that which is caused by error.
In the �le “QM-Mod13-ex2.csv”, we have yield data from one inbred maize line under all factorial combinations
of 9 different levels of nitrogen treatment, and 9 different levels of drought treatment. We’ll use these data to
investigate correlations between the variables, to do a multiple regression analysis, and to carry out an analysis
of variance (ANOVA).
Read the dataset into R, and have a look at the structure of the data.
data<-read.csv("ex2_data.csv", header=T)
head(data)
R returns
drought N yield
1 -4 0 1886.792
2 -4 28.025 2590.756
3 -4 56.05 3743.000
4 -4 84.075 4910.937
5 -4 112.1 5656.499
6 -4 140.125 5689.165
The data contain entries for yield (kg/ha), level of nitrogen applied (kg/ha), and a “drought” score to indicate the
level of drought stress applied (i.e. a level of -4 is the maximum drought stress applied and a value of 4 is the
minimum level of drought stress).
Note: even though we have �xed treatments assigned to each test plot, we will run the analyses in this ALM as
if they were random treatments (i.e. keeping the values for drought and N as numeric). This will allow us to
investigate simple and partial correlations.
Simple Correlation
The correlation between X and Y ( )
is calculated as the covariance of X and Y divided by the product of the standard deviations of X and Y.
The �rst step is to review how highly correlated the data are. Since visualization of multiple data is more
di�cult, numerical relationships must be emphasized. Let’s examine the correlations among the variables.
Calculate the simple correlations between the 3 variables by entering into the console window.
cor(data)
Partial Correlation
Taking X1 to be nitrogen, X2 to be drought, and Y to be yield, we can list the simple correlation variables.
Simple Correlations
Both nitrogen and irrigation are correlated with yield, but these simple correlations include interactions among
the variables. To determine individual relationships, calculations of partial correlation coe�cients are helpful.
Partial correlation coe�cient values are the additional variability in the response variable that can be explained
by an independent variable, such as that by X1, after the variability of another independent variable, such as X2,
alone has been accounted for. These values are also used in computing the ANOVA for multiple regression.
We’ll now do a quick investigation of the partial correlations between the variables in the dataset. If you haven’t
already, load the ‘ppcor’ package. Then, use the pcor command to obtain the matrix of partial correlations
between all variables in the data set.
library(ppcor)
pcor(data)
R returns 3 matrices: a matrix with the partial correlation coe�cient estimates ($estimate), a matrix with the
test statistic for the estimate ($statistic), and a matrix for the p-value of the test statistic ($p.value).
$estimate
drought N yield
dtm 1.0000000 -0.239363 0.7540868
N -0.2393630 1.000000 0.3174210
yield 0.7540868 0.317421 1.0000000
$p.value
drought N yield
drought 0.000000e+00 0.029458897 3.658519e-24
N 2.945890e-02 0.000000000 3.113831e-03
yield 3.658519e-22 0.003113831 0.000000e+00
$statistic
drought N yield
Let's calculate the partial correlation coe�cient for nitrogen on yield by hand to check the calculation returned
by R in the estimate matrix.
You can see that the value in the estimate matrix for the partial correlation coe�cient between nitrogen and
yield is identical to the value obtained by our hand-calculation. Also, based on the p-value matrix, all of the
partial-correlation estimates are statistically signi�cant.
The test-statistic matrix contains values calculated from the standard mornal distribution (with a mean of 0,
and standard deviation of 1). The test statistic for the partial correlation of nitrogen on yield is 12.12521. We
can check that this value is correct by calculating the p-value for this value from the standard normal
distribution by entering
(1-pnorm(2.95621, mean=0, sd=1))*2
R returns
[1]0.00311834
The p-value for the partial correlation coe�cient given in the R output from calculating the partial correlation
coe�cients is identical to that given in the R output using the pcor function.
The R2 value is the total coe�cient of determination, which combines the explanatory variables (X’s) to describe
how well their combined effects are associated with the response variable (Y). This is determined by the
following equation:
The R2 is very useful for interpreting how well a regression model �ts. Its value is the proportion of variance in Y
that is explained by the regression equation. The closer to 1.0, the better the �t; a value of 1 would mean all of
the data points fall on the regression line. The square root of this value gives the correlation of the X’s with Y. It
is obvious that the correlations are not additive. This is where partial correlation comes into play.
The drawback of relying on the R2 value as a measure of �t for a model is that the value of R2 increases with
each additional term added to the regression model, regardless of how important the term is in predicting the
value of the dependent variable. The Adjusted R2 value (or R2Adj) is a way to correct for this modeling issue.
The formula for Adjusted R2Adj is:
where R2 is the regression coe�cient, n is the sample size, and k is the number of terms in the regression
model. The R-squared value increases with each additional term added to the regression model so taken by
itself, can be misleading. The R2Adj takes this into account and is used to balance the cost of adding more
terms; i.e. it penalizes the R2 for each additional term (k) in the model. The R2Adj value is most important for
comparing and selecting from a set of models with different numbers of regression terms. It is not of great
concern until you are faced with choosing one model to describe a relationship over another. We’ll carry out
hand calculations for both R2 and R2Adj after we run the regression in R.
Ex. 2: Multiple Regression and Anova Using R (10)
The testing of the regression equation partitions the total sum of squares using the total coe�cient of
determination, R2 . Note that this is not the same as the square of total correlation.
Initially, the null hypothesis being tested is that the whole regression relationship is not signi�cantly different
from 0.
The F-test for multiple linear regression uses the regression mean square to determine the amount of variability
explained by the whole regression equation. If the regression mean square is signi�cant at your speci�ed level,
the null hypothesis that all of the regression coe�cients are equal to 0 is rejected. This F-test does not
differentiate between coe�cients; all are signi�cant or none are according to the test.
Individual regression coe�cients (b1, b2, etc.) may be tested for signi�cance. The simple coe�cient of
determination between each X and Y explains the sum of squares associated with each regression coe�cient
including interactions with other X’s. The partial coe�cient of determination between each X and Y explains the
additional variability without interaction. These can be tested with the residual error not explained by the
regression model to test the signi�cance of each X.
Each coe�cient may also be tested with a t-test; R does this automatically when you run a multiple regression
model using the lm function.
MULTI-LINEAR REGRESSION
Let’s run a multiple regression analysis where yield is the response variable and drought and nitrogen are the
explanatory variables. We will keep nitrogen and drought as numeric variables for this analysis, but later will run
the same analysis with these variables as factors.
summary(1m(data=data,yield~drought+N))
1. data = data indicates that we want to run the linear model with dataset ‘data’
2. yield~ drought + N speci�es that the regression equation we are analyzing is yield = ‘the amount of
nitrogen applied’ + ‘the amount of drought applied’
3. lm indicates to R that we want to run a linear regression model
4. Summary indicates that we want R to return all of the useful information from the regression analysis
back to us.
R returns
Call:
Residuals:
Coe�cients:
---
Signif. codes:
0 '***' 0.001 '**' 001 '*' 0.05 '.' 0.1 ' ' 1
The R2 value is given at the bottom of the R output as 0.5885. This means that the model explains 58.85% of
the variation in yield. Let’s calculate the R2 value by hand using the simple correlation coe�cient matrix from
above.
The value obtained for R2 obtained by our hand calculation is identical to the value returned by R.
Now, let’s calculate the R2Adj for the model by hand. Use the value of R2 from the R output (0.5885).
This is the same value for R2Adj as given in the R output (under “Adjusted R-squared”).
Ex. 2: Multiple Regression and Anova Using R (15)
VARIABLE INTERACTION
Should we include a term in the linear model indicating the interaction between nitrogen and drought? Let’s run
the regression again, this time adding a variable accounting for the interaction between the two independent
variables into the model. (i.e. the amounts of drought and nitrogen applied). The interaction variable is
speci�ed using a multiplication sign (*) with the explanatory variables that you are analyzing for interaction.
summary(1m(data=data,yield~N+drought+N*drought))
R returns
Call:
Residuals:
Coe�cients:
---
signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Compare the R2Adj value and the F-statistic of the model including the interaction to the model not including the
interaction. Which model �ts the data better?
The model without the interaction between Nitrogen and Irrigation has a slightly better �t for these data than
the model including the interaction. Also, the regression coe�cient on the interaction term has a very high
p-value, indicating that is not statistically signi�cant. Save the model without the interaction as ‘m1’.
m1<-1m(data=data,yield~N+drought)
First carry out the ANOVA for the model without the interaction.
anova(lm(data=data,yield~drought+N))
Now run the ANOVA with the linear model excluding the interaction term.
anova(lm(data=data,Yield~drought+N))
Response: yield
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA table lists the model, error, and the sources of variation along with their respective degrees of
freedom (df), sum of squares, mean squares and an F-test for the model. Each model parameter has 1 df. The
total df is 1 less than the total number of observations, in this case 80 (i.e. 1 + 1 + 78). This correction is for the
intercept; the single df that is subtracted re�ects this. We are most interested in the F-test for the model, which
is calculated by dividing the model MS by the error MS. The F-statistic and p-value of the F-statistic for the
model are listed at the bottom of the R output that we obtained from running the multiple regression model. The
model MS is not listed in the anova table R returned to us, however, we can easily calculate the F-statistic for
the model using the anova output as the mean of the F-statistics for the model parameters. The p-value for the
model can also be calculated from the anova table. The F-statistic value we obtain by averaging the inorganic
and organic F-statistics is
To get the p-value for this F-statistic, in the R console window enter
1-pf(55.78,2,80)
The value returned is 3.725908*10-13. The probability of the F-statistic value of 78.4533 occurring by chance is
only incredibly small, so we conclude that the model we have developed explains a signi�cant proportion of the
variation in the data set.
R2 can be calculated from the anova table as the model sum of squares (SS) divided by the corrected total SS.
This is the same value as was reported for R2 in the regression output.
Let’s run the same multiple regression model again, but this time having N and drought as factors instead of
numbers. We must tell R that we want entries for these variables to be considered factors, and not numbers. As
factors, there are 9 speci�ed treatment amounts for each of the 2 independent variables, and 81 possible
combinations between the 2 factor variables.
Convert the data for N and drought into factor variables.
data$N<-as.factor(data$N)
data$drought<-as.factor(data$drought)
is.factor(data$N)
R returns
[1]TRUE
Great, now let’s run the multiple regression. Save this model as ‘m2’.
m2<-summary(1m(data=data,yield~drought+N))
summary(m2)
R returns
Call:
Residuals:
Coe�cients:
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Explain how these results differ from the linear regression with our explanatory variables as numbers (how do
the R2 values differ?) .
Under the “Coe�cients” heading, in the “Estimate” column, we �nd the intercept, as well as the X variable
coe�cients for the multiple regression equation. You’ll notice that the variables for drought = -4 and N = 0 are
not listed in the “Coe�cient” output. The reason for this is that the “intercept” encapsulates these variables,
meaning N=0 and drought = -4 is the baseline in the regression model. All of the other effects of variable
combinations on yield are quanti�ed with respect to this baseline.
Write the equation from the linear regression output of the model yield~drought + N for N=25 and drought= 0,
with “drought” and “N” as numeric variables. Then write out the equation for the same linear model and
parameters, but with “drought” and “N” as factors. Compare the predicted yields
#numeric model
#factored model
predict(m1,list(N=25,drought=0))
R returns
8030.944[kg/hectare]
#factored model (m2)(notice the quotes around the numbers to indicate factors).
predict(m2,list(N="28.025",drought="o"))
R returns
8557.915[kg/hectare]
Ex. 3: Correlation, Multiple Regression and Anova (1)
This exercise contains the following pages:
R CODE FUNCTIONS
• anova
• summary
• lm
• install.packages
• library(’ppcor’)
• cor
• pcor
You are a maize breeder in charge of developing an inbred line for use as the ‘female’ parent in a hybrid cross.
Yield of the inbred female parent is a major factor affecting hybrid seed production; a high level of seed
production from the hybrid cross leads to more hybrid seed that can be sold. Only 2 lines remain in your
breeding program, and your boss wants you to determine which of the two lines has the best yield-response to
variable Nitrogen fertilizer (N) applications under several different drought levels. The three-variable dataset
relating the yield (per plot) of the 2 inbred lines to the amount of N and level of drought applied to each plot can
be found in the �le 13_ex3.csv.
Determine the simple and partial correlation amongst yield and the amount of nitrogen fertilizer applied, and
drought for each of the lines. Then, develop a regression equation to predict yield from the independent
variables. Test to see if an interaction between drought and N should be included in the linear model. Decide on
a model to evaluate these data and decide which of the 2 lines should be selected.
data<-read.csv("ex3.csv",header=TRUE)
Check the head of the data to make sure the �le was read into R correctly.
head(data)
All data should be of the numeric class (that is, R recognizes all entries for all explanatory variables as
numbers). Calculate the simple correlation matrix for the data.
cor(data)
Drought has a very high simple correlation with yield, and N has a moderate correlation with yield. Keeping in
mind that all variable data are classi�ed as numbers, what does the positive correlation between line and yield
imply (think about how line is coded in the data)?
There are only 2 lines in the data. The positive correlation between yield and line means that the two variables
move into the same direction; a higher value for line (i.e. 2) corresponds to higher values of yield, and vice
versa. This positive correlation provides evidence for line 2 being the higher yielding line.
If there were more than 2 lines and more than 2 reps in these data, could we analyze the data in the same way
(i.e. could ‘line’ and ‘rep’ be classi�ed as numbers in the analysis)? Could we calculate the correlation between
yield and line, and rep and yield? If we had more than 2 lines and reps in these data, we’d have to reclassify the
‘line’ and ‘rep’ variable as factors. We would then not be able to calculate the correlation between yield and line,
and rep and yield.
You should’ve already installed the package ‘ppcor’. If you have, ignore the ‘install.package’ command, and
simply load the package using the ‘library’ command.
install.packages ('ppcor')
library(ppcor)
pcor(data)
$estimate
$p.value
$statistic
What linear model would you use to analyze these data? Should you include the interaction between N and
drought? Should you include rep? Should you include line? Test some possible models, then explain which
model you think is best and why.
The coe�cient on the interaction (drought*N) is signi�cant (barely) at =0.1. Coe�cient on rep is not signi�cant.
Thus, ‘rep’ should be excluded and the interaction term should be included.
summary(1m(data,yield~line+drought+N+drought*N))
Call
Residuals:
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Run the anova for the linear model you chose in the previous question.
anova(lm(data=data,yield~line+drought+N+N*drought))
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the regression equation, what yield value would you predict to obtain for each line under N = 100 and
drought = 0?
#Note: line=0 is used for 'line 1' and line=1 is used for 'line 2'.
predict(m1, list(N=140.125,drought=0,line=1))
9783.187
predict(m1,list(N=140.125,drought=0,line=2))
10338.84
Interpret the results, which line would you choose and why?
Designate ‘line’ as a factor, and run the linear regression again with the same model.
data$line<-as.factor(data$line)
data$rep<-as.factor(data$rep)
summary(1m(data=data,yield~line+rep+drought+N+N*drought))
Interpret the results of the linear regression output with line as a factor (i.e. why is ‘line2’ listed and ‘line1’ not
listed in the output?).
’line2’ indicates that if we are predicting a yield for line 2 based on the linear regression, we need to add 555.6503
kg/ha to the predicted yield. If a yield value for ‘line1’ is being predicted, we do not add anything to the predicted
yield value based on the line. In effect, the ‘Intercept’ includes ‘line1’. ‘line1’ can be considered the baseline, and
the ‘line2’ a deviation from the baseline.
Call:
Residuals:
Coe�cients:
---
Signif. codes: 0 '***' 0.001 "**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Because multiple effects are involved in multiple regression, determination of which terms and variables are of
importance adds a level of di�culty to the analysis. Not only are there direct effects from certain variables, but
combinations of effects among separate variables. These are caused by interaction between several variables.
The effects are estimated by using the associated regression coe�cients.
The initial test is to determine if the total regression equation is signi�cant. As in linear regression, "Does the
regression relationship explain enough of the variability in Y to be signi�cant?" Partitioning the sums of squares
into an ANOVA table can be used to resolve the hypothesis test. The ANOVA table for multiple regression is
similar to that in linear regression. Additional regression degrees of freedom are included for each X variable.
Two df are used for a regression relationship with two variables.
The testing of the regression equation partitions the total sum of squares using the total coe�cient of
determination, R2, below equation. Note that this is the same as the square of the total correlation, as given in
"Total Correlation".
Equation 6
The Whole Regression Relationship
The hypothesis being tested, initially, is to test the whole regression relationship to see if it is signi�cantly
different from 0, Equation 7.
Equation 7
The F-test uses the regression mean square, RegMS, to determine the amount of variability explained by the
whole regression equation. If the RegMS is signi�cant at your alpha level, the null hypothesis that all of the
partial regression coe�cients equal 0 is rejected. This F-test does not differentiate any coe�cients, all are
signi�cant or none are according to the test.
The total sum of squares in this data set can be calculated as 338 (see Exercise 3). The R2 was calculated in
last section as 0.900. The ANOVA table (Table 3) with two degrees of freedom is calculated.
Table 3
Source df SS MS F P
Total 9 338.0
The complete regression model is signi�cant at a probability much less than 0.01. The regression equation is
signi�cant, explaining su�cient variability in the data.
Regression Coefficient Signficance
Individual regression coe�cients (b1, b2, etc.) may be tested for signi�cance. The simple coe�cient of
determination between each X and Y explains the sum of squares associated with each regression coe�cient
including interactions with other X's. The partial coe�cient of determination between each X and Y explains the
additional variability without interaction. These can be tested with the residual error not explained by the
regression to test the signi�cance of each b.
Each coe�cient may also be tested with a t-test, and this is done in the Parameter Estimates table of the lm()
Output. Tested individually, X2 is signi�cant while the coe�cient for X1 is not signi�cant. The nitrogen term
would probably be dropped because it explains little additional variance beyond that from the rainfall. The �nal
equation would be a simpler linear equation obtained by just adding Rainfall to the Fit Model, not including N
Fertilizer. This equation is:
Equation 8
You will note from the original regression equation that the X1 coe�cient was small. This does not necessarily
mean that small regression coe�cients are not signi�cant. They need to be tested to determine their
signi�cance. The testing is not to attempt to remove terms but to remove terms which add to the complexity
without explaining variability in the regression analysis. Dropping terms from an equation is not always done.
All coe�cients in the regression equation may be signi�cant and may be kept to explain the variability in the
response.
Ex. 4: Non-Linear Regression and Model Comparison (1)
R CODE FUNCTIONS
• anova
• summary
• lm
• install.packages
• library
• cor
• pcor
• ggplot
• reshape2
If the assumptions necessary for multiple regression are not met, a number of problems can arise. These
problems can usually be seen when examining the residuals; the difference of the actual Y’s from the predicted
Y’s.
First; if the Y’s are not independent, serial correlation (or auto-correlation) problems can result. These can be
seen if the residuals are plotted versus the X values, showing a consistently positive or negative trend over
portions of the data. When collecting data over a period of time, this can be a problem, since temporal data has
some relationship to the value at the previous time. For instance, a temperature measurement 5 minutes after a
previous one is going to be strongly correlated with the previous measurement because temperatures do not
change that rapidly. These problems can be overcome by analyzing the data using different techniques. One of
these is to take the difference of the value at the current time step from the value at the last time step as the Y
value instead of the measured value.
Ex. 4: Non-Linear Regression and Model Comparison (2)
Second, violating the equal variances assumption leads to heteroscedasticity. Here the variance changes for
changing values of X. A plot of residuals where the spread gradually increases toward lower or higher X’s can
also occur.
The third problem is multicollinearity. Here two or more independent variables (X’s) are strongly correlated (for
example the growing degree days (GDD) and hours of sunlight). The individual effects are hard to separate and
lead to greater variability in the regression. Large R2 values with insigni�cant regression coe�cients are seen
with this problem. Eliminating the least signi�cant variable, after testing, will often solve this problem without
changing the R2 very much.
Ex. 4, Non-Linear Regression and Model Comparison (3)
POLYNOMIAL FUNCTION
A set of functions which can be useful for describing quantitative responses are the various orders of
polynomial functions. Polynomial functions have a general form
A horizontal line is a polynomial function of order 0. Linear relationships are polynomial functions of the �rst
order. The highest exponent of X in the function determines the order of the polynomial (0 for a horizontal line,
1 for simple linear regression equation). Each order has a distinctive shape. First order polynomials produce a
straight line, second order polynomials produce a parabola, third order polynomials produce a parabola with one
in�ection point and fourth order polynomials produce a parabola with 2 in�ection points. Graphs of the �rst 4
orders are shown below.
Ex. 4, Non-Linear Regression and Model Comparison (4)
As with the other functions, an in�nite number of curves may be created by carrying the coe�cients. A
polynomial function can usually be �t to most sets of data. The value of such relationships can be questioned
at very high orders, though. Important in most functional relationships is the physical or biological relationship
represented in the data. Higher order relationships sometimes produce detailed equations which have a
relatively limited physical or biological relevance.
Each additional order should be tested for signi�cance using the hypothesis H0: highest order coe�cient = 0.
This can be tested using the equation
Now we’ll look at some very simple data and try to �nd the best model to �t the data. In the �le QM-Mod13-
ex4.csv, you’ll �nd a very small data set giving the rate of runoff () for various amounts of rainfall. Read the �le
QM-Mod13-ex4.csv into R and take a look at it (there are only 10 entries, so don’t use the “head” command).
data<-read.csv("ex4.csv",header=T)
data
Ex. 4, Non-Linear Regression and Model Comparison (6)
R returns
Rainfall Runoff
1 3.00 0.00
2 12.00 1.00
3 14.00 2.50
4 14.50 3.25
5 15.00 8.50
6 15.50 9.50
7 16.00 12.50
8 17.50 13.50
9 19.00 16.00
10 19.25 19.00
Let’s plot the data quickly to see if we visualize any obvious trends.
library(ggplot2)
gplot(data=data,x=Rainfall,y=Runoff)
Ex. 4, Non-Linear Regression and Model Comparison (7)
R returns
Fig. 2
Let’s run the regression models of the 1st and 2nd order (i.e. Runoff ~ Rainfall and Runoff ~ Rainfall^2) and
compare them, visually and statistically. We’ll plot the predictive function given by each regression model
output on a scatterplot with these data and compare the models visually.
We use “I” in front of the “x” variable in the “lm” command to indicate to R that we want the higher order of x
included in the model (i.e. for a second order model, we would indicate x2 by entering: I(x^2).
Ex. 4, Non-Linear Regression and Model Comparison (8)
Enter the models into the R console. Call the “Rainfall” variable x, and the “Runoff” variable y.
x<-data$Rainfall
y<-data$Runoff
m1<-1m(y~x,data-data) #1st order
m2<-1m(y~x+I(x^2),data=data) #2nd order
m3<-1m(y~x+I(x^2)+I(x^3),data=data) #3rd order
Here, we create the points on the line or parabola given by each model. Because the distance between these
points is so small, they will appear as a line on our �gure.
1d<-data.frame(x=seq(0,20,by=0.5))
result<-1d
result$m1<-predict(m1,newdata=1d)
result$m2<-predict(m2,newdata=1d)
result$m3<-predict(m3,newdata=1d)
Ex. 4, Non-Linear Regression and Model Comparison (9)
Here, we use the package “reshape2” to change the format of the data to facilitate graphing in the next step.
library(reshape2)
library(ggplot2)
result<-melt(result,id,vars="x",variable.name="mode1",value.name="�tted")
names(result)[1:3]<-c("rainfall","order","runoff")
levels(result$order)[1:3]<-c("1st","2nd","3rd")
Finally, we are ready to plot the 1st, 2nd, and 3rd order regression models on top of the original date.
ggplot(result,aes(x=rainfall,y-runoff))+
geom point(data=data,aes(x=x,y=y))+
xlab("Rainfall (mm)")+
ylab("Runoff(m^3/sec)")+
geom line(aes(colour=order),size=1)
Ex. 4, Non-Linear Regression and Model Comparison (10)
R returns
Fig. 3
Let’s take a look at the output for the 1st order regression model and anova.
summary(m1)
anova(m1)
Ex. 4, Non-Linear Regression and Model Comparison (11)
R outputs,
Call:
Residuals:
Coe�cients:
(Intercept)
x **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Response: y
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The equation given by the linear model is yRunoff = -8.3336 + 1.1601xRainfall. The intercept is not statistically
signi�cant, the x variable is. The r2 value is 0.6519, and the linear regression is signi�cant but there is scatter
about the regression line. The anova shows a regression SS of 261.11 and a residual SS of 139.4 for the 1st
order model.
Ex. 4, Non-Linear Regression and Model Comparison (13)
summary(m2)
anova(m2)
R outputs,
Call:
Residuals:
Coe�cients:
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.115 on 7 degrees of freedom
Response: y
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here you can see that the R2 value increased, indicating that more of the variance in the data is explained by the
regression equation. Testing the reduction using the F-test produces a very signi�cant decrease in unexplained
variability as the residual SS drops from 139.4 to 31.316. The regression line follows the data closely.
Ex. 4, Non-Linear Regression and Model Comparison (15)
summary(m3)
anova(m3)
Call:
Residuals:
Coe�cients:
Response: y
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here, you see that not much more information about the response has been gained. The R2 (and R2Adj)
increases little, and very little additional variability is explained in the third order regression. In the anova table,
the F-value for the third order-regression is not signi�cant at even the 0.10 level. The second order polynomial,
therefore, is the best polynomial equation for describing the response. Physically, we are trying to �t a
relationship of rainfall to runoff. The negative runoff or in�ltration after rain begins makes sense. The x2
relationship may be explainable since we are considering a volume of runoff from a depth of rainfall. The
equation does �t the data well. Again, this �ts only the data gathered. Use of this relationship beyond the scope
of this dataset would be improper.
Problems in Multiple Regression
Examining Problems
Recall the assumptions for regression discussed at the beginning of the lesson and in the module on Mean
Comparisons. If the assumptions necessary for multiple regression are not met, a number of problems can
arise. These problems can usually be seen when examining the residuals, the difference of the actual Y's from
the predicted Y's.
First, if the Y's are not independent, serial correlation or auto-correlation problems can result. These can be
seen if the residuals are plotted versus the X values, showing a consistently positive or negative trend over
portions of the data. When collecting data over a period of time, this can be a problem, since temporal data has
some relationship to the value at the previous time. For instance, a temperature measurement 5 minutes after a
previous one is going to be strongly correlated with the previous measurement because temperatures do not
change that rapidly. These problems can be overcome by analyzing the data using different techniques. One of
these is to take the difference of the value at the current time step from the value at the last time step as the Y
value instead of the measured value.
Second, violating the equal variances assumption leads to heteroescedasticity. Here the variance changes for
changing values of X. A plot of residuals where the spread gradually increases toward lower or higher X's can
also occur. The residual plot from the replicated data regression in the module on Linear Correlation,
Regression and Prediction shows a hint of this. Notice how the residuals start to spread slightly as X increases
(Fig. 3).
Fig. 4 Residuals, or deviation of each data point from the calculated regression equation.
Multicollinearity
Fig. 5 Residuals, or deviation of each data point from the calculated regression equation.
The third problem is multicollinearity, as we discussed in the �rst part of the unit. Here two or more
independent variables (x) are strongly correlated (for example the GDD and hours of sunlight variables). The
individual effects are hard to separate and lead to greater variability in the regression. Large R2 values with
insigni�cant regression coe�cients are seen with this problem. Eliminating the least signi�cant variable, after
testing, will often solve this problem without changing the R2 very much. The example just discussed showed
such a property, where the X1 and X2 values were strongly correlated (r=.905). The insigni�cant coe�cient can
be eliminated, usually solving the problem.
Polynomial Functions
Polynomial Functions
A set of functions which can be useful for describing quantitative responses are the various orders of
polynomial functions. A horizontal line is a polynomial function of order 0. Linear relationships are polynomial
functions of the �rst order. The highest exponent of X in the function determines the order of the polynomial (0
for a horizontal line, 1 for simple linear regression equation). Each order has a distinctive shape. First order
polynomials produce a straight line. Second order polynomials produce a parabola. Graphs of the �rst 4 orders
have similar shapes to those in Fig. 6.
As with the other functions, an in�nite number of curves may be created by varying the coe�cients. A
polynomial function can usually be �t to most sets of data. The value of such relationships can be questioned
at very high orders, though. Important in most functional relationships is the physical or biological relationship
represented in the data. Higher order relationships sometimes produce detailed equations which have a
relatively limited physical or biological relevance.
Equation 9
These equations, which are linear in the parameters (a, b, c, . . .), are used to �t experimental data similar to the
methods described earlier in this unit.
Polynomial Relationships
Polynomial equations are generally �t sequentially, with terms x, x2, x3, etc. successively included.
Polynomial relationships are calculated to reduce the variability around the regression line, whatever the order.
The usual technique is to begin with a linear equation. If the deviation from this line is signi�cant, add a term to
reduce the sum of squares about the line. Adding another order to the polynomial reduces the sums of squares.
When the reduction of the sum of squares by adding another order becomes small, the limit of the equation has
been reached. Enough terms can be added to �t any data set. Generally, a third order equation is the upper limit
of terms in an equation to have any relevance. More terms often simply �t the error scatter of the data into the
equation without adding additional relevance.
Each additional order should be tested for signi�cance using the hypothesis H0: highest order coe�cient = 0.
This can be tested using the below equation.
Equation 10
where:
Numerator df = 1
Polynomial Example
Let's use the example from the module on Linear Correlation, Regression and Prediction. The data set was
approximated using a linear model (Fig. 7).
Fig. 7 Linear regression applied to runoff from a �eld based on rainfall data.
The R2 value is 0.62 with a regression SS of 242.1 and a residual SS of 149.2. The linear regression is
signi�cant, but there is scatter about the regression line. Fitting the same data with a second order polynomial
produces:
df SS MS
Total 9 391.3
critical F = 12.25; P = 0.01
Here you can see that the R2 value increased, indicating that more of the variance in the data is explained by the
regression equation. Testing the reduction using the F-test produces a very signi�cant decrease in unexplained
variability as the residual SS drops from 149.2 to 29.1. The regression line follows the data closely (Fig. 8).
Fig. 8 Linear regression applied to runoff from a �eld based on rainfall data.
Going a step further to assure that most of the variance is explained by the regression equation, we �t a third
order polynomial (Table 5).
Total 9 391.3
critical F = 3.29; P = 0.01
Here, you see that not much more information about the response has been gained. The R2 increases little and
very little additional variability is explained in the third order regression. The F-value for third order-regression is
not signi�cant at even the 0.10 level. The second order polynomial, therefore, is the best polynomial equation
for describing the response. Physically, we are trying to �t a relationship of rainfall to run-off. The negative run-
off or in�ltration after rain begins makes sense. The X2 relationship may be explainable since we are
considering a volume of run-off from a depth of rainfall. The equation does �t the data well. Again, this �ts only
the data gathered. Use of this relationship beyond the scope of this data set would be improper.
Ex. 5: Non-Linear Multiple Regression Analysis (1)
R CODE FUNCTIONS
• anova
• summary
• lm
• install.packages
• library
• cor
• pcor
You are a maize breeder in charge of developing an inbred line for use as the ‘female’ parent in a hybrid cross.
Yield of the inbred female parent is a major factor affecting hybrid seed production; a high level of seed
production from the hybrid cross leads to more hybrid seed that can be sold. Only 3 lines remain in your
breeding program, and your boss wants you to determine 1. Which is the best model to use to analyze the data,
and 2. Which of the three lines should be selected for advancement in the breeding program. The three-variable
data-set relating the yield (per plot) of the 3 inbred lines (evaluated in 3 reps) to the amount of N and level of
drought applied to each plot can be found in the �le ex5.csv.
Answers:
Students should test models on their own to �nd the best one.
Ex. 5: Non-Linear Multiple Regression Analysis (3)
The correct model is: yield ~ N + drought + line + rep + N*drought + N2 + drought2 “rep” and “line” should be
factors, as the numeric values (1 to 3) are identi�ers only and don’t indicate a treatment amount.
The model including drought and N as a 2nd order variable is the best. The model that has drought as a 2nd
order polynomial and N as a 3rd order polynomial technically has a has a better R2Adj, however since the
difference between the R2Adj values of the 2 models is incredibly small AND since the coe�cient N2 is not
signi�cant in the model with the higher order polynomial , we choose the simpler of the 2.
data$line<-as.factor(data$line)
data$rep<-as.factor(data$rep)
summary(1m(data,yield~N+drought+line+rep+N*drought+I(N^2)+I(drought^2)))
Ex. 5: Non-Linear Multiple Regression Analysis (4)
R outputs,
Call:
lm(formula = yield ~ N + drought + line + rep + N * drought + I(N^2) + I(drought^2), data = data)
Residuals:
Coe�cients:
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m2)
Response: yield
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Multiple Correlation
Polynomial Regression
1. In your own words, write a short summary (< 150 words) for this module.
2. What is the most valuable concept that you learned from the module? Why is this concept valuable to
you?
3. What concepts in the module are still unclear/the least clear to you?
Acknowledgements
This module was developed as part of the Bill & Melinda Gates Foundation Contract No. 24576 for Plant
Breeding E-Learning in Africa.
Quantitative Methods Multiple Regression Author: Ron Mowers, Dennis Todey, Ken Moore, and Laura Merrick
(ISU)
Multimedia Developers: Gretchen Anderson, Todd Hartnell, and Andy Rohrback (ISU)
How to cite this module: Mowers, R., D. Todey, K. Moore, and L. Merrick. 2016. Multiple Regression. In Quantitative
Methods, interactive e-learning courseware. Plant Breeding E-Learning in Africa. Retrieved from
https://pbea.agron.iastate.edu.
Source URL: https://pbea.agron.iastate.edu/course-materials/quantitative-methods/multiple-regression-
0?cover=1