Simple Linear Regression Analysis - ReliaWiki
Simple Linear Regression Analysis - ReliaWiki
From ReliaWiki
Regression analysis is a statistical technique that attempts to explore and model the relationship between two or more variables. For
example, an analyst may want to know if there is a relationship between road accidents and the age of the driver. Regression analysis
forms an important part of the statistical analysis of the data obtained from designed experiments and is discussed briefly in this chapter.
Every experiment analyzed in a Weibull++ (https://koi-3QN72QORVC.marketingautomation.services/net/m?
md=Rw01CJDOxn%2FabhkPlZsy6DwBQ%2BaCXsGR) DOE foilo includes regression results for each of the responses. These results,
along with the results from the analysis of variance (explained in the One Factor Designs and General Full Factorial Designs chapters),
provide information that is useful to identify significant factors in an experiment and explore the nature of the relationship between these
factors and the response. Regression analysis forms the basis for all Weibull++ (https://koi-
3QN72QORVC.marketingautomation.services/net/m?md=Rw01CJDOxn%2FabhkPlZsy6DwBQ%2BaCXsGR) DOE folio calculations
related to the sum of squares used in the analysis of variance. The reason for this is explained in Appendix B. Additionally, DOE folios
also include a regression tool to see if two or more variables are related, and to explore the nature of the relationship between them.
This chapter discusses simple linear regression analysis while a subsequent chapter focuses on multiple linear regression analysis.
This data can be entered in the DOE folio as shown in the following figure:
The above equation is the linear regression model that can be used to explain the relation between and that is seen on the scatter
plot above. In this model, the mean value of (abbreviated as ) is assumed to follow the linear relation:
The actual values of (which are observed as yield from the chemical process from time to time and are random in nature) are assumed
to be the sum of the mean value, , and a random error term, :
The regression model here is called a simple linear regression model because there is just one independent variable, , in the model. In
regression models, the independent variables are also referred to as regressors or predictor variables. The dependent variable, , is also
referred to as the response. The slope, , and the intercept, , of the line are called regression coefficients.
The slope, , can be interpreted as the change in the mean value of for a unit change in .
The random error term, , is assumed to follow the normal distribution with a mean of 0 and variance of . Since is the sum of this
random term and the mean value, , which is a constant, the variance of at any given value of is also . Therefore, at any
given value of , say , the dependent variable follows a normal distribution with a mean of and a standard
deviation of . This is illustrated in the following figure.
The true regression line is usually not known. However, the regression line can be estimated by estimating the coefficients and
where is the mean of all the observed values and is the mean of all values of the predictor variable at which the observations were
taken. is calculated using and is calculated using .
Once and are known, the fitted regression line can be written as:
where is the fitted or estimated value based on the fitted regression model. It is an estimate of the mean value, . The fitted
value, , for a given value of the predictor variable, , may be different from the corresponding observed value, . The difference
between the two values is called the residual, :
The least square estimates of the regression coefficients can be obtained for the data in the preceding table as follows:
Once the fitted regression line is known, the fitted value of corresponding to any observed data point can be calculated. For example,
the fitted value corresponding to the 21st observation in the preceding table is:
The observed response at this point is . Therefore, the residual at this point is:
In DOE folios, fitted values and residuals can be calculated. The values are shown in the figure below.
t Tests
The tests are used to conduct hypothesis tests on the regression coefficients obtained in simple linear regression. A statistic based on
the distribution is used to test the two-sided hypothesis that the true slope, , equals some constant value, . The statements for
the hypothesis test are expressed as:
The test statistic, , follows a distribution with degrees of freedom, where is the total number of observations. The
null hypothesis, , is accepted if the calculated value of the test statistic is such that:
where and are the critical values for the two-sided hypothesis. is the percentile of the distribution
corresponding to a cumulative probability of and is the significance level.
If the value of used is zero, then the hypothesis tests for the significance of regression. In other words, the test indicates if the fitted
regression model is of value in explaining variations in the observations or if you are trying to impose a regression model when no true
relationship exists between and . Failure to reject implies that no linear relationship exists between and .
This result may be obtained when the scatter plots of against are as shown in (a) of the following figure and (b) of the following figure.
(a) represents the case where no model exits for the observed data. In this case you would be trying to fit a regression model to noise or
random variation. (b) represents the case where the true relationship between and is not linear. (c) and (d) represent the case when
is rejected, implying that a model does exist between and . (c) represents the case where the linear model is
sufficient. In the following figure, (d) represents the case where a higher order model may be needed.
where is the least square estimate of , and is its standard error which is calculated using:
Example
The test for the significance of regression for the data in the preceding table is illustrated in this example. The test is carried out using the
test on the coefficient . The hypothesis to be tested is . To calculate the statistic to test , the estimate, , and
the standard error, , are needed. The value of was obtained in this section. The standard error can be calculated as follows:
Then, the test statistic can be calculated using the following equation:
The value corresponding to this statistic based on the distribution with 23 (n-2 = 25-2 = 23) degrees of freedom can be obtained as
follows:
In Weibull++ DOE folios, information related to the test is displayed in the Regression Information table as shown in the following
figure. In this table the test for is displayed in the row for the term Temperature because is the coefficient that represents the
variable temperature in the regression model. The columns labeled Standard Error, T Value and P Value represent the standard error, the
test statistic for the test and the value for the test, respectively. These values have been calculated for in this example. The
Coefficient column represents the estimate of regression coefficients. The Effect column represents values obtained by multiplying the
coefficients by a factor of 2. This value is useful in the case of two factor experiments and is explained in Two Level Factorial
Experiments. Columns Low Confidence and High Confidence represent the limits of the confidence intervals for the regression
coefficients and are explained in Confidence Interval on Regression Coefficients.
The analysis of variance (ANOVA) is another method to test for the significance of regression. As the name implies, this approach uses
the variance of the observed data to determine if a regression model can be applied to the observed data. The observed variance is
partitioned into components that are then used in the test for significance of regression.
Sum of Squares
The total variance (i.e., the variance of all of the observed data) is estimated using the observed data. As mentioned in Statistical
Background, the variance of a population can be estimated using the sample variance, which is calculated using the following
relationship:
The quantity in the numerator of the previous equation is called the sum of squares. It is the sum of the square of deviations of all the
observations, , from their mean, . In the context of ANOVA this quantity is called the total sum of squares (abbreviated )
because it relates to the total variance of the observations. Thus:
When you attempt to fit a regression model to the observations, you are trying to explain some of the variation of the observations using
this model. If the regression model is such that the resulting fitted regression line passes through all of the observations, then you would
have a "perfect" model (see (a) of the figure below). In this case the model would explain all of the variability of the observations.
Therefore, the model sum of squares (also referred to as the regression sum of squares and abbreviated ) equals the total sum of
squares; i.e., the model explains all of the observed variance:
For the perfect model, the regression sum of squares, , equals the total sum of squares, , because all estimated values, ,
will equal the corresponding observations, . can be calculated using a relationship similar to the one for obtaining by
replacing by in the relationship of . Therefore:
Based on the preceding discussion of ANOVA, a perfect regression model exists when the fitted regression line passes through all
observed points. However, this is not usually the case, as seen in (b) of the following figure.
The number of degrees of freedom associated with , , is . The total variability of the observed data (i.e.,
total sum of squares, ) can be written using the portion of the variability explained by the model, , and the portion
unexplained by the model, , as:
The above equation is also referred to as the analysis of variance identity and can be expanded as follows:
As mentioned previously, mean squares are obtained by dividing the sum of squares by the respective degrees of freedom. For example,
the error mean square, , can be obtained as:
The error mean square is an estimate of the variance, , of the random error term, , and can be written as:
Similarly, the regression mean square, , can be obtained by dividing the regression sum of squares by the respective degrees of
freedom as follows:
F Test
To test the hypothesis , the statistic used is based on the distribution. It can be shown that if the null hypothesis is
true, then the statistic:
where is the percentile of the distribution corresponding to a cumulative probability of ( ) and is the significance
level.
Example
The analysis of variance approach to test the significance of regression can be applied to the yield data in the preceding table. To
calculate the statistic, , for the test, the sum of squares have to be obtained. The sum of squares can be calculated as shown next. The
total sum of squares can be calculated as:
Knowing the sum of squares, the statistic to test can be calculated as follows:
Assuming that the desired significance is 0.1, since the value < 0.1, then is rejected, implying that a relation does exist
between temperature and yield for the data in the preceding table. Using this result along with the scatter plot of the above figure, it can
be concluded that the relationship that exists between temperature and yield is linear. This result is displayed in the ANOVA table as
shown in the following figure. Note that this is the same result that was obtained from the test in the section t Tests. The ANOVA and
Regression Information tables in Weibull++ DOE folios represent two different ways to test for the significance of the regression model.
In the case of multiple linear regression models these tables are expanded to allow tests on individual variables used in the model. This is
done using extra sum of squares. Multiple linear regression models and the application of extra sum of squares in the analysis of these
models are discussed in Multiple Linear Regression Analysis.
It can be seen that the width of the confidence interval depends on the value of and will be a minimum at and will widen as
increases.
For the data in the preceding table, assume that a new value of the yield is observed after the regression model is fit to the data. This new
observation is independent of the observations used to obtain the regression model. If is the level of the temperature at which the
new observation was taken, then the estimate for this new value based on the fitted regression model is:
If a confidence interval needs to be obtained on , then this interval should include both the error from the fitted model and the error
associated with future observations. This is because represents the estimate for a value of that was not used to obtain the
regression model. The confidence interval on is referred to as the prediction interval. A 100 ( ) percent prediction interval on a
new observation is obtained as follows:
To illustrate the calculation of confidence intervals, the 95% confidence intervals on the response at for the data in the
preceding table is obtained in this example. A 95% prediction interval is also obtained assuming that a new observation for the yield was
made at .
The 95% limits on are 199.95 and 205.2, respectively. The estimated value based on the fitted regression model for the new
observation at is:
The coefficient of determination is a measure of the amount of variability in the data accounted for by the regression model. As
mentioned previously, the total variability of the data is measured by the total sum of squares, . The amount of this variability
explained by the regression model is the regression sum of squares, . The coefficient of determination is the ratio of the regression
sum of squares to the total sum of squares.
Therefore, 98% of the variability in the yield data is explained by the regression model, indicating a very good fit of the model. It may
appear that larger values of indicate a better fitting regression model. However, should be used cautiously as this is not always
the case. The value of increases as more terms are added to the model, even if the new term does not contribute significantly to the
model. Therefore, an increase in the value of cannot be taken as a sign to conclude that the new model is superior to the older model.
Adding a new term may make the regression model worse if the error mean square, , for the new model is larger than the
of the older model, even though the new model will show an increased value of . In the results obtained from the DOE folio, is
displayed as R-sq under the ANOVA table (as shown in the figure below), which displays the complete analysis sheet for the data in the
preceding table.
The other values displayed with are S, R-sq(adj), PRESS and R-sq(pred). These values measure different aspects of the adequacy of the
regression model. For example, the value of S is the square root of the error mean square, , and represents the "standard error of
the model." A lower value of S indicates a better fitting model. The values of S, R-sq and R-sq(adj) indicate how well the model fits the
observed data. The values of PRESS and R-sq(pred) are indicators of how well the regression model predicts new observations. R-
sq(adj), PRESS and R-sq(pred) are explained in Multiple Linear Regression Analysis.
In the simple linear regression model the true error terms, , are never known. The residuals, , may be thought of as the observed
error terms that are similar to the true error terms. Since the true error terms, , are assumed to be normally distributed with a mean of
zero and a variance of , in a good model the observed error terms (i.e., the residuals, ) should also follow these assumptions. Thus
the residuals in the simple linear regression should be normally distributed with a mean of zero and a constant variance of . Residuals
are usually plotted against the fitted values, , against the predictor variable values, , and against time or run-order sequence, in
addition to the normal probability plot. Plots of residuals are used to check for the following:
Examples of residual plots are shown in the following figure. (a) is a satisfactory plot with the residuals falling in a horizontal band with
no systematic pattern. Such a plot indicates an appropriate regression model. (b) shows residuals falling in a funnel shape. Such a plot
indicates increase in variance of residuals and the assumption of constant variance is violated here. Transformation on may be helpful
in this case (see Transformations). If the residuals follow the pattern of (c) or (d), then this is an indication that the linear regression
model is not adequate. Addition of higher order terms to the regression model or transformation on or may be required in such
cases. A plot of residuals may also show a pattern as seen in (e), indicating that the residuals increase (or decrease) as the run order
sequence or time progresses. This may be due to factors such as operator-learning or instrument-creep and should be investigated further.
Residual plots for the data of the preceding table are shown in the following figures. One of the following figures is the normal
probability plot. It can be observed that the residuals follow the normal distribution and the assumption of normality is valid here. In one
of the following figures the residuals are plotted against the fitted values, , and in one of the following figures the residuals are plotted
against the run order. Both of these plots show that the 21st observation seems to be an outlier. Further investigations are needed to study
the cause of this outlier.
As mentioned in Analysis of Variance Approach, ANOVA, a perfect regression model results in a fitted line that passes exactly through
all observed data points. This perfect model will give us a zero error sum of squares ( ). Thus, no error exists for the perfect
model. However, if you record the response values for the same values of for a second time, in conditions maintained as strictly
identical as possible to the first time, observations from the second time will not all fall along the perfect model. The deviations in
observations recorded for the second time constitute the "purely" random variation or noise. The sum of squares due to pure error
(abbreviated ) quantifies these variations. is calculated by taking repeated observations at some or all values of and
adding up the square of deviations at each level of using the respective repeated observations at that value.
Assume that there are levels of and repeated observations are taken at each the level. The data is collected as shown next:
The sum of squares of the deviations from the mean of the observations at the level of , , can be calculated as:
where is the mean of the repeated observations corresponding to ( ). The number of degrees of
The total sum of square deviations (or ) for all levels of can be obtained by summing the deviations for all as shown next:
If all , (i.e., repeated observations are taken at all levels of ), then and the degrees of freedom
associated with are:
When repeated observations are used for a perfect regression model, the sum of squares due to pure error, , is also considered as
the error sum of squares, . For the case when repeated observations are used with imperfect regression models, there are two
components of the error sum of squares, . One portion is the pure error due to the repeated observations. The other portion is the
error that represents variation not captured because of the imperfect model. The second portion is termed as the sum of squares due to
lack-of-fit (abbreviated ) to point to the deficiency in fit due to departure from the perfect-fit model. Thus, for an imperfect
regression model:
The degrees of freedom associated with can be obtained in a similar manner using subtraction. For the case when repeated
observations are taken at all levels of , the number of degrees of freedom associated with is:
The magnitude of or will provide an indication of how far the regression model is from the perfect model. An
test exists to examine the lack-of-fit at a particular significance level. The quantity follows an distribution with
degrees of freedom in the numerator and degrees of freedom in the denominator when all equal . The
test statistic for the lack-of-fit test is:
it will lead to the rejection of the hypothesis that the model adequately fits the data.
Example
Assume that a second set of observations are taken for the yield data of the preceding table
(http://reliawiki.org/index.php/Simple_Linear_Regression_Analysis#Simple_Linear_Regression_Analysis|) . The resulting observations
are recorded in the following table. To conduct a lack-of-fit test on this data, the statistic , can be calculated
as shown next.
Using the fitted values, the sum of squares can be obtained as follows:
The error sum of squares, , can now be split into the sum of squares due to pure error, , and the sum of squares due to lack-
of-fit, . can be calculated as follows considering that in this example and :
The test statistic for the lack-of-fit test can now be calculated as:
Since , we fail to reject the hypothesis that the model adequately fits the data. The value for this case is:
Transformations
The linear regression model may not be directly applicable to certain data. Non-linearity may be detected from scatter plots or may be
known through the underlying theory of the product or process or from past experience. Transformations on either the predictor variable,
, or the response variable, , may often be sufficient to make the linear regression model appropriate for the transformed data. If it is
known that the data follows the logarithmic distribution, then a logarithmic transformation on (i.e., ) might be
useful. For data following the Poisson distribution, a square root transformation ( ) is generally applicable.
Transformations on may also be applied based on the type of scatter plot obtained from the data. The following figure shows a few
such examples.
The Box-Cox method may also be used to automatically identify a suitable power transformation for the data based on the relation:
Here the parameter is determined using the given data such that is minimized (details on this method are presented in One
Factor Designs).
Copyright Information.