0% found this document useful (0 votes)
2K views33 pages

Assignment QMT533 RUJUKN

The document discusses several potential violations of basic regression assumptions including autocorrelation, heteroscedasticity, and multicollinearity. Autocorrelation can occur due to inertia in time series data, omitted variables, incorrect functional form, the cobweb phenomenon, use of lags, and data manipulation. Heteroscedasticity occurs when error variances are not constant across observations. Multicollinearity is present when independent variables are correlated, making it difficult to determine the impact of each variable.

Uploaded by

Pokku Arif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views33 pages

Assignment QMT533 RUJUKN

The document discusses several potential violations of basic regression assumptions including autocorrelation, heteroscedasticity, and multicollinearity. Autocorrelation can occur due to inertia in time series data, omitted variables, incorrect functional form, the cobweb phenomenon, use of lags, and data manipulation. Heteroscedasticity occurs when error variances are not constant across observations. Multicollinearity is present when independent variables are correlated, making it difficult to determine the impact of each variable.

Uploaded by

Pokku Arif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

ECONOMETRICS QMT533

ASSESSMENT 2

GROUP MEMBERS:

NAME MATRICS NO.

ADLINA NADHIRAH BINTI SOFIAN 2020960995

IZZAH NAZATUL NAZIHAH BINTI EJAP 2020958097

MUHAMMAD AL-AMIN BIN FAUZI 2020964437

NUR HAMIZAH BINTI MOHD ARIFF 2020980935

SITI NURSOFEA BINTI MOHD RAFLIS 2020988721

LECTURER’S NAME: MADAM ZAITUL ANNA MELISA

GROUP: N4CS2416T2

SUBMISSION DATE: WEEK 12


(a) The nature of the violation of basic regression assumption
● AUTOCORRELATION
Autocorrelation is known as a mathematical presentation of the degree of similarity
between a sequence of data points that occur in successive order over some period
of time and a lagged version of itself over the consecutive time intervals. It is a similar
concept to the correlation between two different sequences of data points, but
autocorrelation uses the same sequence of data points twice where once in its
original sequence of data points and once lagged one. In other words,
autocorrelation occurs when the residuals are not independent from each other.

1. Inertia
Inertia in economic time-series is a great reason for autocorrelation where the data
remain unchanged throughout the time series. For example, data such as GNP,
production, price index, employment, and unemployment exhibit business cycles.
Starting at the bottom of the recession when the economic recovery starts, most of
these series start moving upward. In this upswing, the value of a series at one point
in time is greater than its previous values. These successive periods are likely to be
interdependent.

2. Specification Bias (Excluded variable)


The residuals may suggest that some variables that were originally in the data but
were not included in the model should be included. This is one of the cases of
specification bias. Often the inclusion of such variables may remove the correlation
pattern observed among the residuals. For example, the model

is correct. However, running

, where, ,

the error or disturbance term will give back a systematic pattern. Thus creating false
autocorrelation, due to exclusion of 𝑋3𝑡 variable from the model. The effect of 𝑋3𝑡 will

be captured by the disturbances 𝑣𝑡 .

3. Specification Bias (Incorrect Functional Form)


Autocorrelation can also occur due to the miss-specification of the model. Suppose
that Y_t is connected to X_2t with a quadratic relation
but the equation is wrongly estimate by a straight line relationship ( ).
In this case, the error term acquired from the straight line specification will depend on
2
𝑋2𝑡. If 𝑋2𝑡 is increasing or decreasing over time, 𝑢𝑡 will also be increasing or

decreasing over time.

4. Cobweb Phenomenon
The Cobweb phenomenon happens when quantity supplied in the period t of many
agricultural commodities depends on their price in period t-1. This is because the
decision to plant a crop in a period of t is determined by the price of the commodity in
that period. However, the actual supply of the commodity is available in the period
t-1. Production is influenced by pricing where when price increases, productivity also
increases. If the price decreases, then productivity also decreases. If the price in
period t+1 is low, the farmer will produce less in period t+2 than they did in period
t+1. Thus disturbances in the case of the Cobweb phenomenon are not expected to
be random but they will exhibit a systematic pattern and thus cause a problem of
autocorrelation.

5. Lags
Many times in business and economic research the lagged values of the dependent
variable are used as the independent variables. For example, to study the effect of
tastes and habits on consumption in a period t, consumption in period t-1 is used as
an explanatory variable since consumers do not change their consumption habits
easily. If the lagged terms are not included in the above consumption function, the
resulting error term will reflect a systematic pattern due to the impact of habits and
tastes on current consumption and therefore there will be an autocorrelation problem.

6. Manipulation of data
Often raw data are manipulated in the experimental analysis. For example, in
time-series regression that uses quarterly data, the data are usually obtained from
the monthly data by simply adding three monthly observations and dividing the sum
by 3. This averaging introduces smoothness into the data by decreasing the
fluctuations on the monthly data. This smoothness may itself lend to a systematic
pattern in the disturbances, thereby introducing autocorrelation. Interpolation or
extrapolation of data is also another example of data manipulation.
7. Non-Stationary
It is quite possible that both Y and X are non-stationary and therefore, the error is
also non-stationary. In this case, the error term will show an autocorrelation problem.

● HETEROSCEDASTICITY

The homoscedasticity assumption states that the variance of the unobserved error, u,
and conditional on the explanatory variables is constant (Wooldridge, 2018, p. 285).
When the segments are determined by the different values of the explanatory
variables, it shows that the homoscedasticity fails whenever the variance of the
unobserved factors changes across different segments of the population.

1. Heteroscedasticity occurs when the data that sets the variance of error for all
observations is not the same.

● Ordinary Least Squares (OLS) regression assumes that all residuals are
drawn from a population that has a constant variance (homoscedasticity).
● For example, there would be an increase in the individual’s income, but their
savings would remain constant.

2. Heteroscedasticity occurs more often in datasets that have a large range


between the largest and smallest observed values.

● For example, a cross-sectional study of income can have a range that


extends from poverty to billionaires. A cross-sectional study that involves the
United States can have low values for Delaware and very high values for
California (Frost, 2019).

3. Skewness in the distribution of one or more regressors included in the model.

● It is specifically related to error terms and not between two individual


variables as in the example of income and age.

4. Incorrect functional form (linear of log-linear model) which is an incorrect data


transformation.
● If a regression model is consistently accurate when it predicts low values of
the dependent variable but highly inconsistent in its accuracy when it predicts
high values, then the results of the regression should not be trusted.

● MULTICOLLINEARITY

Multicollinearity occurs when the independent variables in a regression model are correlated.
This correlation is concerning because the independent variables should be independent. A
high degree of correlation between variables can cause problems when fitting the model and
analyzing the results. When a researcher or analyst tries to discover how well each
independent variable can be used to predict or understand the dependent variable in a
statistical model, multicollinearity can lead to biased or incorrect conclusions.

In general, when two or more independent variables are related to each other, then at least
one or more variables are redundant because of their tendency to contribute overlapping
information. Although it is possible to obtain least-square estimates when multicollinearity
occurs, the use of statistical testing techniques and interpretation of the regression
coefficients become troublesome. This is because it's difficult to say which variable has the
most impact on the dependent variable. As a result, multicollinearity is mostly a data problem
rather than a model problem.

The cause of multicollinearity is:

● Data-based multicollinearity

- Poorly planned experiments, data that is entirely observational, or data


gathering methods that cannot be modified are all contributing factors.
Variables may be significantly correlated in some circumstances (typically
owing to data collected from purely observational studies), yet there is no
error on the side of the researcher. As a result, you should conduct
experiments whenever possible, adjusting the predictor variables' levels in
advance. Nearness is mostly a data problem rather than a model problem.
● Structural multicollinearity:
- It is caused by the researcher when creating new predictor variables.
This type occurs when the researcher creates a model term using other
terms. In other words, it’s a byproduct of the model that they specify rather
than being present in the data itself. For example, if the researcher square
term X to model curvature, clearly there is a correlation between X and X2.
● Insufficient data.
- In some cases, collecting more data can resolve the issue.
● Dummy variables could be used wrongly.
- For example, the researcher may fail to exclude one category, or add a
dummy variable for every category (e.g. spring, summer, autumn, winter).
● Incorporating a regression variable that is essentially a combination of two
additional factors.
- For example, including “total investment income” when total investment
income = income from stocks and bonds + income from savings interest.
● Including two identical variables.
- For example, weight in pounds and weight in kilos, or investment income
and savings/bond income.

The presence or absence of multicollinearity problems can be detected by observing


the ‘collinearity statistics’ of Variance Inflation Factor (VIF) and Tolerance using SPSS output
or any other analysis tools. The VIF indicates whether an independent variable is strongly
correlated with an independent variable. There is no hard and fast rule to decide concerning
the value of VIF.
(b) Practical consequences of the violation of basic regression assumption
● AUTOCORRELATION
1. When the disturbance terms are serially correlated then the Ordinary Least Squares

(OLS) estimators of the βs are still unbiased and consistent but the minimum
variance property is not satisfied.

2. The OLS estimators will be inefficient and therefore, no longer be the best linear
unbiased estimator (BLUE).

3. The estimated variance of the regression coefficients will be biased and inconsistent
and will be greater than the variances of estimate calculated by other methods,
2
therefore the hypothesis testing is no longer valid. In most of the cases, 𝑅 will be
overestimated which indicates a better fit than the one that truly exists. The
t-statistics and F-statistics tend to be higher.

4. The variance of random term 𝑢 may be underestimated if the 𝑢’s are autocorrelated.
2
2 ∑𝑢𝑖
2
That is, the random variance σ = 𝑛−2
is likely to be under-estimate the true σ .

5. Among the consequences of autocorrelation another is if the disturbance terms are

autocorrelated then the OLS estimates are not asymptotic. That is βs are not
asymptotically efficient.

● HETEROSCEDASTICITY
1. The regression parameter estimates are still unbiased, but not truly accurate.

2. Because the variances of the estimates are too big, we cannot use the variances of
the coefficient formula to conduct significance tests and generate confidence
intervals, however the test is still acceptable.

3. The estimation of the dependent variable (y) for a particular value of the independent
variable (X) based on an estimation 𝛽 from the original data, which may have a lot of
variances, would be inefficient.
4. Generalized least square (GLS) E(𝛽) is more efficient than ordinary least squares
(OLS) E(𝛽) in the presence of heteroscedasticity.

5. The unbiased of ordinary least square (OLS) E(𝛽) of is not eliminated by


heteroscedasticity, but it is not considered as the Best Linear Unbiased Estimator
(BLUE).

● MULTICOLLINEARITY

Here are the problems that will occur because of multicollinearity:

1. Multicollinearity does not violate Ordinary Least Square (OLS) assumptions.


- OLS estimates are still biased and BLUE (Best Linear Unbiased
Estimators). In general, it does inhibit our ability to obtain a good fit
nor does it tend to affect inferences about the mean response or
prediction of new observations.
- However the OLS estimators and their standard errors can be
sensitive to small changes in the data.
2. The t-statistic will generally be very small and coefficient confidence intervals will be
very wide leading to the acceptance of “zero null hypothesis”.
- This means that it is harder to reject the null hypothesis
- It will result in statistically insignificant.
3. The partial regression coefficient may be an imprecise estimate.
- This may cause the standard errors to be very large.
4. Partial regression coefficients may have sign and/or magnitude changes as they
pass from sample to sample.
5. Multicollinearity makes it difficult to gauge the effect of independent variables on
dependent variables.
- The common interpretation of regression coefficients as measuring the
change in the mean of the response variable when the given
independent variables is increased by one unit while all other predictor
variables are held constant is not fully applicable anymore.
(c) Detection of the violation of basic regression assumption
● AUTOCORRELATION

i) Formal

There are two methods that can be used to detect autocorrelation formally. There is
the Run test and Durbin Watson test.

1. Run test

The Run test also known as Geary test which is a nonparametric test. In this test, run
is defined as an uninterrupted sequence of one symbol. Length of a run has a
number of elements in it. Negative serial correlation can be seen when there are too
many runs as it means that the residual change sign frequently. Meanwhile, too few
runs mean that the residuals change sign frequently, thus indicating positive serial
correlation.

1) Hypothesis

H0: The residuals are random

H1: The residuals are not random

2) Test statistic:

R = Number of runs

3) Confidence interval:

𝐸(𝑅) ± 1. 96 × 𝑉𝑎𝑟(𝑅)

2𝑁1𝑁2 2𝑁1𝑁2(2𝑁1𝑁2−𝑁)
Where 𝐸(𝑅) = 𝑁
+ 1 𝑉𝑎𝑟(𝑅) = 2
𝑁 (𝑁−1)

𝑁1= Number of +

𝑁2= Number of -

𝑁 = Number of observations = 𝑁1+ 𝑁2

4) Decision rule:
Reject H0 if R does not fall in the interval.

2. Durbin Watson Statistic

Durbin Watson statistic (d statistic) provides a test for first-order autocorrelation only.
It is important to note the assumption underlying the d statistic.

1. The regression model includes the intercept term. If it is not present as


in the regression through the origin, it is essential to rerun the
regression including the intercept term to obtain RSS.

2. The explanatory variables are nonstochastic or fixed in the repeated


sampling.

3. The disturbances are generated by the first-order autoregressive


scheme. Therefore, it cannot be used to detect higher-order
autoregressive schemes.

4. The error term is assumed to be normally distributed.

5. The regression model does not include the lagged value(s) of the
dependent variable as the one of the explanatory variables.

6. There are no missing observations in the data.

According to the rule of thumb, if the value of Durbin Watson lies between 1.8 and
2.2, there is no autocorrelation. The formula of Durbin Watson statistic is given as
follow:

∑(µ𝑡− (µ𝑡−1)2
𝑑 =
2
∑µ𝑡

ii) Informal

Autocorrelation can also be detected in an informal way which is done through


graphs, therefore the name is graphical method. Usually, a visual examination of the
errors gives some clues about the likely presence of autocorrelation in the errors.
The graph can be drawn by simply plotting the errors against time in a time sequence
plot.
Figure 1

Left : A model with positive autocorrelation

Right : A model with negative autocorrelation

When there is positive autocorrelation, the graph of data against its error will show a
snake-like shape. Meanwhile, the graph will show the zigzag pattern if the plot is
connected when there is negative autocorrelation, as shown on the right.

Besides, there is another way to detect autocorrelation, which is by using a


correlogram. A correlogram shows the correlation of a series of data with itself. It is
also known as an autocorrelation plot and an Autocorrelation Function (ACF). The
lag refers to the order of the correlation.

Figure 2 : Example of time series analysis data


Figure 3 : Correlogram for the example of data

The correlogram in Figure 3 is for the data shown in Figure 2. We can see in this plot
that at lag 0, the correlation is 1. It means that data is correlated with itself. At lag 1,
the correlation is shown to be around 0.5. There are also negative correlations when
the points are at lags 3, 4 and 5 apart.

● HETEROSCEDASTICITY

Heteroscedasticity occurs when the variance disturbance term differs across observations.
There are few ways to detect heteroscedasticity problems. The hypothesis for all the formal
test for heteroscedasticity as below:

H0: There is no heteroscedasticity in the error variance

H1: There is heteroscedasticity present in the error variance

(Formal way)

1. Park Test

The Park test is used for heteroscedasticity if there are some variables Z that might explain
the different variances of the residuals. The formula shown as below:
where vi is the stochastic disturbance term. Since σ^2i is generally not known. Park
suggests using u^2 i as proxy and running the following regression.

Test procedure:

Step 1: Run regression using ordinary least square (OLS): Yi= β0 + β1Xi +i to obtain
residual, ȗ^2 i

Step 2: Run regression again to obtain β

Step 3: Conduct t-test on β. If t = β/√(s.e(β))~ t_(n-k), then reject H0 or look at p-value.

2. Glejser Test

The amount of random error increases proportionately to changes in one or more exogenous
variables, which is a test for heteroscedasticity. The test is carried out by regressing the
absolute values of the main regression equation's ordinary least squares residuals on the
variables in question. The formula for this test is shown as below:

Test procedure:

Step 1: Run regression using ordinary least square (OLS): Yi= β0 + β1Xi + mi, to obtain
residual, ȗ^2 i

Step 2: Run regression again to obtain β

Step 3: Conduct t-test on β. If t =(formula) , then reject H0 or look at p-value.

3. Spearman’s rank correlation test

The Spearman rank-order correlation coefficient (also known as Spearman's correlation) is a


nonparametric measure of the degree and direction of relationship between two variables
assessed on at least an ordinal scale. The formula for the Spearman’s rank correlation test
is stated as below:

where,

𝑑𝑖 = difference in ranks assigned to two different characteristics of the 𝑖th individual

𝑛 = no. of individual ranked

Test procedure:

Step 1: Run regression using ordinary least square (OLS): Yi= β0 + β1Xi + mi, to obtain
residual, ȗ^2 i

Step 2: Rank both | | and Xi in ascending or descending order, then compute the previously
calculated Spearman's rank correlation coefficient

Step 3: Assuming that the population rank correlation coefficient s is zero and that n > 8, the
significance of the sample rs may be assessed using the t – test, as shown below.

4. Goldfeld- Quandt test

The Goldfeld–Quandt test in statistics checks for homoscedasticity in regression analysis. It


accomplishes this by separating a dataset into two portions or groups, which is why the test
is frequently referred to as a two-group test.

Suppose σ is positively related to as


Where σ is a constant.

Test procedure:

Step 1: Sort the observations by X value, starting with the smallest X value.

Step 2: Obtain and omit c central observations. Separate the remaining (n-c) observations
into two groups.

Step 3: Run separate ordinary least squares (OLS) regression for the group 1 and group 2
and obtain Residual sum of square 1 (RSS1) and Residual sum of square 2 (RSS2).

Step 4: Compute test statistics:

5. Breush-Pagan-Godfrey Test

In a linear regression model, it is used to test for heteroskedasticity. R separately proposed it


with some expansion. The formula is shown as below:

Assume that the error variance σ 2 i is described as


Test procedure:

Step 1: Run regression using ordinary least square (OLS): Yi= β0 + β1Xi + mi, to obtain
residual, ȗ^2 i .

Step 2: Calculate (formula)

Step 3: Construct variables pi defined as:

which is simply each residual squared divided by sigma^2 .

Step 4: Regress 𝑝𝑖 and obtain ESS as:

where vi is the residual term of this regression.

Step 5: Obtain the ESS (explained sum of squares) from above and define the formula
below.

6. White’s General Heteroscedasticity Test

In regression analysis, White's test is used to look for heteroscedastic ("differently


distributed") errors. It's a variant of the simpler Breusch-Pagan test. Let the model:

Test procedure:
Step 1: Run regression using ordinary least square (OLS): Yi= β0 + β1Xi + mi, to obtain
residual, ȗ^2 i. .

Step 2: Then, run the following auxiliary regression.

Now obtained R^2 from the above model.

Step 3: Calculate test statistics: n ×R^2 . If statistics n × R^2 > then Reject H0.

(Informal way)

1. Graphical Method: Residual vs Fitted value

If the points are distributed randomly about zero without any systematic pattern, this
indicates that there is no heteroscedasticity in the data. The figure shown below:

Figure 4

If it has a pattern, it has heteroscedasticity.

a)
This graph shows that it is not linear and nature is unknown.

b)

Based on the graph above show that the graph is Linear increase and presence of
heteroscedasticity.

c)

This graph shows that the graph is Heteroscedasticity with quadratic relationship.

d)

Based on this graph shows the Quadratic relationship.

● MULTICOLLINEARITY
Formal

Multicollinearity occurs due to various problems and it can be solved by detecting the
problems. There are few ways in detecting multicollinearity.

1. High value but few significant t-statistics.

Firstly, the value of is approximately more than 0.8, the F test in most cases will reject the
hypothesis that the partial slope coefficients are simultaneously equal to zero but the
individual t-tests will show that none or few of the partial slope coefficients are statistically
different from zero.

2. T-ratios and F statistics

By checking the significance of each individual coefficient and the significance of the overall
model. If none of the individual coefficients is significant but for the overall model is
significant, then it can be concluded that multicollinearity exists.

3. Tolerance and Variance Inflation Factor (VIF)

Secondly, a formal method of detecting the presence of multicollinearity is by checking the


variance inflation factor (VIF). This measures how much the variance of an estimated
regression coefficient is inflated as compared to when predictor variables are not linearly
related. As , is the value of R-square when each of the predictor variable is regressed on the
remaining predictor variables in the model, increases toward unity as the collinearity of with
the other predictor increases, variance inflation factor (VIF) also increases and in the limit, it
can be infinite. Meanwhile, high multicollinearity as measured by a high VIF may not
necessarily cause high standard errors.

VIF measures how much of the variation in one variable is explained by another variable.

The variance inflation factor (VIF) is calculated for each of the predictor variable where;

A VIF value greater than 10 indicates the presence of multicollinearity. Therefore, can use
tolerance statistics (TOL) as a measure of multicollinearity as this is the reciprocal of the
variance inflation factor (VIF). A tolerance value of less than 0.2 indicates there is a
multicollinearity.
The formula for tolerance statistics is:

4. High pairwise correlations among regressors.

If the pair-wise or zero-order correlation coefficient between two regressors is high, say in
excess of 0.8, then multicollinearity is a serious problem. The problem with this criterion is
that, although high zero-order correlations may suggest collinearity, it is not necessary that
they be high to have collinearity in any specific case.

Informal (graph/table/diagram)

For informal ways of detecting multicollinearity is by using scatter plot and correlation
matrices. Scatter plot can show the types of relationship between the independent variables.
If the scatter plot shows a linear relationship between variables, this indicates that it has the
presence of multicollinearity. It is a good practice to use scatter plots to see how the various
variables in a regression model are related.
Figure 5: Scatter Plot

Figure above shows the scatter plot of blood pressure, age, weight and body surface area.

This figure below shows a correlation of blood pressure, age, weight, body surface area,
duration of hypertension, basal pulse and stress index.

Figure 6: Correlation Matrix

As we can see that there is a strong correlation between body surface area (BSA) and
weight where r = 0.875. Moreover, weight and pulse are fairly correlated since the value of r
= 0.659. Lastly, none of the pairwise correlations among age, weight, duration and stress are
particularly strongly correlated since the value of r is less than 0.40 in each case.
(d) Remedial measures
● AUTOCORRELATION
When autocorrelation error terms are found to be present:

1. Investigate the omission of a key predictor variable.

● If such a predictor does not aid in reducing or eliminating


autocorrelation of the error terms, then certain transformations on the
variables can be performed.

2. Try to find out if the autocorrelation is pure autocorrelation and not the result
of misspecification of the model.

3. If it is in large samples, we can use the New-West method to obtain standard


errors of the Ordinary Least Squares (OLS) estimators that are corrected for
autocorrelation.

4. In some situations, we can continue to use the OLS method.

● HETEROSCEDASTICITY
The unbiasedness and consistency qualities of the Ordinary Least Square (OLS) estimator
are not destroyed by heteroscedasticity, since it remains unbiased and consistent in the
existence of heteroscedasticity. However, they are no longer efficient, even asymptotically.
The standard hypothesis testing approach is suspect due to its inefficiency. As a result,
some heteroscedasticity remedial actions should be implemented. There are two
approaches to remediation: one when the variance is known and the other when the
variance is unknown.

Assume the simple linear regression model is 𝑌𝑖 = α + β𝑥𝑖 + µ𝑖.

i) Variance is known.

2
Heteroscedasticity is present if 𝑉(µ𝑖) = σ . Weighted least squares (WLS), which is a

specific case of Generalized Least Squares (GLS), can be used to correct heteroscedasticity
given the variance values. The OLS technique of estimation for the converted model is
weighted least squares. When heteroscedasticity is identified using a statistical test, the best
option is to alter the original model so that the modified disturbance factor has constant
variance. The transformed model is nothing more than a data modification. The modified
error term like homoscedastic has a constant variance. Following equation:
µ
𝑉(µ𝑖 *) = 𝑉( σ𝑖 )
𝑖

1
= 2
𝑉𝑎𝑟(µ𝑖)
σ𝑖

1 2
= 2
σ𝑖 = 1
σ𝑖

Since the individual error variance is not always known a priori, this method has
limitations. Reasonable estimations of the genuine error variances can be made and used
for variance in the event of significant sample information.

ii) Variance is unknown.

● MULTICOLLINEARITY

For multicollinearity there are some remedial measures to remedy the violation of basic
regression assumption.

1. Dropping a variable(s)

When faced with severe multicollinearity, the simplest ways are to drop one or more of the
correlated independent variables. However, after having dropped the variable(s), it may lose
some information. The omitted variables result in biased coefficient estimates for the
remaining independent variables

2. Do nothing

Secondly, leave the model as is, despite multicollinearity. The presence of multicollinearity
does not affect the fitted model as the predictor variables remain the same pattern in the
regression model.

3. Increase sample size

Other than that, obtain more data to increase the sample size. However, it is possible that in
another sample involving the same variables collinearity may not be too serious as in the
first sample. This way will decrease the standard errors and reduce the collinearity problems.
But simply increasing the size of the sample may attenuate the collinearity problem.
4. Transform variable(s)

Transformation used is to correct the model inadequacies. From the model, transform the
highly correlated variables by transforming them into first differences of logs.

5. Multivariate statistical techniques


(e) Examples for each of violation of basic regression assumptions
● AUTOCORRELATION

This data set is used to show an example for autocorrelation problems using R software.
This Crime Rate data set was taken in the United State of America. The variables are mostly
discrete variables as they measure populations per 1000, there are some continuous
variables such as those measuring expenditure. It consists of 14 variables where the
dependent variable is CrimeRate and it has 47 observations in total. Based on table 1, the fit
regression model is set to detect autocorrelation problems.

Table 1 : Crime Rate data set.


Figure 7 : The scripts and command in R studio to check autocorrelation

Based on figure 7, the data shows an autocorrelation problem detected by using the
Durbin-Watson test. According to the rule of thumb, if the value of Durbin-Watson lies
outside 1.8 and 2.2, there is an autocorrelation problem. Since the Durbin-Watson value is
equal to 0.9830947, therefore autocorrelation exists in the model.

● HETEROSCEDASTICITY

This data set is used to show an example for heteroscedasticity problems using R software.
This Clothing Sales data set was taken in the Netherlands. The data is on the annual sales
of men’s fashion stores. It consists of 8 variables where the dependent variable is Tsales
which refers to annual sales in Dutch guilders and the independent variables are sales per
square meter as sales, gross-profit-margin as margin, number of full-timers as nfull, total
number of hours worked as hours, investment in shop-premises as inv1, investment in
automation as inv2 and sales floor space of the store as ssize. This data set has 392
observations in total. Based on table 2, the fit regression model is set to detect
heteroscedasticity problems.
Table 2 : Clothing Sales data set.
Figure 8: The scripts and command in R studio to check heteroscedasticity

The Breusch-Pagan-Godfrey test is used to detect heteroscedasticity problems occurring in


the data. The p-value used for the test is at 5% level of significance. Based on figure 8, the
data is fit to a regression model with tsales as the response variable and the other 7
variables as the predictors variables. Next, the data is run to see the coefficient values of the
model. The p-value is obtained after running the Breusch-Pagan test. The hypothesis testing
is done to detect the heteroscedasticity problems is as shown below:
H0: There is no heteroscedasticity in the error variance

H1: There is heteroscedasticity present in the error variance


Alpha = 0.05
P-value = 2.2e^-16

Since the p-value=2.2e^-16 is less than alpha=0.05, the null hypothesis is rejected.
Therefore, there is heteroscedasticity present in the error variance.

● MULTICOLLINEARITY

This data set is used to show an example for multicollinearity problems using R software.
This GPA and Medical School Admission data set was taken at a midwestern liberal arts
college. It consists of 7 variables where the dependent variable is GPA which refers to the
Grade Point Average of the college students and the independent variables are Verbal
reasoning (subscore) as VR, Physical sciences (subscore) as PS, Writing sample (subcore)
as WS,Biological sciences (subscore) as BS, score on the MCAT exam which is the sum of
(VR+PS+WS+BS) as MCAT and number of medical schools applied to as Apps. This data
set has 55 observations in total. Based on table 3, the fit regression model is set to detect
multicollinearity problems.
Table 3 : GPA and Medical School Admission data set.
Figure 9 : The scripts and command in R studio to check multicollinearity problems.

Based on Zach (2021), the error in vif.default(model) occurs when a multicollinearity problem
exists in a regression model. This multicollinearity problem happens when two or more
predictor variables in the model are highly correlated. Based on figure 9, it shows that the
multicollinearity problem exists in this model because the error occurs when running the test.
Reference

Zach, . (2021, October 21). How to fix in R: There are aliased coefficients in the model.
Statology. Retrieved January 4, 2022, from
https://www.statology.org/r-aliased-coefficients-in-the-model/

Solutions, S. (2021, June 22). Autocorrelation. Statistics Solutions. Retrieved January 4,


2022, from https://www.statisticssolutions.com/dissertation-resources/autocorrelation/

Astivia, O. L. O., & Zumbo, B. D. (n.d.). Heteroskedasticity in multiple regression analysis:


What it is, how to detect it and how to solve it with applications in R and SPSS.
ScholarWorks@UMass Amherst. Retrieved January 4, 2022, from
https://scholarworks.umass.edu/pare/vol24/iss1/1/

IIT Kanpur, S. (n.d.). Chapter 11 autocorrelation - IIT Kanpur. Regression Analysis.


Retrieved January 4, 2022, from
http://home.iitk.ac.in/~shalab/regression/Chapter11-Regression-Autocorrelation.pdf

Dissertation, C. (2021, August 11). Assumptions of linear regression. Statistics Solutions.


Retrieved January 4, 2022, from
https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/assumptio
ns-of-linear-regression/

Business, F. S. of. (n.d.). Regression diagnostics: testing the assumptions of linear


regression. Testing the assumptions of linear regression. Retrieved January 4, 2022, from
https://people.duke.edu/~rnau/testing.htm#independence

EduPristine. (2020, February 7). Tutorial on detecting multicollinearity with an example.


EduPristine. Retrieved January 4, 2022, from
https://www.edupristine.com/blog/detecting-multicollinearity

PennState Eberly College of Science. (n.d.). 10.7 - detecting multicollinearity using variance
inflation factors. 10.7 - Detecting Multicollinearity Using Variance Inflation Factors | STAT
462. Retrieved January 4, 2022, from https://online.stat.psu.edu/stat462/node/180/

Frost, J., Pitcher, S., Shamshad, D., Matt, Utterback, C., Knaub, J., Nousheen, R., Debono,
K., Mike, Lombardi, J., López, J. M. P., Knaub, J., Knaub, J. R., Joey, Ghidena, Hart, R.,
Phabdallah, Albers, M., C., J., … Nick. (2019, March 15). Heteroscedasticity in regression
analysis. Statistics By Jim. Retrieved January 4, 2022, from
https://statisticsbyjim.com/regression/heteroscedasticity-regression/
INTRODUCTION………………………………………………………………………………………
…. 2 A) THE NATURE OF AUTOCORRELATION, HETEROSCEDASTICITY AND
MULTICOLLINEARITY……………………………………………………………………………….3
B) CONSEQUENCES OF AUTOCORRELATION, HETEROSCEDASTICITY AND
MULTICOLLINEARITY……………………………………………………………………………….7
C) DETECTION OF AUTOCORRELATION, HETEROSCEDASTICITY AND
MULTICOLLINEARITY……………………………………………………………………………….9
D) REMEDIAL
MEASURES……………………………………………………………………..............26 E)
EXAMPLE FOR AUTOCORRELATION, HETEROSCEDASTICITY AND
MULTICOLLINEARITY……………………………………………………………………………...3
1
REFERENCES…………………………………………………………………………………………
…39

You might also like