Research Method
Research Method
• OLS Regression
• Normality Test
Many analysts erroneously use the framework of linear regression (OLS) models to
predict change over time or extrapolate from present conditions to future conditions.
Extreme caution is needed when interpreting the results of regression models
estimated using time series data. Statisticians and analysts working with time series
data uncovered a serious problem with standard analysis techniques applied to time
series. Estimation of parameters of the Ordinary Least Square Regression (OLS)
model produced statistically significant results between time series that contain a
trend and are otherwise random.
Time series datasets are different from other ordinary datasets in that their
observations are recorded sequentially over equal time increments (daily, weekly,
monthly, quarterly, and annually and etc).
There are two sets of conditions under which much of the theory is built:
• Stationary process
• Ergodicity
However, ideas of stationarity must be expanded to consider two important ideas:
strict stationarity and second-order stationarity. Both models and applications can be
developed under each of these conditions, although the models in the latter case
might be considered as only partly specified. In addition, time-series analysis can be
applied where the series are seasonally stationary and non-stationary.
As you have more than one independent variable, it is most appropriate to use
multiple linear regressions. The linear regression model is a very useful tool in
prediction, but it is also very strict requiring some conditions to be met. The first
important one is the sample size. How big? It totally depends on a number of factors,
including the desired power, alpha level, number of predictors, and expected effect
size. As the rule of thumb, the bigger the sample size is, the better the model will be
if the processing time is ignored.
Normality Test
In statistics, normality tests are used to determine whether a random variable is
normally distributed, or not.
The way of evaluating and comparing tests is usually in terms of its power. The power
of a test is the probability that the test rejects the null hypothesis. Of course we
want to reject the null hypothesis when it is not true. A test is said to be more
powerful than other when it has a higher probability of finding out that the null
hypothesis is not true. In the specific case of test of normality, one test is said to be
more powerful than other when it has a higher probability of rejecting the hypothesis
of normality, when the distribution is not normal. Of course to make a fair
comparison we want all tests to have the same probability of rejecting the null
hypothesis when the distribution is truly normal (i.e. they have to have the same α
or significance level). The study has done for a wide variety distribution.
b) The normal distribution has (Pearson's) skewness and kurtosis equal to 0 and 3
respectively.
People that have defined tests for normality have focused in one of those
characteristics. Tests for normality are different in terms of:
d) the distribution of the statistic of the test (some use a common distribution such
as the Chi-square or the normal and some others have ad-hoc distributions)
The two more popular EDF tests are the one defined by Kolmogorov (1933), which
summarizes the comparison through the maximum difference, and by Anderson and
Darling (1954), which involves a combination of all the differences. The second one is
definitely more powerful than the first one.
1.0
F(x)
0.5
0.0
-3 -2 -1 0 1 2 3
x
b) Skewness and Kurtosis tests
Skewness and kurtosis tests. There are several ways of measuring skewness and
kurtosis but the most well known are Pearson's (1905) skewness and kurtosis. Many
tests have been defined using Pearson’s skewness and kurtosis statatistics.
Skewness=0 for the normal distribution. Pearson's skewness =3 for the normal
distribution. So these tests basically calculate the skewness and kurtosis for the
sample and compare it to the values 0 and 1. A statistic is calculated based on that
comparison and a p-value is found using a given distribution. The most recent test
based on Pearson's skewness and kurtosis can be found in D'Agostino et. al. (1990), it
involves complicated transformations and uses the Chi-square distribution to find the
p-values.
The regression and correlation are naturally associated to probability plots, which can
be of help for the trained eye to understand why the null hypothesis is rejected.
Regression tests focus on the slope of the line when the order statistics of the sample
are confronted with their expected value under normality. The most well known of
the regresion tests is the one defined by Shapiro and Wilk (1965).
Correlation tests focus on the strength of the linear relationship between the order
statistics and either their expected value under normality (but they are not as good as
the regression tests)
e) other special tests
There are some other tests, some of them of recent appearance, that cannot be
classified in the previous 3 types but in general they are not more powerful and they
are a little restrictive in their application.
If the purpose of the test is identifying symmetric distributions with high kurtosis, the
skewness-kurtosis tests have higher power. The way in which kurtosis is measured
makes a difference not only in the power of the test for different types of
distributions, but also in the way in which omnibus tests can be defined. Using
kurtosis statistics different from Pearson's kurtosis, it is possible to define more
simple omnibus tests that do not include complicated transformations of the skewness
and kurtosis statistics and whose distribution is quite close to the Chi-square
distribution. These tests do not have a good performance detecting distributions with
kurtosis lower than the normal but have higher power when the distribution is more
peaked than the normal.
Finding
An example of a time series dataset (Raw Data) is illustrated below. It is taken from
year 1999 till Jan 2009 (Appendix)
international
Time KLCI (monthly) reserve
1999 Jan./Jan 591.43 106,214.8
Feb./Feb. 542.23 109,023.9
Mac/Mar. 502.82 105,266.3
Apr./Apr. 674.96 108,672.2
Mei/May 743.04 113,144.7
Jun/Jun. 811.10 118,293.5
Jul./Jul. 768.69 120,378.4
Ogos/Aug. 767.06 122,874.5
Sep./Sep. 675.45 119,254.5
Okt./Oct. 742.87 114,789.3
Nov./Nov. 734.66 114,572.5
Dis./Dec. 812.33 117,243.5
Each of KLCI and IR is called a series, while the combination of the 2 variables YEAR
and MONTH represent the sequential equal time increments. According to this study,
dependant variable is referring to Kuala Lumpur Composite Index (KLCI) and
independent variable is International Reserve, (IR).
• Regression Analysis
This side shows an example regression analysis with footnotes explaining the output.
These data were collected on 2 variables which are KLCI and IR and value are on
starting from 1999-2009. The dependant variable is KLCI while an independent
variable is IR .The proc reg procedure is used to perform regression analysis. On the
model statement, we specify the regression model that we want to run, with the
dependent variable (in this case, KLCI) on the left of the equals sign, and the
independent variables on the right-hand side.
proc print;
Run;
proc reg;
Model KLCI=IR;
Run;
proc gplot;
Plot KLCI*IR;
Symbol i=none v=diamond c=red;
Run;
16:04 Tuesday, May 20, 2008
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
a. Source - This is the source of variance, Model, Residual, and Total. The Total
variance is partitioned into the variance which can be explained by the independent
variables (Model) and the variance which is not explained by the independent
variables (Residual, sometimes called Error). Note that the Sums of Squares for the
Model and Residual add up to the Total Variance, reflecting the fact that the Total
Variance is partitioned into Model and Residual variance.
b. DF - These are the degrees of freedom associated with the sources of variance.
The total variance has N-1 degrees of freedom. In this case, there were N=122 data
or number of observation, so the DF for total is 121. The model degree of freedom
corresponds to the number of predictors minus 1 (K-1). Refer in this case; it has one
independent variable in the model, which is international Reserve, IR). The Residual
degree of freedom is the DF total minus the DF model, 121 - 1 is 120.
c. Sum of Squares - These are the Sum of Squares associated with the three sources
of variance, Total, Model and Residual. These can be computed in many ways.
Conceptually, these formulas can be expressed as:
**Note that SS Model / SS Total are equal to 0.7023, the value of R-Square. This is
because R-Square is the proportion of the variance explained by the independent
variables, hence can be computed by SS Model / SS Total.
d. Mean Square - These are the Mean Squares, the Sum of Squares divided by their
respective DF. For the model, 4173325 / 1 =4173325. For the Residual, the values
are 1768622 / 120 = 14738.51667 or 14739. These are computed so we can compute
the F ratio, dividing the Mean Square Model by the Mean Square Residual to test the
significance of the predictors in the model.
e. F Value and Pr > F - The F-value is the Mean Square Model (4173325) divided by
the Mean Square Residual (14739), yielding F=283.16. The p-value associated with
this F value is very small (0.0001). These values are used to answer the question "Do
the independent variables reliably predict the dependent variable?” The p-value is
compared to our alpha level (typically 0.05) and, if smaller, you can conclude "Yes,
the independent variables, IR reliably predict the dependent variable, KLCI". We
could say that international reserve, IR can be used to reliably predict KLCI (the
dependent variable).
** If the p-value were greater than 0.05, we would say that the independent variables
does not show a statistically significant relationship with the dependent variable, or
independent variables does not reliably predict the dependent variable.
Note that this is an overall significance test assessing whether independent variables
reliably predict the dependent variable, and does not address the ability of any of the
particular independent variables to predict the dependent variable.
f. Root MSE - Root MSE is the standard deviation of the error term, and is the square
root of the Mean Square Residual (or Error). √14739 = 121.40 or √14738.51667 =
121.4023
j. Adj R-Sq - Adjusted R-square. As predictors are added to the model, each
predictor will explain some of the variance in the dependent variable simply due to
chance. The adjusted R-square attempts to yield a more honest value to estimate the
R-squared for the population. The value of R-square was 0.7023, while the value of
Adjusted R-square was 0.6999. Adjusted R-squared is computed using the formula 1 -
((1 – R sq) ((N - 1) / (N - k - 1)). According to this research 1- ((1- 0.7023) (122 – 1) /
(122-1-1)) = 0.6999.
From this formula, we can see that when the number of observations is small and the
number of predictors is large, there will be a much greater difference between R-
square and adjusted R-square (because the ratio of (N - 1) / N - k - 1) will be much
less than 1). By contrast, when the number of observations is very large compared to
the number of predictors, the value of R-square and adjusted R-square will be much
closer because the ratio of (N - 1)/ (N - k - 1) will approach 1.
Parameter estimate
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
k. Variable - This column shows the predictor variables (International Reserve, IR).
The first variable (constant) represents the constant, the Y intercept, the height of
the regression line when it crosses the Y axis. In other words, this is the predicted
value of KLCI when other variables are 0.
n. Parameter Estimates - These are the values for the regression equation for
predicting the dependent variable from the independent variable.
These estimates tell us about the relationship between the independent variables and
the dependent variable. These estimates tell the amount of increase in KLCI value
that would be predicted by a 1 unit increase in the predictor.
IR- The coefficient for read is 0.00189(parameter estimate). Hence, for every unit
increase in IR value data we expect a 0.00189 value increase in the KLCI value data.
This is statistically significant.
Note: For the independent variables which are not significant, the coefficients are
not significantly different from 0, which should be taken into account when
interpreting the coefficients. (See the columns with the t-value and p-value about
testing whether the coefficients are significant).
o. Standard Error - These are the standard errors associated with the coefficients.
The standard error is used for testing whether the parameter is significantly different
from 0 by dividing the parameter estimate by the standard error to obtain a t-value
(see the column with t-values and p-values). The standard errors can also be used to
form a confidence interval for the parameter, as shown in the last two columns of this
table.
p. t Value and Pr > |t|- These columns provide the t-value and 2 tailed p-value used
in testing the null hypothesis that the coefficient/parameter is 0. If you use a 2
tailed test, then you would compare each p-value to your preselected value of alpha.
Coefficients having p-values less than alpha are statistically significant.
The coefficient for IR is significantly different from 0 using alpha of 0.05 because its
p-value is 0.001, which is smaller than 0.05.
The constant (_cons) is significantly different from 0 at the 0.05 alpha level.
However, having a significant intercept is seldom interesting.
The sample code of SAS procedure regression is done as follows, but first, we would like to
have a close look at the relationship between the dependent variable and each of the
independent variables.
proc gplot;
Plot KLCI*IR;
Symbol i=none v=diamond c=red;
Run;
This will draw a diagram with each of the independent variables measured along the
horizontal axis, the dependent variable along the vertical axis, and a dot making each
observation. This is what is called a scatter diagram or scatter plot. Figure below clearly
demonstrates that a relationship between KLCI and IR , and we can see also that an increase
in KLCI leads to an increase in IR. However, this could be misleading because the data are
dealing with are time series data and one of the common violations of linear regression
assumptions is autocorrelation.
You should always keep in mind when you build a linear regression model that the
assumptions of a linear regression analysis must be met. These assumptions include
i) the mean of the response variable is linearly related to the value of the predictor
variable
iii) the error terms for each value of the predictor variable are normally distributed and
iv) The error variances for each value of the predictor variable are equal.
Accordingly, we may encounter the following 3 common problems with regression, which
would violate these assumptions:
i) correlated errors
• Normality Test
According to this study, we are not doing normality test using SAS technique. However, we
can conclude that the graph has positive correlation and look as if normal distributed in a
random variable.
Discussion
In order to make the appropriate judgment in the distributional assumptions, we also need to
look at the diagnostic plots that often provide the picture of the overall distribution along
with the statistical tests. Combining graphical methods and test statistics will definitely
improve our judgment on the normality of data. We can fairly easily automate the whole
analysis process base on the result of single normal tests. However, it is a challenge from a
programming point of view to incorporate such complicated visual assessment in our
programs.