Anova Regression Correlation Analysis A Portfolio of Work in Statistical Techniques With SPSS
Anova Regression Correlation Analysis A Portfolio of Work in Statistical Techniques With SPSS
net/publication/358106271
CITATION READS
1 2,155
1 author:
Ilias Kalemis
University of Derby
5 PUBLICATIONS 1 CITATION
SEE PROFILE
All content following this page was uploaded by Ilias Kalemis on 25 January 2022.
Contents
1 Introduction ..................................................................................................... 4
2 Regression and correlation analysis techniques ............................................... 4
2.1 Regression analysis techniques ............................................................... 4
2.3 Correlation analysis techniques ............................................................ 10
2.4 Regression model first execution .......................................................... 11
2.5 Anova .................................................................................................... 12
2.6 Coefficient ............................................................................................ 13
2.7 Regression model second execution ..................................................... 18
2.8 Parametric and non-parametric models ................................................. 23
2.9 Normality test ....................................................................................... 25
2.10 First T-Test ........................................................................................... 26
2.11 Second T-Test ....................................................................................... 28
3 Conclusion and purpose of the research ........................................................ 29
4 References ..................................................................................................... 31
5 Appendices .................................................................................................... 31
6 Datasets.......................................................................................................... 34
3
Abstract. The specific report will focus on statistical technics that may be ap-
plied in any multivariate dataset. Nowadays the data can give us as clear answers
for queries and hypothesis that may observed as it is easier than ever to collect
data and analyze them. As we do know a standard statistical procedure involves
the collection from dataset where the model deployment such as mean, standard
deviation, regression bivariate analysis, anova, chi square test, correlation’ can
either give us a pattern recognition or a range within a decision maker can have
the best results from it. These techniques are helpful in providing insights about
data however, the final decision should remain into human favor where we must
have or give to others our structural opinion based on statistical analysis. The
subject report will analyze two different datasets and try to implement some of
the structural techniques in order to give to the readers results based on the anal-
ysis charts and tables where it may lead him into valuable decisions. The report
will extract the results from a statistical software which is one of the best tools
for these techniques SPSS [2].
1 Introduction
The subject report will distinct the main vulnerabilities that statistical techniques may
have on the implementation in two datasets. In the following sections we will have one
multivariate analysis, regression analysis and five bivariate analysis such as anova, chi
square test, correlation, and 2 t – tests. The first datasets that we will use on the subject
report is Gas Turbine CO and NOx Emission. The dataset contains 36733 instances of
11 sensor measures aggregated over one hour, from a gas turbine located in Turkey for
the purpose of studying flue gas emissions, namely CO and NOx [3]. Second dataset is
the AI4I 2020 Predictive Maintenance Dataset is a synthetic dataset that reflects real
predictive maintenance data encountered in industry [4].
The other technique that is often used in these circumstances is regression, which in-
volves estimating the best straight line to summarize the association. a regression anal-
ysis model is presented with dependent variable the CO emissions from gas turbines
and independent variables the ambient temperature (AT), pressure (AP) mbar, the hu-
midity (AH) (%), the air filter difference pressure (AFDP) mbar, the gas turbine exhaust
pressure (GTEP) mbar, the turbine inlet temperature (TIT), the turbine after tempera-
ture (TAT) C, the compressor discharge pressure (CDP) mbar and the turbine energy
yield (TEY) MWH (Table 1. Descriptive statistics). The data were part from the UCI
Machine Learning Repository in regards of Gas Turbine CO and NOx Emission Data
Set Data Set. The data concerned the article were written from Heysem, Pınar &
Erdinç Uzun (2019) and aimed to predict CO and NOx emissions from gas tur-
bines[3]. They were collected in an operating range of the turbine between partial load
(75%) and full load (100%). The initial data set contained 36733 instances and con-
cerned five years, 2011 – 2015. In this study we analyzed the data from year 2015 which
contained 7384 observations.
In the next set of charts graphs, we saw how each independent variable is related to the
dependent variable. Thus we took each variable and tried to scale it, based on the graph
charts it can be easily determined that there is a relationship between the dependent
variable and each of the independent variables or in the case there is a relationship it
does not seem that it has a linear form.
6
In this section will analyze for the first dataset the correlation within the variables. A
commonly definition of correlation is the following. We can describe the degree which
of two variables are linearly related. This is an important step in bi-variate data analysis.
In the broadest sense correlation is actually any statistical relationship, whether causal
or not, between two random variables in bivariate data. Moreover, the correlation coef-
ficient is a statistical measure of the strength of the relationship between the relative
movements of two variables. The values range between -1.0 and 1.0. A correlation of -
1.0 shows a perfect negative correlation, while a correlation of 1.0 shows a perfect pos-
itive correlation. A correlation of 0.0 shows no linear relationship between the move-
ment of the two variables. From the summary review on the class we saw that the Pear-
son correlation [5] can evaluate only a linear relationship between two continuous var-
iables. A relationship is linear only when a change in one variable is associated with a
proportional change in the other variable. However, the Spearman correlation [6] can
evaluate a monotonic relationship between two variables where can be continuous or
ordinal and it is based on the ranked values for each variable rather than the raw data.
The Spearman’s rho index was used to investigate the relationships among the varia-
bles. This nonparametric test is more suitable when it does not know whether the nor-
mality assumption is met and due to the fact that the linearity assumption is not met
according to figures 1 – 8 that we have seen above [7]. It can be seen that the dependent
variable, carbon monoxide is correlated negatively or positively with the dependent
variables. The strongest correlation is with the turbine inlet temperature (rho = -.553, p
< .01) and the weakest is with ambient temperature (rho = -.063, p < .01). Also, it can
be seen that there is a statistically significant positive correlation of strong intensity
between turbine inlet temperature and the turbine energy yield (rho = .958, p < .01).
Because the correlation is very strong, we have excluded the turbine energy yield vari-
able from the regression analysis in order to avoid a multicollinearity problem.
Ambient pressure (AP)
Compressor discharge
Ambient humidity (AH)
Carbon monoxide
pressure
pressure
pressure
(AT)
ture
ture
Am- 1,000 -,023* -,085** ,159** ,079** ,152** -,139** ,101** -,063** ,127**
bient
temper-
ature
(AT)
Am- -,023* 1,000 ,063** -,108** -,059** -,099** -,107** -,033** ,148** -,016
bient
pressure
(AP)
11
Am- -,085** ,063** 1,000 -,150** -,212** -,228** ,057** -,162** ,130** -,170**
bient hu-
midity
(AH)
Air ,159** -,108** -,150** 1,000 ,568** ,764** -,420** ,593** -,467** ,723**
filter dif-
ference
pressure
Gas ,079** -,059** -,212** ,568** 1,000 ,758** -,439** ,624** -,388** ,771**
turbine
exhaust
pressure
Tur- ,152** -,099** -,228** ,764** ,758** 1,000 -,511** ,773** -,553** ,958**
bine inlet
tempera-
ture
Tur- -,139** -,107** ,057** -,420** -,439** -,511** 1,000 -,457** ,257** -,562**
bine af-
ter tem-
perature
Com- ,101** -,033** -,162** ,593** ,624** ,773** -,457** 1,000 -,422** ,778**
pressor
dis-
charge
pressure
Car- -,063** ,148** ,130** -,467** -,388** -,553** ,257** -,422** 1,000 -,494**
bon
monox-
ide
Tur- ,127** -,016 -,170** ,723** ,771** ,958** -,562** ,778** -,494** 1,000
bine en-
ergy
yield
In table 3 it can be seen the adjusted R square of each model where the stepwise method
was used in order to result to the most suitable model and the Durbin Watson index [8]
for the last one. As it can be seen the adjusted coefficient of determination is equal to
0.373 which means that the independent variables explain 37.3% of the variability of
the dependent variable. Also, it can be seen that the Durbin Watson index is equal to
1.495 as acceptable values from Andy’s Field model can be between 1 to 3 [9].
2.5 Anova
Our next statistical indicator of the dataset will be the ANOVA. Where ANOVA is
Analysis of variance and it is a statistical method that separates observed variance data
into different components to use for additional tests. A one-way ANOVA is used for
three or more groups of data, to gain information about the relationship between the
dependent and independent variables. In table 4 it can be seen that the regression model
is statistically significant, F (5, 7378) = 880.469, p =.000.
Sum of Mean
Model Squares df Square F Sig.
sion 6
Table 3. Anova
2.6 Coefficient
In table 5 it can be seen that the predictor variables are all statistical significant, Turbine
inlet temperature (b = -.054, p < .01), Ambient temperature (b = .009, p < .01), Turbine
after temperature (b = -.013, p < .01), Ambient pressure (b = .010, p < .01) and com-
pressor discharge pressure (b = .008, p < .01). Furthermore, there is not a multicollin-
earity problem since all VIFs < 10. In addition, in figures 9 and 10 it can be seen that
the normality assumption is not met while the homoscedasticity assumption is partially
met (Figure 11).
14
Figure 9. Histogram
Stand-
ardized
Unstandardized Coeffi- Collinearity Sta-
Std. Tol-
Model B Error Beta t Sig. erance VIF
stant)
ature
Value
Residual - 7,77733 ,00000 1,38063 7384
5,79314
5,81162
18
In table 7 it can be seen that the cook’s distance and the leverage are below 1. Thus,
based on these criteria there is no need to exclude from the analysis observations. How-
ever, they are residuals greater than the absolute value of 3.29. In addition, from the M
distance we have found the multivariate outliers based on the values of the new proba-
bility variable (Probability variable =1- (CDF.CHISQ (MAH_1,5)) which are less than
.001 (Identifying Multivariate Outliers in SPSS - Statistics Solutions, 2021). Thus, we
have implemented the following filter (Probability.MAH_1 <= .001 & -3.29 < ZRE_1
<3.29) and 99 outliers were excluded from the analysis. The remaining observations
are 7285 and we repeated the analysis.
Sum of Mean
Model Squares df Square F Sig.
sion 8
sion 9
Table 7. Anova
In table 5 it can be seen that the regression model is statistical significant, F(4, 7280) =
1193.389, p =.000.
Stand-
ardized
Std. Tol-
(AT)
21
In table 9 it can be seen that the predictor variables are all statistical significant (re-
duced by one in comparison the previous model), Turbine inlet temperature (b = -.057,
p < .01), Ambient temperature (b = .009, p < .01), Turbine after temperature (b = -.027,
p < .01) and ambient pressure (b = .006, p < .05) . Furthermore, there is not a multicol-
linearity problem since all VIFs < 10. In addition, in figure 12 it can be seen that the
normality assumption is not met while the homoscedasticity assumption is partially met
(Figure 13). A potential solution to this is the implementation of a bootstrap regression
analysis. However, this was not possible even though we reduced the number of sam-
ples (100). The following message was printed from the SPSS22.0: “Available memory
was exhausted while compiling output. All the output for this command has been de-
leted”. Table 10 Residuals Statistics.
Residuals Statistics
dicted Value
Value
Residual - 4,5397 ,00000 1,31030 7259
5,71335 6
Value
In this section we present the implementation of the parametric Anova, t – test and non-
parametric tests chi square on a data file retrieved from the website UCI Machine
Learning Repository: AI4I 2020 Predictive Maintenance Dataset Data Set [4]. The
dataset consists of 10000 observations. We have used the following variables: the prod-
uct quality variants (low, medium, high) and machine failures (yes / no) to implement
a chi square test, the product quality variants (low, medium, high) and rotational speed
[rpm] to implement an ANOVA test, machine failures (yes /no) and rotational speed
[rpm] to implement a t – test and tool wear failure (yes / no) and rotational speed [rpm]
to implement a t – test. The variables from the dataset that we will analyze in this section
are product I is consisting of a letter L, M, or H. For low (50% of all products), medium
(30%) and high (20%) as product quality variants and a variant-specific serial number.
The air temperature K generated using a random walk process later normalized to a
standard deviation of 2 K around 300 K. The process temperature K generated using a
random walk process normalized to a standard deviation of 1 K, added to the air tem-
perature plus 10 K. The rotational speed rpm calculated from a power of 2860 W, over-
laid with a normally distributed noise. The torque Nm torque values are normally dis-
tributed around 40 Nm with a σ = 10 Nm and no negative values. The tool wear min
quality variants H/M/L add 5/3/2 minutes of tool wear to the used tool in the process.
And last one a machine failure label that indicates whether the machine has failed in
this particular datapoint for any of the following failure modes are true.
Machine failure
No Yes Total
In table 11 it can be seen that 2.1% of the high variant quality products presented ma-
chine failure compared to 3.9% of the low variant quality products and the 2.8% quality
of the medium variant. This difference is statistical significant according to chi square
test, X2(2) = 13.752, p = .001 (Table 12) (The Fisher test results to the sample conclu-
sion even though the assumptions of the chi square test are met, non-zero cell and less
than 20% of the cells with frequency below 5). However, the intensity of the relation-
ship is very low according to Cramer’s V index (Table 13).
a. 0 cells (0,0%) have expected count less than 5. The minimum expected count is 34,00.
b. Based on 10000 sampled tables with starting seed 299883525.
In table 14 it can be seen that the rotational speed does not follow the normal distribu-
tion in each level of the variant quality product, High, Statistic(1003) = .110, p =.000,
Low, Statistic (6000) = .107, p =.000 and medium, Statistic(2997) = .101, p =.000. In
this case the Kruskal Wallis test was used.
Kolmogorov-Smirnova Shapiro-Wilk
Ty Sta- Sta-
pe1 tistic df Sig. tistic df Sig.
In table 15 it can be seen that there is not a statistically significant difference among the
three variant quality products in relation to the rotational speed, X2(2) = .248, p = .883.
Rotational
speed [rpm]
Chi-Square ,248
df 2
In table 16 it can be seen that the rotational speed does not follow the normal distribu-
tion in each level of the machine failure, yes, Statistic (339) = .339, p =.000 and no,
Statistic(9661) = .096, p =.000. In this case the bootstrap method must be used. How-
ever, because it is very difficult to present the bootstrap method (calculation problems)
we present the t – test.
Kolmogorov-Smirnova Shapiro-Wilk
27
Equality of Vari-
ances t-test for Equality of Means
95% Confidence
Interval of the Differ-
ence
Sig. Mean Std.
(2- Differ- Error Dif-
F Sig. t df tailed) ence ference Lower Upper
Ro- Equal 295,620 ,000 4,423 9998 ,000 43,773 9,898 24,372 63,175
ta- vari-
tional ances
speed as-
[rpm] sumed
28
In table 18 it can be seen that the rotational speed does not follow the normal distribu-
tion in each level of the tool wear failure, yes, Statistic (46) = .156, p =.007 and no,
Statistic(9954) = .104, p =.000. In this case the bootstrap method could be used. How-
ever, because it is very difficult to present the bootstrap method, we present the t – test.
Kolmogorov-Smirnova Shapiro-Wilk
T Sta- Sta-
WF tistic df Sig. tistic df Sig.
In table 19 it can be seen that the homogeneity assumption is met, F = 1.507, p = .200.
It can be seen that in the case of tool wear mechanical failure in comparison to the case
of no mechanical failure the average of rotational speed is higher, t(9998) = .220, p =
.299.
29
Levene's Test
for Equality of
Variances t-test for Equality of Means
95% Confidence
Interval of the Differ-
ence
Sig. Mean Std.
ances
not as-
sumed
The statistical analysis of these two datasets prove that we can have a better understand-
ing of what models prove. For the first model where the dataset can be well used for
predicting turbine energy yield (TEY) using ambient variables as features. We saw that
the initial target which initially aimed to predict CO and NOx emissions from gas tur-
bines had strong relation of the carbon monoxide and the ambient temperature. How-
ever the data as they were collected in an operating range of the turbine between partial
load (75%) and full load (100%) can give us a strong indicator as it was used almost
30
in the pick of the usage as it a=was greater than 75% of its load. The second dataset of
our analysis was about the machine failure where it consists of five independent failure
modes. The tool wear failure (TWF) the tool will be replaced of fail at a randomly
selected tool wear time between 200 and 240 mins. At this point in time, the tool is
replaced 69 times, and fails 51 times we also saw that the heat dissipation causes a
process failure, if the difference between air- and process temperature is below 8.6 K
and the tools rotational speed is below 1380 rpm. This is the case for 115 data points.
power failure (PWF). The product of torque and rotational speed (in rad/s) equals the
power required for the process. If this power is below 3500 W or above 9000 W, the
process fails, which is the case 95 times in our dataset. Finally, the random failures
(RNF) of each process has a chance of 0,1 % to fail regardless of its process parameters.
This is the case for only 5 datapoints, less than could be expected for 10,000 datapoints
in the dataset.
31
4 References
1. Author, Ilias Kalemis. A portfolio of work in Statistical Techniques on SPSS software.
2. SPSS. Software https://www.ibm.com/analytics/spss-statistics-software , last accessed on
2021/5/20.
3. 1st dataset from the paper: Heysem Kaya, Pınar Tüfekci and Erdinç Uzun. 'Predict-
ing CO and NOx emissions from gas turbines: novel data and a benchmark PEMS', Turk-
ish Journal of Electrical Engineering & Computer Sciences, vol. 27, 2019, pp. 4783-4796.
Heysem Kaya, Department of Information and Computing Sciences, Utrecht University,
3584 CC, Utrecht, The Netherlands Email: h.kaya'@' uu.nl and Pınar Tüfekci, Çorlu
Faculty of Engineering, Namık Kemal University, TR-59860 Çorlu, Tekirdağ, Tur-
key and Email: ptufekci '@' nku.edu.tr , link: https://archive.ics.uci.edu/ml/da-
tasets/Gas+Turbine+CO+and+NOx+Emission+Data+Set , last accessed on 2021/5/20.
4. 2nd dataset from the paper: Stephan Matzka, 'Explainable Artificial Intelligence for Predic-
tive Maintenance Applications', Third International Conference on Artificial Intelligence
for Industries (AI4I 2020), 2020 (in press). Stephan Matzka, School of Engineering -
Technology and Life, Hochschule für Technik und Wirtschaft Berlin, 12459 Berlin,
Germany, stephan.matzka '@' htw-berlin.de https://archive.ics.uci.edu/ml/da-
tasets/AI4I+2020+Predictive+Maintenance+Dataset , last accessed on 2021/5/20.
5. Pearson correlation https://libguides.library.kent.edu/SPSS/PearsonCorr , last accessed on
2021/5/20.
6. Spearman correlation Spearman C. (1904). "The proof and measurement of association be-
tween two things". American Journal of Psychology. 15 (1): 72–101.
doi:10.2307/1412159. JSTOR 1412159, last accessed on 2021/5/20.
7. Mavuto Mukaka (2012). Statistics corner: A guide to appropriate use of correlation coeffi-
cient in medical research. Malawi medical journal : the journal of Medical Association.
8. Durbin, J.; Watson, G. S. (1950). "Testing for Serial Correlation in Least Squares Regres-
sion, I". Biometrika. 37 (3–4): 409–428. doi:10.1093/biomet/37.3-4.409. JSTOR 2332391.
9. Discovering Statistics Using IBM SPSS Statistics, Field, A. (2005). Discovering statistics
using SPSS (2nd ed.). Sage Publications, Inc
5 Appendices
Regression & Correlation
DATASET ACTIVATE DataSet1.
GRAPH
/SCATTERPLOT(BIVAR)=AT WITH CO
/MISSING=LISTWISE.
GRAPH
/SCATTERPLOT(BIVAR)=AP WITH CO
/MISSING=LISTWISE.
GRAPH
/SCATTERPLOT(BIVAR)=AH WITH CO
/MISSING=LISTWISE.
GRAPH
32
/SCATTERPLOT(BIVAR)=AFDP WITH CO
/MISSING=LISTWISE.
GRAPH
/SCATTERPLOT(BIVAR)=GTEP WITH CO
/MISSING=LISTWISE.
GRAPH
/SCATTERPLOT(BIVAR)=TIT WITH CO
/MISSING=LISTWISE.
GRAPH
/SCATTERPLOT(BIVAR)=TAT WITH CO
/MISSING=LISTWISE.
GRAPH
/SCATTERPLOT(BIVAR)=CDP WITH CO
/MISSING=LISTWISE.
NONPAR CORR
/VARIABLES=AT AP AH AFDP GTEP TIT TAT CDP CO TEY
/PRINT=SPEARMAN TWOTAIL NOSIG
/MISSING=PAIRWISE.
DESCRIPTIVES VARIABLES=AT AP AH AFDP GTEP TIT TAT CDP CO
TEY
/STATISTICS=MEAN STDDEV MIN MAX.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA COLLIN TOL
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT CO
/METHOD=STEPWISE AT AP AH AFDP GTEP TIT TAT CDP
/SCATTERPLOT=(*ZPRED ,*ZRESID)
/RESIDUALS DURBIN HISTOGRAM(ZRESID) NORMPROB(ZRESID)
/CASEWISE PLOT(ZRESID) OUTLIERS(3)
/SAVE MAHAL COOK LEVER ZRESID SDBETA SDFIT.
FILTER BY filter_$.
EXECUTE.
Chi square
CROSSTABS
/TABLES=Type BY Machinefailure
/FORMAT=AVALUE TABLES
/STATISTICS=CHISQ PHI
/CELLS=COUNT ROW SRESID BPROP
/COUNT ROUND CELL
/METHOD=MC CIN(99) SAMPLES(10000).
Anova
EXAMINE VARIABLES=Rotationalspeedrpm BY Type1
/PLOT BOXPLOT STEMLEAF NPPLOT
/COMPARE GROUPS
/STATISTICS DESCRIPTIVES
/CINTERVAL 95
/MISSING LISTWISE
/NOTOTAL.
NPAR TESTS
/K-W=Rotationalspeedrpm BY Type1(1 3)
/MISSING ANALYSIS
/METHOD=MC CIN(99) SAMPLES(10000).
T – test (1st)
EXAMINE VARIABLES=Rotationalspeedrpm BY Machinefailure
/PLOT BOXPLOT STEMLEAF NPPLOT
/COMPARE GROUPS
/STATISTICS DESCRIPTIVES
/CINTERVAL 95
/MISSING LISTWISE
/NOTOTAL.
BOOTSTRAP
/SAMPLING METHOD=SIMPLE
/VARIABLES TARGET=Rotationalspeedrpm
INPUT=Machinefailure
/CRITERIA CILEVEL=95 CITYPE=BCA NSAMPLES=5000
/MISSING USERMISSING=EXCLUDE.
T-TEST GROUPS=Machinefailure(0 1)
/MISSING=ANALYSIS
/VARIABLES=Rotationalspeedrpm
34
/CRITERIA=CI(.95).
T – test (2nd)
T-TEST GROUPS=TWF(0 1)
/MISSING=ANALYSIS
/VARIABLES=Rotationalspeedrpm
/CRITERIA=CI(.95).
6 Datasets