Effectiveness of Normalization Pre-Processing of Big Data To The Machine Learning Performance

Uploaded by

refsen00

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Effectiveness of Normalization Pre-Processing of Big Data To The Machine Learning Performance

Uploaded by

refsen00

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Journal of the KIECS. pp. 547-552, vol. 14, no. 3, Jun. 30. 2019, t.

95, pISSN 1975-8170 | eISSN 2288-2189

Regular paper
http://dx.doi.org/10.13067/JKIECS.2019.14.3.547

빅데이터의 정규화 전처리과정이

기계학습의 성능에 미치는 영향
*
조준모

Effectiveness of Normalization Pre-Processing of Big Data

to the Machine Learning Performance
*
Jun-Mo Jo

요 약

최근, 빅데이터 분야에서는 빅 데이터의 양적 팽창이 주요 이슈로 떠오르고 있다. 더군다나 이러한 빅데이
터는 기계학습의 입력값으로 사용되어지고 있으며 이들의 성능을 향상시키기 위해 정규화 전처리가 필요하다.
이러한 성능은 빅데이터 컬럼의 범위나 정규화 전처리 방식에 따라 크게 좌우된다. 본 논문에서는 다양한 종
류의 정규화 전처리 방식과 빅데이터 컬럼의 범위를 조절하면서 서포트벡터머신(SVM)의 기계학습방식에 적
용함으로써 더욱 효과적인 정규화 전처리 방식을 파악하고자 하였다. 이를 위하여 파이썬언어와 주피터 노트
북 환경에서 기계학습을 수행하고 분석하였다.

ABSTRACT

Recently, the massive growth in the scale of data has been observed as a major issue in the Big Data. Furthermore, the Big
Data should be preprocessed for normalization to get a high performance of the Machine learning since the Big Data is also an
input of Machine Learning. The performance varies by many factors such as the scope of the columns in a Big Data or the
methods of normalization preprocessing. In this paper, the various types of normalization preprocessing methods and the scopes of
the Big Data columns will be applied to the SVM(: Support Vector Machine) as a Machine Learning method to get the efficient
environment for the normalization preprocessing. The Machine Learning experiment has been programmed in Python and the
Jupyter Notebook.

키워드
AI, Machine Learning, BigData, Preprocessing, Performance Evaluation
인공 지능, 기계 학습, 빅 데이터, 전처리, 성능 평가

Ⅰ. Introduction computational cost of these networks also increases

significantly. For example, the very deep VGG
The accuracy of convolutional neural networks models, which have witnessed great success in a
(CNNs) has been continuously improving, but the wide range of recognition tasks are substantially

* 교신저자 : 동명대학교 전자공학과 ㆍReceived : May. 17, 2019, Revised : May. 31, 2019, Accepted : Jun. 15, 2019
ㆍ접 수 일 : 2019. 05. 17 ㆍCorresponding Author : Jun-Mo Jo
ㆍ수정완료일 : 2019. 05. 31 Dept. Electronic Engineering, TongMyong University,
ㆍ게재확정일 : 2019. 06. 15 Email : [email protected]

547
JKIECS, vol. 14, no. 03, 547-552, 2019

slower than earlier models. Real world systems normalization preprocessing methods and the
may suffer from the low speed of these networks. importance of the scope or selection of the columns
A loud service needs to process thousands of new in the Big Data will be elaborated in the section II.
requests per seconds. Portable devices such as Then the Machine Learning algorithm such as
phones and tablets may not afford slow models and SVM and the impact of a various normalization
semantic segmentation need to apply these models methods for enhancing the performance of the
on many higher resolution images. It is very specific Machine Learning will be explained in the
important to accelerate test-time performance of section III. Then in the section IV, the result of the
CNNs[1]. training will be compared and analyzed for the
Machine learning has made possible the concept more efficient normalization methods. Finally, the
of self-driving cars, automatic intelligent web conclusion is made in section V.
search, user based speech recognition software,
personalized marketing and so on. So the world
wide research is underway in many universities, Ⅱ. Normalization Methods
companies and research facilities. The researches
are related to the supervised and unsupervised Normalization is a preprocessing method. First of
all, the preprocessing is not only a technique that
learning methods as well as the field of the deep
is used to convert the raw data into a clean data
learning. Some methods classifies network of given set, but also enhancing the performance of the
patterns is a form of learning from observation. Machine Learning. In other words, whenever the
Such observation can define a new class or assign data is gathered from different sources it is
a new class to an existing class. This classification collected in raw format which is not feasible for
facilitates new theories and knowledge that is the analysis and the Machine Learning.
There are several well-known normalization
embedded in the input patterns. Learning behavior
methods such as Simple Feature Scaling, Min-Max,
of the neural network model enhances the
Z-score and so on. Firstly, the Simple Feature
classification properties. The supervised and
Scaling method is typically done via the following
unsupervised and investigated its properties in the
equation (1):
classification of post graduate students according to
their performance during the admission period[2].  
    ···(1)
Introduction of cognitive reasoning into a  m ax
conventional computer can solve problems by In the Min-Max method, the data is scaled to a
example mapping like pattern recognition, fixed range, usually 0 to 1, the cost of having this
classification and forecasting. Artificial Neural bounded range is that it will end up with smaller
Networks provides these types of models. These standard deviations. A Min-Max scaling equation
are essentially mathematical models describing a is as following equation (2):
function. ANN is characterized by three types of     m in
    ···(2)
parameters such as interconnection property as feed  m ax   m in
forward network and recurrent network. And the The result of the Z-score normalization is that
application function as a classification model. the features will be rescaled so that they will be
Finally, a learning rule such as supervised, the properties of a standard normal distribution
unsupervised, and the reinforcement methods[3-4]. with
In this paper, the various known types of the μ = 0 and σ = 1

548
빅데이터의 정규화 전처리과정이 기계학습의 성능에 미치는 영향

where μ is the mean and σ is the standard

deviation from the mean. The standard scores of
the samples are calculated as following equation (3):

  ···(3)

The three normalization methods were applied to
the SVM for the performance comparison and the
result did not varies too much with this dataset, so
the Min-Max normalization method is used in this
Fig. 1 Illustration of the decomposition. (a) An
paper.
original layer with complexity O(dk2c). (b) An
approximated layer with complexity reduced to
O(d0k2c) + O(dd0) [1]
Ⅲ. Selecting Columns for Normalization
3.2 Selection of Columns in Big Data
3.1 Support Vector Machine(SVM) A series of a distribution of Certain data could
The Support Vector Machine(SVM) is a affects the performance of a machine learning.
discriminative classifier formally defined by a Most of the Big data as an input of a machine
separating hyperplane. In other words, given learning is a table consists of rows and columns.
labeled training data as a supervised learning, the Mostly, the data in a column is in certain range
algorithm outputs an optimal hyperplane which and it is commonly different from others. In order
categorizes new examples. to normalize the column data, we have to figure
The related study on a deep learning, a out the patters or scopes of the data. Therefore,
decomposition method can be used as a learning the selection of the column(s) is very important.
method shown in Fig. 1. a filter is used for The machine learning is to distinguish classes of
responding at a pixel of a layer approximately lies the input data in order to predict an accurate
on a low-rank subspace. A resulting low-rank result. The related study, for instance, the presence
decomposition reduces time complexity. To find the of full knowledge of the underlying probabilities,
approximate low-rank subspace, it minimizes the the Bayes decision theory gives optimal error
reconstruction error of the responses. rates[5-7].
Machine Learning methods requires the fine The best performance is decided by not only the
tuning of the parameters and also feasible number recognition ratio but also the time of the
of the data set. Therefore, choosing the best simulation. However, the key question when
performance of the learning algorithm is important dealing with ML classification is not whether a
in the real world. And also for a particular data learning algorithm is superior to others, but under
set does not guarantee the precision and accuracy which conditions a particular method can
for another set of data whose attributes are significantly outperform others on a given
logically different from the other[4]. application problem. Meta learning is moving in
this direction, trying to find functions that map
data sets to algorithm performance. After a better
understanding of the strengths and limitations of

549
JKIECS, vol. 14, no. 03, 547-552, 2019

each method, the possibility of integrating two or original raw data of the columns in the Wine data
more algorithms together to solve a problem should are shown in Fig. 2 below.
be investigated. The objective is to utilize the
strengths of one method to complement the
weaknesses of another. If we are only interested in
the best possible classification accuracy, it might
be difficult or impossible to find a single classifier
that performs as well as a good ensemble of
classifiers[4-5].
Unlike the unsupervised learning, the supervised
learning is a method by which you can use labeled
training data to train a function that can be
generalized for new examples. The training
involves a critic that can indicate when the Fig. 2 Result of the normalization(0, 3, 9, 12)
function is correct or not. There are various kinds
of the methods are exist, such as the decision tree The data in the column 12(Col. 12) are
classifier, KNeighbors classifier, and the support exceedingly high compare to the others. However,
vector machine(SVM) and so on[8-10]. after the normalization process, the Col. 12 is the
lowest among all. In this case, the Min-Max
normalization method is used for the result as
IV. Result and Analysis shown in Fig. 3. The result of the graph is
programmed in Python.
To train and to predict with the normalized data,
the fit(), predict(), and closest() methods supported
in Scikit-Learn are used. The Table 1 shows the
utilities used in the program.

Table 1. Scikit-learn utilities used in training

from sklearn import datasets

from sklearn.datasets import load_wine
from sklearn.model_selection import
train_test_split
from sklearn.metrics import accuracy_score
Fig. 3 Result of the normalization(0, 3, 9, 12)

Four columns(0, 3, 9, 12) in the Wine dataset are We can see that the Col. 12 has shrank the
selected to be normalized after calculating the most. Finally, the standard deviations and the
standard deviation of each columns in Table 2. The training accuracies of the selected columns are
columns are selected by the standard deviation shown in Table 2.
value for the training in SVM. The columns are
selected by the level of the standard deviation. The

550
빅데이터의 정규화 전처리과정이 기계학습의 성능에 미치는 영향

Table 2. Std. Dev. and accuracy result of columns deviation, the higher the prediction accuracy. The
Standard Prediction normalization of the column 12 in Wine dataset
nth Column
Deviation Accuracy(%) affects the most of all since it has the highest
0 0.8 70 standard deviation. The normalization for the deep
3 3.3 78 learning using the Tensorflow will be the next
9 2.3 74 study.
12 314 94

According to the Table 2 and the Fig. 4, there is Reference

a relation between standard deviation of columns
[1] X. Zhang, J. Zou, K. He, and J. Sun,“
and the prediction accuracy.
Accelerating very deep convolutional
networks for classification and detection,”
IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 38, 2015, pp. 1943-1955.
[2] R. Sathya, and A. Annamma, “Comparison of
Supervised and Unsupervised Learning
Algorithms for Pattern Classification,” IJARAI,
vol. 2, no. 2, 2013, pp.34-38.
[3] R. Sathya and A. Abraham, “Unsupervised
Control Paradigm for Performance Evaluation,”
International Journal of Computer Application, vol.
Fig. 4 Enhancement by the normalization 44, no. 20, 2012, pp. 27-31.
[4] X. C. Yin, X. Yin, K. Huang, and H. W. Hao,
The higher the standard deviation, the higher the “Robust text detection in natural scene images,”
prediction accuracy. If all the columns are to be IEEE Trans. Pattern Analysis and Machine
preprocessed, the result should be the best of all, Intelligence, vol. 36, no. 5, 2014, pp. 970-983.
however it takes great amount of expanse with the [5] N. Kim and Y. Bae, “Status Diagnosis of Pump
Big data then we need to select the most efficient and Motor Applying K-Nearest Neighbors,” J. of
column(s) what we deal with. the Korea Institute of Electronic Communication
Science, vol. 13, no. 6, 2018, pp. 1249-1255.
[6] J. M. Keller, M. R. Gray, and J. A. Givens, “A
V. Conclusion Fuzzy K-Nearest Neighbor Algorithm,” IEEE
Trans. Systems, Man, and Cybernetics, vol. 15, no.
A series of a distribution of Certain data could 4, 1985, pp. 581-585.
affects the performance of a machine learning. The [7] S. Bang, “Implementation of Image based Fire
standard deviation could be a clue of an efficient Detection System Using Convolution Neural
performance. So I have calculate all the columns of Network,” J. of the Korea Institute of Electronic
the dataset to select a significant column to Communication Science, vol. 12, no. 2, 2017, pp.
normalize. The four columns are selected to be 331-336.
normalized for comparing the prediction accuracy. [8] Y. Kim, S. Park, and D. Kim, “Research on
The result showed that the higher the standard Robust Face Recognition against Lighting

551
JKIECS, vol. 14, no. 03, 547-552, 2019

Variation using CNN,” J. of the Korea Institute of

Electronic Communication Science, vol. 12, no. 2,
2017, pp. 325-330.
[9] C. Jung, R. Jang, D. Nyang, and K. Lee “A
Study of User Behavior Recognition-Based PIN
Entry Using Machine Learning Technique,”
Korea Information Processing Society review,
computer and communication systems, vol. 7, no. 5,
2018, pp. 127-136.
[10] G. Lee, H. Ha, H. Hong, and H. Kim
“Exploratory Research on Automating the
Analysis of Scientific Argumentation Using
Machine Learning,” J. of the Korean Association
for Science Education, vol. 38, no. 2, 2018, pp.
219-234.

저자 소개

조준모(Jun-Mo Jo)
1991년 아이오아주립대학교 컴퓨
터과학과 졸업 (공학사)
1995년 경북대학교 대학원 컴퓨터
공학과 졸업(공학석사)
2004년 경북대학교 대학원 컴퓨터공학과 졸업(공학
박사)
1998년~현재 동명대학교 전자공학과 교수
※ 관심분야 : 이동통신, 뇌파통신, 인공지능

552

Hang Li - Machine Learning Methods-Springer (2023) (Z-Lib - Io)
100% (5)
Hang Li - Machine Learning Methods-Springer (2023) (Z-Lib - Io)
530 pages
Yong Shi - Advances in Big Data Analytics - Theory, Algorithms and Practices (2022, Springer) - Libgen - Li
No ratings yet
Yong Shi - Advances in Big Data Analytics - Theory, Algorithms and Practices (2022, Springer) - Libgen - Li
723 pages
Learning Bayesian Models With R - Sample Chapter
100% (1)
Learning Bayesian Models With R - Sample Chapter
22 pages
Lecture5
No ratings yet
Lecture5
26 pages
Data Mining: A Preprocessing Engine
No ratings yet
Data Mining: A Preprocessing Engine
5 pages
Thesis
No ratings yet
Thesis
364 pages
SkriptOptMach
No ratings yet
SkriptOptMach
49 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
MLBasicsBook
No ratings yet
MLBasicsBook
287 pages
Lecture 7 Data Transformation and Dimensionality Reduction
No ratings yet
Lecture 7 Data Transformation and Dimensionality Reduction
22 pages
Well Posed Learning Problem
100% (1)
Well Posed Learning Problem
4 pages
23.-Scaling-Techniques
No ratings yet
23.-Scaling-Techniques
30 pages
Agarwala 14
No ratings yet
Agarwala 14
9 pages
1737527078055
No ratings yet
1737527078055
111 pages
poly_aml
No ratings yet
poly_aml
76 pages
970-Article Text-3918-2-10-20221108
No ratings yet
970-Article Text-3918-2-10-20221108
9 pages
5.Feauture Engineering
No ratings yet
5.Feauture Engineering
34 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
ML - WEEK 04
No ratings yet
ML - WEEK 04
33 pages
Bott Curt Noce 18
No ratings yet
Bott Curt Noce 18
89 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
3_AML _Lecture 3_Feature Engg
No ratings yet
3_AML _Lecture 3_Feature Engg
39 pages
Tan 2021 J. Phys. Conf. Ser. 1994 012016
No ratings yet
Tan 2021 J. Phys. Conf. Ser. 1994 012016
6 pages
PW3 SupervisedLearning
No ratings yet
PW3 SupervisedLearning
10 pages
Machine Learning
No ratings yet
Machine Learning
137 pages
Mathematics For Machine Learning
No ratings yet
Mathematics For Machine Learning
134 pages
Regularized Minimax Probability Machine - 2019 - Knowledge Based Systems
No ratings yet
Regularized Minimax Probability Machine - 2019 - Knowledge Based Systems
9 pages
English Proj
No ratings yet
English Proj
15 pages
Zhi-Hua Zhou - Machine Learning-Springer Singapore (2021)
No ratings yet
Zhi-Hua Zhou - Machine Learning-Springer Singapore (2021)
578 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Data Preprocessing in Predictive Data Mining: The Knowledge Engineering Review
No ratings yet
Data Preprocessing in Predictive Data Mining: The Knowledge Engineering Review
33 pages
Statistical Pattern Recognition Toolbox For Matlab: User's Guide
No ratings yet
Statistical Pattern Recognition Toolbox For Matlab: User's Guide
99 pages
Lecture 1.3
No ratings yet
Lecture 1.3
11 pages
C. Cifarelli Et Al - Incremental Classification With Generalized Eigenvalues
No ratings yet
C. Cifarelli Et Al - Incremental Classification With Generalized Eigenvalues
25 pages
Imp Document3
No ratings yet
Imp Document3
6 pages
ML Lec-6
No ratings yet
ML Lec-6
16 pages
Book PCH
No ratings yet
Book PCH
321 pages
Iarjset 5
No ratings yet
Iarjset 5
3 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Summer of Science-Final Report
100% (1)
Summer of Science-Final Report
7 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
NoCA2019-ProxyML 2019nov29
No ratings yet
NoCA2019-ProxyML 2019nov29
24 pages
Normalization A Preprocessing Stage
No ratings yet
Normalization A Preprocessing Stage
5 pages
RMT ML Book-1
No ratings yet
RMT ML Book-1
446 pages
AML Unit 4 Part 1
No ratings yet
AML Unit 4 Part 1
14 pages
Stanford ML
No ratings yet
Stanford ML
168 pages
Summery of Feature Eng
No ratings yet
Summery of Feature Eng
4 pages
aipptoriginal-191215023212
No ratings yet
aipptoriginal-191215023212
16 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
No ratings yet
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
10 pages
Pattern Recognition and Neural Networks (B.D.ripley)
100% (1)
Pattern Recognition and Neural Networks (B.D.ripley)
410 pages
Challenges in Computational Statistics and Data Mining (Matwin & Mielniczuk 2015-07-08)
No ratings yet
Challenges in Computational Statistics and Data Mining (Matwin & Mielniczuk 2015-07-08)
404 pages
Machinelearning
No ratings yet
Machinelearning
59 pages
Unit 2 ML 2019
No ratings yet
Unit 2 ML 2019
91 pages
Mathematical Foundations of Machine Learning
100% (1)
Mathematical Foundations of Machine Learning
340 pages
UBICC Article 522 522
No ratings yet
UBICC Article 522 522
8 pages
PR Unit 1 2
No ratings yet
PR Unit 1 2
40 pages
Cs229-Main Notes Andrew NG and Tengyu Ma
No ratings yet
Cs229-Main Notes Andrew NG and Tengyu Ma
227 pages
ML Main Printing Material
No ratings yet
ML Main Printing Material
241 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
DBDA EANDC QB Practical Machine Learning PDF
No ratings yet
DBDA EANDC QB Practical Machine Learning PDF
4 pages
Cyber Bullying Detection Using Machine Learning
No ratings yet
Cyber Bullying Detection Using Machine Learning
4 pages
Intro To Data Minning
No ratings yet
Intro To Data Minning
24 pages
Iris Liveness Detection Using Fusion of Domain-Specific Multiple BSIF and DenseNet Features
No ratings yet
Iris Liveness Detection Using Fusion of Domain-Specific Multiple BSIF and DenseNet Features
12 pages
Customer_Churn_Prediction_Using_Machine_Learning_Algorithms
No ratings yet
Customer_Churn_Prediction_Using_Machine_Learning_Algorithms
6 pages
Research Gate - Asthama Diagnosis
No ratings yet
Research Gate - Asthama Diagnosis
7 pages
Multisensor Data Fusion
100% (1)
Multisensor Data Fusion
529 pages
Machine Learning Slides
No ratings yet
Machine Learning Slides
281 pages
(Ebook) Machine Learning: Hands-On for Developers and Technical Professionals by Jason Bell ISBN 9781118889060, 1118889061 - The latest updated ebook version is ready for download
100% (2)
(Ebook) Machine Learning: Hands-On for Developers and Technical Professionals by Jason Bell ISBN 9781118889060, 1118889061 - The latest updated ebook version is ready for download
57 pages
SPE-203710-MS Quantifying The Risks of Wellbore Failure During Drilling Operations Using Bayesian Algorithm
No ratings yet
SPE-203710-MS Quantifying The Risks of Wellbore Failure During Drilling Operations Using Bayesian Algorithm
14 pages
Gta-304 List of Classifications
No ratings yet
Gta-304 List of Classifications
5 pages
Artificial intelligence-Enabled deep learning model for multimodal biometric fusion
No ratings yet
Artificial intelligence-Enabled deep learning model for multimodal biometric fusion
24 pages
Svmsmote 061430
No ratings yet
Svmsmote 061430
2 pages
Project - PPT - Design - Phase (2) (1) Final
No ratings yet
Project - PPT - Design - Phase (2) (1) Final
15 pages
Ai ML
No ratings yet
Ai ML
2 pages
Heart Disease Prediction With Machine Learning
No ratings yet
Heart Disease Prediction With Machine Learning
11 pages
Applied Natural Language Processing: Projects
No ratings yet
Applied Natural Language Processing: Projects
26 pages
223 COE 292 FinalExam Concept
No ratings yet
223 COE 292 FinalExam Concept
17 pages
Processing of Data, MBA-II Sem
No ratings yet
Processing of Data, MBA-II Sem
3 pages
A Reductions Approach To Fair Classification: Alekh Agarwal Alina Beygelzimer Miroslav Dud Ik John Langford Hanna Wallach
No ratings yet
A Reductions Approach To Fair Classification: Alekh Agarwal Alina Beygelzimer Miroslav Dud Ik John Langford Hanna Wallach
18 pages
Dl Question Bank
No ratings yet
Dl Question Bank
23 pages
Computing Neural Network Gradients-merged
No ratings yet
Computing Neural Network Gradients-merged
67 pages
Convolutional Neural Network For Diabetic Retinopathy Detection
No ratings yet
Convolutional Neural Network For Diabetic Retinopathy Detection
5 pages
Machine Learning Lab (17CSL76)
No ratings yet
Machine Learning Lab (17CSL76)
48 pages
Introduction To Machine Learning by Ethem Alpaydin 2nded - 2010
No ratings yet
Introduction To Machine Learning by Ethem Alpaydin 2nded - 2010
314 pages
Unit 26 Machine Learning - Assignment 02
No ratings yet
Unit 26 Machine Learning - Assignment 02
58 pages
MNIST
No ratings yet
MNIST
54 pages
UHI Project Report
No ratings yet
UHI Project Report
12 pages
Prediction of Heart Disease Using Machine Learning Algorithms: A Survey
No ratings yet
Prediction of Heart Disease Using Machine Learning Algorithms: A Survey
6 pages