Effectiveness of Normalization Pre-Processing of Big Data To The Machine Learning Performance
Effectiveness of Normalization Pre-Processing of Big Data To The Machine Learning Performance
요 약
최근, 빅데이터 분야에서는 빅 데이터의 양적 팽창이 주요 이슈로 떠오르고 있다. 더군다나 이러한 빅데이
터는 기계학습의 입력값으로 사용되어지고 있으며 이들의 성능을 향상시키기 위해 정규화 전처리가 필요하다.
이러한 성능은 빅데이터 컬럼의 범위나 정규화 전처리 방식에 따라 크게 좌우된다. 본 논문에서는 다양한 종
류의 정규화 전처리 방식과 빅데이터 컬럼의 범위를 조절하면서 서포트벡터머신(SVM)의 기계학습방식에 적
용함으로써 더욱 효과적인 정규화 전처리 방식을 파악하고자 하였다. 이를 위하여 파이썬언어와 주피터 노트
북 환경에서 기계학습을 수행하고 분석하였다.
ABSTRACT
Recently, the massive growth in the scale of data has been observed as a major issue in the Big Data. Furthermore, the Big
Data should be preprocessed for normalization to get a high performance of the Machine learning since the Big Data is also an
input of Machine Learning. The performance varies by many factors such as the scope of the columns in a Big Data or the
methods of normalization preprocessing. In this paper, the various types of normalization preprocessing methods and the scopes of
the Big Data columns will be applied to the SVM(: Support Vector Machine) as a Machine Learning method to get the efficient
environment for the normalization preprocessing. The Machine Learning experiment has been programmed in Python and the
Jupyter Notebook.
키워드
AI, Machine Learning, BigData, Preprocessing, Performance Evaluation
인공 지능, 기계 학습, 빅 데이터, 전처리, 성능 평가
* 교신저자 : 동명대학교 전자공학과 ㆍReceived : May. 17, 2019, Revised : May. 31, 2019, Accepted : Jun. 15, 2019
ㆍ접 수 일 : 2019. 05. 17 ㆍCorresponding Author : Jun-Mo Jo
ㆍ수정완료일 : 2019. 05. 31 Dept. Electronic Engineering, TongMyong University,
ㆍ게재확정일 : 2019. 06. 15 Email : [email protected]
547
JKIECS, vol. 14, no. 03, 547-552, 2019
slower than earlier models. Real world systems normalization preprocessing methods and the
may suffer from the low speed of these networks. importance of the scope or selection of the columns
A loud service needs to process thousands of new in the Big Data will be elaborated in the section II.
requests per seconds. Portable devices such as Then the Machine Learning algorithm such as
phones and tablets may not afford slow models and SVM and the impact of a various normalization
semantic segmentation need to apply these models methods for enhancing the performance of the
on many higher resolution images. It is very specific Machine Learning will be explained in the
important to accelerate test-time performance of section III. Then in the section IV, the result of the
CNNs[1]. training will be compared and analyzed for the
Machine learning has made possible the concept more efficient normalization methods. Finally, the
of self-driving cars, automatic intelligent web conclusion is made in section V.
search, user based speech recognition software,
personalized marketing and so on. So the world
wide research is underway in many universities, Ⅱ. Normalization Methods
companies and research facilities. The researches
are related to the supervised and unsupervised Normalization is a preprocessing method. First of
all, the preprocessing is not only a technique that
learning methods as well as the field of the deep
is used to convert the raw data into a clean data
learning. Some methods classifies network of given set, but also enhancing the performance of the
patterns is a form of learning from observation. Machine Learning. In other words, whenever the
Such observation can define a new class or assign data is gathered from different sources it is
a new class to an existing class. This classification collected in raw format which is not feasible for
facilitates new theories and knowledge that is the analysis and the Machine Learning.
There are several well-known normalization
embedded in the input patterns. Learning behavior
methods such as Simple Feature Scaling, Min-Max,
of the neural network model enhances the
Z-score and so on. Firstly, the Simple Feature
classification properties. The supervised and
Scaling method is typically done via the following
unsupervised and investigated its properties in the
equation (1):
classification of post graduate students according to
their performance during the admission period[2].
···(1)
Introduction of cognitive reasoning into a m ax
conventional computer can solve problems by In the Min-Max method, the data is scaled to a
example mapping like pattern recognition, fixed range, usually 0 to 1, the cost of having this
classification and forecasting. Artificial Neural bounded range is that it will end up with smaller
Networks provides these types of models. These standard deviations. A Min-Max scaling equation
are essentially mathematical models describing a is as following equation (2):
function. ANN is characterized by three types of m in
···(2)
parameters such as interconnection property as feed m ax m in
forward network and recurrent network. And the The result of the Z-score normalization is that
application function as a classification model. the features will be rescaled so that they will be
Finally, a learning rule such as supervised, the properties of a standard normal distribution
unsupervised, and the reinforcement methods[3-4]. with
In this paper, the various known types of the μ = 0 and σ = 1
548
빅데이터의 정규화 전처리과정이 기계학습의 성능에 미치는 영향
549
JKIECS, vol. 14, no. 03, 547-552, 2019
each method, the possibility of integrating two or original raw data of the columns in the Wine data
more algorithms together to solve a problem should are shown in Fig. 2 below.
be investigated. The objective is to utilize the
strengths of one method to complement the
weaknesses of another. If we are only interested in
the best possible classification accuracy, it might
be difficult or impossible to find a single classifier
that performs as well as a good ensemble of
classifiers[4-5].
Unlike the unsupervised learning, the supervised
learning is a method by which you can use labeled
training data to train a function that can be
generalized for new examples. The training
involves a critic that can indicate when the Fig. 2 Result of the normalization(0, 3, 9, 12)
function is correct or not. There are various kinds
of the methods are exist, such as the decision tree The data in the column 12(Col. 12) are
classifier, KNeighbors classifier, and the support exceedingly high compare to the others. However,
vector machine(SVM) and so on[8-10]. after the normalization process, the Col. 12 is the
lowest among all. In this case, the Min-Max
normalization method is used for the result as
IV. Result and Analysis shown in Fig. 3. The result of the graph is
programmed in Python.
To train and to predict with the normalized data,
the fit(), predict(), and closest() methods supported
in Scikit-Learn are used. The Table 1 shows the
utilities used in the program.
Four columns(0, 3, 9, 12) in the Wine dataset are We can see that the Col. 12 has shrank the
selected to be normalized after calculating the most. Finally, the standard deviations and the
standard deviation of each columns in Table 2. The training accuracies of the selected columns are
columns are selected by the standard deviation shown in Table 2.
value for the training in SVM. The columns are
selected by the level of the standard deviation. The
550
빅데이터의 정규화 전처리과정이 기계학습의 성능에 미치는 영향
Table 2. Std. Dev. and accuracy result of columns deviation, the higher the prediction accuracy. The
Standard Prediction normalization of the column 12 in Wine dataset
nth Column
Deviation Accuracy(%) affects the most of all since it has the highest
0 0.8 70 standard deviation. The normalization for the deep
3 3.3 78 learning using the Tensorflow will be the next
9 2.3 74 study.
12 314 94
551
JKIECS, vol. 14, no. 03, 547-552, 2019
저자 소개
조준모(Jun-Mo Jo)
1991년 아이오아주립대학교 컴퓨
터과학과 졸업 (공학사)
1995년 경북대학교 대학원 컴퓨터
공학과 졸업(공학석사)
2004년 경북대학교 대학원 컴퓨터공학과 졸업(공학
박사)
1998년~현재 동명대학교 전자공학과 교수
※ 관심분야 : 이동통신, 뇌파통신, 인공지능
552