Using Bayes Network For Prediction of Type-2 Diabetes: Yan Hu
Using Bayes Network For Prediction of Type-2 Diabetes: Yan Hu
Abstract-Diabetes mellitus is a chronic disease and a major also called insulin dependent, which is usually
public health challenge worldwide. Using data mining methods to
aid people to predict diabetes has become a popular topic. In this
diagnosed in children and juvenile; another is type-
paper, Bayes Network was proposed to predict patients with 2 diabetes-which is often diagnosed in middle aged
developing Type-2 diabetes. The dataset used is the Pima Indians to elderly people. Patients with type-2 diabetes do
Diabetes Data Set, which collects the information of patients with
and without developing Type-2 diabetes. Weka software was
not require insulin cure to remain alive, although up
used throughout this study. Accurate results have been obtained to 20% are treated with insulin to control blood
which proves using the proposed Bayes network to predict Type- glucose levels. It has been shown that 80% of type-
2 diabetes is effective.
2 diabetes complications can be prevented or
delayed by early identification of people at risk [4].
Keywords-Bayes Network; Prediction; Type-2 Diabetes
Thus, it is important to develop medical diagnostic
I. INTRODUCTION decision support systems that can aid middle aged
Healthcare information systems tend to capture to elderly people in the self-diagnostic process at
data in databases for research and analysis in order home.
to assist in making medical decisions. As a result, The data set used in this paper is excerpted from
medical information systems in hospitals and the UCI Machine Learning Repository [5]. The
medical institutions become larger and larger and original owner of this dataset is the National
the process of extracting useful information Institute of Diabetes and Digestive and Kidney
becomes more difficult. Traditional manual data Diseases. The selection of these instances is as
analysis has become inefficient and methods for follow: All patients are females at least 21 years old
efficient computer based analysis are very needed. of Pima Indian heritage.
To this aim, many approaches to computerized data II. RELATED WORK
analysis have been considered and examined. Data
Many researches have been conducted in the field
mining represents a significant advance in the type
of Prediction of Type-2 Diabetes. In [6] authors
of analytical tools. It has been proven that the
have constructed an artificial neural network model
benefits of introducing data mining into medical
for diagnosis of diabetes, they used certain
analysis are to increase diagnostic accuracy, to
combination of preprocessing techniques to handle
reduce costs and to save human resources [1].
the missing values and compared the results of
Diabetes mellitus has become a major global
accuracy of the model for each technique, however
public health problem in recent time. According to
the method of handling missing values presented in
the International Diabetes Federation, there are
this paper wasn't employed in that study. Authors
currently 246 million diabetic people worldwide,
in [7] have constructed association rules for
and this number is expected to rise to 380 million
classification of type -2 diabetic patients. They
by 2025[2]. Diabetes is a chronic disease in which
generated 10 association rules to identify whether
body does not produce insulin or use it properly.
the patient goes on to develop diabetes or not.
This increase the risks of developing, kidney
Several of machine learning algorithms have been
disease, blindness, nerve damage, blood vessel
proposed in the context and have been successfully
damage and contribute to heart disease [3]. There
used in some parts. Bayesian networks are powerful
are two types of diabetes: one is type-1 diabetes-
tools for knowledge representation and inference data mining process, the data needs to undergo
under uncertainty, using Bayes Network as preprocessing, using data cleaning, discretization
classifiers has been shown effective in some and data transformation [9]. It has been estimated
domain [8]. In this study, we will use NaIve Bayes that data preparation alone accounts for 60% of all
Network to build a decision make system for the time and effort expanded in the entire data
middle aged to elderly people to do self-prediction mining process [10].
of type-2 diabetes at home. The dataset used in this study is "The Pima
Indians Diabetes Dataset". There are 768 instances
Ill. DATE PREPROCESSING
in this dataset, and all instances have 8 input
Most of the data sets used in data mining were attributes (from Xl to X8) and 1 output attribute(Y).
not necessarily gathered with a specific goal in TABLE I shows the attributes of this dataset.
mind. Some of them may contain errors, outliers or
missing values. In order to use those data sets in the
A. Date Normalization
The entire document should be in Times New Based on formula (1), normalization process is
Roman or Times font. Type 3 fonts must not be performed on the data to overcome this problem
used. Other font types may be used if needed for and to get a better result, shown in TABLE III.
special purposes. TABLE III. RESULTS OF THE NORMALlZATION PROCESS PERFORMED ON THE
DATA
Recommended font sizes are shown in Table 1.
Attribute Mean Standard
TABLE II. RECOMMENDED FONT SIZES
Number Deviation
In this paper, we use the Min-Max normalization model to All of these attributes are numeric, in order to
transfonn the attribute's values to a new range, 0 to 1. The present the standard conditional probability tables
fonnula used to nonnalize attribute X is as follows:
of Bayes belief networks, discrete attributes is
needed. First, make each attribute binary according
to high values and low values, and then fit a diagnoses, it is dropped and a diagnostic score IS
numerical probability distribution for each node. computed for each diagnosis as,
Weka software is a collection of machine
learning algorithms for data mining tasks. It
contains tools for data pre-processing, classification,
regression, clustering, association rules, and
visualization [11]. Frist, we use Weka's Conditional probability of symptom Sj for � is,
'weka.filters.Discretize' method to transform the
attributes to binary variables. But the result is
strange: For most attributes, one side of the attribute _ ft�. n dj)
f I d,) t,<t)
1:.
was a small percentage(less than 10%) of the \-:l ;i '*l - f(�)
samples; this is not useful because over 33% of the
samples are positive. The reason is that Weka filter Where d'1) is the number of patients in the
uses information gain, which often favors highly dataset with disease � and I{>ij n �, is the frequency
pure small splits. count of patients with both �i and ,� .
As an alternative, in order to find the median When Bayesian belief network is applied to the
value of each attribute and divide each attribute up classification problem, one of the most effective
SO/SOCor as close as we could), The division of the classifier is the so-called Naive Bayesian classifier.
values into several bins is a very common method When represented as a Bayes network, it has the
for discretization, but usually more than 2 is used. simple structure proposed in Fig. 1. This classifier
After transformation, OVERWEIGHT, BMI and learns from observed data the conditional
SKIN counts more closely, this is the point we can probability of each variable Sj , given the class label
discard SKIN and BMI in this model. S. Classification is then done by applying Bayes
IV. METHODOLOGY
rule to compute the probability peS I !lj, ,£11 ,) and • •
TABLE X.
Test"'positive Testnegative
INSULIN=IDGH 128/268 256/500
INSU1IN=LOW 1 40/268 244/500
sample is subtracted from the counts, then the learning methods such as Neural Network will be tested to
compare the predicting results.
network is used to classify the sample, and finally
the sample is added back into the counts.
By using leave-one-out evaluation, the accuracy REFERENCES
of the proposed Bayes network and Weka's naIve [I] Marjan Khajehei, Faried Etemady, "Data Mining and Medical
Bayes network are compared. TABLE xi showed Research Studies," cimsim, pp. 1 19-122, 20 10 Second International
Conference on Computational Intelligence, Modelling and Simulation,
the results comparing proposed Bayes network to 20 10
naive Bayes: [2] International Diabetes Federation, Diabetes Atlas, 3rd ed. Brussels,
Belgium: International Diabetes Federation,2007
TABLE Xl. RESULTS COMPARING PROPOSED BAYES NETWORK TO NAIVE [3] R. Bellazzi, "Telemedicine and diabetes management: Current
BAYES challenges and future research directions," J. Diabetes Sci. Technol.,
vol. 2,no.l,pp. 98- 104,2008
Method Accuracy [4] J.C.Pickup, G. Williams,(Eds), Textbook of diabetes, Blackwell
Porposed Byes Network 5551768=72.3% Science,Oxford
NaIve Bayes Network 5491768=71.5% [5] http://archive.ics. uci.edu/mlldatasets/Pima+Indians+Diabets, Irvine,
CA: University of California, School of Information and Computer
Science
From TABLE XI, the result of proposed Bayes [6] Al Jarullah,A.A,Decision Tree Discovery for the Diagnosis of Type II
Diabetes. International Conference on Innovations in Information
network is more accurate than naIve Bayes network. Technology (itt),20 1 1.
The proposed Bayes belief network model is [7] Patil, B.M.; Joshi, R.c.; Toshniwal, D.; , "Association Rule for
Classification of Type-2 Diabetic Patients," Machine Learning and
promising for this domain. Computing (lCMLC), 20 10 Second International Conference on , vol.,
no.,pp.330-334,9- 1 1 Feb. 20 10
V. DISCUSSION AND CONCLUSION [8] Friedman N, Linial M, Nachman I, Pe'er D (2000) Using Bayesian
networks to analyze expression data. Journal of computational biology :
The discovery of knowledge from medical a journal of computational molecular cell biology 7: 60 1-620.
databases is important in order to make effective [9] Larose, D. T. (2006) Data Mining Methods and Models, Hoboken:
John Wiley & Sons,Inc.
medical diagnosis. The aim of data mining is to [ 10] Pyle, D. ( 1999) Data Preparation for Data Mining, San Francisco:
extract knowledge from information stored in Morgan Kaufmann
[II] http://www.cs.waikato.ac.nz/ml/weka!
database and generate clear and understandable [ 12] P. Langley, W. Iba, and K. Thompson. An Analysis of Bayesian
description of patterns. Classifiers. Proc. 10th Nat. Coni. on Artificial Iritelliyence (AAAI'92,
San Jose, CA, USA), 223-228. AAAI Press and MIT Press, Menlo
This study aimed at the discovery of a decision Park and Cambridge,CA,USA 1992
tree model for the diagnosis of type 2 diabetes. The [ 13] P. Langley and S. Sage. Iiiductiori of Selective Bayesian Classifiers.
Proc. 10th Corif. u7r Wricertozrrty zrr Arlsjiciul Irrlelliyence (UAI'94,
dataset used was the Pima Indian diabetes dataset. Seattle, WA, USA), 399-406. Morgan Kaufinarl, Sail Mateo, CA,
Pre-processing was used to improve the quality of USA 1994
[ 14] Seibel, J. A. (2007) Diabetes Guide, WebMD,
data. The techniques of pre-processing applied were http://diabetes.webmd. com!guideloral-glucose-tolerance-test.
attributes identification and selection, data
normalization, and numerical discretization. Next,
classifier was applied to the modified dataset to
construct the NaIve Bayes model. Finally weka was
used to do simulation, and the accuracy of the
resulting model was 72.3%.
There are some limitations of this study. Firstly, considering
the Pima Indian diabetes dataset, there might be other risk
factors that the data collections did not consider. According to
[12], other important factors include gestational diabetes,
family history, metabolic syndrome, smoking, inactive
lifestyles, certain dietary patterns etc. The proper prediction
model would need more data gathering to make it more
accurate. This can be achieved by collecting diabetes datasets
from multiple sources, generating a model from each dataset.
Secondly, in this study we only use Bayes network to predict
diabetes. Considering of the uncertain factors of some
diabetes attributes, in the future work, fuzzy set method will
be introduced to improve Bayes Network to do prediction.
Also, in order to find a best prediction model, other machine