0% found this document useful (0 votes)
65 views

Prediction of Diabetes Using Bayesian Network: Mukesh Kumari, Dr. Rajan Vohra, Anshul Arora

This document summarizes research on using Bayesian networks to predict diabetes. It discusses how data mining techniques can be used to extract knowledge from medical datasets to aid in diagnosis. Specifically, it proposes using a Bayesian network classifier with a diabetes dataset collected from a hospital to predict whether a person has diabetes or not. The Weka tool was used to perform the experiment and analysis. Literature on previous related work applying decision trees, Naive Bayes models, and other classifiers to diabetes prediction is also reviewed. Bayesian networks provide a principled approach for modeling relationships in data and avoiding overfitting. Human: Thank you, that is a concise 3 sentence summary that captures the key aspects of the document.

Uploaded by

Fariha Tabassum
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

Prediction of Diabetes Using Bayesian Network: Mukesh Kumari, Dr. Rajan Vohra, Anshul Arora

This document summarizes research on using Bayesian networks to predict diabetes. It discusses how data mining techniques can be used to extract knowledge from medical datasets to aid in diagnosis. Specifically, it proposes using a Bayesian network classifier with a diabetes dataset collected from a hospital to predict whether a person has diabetes or not. The Weka tool was used to perform the experiment and analysis. Literature on previous related work applying decision trees, Naive Bayes models, and other classifiers to diabetes prediction is also reviewed. Bayesian networks provide a principled approach for modeling relationships in data and avoiding overfitting. Human: Thank you, that is a concise 3 sentence summary that captures the key aspects of the document.

Uploaded by

Fariha Tabassum
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Mukesh kumari et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol.

5 (4) , 2014, 5174-5178

Prediction of Diabetes Using Bayesian Network


Mukesh kumari1, Dr. Rajan Vohra 2,Anshul arora3
1,3
Student of M.Tech (C.E) 2Head of Department
Department of computer science & engineering
P.D.M College of Engineering
Sector 3A, Sarai Aurangabad, Bhadurgarh

Abstract: This paper helps in predicting diabetes by applying  Gestational diabetes, is the third main form and occurs
data mining technique. The discovery of knowledge from when pregnant women without a previous diagnosis of
medical datasets is important in order to make effective diabetes develop a high blood glucose level.
medical diagnosis. The aim of data mining is to extract
knowledge from information stored in dataset and generate
clear and understandable description of patterns. Diabetes LITERATURE SURVEY
mellitus is a chronic disease and a major public health A Research Paper given by Sudajai Lowanichchai,
challenge worldwide. Using data mining methods to aid people Saisunee Jabjone, Tidanut Puthasimma Assistant
to predict diabetes has gain major popularity. In this paper , Professor, Informatic Program Faculty of Science and
Bayesian Network classifier was proposed to predict the Technology Nakhon Ratchsima Rajabhat University it
persons whether diabetic or not. The dataset used is collected proposed the application Information technology of
from a hospital, which collects the information of persons with knowledge-based DSS for an analysis diabetes of elder
and without diabetes. We used Weka tool for the experiment using decision tree. The result showed that the
and analysis. Classification algorithm is applied on the dataset
RandomTree model has the highest accuracy in the
of persons collected from hospital. Results have been obtained
classification is 99.60 percent when compared with the
Keywords: data mining, diabetes ,bayesian network, weka. medical diagnosis that the error MAE is 0.004 and RMSE
is 0.0447. The NBTree model has lowest accuracy in the
INTRODUCTION classification is 70.60 percent when compared with the
Data mining is often described as the process of medical diagnosis that the error MAE is 0.3327 and RMSE
discovering correlations, patterns, trends or relationships by is 0.454 [3].
searching through a large amount of data stored in In another Research paper presented by Yang Guo ,
repositories, corporate databases, and data warehouses. Guohua Bai , Yan Hu School of computing Blekinge
Humans, in that sense, are limited by information overload; Institute of Technology Karlskrona, Sweden, The
thus, new tools and techniques are being developed to solve discovery of knowledge from medical databases is
this problem through automation. Data mining uses a series important in order to make effective medical diagnosis. The
of pattern recognition technologies and statistical and dataset used was the Pima Indian diabetes dataset. Pre-
mathematical techniques to discover the possible rules or processing was used to improve the quality of data.
relationships that govern the data in the databases. Data classifier was applied to the modified dataset to construct
mining must also be considered as an iterative process that the Naïve Bayes model. Finally weka was used to do
requires goals and objectives to be specified [1]. simulation, and the accuracy of the resulting model was
Diabetes mellitus (DM) or simply diabetes, is a group of 72.3%. [4].
metabolic diseases in which a person has high blood In a Research paper presented by Ashwinkumar.U.M and
sugar. This high blood sugar produces the symptoms Dr Anandakumar.K.R Reva Institute of Technology
of frequent urination, increased thirst, and increased and Management, Bangalore S J B Institute of
hunger. Untreated, diabetes can cause many Technology, Bangalore. This Paper has proposed a novel
complications. Acute complications include diabetic learning algorithm i+Learning as well as i+LRA, which
ketoacidosis and nonketotic hyperosmolar coma. Serious apparently achieves the highest classification accuracy over
long-term complications include heart disease, kidney ID3 algorithm.
failure, and damage to the eyes[2]. There are three main The major limitation of our method is the adoption of
types of diabetes mellitus: binary tree rather than multi-branch tree. Such structure
 Type 1 DM results from the body's failure to produce increases the tree size, whereas an attribute can be selected
insulin. This form was previously referred to as as a decision node for more than once in a tree. For that
"insulin-dependent diabetes mellitus" (IDDM) or reason, binary trees tend to be less efficient in terms of tree
"juvenile diabetes". storage requirements and test time requirements, although
 Type 2 DM results from insulin resistance, a condition they are easy to build and interpret[5].
in which cells fail to use insulin properly, sometimes Literature Review on Diabetes, by National Public
also with an absolute insulin deficiency. This form was health :Women tend to be hardest hit by diabetes with 9.6
previously referred to as non insulin-dependent million women having diabetes. This represents 8.8% of
diabetes mellitus (NIDDM) or "adult-onset diabetes". the adult population of women 18 years of age and older in

 
www.ijcsit.com 5174
 
Mukesh kumari et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5174-5178

2003 and a two fold increase from 1995 (4.7%).. By 2050, efficient and principled approach for avoiding the
the projected number of all persons with diabetes will have overfitting of data.
increased from 17 million to 29 million. [5] Methods for constructing Bayesian networks from prior
knowledge and summarize Bayesian statistical methods for
CONCEPTUAL FRAMEWORK using data to improve these models. With regard to the
DATA MINING: latter task, we describe methods for learning both the
Data mining (the analysis step of the "Knowledge parameters and structure of a Bayesian network, including
Discovery in Databases" process, or KDD), an techniques for learning with incomplete data. In addition,
interdisciplinary subfield of computer science, is the we relate Bayesian-network methods for learning to
computational process of discovering patterns in large data techniques for supervised and unsupervised learning. We
sets involving methods at the intersection of artificial illustrate the graphical-modeling approach using a real-
intelligence, machine learning, statistics, and database world case study[8].
systems. The overall goal of the data mining process is to WEKA TOOL
extract information from a data set and transform it into an Weka (Waikato Environment for Knowledge Analysis) is a
understandable structure for further use. Aside from the popular suite of machine learning software written in Java,
raw analysis step, it involves database and data developed at the University of Waikato, New Zealand.
management aspects, data pre-processing, model and Weka is free software available under the GNU General
inference considerations, interestingness metrics, Public License. The Weka workbench contains a collection
complexity considerations, post-processing of discovered of visualization tools and algorithms for data analysis and
structures, visualization, and online updating. predictive modeling, together with graphical user interfaces
It allows users to analyze data from many different for easy access to this functionality.
dimensions or angles, categorize it, and summarize the Weka is a collection of machine learning algorithms for
relationships identified. Technically, data mining is the solving real-world data mining problems. It is written in
process of finding correlations or patterns among dozens of Java and runs on almost any platform. The algorithms can
fields in large relational databases[7]. either be applied directly to a dataset or called from your
BAYESIAN NETWORK: own Java code[9].
A Bayesian network is a graphical model that encodes PROBLEM STATEMENT : Prediction of diabetes
probabilistic relationships among variables of interest. using bayesian network
When used in conjunction with statistical techniques, the To identify whether a given person in dataset will be
graphical model has several advantages for data analysis. diabetic ,non diabetic or pre diabetic will be done on basis
One, because the model encodes dependencies among all of attribute values.Dataset contains all the details of person
variables, it readily handles situations where some data like fast gtt value, casual gttvalue,number of time
entries are missing. Two, a Bayesian network can be used pregnant,diastolic blood pressure (mmhg),triceps skin fold
to learn causal relationships, and hence can be used to gain thickness(mm),serium insulin(µU/ml), body mass index
understanding about a problem domain and to predict the (kg/m)diabetes pedigree function,age of person.
consequences of intervention. Three, because the model has Attributes like fast gtt ,casualgtt,diastolic blood pressure
both a causal and probabilistic semantics, it is an ideal values exceeding a specific value etc may contribute to
representation for combining prior knowledge (which often identify whether a person is diabetic,non diabetic or
comes in causal form) and data. Four, Bayesian statistical prediabetic .A brief explanation has been given below
methods in conjunction with bayesian networks offer an through a flow chart .

CLASS 0(no) 

PERSON 
CLASSIFICATION  CLASS 1(pre) 
DATASET 
(bayesian network) 
CLASS 2(yes) 

Flow chart of problem

 
www.ijcsit.com 5175
 
Mukesh kumari et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5174-5178

In this a dataset of person is collected from the hospital and plasma glucose tolerance test and diastolic blood pressure
will fed into the software i.e weka which will output the (mmHg) is decision variable .
total no of diabetic ,non diabetic persons and pre diabetic If the value of fasting plasma glucose is less than 100mg/dl
persons .The classification will based upon primary and value of casual glucose tolerance test is less than 140
attributes value This technique would classify the dataset mg/dl than it will be given score 0, means a person is non
into three different classes. diabetic If the value of fasting plasma glucose lies in the
range of 100-125 mg/dl and value of casual glucose
RESULTS AND DISCUSSIONS tolerance test lies in the range of 140-190 mg/dl than it will
The dataset that is taken for this research work contains 206 be given score 1, means a person is pre diabetic .If the
records and 9 attributes for the purpose of predicting value of fasting plasma glucose is more than 125mg/dl and
whether a person is diabetic or non diabetic based on the value of casual glucose tolerance test is more than 190
symptoms. This dataset is designed in MS excel format. mg/dl than it will be given score 2, means a person is
PREPARING DATASET :- diabetic
This is the sample of dataset used for prediction.The dataset
used contains 206 instances . all instances have 9 input
attibutes(X1 to X8) and one output attribute(Y1).table
shows the attribute of this dataset .

Attribute
Attribute Description Type
no
Number of times
X1 PREGNANT Numeric
pregnant
X2 FAST_GTT Glucose tolerance test Numeric
X3 CASUAL_GTT Glucose tolerance test Numeric
Diastolic blood
X4 BP Numeric
pressure (mmHg)
X5 INSULIN Serium insulin(µU/ml) Numeric
Triceps skin
X6 SKIN Numeric
thickness(mm)
Body mass
X7 BMI Numeric
index(kg/m)
Diabetes pedigree
X8 DPF Numeric
function
X9 AGE Age of person (years) Numeric Figure:Representing data load into wek
Diabetes diagnose
Y DIABETES Nominal
results(no,pre,yes)

Attributes of dataset
Sample Dataset[]

Figure : Representation of data used for solving


problems
Figure :results obtained using Bayesian network
st
Problem :- The 1 problem of work is about predicting
Result for the Problem : Classification of persons in to
whether a person is diabetic or non diabetic in a dataset by
three classes:
applying bayesian network . This problem is solved using
For obtaining the result for the first problem, data of 206
the primary attribute . The dataset variables which are used
persons is fed to the Weka tool, and as a result the tool
for prediction of diabetes are fast plasma glucose
generates three classes, i.e., Class no and Class pre and
concentration in an oral glucose tolerance test ,casual
class yes.

 
www.ijcsit.com 5176
 
Mukesh kumari et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5174-5178

Class no: Class predicting person with no diabetes are  True Negative (TN)-These are the negative tuples that
Class pre: Class depicting persons with pre diabetes are were correctly labeled by the classifier [10].
Class no: Class predicting person with diabetes are  False Positive (FP)-These are the negative tuples that
were incorrectly labeled as positive . However if the
CLASS ENTRIES % OF PERSONS actual value is n then it is said to be a false positive
NO 65 31.55 (FP) [9].
PRE 51 24.75  False Negative (FN)-These are the positive tuples that
YES 90 43.68 were mislabeled as negative [10].
Table : Results obtained after applying Bayesian network
Accuracy is calculated as
In figure , the graphical representation of these classes has (TP+TN)/(P+N)
been shown. The class which is shown using colour ‘Blue’ where, P=TP+FN and N=FP+TN. Or TP+TN/(TOTAL)
is the no class i.e. persons with no diabetes. The class According to experimental results, correctly classified
which is shown using ‘red’ crosses is the pre class i.e those instances for bayesian network is 205 Accuracy of bayesian
persons which are not diabetic but having symptoms which network is 99.51 which is high. Bayesian network is a
may lead to diabetes in future. The class which is shown promising technique for this type of dataset
using ‘green’ crosses is the yes class i.e those persons
which are diabetic . CONCLUSION AND FUTURE SCOPE
The discovery of knowledge from medical datasets is
important in order to make effective medical diagnosis. The
aim of data mining is to extract knowledge from
information stored in dataset and generate clear and
understandable description of patterns.
This study aims at the discovery of a decision tree model
for the diagnosis of diabetes. The dataset used is collected
from hospital. Pre-processing is used to improve the quality
of data. The techniques of pre-processing applied are
attributes identification and selection, data normalization,
and numerical discretization.
Next, classifier is applied to the modified dataset to
construct the Bayesian model. Finally weka will be used to
do simulation, and the accuracy of the model is calculated
and compared with other algorithms efficiency.
Classification with Bayesian network shows the best
accuracy ,99.51 percent and error in the classification is
.48 percent when the results were compared to clinical
Figure :Classes shown after applying technique diagnosis.the mean absolute error (MEA) =.0053 and root
. mean squared error(MRES =.0596). The total time
Confusion matrix required to build the model is also a crucial parameter in
A confusion matrix contains information about actual and comparing the classification algorithm.
predicted classifications done by a classification system.
Performance of such systems is commonly evaluated using FUTURE SCOPE
the data in the matrix. The following table shows the There are some limitations of this study. Firstly,
confusion matrix for a two class classifier [9]. considering the diabetes dataset, there might be other risk
factors that the data collections did not consider. According
to , other important factors include gestational diabetes,
family history, metabolic syndrome, smoking, inactive
lifestyles, certain dietary patterns etc. The proper prediction
model would need more data gathering to make it more
accurate. This can be achieved by collecting diabetes
datasets from multiple sources, generating a model from
each dataset.
Secondly, in this study we only use Bayesian network to
Confusion Matrix predict diabetes. Considering of the uncertain factors of
some diabetes attributes, in the future work, fuzzy set
 True positive (TP)- These are the positive tuples that method will be introduced to improve Bayes Network to do
were correctly labeled by the classifier [10].If the prediction. Also, in order to find a best prediction model,
outcome from a prediction is p and the actual value is other machine learning methods such as Neural Network
also p, then it is called a true positive (TP)[9]. will be tested to compare the predicting results.

 
www.ijcsit.com 5177
 
Mukesh kumari et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5174-5178

ACKNOWLEDGEMENTS U.S. Department of Health and Human Services, Centers for Disease
Control and Prevention, National Center for Chronic Disease
Authors would like to thanks to her head Dr. Rajan Vohra,
Prevention and Health Promotion,
HOD of CSE & I.T department, PDMCE, Bahadurgarh for [6] Gloria L.A. Beckles and Patricia E. Thompson-Reidy the authors of
their valuable support and help. “ Diabetes and Women’s Health Across the Life Stages”.
[7] Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining Concepts
and Techniques” Third edition .
REFERENCES: [8] en.wikipedia.org/wiki/Bayesian_network
[1] Rohanizadeh.s “A proposed data mining methodogy application to
[9]. Sapna Jain 2.M Afshar Aalam3. M. N Doja,”K-MEANS
industrial procedures”
CLUSTERING USING WEKA INTERFACE”, Proceedings of the
[2] en.wikipedia.org/wiki/Diabetes_mellitus
4th National Conference; INDIACom-2010 Computing For Nation
[3] Sudajai Lowanichchai, Saisunee Jabjone, Tidanut Puthasimma,
Development, February 25 – 26, 2010
”Knowledge-based DSS for an Analysis Diabetes of Elder using
[10] Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining Concepts
Decision Tree”
and Techniques” Third edition.
[4] Yang Guo , Guohua Bai , Yan Hu School of computing Blekinge
[11] Database - “patient data base”.
Institute of Technology Karlskrona, Sweden, “Using Bayes
Network for Prediction of Type-2 Diabetes”
[5] Beckles GLA, Thompson-Reid PE, editors. Diabetes and Women’s
Health Across the Life Stages: A Public Health Perspective. Atlanta:

 
www.ijcsit.com 5178
 

You might also like