Project Documents Group12
Project Documents Group12
By
Group No: - 12
1
2
CERTIFICATE 3
ACKNOWLEDGEMENTS 4
INTRODUCTION 5
MOTIVATION OF THE PROJECT 6
HARDWARE AND SOFTWARE TOOLS TO BE USED 7
FLOW-CHART OF THE PROJECT 8
ABOUT DATASET 9
PREPROSSEING DATASET 10
ABOUT CLASSIFICATION SUPERVISED MODEL 11
CONFUSION MATRIX 13
ROC CURVE 15
OUTPUT COMPARISON 17
FUTURE SCOPE 17
CONCLUSION 18
REFRENCES 18
2
3
CERTIFICATE
We do hereby declaring that the work which is being presented in the Project Report entitled Diabetes
Prediction using Machine Learning, in partial fulfilment of the requirements for the award of the Bachelor of
Technology in Information Technology and submitted to the Department of Information Technology of Future
Institute of Engineering and Management, Kolkata, is an authentic record of our own work carried out during the
period from September 2021 to June 2022, under the supervision of Prof. Mousumi Biswas.
The matter presented in this thesis has not been submitted by us for the award of any other degree elsewhere.
This is to certify that the above statement made by the students, is correct to the best of my knowledge.
Date: 11.01.2022
Signature of the Supervisor
3
4
ACKNOWLEDGEMENT
We have taken efforts in this project. However, it would not have been possible
without the kind support and help of many individuals. We would like to extend
our sincere thanks to all of them.
We are highly indebted to our guide Prof. Mousumi Biswas for his guidance
and constant supervision as well as for providing necessary information
regarding the project and also for his support in completing the project.
Also special thanks to Prof. Debjyoti Basu and Prof. Subhasis Mitra for
helping us in this project.
We express our thanks to our Principal Dr. Aloke Ghosh and our Head of the
Department Prof. Prasenjit Basu for extending their support. We would also
thank our Institution and the faculty members without whom this project would
have been a distant reality.
Our thanks and appreciations also go to all people who have willingly helped us
out with their abilities.
Abhishek Sinha
Arka Dutta
Ritayan Midya
Pritam Pal
Ehsan Hassan
4
5
INTRODUCTION
5
6
In recent times, most peoples are suffering in Diabetes. There are estimated
72.96 million cases of diabetes in adult population of India. The prevalence in
urban areas ranges between 10.9% and 14.2% and prevalence in rural India was
3.0-7.8% among population aged 20 years and above with a much higher
prevalence among individuals aged over 50 years. For this purpose we use the
Pima Indian Diabetes Dataset, we apply various Machine Learning classification
to predict diabetes. Machine Learning Is a method that is used to train computers
or machines explicitly. Various Machine Learning Techniques provide efficient
result to collect Knowledge by building various classification and ensemble
models from collected dataset. Such collected data can be useful to predict
diabetes. Various techniques of Machine Learning can capable to do prediction,
however it’s tough to choose best technique. Thus for this purpose we apply
popular classification method K-NN & Logistic Regression on dataset for
prediction. And main objective of this project comparison between this two
method & choose the best prediction method.
6
7
HARDWARE:
Any Kind of Laptop or Desktop (Windows 10) with internet
connectivity.
GPU
SOFTWARE:
Google Colab
MS Excel
Python
Sklearn
7
8
This is most important phase which includes model building for prediction of
diabetes. In this we have implemented various machine learning algorithms
which are discussed above for diabetes prediction.
SPLIT
DATASET
DATA PROCESSING CLASSIFIER
TEST
DATASET
(20%)
GRAPH
VISUALIZATION CONFUSION PREDICTING TEST
AND ANALYSING MATRIX RESULT
BEST MODEL
8
9
ABOUT DATASET
This dataset is originally from the National Institute of Diabetes and
Digestive and Kidney Diseases. The objective is to predict based on
diagnostic measurements whether a patient has diabetes or not.
This dataset has 768 samples of diabetic and healthy individuals.
In particular, all patients here are females of at least 21 years of age.
The diabetes dataset is credited to UCI machine learning database
repository.
The dataset has total 9 attributes out of which 8 are independent
variables and one is the dependent variable i.e. target variable which
determines whether patient is having diabetes or not.
Attribute Details:
Pregnancies (Number of time pregnant)
Glucose level
Blood Pressure
Skin Thickness
Insulin
BMI(Body Mass Index)
Diabetes Pedigree Function (It provides information about
diabetes history in relatives and genetic relationship of those
relatives with patients.)
Age
Outcome (0 means Non-diabetic and 1 means Diabetic)
9
10
PREPROSSESING DATASET
10
11
Algorithm-
It classify the data in binary form means only in 0 and 1 which refer case
to classify patient that is positive or negative for diabetes.
12
13
CONFUSION MATRIX
The confusion matrix is a technique used for summarizing the
performance of a classification algorithm i.e. it has binary outputs. For
this Diabetes Prediction-
Cases in which the doctor predicted they don’t have the disease, and
they don’t have the disease will be termed as TRUE POSITIVES
(TP). The doctor has correctly predicted that the patient hasn’t the
disease.
Cases in which the doctor predicted they have the disease, and they
have the disease will be termed as TRUE NEGATIVES (TN). The
doctor has correctly predicted that the patient has the disease.
Cases in which the doctor predicted they don’t have the disease, but
they have the disease will be termed as FALSE POSITIVES (FP).
Also known as “Type I error”.
Cases in which the doctor predicted they have the disease, but they
don’t have the disease will be termed as FALSE NEGATIVES
(FN). Also known as “Type II error”.
14
15
ROC CURVE
A Receiver Operating Characteristic Curve (ROC curve) is a graphical
plot that illustrates the diagnostic ability of a binary classifier system as
its discrimination threshold is varied. The ROC curve is created by
plotting the true positive rate against the false positive rate at various
threshold settings.
TPR=TP/ (TP+FN)
SPECIFICITY= TN/ (TN+FP)
FPR =1-SPECIFICITY
15
16
16
17
OUTPUT COMPARISON
Method Name Accuracy Rate(%) Miscalculation
Rate(%)
77.27272727272727 22.727272727272734
K-NN
75.32467532467533 24.675324675324674
Logistic Regression
FUTURE SCOPE
Implementing SVM,RandomForest Classification. Basically try to
improving for more AccuracyRate.
Implement GUI as Front End.
17
18
CONCLUSION
The main aim of this project was to design and implement Diabetes
Prediction Using Machine Learning Methods and Performance Analysis of
that methods and it has been achieved successfully. The proposed approach
uses various classification in which KNN, Logistic Regression are used.
The Experimental results can be assist health care to take early prediction
and make early decision to cure diabetes and save humans life.
REFERENCES
https://www.javatpoint.com/supervised-machine-learning
www.youtube.com
www.kaggle.com
www.ijert.org
18