Disease Prediction Using Machine Learning
Disease Prediction Using Machine Learning
https://doi.org/10.22214/ijraset.2022.43966
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
Abstract: Technology has altered the health arena to a large extent in this era of IT. The goal of this research is to create a
diagnosis model for a variety of diseases based on their symptoms. To create such a model, this system used data mining
techniques such as classification. The intelligent agent is trained using datasets containing copious data regarding patient
diseases that have been gathered, refined, categorised, and utilised. K-Fold cross-validation is used to evaluate the machine
learning models after splitting the data. For cross-validation, employed are the Support Vector Classifier, Gaussian Naive Bayes
Classifier, and Random The patient might then contact the doctor for further therapy based on the results. It's an example of
how technology and medical expertise are flawlessly woven together with the goal of achieving "prediction is better than cure."
Keywords: Gaussian Naive Bayes classifier, K-cross validation, Random forest classifier, Support vector classifier, medical data,
classification, data mining.
I. INTRODUCTION
Nowadays, the use of the internet has been stimulating curiosity among people and, be it of any kind, they are trying to find a
solution to their problems through the internet only. It is a matter of fact that people have much easier access to the internet than
hospitals and doctors. It's a fact nowadays before going to the doctor people tend to google their symptoms and try to figure out the
diagnosis. Sometimes, when people don’t have time to visit the doctor they tend to self diagnose which could be dangerous. The
proposed system here provides a better and more effective alternative to randomly googling your systems and harming oneself by
simply registering on a network, picking out the symptoms, and then getting the prognosis along with the details of doctors they
could contact those are specialized in that field. This Disease Prediction system is a web-based application that predicts the most
probable disease of the user in accordance with the given symptoms with the help of the data sets collected from different health-
related sites. It often happens that someone nearer or dearer to you may need a doctor’s help immediately for some serious reasons
but the doctor isn’t available for consultation for some prior commitments or other obvious reasons. That is when the role of this
automated program comes into play. This Disease Prediction system can be used for urgent guidance on their illness according to the
details and symptoms they will feed to the web-based application. Here, some intelligent data processing techniques are used to get
the most accurate disease that would be related to the patient’s details. And then based on the results, the patient can contact the
respective disease specialist for any further treatments. This system can be used for a free consultation regarding any illness. Also, it
cuts the cost of visiting a general physician first. The patient registered on the network can get their prognosis and can get direct
consultation from a doctor specialized in that particular field.
1) Ba-Alwi and Hintaya [1] suggested a comparative analysis. Data mining algorithms that are used for hepatitis disease diagnosis
are Naive Bayes, Naive Bayes updatable, FT Tree, K Star, J48, LMT, and NN. Hepatitis disease data set was taken from UCI
Machine Learning repository. Classification results are measured in terms of accuracy and time. Comparative Analysis is taken
by using neural connections and WEKA: data mining tool. Results taken by using neural connection are comparatively low than
the algorithms used in WEKA. In this Analysis of Hepatitis disease diagnosis, second technique that is used is rough set theory,
by using WEKA. Performance of Rough set procedure is better than NN specially in case of medical data analysis. Naive Bayes
gives the accuracy of 96.52% in 0 sec. 84% Accuracy is attained by the Naive Bayes Update able algorithm in 0 sec. In 0.2 sec
FT Tree presents the accuracy of 87.10%. K star offers 83.47% Correctness. Time taken for K star algorithm is 0 sec.
Correctness of 83% is achieved by J48 and time that J48 takes to classify is 0.03 sec. LMT provides 83.6% accuracy 0.6 sec.
Neural network shows 70.41% of correctness. Naive Bayes is certainly the best classification algorithm used in rough set
technique as it offers high accuracy in the least possible time. Toxic harmful gas detector, published in 2020, their robot
detecting only gas, however their’s will also showing live streaming.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 109
6
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
2) Fatima Ibrahim, Mohd Nasir Taib, Wan Abu Bakar Wan Abas, Chan Chong Guan, Sadiah Sulaiman [2] suggested a system in
which Artificial neural network is used for forecasting the defervescence day of fever in patients of dengue. For detection, the
suggested approach relies solely on clinical signs and symptoms. The data are gathered from 252 hospitalized patients, in which
4 patients are having DF (Dengue fever) and 248 patients are having DHF (dengue haemorrhagic fever). MATLAB’s neural
network toolbox is used. In this experiment, the Multi-layer feed-forward neural network (MFNN) algorithm is applied. Day of
defervescence of fever is accurately predicted by MFNN in DF and DHF with 90% correctness.
3) Dr. Kanak Saxena et al. [12] developed a data mining model to predict heart disease efficiently. It primarily assists medical
practitioners in making effective decisions based on the parameters provided. The author used the Cleveland dataset from UCI,
as well as age, gender, sex, resting blood pressure, chest pain, serum cholesterol, fasting blood sugar, etc. as attributes. They've
also separated the datasets into two halves, one for testing and the other for training. They have used a 10-fold method to find
accuracy.
4) Deepika et al. [11] proposed predictive analytics to prevent and control the chronic disease with the help of machine learning
techniques such as naive Bayes, support vector machine, decision tree, and artificial neural network and they have used UCI
machine learning repository datasets to calculate the accuracy. Among them, Support vector machine gives the best accuracy of
95.55%.
5) Ashir Javeed et al. [4] developed a model to improve the prediction of heart disease by overcoming the problem of over-fitting.
The proposed model is overly accurate on testing data, but it predicts heart disease inaccurately on training data. The proposed
model is overly accurate on testing data, but it predicts heart disease inaccurately on training data. That model consists of two
algorithms one is RAS(Random search algorithm) other one is a random forest algorithm that is used to predict the model. This
model gave them better results in both training and test data.
III. METHODOLOGY
The diseases are predicted automatically in the proposed system using a model, which has been trained on a medical dataset. This
technique also displays the prediction's confidence score. Following the diagnosis of the anticipated ailment, the system will
recommend specialists who specialise in that disease, allowing the patient to consult with them online. The suggested technology
functions as a decision support system and will health practitioners in making diagnoses
.
Fig 1. Flow diagram
When the user visits the application they are given two choice
A. To register as a patient
B. To register as a doctor on the network
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1097
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
c) When they click on select the disease option to get a prognosis, they are redirected to a page where they can add whatever
symptoms they are having from a drop-down consisting of 132 symptoms listed. Whatever symptoms they chose get added to
the system list after which they can click on predict to get their prognosis.
d) Depending upon the number of symptoms provided the system provides a prognosis along with the confidence score. The
confidence score here implies to the percentage of the model is sure about the prognosis.
e) Not only that, the application provides a link that will direct the user to get a better understanding of the predicted disease.
f) And the most important feature of the proposed model is that along with the prognosis the system gives the opportunity to
connect with the doctor specialized in that particular field to the user who is registered on the network along with their contact
details.
g) Patients can access a list of doctors who specialize in their condition and receive ratings as well as the ability to chat with them
online.
2) As a doctor
a) When the doctor logs in to the application they are directed to their profile page where they can view the consultation history
and give feedback.
b) Consultation history consists of all the consultations the doctor has given on the network be it active or closed.
c) When there is a consultation request from a patient, the status is shown as active and the doctor can consult the patient on the
network, can the patient’s profile and according to them after the procedure mark it as close.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1098
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
A. Data collection
The symptoms of this disease have been found on the internet, so that identification it is more accurate i.e. no dummy values are
entered. The dataset is collected from kaggle. The CSV file contains 5000 rows and 133 columns, 132 columns for the unique
symptoms. And the last column for the disease class (40 unique disease classes).
Fig 3. Some rows of diseases with their corresponding symptoms in the dataset.
B. Cleaning the Data
The most crucial phase in a machine learning project is cleaning. The machine learning model's quality is determined on the quality
of the data. As a result, data must be cleaned before being fed to the model for training. All of the columns in the dataset are
numerical, except for the goal column, prognosis, which is a textual type that is encoded to numerical form using a label encoder.
C. Model Building
After the data has been gathered and cleaned. The model is trained using clean data. The Support Vector Classifier, Naive Bayes
Classifier, and Random Forest Classifier were all trained using cleaned data. We've also plotted a confusion matrix at the end to
assess the models' quality. By merging the predictions of all three models after training them, predict of the disease for the input
symptoms is made possible. This strengthens and improves the accuracy of the total prediction.
D. Dataset splitting
When training a machine learning model, dataset is separated into two:
a) The training dataset
b) The testing dataset.
Data is divided into an 80:20 structure, which means that 80% of the information is utilised to train the model and 20% is used to
evaluate the model's performance.
K-Fold cross-validation is utilised to evaluate the machine learning models after splitting the data. For cross-validation, Support
Vector Classifier is employed, Gaussian Naive Bayes Classifier, and Random Forest Classifier.
1) K-Fold Cross-Validation: K-Fold cross-validation is a cross-validation technique in which the entire dataset is divided into k
subsets, also known as folds, and the model is trained on the k-1 subsets while the remaining one subset is used to evaluate
model performance.
2) Support Vector Classifier: When given labelled training data, the Support Vector Classifier algorithm seeks to discover an ideal
hyperplane that accurately splits the samples into different categories in hyperspace.
3) Gaussian Naive Bayes Classifier: It is a probabilistic machine learning algorithm that internally uses Bayes Theorem to classify
the data points.
4) Random Forest Classifier: Random forest, like its name implies, consists of a large number of individual decision trees that
operate as an ensemble. Each individual tree in the random forest produces a prediction, and the class with the most votes is the
model's prediction.
In order to build a comprehensive model two different factors are combined. Taking into account the predictions of all three models,
the final prediction would be the correct one. This approach helps us to keep the predictions much more accurate on completely
unseen data.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1099
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
After training all the three models on the train data, quality of the models is checked using a confusion matrix, and then combined
the predictions of all the three models.
After combing all the three models, test of combined model on the test data began. In result, combined model has classified all the
data points accurately.
As a function is created that takes symptoms separated by commas as input and outputs the predicted disease using the combined
model based on the input symptoms.
V. RESULTS
If the patient is logged in, they will be able to access disease prediction. This ensures seamless one click solution to get an accurate
prediction.
Patient is given the list of symptoms that can be selected upon need and these symptoms which will be compared to the trained
model with high accuracy.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1100
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
The result is shown to the patient with not only a high accuracy but also a confidence score. Which ensures that not one symptom is
affiliated to one disease. Symptoms can be common between different diseases, hence a confidence score shows the chance of
contracting that particular disease.
VI. FUTURE SCOPE
A. A prime account option available for the patients.
B. Video calling feature.
C. The website's account linking feature allows users to connect their account with other online services like Gmail and social
media.
D. Addition of a map feature to the website, like adding an API for it.
E. Partner with a pharmacy and provide discounts on the medicine for the patients.
VII. CONCLUSION
Proposed a system to predict the disease based on previous cases in the medical history and connected the patients registered on the
network with the best doctors in the specialized field by reducing a patient’s trouble visiting a general physician before.
A disease prediction web application network based on a machine learning algorithm was effectively built. Support Vector
Classifier, Naive Bayes Classifier, and Random Forest Classifier were used to train three different models, which were then
combined to create a more accurate and effective system to classify patient data. This is because medical data is growing at an
exponential rate, and it is necessary to process existing data in order to predict exact disease based on symptoms. By providing the
input as patient symptoms, we were able to get an accurate general illness risk prediction, which let us grasp the level of disease risk
prediction.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1101
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
REFERENCES
[1] Ba-Alwi, F.M. and Hintaya, H.M. (2013) Comparative Study for Analysis the Prognostic in Hepatitis Data: Data Mining Approach. International Journal of
Scientific & Engineering Research
[2] Fatima Ibrahim, Mohd Nasir Taib, Wan Abu Bakar Wan Abas, Chan Chong Guan, Sadiah Sulaiman (2005) A Novel Dengue Fever (DF) and Dengue
Haemorrhagic Fever (DHF) Analysis Using Artificial Neural Network (ANN). Computer Methods and Programs in Biomedicine.
[3] A. Ansari and N. K. Gupta, “Automated diagnosis of coronary heart disease using neuron-fuzzy integrated system,” in 2011 World Congress on Information
and Communication Technologies. IEEE, 2011.
[4] A. Javeed, S. Zhou, L. Yongjian, I. Qasim, A. Noor, and R. Nour, “An Intelligent Learning System Based on Random Search Algorithm and Optimized Random
Forest Model for Improved Heart Disease Detection,”
IEEE Acess, vol.7,pp. 20313-20324, 2020.
[5] M. Gjoreski, A. Gradisek, B. Budna, M. Gams, and G. Poglajen, “Machine Learning and End-to-End Deep Learning for the Detection of Chronic Heart Failure
from Heart Sounds,” IEEE Access, vol. 8, pp. 20313–20324, 2020,
[6] L. Ali, A. Rahman, A. Khan, M. Zhou, A. Javeed, and J. A. Khan, “An Automated Diagnostic System for Heart Disease Prediction Based on χ2 Statistical
Model and Opt imally Configured Deep Neural Network,” IEEE Access, vol. 7, pp. 34938–34945, 2019
[7] M. R. Ahmed, S. M. Hasan Mahmud, M. A. Hossin, H. Jahan and S. R. Haider Noori, “A cloud based architecture for early detection of heart disease with
machine learning algorithms,” 2018 IEEE 4th International Conference on Computational Creativity. ICCC 2018, pp. 1951–1955, 2018
[8] A. K. M Sazzadur Rahman, M. Mehedi Hasan, S. Asaduzzaman, M. Asaduzzaman, and S. Akhter Hossain, “An analysis of computational intelligence
techniques for diabetes prediction Machine Learning View project An analysis of computational intelligence techniques for diabetes prediction,”
Int. J. Eng. &Technology, vol. 7, no. 4, pp. 6229–6232, 2018.
[9] G. H. Tang, A. B. M. Rabie, and U. Hägg, “Indian hedgehog: A Mechanotransduction Mediator in Condylar Cartilage,” J. Dent. Res., vol. 83, no. 5, pp. 434–
438, 2004
[10] Y. Karaca and C. Cattani, “7. Naive Bayesian classifier,” Computer Methods Data Analysis
[11] Purushottam, K. Saxena, and R. Sharma, “Efficient Heart Disease Prediction System,” Procedia Computer Science, vol. 85, pp. 962–969, 2016
[12] K. Deepika and S. Seema, “Predictive analytics to prevent and control chronic diseases,” Proc. 2016 2nd Int. Conf. Appl. Theor. Comput. Commun. Technology
iCATccT 2016, no. January 2016, pp. 381386, 2017
[13] “Analysis and Prediction of Various Heart Diseases Using DNFS Techniques,” vol. 2, no. 1, pp. 1–7, 2015. Proceedings of the International Conference on
Electronics and Sustainable Communication Systems (ICESC 2020) IEEE Xplore Part NumberCFP20V66-ART; ISBN: 978-1-7281-4108-4978
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1102