Tarp Final
Tarp Final
Machine Learning
Team Details:
Prof. RAMANI S
INDEX
1 Project Title 3
2 Abstract 3
3 Introduction 3-4
5 Survey Table 10 - 11
6 Proposed Work 12 - 14
8 HTML Snippets 16 - 18
11 References 23 - 24
PROJECT TITLE - Web Portal for Multiple Disease Prediction Using Machine Learning
ABSTRACT
In medical imaging, Computer Aided Diagnosis (CAD) is a rapidly growing dynamic area of
research. In recent years, significant attempts are made for the enhancement of computer
aided diagnosis applications because errors in medical diagnostic systems can result in
seriously misleading medical treatments. Machine learning is important in Computer Aided
Diagnosis. After using an easy equation, objects such as organs may not be indicated
accurately. So, pattern recognition fundamentally involves learning from examples. In the
field of bio-medical, pattern recognition and machine learning promise the improved
accuracy of perception and diagnosis of disease. They also promote the objectivity of
decision-making. For the analysis of high-dimensional and multimodal bio-medical data,
machine learning offers a worthy approach for making classy and automatic algorithms. In
this project, we have made a web portal to assist all the physicians to predict diseases just by
uploading some information about the patient.
INTRODUCTION
Diseases are extremely dangerous to the human body. They can be transmitted by a variety of
viruses or by some chemical reactions in our body. Among a variety of life-threatening
illnesses, diseases with similar symptoms have received much attention in medical research.
Diagnosis of symptoms with similar symptoms is a challenging task, which can provide
automatic predictions about a patient's illness so that further treatment can be effective. The
diagnosis of such diseases is usually based on the symptoms, signs and image of the patient.
A major challenge facing healthcare organisations, such as hospitals and medical centres, is
the lack of affordable resources and limited time. Healthcare is a major diligence that
provides value-based care to millions of people while also producing significant revenue for
many countries. Excellence, Value, and Outcome are three buzzwords that still surround
healthcare and potential a lot, and today, healthcare experts and stakeholders all over the
world are searching for new ways to deliver on that promise. Technology is now helping
healthcare specialists to build alternative staff models, IP capitalization, provide smart
healthcare, and reduce administrative and supply costs, in addition to playing a vital role in
patient care, billing, and medical records.
During the COVID pandemic, the medical staff is concentrating more on COVID patients
hence reducing the workforce in other medical departments. The objective is to assist the
physician/doctor by using a classification. Heart disease, breast cancer, diabetes, migraine,
jaundice, chickenpox, and other diseases and health issues have a major impact on one's
health and can also lead to death if overlooked. The healthcare industry will make better
decisions by ‘mining’ their massive database, that is, extracting secret tendencies and
relations in the data and then using various machine learning algorithms for the prediction of
a disease.
Machine learning in healthcare is one such field that is progressively getting momentum in
the industry. The consequence of machine learning in the exploration and development of
these modalities, as well as their applied application in a medical environment, cannot be
exaggerated. Image segmentation, image registration, image fusion, image-guided therapy,
image annotation, and image database retrieval are all examples of machine learning in
medical imaging. Today, machine learning is more wanted in the healthcare and medical
sectors. When such machine learning techniques are applied properly, valuable understanding
can be collected from vast databases, allowing medical physicians to make more informed
decisions and increase health care.
LITERATURE SURVEY
In this research paper six supervised machine learning algorithms such as KNN, Logistic
Regression, Decision Tree, ensemble, and Support Vector Machine along with deep learning
along with Adam, Gradient Descent as optimizers through an artificial neural network. The
data was used for Wisconsin Breast Cancer, consisting of 30 factors calculated using fine
needle aspiration of the breast mass. The precision found by Adam Gradient Descent
Learning was found to be highest, with a performance of 98.9%.
Advantages:
Ø Rectified linear unit was used, did not cause any vanishing gradient issue and allowed the
model to learn faster and perform better.
Ø Considering advantages of Support Vector Machine, the developed model was found to be
superior.
Disadvantages:
Ø Random Forest algorithm was found to be the worst for high dimensional sparse data.
2] Mining and Prediction of Diabetes Complication Disease using Data Mining Algorithm.
Most general diabetes diseases in Indonesia are retinopathy, nephropathy and neuropathy.
This research paper built a prediction model for above mentioned diabetes types. It was found
that the diabetes risk factor was divided into seven features, such as Age, Gender, BMI,
family history of diabetes, blood pressure, duration of diabetes sufferers and blood glucose
level. Data used in this research were analysed using a number of machine learning
algorithms such as Naïve Bayes tree, C4.5 decision tree-based classification techniques and
k- means clustering techniques. There are three main phases in this research: data attribute
selection and data mining preprocessing, data mining algorithm and its evaluation criteria
analysis, finally to generate rules for the three major micro vascular diabetes complication
diseases. The accuracy of the proposed model is 68%, highest accuracy on the retinopathy
prediction model.
Advantages:
Ø Proposed model could show the most powerful risk factor for diabetes complication
illnesses.
Ø It showed that compared to the clustering technique, classification technique gives better
information.
Ø The developed model could not predict the automatic pre – diagnosis system, help the
patients to each risk factor of the complication disease.
Ø To improve the prediction model more diabetes medical reports are required, especially to
get sample datasets from all regions in Indonesia.
Ø Models to gain more prominent performance such as logistic model tree, random forest
were mentioned but not implemented in the research.
The purpose of this research work is to develop a compact guide on modelling options and
machine learning strategies, as well as a hybrid system focused on predicting the dynamics of
blood glucose (BG) in type 1 diabetes. Artificial pancreas, custom contour modelling, custom
decision support system and BG alarm event application. Literature reviews are collected
through various online databases, including Google Scholar, ScienceDirect and many more.
This research uses different types of machine learning algorithms, such as artificial neural
networks, support vector machines, Bayesian neural networks, and decision trees. [6]
Advantages:
Ø The proposed model can obtain the best prediction of the test value. [6]
Ø The network topology was successfully used to model and predict the blood glucose level
of patients with type 1 diabetes [6]
Disadvantages:
Ø Due to some complexity, the prediction of BG dynamics was not accurate. [6]
Ø The time lapse between continuous blood glucose monitoring (CGM) and actual blood
glucose levels is not covered. [6]
This research paper makes use of several supervised ml algorithms for projecting therapeutic
events with metrics as accuracy. Obtained results were compared using two statistical
platforms i.e., R-Studio and Rapid Miner. Machine learning algorithms such as decision tree,
random forest, support vector machine, neural networks and logistic regression were
implemented. Data used in this research were extracted from an open database of the
Framingham Heart Study consisting of observations where 57.1% corresponded to women
and 42.9% to men. The accuracy values for both sexes were calculated using formula:
[(sensitivity * prevalence) + (specificity * (1−prevalence))]
Advantages:
Ø Procedures developed can enhance the diagnostic and prognostic capacity of more
traditional regression techniques.
Ø It could keep track of all the changes in a reproducible file (script), was more replicable
and willing to correct, which provided a sense of assurance.
Disadvantages:
Ø Data represented as part of Machine Learning competition, neither its clinical reliability
nor the quality of results can be guaranteed.
Ø Dataset has a limited number of observations, limiting the prediction capability of the
trained models.
Ø Much more precise model could be obtained by using more information about risk factors
of patients.
5] Developing a dengue likelihood model based on weather in Tawau, Malaysia
Malaysia had a 10-fold increase of dengue cases over the last decade (i.e., 2006 – 2017). In
this paper, the relationship between weather prediction sensors including its lagged terms and
dengue incidence in the district of Tawau. Time series model was proposed to predict future
outbursts in Tawau. Correlation between dengue incidence was studied by Superman Rank
Correlation.
Advantages:
Ø Developed model (i.e., SARIMA model) with regression was more precise when
forecasting timeliness of dengue incidence.
Disadvantages:
Ø Established model remains a work in progress, needing more varied and greater data.
Ø Data used was under-reporting (1 case reported to 23 unreported) and did not contain
measures like sunshine hours wind velocity.
6] An improved collaborative learning approach for the estimation of heart infection risk:
While studying this research, enhanced machine learning techniques (KNN, Logistic
Regression, support vector machine, linear discriminant analysis) were suggested to foresee
heart disease successfully. Cleveland and Framingham datasets were used and segregated into
smaller subsets using a mean based splitting approach. Segregated dataset were then
demonstrated using classification and model tree [CART]. Attained classification accuracies
on Cleveland and Framingham datasets were 93% and 91% respectively.
Advantages:
Ø Obtained precisions outperformed other machine learning procedures and similar scholarly
works.
Ø Proposed model showed the heart ailment risk can be anticipated commendably. Ø Can be
of assistance in medical counselling.
7] Evaluation of Dengue Model Performances Developed Using Artificial Neural Network
and Random Forest Classifiers
1) Evaluate the performance of models used to predict the correct category in a given data
set. These models are developed using artificial neural network (ANN) classifiers and
random forest (RF) classifiers, respectively.
The result was a small accuracy due to the small data in this research model. The similar
searches were not impossible to compare with other results, since they have not been done
before. The data used in the project come from the microbial department. [7]
Advantages:
Ø The results obtained can be used for the development of a machine learning model, which
can predict the clinical degree of dengue at a critical stage. [7]
Disadvantages:
Ø ANN architecture can be further developed for achieving higher accuracy rate. [7]
On studying this paper, there was a proper planned framework for finding the risk level of the
patients with involvement of certain parameters. Data mining algorithm was used for heart
disease prediction. Good level of accuracy was obtained in this paper. Dataset was taken from
Cleveland. [8]
Advantages:
Ø With the given paper, it was easy for doctors in making their decision. [8] Ø This project
and predict coronary illness risk level with high accuracy [8]
Ø Let patients get early measurement results as it can be done well even without retraining.
[8]
9] F-test feature selection in Stacking ensemble model for breast cancer prediction
In this research paper existing models combined with supervised machine learning algorithms
were used for developing a new model for prediction of breast cancer. ML algorithms such as
SVM, KNN, Logistic regression, Naïve Bayes were used. The dataset was taken from
Wisconsin. For achieving higher accuracy F test and variance threshold were considered.
Advantages:
Ø F test feature selection provided better accuracy when piling ensemble models. [9]
Ø Shows that learning occurs even at the stacked multilevel, which is different from the
majority voting standard used in packaging and promotion. [9]
Disadvantages:
Ø Inclusion of feature selection algorithm showed that the performance of the model cannot
improve. [9]
10. Using Machine Learning Algorithms for Breast Cancer Risk Prediction and Diagnosis
On studying this research paper, we found out that, comparison was made between different
machine learning algorithms. Algorithms that were used were support vector machine,
decision tree, ingenuous Bayes and KNN. Dataset taken was from Wisconsin breast cancer.
On completion of the research highest precision was found when performing support vector
machines with very low error rate. [10]
Advantage:
Ø SVM model showed greater accuracy in Breast cancer prediction with comparison to other
models. [10]
Ø Developed algorithm was able to show that it had best performance in terms of precision
and low error rate. [10]
Disadvantage:
Ø The precision and efficiency of other classifier models are lower than those of SVM
classifiers [10]
SURVEY TABLE:
In practical situations and high demand of tests, it is difficult for the doctors and lab assistants
to check the results however fast they try. Communication of reports to the patient also takes
significant time thereby putting the patient’s life in danger. Our solution deals with assisting
doctor/lab professionals in prediction of multiple diseases by pacing up the prediction to a
matter of seconds by using high accuracy machine learning algorithms. The sources of
dataset used are:
• Tensorflow
• keras
• Flask
• Mongodb • SMTP
• HTML
• CSS
• Pymongo • Pandas
• numpy
Our website deals with diseases like dengue, breast cancer, diabetes & heart disease risk. We
will be implementing multiple machine learning algorithms to minimise false positives
(predicting absence when the disease is actually present) which can be fatal for the patient.
There will be a feature for signing up and login for the lab assistant. Where he/she can enter
the patient’s details and all his symptoms and medical logs. All the patient details will be
stored in the MongoDB database.
Fig 1: Overall working of the web Portal
CNN FEATURES:A convolution tool that separates and identifies the various features of the
image for analysis in a process called Feature Extraction. The network of feature extraction
consists of many pairs of convolutional or pooling layers. A fully connected layer that utilises
the output from the convolution process and predicts the class of the image based on the
features extracted in previous stages.This CNN model of feature extraction aims to reduce the
number of features present in a dataset. It creates new features which summarises the existing
features contained in an original set of features. There are many CNN layers as shown in the
CNN architecture diagram.
POOLING LAYER :Convolutional Layer is followed by a Pooling Layer. The primary aim
of this layer is to decrease the size of the convolved feature map to reduce the computational
costs. This is performed by decreasing the connections between layers and independently
operates on each feature map. Depending upon the method used, there are several types of
Pooling operations. It basically summarises the features generated by a convolution layer.In
Max Pooling, the largest element is taken from the feature map. Average Pooling calculates
the average of the elements in a predefined sized Image section. The total sum of the
elements in the predefined section is computed in Sum Pooling. The Pooling Layer usually
serves as a bridge between the Convolutional Layer and the FC Layer.This CNN model
generalises the features extracted by the convolution layer, and helps the networks to
recognise the features independently. With the help of this, the computations are also reduced
in a network
PROPOSED MODEL IMPLEMENTATION
We have used the Deep Learning algorithm CNN to increase the accuracy of our Dengue
Diagnosis part of the system. Convolutional neural network (CNN, or ConvNet) is one of the
most important algorithms and processes for in-depth learning and is widely used to analyse
image data. CNN uses a variety of multilayer perceptrons designed to require minimal
processing. CNN uses a much smaller processing compared to other image class algorithms.
This means that the network reads filters in traditional manual algorithms. Other reasons why
CNNs do so much better than classic neural networks on images is that the convolutional
layers take advantage of inherent properties of images.
1. Convolutions
● Simple feedforward neural networks don’t see any order in their inputs. If you
shuffled all your images in the same way, the neural network would have the very
same performance it had when trained on not shuffled images.
● CNN, in opposition, takes advantage of local spatial coherence of images. This
means that they are able to dramatically reduce the number of operations needed to
process an image by using convolution on patches of adjacent pixels, because
adjacent pixels together are meaningful. We also call that local connectivity. Each
map is then filled with the result of the convolution of a small patch of pixels, slid
with a window over the whole image.
2. Pooling layers
a. There are also the pooling layers, which downscale the image. This is possible because we
retain, throughout the network, features that are organised spatially like an image, and thus
downscaling them makes sense as reducing the size of the image. On classic inputs you
cannot downscale a vector, as there is no coherence between an input and the one next to
it.This independence from previous knowledge and human efforts in the construction of the
feature is of great benefit. They have applications for image and video
recognition,complimentary programs, image classification, medical image analysis, and
natural language processing. CNN contains input and output layer, as well as many hidden
layers. CNN's hidden layers usually consist of convolutional layers, composite layers, fully
connected layers and standard layers. The comparative results of the existing and proposed
system are as follows:
HTML Snippets:
● Breast Cancer
● Heart Disease
F-Score 0.7457627117644068
● Diabetes
accuracy - - 0.74 77
We have successfully created a web portal to assist the doctors and lab workers in faster and
more accurate deductions of test results particularly heart disease, diabetes, dengue and breast
cancer. To be useful, a prediction model must provide accurate and validated estimates of the
risks to the individual and ultimately improve an individual’s outcome or the
cost-effectiveness of care. We got comparable performances with traditional machine
learning models and limited features. Also integrated a communication module for immediate
probable results of the tests to the patient/doctor whoever is concerned.
In future, we’ll try to update our portal for more diseases with higher accuracy particularly
less false positive results. We’ll also add more features for doctors and section wise doctors
for better tracking of patients.
REFERENCES
[1] Fiarni, C., Sipayung, E. M., & Maemunah, S. (2019). Analysis and prediction of diabetes
complication disease using data mining algorithm. Procedia Computer Science, 161,
449-457.
[2] Beunza, J. J., Puertas, E., García-Ovejero, E., Villalba, G., Condes, E., Koleva, G., ... &
Landecho, M. F. (2019). Comparison of machine learning algorithms for clinical event
prediction (risk of coronary heart disease). Journal of biomedical informatics, 97, 103257.
[3] Jayaraj, V. J., Avoi, R., Gopalakrishnan, N., Raja, D. B., & Umasa, Y. (2019). Developing
a dengue prediction model based on climate in Tawau, Malaysia. Acta tropica, 197, 105055.
[4] Mienye, I. D., Sun, Y., & Wang, Z. (2020). An improved ensemble learning approach for
the prediction of heart disease risk. Informatics in Medicine Unlocked, 20, 100402.
[5] Gupta, P., & Garg, S. (2020). Breast cancer prediction using varying parameters of
machine learning models. Procedia Computer Science, 171, 593-601.
[6] Woldaregay, A. Z., Årsand, E., Walderhaug, S., Albers, D., Mamykina, L., Botsis, T., &
Hartvigsen, G. (2019). Data-driven modeling and prediction of blood glucose dynamics:
Machine learning applications in type 1 diabetes. Artificial intelligence in medicine, 98,
109-134.
[7] Silitonga, P., Dewi, B. E., Bustamam, A., & Al-Ash, H. S. (2021). Evaluation of Dengue
Model Performances Developed Using Artificial Neural Network and Random Forest
Classifiers. Procedia Computer Science, 179, 135-143.
[8] Saxena, K., & Sharma, R. (2016). Efficient heart disease prediction system. Procedia
Computer Science, 85, 962-969.
[9] Dhanya, R., Paul, I. R., Akula, S. S., Sivakumar, M., & Nair, J. J. (2020). F-test feature
selection in Stacking ensemble model for breast cancer prediction. Procedia Computer
Science, 171, 1561-1570.
[10] Asri, H., Mousannif, H., Al Moatassime, H., & Noel, T. (2016). Using machine learning
algorithms for breast cancer risk prediction and diagnosis. Procedia Computer Science, 83,
1064-1069.
[11] Khanam, J. J., & Foo, S. Y. (2021). A comparison of machine learning algorithms for
diabetes prediction. ICT Express.