Final
Final
BELAGAVI, KARNATAKA.
A PROJECT REPORT
ON
Submitted by
Name USN
Mythri S P 4JN21CS089
Nandita G Bhat 4JN21CS094
Priya B J 4JN21CS119
Rakshitha N V 4JN21CS126
December 2024
National Education Society ®
CERTIFICATE
This is to certify that the project entitled
Submitted by
Name USN
Mythri S P 4JN21CS089
Nandita G Bhat 4JN21CS094
Priya B J 4JN21CS119
Rakshitha N V 4JN21CS126
Students of 7th semester B.E. CS&E, in partial fulfillment of the requirement
for the award of degree of Bachelor of Engineering in Computer Science and
Engineering of Visvesvaraya Technological University, Belagavi during the year
2024-25.
Mr. Hiriyanna G S B.E., M. Tech., Dr. Jalesh Kumar B.E., M. Tech., Ph.D.
Asst. Professor, Dept. of CS&E Professor and Head, Dept. of CS&E
Signature of Principal
Principal, JNNCE
Examiners: 1. 2.
ABSTRACT
The prediction of multiple diseases using Machine Learning (ML) techniques has gained
significant attention due to its potential in early diagnosis and reduce healthcare costs. The
rapid evolution of healthcare technology has created a demand for efficient and accurate
predictive models capable of identifying multiple diseases simultaneously. The project
presents an approach to predict the likelihood of multiple diseases based on extensive
patient data—comprising demographic information, symptoms, clinical history and
lifestyle factors demographics, medical history. A variety of machine learning algorithms,
including Logistic Regression model and Support Vector Machines (SVM), to classify and
predict the risk of diseases such as diabetes, Parkinson’s and heart disease. The model's
performance is evaluated using standard metrics, such as accuracy and precision, across
multiple datasets to assess its robustness and generalizability. The results demonstrate the
feasibility of developing a unified framework for disease prediction, offering a scalable
solution that can aid healthcare providers in identifying high-risk patients and enabling
timely interventions.
i
ACKNOWLEDGEMENT
We would like to acknowledge our profound gratitude to all those who have helped
in implementing the project.
We would like to thank our beloved guide Mr. Hiriyanna G S, Assistant Professor,
Dept. of CS&E who have helped us a lot in making this project and for their continuous
encouragement and guidance throughout the project work.
We would like to thank our respected project co-ordinators Dr. Ravindra S, Assoc.
Professor, Mrs. Thaseen Bhashith, Asst. Professor, Mrs. Sreedevi S, Asst. Professor and
Mrs. Ayesha Siddiqa, Asst. Professor who have extended their warm support with respect
to all aspects of project.
We would like to thank Dr. Jalesh Kumar, HOD of CS&E Dept. and Dr. Y Vijaya
Kumar, Principal, JNNCE, Shimoga for all their support and encouragement.
Finally, we also would like to thank the whole teaching and nonteaching staff of
Computer Science and Engineering.
Project Associates,
Mythri S P 4JN21CS089
Nandita G Bhat 4JN21CS094
Priya B J 4JN21CS119
Rakshitha N V 4JN21CS126
ii
TABLE OF CONTENTS
ABSTRACT i
ACKNOWLEDGEMENT ii
CHAPTER 1 Introduction 1 - 15
1.1 Literature Survey 2 - 14
1.3 Objectives 14
2.7 Summary 25
3.2 Implementation 29 - 30
3.3 Summary 30
4.2 Summary 35
5.1 Conclusion 36
Publication
References 37
iv
LIST OF FIGURES
v
SymptoSense: Multiple Disease Prediction using ML
CHAPTER 1
INTRODUCTION
In recent years, the integration of machine learning (ML) in healthcare has revolutionized the
way diseases are diagnosed and treated. Traditional healthcare models, which largely rely on
the expertise of clinicians to interpret symptoms, lab results, and medical histories, can be time-
consuming and susceptible to human error. As the volume of patient data grows exponentially
with the adoption of electronic health records (EHRs), there is an increasing need for more
efficient, accurate, and scalable methods to predict, diagnose, and manage diseases. Machine
learning, with its ability to analyse large datasets and detect complex patterns, offers a powerful
tool to address this need.
Machine learning in healthcare can be defined as the use of algorithms and statistical
models that allow computers to "learn" from patient data without explicit programming. The
rise of electronic health records (EHRs) has made large datasets of patient information widely
available, providing a rich source of data for training machine learning models which enables
the development of models that can identify patterns in medical data, such as patient
demographics, lab test results, and clinical histories, to predict the likelihood of disease
development in the future. Machine learning algorithms can be trained on these datasets to
detect patterns and correlations, enabling the prediction of multiple diseases based on various
factors. Importantly, these models can be continually refined and improved as more data is
collected, leading to better predictive accuracy over time. Predicting diseases at an early stage
can greatly enhance clinical decision-making by providing healthcare providers with actionable
insights, thus improving patient outcomes and reducing the overall healthcare burden. Early
intervention is crucial in managing diseases such as diabetes, Parkinson’s and heart disease,
where timely diagnosis can significantly reduce morbidity and mortality rates.
Multiple Disease Prediction using Machine Learning and Deep Learning with the
Implementation of Web Technology, Mostafizur Rahman et.al. [1]. The study by Mostafizur
Rahman et al., titled "Multiple Disease Prediction Using Machine Learning and Deep Learning
with the Implementation of Web Technology," provides an innovative approach to predictive
healthcare by integrating machine learning (ML), deep learning (DL), and web-based
platforms. This research aims to address the growing need for accurate, efficient, and scalable
diagnostic tools for multiple diseases such as diabetes, heart disease, and Parkinson’s.
The authors explore the limitations of traditional diagnostic methods, including their
reliance on extensive medical expertise and their often prohibitive costs. In contrast, ML and
DL techniques offer data-driven solutions capable of handling vast datasets to uncover patterns
and insights that might not be apparent to human clinicians. The study compares several ML
algorithms, including Support Vector Machines (SVM), Random Forests, and Gradient
Boosting Machines, alongside DL models such as Convolutional Neural Networks (CNNs) and
Recurrent Neural Networks (RNNs). These models are evaluated based on metrics like
accuracy, precision, recall, and F1-score to determine their suitability for disease prediction.
A key contribution of this research is the integration of these models into a web-based
platform, designed to make predictive tools accessible to end-users, including patients and
healthcare providers. The web implementation demonstrates real-time data input, processing,
and result generation, making it a practical solution for early disease detection and
management. The authors highlight how the user-friendly interface bridges the gap between
complex predictive algorithms and non-technical users, thereby promoting the democratization
of healthcare technologies.
Multiple Disease Predictions using Machine Learning and Deep Learning Algorithms. Anish
Fathima B et.al. [2]. The authors highlight the growing importance of machine learning (ML)
and deep learning (DL) in transforming traditional healthcare diagnostics. By leveraging these
technologies, the study aims to improve accuracy and efficiency in diagnosing conditions such
as diabetes, heart disease, and Parkinson’s. The research compares the performance of various
ML algorithms like Decision Trees, Random Forests, and Gradient Boosting Machines, as well
as DL architectures such as Artificial Neural Networks (ANNs) and Convolutional Neural
Networks (CNNs). Their evaluation focuses on model performance metrics such as accuracy,
precision, recall, and F1-score, illustrating the relative strengths and weaknesses of each
approach.
The research findings demonstrate that deep learning methods, particularly CNNs, outperform
traditional ML models in capturing complex patterns within medical data, making them more
effective for diseases with intricate or subtle presentations.
The study also emphasizes practical implementation and future directions for multi-
disease prediction systems. The authors propose integrating these predictive models into
healthcare frameworks to facilitate early detection, personalized treatment plans, and improved
patient outcomes. They highlight the potential of using wearable devices and IoT-enabled
technologies for continuous data collection, which could feed into these predictive systems for
real-time analysis. By bridging the gap between advanced computational models and real-
world healthcare applications, this research lays a strong foundation for developing scalable,
accurate, and accessible diagnostic tools. The insights provided in this paper underline the
transformative potential of ML and DL in healthcare, paving the way for further innovation in
multi-disease predictive modelling.
Multiple Disease Prediction by Applying Machine Learning and Deep Learning Algorithms
M. Kalpana Chowdary et.al. [3]. The study categorizes and examines various ML algorithms,
such as Support Vector Machines (SVM), Decision Trees, Neural Networks, and Ensemble
Methods, focusing on their roles in handling healthcare datasets. The survey emphasizes the
ability of these algorithms to process large and complex data, offering insights into disease
diagnosis, prediction, and management.
A key highlight of the paper is its discussion on the challenges in healthcare data,
including data heterogeneity, imbalances, and noise, and how ML methods address these
issues. The study particularly notes the importance of feature selection and dimensionality
reduction in improving model accuracy and interpretability. Additionally, it explores how
advanced techniques like deep learning are enhancing predictive capabilities in areas such as
image analysis and personalized medicine.
The paper concludes by identifying gaps in the current use of ML in healthcare, such
as the need for more interpretable models and better handling of ethical concerns like data
privacy. It serves as a foundational resource for researchers aiming to explore or optimize ML
techniques in healthcare data analysis.
A survey on machine learning algorithms for healthcare data analysis Kotsiantis, S. B. et.al.
[4]. Machine learning (ML) has made significant strides in healthcare, revolutionizing the way
diseases are predicted and managed. The study emphasizes how these technologies can analyze
complex medical datasets to provide accurate and efficient predictions for conditions such as
diabetes, heart disease, and Parkinson’s disease. By comparing algorithms like Support Vector
Machines (SVM), Random Forests, and Gradient Boosting Machines with deep learning
architectures such as Artificial Neural Networks (ANNs), the authors demonstrate the
effectiveness of DL models in capturing intricate patterns in healthcare data.
The paper also explores the practical implications of multi-disease predictive systems,
suggesting their integration into healthcare systems for early diagnosis and better treatment
planning. The authors emphasize the importance of creating user-friendly tools that can be
utilized by healthcare providers to improve patient outcomes. By combining advanced
predictive models with real-world healthcare needs, this study contributes significantly to the
growing field of AI-driven diagnostics and sets a foundation for future research in multi-disease
prediction.
Multi-Disease Prediction Using Machine Learning Sathya V et.al.[5]. The authors aim to
address the increasing need for efficient and accurate diagnostic tools capable of handling the
complexities of multi-disease prediction. The study examines several ML algorithms, such as
Logistic Regression, Decision Trees, Random Forests, and Support Vector Machines (SVM),
assessing their ability to process large healthcare datasets. By evaluating these techniques based
on metrics like accuracy, precision, recall, and F1-score, the paper highlights the strengths and
limitations of each algorithm.
One of the central challenges discussed is the overlapping symptoms between different
diseases, which often complicates the prediction process. The study addresses these issues by
employing feature selection techniques to identify the most relevant attributes from datasets
and pre-processing methods to handle imbalanced data. Moreover, the authors emphasize the
importance of ensemble learning models, which combine multiple algorithms to improve
prediction accuracy and robustness. These models were found to outperform individual
algorithms in terms of both reliability and efficiency.
The paper "Diabetes Prediction Using Support Vector Machines" by N. Srividhya et.al.
(October 2023) [6]. The paper employs Support Vector Machines (SVMs) to predict diabetes,
utilizing medical data features such as age, blood pressure, glucose levels, and BMI. SVMs are
known for their ability to handle both linear and non-linear classification tasks, making them
ideal for complex healthcare datasets. The authors highlight the key advantages of using SVM,
such as its robustness to high-dimensional data and its ability to construct an optimal
hyperplane that maximizes the margin between different classes. Diabetes prediction has been
a focus of many studies as early detection can significantly reduce complications. For instance,
studies like "A Hybrid Approach for Diabetes Prediction Using Machine Learning Techniques"
(Yao et al., 2021) combine multiple models to improve accuracy, while "Machine Learning
Algorithms for Diabetes Prediction" (Sharma et al., 2022) also demonstrated the effectiveness
of SVM, Random Forest, and Logistic Regression. Both emphasize the importance of
preprocessing techniques like data normalization and feature selection, which are also
discussed in Srividhya et al.'s paper.
Feature selection plays a crucial role, as irrelevant or redundant features can reduce the
model's accuracy. SVM’s ability to create complex decision boundaries is particularly valuable
for medical datasets, where the relationships between features and outcomes (like diabetes) can
be non-linear. "Diabetes Prediction with Support Vector Machine Classifier" (Gupta & Meena,
2021) also demonstrates SVM's application, comparing it with other classifiers and showing
SVM's superior performance in prediction tasks. Overall, Srividhya et al.'s paper contributes
to the growing body of work in healthcare AI by illustrating how SVM can be leveraged for
diabetes prediction, aligning with other studies that advocate for machine learning techniques
as powerful tools for early disease detection. Their research confirms that SVM, when
combined with proper data preprocessing and feature selection, can significantly enhance
predictive accuracy in healthcare applications.
The paper "Early Recognition of Parkinson’s Disease Through Acoustic Analysis and Machine
Learning" by Niloofar Fadavi and Nazanin Fadavi (2024) [7], focuses on the innovative use of
acoustic analysis and machine learning (ML) techniques to facilitate the early diagnosis of
Parkinson’s disease (PD). Parkinson’s disease is a neurodegenerative disorder that primarily
affects motor skills, and its early detection is crucial for timely intervention and management.
The study explores the potential of using voice-based biomarkers to identify early signs of PD,
which is challenging due to the gradual onset of symptoms and the lack of definitive early-
stage diagnostic tests.
The authors emphasize that speech and voice patterns are significantly altered in
individuals with Parkinson’s disease, often well before motor symptoms become apparent.
Features such as vocal tremor, pitch variations, and speech fluency are commonly impacted in
PD patients, and these alterations can be captured through acoustic analysis. The paper reviews
the use of various machine learning algorithms to classify these speech features, including
Support Vector Machines (SVM), Random Forests, and Neural Networks. By training these
models on voice recordings from individuals with and without PD, the study demonstrates the
potential of ML to differentiate between healthy individuals and those affected by the disease,
based on subtle changes in their speech patterns.
A key contribution of this research is its focus on feature extraction from speech signals,
which includes prosodic features (such as pitch, tone, and rhythm), temporal features (such as
pause duration and speech rate), and spectral features (such as formant frequencies). The
authors employ advanced signal processing techniques to extract these features and then apply
ML algorithms for classification. The study highlights the importance of using a large and
diverse dataset to ensure the robustness and generalization of the model across different patient
populations. The authors also explore the challenges of handling noisy and incomplete data,
which is common in real-world speech datasets.
The findings of the paper suggest that combining acoustic analysis with machine
learning models can lead to highly accurate early detection systems for Parkinson’s disease.
Such systems, once fully developed, could serve as screening tools for healthcare providers,
enabling them to identify patients at risk before the onset of severe symptoms. The paper also
discusses the potential for integrating these diagnostic tools into mobile applications or
telemedicine platforms, which would make early Parkinson’s detection accessible to a wider
population. Overall, this research underscores the promise of voice-based biomarkers and
machine learning in revolutionizing the early diagnosis and management of Parkinson’s
disease, offering a non-invasive, cost-effective alternative to traditional diagnostic methods.
The paper "Heart Disease Prediction Using Support Vector Machine" by Balakrishnan
Duraisamya et.al. (2023) [8] explores the application of Support Vector Machines (SVM) for
predicting heart disease, focusing on its ability to handle complex healthcare datasets and
provide accurate predictions. Cardiovascular diseases, including heart disease, are among the
leading causes of mortality worldwide, making early detection and prediction crucial for
effective intervention and prevention. The study examines the effectiveness of SVM, a
supervised learning algorithm, in predicting the likelihood of heart disease based on various
medical features such as age, gender, cholesterol levels, blood pressure, and family history of
heart disease.
The paper outlines how SVM is used to classify patients into two categories: those at
risk of heart disease and those not at risk. The study emphasizes the role of data preprocessing
techniques, such as normalization and feature selection, to improve the performance of the
SVM model. Normalization ensures that input features with different scales do not bias the
model, while feature selection helps reduce the dimensionality of the data, focusing on the most
relevant attributes that influence heart disease prediction. The paper also discusses the
importance of using an optimal kernel function in SVM to improve classification accuracy,
with experiments showing the superior performance of the radial basis function (RBF) kernel
in comparison to linear kernels.
The results of the study indicate that the SVM model achieved high accuracy and
reliability in predicting heart disease, outperforming other machine learning algorithms like
Decision Trees and Logistic Regression. The paper further explores the challenges associated
with heart disease prediction, such as dealing with imbalanced datasets, where healthy
individuals may significantly outnumber those with the disease. To address this, the authors
employ techniques like oversampling and synthetic data generation to balance the dataset and
prevent bias toward the majority class. Additionally, the study highlights the potential of using
SVM in real-world clinical settings, where the model could assist healthcare professionals in
early screening and personalized treatment planning.
1.3 Objectives
The idea of this project came into existence because of the increase in the health risks.
Provide Data Privacy for user’s data, the data of one user will not be available to others.
Estimate the efficiency of learning algorithms such as logistic regression and support
vector machine by calculating accuracy.
A machine learning model named support vector machine(SVM) and logistic regression
is used for predicting multiple diseases.
It predicts diseases from the given input. If the prediction is resulted as true then a message
will be displayed which shows the presence of the disease.
It predicts the disease based on the symptoms which are given weightage.
The project is deployed for users, revolutionizing early disease detection and preventive
healthcare.
The chapter 1 includes introduction to multiple disease prediction along with literature survey,
problem statement, objectives, scope of the project. Chapter 2 brings out domain specific of
the project. Chapter 3 discusses about system design and implementation. Later in Chapter 4
gives the result and snapshots of the project. Chapter 5 concludes the report along with possible
future enhancements for this project followed by references.
CHAPTER 2
MACHINE LEARNING ALGORITHMS
Machine Learning (ML) is a subset of artificial intelligence (AI) that enables computers to
learn from data and make decisions or predictions without being explicitly programmed. It
focuses on building algorithms that can analyze data, identify patterns, and improve their
performance over time through experience.
Machine learning is used in a wide variety of applications, including image and speech
recognition, natural language processing, and recommender systems.
An Error Function: An error function evaluates the prediction of the model. If there are known
examples, an error function can make a comparison to assess the accuracy of the model.
A Model Optimization Process: If the model can fit better to the data points in the training set,
then weights are adjusted to reduce the discrepancy between the known example and the model
estimate. The algorithm will repeat this iterative “evaluate and optimize” process, updating
weights autonomously until a threshold of accuracy has been met.
Industry analysts agree on the importance of machine learning and its underlying algorithms.
From Forrester, “Advancements in machine-learning algorithms bring precision and depth to
marketing data analysis that helps marketers understand how marketing details—such as
platform, creative, call to action, or messaging—impact marketing performance.1” While
Gartner states that, “Machine learning is at the core of many successful AI applications,
fueling its enormous traction in the market.2”
Most often, training ML algorithms on more data will provide more accurate
answers than training on less data. Using statistical methods, algorithms are trained to
determine classifications or make predictions, and to uncover key insights in data mining
projects. These insights can subsequently improve your decision-making to boost key
growth metrics.
Use cases for machine learning algorithms include the ability to analyze data to
identify trends and predict issues before they occur.3 More advanced AI can enable more
personalized support, reduce response times, provide speech recognition and improve
customer satisfaction. The industries that particularly benefit from machine learning
algorithms to create new content from vast amounts of data include supply chain
management, transportation and logistics, retail and manufacturing4—all embracing
generative AI, with its ability to automate tasks, enhance efficiency and provide valuable
insights, even to beginners.
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
4. Ensemble Learning
Various algorithms exist in these types of machine learning algorithms. The model is
prepared using logistic regression and support vector machine algorithms (SVM) to predict
the diseases accurately. Heart disease is predicted by logistic regression, diabetes and
Parkinson’s by SVM algorithm.
On the basis of the categories, Logistic Regression can be classified into three types:
1. Binomial: In binomial Logistic regression, there can be only two possible types of
the dependent variables, such as 0 or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as “low”, “Medium”, or “High”.
The logistic regression model transforms the linear regression function continuous value
output into categorical value output using a sigmoid function, which maps any real-valued
set of independent variables input into a value between 0 and 1. This function is known as
the logistic function.
𝑥11 ⋯ 𝑥1𝑚
𝑋=[ ⋮ ⋱ ⋮ ] ---- (1)
𝑥𝑛1 ⋯ 𝑥𝑛𝑚
0 𝑖𝑓 𝑐𝑙𝑎𝑠𝑠1
Y={ --- (2)
1 𝑖𝑓 𝑐𝑙𝑎𝑠𝑠 2
The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
It maps any real value into another value within a range of 0 and 1. The value of the logistic
regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve
like the “S” form.
The S-form curve is called the Sigmoid function or the logistic function.
In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value
below the threshold values tends to 0.
Now we use the sigmoid function where the input will be z and we find the probability
between 0 and 1. i.e. predicted y.
1
𝜎 (𝑧 ) =
1 + 𝑒 −𝑧
As shown in the Fig. 2.1, sigmoid function converts the continuous variable data into the
probability i.e. between 0 and 1.
P(y=1) =σ(z)
P(y=0) =1−σ(z).
1. Advantages :
2. Disadvantages:
Assumes Linearity: It assumes a linear relationship between the features and the
log-odds of the target variable, which may not hold in complex, non-linear datasets.
Sensitive to Outliers: Outliers can skew the model’s predictions and significantly
affect the estimated coefficients, leading to poor performance.
Limited to Binary Classification (Without Extensions): Standard logistic regression
is designed for binary classification, and while it can be extended to multiclass
problems, this adds complexity.
Feature Independence Assumption: Logistic regression assumes features are
independent, and the model can perform poorly if features are highly correlated
(multicollinearity).
The dimension of the hyperplane depends on the number of features. For instance,
if there are two input features, the hyperplane is simply a line, and if there are three input
features, the hyperplane becomes a 2-D plane. As the number of features increases beyond
three, the complexity of visualizing the hyperplane also increases.
Consider two independent variables, x1 and x2, and one dependent variable
represented as either a blue circle or a red circle.
In this scenario, the hyperplane is a line because we are working with two features
(x1 and x2).
There are multiple lines (or hyperplanes) that can separate the data points. The
challenge is to determine the best hyperplane that maximizes the separation margin
between the red and blue circles.
From the Fig. 2.2 it’s very clear that there are multiple lines (our hyperplane here is
a line because we are considering only two input features x1, x2) that segregate our data
points or do a classification between red and blue circles.
One reasonable choice for the best hyperplane in a Support Vector Machine (SVM)
is the one that maximizes the separation margin between the two classes. The maximum-
margin hyperplane, also referred to as the hard margin, is selected based on maximizing
the distance between the hyperplane and the nearest data point on each side.
Fig. 2.3: Multiple hyperplanes separating the data from two classes
So, we choose the hyperplane whose distance from it to the nearest data point on
each side is maximized. If such a hyperplane exists it is known as the maximum-margin
hyperplane/hard margin. So, from the Fig. 2.3, we choose L2. Let’s consider a scenario like
shown below.
Here we have one blue ball in the boundary of the red ball. So how does SVM
classify the data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue
balls. The SVM algorithm has the characteristics to ignore the outlier and finds the best
hyperplane that maximizes the margin. Fig. 2.4 indicates the hyperplane for data with
outliers. SVM is robust to outliers.
So, in this type of data point what SVM does is, finds the maximum margin as done
with previous data sets along with that it adds a penalty each time a point crosses the
margin. So the margins in these types of cases are called soft margins. When there is a soft
margin to the data set, the SVM tries to minimize (1/margin+∧(∑penalty)). Hinge loss is a
commonly used penalty. If no violations no hinge loss. If violations hinge loss proportional
to the distance of violation. Fig. 2.5 represents the optimized hyperplane.
Till now, we were talking about linearly separable data (the group of blue balls and
red balls are separable by a straight line/linear line). What to do if data are not linearly
separable?
Say, our data is shown in the Fig. 2.6. SVM solves this by creating a new variable
using a kernel. We call a point xi on the line and we create a new variable yi as a function
of distance from origin o.so if we plot this we get something like as shown below
In this case, the new variable y is created as a function of distance from the origin. A non-
linear function that creates a new variable is referred to as a kernel. Fig. 2.7 shows the
mapping done to 1D data to 2D data to be able to separate two classes.
Based on the nature of the decision boundary, Support Vector Machines (SVM) can be
divided into two main parts:
1. Linear SVM: Linear SVMs use a linear decision boundary to separate the data
points of different classes. When the data can be precisely linearly separated, linear
SVMs are very suitable. This means that a single straight line (in 2D) or a
hyperplane (in higher dimensions) can entirely divide the data points into their
respective classes. A hyperplane that maximizes the margin between the classes is
the decision boundary.
2. Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be
separated into two classes by a straight line (in the case of 2D). By using kernel
functions, nonlinear SVMs can handle nonlinearly separable data. The original
input data is transformed by these kernel functions into a higher-dimensional feature
space, where the data points can be linearly separated. A linear SVM is used to
locate a nonlinear decision boundary in this modified space.
2.6.3 Advantages and Disadvantages of Support Vector Machine (SVM)
1. Advantages
Nonlinear Capability: Utilizing kernel functions like RBF and polynomial, SVM
effectively handles nonlinear relationships.
Outlier Resilience: The soft margin feature allows SVM to ignore outliers,
enhancing robustness in spam detection and anomaly detection.
Binary and Multiclass Support: SVM is effective for both binary classification and
multiclass classification, suitable for applications in text classification.
Memory Efficiency: SVM focuses on support vectors, making it memory efficient
compared to other algorithms.
2. Disadvantages
Slow Training: SVM can be slow for large datasets, affecting performance in SVM
in data mining tasks.
Parameter Tuning Difficulty: Selecting the right kernel and adjusting parameters
like C requires careful tuning, impacting SVM algorithms.
Noise Sensitivity: SVM struggles with noisy datasets and overlapping classes,
limiting effectiveness in real-world scenarios.
Limited Interpretability: The complexity of the hyperplane in higher dimensions
makes SVM less interpretable than other models.
Feature Scaling Sensitivity: Proper feature scaling is essential; otherwise, SVM
models may perform poorly.
2.7 Summary
The chapter 2 includes Domain specific which explains about the machine learning
and machine learning algorithms like logistic regression and support vector machine
learning algorithms which are used in the project.
CHAPTER 3
The Fig. 3.1 represents the flow of the system, the raw data of all the diseases is collected,
the obtained data is then preprocessed and further cleaned, this gives the reduced and
accurate dataset to be used in the model. The dataset is trained by logistic regression and
support vector machine algorithms which gives the output for the user’s data. When user
gives the details of some particular information asked by the model, it is processed and the
presence or absence of the disease is indicated.
1. Data collection:
The data collection process for the multiple disease prediction system begins with
identifying relevant features for each disease. For diabetes, essential features include blood
glucose levels (fasting, postprandial, and HbA1c), BMI, blood pressure, insulin levels, and
lifestyle factors such as diet, physical activity, and smoking habits. For Parkinson's disease,
features such as voice characteristics (e.g., jitter, shimmer, pitch), motor symptoms (e.g.,
tremors, rigidity), clinical scores like the Unified Parkinson’s Disease Rating Scale
(UPDRS), and non-motor symptoms (e.g., sleep disturbances, depression) are critical. For
heart disease, features like chest pain type, cholesterol levels, resting blood pressure, fasting
blood sugar, ECG results, heart rate variability, and smoking history are essential.
parameters (L1 or L2) to prevent overfitting, while also fine-tuning learning rates and
convergence thresholds to achieve faster and more accurate training.
Feature selection is an integral part of the fine-tuning process, where only the most
significant features are retained to reduce noise and computational overhead. After fine-
tuning, the models are re-trained on the complete training dataset using the optimized
parameters and re-evaluated using metric like accuracy. This process ensures that the
finalized models are well-optimized for deployment, delivering reliable and accurate
predictions for diabetes, Parkinson’s, and heart disease within the system pipeline.
6. Model deployment:
For the multiple disease prediction system, model deployment involves integrating
the trained SVM models for diabetes and Parkinson’s disease, along with the Logistic
Regression model for heart disease, into a unified application as outlined in the flowchart.
Frontend is developed using stream lit, enabling seamless interaction through a user
interface where users can input processed patient data. The system processes this input,
routes it to the appropriate model, and returns predictions for each disease in real-time. To
ensure the deployment is secure and efficient, the system incorporates encrypted data
transfer protocols and adheres to privacy regulations. Continuous performance monitoring
is implemented to detect model drift, ensuring accuracy and reliability in real-world usage.
3.2 Implementation
The implementation includes steps like Feature extraction, Training, Testing, comparing
best model and predicting results.
Training the Diabetes and Parkinson’s model using SVM:
- Group dataset by the target column to calculate mean values for each class.
3. Split Features and Labels
- Separate features (X) and labels (Y) from the dataset.
4. Standardize the Features
- Initialize a StandardScaler object.
- Fit the scaler to X and transform it to get standardized data.
- Update X with the standardized data.
5. Split Data into Training and Testing Sets
- Split the dataset into training and test sets (80% training, 20% testing).
- Use stratified sampling to maintain class balance.
- Set a random state for reproducibility.
6. Train a Support Vector Machine (SVM) Classifier
- Initialize an SVM classifier with a linear kernel.
- Train the classifier using the training data.
7. Evaluate the Model on Training Data
- Predict the labels for the training set.
- Compute and print the accuracy score.
8. Evaluate the Model on Test Data
- Predict the labels for the test set.
- Compute and print the accuracy score.
9. Make Predictions for a New Input Instance
- Define an input instance.
- Convert the input data to a NumPy array and reshape it for prediction.
- Standardize the input data using the previously fitted scaler.
- Use the trained classifier to predict the label.
- Display a message based on the prediction result.
10. Save the Trained Model
- Save the trained model to a file using pickle.
11. Load and Use the Saved Model
- Load the saved model from the file.
- Repeat the steps for standardizing and predicting for a new input instance using the
loaded model.
Key Steps in Pseudo-Code:
- Load Dataset
data = load_data(file_path)
X, Y = split_features_labels(data)
- Preprocessing
scaler = initialize_scaler()
X_standardized = fit_transform(scaler, X)
- Split Data
X_train, X_test, Y_train, Y_test = split_data(X_standardized, Y)
- Model Training
classifier = initialize_svm()
fit(classifier, X_train, Y_train)
- Evaluate Model
train_accuracy = evaluate(classifier, X_train, Y_train)
test_accuracy = evaluate(classifier, X_test, Y_test)
- Predict for New Data
input_data = preprocess_input(new_instance)
prediction = predict(classifier, input_data)
display_result(prediction)
- Save and Load Model
save_model(classifier, filename)
loaded_model = load_model(filename)
repeat_prediction_with_loaded_model(new_instance)
Training the Heart disease model using logistic regression:
1. Import Required Libraries
Import libraries for data manipulation, pre-processing, machine learning, and
saving/loading models [9].
2. Load and Inspect Dataset
- Load the heart dataset into a Data Frame.
- Display dataset information:
- First few rows
- Shape
- Summary statistics
- Class distribution
- Group dataset by the target column (Heart) to calculate mean values for each class.
3. Split Features and Labels
scaler = initialize_scaler()
X_standardized = fit_transform(scaler, X)
- Split Data
X_train, X_test, Y_train, Y_test = split_data(X_standardized, Y)
- Model Training
regression = initialize_logistic_regression()
fit(regression, X_train, Y_train)
- Evaluate Model
train_accuracy = evaluate(regression, X_train, Y_train)
test_accuracy = evaluate(regression, X_test, Y_test)
- Predict for New Data
input_data = preprocess_input(new_instance)
prediction = predict(regression, input_data)
display_result(prediction)
- Save and Load Model
save_model(regression, filename)
loaded_model = load_model(filename)
repeat_prediction_with_loaded_model(new_instance)
Create input fields for symptoms such as Frequent Urination, Increased Thirst, Fatigue,
Blurred Vision, etc.
When the user presses the Diabetes Test Result button:
Collect the input values and convert them to float.
Use the diabetes model to predict if the person has diabetes.
Show the prediction result.
If the prediction is positive (indicating the person may have diabetes), show a link to
diabetes health tips.
If the prediction is negative (indicating the person may not have diabetes), show a message
saying so.
10. Heart Disease Prediction (Arrhythmia)
If the user selects Heart Disease Prediction (Arrhythmia):
Display the page for Heart Disease Prediction.
Create input fields for symptoms such as Chest Pain, Shortness of Breath, Irregular
Heartbeat, etc.
When the user presses the Heart Test Result button:
Collect the input values and convert them to float.
Use the heart disease model to predict if the person has arrhythmia.
Show the prediction result.
If the prediction is positive (indicating the person may have arrhythmia), show a link to
heart health tips.
If the prediction is negative (indicating the person may not have arrhythmia), show a
message saying so.
11. Parkinson’s Disease Prediction
If the user selects Parkinson’s Prediction:
Display the page for Parkinson’s Prediction.
Create input fields for symptoms such as Tremors, Muscle Stiffness, Slowness of
Movement, Impaired Balance, etc.
When the user presses the Parkinson’s Test Result button:
Collect the input values and convert them to float.
Use the Parkinson's disease model to predict if the person has Parkinson’s disease.
Show the prediction result.
If the prediction is positive (indicating the person may have Parkinson's disease), show a
link to Parkinson’s health tips.
If the prediction is negative (indicating the person may not have Parkinson’s disease), show
a message saying so.
12. End of Program.
3.3 Summary
In chapter 3 the system design is discussed in detail, including the system
architecture and step wise flow of the model. It also contains the implementation of the
model. The next chapter has the snapshots of result.
CHAPTER 4
RESULTS AND DISCUSSION
4.1 Result Analysis
This chapter contains the snapshots of all results obtained in the project. The accuracy of
the model is calculated. The user interface created using stream lit is represented in all the
figures below. The users can give values to the input fields. The model predicts the presence
or the absence of that particular disease according to the input data.
4.2 Summary
After considering all the results obtained, the efficiency of the project can be considered to
be 98%. The model can effectively predict diabetes, Parkinson's and heart disease using
SVM and Logistic Regression, achieving high accuracy for each disease. The system
outperformed existing solutions by offering reliable predictions with improved efficiency.
The chapter contains all the snapshots of the results and the detailed explanation of how
they work.
CHAPTER 5
5.1 Conclusion
The multiple disease prediction project successfully developed a machine learning-
based system to predict diabetes, heart disease, and Parkinson's disease with promising
accuracy. By leveraging data preprocessing, feature selection, and classification
algorithms, the model provides a valuable tool for early detection and preventive
healthcare. The system emphasizes the importance of timely diagnosis, enabling healthcare
professionals to take proactive measures to manage or mitigate risks. While the results are
encouraging, the model's performance depends on high-quality, diverse datasets,
highlighting the need for continuous refinement and real-world validation. This project
demonstrates the potential of AI to revolutionize healthcare by enhancing diagnostic
efficiency and promoting personalized treatment strategies.
Broaden the system to predict additional diseases like cancer and respiratory
conditions.
Implement explainable AI techniques to make predictions transparent and
trustworthy.
Develop android applications for wider accessibility and ease of use.
REFRENCES
[1] Multiple Disease Prediction using Machine Learning and Deep Learning with the
Implementation of Web Technology, Mostafizur Rahman; Saiful Islam; Sadia Binta
Sarowar; Meem Tasfia Zaman.
[2] Multiple Disease Predictions using Machine Learning and Deep Learning Algorithms.
Anish Fathima B; Vikram R; Siddarth S; Sri Vishnu M.
[3] A survey on machine learning algorithms for healthcare data analysis Kotsiantis, S. B.
[4] Multiple Disease Prediction by Applying Machine Learning and Deep Learning
Algorithms M. Kalpana Chowdary; K.Anil Kumar; C. Ganesh; Rajsekhar Turaka;
B.Devananda Rao; Sk.Lokesh.
[7] Early Recognition of Parkinson’s Disease Through Acoustic Analysis and Machine
Learning by Niloofar Fadavi and Nazanin Fadavi (2024).
[8] Heart Disease Prediction Using Support Vector Machine by Balakrishnan Duraisamya
(2023).
[9] Optimized Ensemble Learning Approach with Explainable AI for Improved Heart
Disease Prediction Ibomoiye Domor Mienye and Nobert Jere, Published: 8 July 2024.
[11] Research on Diabetes Prediction Method Based on Machine Learning To cite this
article: Jingyu Xue et al 2020 J. Phys.: Conf. Ser. 1684 012062.
[12] Parkinson’s Disease and Its Management George DeMaagd, Ashok Philip.
I.INTRODUCTION
In today's fast world, early diagnosis of diseases is crucial for effective treatment and improved health outcomes. The project
"SymptoSense: Multiple Disease Prediction Using ML" takes advantage of the advancements in ML to develop a predictive system
that efficiently identifies multiple diseases based on input symptoms. This is a product designed by a devoted group of
undergraduates fron JNN College of Engineering, Shivamogga with innovative ideas in computer science that present a vision
toward accessible solutions for healthcare. With effective algorithms, it attempts to decomplicate the medical world for health
workers and patients both in their ways to good health.
In [1], Diabetes is one of the most challenging health problems around the world, and the number is projected to increase. The
ability to predict accurately in the early stages will manage and prevent complications. Several algorithms have been used for
predicting diabetes using machine learning (ML) including Support Vector Machines (SVM).
In [2], The disease has a very significant impact on most organs and leads to complications like kidney disease, nerve damage,
and cardiovascular disease.The research concludes that SVM is a good method for diabetes prediction, particularly in datasets with
multiple features. It is recommended to keep updating the data and models to improve the accuracy of the predictions.
In [3], The study focuses on enhancing heart disease prediction by integrating ensemble learning techniques.The proposed
approach combines three ensemble classifiers to improve accuracy and minimize overfitting.To ensure transparency, Shapley
Additive Explanations (SHAP) were employed to interpret model decisions, identifying critical features.
In [4], The study provides a comprehensive review of artificial intelligence (AI) and machine learning (ML) approaches in
diagnosing Parkinson's Disease (PD), emphasizing the importance of early detection due to PD's progressive nature and the lack of
a cure. It explores various methodologies.