ITRByAYUSH
ITRByAYUSH
B.TECH/ CS42
Submitted in partial fulfillment of the
Requirements for the award of
Degree of Bachelor of Technology in Computer Science
SUBMITTED BY:
Semester/Branch:
7th Sem/CSE
SUBMITTED TO:
Department of Computer
Science and Engineering
GITM
Lucknow (Uttar Pradesh)
B.TECH
lOMoARcPSD|28341130
ACKNOWLEDGEMENTS
Special thanks to our colleagues and team members for their hard
work, dedication, and teamwork. Their relentless efforts and
innovative ideas have significantly contributed to the development of
this system.
ABSTRACT
There are multiple techniques in machine learning that can in a variety of industries, do
predictive analytics on large amounts of data. Predictive analytics in healthcare is a difficult
endeavour, but it can eventually assist practitioners in making timely decisions regarding
patients' health and treatment based on massive data. Diseases like Breast cancer, diabetes, and
heart-related diseases are causing many deaths globally but most of these deaths are due to the
lack of timely check-ups of the diseases. The above problem occurs due to a lack of medical
infrastructure and a low ratio of doctors to the population. The statistics clearly show the same,
WHO recommended, the ratio of doctors to patients is 1:1000 whereas India's doctor-to-
population ratio is 1:1456, this indicates the shortage of doctors.
The diseases related to heart, cancer, and diabetes can cause a potential threat to
mankind, if not found early. Therefore, early recognition and diagnosis of these diseases can
save a lot of lives. This work is all about predicting diseases that are harmful using machine
learning classification algorithms. In this work, breast cancer, heart, and diabetes are included.
To make this work seamless and usable by the mass public, our team made a medical test web
application that makes predictions about various diseases using the concept of machine
learning. In this work, our aim to develop a disease-predicting web app that uses the concept of
machine learning-based predictions about various diseases like Breast cancer, Diabetes, and
Heart diseases.
lOMoARcPSD|28341130
CONTENTS
ABSTRACT v
LIST OF ABBREVATIONS ix
Chapter 1 INTRODUCTION 1
Chapter 2 LITERATURE SURVEY 2
Chapter 3 PROBLEM IDENTIFICATION 5
3.1 EXISTING SYSTEM 5
3.1.1 DISADVANTAGES OF EXISTING SYSTEM 5
3.2 PROPOSED SYSTEM 6
3. FEASIBILITY STUDY 6
1. ECONOMIC FEASIBILITY 7
3.3.2 TECHNICAL FEASIBILITY 7
3.3.3 SOCIAL FEASIBILITY 7
3.4 REQUIREMENTS 8
3.4.1 HARDWARE AND SOFTWARE REQUIREMENTS 8
CHAPTER 4 SYSTEM DESIGN 9
4.1 DESCRIPTION 9
4.2 SYSTEM ARCHITECTURE DIAGRAM 11
4.3 UML DIAGRAMS 12
4.3.1 CLASS DIAGRAM 13
4.3.2 USE CASE DIAGRAM 14
4.3.3 SEQUENCE DIAGRAM 15
4.3.4 COMPONENT DIAGRAM 16
4.3.5 DEPLOYMENT DIAGRAM 17
CHAPTER 5 IMPLEMENTATION 18
5.1 MODULES 19
5.2 TECHNOLOGIES USED 22
5.2.1 PYTHON 22
5.2.2 STREAMLIT 24
lOMoARcPSD|28341130
CHAPTER 6 TESTING 35
1. TYPES OF TESTING 35
1. UNIT TESTING 35
2. INTEGRATION TESTING
35
3. FUNCTIONAL TESTING
4. SYSTEM TESTING 35
5. WHITE BOX TESTING 36
6. BLACK BOX TESTING 36
2. INTEGRATION TESTING 37
3. ACCEPTANCE TESTING 37
4. MANUAL TESTING 42
CHAPTER 7 RESULTS 45
CHAPTER 8 CONCLUSION 48
CHAPTER 10 REFERENCES 50
lOMoARcPSD|28341130
LIST OF FIGURES
LIST OF ABBREVATIONS
CHAPTER 1
INTRODUCTION
1
lOMoARcPSD|28341130
CHAPTER 2
LITERATURE SURVEY
Anila M and Dr G Pradeepini proposed the paper titled “Diagnosis of Parkinson’s disease
using Artificial Neural network” [2]. The main objective of this paper is that the detection of
the disease is performed by using the voice analysis of the people affected with Parkinson's
disease. For this purpose, various machine learning techniques like ANN, Random Forest,
KNN, SVM, XG Boost are used to classify the best model, error rates are calculated, and the
performance metrics are evaluated for all the models used. The main drawback of this paper is
that it is limited to ANN with only two hidden layers. And this type of neural networks with
two hidden layers are sufficient and efficient for simple datasets. They used only one technique
for feature selection which reduces the number of features.
Arvind Kumar Tiwari Proposed the paper titled “Machine Learning-based Approaches
for Prediction of Parkinson’s Disease” [3]. In this paper, minimum redundancy maximum
relevance feature selection algorithms were used to select the most important feature among all
the features to predict Parkinson diseases. Here, it was observed that the random forest with 20
number of features selected by minimum redundancy maximum relevance feature selection
algorithms provide the overall accuracy 90.3%, precision 90.2%, Mathews correlation
coefficient values of 0.73 and ROC values 0.96 which is better in comparison to all other
machine learning based approaches such as bagging, boosting, random forest, rotation forest,
random subspace, support vector machine, multilayer perceptron, and decision tree based
methods.
Afzal Hussain Shahid and Maheshwari Prasad Singh proposed the paper titled “A deep
learning approach for prediction of Parkinson’s disease progression” [19]. This paper
proposed a deep neural network (DNN) model using the reduced input feature space of
Parkinson’s telemonitoring dataset to predict Parkinson’s disease (PD) progression and also
proposed a PCA based DNN model for the prediction of Motor-UPDRS and Total-UPDRS in
Parkinson's Disease progression. The DNN model was evaluated on a real-world PD dataset
taken from UCI. Being a DNN model, the performance of the proposed model may improve
with the addition of more data points in the datasets.
2
lOMoARcPSD|28341130
medical is obtained. The performance and accuracy of the applied algorithms are discussed and
compared.
In the paper [7], the authors propose a diabetes prediction model for the classification
of diabetes including external factors responsible for diabetes along with regular factors like
Glucose, BMI, Age, Insulin, etc. Classification accuracy is improved with the novel dataset
compared with existing dataset.
On a dataset of 521 instances (80 % and 20 % for training testing respectively), [8]
authors applied 8 ML algorithms such as logistic regression, support vector machines-linear,
and nonlinear kernel, random forest, decision tree, adaptive boosting classifier, K-nearest
neighbor, and naïve bayes. According to the results, the Random Forest classifier achieved 98
% accuracy compared to the other.
Aditi Gavhane, Gouthami Kokkula, Isha Panday, Prof. Kailash Devadkar, “Prediction
of Heart Disease using Machine Learning” Gavhane et al.[2] have worked on the multi-layer
perceptron model for the prediction of heart diseases in human being and the accuracy of the
algorithm using CAD technology. If the number of person using the prediction system for their
diseases prediction, then the awareness about the diseases is also going to increases and it make
reduction in the death rate of heart patient.
Pahulpreet Singh Kohli and Shriya Arora, “Application of Machine Learning in
Diseases Prediction Machine learning algorithms are used for various type of diseases
predication and many of the researchers have work on this like Kohali et al.[7] work on heart
diseases prediction using logistic regression, diabetes prediction using support vector machine,
breast cancer prediction using Adaboot classifier and concluded that the logistic regression give
the accuracy of 87.1%, support vector machine give the accuracy of 85.71%, Adaboot classifier
give the accuracy up to 98.57%which good for predication point of view.
3
lOMoARcPSD|28341130
In another way, the authors of the paper [13] have built models to predict and classify
diabetes complications. In this work, several supervised classification algorithms were applied
to predict and classify 8 diabetes complications. The complications include some parameters
such as metabolic syndrome, dyslipidemia, nephropathy, diabetic foot, obesity, and retinopathy.
7 The data mining techniques is a more popular in many field of medical, business,
railway, education etc. They are most commonly used for medical diagnosis and disease
prediction at the early stage. The data mining is utilized for healthcare sector in industrial
societies. This paper to provide a survey of data mining techniques of using Parkinson’s disease.
8 Parkinson disease is a global public health issue. Machine learning technique would be
a best solution to classify individuals and individuals with Parkinson's sickness (PD). This
paper gives a complete review for the forecast of Parkinson disease by utilizing the machine
learning based methodologies. A concise presentation of different computational system based
methodologies utilized for the forecast of Parkinson disease are introduced. This paper likewise
displays the outline of results acquired by different scientists from accessible information to
predict the Parkinson disease.
In this experimental analysis [12] four machine learning algorithms, Random Forest,
Knearest neighbor, Support Vector Machine, and Linear Discriminant Analysis are used in the
predictive analysis of early-stage diabetes. High accuracy of 87.66 % goes to the Random Forest
classifier.
In another way, the authors of the paper [13] have built models to predict and classify
diabetes complications. In this work, several supervised classification algorithms were applied
to predict and classify 8 diabetes complications. The complications include some parameters
such as metabolic syndrome, dyslipidemia, nephropathy, diabetic foot, obesity, and retinopathy.
In [14], the authors present two approaches of machine learning to predict diabetes
patients. Random forest algorithm for the classification approach, and XGBoost algorithm for
a hybrid approach. The results show that XGBoost outperforms in terms of an accuracy rate of
74.10%.
Authors in this article [15] tested machine learning algorithms such as support vector
machine, logistic regression, Decision Tree, Random Forest, gradient boost, K-nearest
neighbor, Naïve Bayes algorithm. According to the results, Naïve Base and Random Forest
classifiers achieved 80% accuracy compared to the other algorithms.
4
lOMoARcPSD|28341130
CHAPTER 3
PROBLEM IDENTIFICATION
Many of the existing machine learning models for health care analysis are concentrating
on one disease per analysis. For example first is for liver analysis, one for cancer analysis, one
for lung diseases like that. If a user wants to predict more than one disease, he/she has to go
through different sites. There is no common system where one analysis can perform more than
one disease prediction. Some of the models have lower accuracy which can seriously affect
patients’ health. When an organization wants to analyse their patient’s health reports, they have
to deploy many models which in turn increases the cost as well as time Some of the existing
systems consider very few parameters which can yield false results.
1. EXISTING SYSTEM
The study has identified multiple risk factors for cardiovascular disease, including high
blood pressure, high cholesterol, smoking, and diabetes.
Based on these risk factors, a risk score can be calculated to predict an individual's
likelihood of developing cardiovascular disease.
Traditional statistical methods are used to identify risk factors and calculate a risk score,
which can be used for disease prevention and management.
Overfitting: Overfitting occurs when a machine learning model is trained too closely to
a particular dataset and becomes overly specialized in predicting it. This can result in
poor generalization to new data and lower accuracy.
Lack of interpretability: Many machine learning algorithms are "black boxes," meaning
that it is difficult to understand how they arrive at their predictions. This can be
5
lOMoARcPSD|28341130
Limited data availability: Some diseases are rare, which means that there may not be
enough data available to train a machine learning model accurately. This can limit the
effectiveness of the system for predicting such diseases.
Cost and implementation: Implementing machine learning systems for healthcare can
be expensive and time-consuming. Hospitals and clinics may need to invest in new
hardware, software, and staff training to implement these systems effectively
2. PROPOSED SYSTEM
This project involved analyzing a multiple disease patient dataset with proper data
processing.
Different algorithms were used to train and predict, including Decision Trees, Random
Forest, SVM, and Logistic Regression,adaboost.
3. FEASIBILITY STUDY
The feasibility of the project is analyzed in this phase and business proposal is
put forth with a very general plan for the project and some cost estimates. During
system analysis the feasibility study of the proposed system is to be carried out. This is
6
lOMoARcPSD|28341130
to ensure that the proposed system is not a burden to the company. For feasibility
analysis, some understanding of the major requirements for the system is essential.
1. ECONOMICAL FEASIBILITY
This study is carried out to check the economic impact that the system will have on
the organization. The amount of fund that the company can pour into the research and
development of the system is limited. The expenditures must be justified. Thus the
developed system as well within the budget and this was achieved because most of the
technologies used are freely available. Only the customized products had to be
purchased.
2. TECHNICAL FEASIBILITY
During this study, the analyst identifies the existing computer systems of the concerned
department and determines whether these technical resources are sufficient for the proposed
system or not. If they are not sufficient, the analyst suggests the configuration of the computer
systems that are required. The analyst generally pursues two or three different configurations
which satisfy the key technical requirements but which represent different costs. During
technical feasibility study, financial resources and budget is also considered. The main objective
of technical feasibility is to determine whether the project is technically feasible or not, provided
it is economically feasible.
3. SOCIAL FEASIBILITY
The aspect of study is to check the level of acceptance of the system by the user. This
includes the process of training the user to use the system efficiently. The user must not feel
threatened by the system, instead must accept it as a necessity. The level of acceptance by the
users solely depends on the methods that are employed to educate the user about the system and
to make him familiar with it. His level of confidence must be raised so that he is also able to
make some constructive criticism, which is welcomed, as he is the final user of the system.
7
lOMoARcPSD|28341130
4. REQUIREMENTS
A software requirements specification (SRS) is a description of a software
system to be developed, its defined after business requirements specification
(CONOPS) also called stakeholder requirements specification (STRS) other document
related is the system requirements specification (SYRS).
HARDWARE REQUIREMENTS
System processor : Intel Core i7.
Hard Disk : 512 SSD.
Monitor : “15” LED.
Mouse : Optical Mouse.
RAM : 8.0 GB.
Key Board : Standard Windows Keyboard.
SOFTWARE REQUIREMENTS
Operating system : Windows 10.
Coding Language : Python 3.9.
Front-End : Streamlit 3.7, Python
Back-End : Python3.9
Python Modules : Pickle 1.2.3
8
lOMoARcPSD|28341130
CHAPTER 4
SYSTEM DESIGN
This chapter provides information of software development life cycle, design model
i.e.various UML diagrams and process specification.
4.1 DESCRIPTION
Systems design is the process or art of defining the architecture, components,
modules, interfaces, and data for a system to satisfy specified requirements. One could see it
as the application of systems theory to product development. There is some overlap and
synergy with the disciplines of systems analysis, systems architecture and systems
engineering.
This design activity describes the system in narrative form using non-technical terms.
It should provide a high-level system architecture diagram showing a subsystem breakout of
the system, if applicable. The high-level system architecture or subsystem diagrams should,
if applicable, show interfaces to external systems. Supply a high-level context diagram for
the system and subsystems, if applicable. Refer to the requirements trace ability matrix
(RTM) in the Functional Requirements Document (FRD), to identify the allocation of the
functional requirements into this design document.
This section describes any constraints in the system design (reference any trade-off
analyses conducted such, as resource use versus productivity, or conflicts with other systems)
and includesany assumptions made by the project team in developing the system design.
This section describes any contingencies that might arise in the design of the system
that may change the development direction. Possibilities include lack of interface agreements
with outside agencies or unstable architectures at the time this document is produced. Address
any possible workarounds or alternative plans.
9
lOMoARcPSD|28341130
To design a system for Multiple Disease prediction based on lab reports using
machine learning, we can follow the following steps:
1. Data Collection: The first component of the system involves collecting a large
dataset of medical records containing patient information and various medical
features related to multiple diseases. This dataset will be used to train the machine
learning models.
5. Model Evaluation: The selected model will be evaluated on a separate test dataset
to measure its accuracy and reliability in predicting multiple diseases. This
component of the system involves testing the model and measuring its
performance.
1
0
lOMoARcPSD|28341130
Machine learning has given computer systems the ability to automatically learn without being
explicitly programmed. In this, the author has used three machine learning algorithms
(Logistic Regression, KNN, and Naïve Bayes). The architecture diagram describes the high-
level overview of major system components and important working relationships.
1
1
lOMoARcPSD|28341130
1
2
lOMoARcPSD|28341130
1
3
lOMoARcPSD|28341130
Use case diagrams model behavior within a system and helps the developers
understand of what the user require.
Use case diagram can be useful for getting an overall view of the system and
clarifying who can do and more importantly what they can’t do.
Use case diagram consists of use cases and actors and shows the interaction
between the use case and actors.
1
4
lOMoARcPSD|28341130
One of the primary uses of sequence diagrams is in the transition from requirements
expressed as use cases to the next and more formal level of refinement. Use cases are often
refined into one or more sequence diagrams.
From the Fig:4.2.3 sequence diagram the prediction system can collect the data from
actor and store the data in dataset.Prediction system processes the train data and access the
data from dataset then prediction system use the train and test data and apply ML algorithms
and check user status value and grand status values then get the output.
1
5
lOMoARcPSD|28341130
A component diagram is used to break down a large object-oriented system into the
smaller components, so as to make them more manageable. It models the physical view of a
system such as executables, files, libraries, etc. that resides within the node.
It visualizes the relationships as well as the organization between the components
present in the system. It helps in forming an executable system. A component is a single unit
of the system, which is replaceable and executable. The implementation details of a component
are hidden, and it necessitates an interface to execute a function. It is like a black box whose
behavior is explained by the provided and required interfaces.
This diagram is also used as a communication tool between the developer and
stakeholders of the system. Programmers and developers use the diagrams to formalize a
roadmap for the implementation, allowing for better decision-making about task assignment
or needed skill improvements. System administrators can use component diagrams to plan
ahead, using the view of the logical software components and their relationship on system.
From the Fig:4.2.4 component diagram has components like user,system,data set,pre
processing,results,security, persistence and data base these are tha components of Multiple
Disease prediction system.
1
6
lOMoARcPSD|28341130
1
7
lOMoARcPSD|28341130
CHAPTER 5
IMPLEMENTATION
Data Collection
First step for predication system is data collection and deciding about the training and
testing dataset. In this project we have used training dataset and testing dataset.
Attribute Selection
Attribute of dataset are property of dataset which are used for system and for heart many
attributes are like heart bit rate of person, gender, sex of the person, age of the person and many
more for predication system.
Data Pre-processing
Pre processing needed for achieving prestigious result from the machine learning
algorithms. For example Random forest algorithm does not support null values dataset and for
this we have to manage null values from original raw data.
For our project we have to convert some categorized value by dummy value means in the form
of “0”and “1” by using following code.
Balancing of Data
Imbalanced datasets can be balanced in two ways. They are Under Sampling and Over
Sampling.
Under Sampling
Dataset balance is done by the reduction of the size of the data set. This process is
considered when the amount of data is adequate.
Over Sampling
In Over Sampling, dataset balance is done by increasing the size of the dataset. This
process is considered when the amount of data is inadequate.
1
8
lOMoARcPSD|28341130
5.1 MODULES
Attribute Information:
1. name - ASCII subject name and recording number.
2. MDVP:Fo(Hz)-Average vocal fundamental frequency.
3. MDVP:Fhi(Hz) - Maximum vocal fundamental frequency.
4. MDVP:Flo(Hz) - Minimum vocal fundamental frequency.
5. MDVP:Jitter(%), MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, Jitter:DDP - Several
measures of variation in fundamental frequency.
6. MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,
Shimmer:DDA - Several measures of variation in amplitude.
7. NHR, HNR- Two measures of the ratio of noise to tonal components in the voice.
8. status - The health status of the subject (one) - Parkinson's, (zero) – healthy.
9. RPDE, D2- Two nonlinear dynamical complexity measures.
10. DFA - Signal fractal scaling exponent.
11. spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation.
Comparison of Models
1
9
lOMoARcPSD|28341130
We can say that kNN Model is good for our dataset but SVM giving more AUC.
The higher the AUC, the better the performance of the model at distinguishing between
the positive and negative classes.
Classification Report
Attribute Information
1. Pregnancies
2. Glucose
3. Blood pressure
4. SkinThickness
5. Insulin
6. BMI
7. DiabetesPedigreeFunction
8. Age
2
0
lOMoARcPSD|28341130
Comparison of Models
Classification Report
2
1
lOMoARcPSD|28341130
Accuracy Results
2. TECHNOLOGIES USED
1. PYTHON
Python is a high-level, general-purpose and a very popular programming language.
Python programming language (latest Python 3) is being used in web development, Machine
Learning applications, along with all cutting edge technology in Software Industry. Python
Programming Language is very well suited for Beginners, also for experienced programmers
with other programming languages like C++ and Java.
Python is an interpreted, high-level, general-purpose programming language. Created
by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code
readability with its notable use of significant whitespace. Its language constructs and object-
oriented approach aim to help programmers write clear, logical code for small and large-scale
projects.
Python is dynamically typed and garbage-collected. It supports multiple programming
paradigms, including structured (particularly, procedural,) object-oriented, and functional
2
2
lOMoARcPSD|28341130
ADVANTAGES OF PYTHON
1. Easy to read, learn and code
Python is a high-level language and its syntax is very simple. It does not need any
semicolons or braces and looks like English. Thus, it is beginner-friendly. Due to its simplicity,
its maintenance cost is less.
2. Dynamic Typing
In Python, there is no need for the declaration of variables. The data type of the variable
gets assigned automatically during runtime, facilitating dynamic coding.
4. Portable :
Python is also platform-independent. That is, if you write the code on one of the
Windows, Mac, or Linux operating systems, then you can run the same code on the other OS
with no need for any changes.
This is called Write Once Run Anywhere (WORA). However, you should be careful while you
add system dependent features.
2
3
lOMoARcPSD|28341130
These libraries have different modules/ packages. These modules contain different inbuilt
functions and algorithms. Using these make the coding process easier and makes it look simple.
5.2.2 STREAMLIT
Streamlit is an open-source python framework for building web apps for Machine
Learning and Data Science. We can instantly develop web apps and deploy them easily using
Streamlit. Streamlit allows you to write an app the same way you write a python code. Streamlit
makes it seamless to work on the interactive loop of coding and viewing results in the web app.
The best thing about Streamlit is that you don't even need to know the basics of web
development to get started or to create your first web application. So if you're somebody who's
into data science and you want to deploy your models easily, quickly, and with only a few lines
of code, Streamlit is a good fit.
You don't need to spend days or months to create a web app, you can create a really
beautiful machine learning or data science app in only a few hours or even minutes.
It is compatible with the majority of Python libraries (e.g. pandas, matplotlib, seaborn,
plotly, Keras, PyTorch, SymPy(latex)).
2
4
lOMoARcPSD|28341130
Streamlit is a popular open-source Python library that allows developers to build interactive
web applications for data science and machine learning projects with ease. Here are some of
the key features of Streamlit:
1. Ease of Use: Streamlit is easy to use for both beginners and advanced developers. Its
simple syntax allows developers to build interactive web applications quickly without
having to worry about the details of web development.
2. Data Visualization: Streamlit allows developers to create data visualizations such as
charts, plots, and graphs with just a few lines of code. It supports popular data
visualization libraries like Matplotlib, Plotly, and Altair.
3. Customizable UI Components: Streamlit provides various UI components that can be
customized to fit the needs of the application. These components include sliders,
dropdowns, buttons, and text inputs.
4. Real-time Updating: Streamlit automatically updates the web application in real-time as
the user interacts with it. This makes it easy to create dynamic applications that respond
to user input in real-time.
5. Integration with Machine Learning Libraries: Streamlit integrates seamlessly with
popular machine learning libraries like TensorFlow, PyTorch, and Scikit-learn. This
allows developers to build and deploy machine learning models with ease.
6. Sharing and Deployment: Streamlit makes it easy to share and deploy applications.
Developers can share their applications with others by simply sharing a URL. Streamlit
also provides tools for deploying applications to cloud services like Heroku and AWS
ADVANTAGES OF STREAMLIT
Fast and Easy Development: Streamlit provides a simple and intuitive syntax that makes
it easy to build interactive web applications for data science and machine learning projects.
With Streamlit, developers can build applications faster and with less code.
2
5
lOMoARcPSD|28341130
Sharing and Deployment: Streamlit makes it easy to share and deploy applications.
Developers can share their applications with others by simply sharing a URL. Streamlit also
provides tools for deploying applications to cloud services like Heroku and AWS, making it
easy to scale applications as needed.
Active Community Support: Streamlit has an active community of developers and users
who contribute to the development of the library, provide support to other developers, and share
their own projects and experiences with the library.
The Jupyter Notebook is an open source web application that you can use to create and
share documents that contain live code, equations, visualizations, and text. Jupyter Notebook is
maintained by the people at Project Jupyter.
Jupyter Notebooks are a spin-off project from the IPython project, which used to have
an IPython Notebook project itself. The name, Jupyter, comes from the core supported
programming languages that it supports: Julia, Python, and R. Jupyter ships with the IPython
kernel, which allows you to write your programs in Python, but there are currently over 100
other kernels that you can also use.
2
6
lOMoARcPSD|28341130
•The notebook web application: An interactive web application for writing and running code
interactively and authoring notebook documents.
•Kernels: Separate processes started by the notebook web application that runs users’ code in
a given language and returns output back to the notebook web application. The kernel also
handles things like computations for interactive widgets, tab completion and introspection.
•Notebook documents: Self-contained documents that contain a representation of all content
visible in the note-book web application, including inputs and outputs of the computations,
narrative text, equations, images, and rich media representations of objects. Each notebook
document has its own kernel.
2
7
lOMoARcPSD|28341130
In-browser editing for code, with automatic syntax highlighting, indentation, and tab
completion/introspection.
The ability to execute code from the browser, with the results of computations attached
to the code which generated them.
Displaying the result of computation using rich media representations, such as HTML,
LaTeX, PNG, SVG, etc.
For example, publication-quality figures rendered by the matplotlib library, can be included
inline.
In-browser editing for rich text using the Markdown markup language, which can
provide commentary for the code, is not limited to plain text.
The ability to easily include mathematical notation within markdown cells using LaTeX,
and rendered natively by MathJax.
Easy to convert: Jupyter Notebook allows users to convert the notebooks into other formats
such as HTML and PDF. It also uses online tools and nbviewer which allows you to render a
publicly available notebook in the browser direct.
5.3 ALGORITHMS
2
8
lOMoARcPSD|28341130
Step 1. If all the objects in S belong to the same class, for example Ci, the decision tree
for S consists of a leaf labeled with this class
Step 2. Otherwise, let T be some test with possible outcomes O1, O2,…, On. Each object
in S has one outcome for T so the test partitions S into subsets S1, S2,… Sn where each
object in Si has outcome Oi for T. T becomes the root of the decision tree and for each
outcome Oi we build a subsidiary decision tree by invoking the same procedure recursively
on the set Si.
Gradient boosting
Gradient boosting is a machine learning technique used
in regression and classification tasks, among others. It gives a prediction model in the form of
an ensemble of weak prediction models, which are typically decision trees.[1][2] When a
decision tree is the weak learner, the resulting algorithm is called gradient-boosted trees; it
usually outperforms random forest.A gradient-boosted trees model is built in a stage-wise
fashion as in other boosting methods, but it generalizes the other methods by allowing
optimization of an arbitrary differentiable loss function.
2
9
lOMoARcPSD|28341130
In N dimensions, the Euclidean distance between two points p and q is √(∑i=1N (pi-qi)²)
where pi (or qi) is the coordinate of p (or q) in dimension i
algorithm for KNN is defined in the steps given below:
1. D represents the samples used in the training and k denotes the number of nearest neighbour.
2. Create super class for each sample class.
3. Compute Euclidian distance for every training sample
4. Based on majority of class in neighbour, classify the sample
Algorithm Implementation:
Step 1 − for implementing any algorithm, we need dataset. So during the first step of KNN, we
must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any odd
integer.
3
0
lOMoARcPSD|28341130
This program computes binary logistic regression and multinomial logistic regression
on both numeric and categorical independent variables. It reports on the regression equation as
well as the goodness of fit, odds ratios, confidence limits, likelihood, and deviance. It performs
a comprehensive residual analysis including diagnostic residual reports and plots. It can
perform an independent variable subset selection search, looking for the best regression model
with the fewest independent variables. It provides confidence intervals on predicted values and
provides ROC curves to help determine the best cutoff point for classification. It allows you to
validate your results by automatically classifying rows that are not used during the analysis.
Naïve Bayes
The naive bayes approach is a supervised learning method which is based on a simplistic
hypothesis: it assumes that the presence (or absence) of a particular feature of a class is
unrelated to the presence (or absence) of any other feature .
Yet, despite this, it appears robust and efficient. Its performance is comparable to other
supervised learning techniques. Various reasons have been advanced in the literature. In this
tutorial, we highlight an explanation based on the representation bias. The naive bayes classifier
is a linear classifier, as well as linear discriminant analysis, logistic regression or linear SVM
(support vector machine). The difference lies on the method of estimating the parameters of the
classifier (the learning bias).
3
1
lOMoARcPSD|28341130
While the Naive Bayes classifier is widely used in the research world, it is not
widespread among practitioners which want to obtain usable results. On the one hand, the
researchers found especially it is very easy to program and implement it, its parameters are easy
to estimate, learning is very fast even on very large databases, its accuracy is reasonably good
in comparison to the other approaches. On the other hand, the final users do not obtain a model
easy to interpret and deploy, they does not understand the interest of such a technique.
Thus, we introduce in a new presentation of the results of the learning process. The
classifier is easier to understand, and its deployment is also made easier. In the first part of this
tutorial, we present some theoretical aspects of the naive bayes classifier. Then, we implement
the approach on a dataset with Tanagra. We compare the obtained results (the parameters of the
model) to those obtained with other linear approaches such as the logistic regression, the linear
discriminant analysis and the linear SVM. We note that the results are highly consistent. This
largely explains the good performance of the method in comparison to others. In the second
part, we use various tools on the same dataset (Weka 3.6.0, R 2.9.2, Knime 2.1.1, Orange 2.0b
and RapidMiner 4.6.0). We try above all to understand the obtained results.
Random Forest
Random forests or random decision forests are an ensemble learning method for
classification, regression and other tasks that operates by constructing a multitude of decision
trees at training time. For classification tasks, the output of the random forest is the class
selected by most trees. For regression tasks, the mean or average prediction of the individual
trees is returned. Random decision forests correct for decision trees' habit of overfitting to their
training set. Random forests generally outperform decision trees, but their accuracy is lower
than gradient boosted trees. However, data characteristics can affect their performance.
The first algorithm for random decision forests was created in 1995 by Tin Kam Ho[1]
using the random subspace method, which, in Ho's formulation, is a way to implement the
"stochastic discrimination" approach to classification proposed by Eugene Kleinberg.
An extension of the algorithm was developed by Leo Breiman and Adele Cutler, who
registered "Random Forests" as a trademark in 2006 (as of 2019, owned by Minitab, Inc.).The
extension combines Breiman's "bagging" idea and random selection of features, introduced
3
2
lOMoARcPSD|28341130
first by Ho[1] and later independently by Amit and Geman[13] in order to construct a
collection of decision trees with controlled variance.0
Random forests are frequently used as "blackbox" models in businesses, as they generate
reasonable predictions across a wide range of data while requiring little configuration.
SVM
In classification tasks a discriminant machine learning technique aims at finding, based
on an independent and identically distributed (iid) training dataset, a discriminant function that
can correctly predict labels for newly acquired instances. Unlike generative machine learning
approaches, which require computations of conditional probability distributions, a discriminant
classification function takes a data point x and assigns it to one of the different classes that are
a part of the classification task. Less powerful than generative approaches, which are mostly
used when prediction involves outlier detection, discriminant approaches require fewer
computational resources and less training data, especially for a multidimensional feature space
and when only posterior probabilities are needed. From a geometric perspective, learning a
classifier is equivalent to finding the equation for a multidimensional surface that best separates
the different classes in the feature space.
ADA BOOST
AdaBoost, also called Adaptive Boosting, is a technique in Machine Learning used as
an Ensemble Method. The most common estimator used with AdaBoost is decision trees with
one level which means Decision trees with only 1 split. These trees are also called Decision
Stumps.
3
3
lOMoARcPSD|28341130
Decision stumps are the simplest model we could construct in terms of complexity. The
algo would just guess the same label for every new example, no matter what it looked like. The
accuracy of such a model would be best if we guess whichever answer, 1 or 0, is most common
in the data. If, say, 60 percent of the examples are 1s, then we’ll get 60 percent accuracy just by
guessing 1 every time.
Decision stumps improve upon this by splitting the examples into two subsets based on the
value of one feature. Each stump chooses a feature, say X2, and a threshold, T, and then splits
the examples into the two groups on either side of the threshold.
To find the decision stump that best fits the examples, we can try every feature of the input
along with every possible threshold and see which one gives the best accuracy. While it naively
seems like there are an infinite number of choices for the threshold, two different thresholds are
only meaningfully different if they put some examples on different sides of the split. To try
every possibility, then, we can sort the examples by the feature in question and try one threshold
falling between each adjacent pair of examples.
The algorithm just described can be improved further, but even this simple version is extremely
fast in comparison to other ML algorithms (e.g. training neural networks).
3
4
lOMoARcPSD|28341130
CHAPTER 6
TESTING
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, sub-assemblies, assemblies and/or a finished product It is the
process of exercising software with the intent of ensuring that the software system meets its
requirements and user expectations and does not fail in an unacceptable manner. There are
various types of test. Each test type addresses a specific testing requirement.
1. TYPES OF TESTS
1. UNIT TESTING
Unit testing involves the design of test cases that validate that the internal program logic
is functioning properly, and that program inputs produce valid outputs. All decision branches
and internal code flow should be validated. It is the testing of individual software units of the
application .it is done after the completion of an individual unit before integration. This is a
structural testing, that relies on knowledge of its construction and is invasive. Unit tests perform
basic tests at component level and test a specific business process, application, and/or system
configuration. Unit tests ensure that each unique path of a business process performs accurately
to the documented specifications and contains clearly defined inputs and expected results.
2. INTEGRATION TESTING
Integration tests are designed to test integrated software components to determine if they
actually run as one program. Testing is event driven and is more concerned with the basic
outcome of screens or fields. Integration tests demonstrate that although the components were
individually satisfaction, as shown by successfully unit testing, the combination of
components is correct and consistent. Integration testing is specifically aimed at exposing
the problems that arise from the combination of components.
3
5
lOMoARcPSD|28341130
4. SYSTEM TESTING
System testing ensures that the entire integrated software system meets
requirements. It tests a configuration to ensure known and predictable results. An example of
system testing is the configuration oriented system integration test. System testing is based on
process descriptions and flows, emphasizing pre-driven process links and integration points.
3
6
lOMoARcPSD|28341130
Unit Testing
Unit testing is usually conducted as part of a combined code and unit test phase of the
software lifecycle, although it is not uncommon for coding and unit testing to be conducted as
two distinct phases.
Features to be tested
Verify that the entries are of the correct format
No duplicate entries should be allowed
All links should take the user to the correct page.
6.2Integration Testing
Software integration testing is the incremental integration testing of two or more
integrated software components on a single platform to produce failures caused by interface
defects. The task of the integration test is to check that components or software applications,
e.g. components in a software system or – one step up – software applications at the company
level – interact without error.
Test Results: All the test cases mentioned above passed successfully. No defects
encountered.
3
7
lOMoARcPSD|28341130
SYSTEM TESTING
TESTING METHODOLOGIES
Unit Testing
Unit testing focuses verification effort on the smallest unit of Software design that is the
module. Unit testing exercises specific paths in a module’s control structure to ensure complete
coverage and maximum error detection. This test focuses on each module individually, ensuring
that it functions properly as a unit. Hence, the naming is Unit Testing.
During this testing, each module is tested individually and the module interfaces are
verified for the consistency with design specification. All important processing path are tested
for the expected results. All error handling paths are also tested.
Integration Testing
Integration testing addresses the issues associated with the dual problems of verification
and program construction. After the software has been integrated a set of high order tests are
conducted. The main objective in this testing process is to take unit tested modules and builds
a program structure that has been dictated by design.
3
8
lOMoARcPSD|28341130
2. Bottom-up Integration
This method begins the construction and testing with the modules at the lowest level in
the program structure. Since the modules are integrated from the bottom up, processing required
for modules subordinate to a given level is always available and the need for stubs is eliminated.
The bottom up integration strategy may be implemented with the following steps:
The low-level modules are combined into clusters into clusters that perform a specific
Software sub-function.
A driver (i.e.) the control program for testing is written to coordinate test case
input and output.
The cluster is tested.
Drivers are removed and clusters are combined moving upward in the program structure
The bottom up approaches tests each module individually and then each module is
module is integrated with a main module and tested for functionality.
Output Testing
After performing the validation testing, the next step is output testing of the proposed
system, since no system could be useful if it does not produce the required output in the
specified format. Asking the users about the format required by them tests the outputs generated
or displayed by the system under consideration. Hence the output format is considered in 2
ways – one is on screen and another in printed format.
Validation Checking
Validation checks are performed on the following fields.
Text Field
The text field can contain only the number of characters lesser than or equal to its size.
The text fields are alphanumeric in some tables and alphabetic in other tables. Incorrect entry
always flashes and error message.
3
9
lOMoARcPSD|28341130
Numeric Field:
The numeric field can contain only numbers from 0 to 9. An entry of any character
flashes an error messages. The individual modules are checked for accuracy and what it has to
perform. Each module is subjected to test run along with sample data. The individually tested
modules are integrated into a single system. Testing involves executing the real data
information is used in the program the existence of any program defect is inferred from the
output. The testing should be planned so that all the requirements are individually tested.
A successful test is one that gives out the defects for the inappropriate data and
produces and output revealing the errors in the system.
It is difficult to obtain live data in sufficient amounts to conduct extensive testing. And,
although it is realistic data that will show how the system will perform for the typical processing
requirement, assuming that the live data entered are in fact typical, such data generally will not
test all combinations or formats that can enter the system. This bias toward typical values then
does not provide a true systems test and in fact ignores the cases most likely to cause system
failure.
4
0
lOMoARcPSD|28341130
The most effective test programs use artificial test data generated by persons other than
those who wrote the programs. Often, an independent team of testers formulates a testing plan,
using the systems specifications.
The package “Virtual Private Network” has satisfied all the requirements specified as per
software requirement specification and was accepted.
USER TRAINING
Whenever a new system is developed, user training is required to educate them about the
working of the system so that it can be put to efficient use by those for whom the system has
been primarily designed. For this purpose the normal working of the project was demonstrated
to the prospective users. Its working is easily understandable and since the expected users are
people who have good knowledge of computers, the use of this system is very easy.
MAINTAINENCE
This covers a wide range of activities including correcting code and design errors. To
reduce the need for maintenance in the long run, we have more accurately defined the user’s
requirements during the process of system development. Depending on the requirements, this
system has been developed to satisfy the needs to the largest possible extent. With development
in technology, it may be possible to add many more features based on the requirements in future.
The coding and designing is simple and easy to understand which will make maintenance easier.
TESTING STRATEGY
A strategy for system testing integrates system test cases and design techniques into a
well-planned series of steps that results in the successful construction of software. The testing
strategy must co-operate test planning, test case design, test execution, and the resultant data
collection and evaluation. A strategy for software testing must accommodate low-level tests
that are necessary to verify that a small source code segment has been correctly implemented
as well as high level tests that validate major system functions against user requirements.
Software testing is a critical element of software quality assurance and represents the
ultimate review of specification design and coding. Testing represents an interesting anomaly
for the software. Thus, a series of testing are performed for the proposed system before the
system is ready for user acceptance testing.
4
1
lOMoARcPSD|28341130
SYSTEM TESTING
Software once validated must be combined with other system elements (e.g. Hardware,
people, database). System testing verifies that all the elements are proper and that overall system
function performance is achieved. It also tests to find discrepancies between the system.
6.4 Manual Testing
Test Case for Brain Disease Prediction
Test Description The user enters the symptoms.User answers the sub questions
The user will checks the symptoms and attributes and submit the
Actions
button for diagnosis.
Table 6.1
Table 6.1 Describes that The above table shows the test case for Values Entered. In
this the main importance is given to check the entered values are Compared with Dataset
values. If the values are in Matching then the test is passed.
4
2
lOMoARcPSD|28341130
Test Description The user enters the symptoms.User answers the sub questions
The user will checks the symptoms and attributes and submit the
Actions
button for diagnosis.
Table 6.2
Table 6.2Describes that The above table shows the test case for Values Entered. In
this the main importance is given to check the entered values are Compared with Dataset
values. If the values are in Matching then the test is passed.
4
3
lOMoARcPSD|28341130
Test Description The user enters the symptoms.User answers the sub questions
The user will checks the symptoms and attributes and submit the
Actions
button for diagnosis.
Table 6.3
Table 6.3 Describes that The above table shows the test case for Values Entered. In
this the main importance is given to check the entered values are Compared with Dataset
values. If the values are in Matching then the test is passed.
4
4
lOMoARcPSD|28341130
CHAPTER 7
RESULTS
8.1 DIABETES PREDICTION
4
5
lOMoARcPSD|28341130
4
6
lOMoARcPSD|28341130
4
7
lOMoARcPSD|28341130
CHAPTER 8
CONCLUSION
While there are challenges and limitations to the use of machine learning in healthcare,
such as the risk of bias and the need for diverse and representative data, ongoing research and
development in this field is helping to address these challenges and unlock the full potential of
multiple disease prediction using machine learning.
As technology continues to evolve and more data becomes available, it is likely that
machine learning algorithms will become increasingly sophisticated and accurate, leading to
even better patient outcomes and more personalized medicine. Multiple disease prediction using
machine learning has the potential to transform healthcare, and it is an exciting area of research
that holds great promise for the future.
4
8
lOMoARcPSD|28341130
CHAPTER 9
FUTURE WORK
Addressing data bias: As with all machine learning algorithms, bias in the training data
can lead to inaccurate predictions and perpetuate health disparities. Future work should focus
on developing methods to address and mitigate data bias, such as using more diverse and
representative datasets, and incorporating fairness and equity considerations into the algorithm
development process.
4
9
lOMoARcPSD|28341130
CHAPTER 10
REFERENCES
5
0
lOMoARcPSD|28341130
5
1