0% found this document useful (0 votes)
11 views

batch34_diabetis prediction ML_formatted

The document outlines the design and implementation of a diabetes prediction system utilizing data mining techniques to enhance clinical decision-making in healthcare. It discusses the existing manual processes, proposes a computer-based solution, and details various machine learning algorithms applicable for classification and prediction tasks. The document emphasizes the importance of accurate diabetes diagnosis and the role of data mining in improving healthcare outcomes.

Uploaded by

228w1f0045
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

batch34_diabetis prediction ML_formatted

The document outlines the design and implementation of a diabetes prediction system utilizing data mining techniques to enhance clinical decision-making in healthcare. It discusses the existing manual processes, proposes a computer-based solution, and details various machine learning algorithms applicable for classification and prediction tasks. The document emphasizes the importance of accurate diabetes diagnosis and the role of data mining in improving healthcare outcomes.

Uploaded by

228w1f0045
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 81

Contents

S. No Topic Page No.

1. Introduction
2. System Analysis
2.1 Existing System
2.2 Proposed System
2.3 Algorithms
3. Requirements Analysis
3.1 Preliminary Investigation
3.2 Requirements clarification
3.3 Feasibility Study
3.4 Functional Requirements
3.5 Non-Functional Requirements
3.6 System Requirements
3.6.1 Software Requirements
3.6.2 Hardware Requirements
3.7 Architecture Diagram
4. System Design
4.1 Input Design
4.2 Output design
4.3 UML Diagrams
List of figures----
4.3.1 Use case diagram
4.3.2 Class Diagram
4.3.3 Sequence Diagram
4.3.4 Collaboration Diagram
4.3.5 Activity Diagram
4.3.6 State chart diagram
4.3.7 Component Diagram

1
2

4.3.8 Deployment Diagram

5. Module Description
6. Technologies
6.1 Software Environment
6.2 Python features
6.3 Python Applications
7. Coding
8. Screen shots
9. System Testing
9.1 Introduction
9.2 Types of tests
10. Conclusion
11. References
ABSTRACT

A major challenge facing healthcare organizations is the provision of quality services at affordable
costs. Quality service implies diagnosing patients correctly and administering treatments that are
effective. Poor clinical decisions can lead to disastrous consequences which are therefore
unacceptable. Hospitals must also minimize the cost of clinical tests. They can achieve these results
by employing appropriate computer-based information and/or decision support systems. How can
we turn data into useful information that can enable healthcare practitioners to make intelligent
clinical decision?Data mining combines statistical analysis, machine learning and database
technology to extract hidden patterns and relationships from large databases. The two most common
modeling objectives are classification and prediction. Classification models predict categorical
labels (discrete, unordered) while prediction models predict continuous-valued functions. Decision
Trees and use of classification algorithms while Regression, Association Rules and Clustering use
prediction algorithms and Support Vector Machine calculations..

3
4

Chapter-1

1.INTRODUCTION

Data Mining is the process of extracting hidden knowledge from large volumes of raw data.
The knowledge must be new, not obvious, and one must be able to use it. Data mining has been
defined as “the nontrivial extraction of previously unknown, implicit and potentially useful
information from data. It is “the science of extracting useful information from large databases”. It is
one of the tasks in the process of knowledge discovery from the database. [1] Data Mining is used to
discover knowledge out of data and presenting it in a form that is easily understand to humans. It is a
process to examine large amounts of data routinely collected. Data mining is most useful in an
exploratory analysis because of nontrivial information in large volumes of data. It is a cooperative
effort of humans and computers. Best results are achieved by balancing the knowledge of human
experts in describing problems and goals with the search capabilities of computers. There are two
primary goals of data mining tend to be prediction and description. Prediction involves some variables
or fields in the data set to predict unknown or future values of other variables of interest. On the other
hand Description focuses on finding patterns describing the data that can be interpreted by humans.
The Disease Prediction plays an important role in data mining.

Insulin is one of the most important hormones in the body. It aids the body in converting
sugar, starches and other food items into the energy needed for daily life. However, if the body does
not produce or properly use insulin, the redundant amount of sugar will be driven out by urination.
This disease is referred to diabetes. The cause of diabetes is a mystery, although obesity and lack of
exercise appear to possibly play significant roles. Based on the American Diabetes Association [4] in
November 2007, 20.8million children and adults in the United States (i.e., approximately 7% of the
population) were diagnosed with diabetes. In early the ability to diagnose diabetes plays an important
role for the patient‟s treatment process. In [18] the author predicts whether a new patient would test
positive for diabetes. This paper studied a new approach, called the Homogeneity- Based Algorithm
(or HBA) to determine optimally control the over fitting and overgeneralization behaviors of
classification on this dataset (Pima Indian diabetes data set). The HBA is used in conjunction with
classification approaches (such as Support Vector Machines (SVMs), Artificial Neural Networks
(ANNs), or Decision Trees (DTs)) to enhance their classification accuracy. Some expermental results
seem to indicate that the proposed approach significantly outperforms current approaches. From the
experiment the author concluded that it is very important both for accurately predicting diabetes and
also for the data mining community, in general. In [19] Data mining algorithm is used for testing the
accuracy in predicting diabetic status. Fuzzy Systems are been used for solving a wide range of
problems in different application domain Genetic Algorithm for designing. Fuzzy systems allows in
introducing the learning and adaptation capabilities. Neural Networks are efficiently used for learning
membership functions. Diabetes occurs throughout the world, but Type 2 is more common in the most
developed countries. As a result the author concluded that the optimization of chromosome using GA
is obtained and it is based on the rate of old population diabetes can be restricted in new population to
get accuracy.

5
6

CHAPTER 2
2.SYSTEM ANALASYS
2.1Existing system

Diabetes was a common problem among adult’s specifically middle-aged people but due to
changing lifestyles diabetes affects children too. Type 1 diabetes is unpreventable because of the
various external environmental stimulants which result in the destruction of body’s insulin
producing cells. However, changing lifestyle to achieve the required body weight and obtain the
physical activities can help to prevent type 2 diabetes to enlarge. Diabetes is a chronic health
problem with devastating, yet preventable consequences. The current process of carrying this
activity is manually which tends not to analyzing data flexible for the doctors, and transmission of
information is not transparent.

2.2Proposed System

Diabetes is divided into two distinct types; type 1 diabetes enforces the need for artificially infusing
insulin through medicines or by injections and type 2 diabetes, pancreas create insulin, but it is not
effectively used by the body. This research work was conducted on the design and
implementation of a diabetes prediction system, a case study of Fudawa Health Centre. This
research will help in automating prediction of diabetes even before clinicians arrived. Data mining
is a relatively new concept used for retrieving information from a large set of data. Mining means
using available data and processing it in such a way that it is useful for decision-making. Data
mining is the process of discovering patterns in large data sets involving methods at the intersection
of machine learning, statistics, and database systems. Machine Learning is an interdisciplinary
subfield of computer science and statistics with an overall goal to extract information (with
intelligent methods) from a data set and transform the information into a comprehensible structure
for further use.

2.3 Algorithms:

Def: Machine learning is the study of algorithms that improve their performance at same task with
experience.
Types:

 Supervised learning (training data + desired outputs(labels)).


 Unsupervised learning (training data without desired outputs).
 Semi-supervised learning (training data + a few desired outputs).
 Reinforcement learning (rewards from sequence of actions).

Supervised Algorithm:

Supervised learning is when the model is getting trained on a labelled dataset. Labelled dataset is
one which have both input and output parameters. In this type of learning both training and

validation datasets are labelled

Types of Supervised Learning:

Classification: It is a Supervised Learning task where output is having defined labels (discrete
value). For example, in above Figure A, Output Purchased has defined labels i.e., 0 or 1; 1 means
the customer will purchase and 0 means that customer won’t purchase. The goal here is to predict
discrete values belonging to a particular class and evaluate on the basis of accuracy.

7
8

Support Vector Machine (SVM) : is a supervised machine learning algorithm which


can be used for both classification or regression challenges. However, it is mostly
used in classification problems. In the SVM algorithm, we plot each data item as a
point in n-dimensional space (where n is number of features you have) with the
value of each feature being the value of a particular coordinate. Then, we perform
classification by finding the hyper-plane that differentiates the two classes very well.
SVM (Support vector machine) is an efficient classification method when the feature
vector is high dimensional. In sci-kit learn, we can specify the kernel function.

Working:

Suppose we have a dataset that has two tags (green and blue), and the dataset has
two features x1 and x2. It is 2-d space. The hyperplane with maximum margin is called
the optimal hyperplane. If data is linearly arranged, then we can separate it by using a
straight line, but for non-linear data, we cannot draw a single straight line.
So, to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a
third-dimension z. It can be calculated as:

z=x2 +y2 then SVM will divide the datasets into classes in the following way.

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1 Hence we get a circumference of radius 1 in case of
non-linear data.

KNN (k-nearest neighbours) classifier – KNN or k-nearest neighbours is the simplest classification
algorithm. This classification algorithm does not depend on the structure of the data. Whenever a
new example is encountered, its k nearest neighbours from the training data are examined. Distance
between two examples can be the Euclidean distance between their feature vectors.

9
10

Step-1: Select the number K of the neighbours

Step-2: Calculate the Euclidean distance of K number of neighbours

Step-3: Take the K nearest neighbours as per the calculated Euclidean distance.

Step-4: Among these k neighbours, count the number of the data points in each category.

Step-5: Assign the new data points to that category for which the number of the neighbour is
maximum.

Step-6: Our model is ready.

Working:

Firstly, we will choose the number of neighbours, so we will choose the k=5.
Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is
the distance between two points, which we have already studied in geometry. By calculating the
Euclidean distance, we got the nearest neighbours, as three nearest neighbours in category A and
two nearest neighbours in category B.As we can see the 3 nearest neighbours are from category A,
hence this new data point must belong to category A.

Naive Bayes Classifier Algorithm: Naive Bayes algorithm is a supervised learning algorithm, which
is based on Bayes theorem and used for solving classification problems.

It is mainly used in text classification that includes a high-dimensional training dataset.

Naive Bayes Classifier is one of the simple and most effective Classification algorithms which
helps in building the fast machine learning models that can make quick predictions.

It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.

Some popular examples of Naive Bayes Algorithm are spam filtration, Sentimental analysis, and
classifying articles.

The Naive Bayes algorithm is comprised of two words Naive and Bayes, which can be described as:

Naive: It is called Naïve because it assumes that the occurrence of a certain feature is independent
of the occurrence of other features. Such as if the fruit is identified on the bases of colour, shape,
and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature
individually contributes to identify that it is an apple without depending on each other.

Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:

Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.

The formula for Bayes' theorem is given as:

Where,
11
12

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working:

Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using
this dataset we need to decide that whether we should play or not on a particular day according to
the weather conditions. So to solve this problem, we need to follow the below steps:

Convert the given dataset into frequency tables.

Generate Likelihood table by finding the probabilities of given features.

Now, use Bayes theorem to calculate the posterior probability.

Random Forest Algorithm:

Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to solve a
complex problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each
tree and based on the majority votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.

There should be some actual values in the feature variable of the dataset so that the classifier can
predict accurate results rather than a guessed result.
The predictions from each tree must have very low correlations.

It takes less training time as compared to other algorithms.

It predicts output with high accuracy, even for the large dataset it runs efficiently.

It can also maintain accuracy when a large proportion of data is missing.

Working:

Random Forest works in two-phase first is to create the random forest by combining N decision
tree, and second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the new data
points to the category that wins the majority votes.

13
14

CHAPTER 3

3.REQUIREMENT ANALASYS

3.1PRELIMINARY INVESTIGATION

The first and foremost strategy for development of a project starts from the thought of
designing a mail enabled platform for a small firm in which it is easy and convenient of sending and
receiving messages, there is a search engine ,address book and also including some entertaining
games. When it is approved by the organization and our project guide the first activity, ie.
preliminary investigation begins. The activity has three parts:

1. Request Clarification

2. Feasibility Study

3. Request Approval

3.2REQUEST CLARIFICATION

After the approval of the request to the organization and project guide, with an investigation
being considered, the project request must be examined to determine precisely what the system
requires.

Here our project is basically meant for users within the company whose systems can be
interconnected by the Local Area Network(LAN). In today’s busy schedule man need everything
should be provided in a readymade manner. So taking into consideration of the vastly use of the net
in day to day life, the corresponding development of the portal came into existence.

3.3FEASIBILITY STUDY :

An important outcome of preliminary investigation is the determination that the system request is
feasible. This is possible only if it is feasible within limited resource and time. The different
feasibilities that have to be analyzed are

 Operational Feasibility
 Economic Feasibility
 Technical Feasibility
Operational Feasibility
Operational Feasibility deals with the study of prospects of the system to be developed. This
system operationally eliminates all the tensions of the Admin and helps him in effectively tracking
the project progress. This kind of automation will surely reduce the time and energy, which
previously consumed in manual work. Based on the study, the system is proved to be operationally
feasible.

Economic Feasibility

Economic Feasibility or Cost-benefit is an assessment of the economic justification for a


computer based project. As hardware was installed from the beginning & for lots of purposes thus
the cost on project of hardware is low. Since the system is a network based, any number of
employees connected to the LAN within that organization can use this tool from at anytime. The
Virtual Private Network is to be developed using the existing resources of the organization. So the
project is economically feasible.

Technical Feasibility
According to Roger S. Pressman, Technical Feasibility is the assessment of the technical resources
of the organization. The organization needs IBM compatible machines with a graphical web
browser connected to the Internet and Intranet. The system is developed for platform Independent
environment. Java Server Pages, JavaScript, HTML, SQL server and WebLogic Server are used to
develop the system. The technical feasibility has been carried out. The system is technically feasible
for development and can be developed with the existing facility

REQUEST APPROVAL

Not all request projects are desirable or feasible. Some organization receives so many
project requests from client users that only few of them are pursued. However, those projects that
are both feasible and desirable should be put into schedule. After a project request is approved, it
cost, priority, completion time and personnel requirement is estimated and used to determine where
to add it to any project list. Truly speaking, the approval of those above factors, development works
can be launched.

15
16

3.4 FUNCTIONALREQUIREMENTS

In software engineering, a functional requirement defines a function of a software system or


its component. A function is described as a set of inputs, the behavior, and outputs (see also
software). Functional requirements may be calculations, technical details, data manipulation
and processing and other specific functionality that define what a system is supposed to
accomplish. Behavioral requirements describing all the cases where the system uses the
functional requirements are captured in use cases. Generally, functional requirements
expressed in the form system shall do <requirement>. The plan for implementing functional
requirements is detailed in the system design. In requirements engineering, functional
requirements specify particular results of a system. Functional requirements drive the
application architecture of a system. A requirements analyst generates use cases after
gathering and validating a set of functional requirements. The hierarchy of functional
requirements is: user/stakeholder request -> feature -> use case -> business rule.

Functional requirements drive the application architecture of a system. A


requirements analyst generates use cases after gathering and validating a set of functional
requirements. Functional requirements may be technical details, data manipulation and other
specific functionality of the project is to provide the information to theuser.

3.5 NON FUNCTIONALREQUIREMENTS

In systems engineering and requirements engineering, a non-functional requirement


is a requirement that specifies criteria that can be used to judge the operation of a system,
rather than specific behaviors

The project non-functional requirements include the following.

Updating Work status.

Problem resolution.

Error occurrence in the system.

Customer requests.

Availability: A system’s “availability” or “uptime” is the amount of time that is operational


and available for use. It’s related to is the server providing the service to the users in
displaying images. As our system will be used by thousands of users at any time our system
must be available always. If there are any cases of updations they must be performed in a
short interval of time without interrupting the normal services made available to the users.

Efficiency: Specifies how well the software utilizes scarce resources: CPU cycles, disk
space, memory, bandwidth etc. All of the above mentioned resources can be effectively used
by performing most of the validations at client side and reducing the workload on server by
using JSP instead of CGI which is being implemented now.

Flexibility: If the organization intends to increase or extend the functionality of the software
after it is deployed, that should be planned from the beginning; it influences choices made the
design, development, testing and deployment of the system. New modules can be easily
integrated to our system without disturbing the existing modules or modifying the logical
database schema of the existing applications.

Portability: Portability specifies the ease with which the software can be installed on all
necessary platforms, and the platforms on which it is expected to run. By using appropriate
server versions released for different platforms our project can be easily operated on any
operating system, hence can be said highly portable.

Scalability: Software that is scalable has the ability to handle a wide variety of system
configuration sizes. The nonfunctional requirements should specify the ways in which the
system may be expected to scale up (by increasing hardware capacity, adding machines etc.).
Our system can be easily expandable. Any additional requirements such as hardware or
software which increase the performance of the system can be easily added. An additional
server would be useful to speed up the application.

Integrity: Integrity requirements define the security attributes of the system, restricting
access to features or data to certain users and protecting the privacy of data entered into the
software. Certain features access must be disabled to normal users such as adding the details
of files, searching etc which is the sole responsibility of the server. Access can be disabled by
providing appropriate logins to the users for onlyaccess.

Usability: Ease-of-use requirements address the factors that constitute the capacity of the
software to be understood, learned, and used by its intended users. Hyperlinks will be
provided for each and every service the system provides through which navigation will be
easier. A system that has high usability coefficient makes the work of the user easier.

17
18

Performance: The performance constraints specify the timing characteristics of the


software.

3.6 SYSTEMREQUIREMENTS
3.6.1H/W SystemConfiguration:-

 Processor - I3
 RAM - 4 GB(min)
 Hard Disk - 320 GB

3.6.2Software Requirements:

 Operating System - Windows 10


 Tools - pycharm/jupyter
 Programming Language - python
Chapter-4

4. SYSTEM DESIGN

4.1 INPUT DESIGN

Input Design plays a vital role in the life cycle of software development, it requires very
careful attention of developers. The input design is to feed data to the application as accurate as
possible. So inputs are supposed to be designed effectively so that the errors occurring while
feeding are minimized. According to Software Engineering Concepts, the input forms or screens are
designed to provide to have a validation control over the input limit, range and other related
validations.

This system has input screens in almost all the modules. Error messages are developed to
alert the user whenever he commits some mistakes and guides him in the right way so that invalid
entries are not made. Let us see deeply about this under module design.

Input design is the process of converting the user created input into a computer-based
format. The goal of the input design is to make the data entry logical and free from errors. The error
is in the input are controlled by the input design. The application has been developed in user-
friendly manner. The forms have been designed in such a way during the processing the cursor is
placed in the position where must be entered. The user is also provided with in an option to select
an appropriate input from various alternatives related to the field in certain cases.

Validations are required for each data entered. Whenever a user enters an erroneous data,
error message is displayed and the user can move on to the subsequent pages after completing all
the entries in the current page.

4.2OUTPUT DESIGN
The Output from the computer is required to mainly create an efficient method of
communication within the company primarily among the project leader and his team members, in
other words, the administrator and the clients. The output of VPN is the system which allows the
project leader to manage his clients in terms of creating new clients and assigning new projects to
them, maintaining a record of the project validity and providing folder level access to each client on
the user side depending on the projects allotted to him. After completion of a project, a new project
may be assigned to the client. User authentication procedures are maintained at the initial stages
itself. A new user may be created by the administrator himself or a user can himself register as a
new user but the task of assigning projects and validating a new user rests with the administrator
only.
19
20

The application starts running when it is executed for the first time. The server has to be started and
then the internet explorer in used as the browser. The project will run on the local area network so
the server machine will serve as the administrator while the other connected systems can act as the
clients. The developed system is highly user friendly and can be easily understood by anyone using
it even for the first time

4.3Introduction to UML
UML is a method for describing the system architecture in detail using the blue print. UML
represents a collection of best engineering practice that has proven successful in the modeling of
large and complex systems. The UML is very important parts of developing object oriented
software and the software development process. The UML
uses mostly graphical notations to express the design of software projects. Using the helps UML
helps project teams communicate explore potential designs and validate the architectural design of
the software.
4.3.1 Use Case Diagram

Use case diagram represents the functionality of the system. Use case focus on the behavior
of the system from external point of view. Actors are external entities that interact with the system.

UseCaseDiagram:
collect dataset

preprocessing

visualization of data

build model
admin/ user

train the model

test the model

apply machine learning techniques

compute accuracy

g fffi

Use cases:

A use case describes a sequence of actions that provide something of measurable value to an
actor and is drawn as a horizontal ellipse.
Actors:
An actor is a person, organization, or external system that plays a role in one or more
interactions with the system.

System boundary boxes (optional):

A rectangle is drawn around the use cases, called the system boundary box, to indicate the
scope of system. Anything within the box represents functionality that is in scope and anything
outside the box is not.
Four relationships among use cases are used often in practice.
Include:

21
22

In one form of interaction, a given use case may include another. "Include is a Directed
Relationship between two use cases, implying that the behavior of the included use case is inserted
into the behavior of the including use case.
The first use case often depends on the outcome of the included use case. This is useful for
extracting truly common behaviors from multiple use cases into a single description. The notation is
a dashed arrow from the including to the included use case, with the label "«include»".There are no
parameters or return values. To specify the location in a flow of events in which the base use case
includes the behavior of another, you simply write include followed by the name of use case you
want to include, as in the following flow for track order.
Extend:
In another form of interaction, a given use case (the extension) may extend another. This
relationship indicates that the behavior of the extension use case may be inserted in the extended
use case under some conditions. The notation is a dashed arrow from the extension to the extended
use case, with the label "«extend»". Modelers use the «extend» relationship to indicate use cases
that are "optional" to the base use case.

Generalization:
In the third form of relationship among use cases, a generalization/specialization
relationship exists. A given use case may have common behaviors, requirements, constraints, and
assumptions with a more general use case. In this case, describe them once, and deal with it in the
same way, describing any differences in the specialized cases. The notation is a solid line ending in
a hollow triangle drawn from the specialized to the more general use case (following the standard
generalization notation
Associations:
Associations between actors and use cases are indicated in use case diagrams by solid lines.
An association exists whenever an actor is involved with an interaction described by a use case.
Associations are modeled as lines connecting use cases and actors to one another, with an optional
arrowhead on one end of the line. The arrowhead is often used to indicating the direction of the
initial invocation of the relationship or to indicate the primary actor within the use case.
Identified Use Cases

The “user model view”encompasses a problem and solution from the preservative of
those individuals whose problem the solution addresses. The view presents the goals and
objectives of the problem owners and their requirements of the solution. This view is composed of
“use case diagrams”. These diagrams describe the functionality provided by a system to external
integrators. These diagrams contain actors, use cases, and their relationships.

4.3.2 Class Diagram:

admin

+train the model()


+test the model()
+apply machine learning techniques()
+compute accuracy()

dataset database

+preprocessing() +retrive()
+processing() +update()
+append()

Class-based Modeling, or more commonly class-orientation, refers to the style of object-


oriented programming in which inheritance is achieved by defining classes of objects; as opposed to
the objects themselves (compare Prototype-based programming).

The most popular and developed model of OOP is a class-based model, as opposed to an
object-based model. In this model, objects are entities that combine state (i.e., data), behavior (i.e.,
procedures, or methods) and identity (unique existence among all other objects). The structure and
behavior of an object are defined by a class, which is a definition, or blueprint, of all objects of a
specific type. An object must be explicitly created based on a class and an object thus created is
considered to be an instance of that class. An object is similar to a structure, with the addition of
method pointers, member access control, and an implicit data member which locates instances of
the class (i.e. actual objects of that class) in the class hierarchy (essential for runtime inheritance
features).

4.3.3 Sequence Diagram:

23
24

admin/user interface dataset database

1 : collect dataset() 2 : store()

3 : check()
4 : build model()
5 : store()

7 : train the model() 6 : check()


8 : store()

9 : check()
10 : test the model()
11 : store()

12 : check()
13 : apply machine learning algorithm()
14 : store()

16 : make a prediction() 15 : check()

17 : store the results()

18 : check()

19 : display computed accuracy()

A sequence diagram in Unified Modeling Language (UML) is a kind of interaction


diagram that shows how processes operate with one another and in what order. It is a construct of a
Message Sequence Chart.

Sequence diagrams are sometimes called event diagrams, event scenarios, and timing
diagrams.A sequence diagram shows, as parallel vertical lines (lifelines), different processes or
objects that live simultaneously, and, as horizontal arrows, the messages exchanged between them,
in the order in which they occur. This allows the specification of simple runtime scenarios in a
graphical manner. If the lifeline is that of an object, it demonstrates a role. Note that leaving the
instance name blank can represent anonymous and unnamed instances. In order to display
interaction, messages are used. These are horizontal arrows with the message name written above
them. Solid arrows with full heads are synchronous calls, solid arrows with stick heads are
asynchronous calls and dashed arrows with stick heads are return messages. This definition is true
as of UML 2, considerably different from UML 1.x.
Activation boxes, or method-call boxes, are opaque rectangles drawn on top of lifelines to
represent that processes are being performed in response to the message (Execution Specifications
in UML).

Objects calling methods on themselves use messages and add new activation boxes on top of
any others to indicate a further level of processing. When an object is destroyed (removed from
memory), an X is drawn on top of the lifeline, and the dashed line ceases to be drawn below it (this
is not the case in the first example though). It should be the result of a message, either from the
object itself, or another.

A message sent from outside the diagram can be represented by a message originating from
a filled-in circle (found message in UML) or from a border of sequence diagram (gate in UML)

4.3.4Colloboration Diagram:

A Sequence diagram is dynamic, and, more importantly, is time ordered. A Collaboration


diagram is very similar to a Sequence diagram in the purpose it achieves; in other words, it shows
the dynamic interaction of the objects in a system. A distinguishing feature of a Collaboration
diagram is that it shows the objects and their association with other objects in the system apart from
25
26

how they interact with each other. The association between objects is not represented in a Sequence
diagram.

A Collaboration diagram is easily represented by modeling objects in a system and


representing the associations between the objects as links. The interaction between the objects is
denoted by arrows. To identify the sequence of invocation of these objects, a number is placed next
to each of these arrows.

Defining a Collaboration Diagram:

A sophisticated modeling tool can easily convert a collaboration diagram into a sequence
diagram and the vice versa. Hence, the elements of a Collaboration diagram are essentially the same
as that of a Sequence diagram.

4.3.5 Activity Diagram:

collect dataset

train the model

test the model

apply machine learning algortihms

compute accuracy
Activity diagrams are graphical representations of workflows of stepwise activities and
actions with support for choice, iteration and concurrency. In the Unified Modeling Language,
activity diagrams can be used to describe the business and operational step-by-step workflows of
components in a system. An activity diagram shows the overall flow of control.

Activity diagrams are constructed from a limited repertoire of shapes, connected with arrows. The
most important shape types:

 rounded rectangles represent activities;


 diamonds represent decisions;
 bars represent the start (split) or end (join) of concurrent activities;
 a black circle represents the start (initial state) of the workflow;
 An encircled black circle represents the end (final state).
Arrows run from the start towards the end and represent the order in which activities
happen. However, the join and split symbols in activity diagrams only resolve this for simple cases;
the meaning of the model is not clear when they are arbitrarily combined with the decisions or loop.

4.3.6STATE CHART DIAGRAM:

Objects have behaviors and states. The state of an object depends on its current activity or
condition. A state chart diagram shows the possible states of the object and the transitions that cause
a change in state. A state diagram, also called a state machine diagram or state chart diagram, is an
illustration of the states an object can attain as well as the transitions between those states in the
Unified Modeling Language.A state diagram resembles a flowchart in which the initial state is
represented by a large black dot and subsequent states are portrayed as boxes with rounded corners.
There may be one or two horizontal lines through a box, dividing it into stacked sections. In that
case, the upper section contains the name of the state, the middle section (if any) contains the
state variables and the lower section contains the actions performed in that state. If there are no
horizontal lines through a box, only the name of the state is written inside it. External straight lines,
each with an arrow at one end, connect various pairs of boxes. These lines define the transitions
between states. The final state is portrayed as a large black dot with a circle around it. Historical
states are denoted as circles with the letter H inside.

27
28

collect dataset

build model

train the model

test the model

apply machine learning techniques

compute accuracy

4.3.7Component Diagram:

COMPONENT LEVEL CLASS DESIGN


This chapter discusses the portion of the software development process where the design is
elaborated and the individual data elements and operations are designed in detail. First, different
views of a “component” are introduced. Guidelines for the design of object-oriented and traditional
(conventional) program components are presented.

What is a Component?

This section defines the term component and discusses the differences between object
oriented, traditional, and process related views of component level design. Object Management
Group OMG UML defines a component as “… a modular, deployable, and replaceable part of a
system that encapsulates implementation and exposes a set of interfaces.”

An Object Oriented View

A component contains a set of collaborating classes. Each class within a component has
been fully elaborated to include all attributes and operations that are relevant to its implementation.
As part of the design elaboration, all interfaces (messages) that enable the classes to communicate
and collaborate with other design classes must also be defined. To accomplish this, the designer
begins with the analysis model and elaborates analysis classes (for components that relate to the
problem domain) and infrastructure classes (or
components that provide support services for the problem domain).

Dataset

Training and Testing the Dataset

Admin/ User

Machine Learning Algorithms

Prediction

4.3.8Deployment Diagram:

29
30

Deployment diagrams are used to visualize the topology of the physical components of a
system where the software components are deployed.

So deployment diagrams are used to describe the static deployment view of a system. Deployment
diagrams consist of nodes and their relationships.

Purpose:

The name Deployment itself describes the purpose of the diagram. Deployment diagrams are used
for describing the hardware components where software components are deployed. Component
diagrams and deployment diagrams are closely related.

Component diagrams are used to describe the components and deployment diagrams shows how
they are deployed in hardware.

UML is mainly designed to focus on software artifacts of a system. But these two diagrams are
special diagrams used to focus on software components and hardware components.

So most of the UML diagrams are used to handle logical components but deployment diagrams are
made to focus on hardware topology of a system. Deployment diagrams are used by the system
engineers.

The purpose of deployment diagrams can be described as:

 Visualize hardware topology of a system.

 Describe the hardware components used to deploy software components.

 Describe runtime processing nodes.

How to draw Deployment Diagram?

Deployment diagram represents the deployment view of a system. It is related to the component
diagram. Because the components are deployed using the deployment diagrams. A deployment
diagram consists of nodes. Nodes are nothing but physical hardwares used to deploy the application.

Deployment diagrams are useful for system engineers. An efficient deployment diagram is very
important because it controls the following parameters

 Performance
 Scalability

 Maintainability

 Portability

So before drawing a deployment diagram the following artifacts should be identified:

 Nodes

 Relationships among nodes

The following deployment diagram is a sample to give an idea of the deployment view of order
management system. Here we have shown nodes as:

 Monitor

 Modem

 Caching server

 Server

The application is assumed to be a web based application which is deployed in a clustered


environment using server 1, server 2 and server 3. The user is connecting to the application
using internet. The control is flowing from the caching server to the clustered environment.

So the following deployment diagram has been drawn considering all the points mentioned above:

Admin/User Dataset Analysis Prediction

31
32

Chapter-5

5.MODULESDESCRIPTION

The whole approach is depicted by the following flowchart.

Modules:

Data gathering:

The sample data has been collected, which consists of all the records of previous yearsrecords. The
dataset collected consist of some instances about your data.

Pre processing:

Data pre processing is a technique that is used to convert raw data into a clean dataset. The data is
gathered from different sources is in raw format which is not feasible for the analysis. Pre-
processing for this approach takes 4 simple yet effective steps.
 Attribute selection :Some of the attributes in the initial dataset that was not pertinent
(relevant) to the experiment goal were ignored.
 Cleaning missing values: In some cases the dataset contain missing values . We need to be
equipped to handle the problem when we come across them. Obviously you could remove
the entire line of data but what if you're inadvertently removing crucial information?after all
we might not need to try to do that. one in every of the foremost common plan to handle the
matter is to require a mean of all the values of the same column and have it to replace the
missing data. The library used for the task is called Scikit Learn preprocessing. It contains a
class called Imputer which will help us take care of the missing data.
 Training and Test data: Splitting the Dataset into Training set and Test Set Now the next
step is to split our dataset into two. Training set and a Test set. We will train our machine
learning models on our training set, i.e our machine learning models will try to understand
any correlations in our training set and then we will test the models on our test set to
examine how accurately it will predict. A general rule of the thumb is to assign 80% of the
dataset to training set and therefore the remaining 20% to test set.
 Feature Scaling: The final step of data pre processing is feature scaling. But what is it? It is
a method used to standardize the range of independent variables or features of data. But why
is it necessary? A lot of machine learning models are based on Euclidean distance. If, for
example, the values in one column (x) is much higher than the value in another column (y),
(x2-x1) squared will give a far greater value than (y2-y1) squared. So clearly, one square
distinction dominates over the other square distinction. In the machine learning equations,
the square difference with the lower value in comparison to the far greater value will almost
be treated as if it does not exist. We do not want that to happen. That is why it’s necessary to
transform all our variables into the same scale.

Processing:

Classification of data is a two phase process. In phase one which is called training phase a classifier
is built using training set of tuples. The second phase is the classification phase, where the testing
set of tuples is used for validating the model and the performance of the model is analyzed.

Interpretation:

The data set used for is further splitted into two sets consisting of two third as training set and one
third as testing set. Algorithms applied random forest shown the best results. The efficiency of the
approaches is compared in terms of the accuracy. The accuracy of the prediction model/classifier is
defined as the total number of correctly predicted/classified instances.

33
34

K-Nearest Neighbor(KNN) Algorithm :

K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique.

o K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we
want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar features
of the new data set to the cats and dogs images and based on the most similar features it will
put it in either cat or dog category.
How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors

o Step-2: Calculate the Euclidean distance of K number of neighbors

o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

o Step-4: Among these k neighbors, count the number of the data points in each category.

o Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
o Step-6: Our model is ready.

35
36

Chapter- 6

6. TECHNOLOGIES

6.1SOFTWARE ENVIRONMENT:

ANACONDA

Anconda is a complete, open source data science package with a community of over 6
million users. It is easy to download and install, and it is supported on Linux, MacOs, and Windows.

The distribution comes with more than 1,000 data packages as well as the Conda package
and virtual environment manager, so it elminates the need to learn to install each library
independently. As Anaconda’s website says, “The Python and R conda packages in the Anaconda
Repository are curated and compiled in our secure environment so you get optimized binaries that
‘just work’ on your system”.
Fig: 3.1 Anaconda Distribution

What is Anaconda Navigator?

Anaconda Navigator is a desktop graphical user interface (GUI) included in Anaconda®


distribution that allows you to launch applications and easily manage conda packages, environments,
and channels without using command-line commands. Navigator can search for packages on
Anaconda Cloud or in a local Anaconda Repository. It is available for Windows, macOS, and Linux.

Why use Navigator?

In order to run, many scientific packages depend on specific versions of other packages. Data
scientists often use multiple versions of many packages and use multiple environments to separate
these different versions.

37
38

The command-line program conda is both a package manager and an environment manager.
This helps data scientists ensure that each version of each package has all the dependencies it
requires and works correctly.

Navigator is an easy, point-and-click way to work with packages and environments without
needing to type conda commands in a terminal window. You can use it to find the packages you
want, install them in an environment, run the packages, and update them – all inside Navigator.

What applications can we access using Navigator?

The following applications are available by default in Navigator:

 Jupyter Notebook

 Spyder

 PyCharm

 VSCode

 Glueviz

 Orange 3 App

 RStudio

 Anaconda Prompt (Windows only)

 Anaconda PowerShell (Windows only)

 Jupyter Lab

 JupyterLab: This is an extensible working environment for interactive and reproducible


computing, based on the Jupyter Notebook and Architecture.
 Qt Console: It is the PyQt GUI that supports inline figures, proper multiline editing with
syntax highlighting, graphical calltips and more.
 Spyder: Spyder is a scientific Python Development Environment. It is a powerful Python
IDE with advanced editing, interactive testing, debugging and introspection features.
 VS Code: It is a streamlined code editor with support for development operations like
debugging, task running and version control.
 Glueviz: This is used for multidimensional data visualization across files. It explores
relationships within and among related datasets.
 Orange 3: It is a component-based data mining framework. This can be used for data
visualization and data analysis. The workflows in Orange 3 are very interactive and
provide a large toolbox.
 Rstudio: It is a set of integrated tools designed to help you be more productive with R. It
includes R essentials and notebooks.

 Jupyter Notebook: This is a web-based, interactive computing notebook environment.


We can edit and run human-readable docs while describing the data analysis.

The Jupyter Notebook is an open source web application that you can use to create and
share documents that contain live code, equations, visualizations, and text. Jupyter Notebook is
maintained by the people at Project Jupyter.

Jupyter Notebooks are a spin-off project from the IPython project, which used to have an
IPython Notebook project itself. The name, Jupyter, comes from the core supported
programming languages that it supports: Julia, Python, and R. Jupyter ships with the IPython
kernel, which allows you to write your programs in Python, but there are currently over 100
other kernels that you can also use.

The Jupyter Notebook is not included with Python, so if you want to try it out, you will need
to install Jupyter.

There are many distributions of the Python language. This article will focus on just two of
them for the purposes of installing Jupyter Notebook. The most popular is CPython, which is the
reference version of Python that you can get from their website. It is also assumed that you are
using Python.

 PyCharm: It is the most popular IDE for Python, and includes great features such as
excellent code completion and inspection with advanced debugger and support for web
programming and various frameworks. PyCharm is created by Czech company, Jet brains
which focusses on creating integrated development environment for various web
development languages like JavaScript and PHP. PyCharm offers some of the best
features to its users and developers in the following aspects
 Code completion and inspection.

 Advanced debugging.

 Support for web programming and frameworks such as Django and Flask.
39
40

Features of PyCharm

Besides, a developer will find PyCharm comfortable to work with because of the features mentioned
below −

 Code Completion: PyCharm enables smoother code completion whether it is for


built in or for an external package.

 SQL Alchemy as Debugger: You can set a breakpoint, pause in the debugger and
can see the SQL representation of the user expression for SQL Language code.
 Git Visualization in Editor: When coding in Python, queries are normal for a
developer. You can check the last commit easily in PyCharm as it has the blue sections
that can define the difference between the last commit and the current one.
 Code Coverage in Editor: You can run .py files outside PyCharm Editor as well
marking it as code coverage details elsewhere in the project tree, in the summary section
etc.
 Package Management: All the installed packages are displayed with proper visual
representation. This includes list of installed packages and the ability to search and add
new packages.
 Local History is always keeping track of the changes in a way that complements like
Git. Local history in PyCharm gives complete details of what is needed to rollback and
what is to be added.
 Refactoring is the process of renaming one or more files at a time and PyCharm
includes various shortcuts for a smooth refactoring process.
 Wamp Server: WAMPs are packages of independently-created programs installed on
computers that use a Microsoft Windows operating system. Apache is a web server.
MySQL is an open-source database. PHP is a scripting language that can manipulate
information held in a database and generate web pages dynamically each time content is
requested by a browser. Other programs may also be included in a package, such as php
My Admin which provides a graphical user interface for the MySQL database manager,
or the alternative scripting languages Python or Perl.
Installation on Windows

Visit the link https://www.python.org/downloads/ to download the latest release of Python. In


this process, we will install Python 3.6.7 on our Windows operating system.

Double-click the executable file which is downloaded; the following window will open. Select
Customize installation and proceed.

Fig: 3.2 Anaconda Navigator

How can I run code with Navigator?

The simplest way is with Spyder. From the Navigator Home tab, click Spyder, and write and execute
your code.

41
42

You can also use Jupyter Notebooks the same way. Jupyter Notebooks are an increasingly popular
system that combine your code, descriptive text, output, images, and interactive interfaces into a
single notebook file that is edited, viewed, and used in a web browser.
LIBRARIES
Matplotlib:

 Matplotlib is a Python 2D plotting library which produces publication quality figures in a


variety of hardcopy formats and interactive environments across platforms.
 Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter
notebook, web application servers, and four graphical user interface toolkits.

Fig: 3.3 Matplotlib images

 Matplotlib tries to make easy things easy and hard things possible.

 You can generate plots, histograms, power spectra, bar charts, error charts, scatterplots,
etc., with just a few lines of code.
 For simple plotting the pyplot module provides a MATLAB-like interface, particularly
43
44

when combined with IPython.


 For the power user, you have full control of line styles, font properties, axes properties,
etc, via an object oriented interface or via a set of functions familiar to MATLAB users.

Numpy:

NumPy is the fundamental package for scientific computing with Python. It contains among other
things:

 a powerful N-dimensional array object

 sophisticated (broadcasting) functions

 tools for integrating C/C++ and Fortran code

 useful linear algebra, Fourier transform, and random number capabilities

 Besides its obvious scientific uses, NumPy can also be used as an efficient multi-
dimensional container of generic data. Arbitrary data-types can be defined. This allows
NumPy to seamlessly and speedily integrate with a wide variety of databases.
 NumPy is licensed under the BSD license, enabling reuse with few restrictions.

Pandas:

History of development

In 2008, pandas development began at AQR Capital Management. By the end of 2009 it had been
open sourced, and is actively supported today by a community of like-minded individuals around the
world who contribute their valuable time and energy to help make open source pandas possible.

Since 2015, pandas is a NumFOCUS sponsored project. This will help ensure the success of
development of pandas as a world-class open-source project.

Timeline

 2008: Development of pandas started


 2009: pandas become open source

 2012: First edition of Python for Data Analysis is published

 2015: pandas becomes a NumFOCUS sponsored project

 2018: First in-person core developer sprint

Library Highlights

 A fast and efficient DataFrame object for data manipulation with integrated indexing.

 Tools for reading and writing data between in-memory data structures and different
formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5
format;

 Intelligent data alignment and integrated handling of missing data: gain automatic
label-based alignment in computations and easily manipulate messy data into an
orderly form.
 Flexible reshaping and pivoting of data sets.

 Intelligent label-based slicing, fancy indexing, and subsetting of large data sets.

 Columns can be inserted and deleted from data structures for size mutability.

 Aggregating or transforming data with a powerful group by engine allowing split-


apply-combine operations on data sets.
 High performance merging and joining of data sets.

 Hierarchical axis indexing provides an intuitive way of working with high-


dimensional data in a lower-dimensional data structure.
 Time series-functionality: date range generation and frequency conversion, moving
window statistics, date shifting and lagging. Even create domain-specific time offsets
and join time series without losing data.
 Highly optimized for performance, with critical code paths written in Cython or C.

 Python with pandas is in use in a wide variety of academic and


commercial domains, including Finance, Neuroscience, Economics, Statistics,
Advertising, Web Analytics, and more.

45
46

Mission

Pandas aims to be the fundamental high-level building block for doing practical, real world data
analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible
open source data analysis / manipulation tool available in any language.

Vision

 Accessible to everyone

 Free for users to use and modify

 Flexible

 Powerful

 Easy to use

 Fast

Values

Is in the core of pandas to be respectful and welcoming with everybody, users, contributors
and the broader community. Regardless of level of experience, gender, gender identity and
expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age,
religion, or nationality.

Regex:

 A regular expression, regex or regexp (sometimes called a rational expression) is a


sequence of characters that define a search pattern.
 Usually such patterns are used by string searching algorithms for "find" or "find and
replace" operations on strings, or for input validation.
 It is a technique developed in theoretical computer science and formal language theory.

 Regular expressions are used in search engines, search and replace dialogs of word
processors and text editors, in text processing utilities such as sed and AWK and
in lexical analysis.
 Many programming languages provide regex capabilities either built-in or via libraries.
Requests:

 Requests is a Python HTTP library, released under the Apache2 License.

 The goal of the project is to make HTTP requests simpler and more human-friendly.

 The current version is 2.22.0

 The requests library is the de facto standard for making HTTP requests in Python.

 It abstracts the complexities of making requests behind a beautiful, simple API so that
you can focus on interacting with services and consuming data in your application.

Scikit-learn:

 cikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine
learning library for the Python programming language.
 It features various classification, regression and clustering algorithms including support
vector machines, random forests, gradient boosting, k-means and DBSCAN, and is
designed to interoperate with the Python numerical and scientific
libraries NumPy and SciPy.
 Scikit-learn is largely written in Python, and uses numpy extensively for high-
performance linear algebra and array operations.

 Furthermore, some core algorithms are written in Cython to improve performance.

 Support vector machines are implemented by a Cython wrapper around LIBSVM; logistic
regression and linear support vector machines by a similar wrapper around LIBLINEAR.

 In such cases, extending these methods with Python may not be possible.

 Scikit-learn integrates well with many other Python libraries, such as matplotlib and
plotly for plotting, numpy for array vectorization, pandas dataframes, scipy, and many
more.
 Scikit-learn is one of the most popular machine learning libraries on GitHub.

47
48

SciPy:

 SciPy is a free and open-source Python library used for scientific computing and
technical computing.
 SciPy contains modules for optimization, linear algebra, integration, interpolation, special
functions, FFT, signal and image processing, ODE solvers and other tasks common in
science and engineering.
 SciPy builds on the NumPy array object and is part of the NumPy stack which includes
tools like Matplotlib, pandas and SymPy, and an expanding set of scientific computing
libraries.
 This NumPy stack has similar users to other applications such as MATLAB, GNU
Octave, and Scilab.
 The NumPy stack is also sometimes referred to as the SciPy stack.

 SciPy is also a family of conferences for users and developers of these tools: SciPy (in the
United States), EuroSciPy (in Europe) and SciPy.in (in India).
 Enthought originated the SciPy conference in the United States and continues to sponsor
many of the international conferences as well as host the SciPy website.
 The SciPy library is currently distributed under the BSD license, and its development is
sponsored and supported by an open community of developers.
 It is also supported by NumFOCUS, a community foundation for supporting reproducible
and accessible science.
 The basic data structure used by SciPy is a multidimensional array provided by
the NumPy module
 NumPy provides some functions for linear algebra, Fourier transforms, and random
number generation, but not with the generality of the equivalent functions in SciPy.
 NumPy can also be used as an efficient multidimensional container of data with
arbitrary datatypes.
 This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

 Older versions of SciPy used Numeric as an array type, which is now deprecated in
favor of the newer NumPy array code.

Python Introduction

Python is a general purpose, dynamic, high level and interpreted programming language. It
supports Object Oriented programming approach to develop applications. It is simple and easy to
learn and provides lots of high-level data structures.

Python is easy to learn yet powerful and versatile scripting language which makes it attractive
for Application Development.

Python's syntax and dynamic typing with its interpreted nature, makes it an ideal language for
scripting and rapid application development.

Python supports multiple programming pattern, including object oriented, imperative and
functional or procedural programming styles.

Python is not intended to work on special area such as web programming. That is why it is
known as multipurpose because it can be used with web, enterprise, 3D CAD etc.

We don't need to use data types to declare variable because it is dynamically typed so we can
write a=10 to assign an integer value in an integer variable.

Python makes the development and debugging fast because there is no compilation step included
in python development and edit-test-debug cycle is very fast.
50

Python History
 Python laid its foundation in the late 1980s.
 The implementation of Python was started in the December 1989 by Guido Van
Rossum at CWI in Netherland.
 In February 1991, van Rossum published the code (labeled version 0.9.0) to alt.sources.
 In 1994, Python 1.0 was released with new features like: lambda, map, filter, and reduce.
 Python 2.0 added new features like: list comprehensions, garbage collection system.
 On December 3, 2008, Python 3.0 (also called "Py3K") was released. It was designed to
rectify fundamental flaw of the language.
 ABC programming language is said to be the predecessor of Python language which was
capable of Exception Handling and interfacing with Amoeba Operating System.
 Python is influenced by following programming languages:

 ABC language.
 Modula-3

6.2Python Features

Python provides lots of features that are listed below.

1) Easy to Learn and Use

Python is easy to learn and use. It is developer-friendly and high level programming language.

2) Expressive Language

Python language is more expressive means that it is more understandable and readable.

3) Interpreted Language

Python is an interpreted language i.e. interpreter executes the code line by line at a time. This
makes debugging easy and thus suitable for beginners.
4) Cross-platform Language

Python can run equally on different platforms such as Windows, Linux, Unix and Macintosh etc.
So, we can say that Python is a portable language.

5) Free and Open Source

Python language is freely available at offical web address.The source-code is also available.
Therefore it is open source.

6) Object-Oriented Language

Python supports object oriented language and concepts of classes and objects come into
existence.

7) Extensible

It implies that other languages such as C/C++ can be used to compile the code and thus it can be
used further in our python code.

8) Large Standard Library

Python has a large and broad library and prvides rich set of module and functions for rapid
application development.

9) GUI Programming Support

Graphical user interfaces can be developed using Python.

10) Integrated

It can be easily integrated with languages like C, C++, JAVA etc.

51
52

6.3Python Applications

Python is known for its general purpose nature that makes it applicable in almost each domain of
software development. Python as a whole can be used in any sphere of development.

Here, we are specifing applications areas where python can be applied.

1) Web Applications

We can use Python to develop web applications. It provides libraries to handle internet protocols
such as HTML and XML, JSON, Email processing, request, beautifulSoup, Feedparser etc. It
also provides Frameworks such as Django, Pyramid, Flask etc to design and delelop web based
applications. Some important developments are: PythonWikiEngines, Pocoo,
PythonBlogSoftware etc.

2) Desktop GUI Applications

Python provides Tk GUI library to develop user interface in python based application.
Someotheruseful toolkitswxWidgets,Kivy, pyqt that are useable on several platforms. The Kivy
is popular for writing multitouch applications.

3) Software Development

Python is helpful for software development process. It works as a support language and can be
used for build control and management, testing etc.

4) Scientific and Numeric

Python is popular and widely used in scientific and numeric computing. Some useful library and
package are SciPy, Pandas, IPython etc. SciPy is group of packages of engineering, science and
mathematics.
5) Business Applications

Python is used to build Bussiness applications like ERP and e-commerce systems. Tryton is a
high level application platform.

6) Console Based Application

We can use Python to develop console based applications. For example: IPython.

7) Audio or Video based Applications

Python is awesome to perform multiple tasks and can be used to develop multimedia
applications. Some of real applications are: TimPlayer, cplay etc.

8) 3D CAD Applications

To create CAD application Fandango is a real application which provides full features of CAD.

9) Enterprise Applications

Python can be used to create applications which can be used within an Enterprise or an
Organization. Some real time applications are: OpenErp, Tryton, Picalo etc.

10) Applications for Images

Using Python several application can be developed for image. Applications developed are:
VPython, Gogh, imgSeek etc. There are several such applications which can be developed using
Python

How to Install Python (Environment Set-up)

In this section of the tutorial, we will discuss the installation of python on various operating
systems.

53
54

Installation on Windows

Visit the link https://www.python.org/downloads/ to download the latest release of Python. In


this process, we will install Python 3.6.7 on our Windows operating system.

Double-click the executable file which is downloaded; the following window will open. Select
Customize installation and proceed.

The following window shows all the optional features. All the features need to be installed and
are checked by default; we need to click next to continue.
The following window shows a list of advanced options. Check all the options which you want
to install and click next. Here, we must notice that the first check-box (install for all users) must
be checked.

55
56

Now, we are ready to install python-3.6.7. Let's install it.


Now, try to run python on the command prompt. Type the command python in case of python2
or python3 in case of python3. It will show an error as given in the below image. It is because
we haven't set the path.

57
58

To set the path of python, we need to the right click on "my computer" and go to Properties →
Advanced → Environment Variables.
59
60

Add the new path variable in the user variable section.


Type PATH as the variable name and set the path to the installation directory of the python
shown in the below image.

61
62

Now, the path is set, we are ready to run python on our local system. Restart CMD, and
type python again. It will open the python interpreter shell where we can execute the python
statements.
Chapter 7

CODING

#!/usr/bin/env python

# coding: utf-8

# # Diabetes prediction

# In[1]:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn import svm

from sklearn.ensemble import RandomForestClassifier

import warnings

63
64

warnings.filterwarnings("ignore")

# In[2]:

df = pd.read_csv('diabetes.csv')

# In[3]:

df.head()

# In[4]:

df.shape
# In[5]:

df.isnull().sum()

# In[6]:

df.info()

# In[7]:

df.describe()

# In[8]:

65
66

X = df.drop(columns=['class'])

Y = df['class']

# # Support vector machine classifier

# In[9]:

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

clf = svm.SVC(kernel='linear')

clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)

from sklearn import metrics

print("supprot vector classcification model accuracy(in %):", metrics.accuracy_score(y_test,


y_pred) * 100)

svm_acc=metrics.accuracy_score(y_test, y_pred) * 100

m=svm_acc
# In[10]:

from sklearn.metrics import confusion_matrix

print("confusion matrix ", confusion_matrix(y_test, y_pred))

# In[11]:

dft=pd.DataFrame()

dft=x_test

dft['actual']=y_test

dft["predictions"]=y_pred

print(dft["predictions"].value_counts())

dft['predictions'].value_counts().sort_index().plot(kind='bar', figsize=(10, 5),

title='SVM prediction distrubution in test data',

color='blue')

plt.show()

dft.to_csv(r'svm_valid_predections.csv')

67
68

# **Here 77% of people are safe from diabetis**

# # KNeighborsClassifier

# In[12]:

from sklearn.neighbors import KNeighborsClassifier

x_train, x_test1, y_train, y_test1 = train_test_split(X, Y, test_size=0.2, random_state=1)

knn = KNeighborsClassifier()

knn_model = knn.fit(x_train, y_train)

y_true, y_pred1 = y_test, knn_model.predict(x_test1)

print('k-NN accuracy for test set: %f' % (knn_model.score(x_test1, y_test1)*100))

knn_acc=knn_model.score(x_test1, y_test1)*100

o=knn_acc

# In[13]:
dft=pd.DataFrame()

dft=x_test1

dft['actual']=y_test1

dft["predictions"]=y_pred1

print(dft["predictions"].value_counts())

dft['predictions'].value_counts().sort_index().plot(kind='bar', figsize=(10,5),

title='KNN prediction distrubution in test data',

color='orange')

plt.show()

dft.to_csv(r'Knn_valid_predections.csv')

# **Here 73% of people are safe from diabetis**

# # comparison of accuracy_score of SVM,KNN classifiers

# In[14]:

69
70

y=[o,m]

x=['KNN','SVM']

plt.bar(x,y)

plt.title('comparison of accuracy_score of SVM,KNN classifiers')

plt.savefig('comparison_graph.jpg')

plt.show()

# In[ ]:
CHAPTER 8

8.SCREEEN SHOTS

Diabetes prediction

71
72

In this Screenshot I have import python libraries. After that I have load the “Diabetes” Dataset.

After loading the dataset, we have to do ‘’Exploratory Data Analysis” (EDA) for describing every variable

data.
In this screenshot Next step is Data splitting into two parts i.e., train data & test data.Then Import our
machine learning model Support vector Classifier. From this model we got 77% accuracy.

From this visualization we observed Support vector machine classifier prediction distribution in test
data. Non diabetic (0) 113 & diabetic (1) 41.

73
74

In this screenshot we Import our machine learning model KNeighbours Classifier. From this model we
got 73% accuracy.
From this visualization we observed KNeighbours Classifier prediction distribution in test data. Non
diabetic (0) 106& diabetic (1) 48.

Comparison of accuracy score of SVM, KNN classifiers.

75
76

CHAPTER 9

9. SYSTEM TESTING
9.1INTRODUCTION:

System testing ensures that the entire integrated software system meets requirements.
It tests a configuration to ensure known and predictable results. An example of system testing is
the configuration oriented system integration test. System testing is based on process
descriptions and flows, emphasizing pre-driven process links and integration points.

The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, sub-assemblies, assemblies and/or a finished product It is the
process of exercising software with the intent of ensuring that the Software system meets its
requirements and user expectations and does not fail in an unacceptable manner. There are
various types of test. Each test type addresses a specific testing requirement.

Organization and preparation of functional tests is focused on requirements, key


functions, or special test cases. In addition, systematic coverage pertaining to identify Business
process flows; data fields, predefined processes, and successive processes must be considered for
testing. Before functional testing is complete, additional tests are identified and the effective
value of current tests is determined.

9.2TYPES OF TESTS

Unit Testing:

Unit testing involves the design of test cases that validate that the internal program
logic is functioning properly, and that program inputs produce valid outputs. All decision
branches and internal code flow should be validated. It is the testing of individual software units
of the application .it is done after the completion of an individual unit before integration. This is
a structural testing, that relies on knowledge of its construction and is invasive. Unit tests
perform basic tests at component level and test a specific business process, application, and/or
system configuration. Unit tests ensure that each unique path of a business process performs
accurately to the documented specifications and contains clearly defined inputs and expected
results. Unit testing is usually conducted as part of a combined code and unit test phase of
thesoftware lifecycle, although it is not uncommon for coding and unit testing to be conducted as
two distinct phases.

Test strategy and approach

Field testing will be performed manually and functional tests will be written in detail.

Test objectives

 All field entries must work properly.


 Pages must be activated from the identified link.
 The entry screen, messages and responses must not be delayed.
Features to be tested

 Verify that the entries are of the correct format


 No duplicate entries should be allowed
 All links should take the user to the correct page.
Features to be tested

 Verify that the entries are of the correct format


 No duplicate entries should be allowed
 All links should take the user to the correct page.
Integration Testing

Software integration testing is the incremental integration testing of two or more


integrated software components on a single platform to produce failures caused by interface
defects.The task of the integration test is to check that components or software applications, e.g.
components in a software system or – one step up – software applications at the company level –
interact without error.

Integration tests are designed to test integrated software components to determine if they
actually run as one program. Testing is event driven and is more concerned with the basic
outcome of screens or fields. Integration tests demonstrate that although the components were
individually satisfaction, as shown by successfully unit testing, the combination of components is

77
78

correct and consistent. Integration testing is specifically aimed at exposing the problems that
arise from the combination of components.

Test Results: All the test cases mentioned above passed successfully. No defects encountered.

Functional test:

Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation, and user manuals.

Functional testing is centered on the following items:

Valid Input : identified classes of valid input must be accepted.

Invalid Input : identified classes of invalid input must be rejected.

Functions : identified functions must be exercised.

Output : identified classes of application outputs must be exercised.

White Box Testing

White Box Testing is a testing in which in which the software tester has knowledge of the
inner workings, structure and language of the software, or at least its purpose. It is purpose. It is
used to test areas that cannot be reached from a black box level.

Black Box TestingUser Acceptance Testing is a critical phase of any project and requires
significant participation by the end user. It also ensures that the system meets the functional
requirements.

Acceptance Testing

Black Box Testing is testing the software without any knowledge of the inner workings,
structure or language of the module being tested. Black box tests, as most other kinds of tests,
must be written from a definitive source document, such as specification or requirements
document, such as specification or requirements document. It is a testing in which the software
under test is treated, as a black box .you cannot ―see‖ into it. The test provides inputs and
responds to outputs without considering how the software works.
CHAPTER 10

CONCLUSION

An Application using a data mining algorithm of classes’ comparison has been developed to
predict the occurrence of or recurrence of diabetes risks. In addition, the result of the application
shows that the predictions system is capable of predicting diabetes effectively, efficiently and
most importantly, timely. That means the application is capable of helping a physician in making
decisions towards patient health risks. It generates results that make it closer to the real life
situations. That makes the data mining more helpful in the health sector, which means that it is
necessary for knowledge discovery in the healthcare’s sector. Much more than huge savings in
costs in terms of medical expenses, loss of duty time and usage of critical medical facilities,
The naïve bayes classifier based system is very useful for diagnosis of diabetes. The system can
perform good prediction with less error and this technique could be an important tool for
supplementing the medical doctors in performing expert diagnosis. In this method the efficiency
of forecasting was found to be around 95%. This application would be a tremendous asset for
doctors who can have structured specific and invaluable information about their patients / others
so that they can ensure that their diagnosis or inferences are correct and professional.
Finally, the huge appreciations received from the doctors on
having such software prove that in a place like, where diseases are on the rise, such applications
should be developed to cover the entire state. The common person stands to benefit from
doctors having such a tool so that he/she can be better knowledgeable as far as personal health
and wellbeing is
concerned.

79
80

CHAPTER 10

REFERENCES

[1]. Han, J., Kamber, M.: “Data Mining Concepts and Techniques”, Morgan Kaufmann
Publishers, 2006 [2] “Data mining: Introductory and Advanced Topics” Margaret H. Dunham

[3]. JyotiSoni, Ujma Ansari, Dipesh Sharma, SunitaSoni “Predictive Data Mining for Medical
Diagnosis: An Overview of Heart Disease Prediction” IJCSE Vol. 3 No. 6 June 2011

[4]. Carloz Ordonez, “Association Rule Discovery with Train and Test approach for heart
disease prediction”, IEEE Transactions on Information Technology in Biomedicine, Volume 10,
No. 2, April 2006.pp 334-343.

[5]. M. ANBARASI, E. ANUPRIYA, N.CH.S.N.IYENGAR, “Enhanced Prediction of Heart


Disease with Feature Subset Selection using Genetic Algorithm”, International Journal of
Engineering Science and Technology Vol. 2(10), 2010, 5370- 5376

[6]. G. Parthiban, A. Rajesh, S.K.Srivatsa “Diagnosis of Heart Disease for Diabetic Patients
using Naive Bayes Method”

[7]. Choi J.P., Han T.H. and Park R.W., “A Hybrid Bayesian Network Model for Predicting
Breast Cancer Prognosis”, J Korean Soc Med Inform, 2009, pp. 49-57

[8] BellaachiaAbdelghani and ErhanGuven, "Predicting Breast Cancer Survivability using Data
Mining Techniques,"Ninth Workshop on Mining Scientific and Engineering Datasets in
conjunction with the Sixth SIAM International Conference on Data Mining,” 2006.

[9] Lundin M., Lundin J., BurkeB.H.,Toikkanen S., Pylkkänen L. and Joensuu H. , “Artificial
Neural Networks Applied to Survival Prediction in Breast Cancer”, Oncology International
Journal for Cancer Resaerch and Treatment, vol. 57, 1999.

[10] DelenDursun , Walker Glenn and KadamAmit , “Predicting breast cancer survivability: a
comparison of three data mining methods,” Artificial Intelligence in Medicine ,vol. 34, pp. 113-
127 , June 2005.
[11]. Ruben D. CanlasJr.,”DATA MINING IN HEALTHCARE: CURRENT APPLICATIONS
AND ISSUES”, August 2009

[12] Michael Feld, Dr. Michael Kipp, Dr. AlassaneNdiaye and Dr. DominikHeckmann “Weka:
Practical machine learning tools and techniques with Java implementations”

[13] K.P Soman, ShyamDiwakar, V.Vijay “Insight into Data mining theory and practice”

[14] AshaRajkumar, G.SophiaReena, “Diagnosis Of Heart Disease Using Datamining


Algorithm”, Global Journal of Computer Science and Technology 38 Vol. 10 Issue 10 Ver. 1.0
September 2010.

[15] ShantakumarB.Patil, Y.S.Kumaraswamy, “Intelligent and Effective Heart Attack Prediction


System Using Data Mining and Artificial Neural Network”, European Journal of Scientific
Research ISSN 1450-216X Vol.31 No.4 (2009), pp.642-656 © EuroJournals Publishing, Inc.
2009.

[16] Wingo PA, Tong T, Bolden S, “Cancer statistics”, 1995, CA Cancer J Clin 45 (1995), no. 1,
8-30.

[17] Fentiman IS, “Detection and treatment of breast cancer”, London: Martin Duntiz (1998).

81

You might also like