Data Mining 5 Semester Bca
Data Mining 5 Semester Bca
PRESIDENCY
Presidency COLLEGE
College
(Autonomous)
(Autonomous)
5 SEMESTER BCA
By
Dr. J. Vijay Fidelis
Reaccredited by Associate Professor, Dept of Computer Applications
NAAC with A+ Presidency college, Bangalore-24
Presidency
Group
UNIT V
CLASSIFICATION AND PREDICTION
Presidency College
(Autonomous)
There are two forms of data analysis that can be used for extracting
models describing important classes or to predict future data trends.
These two forms are as follows −
Classification
Prediction
Reaccredited by
NAAC with A+ Classification models predict categorical class labels;
Prediction models predict continuous valued functions.
For example, we can build a classification model to categorize bank
loan applications as either safe or risky,
Presidency
Group or a prediction model to predict the expenditures in dollars of
potential customers on computer equipment given their income and
occupation.
What is classification?
Presidency College
(Autonomous)
Classification is to identify the category or the class label
of a new observation.
Following are the examples of cases where the data
analysis task is Classification −
A bank loan officer wants to analyze the data in order to
know which customer (loan applicant) are risky or which
Reaccredited by
NAAC with A+ are safe.
A marketing manager at a company needs to analyze a
customer with a given profile, who will buy a new
Presidency
computer.
Group
In both of the above examples, a model or classifier is
constructed to predict the categorical labels. These labels
are risky or safe for loan application data and yes or no
for marketing data.
What is Prediction?
Presidency College
(Autonomous)
Classification Prediction
Classification is the process of identifying Predication is the process of identifying
which category a new observation belongs the missing or unavailable numerical data
to based on a training data set containing for a new observation.
observations whose category membership
is known.
In classification, the model can be known In prediction, the model can be known as
as the classifier. the predictor.
Presidency
Group A model or the classifier is constructed to A model or a predictor will be constructed
find the categorical labels. that predicts a continuous-valued function
or ordered value.
For example, the grouping of patients For example, We can think of prediction
based on their medical records can be as predicting the correct treatment for a
considered a classification. particular disease for a person.
HOW DOES CLASSIFICATION WORKS?
Presidency College
(Autonomous)
❖ With the help of the bank loan application that we have discussed
above, let us understand the working of classification.
The Data Classification process includes two steps −
➢ Building the Classifier or Model
➢ Using Classifier for Classification
Reaccredited by
NAAC with A+
Presidency
Group
Using Classifier for Classification
Presidency College
(Autonomous)
In this step, the classifier is used for classification. Here the test data
is used to estimate the accuracy of classification rules. The
classification rules can be applied to the new data tuples if the
accuracy is considered acceptable.
Reaccredited by
NAAC with A+
Presidency
Group
Classification and Prediction Issues
Presidency College
(Autonomous)
Presidency
In simple words, when multiple attributes are there but attributes
Group have values on different scales, this may lead to poor data models
while performing data mining operations. So they are normalized to
bring all the attributes on the same scale.
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency
Group
Comparison of Classification and Prediction
Methods
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency
Group
Decision Tree Induction Algorithm
Presidency College
(Autonomous)
Presidency
Group
Tree Pruning
Presidency College
(Autonomous)
Presidency
Group
Directed Acyclic Graph Representation
Presidency College
(Autonomous)
The following diagram shows a directed acyclic graph for six Boolean
variables.
Reaccredited by
NAAC with A+
Reaccredited by
NAAC with A+
Presidency
Group
CLASSIFICATION BY BACKPROPAGATION
Presidency College
(Autonomous)
• Back propagation, or backward propagation is
an algorithm that is designed to test for errors working
back from output nodes to input nodes.
• It is an important mathematical tool for improving the
accuracy of predictions in data mining and machine
learning..
• The characteristics of Back propagation are the iterative,
recursive and effective approach through which it
Reaccredited by
NAAC with A+
computes the updated weight to enhance the network
• Back propagation is generally used in neural network
training and computes the loss function concerning the
weights of the network.
Presidency
• It functions with a multi-layer neural network and
Group observes the internal representations of input-output
mapping.
CLASSIFICATION BY BACKPROPAGATION
Presidency College
(Autonomous)
➢ A neural network: A set of connected input/output units
where each connection has a weight associated with it .
➢ Neural networks can help computers make intelligent
decisions with limited human assistance. This is because
they can learn and model the relationships between input
and output data that are nonlinear and complex.
Reaccredited by
NAAC with A+ Neural Network as a Classifier
WEAKNESS
o Long training time
Presidency
o Require a number of parameters typically best determined
Group empirically, e.g., the network topology or ``structure."
o Poor interpretability: Difficult to interpret the symbolic
meaning behind the learned weights and of ``hidden units" in
the network
CLASSIFICATION BY BACKPROPAGATION
Presidency College
(Autonomous)
Strength
o High tolerance to noisy data
o Ability to classify untrained patterns
o Well-suited for continuous-valued inputs and outputs
o Successful on a wide array of real-world data
Reaccredited by
NAAC with A+
o Algorithms are inherently parallel
o Techniques have recently been developed for the
extraction of rules from trained neural networks
Presidency
Group
PROCESS
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency
Group
k-Nearest-Neighbor Classifier:
Presidency College
(Autonomous)
Presidency
Group
Working of KNN Algorithm
Presidency College
(Autonomous)
Example
The following is an example to understand the concept of
K and working of KNN algorithm −
Suppose we have a dataset which can be plotted as
follows −
Reaccredited by
NAAC with A+
Presidency
Group
KNN Algorithm
Presidency College
(Autonomous)
Now, we need to classify new data point with black dot (at point
60,60) into blue or red class. We are assuming K = 3 i.e. it would find
three nearest data points. It is shown in the next diagram −
Reaccredited by
NAAC with A+
Presidency
Group
We can see in the above diagram the three nearest neighbors of
the data point with black dot. Among those three, two of them lies
in Red class hence the black dot will also be assigned in red
class.
Pros and Cons of KNN
Presidency College
(Autonomous)
Pros
• It is very simple algorithm to understand and
interpret.
• It is very useful for nonlinear data because there is no
assumption about data in this algorithm.
Reaccredited by
NAAC with A+ • It is a versatile algorithm as we can use it for
classification as well as regression.
• It has relatively high accuracy but there are much
Presidency
better supervised learning models than KNN.
Group
Cons
Presidency College
(Autonomous)
The following are some of the areas in which KNN can be applied
successfully −
Banking System
KNN can be used in banking system to predict weather an individual is fit for
loan approval? Does that individual have the characteristics similar to the
defaulters one?
Calculating Credit Ratings
Reaccredited by KNN algorithms can be used to find an individual’s credit rating by
NAAC with A+ comparing with the persons having similar traits.
Politics
With the help of KNN algorithms, we can classify a potential voter into
various classes like “Will Vote”, “Will not Vote”, “Will Vote to Party ‘Congress’,
Presidency
“Will Vote to Party ‘BJP’.
Group Other areas in which KNN algorithm can be used are Speech Recognition,
Handwriting Detection, Image Recognition and Video Recognition.
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency
Group
What Is the Genetic Algorithm?
Presidency College
(Autonomous)
➢ The genetic algorithm is a method for solving both
constrained and unconstrained optimization problems.
➢ The genetic algorithm repeatedly modifies a population of
individual solutions.
➢ At each step, the genetic algorithm selects individuals at random
from the current population to be parents and uses them to
Reaccredited by
produce the children for the next generation.
NAAC with A+
➢ Over successive generations, the population "evolves" toward an
optimal solution.
➢ Genetic algorithm is used to solve a variety of optimization
problems that are not well suited for standard optimization
Presidency
Group algorithms, including problems in which the objective function is
discontinuous, non differentiable, highly nonlinear.
➢ The genetic algorithm can address problems of mixed integer
programming, where some components are restricted to be
integer-valued.
Presidency College
(Autonomous)
The genetic algorithm uses three main types of rules at each step to
create the next generation from the current population:
• Selection rules select the individuals, called parents, that contribute
to the population at the next generation.
• Crossover rules combine two parents to form children for the next
generation.
Reaccredited by • Mutation rules apply random changes to individual parents to form
NAAC with A+
children.
Cluster is a group of objects that belongs to the same class. In other words,
similar objects are grouped in one cluster and dissimilar objects are grouped
in another cluster.
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of
similar objects.
Reaccredited by
Points to Remember
NAAC with A+ • A cluster of data objects can be treated as one group.
• While doing cluster analysis, we first partition the set of data into groups
based on data similarity and then assign the labels to the groups.
• The main advantage of clustering over classification is that, it is adaptable
Presidency to changes and helps single out useful features that distinguish different
Group
groups
Applications of Cluster Analysis
Presidency College
(Autonomous)
• Clustering analysis is broadly used in many applications such as market
research, pattern recognition, data analysis, and image processing
Clustering can also help marketers discover distinct groups in their customer
base. And they can characterize their customer groups based on the purchasing
patterns.
• In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures
Reaccredited by
inherent to populations.
NAAC with A+ • Clustering also helps in identification of areas of similar land use in an earth
observation database. It also helps in the identification of groups of houses in a
city according to house type, value, and geographic location.
• Clustering also helps in classifying documents on the web for information
discovery.
Presidency • Clustering is also used in outlier detection applications such as detection of
Group
credit card fraud.
• As a data mining function, cluster analysis serves as a tool to gain insight into
the distribution of data to observe characteristics of each cluster.
Requirements of Clustering in Data Mining
Presidency College
(Autonomous)
The following points throw light on why clustering is required in data mining −
• Scalability − We need highly scalable clustering algorithms to deal with large
databases.
• Ability to deal with different kinds of attributes − Algorithms should be
capable to be applied on any kind of data such as interval-based (numerical) data,
categorical, and binary data.
• Discovery of clusters with attribute shape − The clustering algorithm should
be capable of detecting clusters of arbitrary shape. They should not be bounded
Reaccredited by
NAAC with A+ to only distance measures that tend to find spherical cluster of small sizes.
• High dimensionality − The clustering algorithm should not only be able to
handle low-dimensional data but also the high dimensional space.
• Ability to deal with noisy data − Databases contain noisy, missing or
erroneous data. Some algorithms are sensitive to such data and may lead to poor
Presidency
Group
quality clusters.
• Interpretability − The clustering results should be interpretable,
comprehensible, and usable
Clustering Methods
Presidency College
(Autonomous) Clustering methods can be classified into the following categories −
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
Model-Based Method
• Constraint-based Method
Reaccredited by
NAAC with A+
Presidency
Group
Partitioning Method
Presidency College
(Autonomous)
Suppose we are given a database of ‘n’ objects and the partitioning method
constructs ‘k’ partition of data. Each partition will represent a cluster and k ≤
n. It means that it will classify the data into k groups, which satisfy the
following requirements −
• Each group contains at least one object.
• Each object must belong to exactly one group.
Hierarchical Methods
Reaccredited by
This method creates a hierarchical decomposition of the given set of data
NAAC with A+ objects. We can classify hierarchical methods on the basis of how the
hierarchical decomposition is formed.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with
all of the objects in the same cluster. In the continuous iteration, a cluster is
Presidency
Group
split up into smaller clusters. It is down until each object in one cluster or the
termination condition holds. This method is rigid, i.e., once a merging or
splitting is done, it can never be undone.