Unit-4 DM
Unit-4 DM
Classification and prediction - Issues Regarding Classification and Prediction - Classification by Decision
Tree Induction - Bayesian classification - Baye’s Theorem - Naïve Bayesian Classification - Bayesian
Belief Network - Rule based classification - Classification by Back propagation - Support vector machines
- Prediction - Linear Regression.
*Credit approval
*Target marketing
*Medical diagnosis
– Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute
• The known label of test sample is compared with the classified result from the model
• Accuracy rate is the percentage of test set samples that are correctly classified by the model
• Test set is independent of training set, otherwise over-fitting will occur
There are two issues regarding classification and prediction they are
Issues (1): Data Preparation: Issues of data preparation includes the following
1) Data cleaning
Preprocess data in order to reduce noise and handle missing values (refer preprocessing techniques i.e. data
cleaning notes)
Remove the irrelevant or redundant attributes (refer unit-iv AOI Relevance analysis)
3) Data transformation (refer preprocessing techniques i.e data cleaning notes) Generalize and/or normalize data
Issues (2): Evaluating Classification Methods: considering classification methods should satisfy the
following properties
1. Predictive accuracy
3. Robustness
4. Scalability
5. Interpretability:
6. Goodness of rules
Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
Training Dataset
Example
Halt tree construction early—do not split a node if this would result in the goodness measure falling below
a threshold
Difficult to choose an appropriate threshold
Post pruning:
Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees
Use a set of data different from the training data to decide which the “best pruned tree”
• Information Gain
• Gain ratio
• Gini Index
1. Deciding not to divide a set of samples any further under some conditions. The stopping criterion is
usually based on some statistical tests, such as the χ2 test: If there are no significant differences in
classification accuracy before and after division, then represent a current node as a leaf. The decision
is made in advance, before splitting, and therefore this approach is called pre pruning.
2. Removing retrospectively some of the tree structure using selected accuracy criteria. The decision in
this process of post pruning is made after the tree has been built.
C4.5 follows the post pruning approach, but it uses a specific technique to estimate the predicted error rate.
This method is called pessimistic pruning. For every node in a tree, the estimation of the upper confidence
limit ucf is computed using the statistical tables for binomial distribution (given in most textbooks on
statistics). Parameter Ucf is a function of ∣ Ti∣ and E for a given node. C4.5 uses the default confidence level
of 25%, and compares U25% (∣ Ti∣ /E) for a given node Ti with a weighted confidence of its leaves. Weights
are the total number of cases for every leaf. If the predicted error for a root node in a sub tree is less than
weighted sum of U25% for the leaves (predicted error for the sub tree), then a sub tree will be replaced with its
root node, which becomes a new leaf in a pruned tree.
Let us illustrate this procedure with one simple example. A sub tree of a decision tree is given in
Figure, where the root node is the test x1 on three possible values {1, 2, 3} of the attribute A. The children of
the root node are leaves denoted with corresponding classes and (∣ Ti∣ /E) parameters. The question is to
estimate the possibility of pruning the sub tree and replacing it with its root node as a new, generalized leaf
node.
To analyze the possibility of replacing the sub tree with a leaf node it is necessary to compute a
predicted error PE for the initial tree and for a replaced node. Using default confidence of 25%, the upper
confidence limits for all nodes are collected from statistical tables: U25% (6, 0) = 0.206, U25%(9, 0) = 0.143,
U25%(1, 0) = 0.750, and U25%(16, 1) = 0.157. Using these values, the predicted errors for the initial tree and
the replaced node are
Since the existing subtree has a higher value of predicted error than the replaced node, it is
recommended that the decision tree be pruned and the subtree replaced with the new leaf node.
BAYESIAN THEOREM
• Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem
• Practical difficulty: require initial knowledge of many probabilities, significant computational cost
Greatly reduces the computation cost, only count the class distribution.
Outlook P N Humidity P N
Temperature Windy
Bayesian classification
CPT shows the conditional probability for each possible combination of its parents
Derivation of the probability of a particular combination of values of X, from CPT:
Association-Based Classification
The process repeats on the remaining tuples unless termination condition, e.g., when no
more training examples or when the quality of a rule returned is below a user-specified
threshold
Comp. w. decision-tree induction: learning a set of rules simultaneously
CLASSIFICATION BY BACKPROPAGATION
The n-dimensional input vector x is mapped into variable y by means of the scalar product and a
nonlinear function mapping
A Multi-Layer Feed-Forward Neural Network
The inputs to the network correspond to the attributes measured for each training tuple
Inputs are fed simultaneously into the units making up the input layer
They are then weighted and fed simultaneously to a hidden layer
The number of hidden layers is arbitrary, although usually only one
The weighted outputs of the last hidden layer are input to units making up the output layer, which
emits the network's prediction
The network is feed-forward in that none of the weights cycles back to an input unit or to an output
unit of a previous layer
From a statistical point of view, networks perform nonlinear regression: Given enough hidden units
and enough training samples, they can closely approximate any function
K.BABU/ASSISTANT PROFESSOR/CSE Page 15
BACKPROPAGATION
Iteratively process a set of training tuples & compare the network's prediction with the actual known
target value
For each training tuple, the weights are modified to minimize the mean squared error between the
network's prediction and the actual target value
Modifications are made in the “backwards” direction: from the output layer, through each hidden
layer down to the first hidden layer, hence “backpropagation”
Steps
Initialize weights (to small random #s) and biases in the network
Propagate the inputs forward (by applying activation function)
Back propagate the error (by updating weights and biases)
Terminating condition (when error is very small, etc.)
Efficiency of backpropagation: Each epoch (one interaction through the training set) takes O(|D| * w),
with |D| tuples and w weights, but # of epochs can be exponential to n, the number of inputs, in the
worst case
Rule extraction from networks: network pruning
Simplify the network structure by removing weighted links that have the least effect on the
trained network
Then perform link, unit, or activation value clustering
The set of input and activation values are studied to derive rules describing the relationship
between the input and hidden unit layers
Sensitivity analysis: assess the impact that a given input variable has on a network output. The
knowledge gained from this analysis can be represented in rules
SVM—SUPPORT VECTOR MACHINES
A new classification method for both linear and nonlinear data
It uses a nonlinear mapping to transform the original training data into a higher dimension
With the new dimension, it searches for the linear optimal separating hyper plane (i.e., “decision
boundary”)
With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can
always be separated by a hyper plane
SVM finds this hyper plane using support vectors (“essential” training tuples) and margins (defined
by the support vectors)
Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear
decision boundaries (margin maximization)
Object recognition
Speaker identification,
Any training tuples that fall on hyper planes H1 or H2 (i.e., the sides defining the margin)
are support vectors
This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function
and linear constraints Quadratic Programming (QP) Lagrangian multipliers