0% found this document useful (0 votes)
3 views

Unit-4 DM

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Unit-4 DM

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT 4

CLASSIFICATION AND PREDICTION

Classification and prediction - Issues Regarding Classification and Prediction - Classification by Decision
Tree Induction - Bayesian classification - Baye’s Theorem - Naïve Bayesian Classification - Bayesian
Belief Network - Rule based classification - Classification by Back propagation - Support vector machines
- Prediction - Linear Regression.

What is classification? What is prediction? Classification:


* Used for prediction (future analysis) to know the unknown attributes with their values by using classifier
algorithms and decision tree. (In data mining)
* Which constructs some models (like decision trees) then which classifies the attributes.
* Already we know the types of attributes are 1.categorical attribute and 2.numerical attribute
* These classification can work on both the above mentioned attributes.
Prediction: prediction also used for to know the unknown or missing values.

1. Which also uses some models in order to predict the attributes


2. Models like neural networks, if else rules and other mechanisms

Classification and prediction are used in the Applications like

*Credit approval

*Target marketing

*Medical diagnosis

Classification—A Two-Step Process

• Model construction: describing a set of predetermined classes

– Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute

– The set of tuples used for model construction: training set

– The model is represented as classification rules, decision trees, or mathematical formulae

• Model usage: for classifying future or unknown objects

– Estimate accuracy of the model

• The known label of test sample is compared with the classified result from the model
• Accuracy rate is the percentage of test set samples that are correctly classified by the model
• Test set is independent of training set, otherwise over-fitting will occur

K.BABU/ASSISTANT PROFESSOR/CSE Page 1


Process (1): Model Construction

Process (2): Using the Model in Prediction

Supervised vs. Unsupervised Learning


 Supervised learning (classification)
 Supervision: The training data (observations, measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown

K.BABU/ASSISTANT PROFESSOR/CSE Page 2


 Given a set of measurements, observations, etc. with the aim of establishing the existence of
classes or clusters in the data
ISSUES REGARDING CLASSIFICATION AND PREDICTION

There are two issues regarding classification and prediction they are

Issues (1): Data Preparation

Issues (2): Evaluating Classification Methods

Issues (1): Data Preparation: Issues of data preparation includes the following

1) Data cleaning

Preprocess data in order to reduce noise and handle missing values (refer preprocessing techniques i.e. data
cleaning notes)

2) Relevance analysis (feature selection)

Remove the irrelevant or redundant attributes (refer unit-iv AOI Relevance analysis)

3) Data transformation (refer preprocessing techniques i.e data cleaning notes) Generalize and/or normalize data

Issues (2): Evaluating Classification Methods: considering classification methods should satisfy the
following properties

1. Predictive accuracy

2. Speed and scalability

 Time to construct the model


 Time to use the model

3. Robustness

 Handling noise and missing values

4. Scalability

 Efficiency in disk-resident databases

5. Interpretability:

 Understanding and insight provided by the model

6. Goodness of rules

 Decision tree size


 Compactness of classification rules

K.BABU/ASSISTANT PROFESSOR/CSE Page 3


CLASSIFICATION BY DECISION TREE INDUCTION

Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample

– Test the attribute values of the sample against the decision tree

Training Dataset

This follows an example from Quinlan’s ID3

Age Income Student Credit rating


<=30 High No Fair
<=30 High No Excellent
31…40 High No Fair
>40 Medium No Fair
>40 Low Yes Fair
>40 Low Yes Excellent
31…40 Low Yes Excellent
<=30 Medium No Fair
<=30 Low Yes Fair
>40 Medium Yes Fair
<=30 Medium Yes Excellent
31…40 Medium No Excellent
31…40 High Yes Fair
>40 Medium No Excellent

K.BABU/ASSISTANT PROFESSOR/CSE Page 4


Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority voting is employed for classifying
the leaf
– There are no samples left
Extracting Classification Rules from Trees
• Represent the knowledge in the form of IF-THEN rules
• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand

Example

IF age = “<=30” AND student = “no” THEN buys_computer = “no”

IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”

IF age = “31…40” THEN buys_computer = “yes”

IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”

IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”

Avoid Overfitting in Classification


• The generated tree may overfit the training data
– Too many branches, some may reflect anomalies due to noise or outliers
– Result is in poor accuracy for unseen samples
• Two approaches to avoid over fitting

K.BABU/ASSISTANT PROFESSOR/CSE Page 5


Prepruning:

 Halt tree construction early—do not split a node if this would result in the goodness measure falling below
a threshold
 Difficult to choose an appropriate threshold
Post pruning:

 Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees
 Use a set of data different from the training data to decide which the “best pruned tree”

Tree Mining in Weka and Tree Mining in Clementine.


Tree Mining in Weka
• Example:
– Weather problem: build a decision tree to guide the decision about whether or not to play tennis.
– Dataset (weather.nominal.arff)
• Validation:
– Using training set as a test set will provide optimal classification accuracy.
– Expected accuracy on a different test set will always be less.
– 10-fold cross validation is more robust than using the training set as a test set.
• Divide data into 10 sets with about same proportion of class label values as in original set.
• Run classification 10 times independently with the remaining 9/10 of the set as the training set.
• Average accuracy.
– Ratio validation: 67% training set / 33% test set.
– Best: having a separate training set and test set.
• Results:
– Classification accuracy (correctly classified instances).
– Errors (absolute mean, root squared mean …)
– Kappa statistic (measures agreement between predicted and observed classification; -100%-100% is the
proportion of agreements after chance agreement has been excluded; 0% means complete agreement by
chance)
• Results:
– TP (True Positive) rate per class label
– FP (False Positive) rate
– Precision = TP rate = TP / (TP + FN)) * 100%
– Recall = TP / (TP + FP)) * 100%
– F-measure = 2* recall * precision / recall + precision

K.BABU/ASSISTANT PROFESSOR/CSE Page 6


• ID3 characteristics:
– Requires nominal values
– Improved into C4.5
• Dealing with numeric attributes
• Dealing with missing values
• Dealing with noisy data
• Generating rules from trees
Tree Mining in Clementine
• Methods:
– C5.0: target field must be categorical, predictor fields may be numeric or categorical, provides multiple
splits on the field that provides the maximum information gain at each level
– QUEST: target field must be categorical, predictor fields may be numeric ranges or categorical, statistical
binary split
– C&RT: target and predictor fields may be numeric ranges or categorical, statistical binary split based on
regression
– CHAID: target and predictor fields may be numeric ranges or categorical, statistical binary split based on
chi-square
Attribute Selection Measures

• Information Gain

• Gain ratio

• Gini Index

Pruning of decision trees


Discarding one or more sub trees and replacing them with leaves simplify a decision tree, and that is the
main task in decision-tree pruning. In replacing the sub tree with a leaf, the algorithm expects to lower the
predicted error rate and increase the quality of a classification model. But computation of error rate is not
simple. An error rate based only on a training data set does not provide a suitable estimate. One possibility to
estimate the predicted error rate is to use a new, additional set of test samples if they are available, or to use
the cross-validation techniques. This technique divides initially available samples into equal sized blocks
and, for each block; the tree is constructed from all samples except this block and tested with a given block
of samples. With the available training and testing samples, the basic idea of decision tree-pruning is to
remove parts of the tree (sub trees) that do not contribute to the classification accuracy of unseen testing
samples, producing a less complex and thus more comprehensible tree. There are two ways in which the

K.BABU/ASSISTANT PROFESSOR/CSE Page 7


recursive-partitioning method can be modified:

1. Deciding not to divide a set of samples any further under some conditions. The stopping criterion is
usually based on some statistical tests, such as the χ2 test: If there are no significant differences in
classification accuracy before and after division, then represent a current node as a leaf. The decision
is made in advance, before splitting, and therefore this approach is called pre pruning.
2. Removing retrospectively some of the tree structure using selected accuracy criteria. The decision in
this process of post pruning is made after the tree has been built.
C4.5 follows the post pruning approach, but it uses a specific technique to estimate the predicted error rate.
This method is called pessimistic pruning. For every node in a tree, the estimation of the upper confidence
limit ucf is computed using the statistical tables for binomial distribution (given in most textbooks on
statistics). Parameter Ucf is a function of ∣ Ti∣ and E for a given node. C4.5 uses the default confidence level
of 25%, and compares U25% (∣ Ti∣ /E) for a given node Ti with a weighted confidence of its leaves. Weights
are the total number of cases for every leaf. If the predicted error for a root node in a sub tree is less than
weighted sum of U25% for the leaves (predicted error for the sub tree), then a sub tree will be replaced with its
root node, which becomes a new leaf in a pruned tree.
Let us illustrate this procedure with one simple example. A sub tree of a decision tree is given in
Figure, where the root node is the test x1 on three possible values {1, 2, 3} of the attribute A. The children of
the root node are leaves denoted with corresponding classes and (∣ Ti∣ /E) parameters. The question is to
estimate the possibility of pruning the sub tree and replacing it with its root node as a new, generalized leaf
node.
To analyze the possibility of replacing the sub tree with a leaf node it is necessary to compute a
predicted error PE for the initial tree and for a replaced node. Using default confidence of 25%, the upper
confidence limits for all nodes are collected from statistical tables: U25% (6, 0) = 0.206, U25%(9, 0) = 0.143,
U25%(1, 0) = 0.750, and U25%(16, 1) = 0.157. Using these values, the predicted errors for the initial tree and
the replaced node are

PEtree = 6.0.206 + 9.0.143 + 1.0.750 = 3.257


PEnode = 16.0.157 = 2.512

Since the existing subtree has a higher value of predicted error than the replaced node, it is
recommended that the decision tree be pruned and the subtree replaced with the new leaf node.

K.BABU/ASSISTANT PROFESSOR/CSE Page 8


BAYESIAN CLASSIFICATION
• Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical
approaches to certain types of learning problems
• Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis
is correct. Prior knowledge can be combined with observed data.
• Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
• Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of
optimal decision making against which other methods can be measured

BAYESIAN THEOREM
• Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem

• MAP (maximum posteriori) hypothesis

• Practical difficulty: require initial knowledge of many probabilities, significant computational cost

Naïve Bayes Classifier (I)


 A simplified assumption: attributes are conditionally independent:
n

P(C j | V )  P(C j ) P(vi | C j )


i1

 Greatly reduces the computation cost, only count the class distribution.

K.BABU/ASSISTANT PROFESSOR/CSE Page 9


Naive Bayesian Classifier (II)
Given a training set, we can compute the probabilities

Outlook P N Humidity P N

sunny 2/9 3/5 high 3/9 4/5

overcast 4/9 0 normal 6/9 1/5

rain 3/9 2/5

Temperature Windy

hot 2/9 2/5 true 3/9 3/5

mild 4/9 2/5 false 6/9 2/5

cool 3/9 1/5

Bayesian classification

• The classification problem may be formalized using a-posteriori probabilities:


• P(C|X) = prob. that the sample tuple
• X=<x1,…,xk> is of class C.
• E.g. P(class=N | outlook=sunny, windy=true,…)
• Idea: assign to sample X the class label C such that P(C|X) is maximal

Estimating a-posteriori probabilities


• Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
• P(X) is constant for all classes
• P(C) = relative freq of class C samples
• C such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum
• Problem: computing P(X|C) is unfeasible!

Naïve Bayesian Classification


• Naïve assumption: attribute independenceP(x1,…,xk|C) =
P(x1|C)·…·P(xk|C)

K.BABU/ASSISTANT PROFESSOR/CSE Page 10


• If i-th attribute is categorical:
P(xi|C) is estimated as the relative freq of samples having value xi as i-thattribute in class C
• If i-th attribute is continuous:
P(xi|C) is estimated thru a Gaussian density function
• Computationally easy in both cases

Bayesian Belief Networks


 Bayesian belief network allows a subset of the variables conditionally independent
 A graphical model of causal relationships
 Represents dependency among the variables
 Gives a specification of joint probability distribution

Bayesian Belief Network: An Example

K.BABU/ASSISTANT PROFESSOR/CSE Page 11


The conditional probability table (CPT) for variable LungCancer:

CPT shows the conditional probability for each possible combination of its parents
Derivation of the probability of a particular combination of values of X, from CPT:

Association-Based Classification

• Several methods for association-based classification


– ARCS: Quantitative association mining and clustering of association rules (Lent et al’97)
• It beats C4.5 in (mainly) scalability and also accuracy
– Associative classification: (Liu et al’98)
• It mines high support and high confidence rules in the form of “cond_set => y”, where y is a class label
– CAEP (Classification by aggregating emerging patterns) (Dong et al’99)
 Emerging patterns (EPs): the item sets whose support increases significantly from one class to
another
 Mine Eps based on minimum support and growth rate
Rule Based Classification

Using IF-THEN Rules for Classification


 Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
 Rule antecedent/precondition vs. rule consequent
 Assessment of a rule: coverage and accuracy
 ncovers = # of tuples covered by R
 ncorrect = # of tuples correctly classified by Rcoverage(R) = ncovers
/|D| /* D: training data set */ accuracy(R) = ncorrect / ncovers
 If more than one rule is triggered, need conflict resolution
 Size ordering: assign the highest priority to the triggering rules that has the “toughest”

K.BABU/ASSISTANT PROFESSOR/CSE Page 12


requirement (i.e., with the most attribute test)

 Class-based ordering: decreasing order of prevalence or misclassification cost


per class
 Rule-based ordering (decision list): rules are organized into one long priority list, according to
some measure of rule quality or by experts
Rule Extraction from a Decision Tree

 Rules are easier to understand than large trees


 One rule is created for each path from the root to a leaf
 Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction
 Rules are mutually exclusive and exhaustive

 Example: Rule extraction from our buys_computer decision-tree

IF age = young AND student = no THEN buys_computer = no


IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = yes
IF age = young AND credit_rating = fair THEN buys_computer = no
Rule Extraction from the Training Data
 Sequential covering algorithm: Extracts rules directly from training data
 Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
 Rules are learned sequentially, each for a given class Ci will cover many tuples of Ci but none
(or few) of the tuples of other classes
 Steps:
 Rules are learned one at a time

K.BABU/ASSISTANT PROFESSOR/CSE Page 13


 Each time a rule is learned, the tuples covered by the rules are removed

 The process repeats on the remaining tuples unless termination condition, e.g., when no
more training examples or when the quality of a rule returned is below a user-specified
threshold
 Comp. w. decision-tree induction: learning a set of rules simultaneously
CLASSIFICATION BY BACKPROPAGATION

 Back propagation: A neural network learning algorithm


 Started by psychologists and neurobiologists to develop and test computational
analogues of neurons
 A neural network: A set of connected input/output units where each connection has a
weight associated with it
 During the learning phase, the network learns by adjusting the weights so as to be able to predict
the correct class label of the input tuples
 Also referred to as connectionist learning due to the connections between units
NEURAL NETWORK AS A CLASSIFIER
 Weakness
 Long training time
 Require a number of parameters typically best determined empirically, e.g., the network
topology or ``structure."
 Poor interpretability: Difficult to interpret the symbolic meaning behind the learned
weights and of ``hidden units" in the network
 Strength
 High tolerance to noisy data
 Ability to classify untrained patterns
 Well-suited for continuous-valued inputs and outputs
 Successful on a wide array of real-world data
 Algorithms are inherently parallel
 Techniques have recently been developed for the extraction of rules from trained neural
networks

K.BABU/ASSISTANT PROFESSOR/CSE Page 14


A Neuron (= a perceptron)

 The n-dimensional input vector x is mapped into variable y by means of the scalar product and a
nonlinear function mapping
A Multi-Layer Feed-Forward Neural Network

 The inputs to the network correspond to the attributes measured for each training tuple
 Inputs are fed simultaneously into the units making up the input layer
 They are then weighted and fed simultaneously to a hidden layer
 The number of hidden layers is arbitrary, although usually only one
 The weighted outputs of the last hidden layer are input to units making up the output layer, which
emits the network's prediction
 The network is feed-forward in that none of the weights cycles back to an input unit or to an output
unit of a previous layer
 From a statistical point of view, networks perform nonlinear regression: Given enough hidden units
and enough training samples, they can closely approximate any function
K.BABU/ASSISTANT PROFESSOR/CSE Page 15
BACKPROPAGATION
 Iteratively process a set of training tuples & compare the network's prediction with the actual known
target value
 For each training tuple, the weights are modified to minimize the mean squared error between the
network's prediction and the actual target value
 Modifications are made in the “backwards” direction: from the output layer, through each hidden
layer down to the first hidden layer, hence “backpropagation”
 Steps
 Initialize weights (to small random #s) and biases in the network
 Propagate the inputs forward (by applying activation function)
 Back propagate the error (by updating weights and biases)
 Terminating condition (when error is very small, etc.)
 Efficiency of backpropagation: Each epoch (one interaction through the training set) takes O(|D| * w),
with |D| tuples and w weights, but # of epochs can be exponential to n, the number of inputs, in the
worst case
 Rule extraction from networks: network pruning
 Simplify the network structure by removing weighted links that have the least effect on the
trained network
 Then perform link, unit, or activation value clustering
 The set of input and activation values are studied to derive rules describing the relationship
between the input and hidden unit layers
 Sensitivity analysis: assess the impact that a given input variable has on a network output. The
knowledge gained from this analysis can be represented in rules
SVM—SUPPORT VECTOR MACHINES
 A new classification method for both linear and nonlinear data
 It uses a nonlinear mapping to transform the original training data into a higher dimension
 With the new dimension, it searches for the linear optimal separating hyper plane (i.e., “decision
boundary”)
 With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can
always be separated by a hyper plane
 SVM finds this hyper plane using support vectors (“essential” training tuples) and margins (defined
by the support vectors)
 Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear
decision boundaries (margin maximization)

K.BABU/ASSISTANT PROFESSOR/CSE Page 16


 Used both for classification and prediction
 Applications:

 Handwritten digit recognition,

 Object recognition

 Speaker identification,

 Benchmarking time-series prediction tests


SVM—General Philosophy

SVM—Margins and Support Vectors

K.BABU/ASSISTANT PROFESSOR/CSE Page 17


SVM—Linearly Separable
 A separating hyper plane can be written as
W●X+b=0
Where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
 For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
 The hyper plane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1

 Any training tuples that fall on hyper planes H1 or H2 (i.e., the sides defining the margin)
are support vectors
 This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function
and linear constraints  Quadratic Programming (QP)  Lagrangian multipliers

Why Is SVM Effective on High Dimensional Data?


 The complexity of trained classifier is characterized by the # of support vectors rather than the
dimensionality of the data
 The support vectors are the essential or critical training examples —they lie closest to the decision
boundary (MMH)
 If all other training examples are removed and the training is repeated, the same separating hyper
plane would be found
 The number of support vectors found can be used to compute an (upper) bound on the expected error
rate of the SVM classifier, which is independent of the data dimensionality
 Thus, an SVM with a small number of support vectors can have good generalization, even when the
dimensionality of the data is high

K.BABU/ASSISTANT PROFESSOR/CSE Page 18


PREDICTION
 (Numerical) prediction is similar to classification
 construct a model
 use model to predict continuous or ordered value for a given input
 Prediction is different from classification
 Classification refers to predict categorical class label
 Prediction models continuous-valued functions
 Major method for prediction: regression
 model the relationship between one or more independent or predictorvariables and a
dependent or response variable
 Regression analysis
 Linear and multiple regression
 Non-linear regression
 Other regression methods: generalized linear model, Poissonregression, log-linear
models, regression trees
Linear Regression
 Linear regression: involves a response variable y and a single predictor variable x
y = w0 + w1 x
Where w0 (y-intercept) and w1 (slope) are regression coefficients
 Method of least squares: estimates the best-fitting straight line
 Multiple linear regression: involves more than one predictor variable
 Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
 Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2
 Solvable by extension of least square method or using SAS, S-Plus
 Many nonlinear functions can be transformed into the above
Nonlinear Regression
 Some nonlinear models can be modeled by a polynomial function
 A polynomial regression model can be transformed into linear regression model. For example,
y = w0 + w1 x + w2 x2 + w3 x3
convertible to linear with new variables: x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
 Other functions, such as power function, can also be transformed to linear model
 Some models are intractable nonlinear (e.g., sum of exponential terms)
Possible to obtain least square estimates through extensive calculation on more complex formulae

K.BABU/ASSISTANT PROFESSOR/CSE Page 19

You might also like