8.predictive Analytics - Classification 2
8.predictive Analytics - Classification 2
Introductory
Big Data
College of Engineering
Chapter -9-
Predictive Analytics -
Classification
Learning Objectives
• Classification
• Binary classification
• Multivariate classification
• KNN algorithm
• Predictive Performance Measures for Classification
Classification
• Classification task: a predictive task where the label to be assigned to a
new, unlabeled, object, given the value of its predictive attributes, is a
qualitative value representing a class or category.
• Usually, one of these classes is referred to as the positive class and the other as the
negative class. The positive class is usually the class of particular interest.
• Example:
• A medical diagnosis classification task can have only two classes: healthy and sick.
• The “sick” class is the main class of interest, so it is the positive class.
Binary Classification - Example
• Suppose you want to go out for dinner and you want to predict who, among your new
contacts, would be a
good company.
• To make this decision, you can use data from previous dinners with people in
your social network
What is the prediction for a new friend (Name: “Noah”, Age: 26)?
Example 2
• The data set in the previous example is easy to classify, classification tasks are usually
not that easy!
• What about this data set, can you find a decision boundary that separates the data?
• In binary classification tasks with two predictive attributes, when the decision boundary can
separate the objects of the two classes, we say that the classification data set is linearly
separable.
Example 2 – Cont’d
• If we add the attribute “Education level”:
• What is the prediction for the data point represented by the red dot?
Classification Algorithms
• The previous examples demonstrated very simple classification examples.
• Real life scenarios are more complex and require algorithms to find the
classification decision boundary
• Thousands of classification algorithms can be found in the literature. These algorithms
follow different principles and can be roughly categorized as:
1. distance-based algorithms:
• The simplest approach to predicting the class of a new object
• How similar this object is to other, previously labeled, objects
• Data points close to each other are thought to be similar and typically belong to the same class
2. probability-based algorithms:
• Estimates the probability of an object belonging to each class, given the information (training data) we
have.
• Probabilistic classification algorithms can model the probabilistic relationship between the predictive
attributes and the target attribute of a data set
Distance-based Learning Algorithms -
K-nearest Neighbor Algorithms (KNN)
• In a confusion matrix:
• The two rows represent the predicted classes and the two columns the true, correct, classes
• The correct predictions are the values on the main diagonal ( TPs
and TNs)
• The incorrect predictions (FPs and FNs) are shown in the secondary diagonal
Predictive Performance Measures for
Classification -Example
• The Output classes: Sick, not sick
• Sick is the positive class
• TP and TN are the correct predictions
• FN, FPs are the wrong predictions