0% found this document useful (0 votes)
3 views

8.predictive Analytics - Classification 2

Uploaded by

alhosani911
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

8.predictive Analytics - Classification 2

Uploaded by

alhosani911
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

COE 102

Introductory
Big Data
College of Engineering

Chapter -9-
Predictive Analytics -
Classification
Learning Objectives
• Classification
• Binary classification
• Multivariate classification
• KNN algorithm
• Predictive Performance Measures for Classification
Classification
• Classification task: a predictive task where the label to be assigned to a
new, unlabeled, object, given the value of its predictive attributes, is a
qualitative value representing a class or category.

• Examples of classification tasks where ML algorithms can be used include:


• classifying an email as spam or useful

• diagnosing a patient as sick or healthy


• classifying a financial transaction as fraudulent or normal
• identifying a face as belonging to a criminal or to an innocent person
• predicting if someone in our social network would be good dinner
company or not
• Can you think of more classification examples?
Classification Types
• Classification can be considered:
• Binary classification
• Multiclass classification
Binary Classification
• The target attribute can have one of only two possible values, for example, “Yes” or
“No”

• Usually, one of these classes is referred to as the positive class and the other as the
negative class. The positive class is usually the class of particular interest.
• Example:
• A medical diagnosis classification task can have only two classes: healthy and sick.
• The “sick” class is the main class of interest, so it is the positive class.
Binary Classification - Example
• Suppose you want to go out for dinner and you want to predict who, among your new
contacts, would be a
good company.
• To make this decision, you can use data from previous dinners with people in
your social network

- Identify the predictive attributes?


- Identify the target attribute and its possible values?
Example – Cont’d
• Step 1: illustrate the available data

• Step 2: a simple classification model can be induced by applying a classification


algorithm to identify a vertical line able to separate objects into the two classes.
• A simple model can be induced, which can be a rule saying:

What is the prediction for a new friend (Name: “Noah”, Age: 26)?
Example 2
• The data set in the previous example is easy to classify, classification tasks are usually
not that easy!
• What about this data set, can you find a decision boundary that separates the data?

• An alternative to deal with this difficulty is to extract an additional predictive attribute


from our social network data that could allow the induction of a classification model
able to discriminate between the two classes
Example 2 – Cont’d
• If we add the attribute “Education level”:

• In binary classification tasks with two predictive attributes, when the decision boundary can
separate the objects of the two classes, we say that the classification data set is linearly
separable.
Example 2 – Cont’d
• If we add the attribute “Education level”:

• What is the prediction for the data point represented by the red dot?
Classification Algorithms
• The previous examples demonstrated very simple classification examples.
• Real life scenarios are more complex and require algorithms to find the
classification decision boundary
• Thousands of classification algorithms can be found in the literature. These algorithms
follow different principles and can be roughly categorized as:
1. distance-based algorithms:
• The simplest approach to predicting the class of a new object
• How similar this object is to other, previously labeled, objects
• Data points close to each other are thought to be similar and typically belong to the same class

2. probability-based algorithms:
• Estimates the probability of an object belonging to each class, given the information (training data) we
have.
• Probabilistic classification algorithms can model the probabilistic relationship between the predictive
attributes and the target attribute of a data set
Distance-based Learning Algorithms -
K-nearest Neighbor Algorithms (KNN)

• The k-NN algorithm is one of the simplest classification algorithms


• Whenever k-NN has to predict the class of a new object, it just
identifies the class of the k objects most similar to this object.
• The value of k defines how many neighboring labeled objects are
consulted.
K-NN
K-NN Examples
K-NN Examples
K-NN Examples
KNN - Selecting K
• To use the k-NN algorithm it is
necessary to define the value of k, the
number of nearest neighbors that
will be considered.

• A very large value:


• may include neighbors that are very
different to the object to be classified.
• Neighboring objects very far from the
new object are not good predictors of its
class.
• It also makes the prediction to tend to
the majority class.
KNN - Selecting K
• To use the k-NN algorithm it is necessary to define the value of k, the number of
nearest neighbors that will be considered.

• If the value of k is too small


• only objects very similar to the object to be classified will be considered.
• Thus, the amount of information used to classify the new object may not be
enough for a good classification.
• This can result in an unstable behavior, since noisy objects can decide the
classification of new objects.
Probability-Based Algorithms- Logistic
Regression
• Although it has the term regression in its name, logistic regression is used
for classification tasks
• It estimates the probability of an object belonging to a class. For such, it
adjusts a logistic function to a training data set.
• Logistic regression can be applied to data sets with any number of predictive
attributes, which can be qualitative or quantitative
• This logistic function generates a straight line separating the objects of the
two classes :
• this function produces values in the [0, 1] interval.
Probability-Based Algorithms- Logistic
Regression
• Initially, logistic regression calculates the odds of an object belonging to each of the two classes,
which are values in the [0, 1] interval.
• The ratio of the odds for the two classes is calculated and a logarithmic function is applied to the
result
• Since logistic regression returns a probability, a quantitative value, this value has to be
transformed into a qualitative value, a class label.
• In binary problems, when that probability is less than 0.5 the negative class is
predicted. Otherwise
the positive class is predicted.
Examples:
1. What is the classification for data point X, if the logistic function returned a
probability value of .7
2. What is the classification for data point X, if the logistic function returned a
probability value of .1
3. What is the classification for data point X, if the logistic function returned a
probability value of .9
Predictive Performance Measures for
Classification
• When dealing with a classification task, it is necessary to evaluate how well the
induced model solves the task
• Several measures can be used to assess the predictive performance of a classification
model
• The main classification performance measures can be easily derived from a confusion
matrix, a 2 × 2 matrix that shows:
• how many objects were correctly classified and how many were misclassified

• In a confusion matrix:
• The two rows represent the predicted classes and the two columns the true, correct, classes
• The correct predictions are the values on the main diagonal ( TPs
and TNs)
• The incorrect predictions (FPs and FNs) are shown in the secondary diagonal
Predictive Performance Measures for
Classification -Example
• The Output classes: Sick, not sick
• Sick is the positive class
• TP and TN are the correct predictions
• FN, FPs are the wrong predictions

• How many correct predictions did the Classifier make?


• How many wrong predictions did the classifier make?
Predictive Performance Measures for
Classification -Example
• A confusion matrix is typically used to calculate performance measures:
• recall measure: returns the percentage of true positive objects that
are classified as positive by the induced classifier
• Specificity: his measure returns the percentage of objects from the negative class
that are classified by the induced classifier as negative
• The precision measure returns the percentage of objects classified as positive that
were really positive
Example
• The recall and the precision for this confusion matrix:
Example
• The recall and the precision for this confusion matrix:
Class Activity
Using this confusion matrix, calculate the recall, precision. Given that p is the positive class
Timeline
• Lab 5
• Quiz 2
• Project Phase 2 + 3 :
• Visually appealing
• Interactive
• Slicers, filters
• Different visuals , different levels of analytics
Reading
• Chapter 9 from the textbook “A General Introduction to Data
Analytics”

You might also like