0% found this document useful (0 votes)
3 views

07-Classification

Uploaded by

rohit rushil
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

07-Classification

Uploaded by

rohit rushil
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 52

Chapter 07

Classification

Dr. Steffen Herbold


[email protected]

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Outline

• Overview

• Classification Models

• Comparison of Classification Models

• Summary

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Example of Classification

This is a whale

This is a bear

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
The General Problem

Class of object 1
Object 1

Concept
Object 2
Class of object 2

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
The Formal Problem

• Object space

• Often infinite How do you


• Representations of the objects in a feature space get ?

• Set of classes

• A target concept that maps objects to classes

• Classification
• Finding an approximation of the target concept

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
The „Whale“ Hypothesis

• Why do we know this is a whale?

Has a fin Blue background

Oval body

Black top, white


bottom Hypothesis: Objects with fins, an oval general shape that are
black on top and white on the bottom in front of a
blue background are whales.
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
The Hypothesis

• A hypothesis maps features to classes


What if I am not
sure about the
class?
• Approximation of the target concept

• Hypothesis = Classifier = Classification Model

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Classification using Scores

• A numeric score for each class


• Often a probability distribution

• Example
• Three classes: „whale“, „bear“, „other“

score whale
score bear score other

• Standard approach:
• Classification is class with highest score

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Thresholds for Scores

• Different thresholds also possible

Threshold of 0.2 would miss “Spam” Many “No Spam” incorrectly


but better identify “No Spam” detected as spam if “highest”
score is used
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
How do you evaluate
Quality of Hypothesis

• Goal: Approximation of the target concept


 Use Test Data


• Structure is the same as training data
• Apply hypothesis

hasFin shape colorTop colorBottom background class prediction

true oval black black blue whale whale

false rectangle brown brown green bear whale

… … … … … …

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
The Confusion Matrix

• Table of actual values versus prediction

Actual class

whale bear other


Predicted Class

whale 29 1 3

bear 2 22 13

other 4 11 51

Two whales were incorrectly


predicted as bears

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Binary Classification

• Many problems are binary


• Will I get my money back?
• Is this credit card fraud?
• Will my paper be accepted?
• …

• Can all be formulated as either being in a class or not


Labels true and false

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
The Binary Confusion Matrix
Actual class
Predicted Class

true false

true True Positives (TP) False Positives (FP)

false False Negatives (FN) True Negatives (TN)

• False positives are also called Type I error


• False negatives are also called Type II error

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Binary Performance Metrics (1)

• Rates per actual class


• True positive rate, recall, sensitivity
• Percentage of actually „True“ that is predicted correctly

• True negative rate, specificity


• Percentage of actually „False“ that is predicted correctly

• False negative rate


• Percentage of actually „True“ that is predicted wrongly

• False positive rate


• Percentage of actually „False“ that is predicted wrongly

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Binary Performance Metrics (2)

• Rates per predicted class


• Positive predictive value, precision
• Percentage of predicted „True“ that is predicted correctly

• Negative predictive value


• Percentage of predicted „False“ that is predicted correctly

• False discovery rate


• Percentage of predicted „True“ that is predicted wrongly

• False omission rate


• Percentage of predicted „False“ that is predicted wrongly

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Binary Performance Metrics (3)

• Metrics that take „everything“ into account


• Accuracy
• Percentage of data that is predicted correctly

• F1 measure
• Harmonic mean of precision and recall

• Matthews correlation coefficient (MCC)


• Chi-squared correlation between prediction and actual values

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Receiver Operator Characteristics (ROC)

• Plot of true positive rate (TPR) versus false positive rate (FPR)

• Different TPR/FPR values possible due to thresholds for scores

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Area Under the Curve (AUC)

• Large Area = Good Performance

• Accounts for tradeoffs between TPR and FPR

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Micro and Macro Averaging

• Metrics not directly applicable for more than two classes


• Accuracy is the exception
• Micro Averaging
• Expand formulas to use individual positive, negative examples for each
class
• Macro Averaging
• Assume one class as true, combine all other as false
• Compute metrics for all such combinations
• Take average
• Example for the true positive rate:

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Outline

• Overview

• Classification Models

• Comparison of Classification Models

• Summary

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Overview of Classifiers

• The following classifiers are introduced


• -nearest Neighbor
• Decision Trees
• Random Forests
• Logistic Regression
• Naive Bayes
• Support Vector Machines
• Neural Networks

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
-nearest Neighbor

• Basic Idea
• Instances with similar feature values should have the same class
• Class can be determined by looking at instances that are similar

 Assign each instance the mode of its nearest instances

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Impact of

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Decision Trees

• Basic Idea
• Make decisions based on logical rules about features
• Organize rules as a tree

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Basic Decision Tree Algorithm

• Recursive algorithm
• Stop if
• Data is “pure”, i.e. mostly from class
• Amount of data is too small, i.e., only few instances in partition
• Otherwise
• Determine „most informative feature“
• Partition training data using
• Recursively create subtree for each partition

• Details may vary depending on the specific algorithm


• For example, CART, ID3, C4.5
• General concept always the same

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
The „Most Informative Feature“

• Information theory based approach

• Entropy of the class label Can be used as measure for purity

• Conditional entropy of the class label based on feature

Interpret each dimension as


• Mutual Information random variable

 Feature with highest mutual information is most informative

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Decision Surface of Decision Trees

• All decisions are axis-aligned

Overfitting

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Random Forest
Randomized
attributes
• Basic Idea
• Ensemble of randomized decision trees
Randomized
subset

Classification as majority vote of random trees


Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Bagging as Ensemble Learner

• Bagging is short for bootstrap aggregating

• Randomly draw subsamples of training data


• Build model for each subsample  ensemble of models
• Voting to create class
• Can be weighted, e.g., using quality of ensemble models

• Random Forests combine Bagging with


• Short decision trees, i.e., low depth
• Allowing only a random subset of features for each decision

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Decision Surface of Random Forests

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Logistic Regression

• Basic Idea:
• Regression model of the probability that an object belongs to a class
• Combines the function with linear regression

• Linear Regression
• as linear combination of

• The function

• Logistic Regression

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Odds Ratios

• Probabilities vs. Odds


• Probability:
• Odds of passing the exam:
• The odds if passing the exam is 3 to 1

• If we invert the natural logarithm, we get

Definition
• It follows that is the odds ratio of feature
of odds
• Odds ratio means the change in odds if we increase by one.
• Odds ratio greater than one means increased odds
• Odds ratio less than one mean decreased odds

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Decision Surface of Logistic Regression

• Decision boundaries are linear

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Naive Bayes

• Basic idea:
• Assume all features as independent
• Score classes using the conditional probability

• Bayes Law

• Conditional probability of a class:

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
From Bayes Law to Naive Bayes

• Probability following Bayes law

• „Naive“ assumption: conditionally independent given

• is independent of and always the same

• Assign the class with highest score

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Multinomial and Gaussian Naive Bayes

• Different variants on how is estimated

• Multinomial
• is the empirical probability of observing a feature
• “Counts” observations of in the data

• Gaussian
• Assumes features follow a gaussian/normal distribution
• Estimates conditional probability using the gaussian density function

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Decision Surface of Naive Bayes

• Multinomial has linear decision boundaries


• Gaussian has piecewise quadratic decision boundaries

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Support Vector Machines (SVM)

• Basic Idea:
• Calculate decision boundary such that it is “far away” from data

Linear decision
boundary Support vectors
= Instances with minimal
distance to decision
bounday

Margin is
Margin
maximized

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Non-linear SVMs through Kernels

• Expand features using kernels to separate non-linear data


• Transformation into high-dimensional kernel space
• Can be infinite (e.g., Gaussian kernel, RBF kernel) !
• Calculate linear separation in kernel space Quadractic
• Use kernel trick to avoid actual expansion kernel

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Decision Surface of SVMs

• Shape of decision surface depends on kernel

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Neural Networks

• Basic Idea:
• Network of neurons with different layers and communication between
neurons
• Input layer feeds data into the network
• Hidden layers “correlate” data
• Output layer gives computation results

Input Output
Layer Layer

Two hidden
layers
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Multilayer Perceptron (MLP)

• First weighted sum of inputs


• Then activation function, e.g,

Each feature gets


Single output
an input neuron
neuron with the
classification

Multiple fully
connected
hidden layers

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Decision Surface of MLP

• Shape of decision boundary depends on


• Activation function
• Number of hidden layers
• Number of neurons in the hidden layers

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Outline

• Overview

• Classification Models

• Comparison of Classification Models

• Summary

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
General Approach

• Different approaches behind all covered classifiers

• -nearest Neighbor  Instance based


• Decision Trees  Rule based + information
theory
• Random Forests  Randomized ensemble
• Logistic Regression  Regression
• Naive Bayes  Conditional probability
• Support Vector Machines  Margin maximization + kernels
• Neural Networks  (Very complex) Regression

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Comparison of Decision Surfaces
IRIS Data

Results may vary with hyper parameter tuning


Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Comparison of Decision Surfaces
Non-linear separable

Results may vary with hyper parameter tuning


Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Comparison of Decision Surfaces
Circles within circles

Results may vary with hyper parameter tuning


Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Comparison of Execution Times

Times taken using GWDG Jupyter Hub and scikit-learn implementations of the algorithms.
Data randomly generated with using scikit-learn.datasets.make_moons (July 2018)

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Strengths and Weaknesses

Explanatory Consise Scoring Categorical Missing Correlated


value representation features features features
𝑘-nearest
Neighbor
o - - - + -
Decision
Tree
+ + + + o +
Random
Forest
- o + + o +
Logistic
Regression
+ + + o - o
Naive Bayes o o + + - -
SVM - o - o - -
Neural
Network
- o + o - +

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Outline

• Overview

• Classification Models

• Comparison of Classification Models

• Summary

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science
Summary

• Classification is the task of assigning labels to objects

• Many evaluation criteria


• Confusion matrix commonly used

• Lots of classification algorithms


• Rule based, instance based, ensembles, regressions, …

• Different algorithms may be best in different situations

Introduction to Data Science


https://sherbold.github.io/intro-to-data-science

You might also like