Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 33
Evaluating Model Performance
• Evaluating student performance
• Evaluating employee performance • Evaluating machine learning algorithm performance Measuring performance for classification • Evaluating the performance of any medical test • The goal of evaluating a classification model is to have a better understanding of how its performance will extrapolate to future cases. • Though we've evaluated classifiers in the prior chapters, it's worth reflecting on the types of data at our disposal: • Actual class values • Predicted class values • Estimated probability of the prediction • The actual and predicted class values may be self- evident, but they are the key to evaluation. Just like a teacher uses an answer key to assess the student's answers, we need to know the correct answer for a machine learner's predictions. The goal is to maintain two vectors of data: one holding the correct or actual class values, and the other holding the predicted class values. Both vectors must have the same number of values stored in the same order. The predicted and actual values may be stored as separate R vectors or columns in a single R data frame. • Obtaining this data is easy. The actual class values come directly from the target feature in the test dataset. Predicted class values are obtained from the classifier built upon the training data, and applied to the test data. For most machine learning packages, this involves applying the predict() function to a model object and a data frame of test data, such as: predicted_outcome <- predict(model, test_data). • Studying these internal prediction probabilities provides useful data to evaluate a model's performance. If two models make the same number of mistakes, but one is more capable of accurately assessing its uncertainty, then it is a smarter model. It's ideal to fid a learner that is extremely confident when making a correct prediction, but timid in the face of doubt. The balance between confidence and caution is a key part of model evaluation. • Unfortunately, obtaining internal prediction probabilities can be tricky because the method to do so varies across classifiers. In general, for most classifiers, the predict() function is used to specify the desired type of prediction. To obtain a single predicted class, such as spam or ham, you typically set the type = "class“ parameter. To obtain the prediction probability, the type parameter should be set to one of "prob", "posterior", "raw", or "probability" depending on the classifier used. • For example, to output the predicted probabilities for the C5.0 classifier- Classification Using Decision Trees and Rules, use the predict() function with type = "prob" as follows: • > predicted_prob <- predict(credit_model, credit_test, type = "prob") • In most cases, the predict() function returns a probability for each category of the outcome. For example, in the case of a two-outcome model like the SMS classifier, the predicted probabilities might be a matrix or data frame as shown here: • > head(sms_test_prob) • For convenience during the evaluation process, it can be helpful to construct a data frame containing the predicted class values, actual class values, as well as the estimated probabilities of interest. Confusion Matrix • A confusion matrix is a table that categorizes predictions according to whether they match the actual value. One of the table's dimensions indicates the possible categories of predicted values, while the other dimension indicates the same for actual values. Although we have only seen 2 x 2 confusion matrices so far, a matrix can be created for models that predict any number of class values. The following figure depicts the familiar confusion matrix for a two-class binary model as well as the 3 x 3 confusion matrix for a three-class model. • When the predicted value is the same as the actual value, it is a correct classification. Correct predictions fall on the diagonal in the confusion matrix (denoted by O). The off- diagonal matrix cells (denoted by X) indicate the cases where the predicted value differs from the actual value. These are incorrect predictions. • The most common performance measures consider the model's ability to discern one class versus all others. The class of interest is known as the positive class, while all others are known as negative. • The relationship between the positive class and negative class predictions can be depicted as a 2 x 2 confusion matrix that tabulates whether predictions fall into one of the four categories: • True Positive (TP): Correctly classified as the class of interest • True Negative (TN): Correctly classified as not the class of interest • False Positive (FP): Incorrectly classified as the class of interest False Negative (FN): Incorrectly classified as not the class of interest Using confusion matrix to measure accuracy • An easy way to tabulate a classifier's predictions into a confusion matrix is to use R's table() function. The command to create a confusion matrix for the SMS data is shown as follows. The counts in this table could then be used to calculate accuracy and other statistics: • >table(sms_results$actual_type,sms_results$ predict_type) • If you like to create a confusion matrix with a more informative output, the CrossTable() function in the gmodels package offers a customizable solution. you will need to do so using the install.packages("gmodels") command. • By default, the CrossTable() output includes proportions in each cell that indicate the cell count as a percentage of table's row, column, or overall total counts. The output also includes row and column totals. As shown in the following code, the syntax is similar to the table() function: • > library(gmodels) • >CrossTable(sms_results$actual_type,sms_results$predict _type) • We can use the confusion matrix to obtain the accuracy and error rate. Since the accuracy is (TP + TN) / (TP + TN + FP + FN), we can calculate it using following command: • > (152 + 1203) / (152 + 1203 + 4 + 31) • 1] 0.9748201 • We can also calculate the error rate (FP + FN) / (TP + TN + FP + FN) as: • > (4 + 31) / (152 + 1203 + 4 + 31) • [1] 0.02517986 • This is the same as one minus accuracy: • > 1 - 0.9748201 • [1] 0.0251799 Other Performance Measures • The Classification and Regression Training package caret by Max Kuhn includes functions to compute many such performance measures. This package provides a large number of tools to prepare, train, evaluate, and visualize machine learning models and data. • Before proceeding, you will need to install the package using the install.packages("caret") command. • Caret provides measures of model performance that consider the ability to classify the positive class, a positive parameter should be specified. In this case, since the SMS classifier is intended to detect spam, we will set positive = "spam" as follows: • > library(caret) • > confusionMatrix(sms_results$predict_type, • sms_results$actual_type, positive = "spam") The kappa statistic • The kappa statistic adjusts accuracy by accounting for the possibility of a correct prediction. • Kappa values range from 0 to a maximum of 1, which indicates perfect agreement between the model's predictions and the true values. Values less than one indicate imperfect agreement. • Poor agreement = less than 0.20 • Fair agreement = 0.20 to 0.40 • Moderate agreement = 0.40 to 0.60 • Good agreement = 0.60 to 0.80 • Very good agreement = 0.80 to 1.00 Sensitivity and specificity • The sensitivity of a model (also called the true positive rate) measures the proportion of positive examples that were correctly classified. Therefore, as shown in the following formula, it is calculated as the number of true positives divided by the total number of positives, both correctly classified (the true positives) as well as incorrectly classified (the false negatives): • The specificity of a model (also called the true negative rate) measures the proportion of negative examples that were correctly classified. As with sensitivity, this is computed as the number of true negatives, divided by the total number of negatives—the true negatives plus the false positives: Precision and recall • The precision (also known as the positive predictive value) is defined as the proportion of positive examples that are truly positive; in other words, when a model predicts the positive class, how often is it correct? A precise model will only predict the positive class in cases that are very likely to be positive. It will be very trustworthy. • On the other hand, recall is a measure of how complete the results are. • A model with a high recall captures a large portion of the positive examples, meaning that it has wide breadth. For example, a search engine with a high recall returns a large number of documents pertinent to the search query. Similarly, the SMS spam filter has a high recall if the majority of spam messages are correctly identified. The F-measure • A measure of model performance that combines precision and recall into a single number is known as the F-measure (also sometimes called the F1 score or F-score). The F-measure combines precision and recall using the harmonic mean, a type of average that is used for rates of change. The harmonic mean is used rather than the common arithmetic mean since both precision and recall are expressed as proportions between zero and one, which can be interpreted as rates. The following is the formula for the F- measure: