Model evaluation is a process that uses some metrics which help us to analyze the performance of the model. Think of training a model like teaching a student. Model evaluation is like giving them a test to see if they truly learned the subject—or just memorized answers. It helps us answer:
- Did the model learn patterns?
- Will it fail on new questions?
Model development is a multi-step process and we need to keep a check on how well the model do future predictions and analyze a models weaknesses. There are many metrics for that. Cross Validation is one technique that is followed during the training phase and it is a model evaluation technique.
Cross-Validation: The Ultimate Practice Test
Cross Validation is a method in which we do not use the whole dataset for training. In this technique some part of the dataset is reserved for testing the model. There are many types of Cross-Validation out of which K Fold Cross Validation is mostly used. In K Fold Cross Validation the original dataset is divided into k subsets. The subsets are known as folds. This is repeated k times where 1 fold is used for testing purposes, rest k-1 folds are used for training the model. It is seen that this technique generalizes the model well and reduces the error rate.
Holdout is the simplest approach. It is used in neural networks as well as in many classifiers. In this technique the dataset is divided into train and test datasets. The dataset is usually divided into ratios like 70:30 or 80:20. Normally a large percentage of data is used for training the model and a small portion of dataset is used for testing the model.
Evaluation Metrics for Classification Task
Classification is used to categorize data into predefined labels or classes. To evaluate the performance of a classification model we commonly use metrics such as accuracy, precision, recall, F1 score and confusion matrix. These metrics are useful in assessing how well model distinguishes between classes especially in cases of imbalanced datasets. By understanding the strengths and weaknesses of each metric, we can select the most appropriate one for a given classification problem.
In this Python code, we have imported the iris dataset which has features like the length and width of sepals and petals. The target values are Iris setosa, Iris virginica, and Iris versicolor. After importing the dataset we divided the dataset into train and test datasets in the ratio 80:20. Then we called Decision Trees and trained our model. After that, we performed the prediction and calculated the accuracy score, precision, recall, and f1 score. We also plotted the confusion matrix.
Importing Libraries and Dataset
Python
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn import datasets
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import precision_score,\
recall_score, f1_score, accuracy_score
Now let's load the toy dataset iris flowers from the sklearn.datasets library and then split it into training and testing parts (for model evaluation) in the 80:20 ratio.
Python
iris = load_iris()
X = iris.data
y = iris.target
# Holdout method.Dividing the data into train and test
X_train, X_test,\
y_train, y_test = train_test_split(X, y,
random_state=20,
test_size=0.20)
Now, let's train a Decision Tree Classifier model on the training data, and then we will move on to the evaluation part of the model using different metrics.
Python
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)
1. Accuracy
Accuracy is defined as the ratio of number of correct predictions to the total number of predictions. This is the most fundamental metric used to evaluate the model. The formula is given by:
\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}
However Accuracy has a drawback. It cannot perform well on an imbalanced dataset. Suppose a model classifies that the majority of the data belongs to the major class label. It gives higher accuracy, but in general model cannot classify on minor class labels and has poor performance.
Python
print("Accuracy:", accuracy_score(y_test,
y_pred))
Output:
Accuracy: 0.9333333333333333
2. Precision and Recall
Precision is the ratio of true positives to the summation of true positives and false positives. It basically analyses the positive predictions.
\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
The drawback of Precision is that it does not consider the True Negatives and False Negatives.
Recall is the ratio of true positives to the summation of true positives and false negatives. It basically analyses the number of correct positive samples.
\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}
The drawback of Recall is that often it leads to a higher false positive rate.
Python
print("Precision:", precision_score(y_test,
y_pred,
average="weighted"))
print('Recall:', recall_score(y_test,
y_pred,
average="weighted"))
Output:
Precision: 0.9435897435897436
Recall: 0.9333333333333333
3. F1 score
F1 score is the harmonic mean of precision and recall. It is seen that during the precision-recall trade-off if we increase the precision, recall decreases and vice versa. The goal of the F1 score is to combine precision and recall.
\text{F1 Score} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
Python
# calculating f1 score
print('F1 score:', f1_score(y_test, y_pred,
average="weighted"))
Output:
F1 score: 0.9327777777777778
4. Confusion Matrix
Confusion matrix is a N x N matrix where N is the number of target classes. It represents number of actual outputs and predicted outputs. Some terminologies in the matrix are as follows:
- True Positives: It is also known as TP. It is the output in which the actual and the predicted values are YES.
- True Negatives: It is also known as TN. It is the output in which the actual and the predicted values are NO.
- False Positives: It is also known as FP. It is the output in which the actual value is NO but the predicted value is YES.
- False Negatives: It is also known as FN. It is the output in which the actual value is YES but the predicted value is NO.
Python
confusion_matrix = metrics.confusion_matrix(y_test,
y_pred)
cm_display = metrics.ConfusionMatrixDisplay(
confusion_matrix=confusion_matrix,
display_labels=[0, 1, 2])
cm_display.plot()
plt.show()
Output:
Confusion matrix for the output of the modelIn the output the accuracy of model is 93.33%. Precision is approximately 0.944 and Recall is 0.933. F1 score is approximately 0.933. Finally the confusion matrix is plotted. Here class labels denote the target classes:
0 = Setosa
1 = Versicolor
2 = Virginica
From the confusion matrix, we see that 8 setosa classes were correctly predicted. 11 Versicolor test cases were also correctly predicted by the model and 2 virginica test cases were misclassified. In contrast, the rest 9 were correctly predicted.
5. AUC-ROC Curve
AUC (Area Under Curve) is an evaluation metric that is used to analyze the classification model at different threshold values. The Receiver Operating Characteristic (ROC) curve is a probabilistic curve used to highlight the model's performance. The curve has two parameters:
- TPR: It stands for True positive rate. It basically follows the formula of Recall.
- FPR: It stands for False Positive rate. It is defined as the ratio of False positives to the summation of false positives and True negatives.
This curve is useful as it helps us to determine the model's capacity to distinguish between different classes. Let us illustrate this with the help of a simple Python example
Python
import numpy as np
from sklearn .metrics import roc_auc_score
y_true = [1, 0, 0, 1]
y_pred = [1, 0, 0.9, 0.2]
auc = np.round(roc_auc_score(y_true,
y_pred), 3)
print("Auc", (auc))
Output:
Auc 0.75
AUC score is a useful metric to evaluate the model. It highlights model's capacity to separate the classes. In the above code 0.75 is a good AUC score. A model is considered good if the AUC score is greater than 0.5 and approaches 1.
Evaluation Metrics for Regression Task
Regression is used to determine continuous values. It is mostly used to find a relation between a dependent and independent variable. For classification we use a confusion matrix, accuracy, f1 score, etc. But for regression analysis since we are predicting a numerical value it may differ from the actual output. So we consider the error calculation as it helps to summarize how close the prediction is to the actual value. There are many metrics available for evaluating the regression model.
In this Python Code we have implemented a simple regression model using the Mumbai weather CSV file. This file comprises Day, Hour, Temperature, Relative Humidity, Wind Speed and Wind Direction. The link for the dataset is here.
We are interested in finding relationship between Temperature and Relative Humidity. Here Relative Humidity is the dependent variable and Temperature is the independent variable. We performed linear regression and use different metrics to evaluate the performance of our model. To calculate the metrics we make extensive use of sklearn library.
Python
# importing the libraries
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error,\
mean_squared_error, mean_absolute_percentage_error
Now let's load the data into the panda's data frame and then split it into training and testing parts (for model evaluation) in the 80:20 ratio.
Python
df = pd.read_csv('weather.csv')
X = df.iloc[:, 2].values
Y = df.iloc[:, 3].values
X_train, X_test,\
Y_train, Y_test = train_test_split(X, Y,
test_size=0.20,
random_state=0)
Now, let's train a simple linear regression model. On the training data and we will move to the evaluation part of the model using different metrics.
Python
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)
regression = LinearRegression()
regression.fit(X_train, Y_train)
Y_pred = regression.predict(X_test)
1. Mean Absolute Error (MAE)
This is the simplest metric used to analyze the loss over the whole dataset. As we know that error is basically the difference between the predicted and actual values. Therefore MAE is defined as the average of the errors calculated. Here we calculate the modulus of the error, perform summation and then divide the result by the total number of data points. It is a positive value. The formula of MAE is given by
\text{MAE} = \frac{\sum_{i=1}^{N} |\text{y}_{\text{pred}} - \text{y}_{\text{actual}}|}{N}
Python
mae = mean_absolute_error(y_true=Y_test,
y_pred=Y_pred)
print("Mean Absolute Error", mae)
Output:
Mean Absolute Error 1.7236295632503873
2. Mean Squared Error(MSE)
The most commonly used metric is Mean Square error or MSE. It is a function used to calculate the loss. We find the difference between the predicted values and actual variable, square the result and then find the average by all datapoints present in dataset. MSE is always positive as we square the values. Small the value of MSE better is the performance of our model. The formula of MSE is given:
\text{MSE} = \frac{\sum_{i=1}^{N} (\text{y}_{\text{pred}} - \text{y}_{\text{actual}})^2}{N}
Python
mse = mean_squared_error(y_true=Y_test,
y_pred=Y_pred)
print("Mean Square Error", mse)
Output:
Mean Square Error 3.9808057060106954
3. Root Mean Squared Error(RMSE)
RMSE is a popular method and is the extended version of MSE. It indicates how much the data points are spread around the best line. It is the standard deviation of the MSE. A lower value means that the data point lies closer to the best fit line.
\text{RMSE} = \sqrt{\frac{\sum_{i=1}^{N} (\text{y}_{\text{pred}} - \text{y}_{\text{actual}})^2}{N}}
Python
rmse = mean_squared_error(y_true=Y_test,
y_pred=Y_pred,
squared=False)
print("Root Mean Square Error", rmse)
Output:
Root Mean Square Error 1.9951956560725306
4. Mean Absolute Percentage Error (MAPE)
MAPE is used to express the error in terms of percentage. It is defined as the difference between the actual and predicted value. The error is then divided by the actual value. The results are then summed up and finally and we calculate the average. Smaller the percentage better the performance of the model. The formula is given by
\text{MAPE} = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{|\text{y}_{\text{pred}} - \text{y}_{\text{actual}}|}{|\text{y}_{\text{actual}}|} \right) \times 100 \%
Python
mape = mean_absolute_percentage_error(Y_test,
Y_pred,
sample_weight=None,
multioutput='uniform_average')
print("Mean Absolute Percentage Error", mape)
Output:
Mean Absolute Percentage Error 0.02334408993333347
Evaluating machine learning models is a important step in ensuring their effectiveness and reliability in real-world applications. Using appropriate metrics such as accuracy, precision, recall, F1 score for classification and regression-specific measures like MAE, MSE, RMSE and MAPE can assess model performance for different tasks. Moreover adopting evaluation techniques like cross-validation and holdout ensures that models generalize well to unseen data.
Similar Reads
Machine Learning Tutorial Machine learning is a branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the systems to think and understand like humans by learning from the data.It can
5 min read
Prerequisites for Machine Learning
Python for Machine Learning Welcome to "Python for Machine Learning," a comprehensive guide to mastering one of the most powerful tools in the data science toolkit. Python is widely recognized for its simplicity, versatility, and extensive ecosystem of libraries, making it the go-to programming language for machine learning. I
6 min read
SQL for Machine Learning Integrating SQL with machine learning can provide a powerful framework for managing and analyzing data, especially in scenarios where large datasets are involved. By combining the structured querying capabilities of SQL with the analytical and predictive capabilities of machine learning algorithms,
6 min read
Getting Started with Machine Learning
Advantages and Disadvantages of Machine Learning Machine learning (ML) has revolutionized industries, reshaped decision-making processes, and transformed how we interact with technology. As a subset of artificial intelligence ML enables systems to learn from data, identify patterns, and make decisions with minimal human intervention. While its pot
3 min read
Why ML is Important ? Machine learning (ML) has become a cornerstone of modern technology, revolutionizing industries and reshaping the way we interact with the world. As a subset of artificial intelligence (AI), ML enables systems to learn and improve from experience without being explicitly programmed. Its importance s
4 min read
Real- Life Examples of Machine Learning Machine learning plays an important role in real life, as it provides us with countless possibilities and solutions to problems. It is used in various fields, such as health care, financial services, regulation, and more. Importance of Machine Learning in Real-Life ScenariosThe importance of machine
13 min read
What is the Role of Machine Learning in Data Science In today's world, the collaboration between machine learning and data science plays an important role in maximizing the potential of large datasets. Despite the complexity, these concepts are integral in unraveling insights from vast data pools. Let's delve into the role of machine learning in data
9 min read
Top Machine Learning Careers/Jobs Machine Learning (ML) is one of the fastest-growing fields in technology, driving innovations across healthcare, finance, e-commerce, and more. As companies increasingly adopt AI-based solutions, the demand for skilled ML professionals is Soaring. Machine Learning JobsThis article delves into the Ty
10 min read