0% found this document useful (0 votes)
11 views11 pages

XIIAIUNITICAPSTONE_PROJECTPARTII

The document outlines two primary methods for validating model quality: Train-Test Split Evaluation and Cross Validation. Train-Test Split involves dividing a dataset into training and test subsets to evaluate model performance, while Cross Validation uses multiple subsets to provide a more accurate measure of model quality, especially useful for smaller datasets. It also discusses performance metrics like Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) for assessing prediction accuracy.

Uploaded by

hwefhwfb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views11 pages

XIIAIUNITICAPSTONE_PROJECTPARTII

The document outlines two primary methods for validating model quality: Train-Test Split Evaluation and Cross Validation. Train-Test Split involves dividing a dataset into training and test subsets to evaluate model performance, while Cross Validation uses multiple subsets to provide a more accurate measure of model quality, especially useful for smaller datasets. It also discusses performance metrics like Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) for assessing prediction accuracy.

Uploaded by

hwefhwfb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

4.

How to validate model quality


There are mainly two validation method for model quality.
(ii) Train-Test Split Evaluation
(iii) Cross validation

(i) Train Test Split Evaluation


 The train-test split is a technique for evaluating the performance of a machine learning model.
 The objective is to estimate the performance of the machine learning model on new data: data not
used to train the model.
 It can be used for classification or regression problems and can be used for any supervised learning
algorithm.
 The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to
fit/train the model and is referred to as the training dataset. The second subset is not used to train
the model; instead, the input element of the dataset is provided to the model, then predictions are
made and compared to the expected values. This second dataset is referred to as the test dataset.
i.e. Train Dataset: Used to fit/train the machine learning model.
Test Dataset: Used to evaluate the fit machine learning model.
 The train-test procedure is appropriate when there is a sufficiently large dataset available.

How to Configure the Train-Test Split: -


 The procedure has one main configuration parameter, which is the size of the train and test sets. This
is most commonly expressed as a percentage between 0 and 1 for either the train or test datasets.
For example, a training set with the size of 0.67 (67 percent) means that the remainder percentage
0.33 (33 percent) is assigned to the test set.
 There is no optimal split percentage.

 Common split percentages include:


(i) Train: 80%, Test: 20%
(ii) Train: 67%, Test: 33%
(iii) Train: 50%, Test: 50%
Training and Testing Data in Python Machine Learning
 As we work with datasets, a machine learning model works in two stages. We usually split the data
around 80%-20% between training data and test data in Python ML.
 Scikit-learn is a Python programming library which is used to implement machine learning models.
 Along with scikit-learn, we will use few more libraries like numpy, pandas and matplotlib
 train_test-split() – It is a common function used to split the data set in to training data and testing
data which is defined in sklearn.model_selection
 test_size=0.3 - suggests that the test data should be 30% of the dataset and the rest should be train
data.

 x_test.shape – It tells how many rows and columns in test data

 We will install the above libraries like


1. pip install pandas
2. pip install scikit-learn (for sklearn library in which train_test_split() function is defined)

 We use pandas to import dataset in program and sklearn.model_selection for train_test_split()


function to perform splitting. We can import these above libraries in program as follows:
>>> import pandas as pd
>>>from sklearn.model_selection import train_test_split

 Now load the dataset from the csv file “car.csv” we have already created
using read_csv() function.
(read_csv(‘car.csv’) is used to read the data from the csv file)
>>> df = pd.read_csv("car.csv")
>>> df
Distance Year Price

0 1500 5 50000

1 1600 3 45000

2 1000 1 70000

3 2000 2 60000

4 4000 7 35000

5 8000 9 20000
 Now extract the data of dependent(like Price) and independent(like
Distance and Year) variables.
>>>X=df[['Distance','Year']]
>>>y=df['Price']

 Now display the value of X and y


>>>X
Distance Year

0 1500 5

1 1600 3

2 1000 1

3 2000 2

4 4000 7

5 8000 9
>>>y
0 50000

1 45000

2 70000

3 60000

4 35000

 Now display the shape(i.e. number of rows and column in variables X and y
>>>X.shape()
(6, 2)
>>>y.shape()
(6,)

 Now split the data into training data and testing data
>>>X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3)
>>>X_train
Distance Year

3 2000 2

5 8000 9

0 1500 5
2 1000 1
>>>X_test
Distance Year

1 1600 3

4 4000 7

>>>y_train
3 60000

5 20000

0 50000

2 70000
>>>y_test
1 45000

4 35000

(ii) Cross Validation Procedure


 Cross validation is a resampling technique for evaluating machine learning models on a small
dataset.
 In Cross validation we run our modeling process on different subsets of data to get multiple
measures of model quality.
 This process includes only one parameter i.e. k that specifies the number of groups into which a
given dataset should be divided. This process is frequently known as k-fold cross validation.
 For example, we could have 5 folds or experiments. We divide the data into 5 pieces, each being
20% of the full dataset. Here the data is divided in to five groups so here k=5 so it is called 5-fold
cross validation.
 The following is the general procedure of cross validation:
We run an experiment called experiment 1 which uses the first fold as a holdout set (test
data), and everything else as training data. This gives us a measure of model quality based
on a 20% holdout set. We then run a second experiment, where we hold out data from the
second fold (using everything except the 2nd fold for training the model.) This gives us a
second estimate of model quality. We repeat this process, using every fold once as the
holdout (test data). Putting this together, 100% of the data is used as a holdout (test data)
at some point.

 It is a popular strategy since it is straightforward to grasp and produced a less biased or


optimistic estimate of model competence than other approach such as train-test split.

Train Test Split v/s Cross Validation

 The train-test split procedure is appropriate when there is a sufficiently large dataset
available. It will run faster, and you may have enough data for training and testing. While
Cross validation should be used if your dataset is small.

 Cross-validation gives a more accurate measure of model quality, which is especially


important if you are making a lot of modeling decisions. However, it can take more time to
run, because it estimates models once for each fold. So it is doing more total work.

 On small datasets, the extra computational burden of running cross-validation isn't a big
deal. With train-test split, these are also the problems where model quality scores would
be least reliable. So, if your dataset is smaller, you should run cross validation.

 If your model takes a couple minute or less to run, it's probably worth switching to cross-
validation. If your model takes much longer to run, cross-validation may slow down your
workflow more than it's worth.
6. Metrics of model quality by simple Math
 There are standard measures that we can use to summarize how good a set of predictions actually
are.
 You must estimate the quality of a set of predictions when training a machine learning model.
 Performance metrics like classification accuracy and Root Mean Squared Error(RMSE) can give you a
clear objective idea of how good a set of predictions is, and in turn how good the model is that
generated them.
 All the algorithms in machine learning rely on minimizing or maximizing a function, which we call
“objective function”. The group of functions that are minimized are called “loss functions”. A loss
function is a measure of how good a prediction model does in terms of being able to predict the
expected outcome.

 Loss functions can be broadly categorized into 2 types: Classification and Regression Loss.
 Regression functions predict a quantity, and classification functions predict a label.

Important loss functions: -


➢ MSE- Mean Squared Error
➢ RMSE – Root Mean Squared Error
➢ MAPE – Mean Absolute Percent Error

 MSE- Mean Squared Error


MSE i.e. Mean Squared Error is the most commonly used regression loss function.
Calculate the difference between model’s predictions and actual values, square it and
find average across the entire dataset to get the value of MSE.

Note: - MSE will never be negative because the errors are always squared.
Lets take an example in terms of graph to understand MSE/RMSE:

Rain
Rain

Day Day

For calculating MSE in Python, we have mean_squared_error() function defined in library sklearn.

from sklearn.metrices import mean_squared_error

y_true = [1, 2, 2, 2, 4] # list of actual values

y_pred = [0.6, 1.29, 1.99, 2.69, 3.4] # list of predicted values

print(“MSE =”, mean_squared_error(y_true,y_pred) # calculation of MSE and print MSE

Output : - 0.21606
 RMSE- Root Mean Squared Error
RMSE is one of the methods to determine the accuracy of our model in predicting the target values. In machine
Learning when we want to look at the accuracy of our model, we take the root mean square of the error that
has occurred between the test values and the predicted values mathematically:

Let’s au see some graphical examples:

 In this scattered graph the red dots are the actual values and the blue line is the set of predicted values
drawn by our model.
 Here X represents the distance between the actual value and the predicted line this line represents
the error, similarly, we can draw straight lines from each red dot to the blue line. Squaring them and
taking mean of all those distances and finally taking the root will give us RMSE of our model.

Note: - A good model should have an RMSE value less than 180. In case you have a higher RMSE value, this
would mean that you probably need to change your feature or probably you need to tweak your
hyperparameters.
(Hyperparameters are the parameters whose values govern the learning process)
Steps to Calculate MSE/RMSE: -

You might also like