0% found this document useful (0 votes)

11 views11 pages

XIIAIUNITICAPSTONE_PROJECTPARTII

The document outlines two primary methods for validating model quality: Train-Test Split Evaluation and Cross Validation. Train-Test Split involves dividing a dataset into training and test subsets to evaluate model performance, while Cross Validation uses multiple subsets to provide a more accurate measure of model quality, especially useful for smaller datasets. It also discusses performance metrics like Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) for assessing prediction accuracy.

Uploaded by

hwefhwfb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views11 pages

XIIAIUNITICAPSTONE_PROJECTPARTII

Uploaded by

hwefhwfb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

4.

How to validate model quality

There are mainly two validation method for model quality.
(ii) Train-Test Split Evaluation
(iii) Cross validation

(i) Train Test Split Evaluation

 The train-test split is a technique for evaluating the performance of a machine learning model.
 The objective is to estimate the performance of the machine learning model on new data: data not
used to train the model.
 It can be used for classification or regression problems and can be used for any supervised learning
algorithm.
 The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to
fit/train the model and is referred to as the training dataset. The second subset is not used to train
the model; instead, the input element of the dataset is provided to the model, then predictions are
made and compared to the expected values. This second dataset is referred to as the test dataset.
i.e. Train Dataset: Used to fit/train the machine learning model.
Test Dataset: Used to evaluate the fit machine learning model.
 The train-test procedure is appropriate when there is a sufficiently large dataset available.

How to Configure the Train-Test Split: -

 The procedure has one main configuration parameter, which is the size of the train and test sets. This
is most commonly expressed as a percentage between 0 and 1 for either the train or test datasets.
For example, a training set with the size of 0.67 (67 percent) means that the remainder percentage
0.33 (33 percent) is assigned to the test set.
 There is no optimal split percentage.

 Common split percentages include:

(i) Train: 80%, Test: 20%
(ii) Train: 67%, Test: 33%
(iii) Train: 50%, Test: 50%
Training and Testing Data in Python Machine Learning
 As we work with datasets, a machine learning model works in two stages. We usually split the data
around 80%-20% between training data and test data in Python ML.
 Scikit-learn is a Python programming library which is used to implement machine learning models.
 Along with scikit-learn, we will use few more libraries like numpy, pandas and matplotlib
 train_test-split() – It is a common function used to split the data set in to training data and testing
data which is defined in sklearn.model_selection
 test_size=0.3 - suggests that the test data should be 30% of the dataset and the rest should be train
data.

 x_test.shape – It tells how many rows and columns in test data

 We will install the above libraries like

1. pip install pandas
2. pip install scikit-learn (for sklearn library in which train_test_split() function is defined)

 We use pandas to import dataset in program and sklearn.model_selection for train_test_split()

function to perform splitting. We can import these above libraries in program as follows:
>>> import pandas as pd
>>>from sklearn.model_selection import train_test_split

 Now load the dataset from the csv file “car.csv” we have already created
using read_csv() function.
(read_csv(‘car.csv’) is used to read the data from the csv file)
>>> df = pd.read_csv("car.csv")
>>> df
Distance Year Price

0 1500 5 50000

1 1600 3 45000

2 1000 1 70000

3 2000 2 60000

4 4000 7 35000

5 8000 9 20000
 Now extract the data of dependent(like Price) and independent(like
Distance and Year) variables.
>>>X=df[['Distance','Year']]
>>>y=df['Price']

 Now display the value of X and y

>>>X
Distance Year

0 1500 5

1 1600 3

2 1000 1

3 2000 2

4 4000 7

5 8000 9
>>>y
0 50000

1 45000

2 70000

3 60000

4 35000

 Now display the shape(i.e. number of rows and column in variables X and y
>>>X.shape()
(6, 2)
>>>y.shape()
(6,)

 Now split the data into training data and testing data
>>>X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3)
>>>X_train
Distance Year

3 2000 2

5 8000 9

0 1500 5
2 1000 1
>>>X_test
Distance Year

1 1600 3

4 4000 7

>>>y_train
3 60000

5 20000

0 50000

2 70000
>>>y_test
1 45000

4 35000

(ii) Cross Validation Procedure

 Cross validation is a resampling technique for evaluating machine learning models on a small
dataset.
 In Cross validation we run our modeling process on different subsets of data to get multiple
measures of model quality.
 This process includes only one parameter i.e. k that specifies the number of groups into which a
given dataset should be divided. This process is frequently known as k-fold cross validation.
 For example, we could have 5 folds or experiments. We divide the data into 5 pieces, each being
20% of the full dataset. Here the data is divided in to five groups so here k=5 so it is called 5-fold
cross validation.
 The following is the general procedure of cross validation:
We run an experiment called experiment 1 which uses the first fold as a holdout set (test
data), and everything else as training data. This gives us a measure of model quality based
on a 20% holdout set. We then run a second experiment, where we hold out data from the
second fold (using everything except the 2nd fold for training the model.) This gives us a
second estimate of model quality. We repeat this process, using every fold once as the
holdout (test data). Putting this together, 100% of the data is used as a holdout (test data)
at some point.

 It is a popular strategy since it is straightforward to grasp and produced a less biased or

optimistic estimate of model competence than other approach such as train-test split.

Train Test Split v/s Cross Validation

 The train-test split procedure is appropriate when there is a sufficiently large dataset
available. It will run faster, and you may have enough data for training and testing. While
Cross validation should be used if your dataset is small.

 Cross-validation gives a more accurate measure of model quality, which is especially

important if you are making a lot of modeling decisions. However, it can take more time to
run, because it estimates models once for each fold. So it is doing more total work.

 On small datasets, the extra computational burden of running cross-validation isn't a big
deal. With train-test split, these are also the problems where model quality scores would
be least reliable. So, if your dataset is smaller, you should run cross validation.

 If your model takes a couple minute or less to run, it's probably worth switching to cross-
validation. If your model takes much longer to run, cross-validation may slow down your
workflow more than it's worth.
6. Metrics of model quality by simple Math
 There are standard measures that we can use to summarize how good a set of predictions actually
are.
 You must estimate the quality of a set of predictions when training a machine learning model.
 Performance metrics like classification accuracy and Root Mean Squared Error(RMSE) can give you a
clear objective idea of how good a set of predictions is, and in turn how good the model is that
generated them.
 All the algorithms in machine learning rely on minimizing or maximizing a function, which we call
“objective function”. The group of functions that are minimized are called “loss functions”. A loss
function is a measure of how good a prediction model does in terms of being able to predict the
expected outcome.

 Loss functions can be broadly categorized into 2 types: Classification and Regression Loss.
 Regression functions predict a quantity, and classification functions predict a label.

Important loss functions: -

➢ MSE- Mean Squared Error
➢ RMSE – Root Mean Squared Error
➢ MAPE – Mean Absolute Percent Error

 MSE- Mean Squared Error

MSE i.e. Mean Squared Error is the most commonly used regression loss function.
Calculate the difference between model’s predictions and actual values, square it and
find average across the entire dataset to get the value of MSE.

Note: - MSE will never be negative because the errors are always squared.
Lets take an example in terms of graph to understand MSE/RMSE:

Rain
Rain

Day Day

For calculating MSE in Python, we have mean_squared_error() function defined in library sklearn.

from sklearn.metrices import mean_squared_error

y_true = [1, 2, 2, 2, 4] # list of actual values

y_pred = [0.6, 1.29, 1.99, 2.69, 3.4] # list of predicted values

print(“MSE =”, mean_squared_error(y_true,y_pred) # calculation of MSE and print MSE

Output : - 0.21606
 RMSE- Root Mean Squared Error
RMSE is one of the methods to determine the accuracy of our model in predicting the target values. In machine
Learning when we want to look at the accuracy of our model, we take the root mean square of the error that
has occurred between the test values and the predicted values mathematically:

Let’s au see some graphical examples:

 In this scattered graph the red dots are the actual values and the blue line is the set of predicted values
drawn by our model.
 Here X represents the distance between the actual value and the predicted line this line represents
the error, similarly, we can draw straight lines from each red dot to the blue line. Squaring them and
taking mean of all those distances and finally taking the root will give us RMSE of our model.

Note: - A good model should have an RMSE value less than 180. In case you have a higher RMSE value, this
would mean that you probably need to change your feature or probably you need to tweak your
hyperparameters.
(Hyperparameters are the parameters whose values govern the learning process)
Steps to Calculate MSE/RMSE: -

C2_W3_Assignment
No ratings yet
C2_W3_Assignment
437 pages
Wa0001.
No ratings yet
Wa0001.
173 pages
Xii Ai Capstone Project
No ratings yet
Xii Ai Capstone Project
35 pages
机器学习
No ratings yet
机器学习
41 pages
AI Capstone Project - Notes-Part2
No ratings yet
AI Capstone Project - Notes-Part2
8 pages
ML Mod 5
No ratings yet
ML Mod 5
58 pages
Cross-Validation in Machine Learning
No ratings yet
Cross-Validation in Machine Learning
18 pages
ml-5
No ratings yet
ml-5
26 pages
Understanding Datasets Features Selection Train Test Validation Sets L12
No ratings yet
Understanding Datasets Features Selection Train Test Validation Sets L12
25 pages
ml unit1
No ratings yet
ml unit1
11 pages
Train and Test Datasets in Machine Learning
No ratings yet
Train and Test Datasets in Machine Learning
26 pages
L03 Generalization, Train Test Splits and Validation
No ratings yet
L03 Generalization, Train Test Splits and Validation
49 pages
14 Model Selection and Boosting
No ratings yet
14 Model Selection and Boosting
51 pages
1 (A) Explain Supervised Learning and Unsupervised Learning
No ratings yet
1 (A) Explain Supervised Learning and Unsupervised Learning
52 pages
Capstone Project
No ratings yet
Capstone Project
40 pages
Module 6_ML
No ratings yet
Module 6_ML
30 pages
Training Evaluation
No ratings yet
Training Evaluation
42 pages
ML Unit 2
No ratings yet
ML Unit 2
33 pages
T1 ML QB Soln
No ratings yet
T1 ML QB Soln
23 pages
Section 1: Cross-Validation and Model Performance
No ratings yet
Section 1: Cross-Validation and Model Performance
33 pages
ADS
No ratings yet
ADS
20 pages
Unit 2
No ratings yet
Unit 2
28 pages
C2W3 Lab 01 Model Evaluation and Selection
No ratings yet
C2W3 Lab 01 Model Evaluation and Selection
21 pages
P-2.1.2 Cross Validation and Regularization
No ratings yet
P-2.1.2 Cross Validation and Regularization
37 pages
ML.1Lecture.2 (Old)
No ratings yet
ML.1Lecture.2 (Old)
23 pages
Train and Test Datasets in Machine Learning
No ratings yet
Train and Test Datasets in Machine Learning
6 pages
CSC407_Chapter 5-6
No ratings yet
CSC407_Chapter 5-6
42 pages
Assignment 9[1]
No ratings yet
Assignment 9[1]
8 pages
Cross Validation in ML
No ratings yet
Cross Validation in ML
5 pages
1 - Introduction To Datascience
No ratings yet
1 - Introduction To Datascience
444 pages
Unit 5 New
No ratings yet
Unit 5 New
9 pages
Cross Validation - Notes
No ratings yet
Cross Validation - Notes
10 pages
Project 03: Data Fitting Applied Mathematics and Statistics For Information Technology
No ratings yet
Project 03: Data Fitting Applied Mathematics and Statistics For Information Technology
17 pages
Cross Validation Thesis
100% (4)
Cross Validation Thesis
5 pages
Chapter 5
No ratings yet
Chapter 5
3 pages
AI & ML Notes
No ratings yet
AI & ML Notes
22 pages
chapter 1 capstone project ai class 12
No ratings yet
chapter 1 capstone project ai class 12
5 pages
ML Unit 4 Trupesh Patel
No ratings yet
ML Unit 4 Trupesh Patel
56 pages
All Types of Cross Validation
No ratings yet
All Types of Cross Validation
9 pages
Unit IV
No ratings yet
Unit IV
51 pages
Model Generalization
No ratings yet
Model Generalization
117 pages
ML - Module 5
No ratings yet
ML - Module 5
80 pages
Train Test Split in Python
No ratings yet
Train Test Split in Python
11 pages
Practical Issues
No ratings yet
Practical Issues
30 pages
CH 05 Optimization Technique
No ratings yet
CH 05 Optimization Technique
58 pages
Comparison Between Performance of Classifiers
No ratings yet
Comparison Between Performance of Classifiers
5 pages
Capstone Project
No ratings yet
Capstone Project
6 pages
Guide
No ratings yet
Guide
24 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
32 pages
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
No ratings yet
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
28 pages
Chapter2 1 33
No ratings yet
Chapter2 1 33
18 pages
unit 4
No ratings yet
unit 4
34 pages
Research Trends in Machine Learning: Muhammad Kashif Hanif
No ratings yet
Research Trends in Machine Learning: Muhammad Kashif Hanif
20 pages
Data Mining Practicals
No ratings yet
Data Mining Practicals
22 pages
DEEP LEARNING UNIT 3
No ratings yet
DEEP LEARNING UNIT 3
19 pages
ML 5
No ratings yet
ML 5
14 pages
Module 3 - ML
No ratings yet
Module 3 - ML
101 pages
C2W3_Lab_01_Model_Evaluation_and_Selection
No ratings yet
C2W3_Lab_01_Model_Evaluation_and_Selection
21 pages
Choosing Model and Tuning
No ratings yet
Choosing Model and Tuning
20 pages
Qt Notes 2025 Work1
No ratings yet
Qt Notes 2025 Work1
8 pages
Wolters - FPA - Score Sheet
No ratings yet
Wolters - FPA - Score Sheet
36 pages
Caltech Data Analytics Bootcamp Brochure
No ratings yet
Caltech Data Analytics Bootcamp Brochure
26 pages
Wooldridge Session 4
No ratings yet
Wooldridge Session 4
64 pages
Unit 4 - DA - Frequent Itemsets and Clustering-1 (Unit-5)
No ratings yet
Unit 4 - DA - Frequent Itemsets and Clustering-1 (Unit-5)
86 pages
Clustering For Big Data Analytics
No ratings yet
Clustering For Big Data Analytics
28 pages
roadmap for project
No ratings yet
roadmap for project
9 pages
Desalegn Amlaku
No ratings yet
Desalegn Amlaku
105 pages
Research Model
No ratings yet
Research Model
15 pages
AIA 6550 Module 5
50% (2)
AIA 6550 Module 5
21 pages
EXTENDED PROJECT-Soft Drink
100% (1)
EXTENDED PROJECT-Soft Drink
26 pages
Logistic Regression
No ratings yet
Logistic Regression
28 pages
Lcup Compre Stat Review Edited
100% (1)
Lcup Compre Stat Review Edited
44 pages
174839-Article Text-447457-1-10-20180719
No ratings yet
174839-Article Text-447457-1-10-20180719
32 pages
lec8.4
No ratings yet
lec8.4
3 pages
Activity 3. Regression Analysis (1)
No ratings yet
Activity 3. Regression Analysis (1)
2 pages
Correlation and Regression: © The Mcgraw-Hill Companies, Inc., 2000
No ratings yet
Correlation and Regression: © The Mcgraw-Hill Companies, Inc., 2000
32 pages
Group 1 Des 502 Venture Creation Question 2
No ratings yet
Group 1 Des 502 Venture Creation Question 2
6 pages
# Understanding DM Architecture, KDD & DM Tools
No ratings yet
# Understanding DM Architecture, KDD & DM Tools
29 pages
Doe 8
100% (1)
Doe 8
18 pages
Sta116 TEMPLATE PROJECT REPORT
No ratings yet
Sta116 TEMPLATE PROJECT REPORT
10 pages
Adaboost Dataset X1 X2 Y
No ratings yet
Adaboost Dataset X1 X2 Y
5 pages
RAJASTHAN TECHNICAL UNIVERSITY Paper 2022
No ratings yet
RAJASTHAN TECHNICAL UNIVERSITY Paper 2022
2 pages
Leading A Conducting A Strategic Managment Project Assignment
No ratings yet
Leading A Conducting A Strategic Managment Project Assignment
15 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
Oracle: Question & Answers
0% (1)
Oracle: Question & Answers
20 pages
A Threat Hunting Methodology
No ratings yet
A Threat Hunting Methodology
11 pages
Uas English XII
No ratings yet
Uas English XII
5 pages
Article 19 Effect of Financial Leverage On Financial Performance PDF
No ratings yet
Article 19 Effect of Financial Leverage On Financial Performance PDF
9 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet

XIIAIUNITICAPSTONE_PROJECTPARTII

Uploaded by

XIIAIUNITICAPSTONE_PROJECTPARTII

Uploaded by

4.

How to validate model quality

(i) Train Test Split Evaluation

How to Configure the Train-Test Split: -

 Common split percentages include:

 x_test.shape – It tells how many rows and columns in test data

 We will install the above libraries like

 We use pandas to import dataset in program and sklearn.model_selection for train_test_split()

 Now display the value of X and y

(ii) Cross Validation Procedure

 It is a popular strategy since it is straightforward to grasp and produced a less biased or

Train Test Split v/s Cross Validation

 Cross-validation gives a more accurate measure of model quality, which is especially

Important loss functions: -

 MSE- Mean Squared Error

from sklearn.metrices import mean_squared_error

y_true = [1, 2, 2, 2, 4] # list of actual values

y_pred = [0.6, 1.29, 1.99, 2.69, 3.4] # list of predicted values

print(“MSE =”, mean_squared_error(y_true,y_pred) # calculation of MSE and print MSE

Let’s au see some graphical examples:

You might also like