0% found this document useful (0 votes)

38 views

13 Cross - Validation

Cross-validation is a technique for evaluating machine learning models by training models on subsets of the available input data and evaluating them on the remaining subsets. Stratified k-fold cross-validation is a variation of k-fold cross-validation that ensures each data split has samples of each class in the same proportions as the original dataset, to address problems with imbalanced datasets. The demonstration shows stratified 5-fold cross-validation performed on a breast cancer dataset using logistic regression, with mean AUC scores calculated across the folds to evaluate model performance.

Uploaded by

Gabriel Gheorghe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views

13 Cross - Validation

Uploaded by

Gabriel Gheorghe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Cross Validation

Cross-validation is a technique for evaluating ML models by training several ML models on subsets of the
available input data and evaluating them on the complementary subset of the data. Use cross-validation to
detect overfitting, ie, failing to generalize a pattern.
The three steps involved in cross-validation are as follows :

1. Reserve some portion of sample data-set.

2. Using the rest data-set train the model.
3. Test the model using the reserve portion of the data-set.

Here I Only Dicusss about Stratified k-fold cross validation

Stratified k-fold cross validation

Stratified k-fold cross-validation is same as just k-fold cross-validation, But in Stratified k-fold cross-
validation, it does stratified sampling instead of random sampling.
One obvious problem with normal KFold, is that each in each fold the distribution of classes in the
validation set, will be not be same. This is a big problem with imbalanced datasets.
To overcome this problem we will use Stratified-KFold Validation. StratifiedKFold ensures that each of the
splits have same proportion of examples of each class.
StratifiedKFold is a variation of KFold. First, StratifiedKFold shuffles your data, after that splits the data into
n_splits parts and Done. Now, it will use each part as a test set. Note that it only and always shuffles data
one time before splitting.

Demonstration
In [1]:

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix, roc_auc_score ,roc_curve,auc
from sklearn.model_selection import StratifiedKFold
import warnings
warnings.simplefilter('ignore')

In [2]:

cancer = load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = pd.Series(cancer.target)
df.head()

Out[2]:

mean
mean mean mean mean mean mean mean mean
concave
radius texture perimeter area smoothness compactness concavity symmetry
points

0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419

1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812

2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069

3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597

4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809

5 rows × 31 columns

In [3]:

X = df.drop('target',axis=1)
y = df['target'].astype('category')

Use manual train_test_split

In [4]:

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

In [5]:

lr_manual = LogisticRegression()
lr_manual.fit(X_train,y_train)

Out[5]:

LogisticRegression()
In [6]:

confusion_matrix(y_test,lr_manual.predict(X_test))

Out[6]:

array([[39, 5],
[ 0, 70]], dtype=int64)

Use StratifiedKFold
In [7]:

kf = StratifiedKFold(n_splits=5,shuffle=True,random_state=45)
pred_test_full =0
cv_score =[]
i=1
for train_index,test_index in kf.split(X,y):
print('{} of KFold {}'.format(i,kf.n_splits))

### Training Set

xtr,xvl = X.iloc[train_index],X.iloc[test_index]

### Validation Set

ytr,yvl = y.iloc[train_index],y.iloc[test_index]

#model
lr = LogisticRegression(C=2)
lr.fit(xtr,ytr)
score = roc_auc_score(yvl,lr.predict(xvl))
print('ROC AUC score:',score)
cv_score.append(score)

# pred_test = lr.predict_proba(x_test)[:,1]
# pred_test_full +=pred_test
i+=1

1 of KFold 5
ROC AUC score: 0.9415329184408779
2 of KFold 5
ROC AUC score: 0.932361611529643
3 of KFold 5
ROC AUC score: 0.9523809523809523
4 of KFold 5
ROC AUC score: 0.9692460317460316
5 of KFold 5
ROC AUC score: 0.9312541918175722
In [8]:

print('Confusion matrix\n',confusion_matrix(yvl,lr.predict(xvl)))
print('Cv',cv_score,'\nMean cv Score',np.mean(cv_score))

Confusion matrix
[[38 4]
[ 3 68]]
Cv [0.9415329184408779, 0.932361611529643, 0.9523809523809523, 0.96924603174
60316, 0.9312541918175722]
Mean cv Score 0.9453551411830154

here I use logistic regression for demonstrate the k-fold. you can use any algorithm.

Farsight 1 - Entering the New Age of Work
No ratings yet
Farsight 1 - Entering the New Age of Work
100 pages
Information Software Technology
No ratings yet
Information Software Technology
376 pages
Cross Validation - Notes
No ratings yet
Cross Validation - Notes
10 pages
UNIT4 Cross Validation
No ratings yet
UNIT4 Cross Validation
16 pages
Research Trends in Machine Learning: Muhammad Kashif Hanif
No ratings yet
Research Trends in Machine Learning: Muhammad Kashif Hanif
20 pages
Day 24
No ratings yet
Day 24
3 pages
3.1. Cross-Validation - Evaluating Estimator Performance - Scikit-Learn 1.3.0 Documentation
No ratings yet
3.1. Cross-Validation - Evaluating Estimator Performance - Scikit-Learn 1.3.0 Documentation
12 pages
Cross Validation
No ratings yet
Cross Validation
5 pages
8
No ratings yet
8
56 pages
All Types of Cross Validation
No ratings yet
All Types of Cross Validation
9 pages
Cross-Validation in Machine Learning
No ratings yet
Cross-Validation in Machine Learning
18 pages
Answer-4 Shreyansh
No ratings yet
Answer-4 Shreyansh
4 pages
K-Fold CV On Imbalance Classification Data - Analytics Vidhya - Ayobami Akiode
No ratings yet
K-Fold CV On Imbalance Classification Data - Analytics Vidhya - Ayobami Akiode
18 pages
sklearn
No ratings yet
sklearn
141 pages
CROSS VALIDATION3
No ratings yet
CROSS VALIDATION3
2 pages
assig_5_mining
No ratings yet
assig_5_mining
5 pages
Machine Learning Feature - Week 5-8
No ratings yet
Machine Learning Feature - Week 5-8
54 pages
Cross-Validation in Machine Learning - Javatpoint
No ratings yet
Cross-Validation in Machine Learning - Javatpoint
8 pages
Assignment 1
No ratings yet
Assignment 1
17 pages
Dela Cruz - NB - AT
No ratings yet
Dela Cruz - NB - AT
6 pages
Q3-Copy1: Pandas PD Numpy NP CSV
No ratings yet
Q3-Copy1: Pandas PD Numpy NP CSV
7 pages
EXpt_3_ML2025[1]
No ratings yet
EXpt_3_ML2025[1]
3 pages
Code Examples in space
No ratings yet
Code Examples in space
13 pages
Train Test Split in Python
No ratings yet
Train Test Split in Python
11 pages
1 (A) Explain Supervised Learning and Unsupervised Learning
No ratings yet
1 (A) Explain Supervised Learning and Unsupervised Learning
52 pages
Aiml Ex 4-7
No ratings yet
Aiml Ex 4-7
8 pages
Cross Validation in ML
No ratings yet
Cross Validation in ML
5 pages
Preductive Modelling Assignment
No ratings yet
Preductive Modelling Assignment
3 pages
Guide
No ratings yet
Guide
24 pages
Cross-Validation
No ratings yet
Cross-Validation
21 pages
NaiveBayesKfold Report
No ratings yet
NaiveBayesKfold Report
2 pages
XIIAIUNITICAPSTONE_PROJECTPARTII
No ratings yet
XIIAIUNITICAPSTONE_PROJECTPARTII
11 pages
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
100% (1)
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
73 pages
22104057_Prakhar_Week 5
No ratings yet
22104057_Prakhar_Week 5
8 pages
Linearregression SVM
No ratings yet
Linearregression SVM
3 pages
cross validation
No ratings yet
cross validation
5 pages
22510045_Assignment_10[1]
No ratings yet
22510045_Assignment_10[1]
14 pages
task4
No ratings yet
task4
2 pages
lab manual
No ratings yet
lab manual
9 pages
Python for Data Science IA 1 Programs
No ratings yet
Python for Data Science IA 1 Programs
14 pages
ML Lab 5
No ratings yet
ML Lab 5
2 pages
Unit 9 Model Evaluation
No ratings yet
Unit 9 Model Evaluation
26 pages
Comparison Between Performance of Classifiers
No ratings yet
Comparison Between Performance of Classifiers
5 pages
Machine Learning
No ratings yet
Machine Learning
3 pages
Slip
No ratings yet
Slip
5 pages
Breast Cancer Classification Using DTC
No ratings yet
Breast Cancer Classification Using DTC
1 page
Machine Learning Practical PDF
No ratings yet
Machine Learning Practical PDF
12 pages
Cross Validation: Chandan B K Mrs. S Asst Professor, Department of Computer Science Engineering
No ratings yet
Cross Validation: Chandan B K Mrs. S Asst Professor, Department of Computer Science Engineering
21 pages
Untitled Document
No ratings yet
Untitled Document
6 pages
Assignment 4 Instructions
No ratings yet
Assignment 4 Instructions
4 pages
reast-cancer-prediction-using-debt
No ratings yet
reast-cancer-prediction-using-debt
18 pages
Lecture 5 Evaluation_Classifer
No ratings yet
Lecture 5 Evaluation_Classifer
61 pages
ML Assignment 1 - Nageswar
No ratings yet
ML Assignment 1 - Nageswar
7 pages
Assignment 5
No ratings yet
Assignment 5
5 pages
SVM
No ratings yet
SVM
8 pages
ML W8 Merged
No ratings yet
ML W8 Merged
27 pages
Module 6_ML
No ratings yet
Module 6_ML
30 pages
Data Mining Practicals
No ratings yet
Data Mining Practicals
22 pages
Lab 1 - Machine Learning with Python - ML Engineering مهم
No ratings yet
Lab 1 - Machine Learning with Python - ML Engineering مهم
10 pages
Introduction to K-fold Cross-Validation
No ratings yet
Introduction to K-fold Cross-Validation
6 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
08 Decision - Tree
No ratings yet
08 Decision - Tree
9 pages
03 Multiple Linear Regression
No ratings yet
03 Multiple Linear Regression
7 pages
03 A Polynomial Linear Regression
No ratings yet
03 A Polynomial Linear Regression
6 pages
Backward && Forward Feature Selection PART-2
No ratings yet
Backward && Forward Feature Selection PART-2
6 pages
B Ridge - and - Lasso - Regression
No ratings yet
B Ridge - and - Lasso - Regression
5 pages
Ensemble Learning
100% (1)
Ensemble Learning
7 pages
22 Dim Reduction Part-1
No ratings yet
22 Dim Reduction Part-1
9 pages
Knowledge Learning Steps Learnig in ServiceNow
No ratings yet
Knowledge Learning Steps Learnig in ServiceNow
1 page
Service Now
No ratings yet
Service Now
2 pages
Addsup
No ratings yet
Addsup
2 pages
An Artificial Intelligence Browser Architecture AI
No ratings yet
An Artificial Intelligence Browser Architecture AI
11 pages
3.MuleHunter.AI
No ratings yet
3.MuleHunter.AI
3 pages
Employee Attrition Data Analysis
No ratings yet
Employee Attrition Data Analysis
10 pages
Practical File (AI)
No ratings yet
Practical File (AI)
15 pages
Cyberbullying Detection Using Machine Learning
No ratings yet
Cyberbullying Detection Using Machine Learning
6 pages
Why the future doesn't need us
No ratings yet
Why the future doesn't need us
2 pages
Motivation and Learning Modes: Towards An Automatic Intelligent Evaluation of Learner Motivation
No ratings yet
Motivation and Learning Modes: Towards An Automatic Intelligent Evaluation of Learner Motivation
10 pages
Network Traffic Analysis Nta)
No ratings yet
Network Traffic Analysis Nta)
135 pages
Barbara
No ratings yet
Barbara
3 pages
Time Table - II, III , IV Year - B.E.,B.tech. - End Semester Theory Examinations Apr May 2025 - Dept. Wise - R2021 - Except VIII Semester
No ratings yet
Time Table - II, III , IV Year - B.E.,B.tech. - End Semester Theory Examinations Apr May 2025 - Dept. Wise - R2021 - Except VIII Semester
10 pages
Bài tập anh 12 Global success 4 kỹ năng có đáp án UNIT 6. ARTIFICIAL INTELLIGENCE
No ratings yet
Bài tập anh 12 Global success 4 kỹ năng có đáp án UNIT 6. ARTIFICIAL INTELLIGENCE
10 pages
200-Article Text-3847-1-10-20230705
No ratings yet
200-Article Text-3847-1-10-20230705
7 pages
Machine Learning Applications For Precision Agricu
No ratings yet
Machine Learning Applications For Precision Agricu
38 pages
Types of Data Represented As Strings
No ratings yet
Types of Data Represented As Strings
2 pages
ICEF 2020 Keynote Prith Banerjee
No ratings yet
ICEF 2020 Keynote Prith Banerjee
23 pages
CNN Model for Smoke and Fire Detection
No ratings yet
CNN Model for Smoke and Fire Detection
10 pages
Predictive Analytics - 2021-22
No ratings yet
Predictive Analytics - 2021-22
4 pages
Demisew Presentation 1
No ratings yet
Demisew Presentation 1
14 pages
Artificial Intelligence (AI) in Strategic Marketing Decision-Making: A Research Agenda
No ratings yet
Artificial Intelligence (AI) in Strategic Marketing Decision-Making: A Research Agenda
19 pages
Technology As A Megatrend Driver
No ratings yet
Technology As A Megatrend Driver
28 pages
M_ScDS(Sem-III)
No ratings yet
M_ScDS(Sem-III)
33 pages
(FREE) Rap Lyrics Generator AI - (No Login, Unlimited) 2
No ratings yet
(FREE) Rap Lyrics Generator AI - (No Login, Unlimited) 2
1 page
Sachin Report
No ratings yet
Sachin Report
30 pages
Answer Key Sample Paper 3 AI Class 10
No ratings yet
Answer Key Sample Paper 3 AI Class 10
12 pages
02_The Role of AI in Supply Chain Optimization
100% (2)
02_The Role of AI in Supply Chain Optimization
71 pages
Cognitive Psychology-Based Artificial Intelligence
No ratings yet
Cognitive Psychology-Based Artificial Intelligence
9 pages
Accuracy Prediction Using Machine Learning Techniques For Patient Liver Disease
100% (1)
Accuracy Prediction Using Machine Learning Techniques For Patient Liver Disease
15 pages
Using Transformers For Computer Vision - by Cameron R. Wolfe, Ph.D. - Towards Data Science
No ratings yet
Using Transformers For Computer Vision - by Cameron R. Wolfe, Ph.D. - Towards Data Science
2 pages
The Impact of Artifical Intellegience in Job Market and Society
No ratings yet
The Impact of Artifical Intellegience in Job Market and Society
1 page

13 Cross - Validation

Uploaded by

13 Cross - Validation

Uploaded by

Cross Validation

1. Reserve some portion of sample data-set.

Here I Only Dicusss about Stratified k-fold cross validation

Stratified k-fold cross validation

0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419

1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812

2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069

3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597

4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809

Use manual train_test_split

from sklearn.model_selection import train_test_split

### Training Set

### Validation Set

You might also like