0% found this document useful (0 votes)

317 views

Solution 3.1

This document provides solutions and explanations for Question 3.1, which asks to find a good classifier using cross-validation and by splitting the data into training, validation, and test sets. Three methods for cross-validation are shown: 1) using kknn's built-in cross-validation, 2) using the cv.kknn function, and 3) using the caret package. Cross-validation is also performed for ksvm models. When splitting the data, 60% is used for training, 20% for validation, and 20% for testing. The best SVM and KNN models on the training data are selected based on performance on the validation set, with the caveat that the final model choice should

Uploaded by

Adnan Kiani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

317 views

Solution 3.1

Uploaded by

Adnan Kiani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Question 3.

Using the same data set (credit_card_data.txt or credit_card_data-headers.txt) as

in Question 2.2, use the ksvm or kknn function to find a good classifier:
(a) using cross-validation (do this for the k-nearest-neighbors model; SVM is optional); and
(b) splitting the data into training, validation, and test data sets (pick either KNN or SVM; the other
is optional).

SOLUTIONS:

(a)

There are different ways to do this. Three different methods are shown in solution 3.1-a.R. Just
having one method is fine for your homework solutions. All three are shown below, for learning
purposes. Another optional component shown below is using cross-validation for ksvm; this too did not
need to be included in your solutions.

METHOD 1

The simplest approach, using kknn’s built-in cross-validation, is fine as a solution. train.kknn uses
leave-one-out cross-validation, which sounds like a different type of cross-validation that I didn’t
mention in the videos – but if you watched the videos, you know it implicitly already! For each data
point, it fits a model to all the other data points, and uses the remaining data point as a test – in other
words, if n is the number of data points, then leave-one-out cross-validation is the same as n-fold cross-
validation.

Using this approach here are the results (using scaled data):

k Correct Percent k Correct Percent

correct correct
1,2,3,4 533 81.50% 18 557 85.17%
5 557 85.17% 19-20 556 85.02%
6 553 84.56% 21 555 84.86%
7 554 84.71% 22 554 84.71%
8 555 84.86% 23 552 84.40%
9 554 84.71% 24-25 553 84.56%
10-11 557 85.17% 26 552 84.40%
12 558 85.32% 27 550 84.10%
13-14 557 85.17% 28 548 83.79%
15-17 558 85.32% 29 549 83.94%
30 550 84.10%

As before k < 5 is clearly worse than the rest, and value of k between 10 and 18 seem to do best. For
unscaled data, the results are significantly worse (not shown here, but generally between 66% and 71%).

Note that technically, these runs just let us choose a model from among k=1 through k=30, but because
there might be random effects in validation, to find an estimate of the model quality we’d have to run it
on some test data that we didn’t use for training/cross-validation.

METHOD 2

Some of you used the cv.kknn function in the kknn library. This approach is also shown in
solution 3.1-a.R.

METHOD 3

And others of you found the caret package in R that has the capability to run k-fold cross-validation
(among other things). The built in functionality of the caret package gives ease of use but also the
flexibility to tune different parameters and run different models. It’s worth trying. This approach is also
shown in solution 3.1-a.R.

The main line of code is:

knn_fit <- train(as.factor(V11)~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,

data,
method = "knn", # choose knn model
trControl=trainControl(
method="repeatedcv", # k-fold cross validation
number=10, # number of folds (k in cross validation)
repeats=5), # number of times to repeat k-fold cross
validation
preProcess = c("center", "scale"), # standardize the data
tuneLength = kmax) # max number of neighbors (k in nearest
neighbor)

The trainControl method allows us to determine the number of resampling iterations (“number”)
and the number of folds to perform ("repeats"). The train function finally trains the model while
allowing us to preprocess the data (scale and center) as well as select the number of k values to choose
from.

ksvm Cross Validation

If you also tried cross-validation with ksvm (you didn’t need to), you could do that by including
“cross=k” for k-fold cross-validation – for example, “cross=10” gives 10-fold cross-validation.

In the R code for Question 2.2 Part 1, you would replace the line

model <- ksvm(as.matrix(data[,1:10]),as.factor(data[,11]),type = "C-

svc",kernel = "vanilladot",C = 100,scaled = TRUE)

with

model <- ksvm(as.matrix(data[,1:10]),as.factor(data[,11]),type = "C-

svc",kernel = "vanilladot",C = 100,scaled = TRUE,cross=10)

model@cross shows the error measured by cross-validation, so 1–model@cross is the estimate of

the model’s fraction of points correctly classified; instead of the 86.4% correct classifications found in
Question 2.2 Part 1 using scaled data, the cross-validated estimate is a little bit lower: 86.2%. That’s a
difference of only about 1 correct prediction out of 654, so it’s not a big difference – meaning our initial
model is a good one, and doesn’t seem to have been over-fit.

To compare models with different values of C, we can use that modification in the code in solution
2.2-1.R.

The results with scaled data to show that for C=0.00001 or C=0.00001, only about 55% of points are
classified correctly. At C=0.001, about 83% are classified correctly. At 0.01 and higher, the model
achieves the 86.2% classification correctness we got above – a wide range of values of C gives a good
model. With unscaled data, just as before, finding a value of C to give a good model is harder.

C Percent correct Percent correct

(scaled data) (unscaled data)
0.00001 54.76% 66.19%
0.0001 54.74% 68.50%
0.001 82.86% 75.38%
0.01 86.23% 81.50%
0.1 86.24% 85.48%
1 86.24% 78.89%
10 86.26% 58.81%
100 86.25% 70.04%
1000 86.25% 65.36%

(b)

As usual, there are lots of possible answers. File solution 3.1-b.R contains one approach.

In this approach, we first split the data into training, validation, and test sets. We used 60%, 20%, and
20%, but other splits are fine too as long as training has at least half.
Then, we fit 9 SVM models and 20 k-nearest-neighbor models to the training data, and evaluated them
on the validation data.

We report the SVM model that does best in validation, and the KNN model that does best in validation.
The code chose C=0.01 as the best SVM model, though it was equal with C=0.1, 1, 10, 100, and 1000, so
any of them could’ve been chosen. The best KNN model was with k=16.

Then, we have an if statement that checks to see which model – the best SVM model or the best KNN
model – performed best on the validation data. Whichever one it is, is the model we suggest using (and
we report its performance on the test set).

Important note: In our code, the best SVM model (C=0.01) performs best on the validation data… but
the best KNN model performs best on the test data. It might be tempting to therefore say, “Oh, let’s use
the best KNN model.” Don’t give in to this temptation! If you do, you’re losing the value of separating
validation and test sets. You’d essentially be using the test set to pick the best model, and then you’d
(incorrectly) be using that same test set to estimate its quality – the selection bias from the validation
step will be incorrectly included in the quality estimate.

You could’ve used a different approach – for example, only testing SVM models or only testing KNN
models.

Some people also went beyond what we’ve covered, and tested models with different kernels – that’s
also a good idea, and it’s possible to get better models that way.

ISYE6501 HW1 Kevin
No ratings yet
ISYE6501 HW1 Kevin
7 pages
PCI PIN Security Requirements
No ratings yet
PCI PIN Security Requirements
86 pages
sm17 Solution Manual Operations and Supply Chain Management PDF
No ratings yet
sm17 Solution Manual Operations and Supply Chain Management PDF
32 pages
Week-2 NK
No ratings yet
Week-2 NK
12 pages
solution 2.2
No ratings yet
solution 2.2
4 pages
ISYE 6501 Georgia Tech hmwk3.1b
No ratings yet
ISYE 6501 Georgia Tech hmwk3.1b
5 pages
Week 1 HW
No ratings yet
Week 1 HW
3 pages
Analysis Course HW2
No ratings yet
Analysis Course HW2
13 pages
Solution 1
No ratings yet
Solution 1
6 pages
ISYE6501-Homework-2
No ratings yet
ISYE6501-Homework-2
11 pages
ISYE 6501 Georgia Tech Hmwk3.1a
No ratings yet
ISYE 6501 Georgia Tech Hmwk3.1a
4 pages
ML Lec-10
No ratings yet
ML Lec-10
19 pages
P-2.1.2 Cross Validation and Regularization
No ratings yet
P-2.1.2 Cross Validation and Regularization
37 pages
ECS171: Machine Learning: Lecture 13: Validation, Model Selection
No ratings yet
ECS171: Machine Learning: Lecture 13: Validation, Model Selection
32 pages
Week-1 NK
No ratings yet
Week-1 NK
5 pages
3.1. Cross-Validation - Evaluating Estimator Performance - Scikit-Learn 1.3.0 Documentation
No ratings yet
3.1. Cross-Validation - Evaluating Estimator Performance - Scikit-Learn 1.3.0 Documentation
12 pages
2022hw01sol-na-na
No ratings yet
2022hw01sol-na-na
11 pages
ADS
No ratings yet
ADS
20 pages
Chapter2 1 33
No ratings yet
Chapter2 1 33
18 pages
Question 2.2
No ratings yet
Question 2.2
2 pages
18+cv+%26+model+selection
No ratings yet
18+cv+%26+model+selection
11 pages
Question 2.2
No ratings yet
Question 2.2
4 pages
ML Notes
100% (2)
ML Notes
125 pages
8
No ratings yet
8
56 pages
Ch5 Resampling Methods
No ratings yet
Ch5 Resampling Methods
66 pages
ml_pyq_ans
No ratings yet
ml_pyq_ans
37 pages
ML Concepts: 1. Parametric Vs Non-Parametric Models:: Examples: Linear, Logistic, SVM
No ratings yet
ML Concepts: 1. Parametric Vs Non-Parametric Models:: Examples: Linear, Logistic, SVM
34 pages
ML Mod 5
No ratings yet
ML Mod 5
58 pages
Analysis of K-Fold Cross-Validation Over Hold-Out
No ratings yet
Analysis of K-Fold Cross-Validation Over Hold-Out
6 pages
Es2012 62
No ratings yet
Es2012 62
6 pages
Tutorial 6
No ratings yet
Tutorial 6
3 pages
11 Important Model Evaluation Error Metrics 2
100% (1)
11 Important Model Evaluation Error Metrics 2
4 pages
A Practical Guide To Support Vector Classification: I I I N L
No ratings yet
A Practical Guide To Support Vector Classification: I I I N L
12 pages
Final
No ratings yet
Final
145 pages
Cross Validation in ML
No ratings yet
Cross Validation in ML
5 pages
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
100% (1)
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
125 pages
Lec9 - Evaluation - converted
No ratings yet
Lec9 - Evaluation - converted
11 pages
Unit 2
No ratings yet
Unit 2
28 pages
A Practical Guide To Support Vector Classification: I I I N L
No ratings yet
A Practical Guide To Support Vector Classification: I I I N L
15 pages
Module3-Ensemble Learning
No ratings yet
Module3-Ensemble Learning
107 pages
When Do We Use KNN Algorithm?
No ratings yet
When Do We Use KNN Algorithm?
7 pages
Classification
No ratings yet
Classification
4 pages
All Types of Cross Validation
No ratings yet
All Types of Cross Validation
9 pages
cYCLE 9
No ratings yet
cYCLE 9
5 pages
Knn
No ratings yet
Knn
7 pages
6 Model Evalution
No ratings yet
6 Model Evalution
16 pages
Validation Over Under Fir Unit 5
No ratings yet
Validation Over Under Fir Unit 5
6 pages
CIS 520, Machine Learning, Fall 2015: Assignment 2 Due: Friday, September 18th, 11:59pm (Via Turnin)
No ratings yet
CIS 520, Machine Learning, Fall 2015: Assignment 2 Due: Friday, September 18th, 11:59pm (Via Turnin)
3 pages
Chenhao_HW1
No ratings yet
Chenhao_HW1
5 pages
ML Lec07 KNN
100% (2)
ML Lec07 KNN
37 pages
1 Analytical Part (3 Percent Grade) : + + + 1 N I: y +1 I 1 N I: y 1 I
No ratings yet
1 Analytical Part (3 Percent Grade) : + + + 1 N I: y +1 I 1 N I: y 1 I
5 pages
Cross-Validation and Model Selection
No ratings yet
Cross-Validation and Model Selection
46 pages
Cross-Validation in Machine Learning
No ratings yet
Cross-Validation in Machine Learning
18 pages
T1 ML QB Soln
No ratings yet
T1 ML QB Soln
23 pages
Part I
No ratings yet
Part I
12 pages
KNN Model
No ratings yet
KNN Model
5 pages
Assingment On Database
No ratings yet
Assingment On Database
16 pages
XIIAIUNITICAPSTONE_PROJECTPARTII
No ratings yet
XIIAIUNITICAPSTONE_PROJECTPARTII
11 pages
Final Project
No ratings yet
Final Project
9 pages
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet
Solutions Manual to accompany Introduction to Linear Regression Analysis
From Everand
Solutions Manual to accompany Introduction to Linear Regression Analysis
Douglas C. Montgomery
1/5 (1)
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Signal Flow Graphs: Engr. Ojay DL. Santos
No ratings yet
Signal Flow Graphs: Engr. Ojay DL. Santos
27 pages
Solutions For Network Security Problems
No ratings yet
Solutions For Network Security Problems
3 pages
Analysis of PCG
No ratings yet
Analysis of PCG
7 pages
Statistical Classification
No ratings yet
Statistical Classification
6 pages
Lesson 1 Substitution
No ratings yet
Lesson 1 Substitution
23 pages
Title: Creative Commons Attribution 4.0 International License
No ratings yet
Title: Creative Commons Attribution 4.0 International License
8 pages
5.1 Scheduling With Uncertain Durations
No ratings yet
5.1 Scheduling With Uncertain Durations
16 pages
DAA QUESTIONS recourse 2023
No ratings yet
DAA QUESTIONS recourse 2023
1 page
ch08 Portfolio Selection
No ratings yet
ch08 Portfolio Selection
24 pages
Recent Trends in Demand Forecasting
No ratings yet
Recent Trends in Demand Forecasting
2 pages
Research Paper-On GPT Rise and Downfall
No ratings yet
Research Paper-On GPT Rise and Downfall
3 pages
Hanneforth-Tree Automata
No ratings yet
Hanneforth-Tree Automata
15 pages
Practice Questions Unit3
No ratings yet
Practice Questions Unit3
3 pages
Signal Flow Graph
No ratings yet
Signal Flow Graph
61 pages
Individual Assignment 1 - Online Quiz Attempt Review
No ratings yet
Individual Assignment 1 - Online Quiz Attempt Review
21 pages
Differential Calculus of Functions of Several Variables
No ratings yet
Differential Calculus of Functions of Several Variables
28 pages
Maharashtra Board Class 9 Maths Sample Paper Part 1 Questions
No ratings yet
Maharashtra Board Class 9 Maths Sample Paper Part 1 Questions
3 pages
Signal Processing Techniques For Software Radio: Behrouz Farhang-Boroujeny
No ratings yet
Signal Processing Techniques For Software Radio: Behrouz Farhang-Boroujeny
7 pages
Control System Toolbox™ Release Notes
No ratings yet
Control System Toolbox™ Release Notes
142 pages
Boundary Fill
No ratings yet
Boundary Fill
6 pages
FINAL Solving Two-Step Equations-2
No ratings yet
FINAL Solving Two-Step Equations-2
30 pages
NPU MachineLearning
No ratings yet
NPU MachineLearning
28 pages
Data Science Is A Multidisciplinary Field That Uses Scientific Methods
No ratings yet
Data Science Is A Multidisciplinary Field That Uses Scientific Methods
2 pages
A Diana Algoritma
No ratings yet
A Diana Algoritma
2 pages
Backgroundofcommunication Systems20201
No ratings yet
Backgroundofcommunication Systems20201
2 pages
Basic Assumptions of The Game Theory: I N N I N
No ratings yet
Basic Assumptions of The Game Theory: I N N I N
6 pages
Branch and Price - Wikipedia
No ratings yet
Branch and Price - Wikipedia
3 pages
Introduction
100% (2)
Introduction
39 pages