CS6301 Homework2 KR
CS6301 Homework2 KR
Kaitlin Rabe
Homework 2
The report component of this assignment is the hard copy of this homework, along with your answers to
questions, and is due at the start of class on Monday, February 25, 2019.
The electronic version of this homework must be uploaded on eLearning by 9:59am Central Standard
Time, Monday, February 25, 2019. All deadlines are hard and without exceptions unless permission was
obtained from the instructor in advance.
You may work in groups to discuss the problems and work through solutions together. However, you must
write up your solutions on your own, without copying another student's work or letting another student
copy your work. In your solution for each problem, you must write down the names of your partner (if any);
this will not affect your grade.
For this problem, we will generate synthetic data for a nonlinear binary classification problem and partition it
into training, validation and test sets. Our goal is to understand the behavior of SVMs with Radial-Basis
Function (RBF) kernels with different values of C and γ .
In [1]: #
# DO NOT EDIT THIS FUNCTION; IF YOU WANT TO PLAY AROUND WITH DATA GENE
RATION,
# MAKE A COPY OF THIS FUNCTION AND THEN EDIT
#
import numpy as np
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
# Take a small subset of the data and make it VERY noisy; that is, g
enerate outliers
http://localhost:8888/nbconvert/html/Box%20Sync/Coursework/Spring%202019/CS%206301.010%20-%20Machine%20Learning/CS6301-Homework2_KR.ipynb?download=false Page 1 of 13
CS6301-Homework2_KR 2/24/19, 7)38 PM
enerate outliers
m = 30
np.random.seed(42)
ind = np.random.permutation(n_samples)[:m]
X[ind, :] += np.random.multivariate_normal([0, 0], np.eye(2), (m, ))
y[ind] = 1 - y[ind]
In [2]: #
# DO NOT EDIT THIS FUNCTION; IF YOU WANT TO PLAY AROUND WITH VISUALIZ
ATION,
# MAKE A COPY OF THIS FUNCTION AND THEN EDIT
#
# Create a mesh
xMin, xMax = X[:, 0].min() - 1, X[:, 0].max() + 1
yMin, yMax = X[:, 1].min() - 1, X[:, 1].max() + 1
xMesh, yMesh = np.meshgrid(np.arange(xMin, xMax, 0.01),
np.arange(yMin, yMax, 0.01))
# Plot contours
zMesh = clf.decision_function(np.c_[xMesh.ravel(), yMesh.ravel()])
zMesh = zMesh.reshape(xMesh.shape)
ax.contourf(xMesh, yMesh, zMesh, cmap=plt.cm.PiYG, alpha=0.6)
http://localhost:8888/nbconvert/html/Box%20Sync/Coursework/Spring%202019/CS%206301.010%20-%20Machine%20Learning/CS6301-Homework2_KR.ipynb?download=false Page 2 of 13
CS6301-Homework2_KR 2/24/19, 7)38 PM
# Plot data
ax.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap, edgecolors='k')
ax.set_title('{0} = {1}'.format(param, p))
Plot: For each classifier, compute both the training error and the validation error. Plot them together,
making sure to label the axes and each curve clearly.
Discussion: How do the training error and the validation error change with C ? Based on the visualization of
the models and their resulting classifiers, how does changing C change the models? Explain in terms of
minimizing the SVM's objective function 12 w′ w + C Σni= 1 ℓ(w ∣ xi , yi ), where ℓ is the hinge loss for each
training example (xi , y i ) .
Final Model Selection: Use the validation set to select the best the classifier corresponding to the best
value, Cbest . Report the accuracy on the test set for this selected best SVM model. Note: You should report
a single number, your final test set accuracy on the model corresponding to $C{best}$_.
http://localhost:8888/nbconvert/html/Box%20Sync/Coursework/Spring%202019/CS%206301.010%20-%20Machine%20Learning/CS6301-Homework2_KR.ipynb?download=false Page 3 of 13
CS6301-Homework2_KR 2/24/19, 7)38 PM
models = dict()
trnErr = dict()
valErr = dict()
tstAcc = dict()
for C in C_values:
plt.figure()
plt.plot(trnErr.keys(), trnErr.values(), marker='o', linewidth=3, mark
ersize=12)
plt.plot(valErr.keys(), valErr.values(), marker='s', linewidth=3, mark
ersize=12)
plt.xlabel('Regularization Parameter, C', fontsize=16)
plt.ylabel('Training/Validation Error', fontsize=16)
plt.legend(['Training Error', 'Validation Error'], fontsize=16)
plt.xscale('log')
min_Error = min(valErr.values())
C_best = 0
for C in valErr.keys():
if valErr.get(C) == min_Error:
C_best = C
tst_Accuracy = tstAcc.get(C)
print("C_best is", C_best)
print("Test accuracy with C_best is", "%.5f" % tst_Accuracy)
print("-------------------------------------------------------
---")
C_best is 1.0
Test accuracy with C_best is 0.65000
----------------------------------------------------------
http://localhost:8888/nbconvert/html/Box%20Sync/Coursework/Spring%202019/CS%206301.010%20-%20Machine%20Learning/CS6301-Homework2_KR.ipynb?download=false Page 4 of 13
CS6301-Homework2_KR 2/24/19, 7)38 PM
Discussion
As the regularization parameter, C, is increased, the training error continues to decrease approaching
zero. The C parameter acts as a trade off between correct classification and a large margin for the
support vector. As C continues to increase, the margin will decrease and the classifier will attempt to
correctly classify all of the training examples. This corresponds to overfitting the training dataset. This is
confirmed when you look at the validation error, as it begins to increase with increasing C. Based on the
data for this example, a C value of 1 would be the most generalizable without overfitting the data.
http://localhost:8888/nbconvert/html/Box%20Sync/Coursework/Spring%202019/CS%206301.010%20-%20Machine%20Learning/CS6301-Homework2_KR.ipynb?download=false Page 5 of 13
CS6301-Homework2_KR 2/24/19, 7)38 PM
Plot: For each classifier, compute both the training error and the validation error. Plot them together,
making sure to label the axes and each curve clearly.
Discussion: How do the training error and the validation error change with γ ? Based on the visualization of
the models and their resulting classifiers, how does changing γ change the models? Explain in terms of the
functional form of the RBF kernel, κ(x, z) = exp(−γ ⋅ ‖x −z‖2 )
Final Model Selection: Use the validation set to select the best the classifier corresponding to the best
value, γbest . Report the accuracy on the test set for this selected best SVM model. Note: You should report a
single number, your final test set accuracy on the model corresponding to $\gamma{best}$_.
http://localhost:8888/nbconvert/html/Box%20Sync/Coursework/Spring%202019/CS%206301.010%20-%20Machine%20Learning/CS6301-Homework2_KR.ipynb?download=false Page 6 of 13
CS6301-Homework2_KR 2/24/19, 7)38 PM
models = dict()
trnErr = dict()
valErr = dict()
tstAcc = dict()
for G in gamma_values:
# Plot training error and validation error vs. gamma for each classifi
er
plt.figure()
plt.plot(trnErr.keys(), trnErr.values(), marker='o', linewidth=3, mark
ersize=12)
plt.plot(valErr.keys(), valErr.values(), marker='s', linewidth=3, mark
ersize=12)
plt.xlabel('Gamma', fontsize=16)
plt.ylabel('Training/Validation Error', fontsize=16)
plt.legend(['Training Error', 'Validation Error'], fontsize=16)
plt.xscale('log')
min_Error = min(valErr.values())
G_best = 0
for G in valErr.keys():
if valErr.get(G) == min_Error:
G_best = G
tst_Accuracy = tstAcc.get(G)
print("G_best is", G_best)
print("Test accuracy with G_best is", "%.5f" % tst_Accuracy)
print("-------------------------------------------------------
---")
G_best is 1.0
Test accuracy with G_best is 0.66667
----------------------------------------------------------
G_best is 10.0
Test accuracy with G_best is 0.61667
----------------------------------------------------------
http://localhost:8888/nbconvert/html/Box%20Sync/Coursework/Spring%202019/CS%206301.010%20-%20Machine%20Learning/CS6301-Homework2_KR.ipynb?download=false Page 7 of 13
CS6301-Homework2_KR 2/24/19, 7)38 PM
Discussion
As gamma is increased, the training error continues to decrease to zero. This is because gamma defines
the magnitude of the influence of a a single training example, or the inverse of the radius of each
training examples support vector. Therefore when gamma increases, the classifier begins to dramatically
overfit the training dataset. This becomes very obvious when you look at the the validation error, as it
begins to increase rapidly with increasing gamma. Based on the data for this example, a gamma value
of 1 would be the most generalizable without overfitting the data.
In [6]: # Load the Breast Cancer Diagnosis data set; download the files from e
Learning
# CSV files can be read easily using np.loadtxt()
)
wdbc_val = np.loadtxt('wdbc_val.csv', dtype = 'float', delimiter = ','
)
wdbc_tst = np.loadtxt('wdbc_tst.csv', dtype = 'float', delimiter = ','
)
y_trn = wdbc_trn[:, 0]
X_trn = wdbc_trn[:, 1:]
y_val = wdbc_val[:, 0]
X_val = wdbc_val[:, 1:]
y_tst = wdbc_val[:, 0]
X_tst = wdbc_val[:, 1:]
Final Model Selection: Use the validation set to select the best the classifier corresponding to the best
parameter values, Cbest and γbest . Report the accuracy on the test set for this selected best SVM model.
Note: You should report a single number, your final test set accuracy on the model corresponding to $C{best}
and \gamma{best}$.
models = dict()
trnErr = dict()
valErr = dict()
tstAcc = dict()
print("C", '\t', '\t', "G", '\t', '\t', "Training Error", '\t', "Valid
ation Error")
print("---", '\t', '\t', "---", '\t', '\t', "---------------", '\t', "
----------------")
for C in C_values:
for G in gamma_values:
min_Error = min(valErr.values())
C_best = 0
G_best = 0
for k in valErr.keys():
if valErr.get(k) == min_Error:
C_best, G_best = k
tst_Accuracy = tstAcc.get(k)
print("C_best and G_best are", C_best, "and", G_best, ", r
espectively.")
print("Test accuracy with (C_best, G_best) is", "%.5f" % t
st_Accuracy)
print("---------------------------------------------------
-------")
7391
C_best and G_best are 100.0 and 0.01 , respectively.
Test accuracy with (C_best, G_best) is 0.97391
----------------------------------------------------------
C_best and G_best are 1000.0 and 0.01 , respectively.
Test accuracy with (C_best, G_best) is 0.97391
----------------------------------------------------------
C_best and G_best are 10000.0 and 0.001 , respectively.
Test accuracy with (C_best, G_best) is 0.97391
----------------------------------------------------------
C_best and G_best are 10000.0 and 0.01 , respectively.
Test accuracy with (C_best, G_best) is 0.97391
----------------------------------------------------------
Plot: For each classifier, compute both the training error and the validation error. Plot them together,
making sure to label the axes and each curve clearly.
Final Model Selection: Use the validation set to select the best the classifier corresponding to the best
parameter value, k best . Report the accuracy on the test set for this selected best kNN model. Note: You
should report a single number, your final test set accuracy on the model corresponding to $k{best}$_.
http://localhost:8888/nbconvert/html/Box%20Sync/Coursework/Spring%202019/CS%206301.010%20-%20Machine%20Learning/CS6301-Homework2_KR.ipynb?download=false Page 11 of 13
CS6301-Homework2_KR 2/24/19, 7)38 PM
models = dict()
trnErr = dict()
valErr = dict()
tstAcc = dict()
for k in k_values:
# Plot training error and validation error vs. k for each classifier
plt.figure()
plt.plot(trnErr.keys(), trnErr.values(), marker='o', linewidth=3, mark
ersize=12)
plt.plot(valErr.keys(), valErr.values(), marker='s', linewidth=3, mark
ersize=12)
plt.xlabel('# of Nearest Neighbors, k', fontsize=16)
plt.ylabel('Training/Validation Error', fontsize=16)
plt.legend(['Training Error', 'Validation Error'], fontsize=16)
plt.axis('tight')
min_Error = min(valErr.values())
k_best = 0
for k in valErr.keys():
if valErr.get(k) == min_Error:
k_best = k
tst_Accuracy = tstAcc.get(k)
print("k_best is", k_best)
print("Test accuracy with k_best is", "%.5f" % tst_Accuracy)
print("-------------------------------------------------------
---")
http://localhost:8888/nbconvert/html/Box%20Sync/Coursework/Spring%202019/CS%206301.010%20-%20Machine%20Learning/CS6301-Homework2_KR.ipynb?download=false Page 12 of 13
CS6301-Homework2_KR 2/24/19, 7)38 PM
k_best is 5
Test accuracy with k_best is 0.95652
----------------------------------------------------------
k_best is 11
Test accuracy with k_best is 0.95652
----------------------------------------------------------
Discussion: Which of these two approaches, SVMs or kNN, would you prefer for this classification task?
Explain.
I would choose the support vector machine classifier for this task because it gives the best classification
accuracy when applied with to test dataset.
In [ ]:
http://localhost:8888/nbconvert/html/Box%20Sync/Coursework/Spring%202019/CS%206301.010%20-%20Machine%20Learning/CS6301-Homework2_KR.ipynb?download=false Page 13 of 13