Classification Review
Classification Review
Model Overfitting
For simplicity, both classes have equal number of labeled instances. The code for
generating and plotting the data is shown below. All instances from class 1 are shown
in red while those from class 0 are shown in black.
import numpy as np
import matplotlib.pyplot as plt
from numpy.random import random
%matplotlib inline
N = 1500
np.random.seed(50)
X = np.random.multivariate_normal(mean1, cov, int(N/6))
X = np.concatenate((X, np.random.multivariate_normal(mean2, cov, int(N/6))))
X = np.concatenate((X, np.random.multivariate_normal(mean3, cov, int(N/6))))
X = np.concatenate((X, 20*np.random.rand(int(N/2),2)))
Y = np.concatenate((np.ones(int(N/2)),np.zeros(int(N/2))))
plt.plot(X[:int(N/2),0],X[:int(N/2),1],'r+',X[int(N/2):,0],X[int(N/2):,1],'k.',ms=4)
#########################################
# Training and Test set creation
#########################################
#########################################
# Model fitting and evaluation
#########################################
maxdepths = [2,3,4,5,6,7,8,9,10,15,20,25,30,35,40,45,50]
trainAcc = np.zeros(len(maxdepths))
testAcc = np.zeros(len(maxdepths))
index = 0
for depth in maxdepths:
clf = tree.DecisionTreeClassifier(max_depth=depth)
clf = clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
trainAcc[index] = accuracy_score(Y_train, Y_predTrain)
testAcc[index] = accuracy_score(Y_test, Y_predTest)
index += 1
#########################################
# Plot of training and test accuracies
#########################################
plt.plot(maxdepths,trainAcc,'ro-',maxdepths,testAcc,'bv--')
plt.legend(['Training Accuracy','Test Accuracy'])
plt.xlabel('Max depth')
plt.ylabel('Accuracy')
In this approach, the class label of a test instance is predicted based on the majority
class of its *k* closest training instances. The number of nearest neighbors, *k*, is a
hyperparameter that must be provided by the user, along with the distance metric.
By default, we can use Euclidean distance (which is equivalent to Minkowski distance
with an exponent factor equals to p=2):
\begin{equation*}
\textrm{Minkowski distance}(x,y) = \bigg[\sum_{i=1}^N |x_i-y_i|^p \bigg]^{\frac{1}
{p}}
\end{equation*}
for k in numNeighbors:
clf = KNeighborsClassifier(n_neighbors=k, metric='minkowski', p=2)
clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
trainAcc.append(accuracy_score(Y_train, Y_predTrain))
testAcc.append(accuracy_score(Y_test, Y_predTest))
Linear classifiers such as logistic regression and support vector machine (SVM)
constructs a linear separating hyperplane to distinguish instances from different
classes.
For logistic regression, the model can be described by the following equation:
\begin{equation*}
P(y=1|x) = \frac{1}{1 + \exp^{-w^Tx - b}} = \sigma(w^Tx + b)
\end{equation*}
The model parameters (w,b) are estimated by optimizing the following regularized
negative log-likelihood function:
\begin{equation*}
(w^*,b^*) = \arg\min_{w,b} - \sum_{i=1}^N y_i \log\bigg[\sigma(w^Tx_i + b)\bigg] +
(1-y_i) \log\bigg[\sigma(-w^Tx_i - b)\bigg] + \frac{1}{C} \Omega([w,b])
\end{equation*}
where $C$ is a hyperparameter that controls the inverse of model complexity
(smaller values imply stronger regularization) while $\Omega(\cdot)$ is the
regularization term, which by default, is assumed to be an $l_2$-norm in sklearn.
For support vector machine, the model parameters $(w^*,b^*)$ are estimated by
solving the following constrained optimization problem:
\begin{eqnarray*}
&&\min_{w^*,b^*,\{\xi_i\}} \frac{\|w\|^2}{2} + \frac{1}{C} \sum_i \xi_i \\
\textrm{s.t.} && \forall i: y_i\bigg[w^T \phi(x_i) + b\bigg] \ge 1 - \xi_i, \ \ \xi_i \ge 0
\end{eqnarray*}
for param in C:
clf = linear_model.LogisticRegression(C=param)
clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
LRtrainAcc.append(accuracy_score(Y_train, Y_predTrain))
LRtestAcc.append(accuracy_score(Y_test, Y_predTest))
clf = SVC(C=param,kernel='linear')
clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
SVMtrainAcc.append(accuracy_score(Y_train, Y_predTrain))
SVMtestAcc.append(accuracy_score(Y_test, Y_predTest))
The code below shows an example of using nonlinear support vector machine with a
Gaussian radial basis function kernel to fit the 2-dimensional dataset.
for param in C:
clf = SVC(C=param,kernel='rbf',gamma='auto')
clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
SVMtrainAcc.append(accuracy_score(Y_train, Y_predTrain))
SVMtestAcc.append(accuracy_score(Y_test, Y_predTest))
##Ensemble Methods
An ensemble classifier constructs a set of base classifiers from the training data and
performs classification by taking a vote on the predictions made by each base
classifier. We consider 3 types of ensemble classifiers in this example: bagging,
boosting, and random forest.
In the example below, we fit 500 base classifiers to the 2-dimensional dataset using
each ensemble method. The base classifier corresponds to a decision tree with
maximum depth equals to 10.
numBaseClassifiers = 500
maxdepth = 10
trainAcc = []
testAcc = []
clf = ensemble.RandomForestClassifier(n_estimators=numBaseClassifiers)
clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
trainAcc.append(accuracy_score(Y_train, Y_predTrain))
testAcc.append(accuracy_score(Y_test, Y_predTest))
clf =
ensemble.BaggingClassifier(DecisionTreeClassifier(max_depth=maxdepth),n_estimat
ors=numBaseClassifiers)
clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
trainAcc.append(accuracy_score(Y_train, Y_predTrain))
testAcc.append(accuracy_score(Y_test, Y_predTest))
clf =
ensemble.AdaBoostClassifier(DecisionTreeClassifier(max_depth=maxdepth),n_estim
ators=numBaseClassifiers)
clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
trainAcc.append(accuracy_score(Y_train, Y_predTrain))
testAcc.append(accuracy_score(Y_test, Y_predTest))