sectionSVM PDF
sectionSVM PDF
Jean-Philippe Vert
• Test a SVM classifier for cancer diagnosis from gene expression data
1 Linear SVM
Here we generate a toy dataset in 2D, and learn how to train and test a SVM.
1
1.1 Generate toy data 1 LINEAR SVM
Positive
Negative
4
2
x[,2]
0
−2
−2 0 2 4
x[,1]
Now we split the data into a training set (80%) and a test set (20%)
# Visualize
plot(x, col = ifelse(y > 0, 1, 2), pch = ifelse(istrain == 1,1,2))
legend("topleft", c("Positive Train", "Positive Test", "Negative Train", "Negative Test"), col = c(1, 1,
2
1.2 Train a SVM 1 LINEAR SVM
Positive Train
Positive Test
Negative Train
4
Negative Test
2
x[,2]
0
−2
−2 0 2 4
x[,1]
3
1.3 Predict with a SVM 1 LINEAR SVM
4
1.4 Cross-validation 1 LINEAR SVM
# Compute accuracy
sum(ypred == ytest) / length(ytest)
# Check that the predicted labels are the signs of the scores
table(ypredscore > 0, ypred)
1.4 Cross-validation
Instead of fixing a training set and a test set, we can improve the quality of these estimates by running k-fold
cross-validation. We split the training set in k groups of approximately the same size, then iteratively train
a SVM using k - 1 groups and make prediction on the group which was left aside. When k is equal to the
number of training points, we talk of leave-one-out (LOO) cross-validatin. To generate a random split of n
points in k folds, we can for example create the following function:
QUESTION2 - Write a function cv.ksvm = function(x, y, folds = 3,...) which returns a vector
ypred of predicted decision score for all points by k-fold cross-validation
QUESTION3 - Compute the various performance of the SVM by 5-fold cross-validation. Al-
ternatively, the ksvm function can automatically compute the k-fold cross-validation accuracy:
5
1.5 Effect of C 1 LINEAR SVM
svp <- ksvm(x, y, type = "C-svc", kernel = "vanilladot", C = 100, scaled=c(), cross = 5)
print(cross(svp))
print(error(svp))
1.5 Effect of C
The C parameters balances the trade-off between having a large margin and separating the positive and
unlabeled on the training set. It is important to choose it well to have good generalization.
QUESTION5 - Plot the decision functions of SVM trained on the toy examples for different
values of C in the range 2seq(−10,14) . To look at the different plots you can use the function
par(ask=T) that will ask you to press a key between successive plots. Alternatively, you can
use par(mfrow = c(5,5)) to see all the plots in the same window
QUESTION7 - Do the same on data with more overlap between the two classes, e.g., re-
generate toy data with meanneg being 1.
6
2 NONLINEAR SVM
2 Nonlinear SVM
Sometimes linear SVM are not enough. For example, generate a toy dataset where positive and negative
examples are mixture of two Gaussians which are not linearly separable.
QUESTION8 - Make a toy example that looks like Figure 2, and test a linear SVM with
different values of C.
To solve this problem, we should instead use a nonlinear SVM. This is obtained by simply changing the
kernel parameter. For example, to use a Gaussian RBF kernel with σ = 1 and C = 1:
# Visualize it
plot(svp, data = x)
You should obtain something that look like Figure 3. Much better than the linear SVM, no? The nonlinear
SVM has now two parameters: σ and C. Both play a role in the generalization capacity of the SVM.
QUESTION9 - Visualize and compute the 5-fold cross-validation error for different values of
C and σ. Observe their influence.
7
2 NONLINEAR SVM
A useful heuristic to choose σ is implemented in kernlab. It is based on the quantiles of the distances between
the training point.
# Visualize it
plot(svp, data = x)
QUESTION11 - Test the polynomial, hyperbolic tangent, Laplacian, Bessel and ANOVA ker-
nels on the toy examples.
8
3 APPLICATION: CANCER DIAGNOSIS FROM GENE EXPRESSION DATA
data(ALL)
# Inspect them
?ALL
show(ALL)
print(summary(pData(ALL)))
Here we focus on predicting the type of the disease (B-cell or T-cell). We get the expression data and disease
type as follows
9
3 APPLICATION: CANCER DIAGNOSIS FROM GENE EXPRESSION DATA
x <- t(exprs(ALL))
y <- substr(ALL$BT,1,1)
QUESTION12 - Test the ability of a SVM to predict the class of the disease from gene ex-
pression. Check the influence of the parameters.
Finally, we may want to predict the type and stage of the diseases. We are then confronted with a multi-class
classification problem, since the variable to predict can take more than two values:
y <- ALL$BT
print(y)
## [1] B2 B2 B4 B1 B2 B1 B1 B1 B2 B2 B3 B3 B3 B2 B3 B B2 B3 B2 B3 B2 B2 B2
## [24] B1 B1 B2 B1 B2 B1 B2 B B B2 B2 B2 B1 B2 B2 B2 B2 B2 B4 B4 B2 B2 B2
## [47] B4 B2 B1 B2 B2 B3 B4 B3 B3 B3 B4 B3 B3 B1 B1 B1 B1 B3 B3 B3 B3 B3 B3
## [70] B3 B3 B1 B3 B1 B4 B2 B2 B1 B3 B4 B4 B2 B2 B3 B4 B4 B4 B1 B2 B2 B2 B1
## [93] B2 B B T T3 T2 T2 T3 T2 T T4 T2 T3 T3 T T2 T3 T2 T2 T2 T1 T4 T
## [116] T2 T3 T2 T2 T2 T2 T3 T3 T3 T2 T3 T2 T
## Levels: B B1 B2 B3 B4 T T1 T2 T3 T4
QUESTION13 - Test the ability of a SVM to predict the class and the stage of the disease
from gene expression.
10