0% found this document useful (0 votes)
42 views27 pages

Probabilistic Feature Selection and Classification Vector Machine

eeg

Uploaded by

nandini bub
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views27 pages

Probabilistic Feature Selection and Classification Vector Machine

eeg

Uploaded by

nandini bub
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Probabilistic Feature Selection and Classification

Vector Machine

BINGBING JIANG, University of Science and Technology of China


CHANG LI and MAARTEN DE RIJKE, University of Amsterdam
XIN YAO, Southern University of Science and Technology
HUANHUAN CHEN, University of Science and Technology of China

Sparse Bayesian learning is a state-of-the-art supervised learning algorithm that can choose a subset of rel-
evant samples from the input data and make reliable probabilistic predictions. However, in the presence
of high-dimensional data with irrelevant features, traditional sparse Bayesian classifiers suffer from perfor-
mance degradation and low efficiency due to the incapability of eliminating irrelevant features. To tackle this
problem, we propose a novel sparse Bayesian embedded feature selection algorithm that adopts truncated
Gaussian distributions as both sample and feature priors. The proposed algorithm, called probabilistic fea-
ture selection and classification vector machine (PFCVMLP ) is able to simultaneously select relevant features
and samples for classification tasks. In order to derive the analytical solutions, Laplace approximation is ap-
plied to compute approximate posteriors and marginal likelihoods. Finally, parameters and hyperparameters
are optimized by the type-II maximum likelihood method. Experiments on three datasets validate the perfor-
mance of PFCVMLP along two dimensions: classification performance and effectiveness for feature selection.
Finally, we analyze the generalization performance and derive a generalization error bound for PFCVMLP .
By tightening the bound, the importance of feature selection is demonstrated.
CCS Concepts: • Computing methodologies → Feature selection; Supervised learning by classification;
Additional Key Words and Phrases: Feature selection, probabilistic classification model, sparse Bayesian
learning, supervised learning, EEG emotion recognition

This work was supported in part by the National Natural Science Foundation of China (Grant nos. 91846111 and 91746209),
the Science and Technology Innovation Committee Foundation of Shenzhen (Grant no. ZDSYS201703031748284), Ahold
Delhaize, Amsterdam Data Science, the Bloomberg Research Grant program, the China Scholarship Council, the Criteo
Faculty Research Award program, Elsevier, the European Community’s Seventh Framework Programme (FP7/2007-2013)
under grant agreement nr 312827 (VOX-Pol), the Google Faculty Research Awards program, the Microsoft Research Ph.D.
program, the Netherlands Institute for Sound and Vision, the Netherlands Organisation for Scientific Research (NWO)
under project nrs CI-14-25, 652.002.001, 612.001.551, 652.001.003, and Yandex. All content represents the opinion of the
authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.
Authors’ addresses: B. Jiang and H. Chen (corresponding author), School of Computer Science and Technology, University
of Science and Technology of China, Hefei, Anhui 230027, China; emails: [email protected], [email protected];
C. Li and M. de Rijke, Informatics Institute, University of Amsterdam, Amsterdam, the Netherlands; email: {c.li, derijke}@
uva.nl; X. Yao, Department of Computer Science and Engineering, Shenzhen Key Laboratory of Computational Intelligence,
Southern University of Science and Technology, Shenzhen, Guangdong 518055, China; email: [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2019 Association for Computing Machinery.
1556-4681/2019/04-ART21 $15.00
https://doi.org/10.1145/3309541

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019. 21
21:2 B. Jiang et al.

ACM Reference format:


Bingbing Jiang, Chang Li, Maarten de Rijke, Xin Yao, and Huanhuan Chen. 2019. Probabilistic Feature Se-
lection and Classification Vector Machine. ACM Trans. Knowl. Discov. Data 13, 2, Article 21 (April 2019), 27
pages.
https://doi.org/10.1145/3309541

1 INTRODUCTION
In supervised learning, we are given input feature vectors x = {x i ∈ RM }i=1 N and corresponding

labels y = {yi }i=1 . The goal is to predict the label of a new datum x̂ based on the training dataset
N 1

S = {x, y} together with other prior knowledge. For regression, we are given continuous labels
y ∈ R, while for classification we are given discrete labels. In this article, we focus on the binary
classification case, in which y ∈ {−1, +1}.
Recently, learning sparseness from large-scale datasets has generated significant research inter-
est [9, 11, 19, 29, 34]. Among the methods proposed, the support vector machine (SVM) [11], which
is based on the kernel trick [46] to create a non-linear decision boundary with a small number of
support vectors, is the state-of-the-art algorithm. The prediction function of SVM is a combination
of basis functions:2

N
f (x̂; w) = ϕ (x̂, x i )w i + b, (1)
i=1
where ϕ (·, ·) is the basis function, w = {w i }i=1
N are sample weights, and b is the bias.

Similar to SVM, many sparse Bayesian classifiers also use Equation (1) as their decision func-
tion; examples include the relevance vector machine (RVM) [45] and the probabilistic classification
vector machine (PCVM) [8]. Unlike SVM, whose weights are determined by maximizing the de-
cision margin and limited to hard binary classification, sparse Bayesian algorithms optimize the
parameters within a maximum likelihood framework and make predictions based on the aver-
age of the prediction function over the posterior of parameters. For example, PCVM computes
the maximum a posteriori (MAP) estimation using the expectation–maximization (EM) algorithm;
RVM and efficient probabilistic classification vector machine (EPCVM) [9] compute the type-II
maximum likelihood [3] to estimate the distribution of the parameters. However, these algorithms
have to deal with different scales of features due to the failure to eliminate irrelevant features.
In addition to sparse Bayesian learning, parameter-free Bayesian methods that are based on the
class-conditional distributions have been proposed to solve the classification task [18, 21, 27, 28].
Lanckriet et al. [28] proposed the minimax probability machine (MPM) to estimate the bound of
classification accuracy by minimizing the worst error rate. To efficiently exploit structural infor-
mation of data, Gu et al. [18] proposed a structural MPM (SMPM) that can produce the non-linear
decision hyperplane by using the kernel trick. To exploit structural information, SMPM adopts a
clustering algorithm to detect the clusters of each class and then calculates the mean and covari-
ance matrix for each cluster. However, selecting a proper number of clusters per class is difficult
for the clustering algorithm, and calculating the mean and covariance matrix for each cluster has
a high computational complexity for high-dimensional data. Therefore, SMPM cannot fit different
scales of features and might suffer from the instability and low efficiency especially for high-
dimensional data.

1 In this article, the subscript of a sample x , i.e., x i , denotes the ith sample and the superscript of a sample x , i.e., x k ,
denotes the k th dimension.
2 In the rest of this article, we prefer to use the term basis function instead of kernel function because, except for SVM, the

basis functions used in this article are free of Mercer’s condition.

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
Probabilistic Feature Selection and Classification Vector Machine 21:3

In order to fit different scales of features, basis functions are always controlled by basis parame-
ters (or kernel parameters). For example, in LIBSVM [7] with Gaussian radial basis functions (RBF)
ϕ (x, z) = exp(−ϑ x − z 2 ), the default ϑ is set relatively small for high-dimensional datasets and
large for low-dimensional datasets. Although the use of basis parameters may help to address the
curse of dimensionality [2], the performance might be degraded when there are lots of irrelevant
and/or redundant features [26, 29, 37]. Parameterized basis functions are designed to deal with this
problem. There are two popular basis functions that can incorporate feature parameters easily:
Gaussian RBF

M
ϕ θ (x, z) = exp − θ k (x k − z k ) 2 , (2)
 k=1 
Pth order polynomial:
P
M

ϕ θ (x, z) = 1 + k k
θk x z , (3)
 k=1 
where the subscript denotes the corresponding index of features, and θ ∈ RM are feature param-
eters (also called feature weights). Once a feature weight θ k → 0,3 the corresponding feature will
not contribute to the classification.
Feature selection, as a dimensionality reduction technique, has been extensively studied in ma-
chine learning and data mining, and various feature selection algorithms have been proposed [5,
26, 29, 35, 37–39, 41, 48, 49, 51, 53–58]. Feature selection methods can be divided into three groups:
filter methods [20, 38, 39, 41], wrapper methods [51], and embedded methods [5, 26, 29, 34, 35, 37].
Filter methods independently select the subset of features from the classifier learning. Wrapper
methods consider all possible feature subsets and then select a specific subset based on its pre-
dictive power. Therefore, the feature selection stage and classification model are separated and
independent in the filter and wrapper methods, and the wrapper methods might suffer from high
computational complexity especially for high-dimensional data [34]. Embedded methods embed
feature selection in the training process, which aims to combine the advantages of the filter and
wrapper methods. As to filter methods, Peng et al. [41] proposed a minimum redundancy and
maximum relevance (mRMR) method, which selects relevant features and simultaneously removes
redundant features according to the mutual information. To avoid evaluating the score for each
feature individually like Fisher score [14], a filter method, trace ratio criterion (TRC) [39] was de-
signed to find the globally optimal feature subset by maximizing the subset level score. Recently,
sparsity regularization in feature space has been widely applied to feature selection tasks. In [5],
Bradley and Mangasarian proposed an embedded method, L1 SVM, that uses the L1 norm to yield
a sparse solution. However, the number of features selected by L1 SVM is upper bounded by the
number of training samples, which limits its application on high-dimensional data. Nie et al. [38]
employed joint L21 norm minimization on both loss function and regularization to propose a fil-
ter method, FSNM. Based on the basis functions mentioned above, Nguyen and De la Torre [37]
designed an embedded feature selection model, weight SVM (WSVM), that can jointly perform fea-
ture selection and classifier construction for non-linear SVMs. However, filter methods are not able
to adaptively select relevant features, i.e., they require a predefined number of selected features.
For Bayesian feature selection approaches, a joint classifier and feature optimization algorithm
(JCFO) is proposed in [26]; the authors adopt a sparse Bayesian model to simultaneously perform
classifier learning and feature selection. To select relevant features, JCFO introduces hierarchical
sparseness promoting priors on feature weights and then employs EM and gradient-based methods

3 Practically, lots of feature weights θ k → 0. When a certain θ k is smaller than a threshold, we will set it 0.

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
21:4 B. Jiang et al.

to optimize the feature weights. In order to simultaneously select relevance samples and features,
Mohsenzadeh et al. [35] extend the standard RVM and then design the relevance sample feature
machine (RSFM) and an incrementally learning version (IRSFM) [34] that scales the basis param-
eters in RVM to a vector and applies zero-mean Gaussian priors on feature weights to generate
sparsity in the feature space. Li and Chen [29] propose an EM algorithm based joint feature selec-
tion strategy for probabilistic feature selection and classification vector machine (PCVM; denoted
as PFCVMEM ), in which they add truncated Gaussian priors to features to enable PCVM to jointly
select relevant samples and features. However, JCFO, PFCVMEM , and RSFM use an EM algorithm
to calculate a MAP point estimate of the sample and feature parameters. As pointed out by Chen
et al. [9], the EM algorithm has the following limitations: first, it is sensitive to the starting points
and cannot guarantee convergence to global maxima or minima; second, the EM algorithm results
in a MAP point estimate, which limits to the Bayes estimator with the 0–1 loss function and cannot
represent all advantages of the Bayesian framework.
JCFO, RSFM, and IRSFM adopt a zero-mean Gaussian prior distribution over sample weights,
and RSFM and IRSFM also use this prior distribution over feature weights. As a result of adopting
a zero-mean Gaussian prior over samples, some training samples that belong to the positive class
(yi = +1) will receive negative weights and vice versa; this may result in instability and degen-
eration in solutions [8]. Also, for RSFM and IRSFM, zero-mean Gaussian feature priors will lead
to negative feature weights, which reduces the value of kernel functions for two samples when
the similarity in the corresponding features is increased [26]. Finally, RSFM and IRSFM have to
construct an N × M kernel matrix for each sample, which yields a space complexity of at least
O (N 2 M ) to store the designed kernel matrices.
We propose a sparse Bayesian embedded feature selection method, i.e., a Laplace approxima-
tion based feature selection PCVM method (PFCVMLP ) that uses the type-II maximum likelihood
method to approximate a fully Bayesian estimation. In contrast to the filter methods such as
mRMR [41], FSNM [38], and TRC [39], and the embedded methods such as JCFO [26], L1 SVM [5],
and WSVM [37], the proposed PFCVMLP method can adaptively select informative and relevant
samples and features with probabilistic predictions. Moreover, PFCVMLP adopts truncated Gauss-
ian priors as both sample and feature priors, which obtains a more stable solution and avoids the
negative values for sample and feature weights. We summarize the main contributions as follows:

— Unlike traditional sparse Bayesian classifiers, like PCVM and RVM, the proposed algorithm
simultaneously selects the relevant features and samples, which leads to a robust classifier
for high-dimensional datasets.
— Compared with PFCVMEM [29], JCFO [26], and RSFM [35], PFCVMLP adopts the type-II
maximum likelihood [45] approach to optimize the parameters and hyperparameters, which
achieves a more stable solution and might avoid the limitations caused by the EM algorithm.
— PFCVMLP is extensively evaluated and compared with the state-of-the-art feature selec-
tion methods on different real-world datasets. The results validate the performances of
PFCVMLP .
— We derive a generalization bound for PFCVMLP . By analyzing the bound, we demonstrate
the significance of feature selection and introduce a way of choosing the initial values.

The rest of the article is structured as follows. Background knowledge of sparse Bayesian learn-
ing is introduced in Section 2. Section 3 details the implementation of simultaneously optimizing
sample and feature weights of PFCVMLP . In Section 4, experiments are designed to evaluate both
the accuracy of classification and the effectiveness of feature selection. Analyses of sparsity and
generalization for PFCVMLP are presented in Section 5. We conclude in Section 6.

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
Probabilistic Feature Selection and Classification Vector Machine 21:5

2 SPARSE BAYESIAN LEARNING FRAMEWORK


In the sparse Bayesian learning framework, we usually use the Laplace distribution and/or the
student’s-t distribution as the sparseness-promoting prior. In binary classification problems, we
choose a Bernoulli distribution as the likelihood function. Together with the proper marginal like-
lihood, we can compute the parameters’ distribution (posterior distribution) either by MAP point
estimation or by a complete Bayesian estimation approximated by type-II maximum likelihood.
Below, we detail the implementation of this framework.

2.1 Model Specification


We concentrate on a linear combination of basis functions. To simplify our notation, the decision
function is defined as
f (x; w, θ ) = Φθ (x)w, (4)
where w denotes the N + 1-dimensional sample weights; w 0 denotes the bias; Φθ (x) is an N ×
(N + 1) basis function matrix, except for the first column ϕ θ,0 (x) = [1, . . . , 1]T , other component
ϕ θ,i j = ϕ θ (x i , x j ) × y j ,4 and θ ∈ RM is the feature weights.
As probabilistic outputs are continuous values in [0, 1], we need a link function to obtain a
smooth transformation from [−∞, +∞] to [0, 1]. Here, we use a sigmoid function σ (z) = 1+e1 −z to
map Equation (4) to [0, 1]. Then, we combine this mapping with a Bernoulli distribution to compute
the following likelihood function:

N
p(t | w, θ, S) = σiti (1 − σi ) (1−ti ) ,
i=1
where ti = (yi + 1)/2 denotes the probabilistic target of the ith sample and σi denotes the sig-
moid mapping for the ith sample: σi = σ ( f (x i ; w, θ )). The vector t = (t 1 , . . . , t N )T consists of the
probabilistic targets of all training samples and S = {x, y} is the training set.

2.2 Priors Over Samples and Features


According to Chen et al. [8], a truncated Gaussian prior may result in the proper sparseness to
sample weights. Following this idea, we introduce a non-negative left-truncated Gaussian prior
Nt (w i | 0, α i−1 ) to each sample weight w i :
  
2N w i | 0, α i−1 if w i ≥ 0
p(w i | α i ) =
0 otherwise
 
= 2N w i | 0, α i−1 · 1w i ≥0 (w i ), (5)
where α i (precision) is a hyperparameter, which is equal to the inverse of variance, and 1x ≥0 (x ) is
an indicator function that returns 1 for each x ≥ 0 and 0 otherwise. For the bias w 0 , we introduce
a zero-mean Gaussian prior N (w 0 | 0, α 0−1 ):
 
p(w 0 | α 0 ) = N w 0 | 0, α 0−1 . (6)
Assuming that the sample weights are independent and identically distributed (i.i.d.), we can com-
pute the priors over sample weights as follows:

N  
N  
p(w|α ) = p(w i |α i ) = N w 0 |0, α 0−1 Nt w i |0, α i−1 , (7)
i=0 i=1

4 We assume that each sample weight has the same sign as the corresponding label. So by multiplying the basis vector with

the corresponding label, we can assume that all sample weights are non-negative.

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
21:6 B. Jiang et al.

where α = (α 0 , . . . , α N )T and Nt (w i | 0, α i−1 ) denote the left truncated Gaussian distribution.


Feature weights indicate the importance of features. For important features, the corresponding
weights are set to relatively large values and vice versa. For irrelevant and/or redundant features,
the weights are set to 0. Following [26], we should not allow negative values for feature weights.
Based on these discussions, we introduce left truncated Gaussian priors for feature weights. Under
the i.i.d. assumption, the prior over features is computed as follows:

M 
M  
p(θ | β ) = p(θ k | βk ) = Nt θ k | 0, βk−1 ,
k=1 k=1

where β = (β 1 , . . . , β M )T are hyperparameters of feature weights. Each prior is formalized as


follows:
  
2N θ k | 0, βk−1 if θ k ≥ 0,
p(θ k | βk ) =
0 otherwise,
 
= 2N θ k | 0, βk−1 · 1θ k >0 (θ k ). (8)
For both kinds of priors, we introduce Gamma distributions for α i and βk as hyperpriors. The
truncated Gaussian priors will work together with the flat Gamma hyperpriors and result in trun-
cated hierarchical Student’s-t priors over weights. These hierarchical priors, which are similar to
Laplace priors, work as L1 regularization and lead to sparse solutions [8, 26].

2.3 Computing Posteriors


The posterior in a Bayesian framework contains the distribution of all parameters. Computing
parameters boils down to updating posteriors. Having priors and likelihood, posteriors can be
computed with the following formula:
p(t | w, θ, S)p(w | α )p(θ |β )
p(w, θ | t, α , β ) = . (9)
p(t | α , β, S)
Some methods, such as PCVM, PFCVMEM , and JCFO, overlook information in the marginal like-
lihood and use the EM algorithm to obtain a MAP point estimation of parameters. Although an
efficient estimation might be obtained by the EM algorithm, it overlooks the information in the
marginal likelihood and is not regarded as a complete Bayesian estimation. Other methods, such as
RVM and EPCVM, retain the marginal likelihood. They compute the type-II maximum likelihood
and obtain a complete Bayesian solution.
The predicted distribution for the new datum x̂ is computed as follows:

p(ŷ | x̂, t, α , β ) = p(ŷ | x̂, w, θ )p(w, θ | t, α , β )dwdθ .

If both terms in the integral are Gaussian distributions, it is easy to compute this integral analyti-
cally. We will detail the implementation of PFCVMLP in the next section.

3 PROBABILISTIC FEATURE SELECTION CLASSIFICATION VECTOR MACHINE


Details of computing sample weights and sample hyperparameters were reported by Chen et al.
[9]. In this section, we mainly focus on computing parameters and hyperparameters for features.

3.1 Approximations for Posterior Distributions


Since the indicator function in Equation (8) is not differentiable, an approximate function is re-
quired to smoothly approximate the indicator function. Here, we use a parameterized sigmoid

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
Probabilistic Feature Selection and Classification Vector Machine 21:7

Fig. 1. Illustration of the indicator function and the sigmoid function.

assumption. Figure 1 shows the approximation of an indicator function made by a sigmoid func-
tion σ (λx ). As depicted in Figure 1, the larger λ is, the more accurate approximation a sigmoid
function will make. In PFCVMLP , we choose σ (5x ) as the approximation function.
We calculate Equation (9) by the Laplace approximation, in which the Gaussian distributions5
N (uθ , Σθ ) and N (uw , Σw ) are used to approximate the unknown posteriors of feature and sample
weights, respectively. We start with the logarithm of Equation (9) by the following formula:
Q (w, θ ) = log{p(t | w, θ, S)p(w | α )p(θ | β )} − log p(t | α , β, S)

N
1 1
= [tn log σn + (1 − tn ) log(1 − σn )] − wT Aw − θ T Bθ
n=1
2 2

N 
M
+ log 1w i ≥0 (w i ) + log 1θ k ≥0 (θ k ) + const,
i=1 k=1
where A = diag(α 0 , . . . , α N ), B = diag(β 1 , . . . , β M ), and const is independent of w and θ .
Using the sigmoid approximation, we substitute 1x ≥0 (x ) by σ (λx ) with λ = 5. We can compute
the derivative of the feature posterior function as follows:
∂Q (w, θ )
= −Bθ + DT (t − σ ) + kθ ,
∂θ
where kθ = [λ(1 − σ (λθ 1 )), . . . , λ(1 − σ (λθ M ))]T is an M-dimensional vector, σ = [σ1 , . . . , σ N ]T ,
and D = ∂Φ∂θθ w .
For Gaussian RBF,

N  2
D i,k = − w j ϕ θ,i j x ik − x jk .
j=1
For Pth-order polynomial,

N
(P −1)/P k
D i,k = Px ik w j ϕ θ,i j
xj .
j=1

5 Because of the truncated prior assumption, we should take the positive quadrant part of the two Gaussian distributions,
which only have an extra normalization term. Fortunately, the normalization term is independent of w and θ . So in the
derivation, we still use the Gaussian distributions.

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
21:8 B. Jiang et al.

∂Q (w,θ )
The mean uθ of the feature posterior distribution is calculated by setting ∂θ = 0:
 
uθ = B−1 DT (t − σ ) + kθ . (10)

Then, we compute the second-order derivative of Q (w, θ ), the Hessian matrix:

∂ 2Q (w, θ )
= −Oθ − B − DT CD + E,
∂θ 2
where Oθ = diag(λ2σ (λθ 1 )(1 − σ (λθ 1 )), . . . , λ2σ (λθ M )(1 − σ (λθ M ))) is an M × M diagonal matrix,
T
and C is an N × N diagonal matrix C = diag((1 − σ1 )σ1 , . . . , (1 − σ N )σ N ). E denotes ∂D ∂θ (t − σ )
and is computed as follows:
For Gaussian RBF,

N ⎡⎢ N   

 ⎥
Ei,k = ⎢⎢ (t − σ ) ϕ w x i
− x i 2
× x k
− x k 2⎥
⎥⎥ .
⎢ p p θ,p j j p j p j
p=1 ⎢⎣ j=1 ⎥⎦

For Pth-order polynomial,

N ⎡⎢ N ⎤⎥
⎢⎢ (t − σ )x i x k (P −2)/P i k⎥
Ei,k = ⎢ p p p p ϕ θ,p j
w j j j ⎥⎥ × P (P − 1).
x x
p=1 ⎢⎣ j=1 ⎥⎦

The covariance of this approximate posterior distribution equals the negative inverse of the Hes-
sian matrix:
  −1
Σθ = DT CD + B + Oθ − E . (11)

Practically, we use Cholesky decomposition to compute the robust inversion.


In the same way, we can obtain uw and Σw by computing the derivative of Q (w, θ ) with respect
to w:
 
uw = A−1 ΦTθ (t − σ ) + kw , (12)
  −1
Σw = ΦTθ CΦθ + A + Ow , (13)

where kw = [0, λ(1 − σ (λw 1 )), . . . , β (1 − σ (λw N ))]T is an (N + 1)-dimension vector, and Ow =
diag(0, λ2σ (λw 1 )(1 − σ (λw 1 )), . . . , λ2σ (λw N )(1 − σ (λw N ))) is an (N + 1) × (N + 1) diagonal
matrix.
After the derivation, the indicator functions degenerate into vectors and matrices, kθ in Equa-
tion (10), Oθ in Equation (11) for the feature posterior, kw in Equation (12), and Ow in Equation (13)
for the sample posterior. These two matrices will hold the non-negative property of the sample
and feature weights, which is consistent with the prior assumption.
With the approximated posterior distributions, N (uθ , Σθ ) and N (uw , Σw ), optimizing PFCVMLP
boils down to maximizing the posterior mode of the hyperparameters, which means maximizing
p(α , β | t) ∝ p(t | α , β, S)p(α )p(β ) with respect to α and β. As we use flat Gamma distributions
over α and β, the maximization depends on the marginal likelihood p(t | α , β, S) [22, 45]. In the
next section, the optimal marginal likelihood is obtained through the type-II maximum likelihood
method.

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
Probabilistic Feature Selection and Classification Vector Machine 21:9

3.2 Maximum Marginal Likelihood


In Bayesian models, the marginal likelihood function is computed as follows:

p(t | α , β, S) = p(t | w, θ, S)p(w | α )p(θ | β )dwdθ . (14)

However, when the likelihood function is a Bernoulli distribution and the priors are approximated
by Gaussian distributions, the maximization of Equation (14) cannot be derived in closed form.
Thus, we introduce an iterative estimation solution. The details of the hyperparameter optimiza-
tion and the derivation of maximizing marginal likelihood are specified in Appendix. Here, we use
the methodology of Bayesian Occam’s razor [32]. The update formula of the feature hyperparam-
eters is rearranged and simplified as
γk
βknew = 2 , (15)
uθ,k
where uθ,k is the kth mean of feature weights in Equation (10), and we denote γk ≡ 1 − βk Σkk ,
where Σkk is the kth diagonal covariance element in Equation (11), and βk Σkk works as Occam’s
factor, which can automatically find a balanced solution between complexity and accuracy of
PFCVMLP . The details of updating the sample hyperparameters α i are the same as for βk , and
we omit them.
In the training step, we will eliminate a feature when the corresponding βk is larger than a
specified threshold. In this case, the feature weight θ k is dominated by the prior distribution and
restricted to a small neighborhood around 0. Hence, this feature contributes little to the classifi-
cation performance. At the start of the iterative process, all samples and features are included in
the model. As iterations proceed, N and M are quickly reduced, which accelerates the speed of the
iterations. Further analysis of the complexity will be reported in Section 4.4. In the next subsection,
we demonstrate how to make predictions on new data.

3.3 Making Predictions


When predicting the label of a new sample x̂, instead of making a hard binary decision, we prefer to
estimate the uncertainty in the decision, the posterior probability of the prediction p(ŷ = 1 | x̂, S).
Incorporating the Bernoulli likelihood, the Bayesian model enables the sigmoid function σ ( f (x̂ ))
to be regarded as a consistent estimate of p(ŷ = 1 | x̂, S) [45]. We can compute the probability of
prediction in the following way:

p(ŷ = 1 | x̂, S) = p(ŷ = 1 | w, x̂, S)q(w)dw,

where p(ŷ = 1 | w, x̂, S) = σ (uTwϕ θ (x̂ )) and q(w) denotes the posterior of sample weights. Employ-
ing the posterior approximation in Section 3.1, we have q(w) ≈ N (w | uw , Σw ). According to [4],
we have
      
p(ŷ = 1 | x̂, S) = σ ϕTθ (x̂ )uw N (w | uw , Σw )dw ≈ σ κ σx̂2 uTwϕ θ (x̂ ) ,
π
where κ (σx̂2 ) = (1 + ϕTθ (x̂ )Σwϕ θ (x̂ )) −1/2 is the variance of x̂ with the covariance of sample pos-
8
terior distribution Σw .
To arrive at a binary classification, we choose uTwϕ θ (x̂ ) = 0 as the decision boundary, where
we have the probability p(ŷ = 1 | x̂, S) = 0.5. Thus, computing the sign of uTwϕ θ (x̂ ) will meet the
case of 0–1 classification. Moreover, the likelihood of prediction provides the confidence of the
prediction, which is more important in unbalanced classification tasks.

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
21:10 B. Jiang et al.

3.4 Implementation
We detail the implementation of PFCVMLP step by step and provide pseudocode in Algorithm 1.

ALGORITHM 1: PFCVMLP algorithm


1: Input: Training data set: S; initial values: INITVALUES; threshold: THRESHOLD;
the maximum number of iterations: maxIts.
2: Output: Weights of model: WEIGHT; Hyperparameters: HYPERPARAMETER.
3: Initialization: [w, θ, α , β]= INITVALUES; Index = generateIndex(α , β)
4: while i < maxItes do
5: Φ = updateBasisFunction(x, θ, Index)
6: [w, θ ] = updatePosterior(Φ, w, θ, α , β, Y)
7: [α , β] = maximumMarginal(Φ, w, θ, α , β, Y)
8: if α i or β k > THRESHOLD.maximum then
9: delete the ith sample or the kth feature
10: end if
11: Index = updateIndex(α , β)
12: marginal = calculateMarginal(Φ, w, θ, α , β, Y)
13: if Δmarginal < THRESHOLD.minimal then
14: break
15: end if
16: WEIGHT = [w, θ, Index]
17: HYPERPARAMETER = [α , β]
18: end while

Algorithm 1 consists of the following main steps:

(1) First, the values of w, θ, α , β are initialized by INITVALUES and a parameter Index gen-
erated to indicate the useful samples and features (line 3).
(2) At the beginning of each iteration, compute the matrix Φ according to Equation (3)
(line 5).
(3) Based on Equation (9), use the new hyperparameters to re-estimate the posterior (line 6).
(4) Use the re-estimated parameters to maximize the logarithm of marginal likelihood and
update the hyperparameters according to Equation (14) (line 7).
(5) Prune irrelevant samples and useless features if the corresponding hyperparameters are
larger than a specified threshold (lines 8–10).
(6) Update the Index vector (line 11).
(7) Calculate the logarithm of the marginal likelihood (line 12).
(8) Convergence detection, if the change of marginal likelihood is relatively small, halt the
iteration (lines 13–15).
(9) Generate the output values. The vector WEIGHT consists of sample and feature weights
and the vector Index indicates the relevant samples and features (lines 16 and 17).

We have now presented all details of PFCVMLP , including derivations of equations and pseudocode.
Next, we evaluate the performance of PFCVMLP by comparing with other state-of-the-art algo-
rithms on a Waveform (UCI) dataset, EEG emotion recognition datasets, and high-dimensional
gene expression datasets.

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
Probabilistic Feature Selection and Classification Vector Machine 21:11

4 EXPERIMENTAL RESULTS
In a series of experiments, we assess the performance of PFCVMLP . The first experiment aims to
evaluate the robustness and stability of PFCVMLP against noise features. Second, a set of experi-
ments are carried out on the emotional EEG datasets to assess the performance of classification and
feature selection. Then, experiments are designed on gene expression datasets, which contain lots
of irrelevant features. Finally, the computational and space complexity of PFCVMLP is analyzed.

4.1 Waveform Dataset: Stability and Robustness Against Noise


The Waveform dataset [36] contains a number of noise features and has been used to estimate
the robustness of feature selection algorithms. This dataset contains 5,000 samples with 3 classes
of waves (about 33% for each wave). Each sample has 40 continuous features, in which the first
21 features are relevant for classification, whereas the latter 19 features are irrelevant noise with
mean 0 and variance 1. The presence of 19 noise features in the Waveform dataset increases the
hardness of the classification problem. Ideal feature selection algorithms should select the relevant
features (features 1–21) and simultaneously remove the irrelevant noise features (features 22–40).
To evaluate the stability and robustness of feature selection of PFCVMLP with noise features, we
choose wave 1 vs. wave 2 from the Waveform as the experimental data, which includes 3,345
samples. In the experiment, we randomly sample data examples to generate 100 distinct training
and testing sets, in which each training set includes 200 training samples for each class. Then, we
run PFCVMLP and three embedded feature selection algorithms on each data partition.
First, to compare the stability of PFCVMLP against that of other algorithms, two indicators are
employed to measure the stability, i.e., the popular Jaccard index stability [23] and the recently
proposed Pearson’s correlation coefficient stability [40]. The stability in the output feature subsets
is a key evaluation metric for feature selection algorithms, which quantizes the sensitivity of a
feature selection procedure with different training sets. Assume F denotes the set of selected
feature subsets, si , s j ∈ F are two selected feature subsets. The Jaccard index between si , s j is
defined as
|si ∩ s j | ri j
ψ Jaccard (si , s j ) = = , (16)
|si ∪ s j | r i + r j − r i j
where r i j denotes the number of common features in si and s j , and r i is the size of selected features
in si . Based on the Jaccard index in Equation (16), the Jaccard stability of F is computed as follows:

2 
R−1  R
ΨJaccard (F ) = ψ Jaccard (si , s j ), (17)
R(R − 1) i=1 j >i

in which R denotes the number of the selected feature in F . ΨJaccard (F ) ∈ [0, 1], where 0 means
there is no overlap between any two feature subsets, 1 means that all feature subsets in F are
identical.
Following [40], the Pearson’s coefficient between si and s j can be redefined as follows:
M · ri j − ri · r j
ψ Pearson (si , s j ) ==  , (18)
r i · r j (M − r i ) · (M − r j )
where M is the number of sample features. Using Equation (18), the Pearson’s correlation coeffi-
cient stability value of F is computed as follows:

2 
R−1  R
ΨPearson (F ) = ψ Pearson (si , s j ). (19)
R(R − 1) i=1 j >i

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
21:12 B. Jiang et al.

Table 1. The Jaccard and Pearson Stability Performances of PFCVMLP and


Other Embedded Feature Selection Algorithms on Waveform Dataset

Algorithms PFCVMLP PFCVMEM WSVM JCFO


Jaccard 0.556±0.071 0.525±0.082 0.543±0.076 0.518±0.072
Pearson 0.662±0.016 0.610±0.026 0.646±0.019 0.603±0.017

Fig. 2. Classification accuracy of PFCVMLP and compared algorithms.

ΨPearson (F ) ∈ [−1, 1], in which −1 means that any two feature subsets are complementary, 0 means
there is no correlation between any two feature subsets, and 1 means that all feature subsets in F
are fully correlated.
In order to provide comprehensive results, three embedded feature algorithms, WSVM, JCFO,
and PFCVMEM , and three supervised learning algorithms, SVM, PCVM, and SMPM [18] using all
features are chosen for comparison. The experimental settings are the same as those in [8]. The
experiments are repeated 100 times with different training and test sets, and 100 feature subsets
will be obtained. Therefore, the stability of each algorithm, measured by Jaccard index and Person’s
correlation coefficient, is listed in Table 1, and the classification accuracy is depicted in Figure 2.
According to the stability definition in Equations (17) and (19), a high value of stability means
that the selected feature subsets do not significantly change with different training sets. From
Table 1 and Figure 2, we observe that PFCVMLP achieves the best stability performance in terms
of both Jaccard and Pearson index, and highly competitive accuracy in comparison with other
algorithms. The stability of WSVM is better than that of PFCVMEM and JCFO, which is attributed
to the use of the LIBSVM [7] and CVX [17] optimization toolbox. However, PFCVMEM and JCFO
show inferior stability scores, the reason being that they use the EM algorithm to a point estimate of
feature parameters, which suffers from the initialization and may converge to a local optimum [9].
Finally, in Figure 2, we also note that due to the lack of feature selection, SVM and SMPM perform
poorly.
In order to demonstrate the robustness of PFCVMLP against the irrelevant noise features, the se-
lected frequency of each feature, P̂ f , is shown in Figure 3. From Figure 3, we observe that PFCVMLP
shows comparative effectiveness to WSVM, JCFO, and PFCVMEM on the first 21 actual features,
and the P̂ f of features 5, 9, 10, 11, 12, 15, 16, 17, 18 are greater than 0.5. As shown in Figure 2,
using these features, SVM achieves 92.12% accuracy, an improvement over the result obtained by
using all features. To quantitatively evaluate the capability of eliminating noise features for these
embedded feature selection algorithms, the frequency of selecting the latter 19 noise features is
used. From Figure 3, we note that the frequencies of selecting the noise features are all less than
0.2 in PFCVMLP . However, there are 3, 1, 2 noise features with more than 0.2 selected frequen-
cies for WSVM, JCFO, and PFCVMEM , respectively. This result demonstrates that in the presence

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
Probabilistic Feature Selection and Classification Vector Machine 21:13

Fig. 3. The selected frequency of each feature. The first 21 features are actual features and the latter 19
features are noise.

Fig. 4. The layout of 62-channel symmetrical electrodes on the EEG.

of noise features, PFCVMLP performs much better than other algorithms in terms of eliminating
those noise features.

4.2 Emotional EEG Datasets: Emotion Recognition and Effectiveness


for Feature Selection
In this section, a newly developed emotion EEG dataset, SEED [59], will be used to evaluate the
performance of PFCVMLP . The SEED dataset contains the EEG signals of 15 subjects, which were
recorded while the subjects were watching 15 emotional film clips in the emotion experiment. The
subjects’ emotional reactions to the film clips are used as the emotional labels (−1 for negative, 0 for
neutral, and +1 for positive) of the corresponding film clips. The EEG signals were recorded by 62-
channel symmetrical electrodes which are shown in Figure 4. In our experiments, the differential
entropy (DE) features are chosen for emotion recognition due to its better discrimination [13].
The DE features are extracted from five common frequency bands, namely Delta (1–3Hz), Theta
(4–7Hz), Alpha (8–13Hz), Beta (14–30Hz), and Gamma (31–50Hz). Therefore, each frequency band
has 62-channel symmetrical electrodes and there are totally 310 features for one sample. In order
to investigate neural signatures and stable patterns across sessions and individuals, each subject
performed the emotion experiment in three separate sessions with an interval of about one week
or longer.

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
21:14 B. Jiang et al.

In this article, we choose positive vs. negative samples from SEED as our experimental data,
which includes the signals of five positive and five negative film clips. In this experiment, the EEG
signals recorded from one subject is regarded as one dataset, and thus there are 15 datasets for
15 subjects [60]. Each dataset has three sessions data, and each session contains 2,290 samples
(1,120 negative samples and 1,170 positive samples) with 310 features. For each data, we choose
1,376 samples as the training set (recorded from three positive and three negative film clips), the
remaining in the same session as a test set. We compare the emotion recognition and feature
selection effectiveness of PFCVMLP with other algorithms on the testing data.
To evaluate the emotion recognition performance, we take four supervised learning algorithms
(i.e., SVM, SMPM, RVM, and PCVM) using all features as baselines and also compare PFCVMLP
with seven state-of-the-art feature selection algorithms: mRMR [41], TRC [39], FSNM [38], L1 SVM
[5], WSVM [37], JCFO [26], and PFCVMEM [29]. Among the algorithms considered, mRMR, TRC,
and FSNM are filtered feature selection algorithms, L1 SVM, WSVM, JCFO, and PFCVMEM are em-
bedded feature selection algorithms. For mRMR, TRC, and FSNM, the PCVM classifier is used to
evaluate their emotion recognition performance by using the features selected by them. In these
experiments, the Gaussian RBF is used as the basis function. Two popular evaluation criteria, i.e.,
error rate6 and area under the curve of the receiver operating characteristic (AUC) are adopted for
evaluation; they represent a rank criterion and a threshold criterion, respectively [6].
We follow the procedure in [8] to choose the parameters. More precisely, the dataset for the
15th subject is chosen for cross-validation, in which we train each algorithm with all parameter
candidates and then choose the parameters with the lowest median error rate on this dataset. We
follow this procedure to choose the optimal numbers of clusters for SMPM, the kernel parameter
ϑ for RVM, SMPM, PCVM, and WSVM, and the regularization parameters for JCFO and L1 SVM.
For SVM, the regularization C and the kernel parameter ϑ are tuned by grid search, in which we
train SVM with all combinations of each candidate C and ϑ , then choose the combination with
the lowest median error rate. For mRMR, TRC, and FSNM, the proper sizes of feature subsets and
the kernel parameter ϑ for PCVM are chosen by a similar grid search. We also choose the proper
starting points for PFCVMEM and the initial hyperparameters for PFCVMLP .
For each subject, all algorithms are separately run on three session datasets, and the average re-
sults as well as the standard deviations of each algorithm are reported in Table 2. Furthermore, we
conduct a statistical significance test by running the pair-wise t-test between PFCVMLP and other
algorithms, with 95% confidence level. Specifically, once PFCVMLP achieves a significantly bet-
ter/worse performance than others on a dataset, a win/loss is counted and we mark it with •/◦ in
Table 2. Ties are also counted but are not marked. The number of wins/ties/losses of PFCVMLP com-
pared to other algorithms are shown in the last row of Table 2. In terms of the error rate, PFCVMLP
significantly outperforms other algorithms on most of datasets. But we note that PFCVMLP is out-
performed by L1 SVM, WSVM, and JCFO only on 1, 1, and 2 datasets, respectively. Specifically, on
the subject #3 dataset, PFCVMLP achieves 0% classification error, while the best competitor has a
3.24% error rate. In terms of the AUC, PFCVMLP is only inferior to JCFO on the subject #8 dataset.
These results indicate that PFCVMLP outperforms the state-of-the-art feature selection algorithms
on the EEG datasets.
In order to give a comprehensive performance comparison between PFCVMLP and other al-
gorithms with statistical significance, the Friedman test [12] combining with the post-hoc tests is
used to make comparisons of multiple methods over multiple datasets. The performance of two al-
gorithms is significantly different if their average ranks on all datasets differ by at least the critical

N
6 Error rate = 1 − classification accuracy = 1
N i =1 1(yi  f (xi ; w, θ )).

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
Table 2. The Error Rate and AUC of PFCVMLP and Other Algorithms on the Emotional EEG Datasets

The error rate (in %) of PFCVMLP and other algorithms


Subject
SVM SMPM RVM PCVM mRMR TRC FSNM L1 SVM WSVM PFCVMEM JCFO PFCVMLP
#1 9.84±7.33• 8.35±7.46• 6.02±3.84• 8.72±3.02• 4.70±7.13 4.52±5.48 4.74±4.55 8.53±9.39• 5.80±5.04 6.85±4.57• 12.18±8.27• 4.63±4.02
#2 13.60±5.92 23.30±10.55• 12.78±8.72 13.42±5.18• 20.05±3.67• 22.43±13.15• 19.66±5.05• 18.09±9.26• 13.38±9.20 17.72±10.70• 19.11±8.33• 11.85±4.74
#3 5.03±6.13• 7.33±9.22• 8.28±10.34• 9.12±10.60• 9.70±6.80• 6.35±5.99• 3.14±5.43 8.77±8.93• 8.17±7.15• 4.96±4.41• 4.60±4.96• 0.00±0.00
#4 19.74±6.16• 20.45±6.19• 18.07±2.22• 19.77±1.86• 17.02±5.69 17.68±5.26• 16.65±6.23 17.28±6.43• 16.45±7.94 17.43±7.09• 19.31±9.65• 16.12±5.15
#5 18.23±7.57• 20.93±12.21• 19.82±9.20• 17.98±10.60• 19.93±7.72• 15.61±8.39 20.81±10.40• 18.05±10.08• 13.57±9.86 19.32±10.74• 20.81±11.11• 14.81±6.39
#6 12.31±6.04• 14.30±7.34• 9.09±9.36• 9.22±6.18• 7.95±6.65 10.44±9.45• 10.83±9.79• 16.45±8.25• 15.75±10.62• 8.01±4.95 11.77±4.99• 6.16±5.24
#7 10.69±9.15 10.20±10.78 11.34±9.83• 7.91±7.07 10.81±9.46• 9.70±8.72 14.26±11.99• 20.64±8.54• 13.13±11.51• 7.48±6.60 15.97±10.45• 8.04±7.10
#8 8.90±1.77• 5.93±1.26• 8.93±7.15• 7.35±4.74• 8.75±0.68• 4.47±2.52• 4.52±4.13• 3.00±2.65 5.29±5.15• 5.85±3.54• 0.00±0.00◦ 2.15±3.72
#9 14.85±9.42• 11.95±4.47 13.72±11.72• 12.61±11.72 13.15±8.54• 14.70±8.17• 10.07±10.74 8.28±8.61◦ 7.70±6.69◦ 16.30±12.68• 11.27±9.18 11.52±8.08
#10 14.33±8.11• 7.04±8.48• 12.50±10.32• 11.59±9.09• 8.49±8.41• 10.89±8.07• 9.23±8.27• 9.34±10.14• 16.74±10.59• 6.55±7.27• 5.27±1.04• 3.06±2.98
#11 12.21±5.63• 13.72±8.24• 12.65±10.16• 8.45±8.44• 7.09±7.47• 13.13±10.59• 13.57±9.96• 1.46±2.53 8.72±9.55• 1.93±1.97 2.04±2.66 1.98±2.29
#12 14.81±8.66• 12.74±7.35• 10.95±9.51• 12.69±3.34• 12.06±8.03• 10.69±2.94• 11.71±4.89• 6.08±3.77 9.32±5.81• 12.15±9.63• 5.58±4.20◦ 6.99±5.37
#13 10.83±4.86• 4.89±6.02 10.37±6.21• 7.17±6.65• 3.28±3.80 2.74±2.79 2.12±2.25 2.42±2.94 2.44±3.23 8.53±10.21• 4.08±2.43• 2.48±3.47
#14 15.46±1.12• 14.64±1.14• 15.40±2.89• 18.78±2.91• 17.79±3.78• 15.43±0.77• 19.69±2.58• 13.82±5.43 19.77±3.72• 12.73±2.40 13.62±8.04 13.27±3.51
Win/tie/loss 12/2/0 11/3/0 13/1/0 12/2/0 11/3/0 10/4/0 9/5/0 9/5/1 8/5/1 9/5/0 8/4/2 /
The AUC (in %) of PFCVMLP and other algorithms
Subject
SVM SMPM RVM PCVM mRMR TRC FSNM L1 SVM WSVM PFCVMEM JCFO PFCVMLP
#1 98.37±1.62 97.47±1.98• 98.56±2.24 97.54±2.61• 97.64±2.09• 99.70±0.46 99.18±0.34 96.42±5.79• 98.04±3.39 96.29±3.22• 94.88±5.18• 99.54±0.50
#2 95.54±1.42• 86.79±6.82• 98.14±1.41 99.07±1.19 93.22±4.12• 92.46±8.08• 92.47±7.04• 94.98±5.24• 97.09±3.31• 94.01±5.88• 92.97±7.93• 99.32±0.59
#3 98.35±0.95• 97.75±1.32• 96.92±1.34• 99.84±0.28 94.15±4.12• 99.90±0.18 99.95±0.00 95.80±3.81• 94.43±2.72• 99.76±0.42 99.67±0.57 100.00±0.00
#4 90.27±6.77• 93.54±5.32 83.01±6.98• 82.18±4.50• 83.47±2.48• 93.55±3.58• 95.20±5.74 93.64±6.39• 91.53±7.88• 91.85±6.77• 94.88±4.87 95.05±4.67
#5 89.36±8.48• 82.42±10.21• 89.01±9.03• 93.62±9.25 92.87±9.39 87.02±9.85• 84.60±10.37• 84.41±10.88• 93.91±6.25 88.57±10.28• 84.76±10.38• 93.74±6.90
Probabilistic Feature Selection and Classification Vector Machine

#6 95.33±1.85• 92.23±0.99• 94.15±2.97• 98.98±1.77 94.35±4.06• 95.11±1.64• 95.65±4.66• 93.87±7.54• 93.14±5.93• 97.39±2.55• 95.71±1.92• 99.72±0.49
#7 96.68±1.17• 95.23±3.37• 97.27±2.83• 99.23±0.86 99.12±0.92 99.17±1.29 95.72±4.95• 83.06±11.22• 96.19±3.68• 99.57±0.42 91.96±6.08• 99.17±0.76
#8 96.67±1.16• 97.05±0.75• 93.77±3.51• 95.94±4.74• 97.93±2.71 98.70±1.40 98.40±1.05 99.60±0.69 96.32±1.53• 97.33±3.89• 100.00±0.00◦ 98.72±1.96
#9 94.61±5.28• 93.10±5.26• 92.00±3.74• 94.90±4.26• 94.51±3.38• 95.30±4.67 95.20±5.31 95.46±7.87 97.06±3.08 94.40±5.47• 96.68±2.92 96.54±4.34
#10 93.70±4.42• 95.00±2.05• 95.04±4.94• 94.75±4.66• 95.59±4.49• 94.07±9.79• 97.91±2.14 96.67±5.76• 93.56±8.82• 94.05±9.21• 97.73±2.82 98.08±1.67
#11 92.48±7.22• 89.82±8.05• 88.68±10.40• 88.08±10.50• 91.23±8.35• 91.91±7.65• 92.61±8.86• 100.00±0.00 97.73±3.98• 99.28±1.15 98.74±0.65• 99.88±0.24
#12 87.45±9.90• 90.92±7.51• 96.21±3.23• 94.33±5.34• 95.69±7.30• 93.60±5.55• 92.38±8.08• 99.42±0.92 96.76±0.42• 95.36±4.26• 99.87±0.23 99.12±0.51
#13 98.88±1.98• 99.78±0.50 93.18±5.82• 97.84±3.73• 99.53±0.81 99.65±0.60 100.00±0.00 99.80±0.35 99.85±0.26 96.31±3.01• 99.33±1.16 99.91±0.16
#14 95.55±1.51 95.29±2.21 95.05±3.04 93.44±1.51• 94.15±3.97• 95.62±1.00 93.69±2.51• 95.94±3.71 91.28±1.23• 95.45±2.48 95.19±3.64 96.15±3.45
Win/tie/loss 12/2/0 11/3/0 11/3/0 9/5/0 10/4/0 7/7/0 7/7/0 8/6/0 10/4/0 10/4/0 6/7/1 /
The best result on each dataset is bolded. The symbols •/◦ indicate that PFCVMLP is significantly superior/inferior to other algorithms according to the pair-wise t-test with
95% confidence level.
21:15

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
21:16 B. Jiang et al.

Fig. 5. Results of the Friedman test for the performance of PFCVMLP and other algorithms on the EEG
datasets. The dots denote the average ranks, the bars indicate the critical differences CD, and the algorithms
having non-overlapped bars are significantly inferior to PFCVMLP .

difference (CD):

p(p + 1)
CD = q α , (20)
6N
where p is the number of algorithms, N is the number of datasets, α is the significance level, and
q α denotes the critical value. According to the experimental results in Table 2, the better embed-
ded feature selection algorithms, JCFO and PFCVMEM , the better filter feature selection algorithms,
TRC and FSNM, and the important baseline, PCVM, are chosen to compare with PFCVMLP . Choos-
ing α = 0.05 and q α = 2.576 (p = 6), the critical difference becomes CD = 1.82.
Figure 5 shows the Friedman test results on the EEG datasets. We observe that the differences
between PFCVMLP and PCVM are significant (greater than CD), demonstrating the effectiveness of
simultaneously learning feature weights in terms of improving comprehensive performance. We
note that the differences between PFCVMLP and PFCVMEM are greater than CD, which indicates
that the Bayesian estimation approximated by the type-II maximum likelihood approach works
better than the MAP point estimation based on the EM algorithm. Moreover, PFCVMLP achieves
a significant difference compared with other feature selection algorithms (i.e., TRC, FSNM, and
JCFO), which demonstrates the effectiveness and superiority of PFCVMLP for selecting relevant
features.
To quantitatively assess the reliability of our classification results, the kappa statistic [47] is
adopted to evaluate the consistency between the prediction of algorithms and the truth. The kappa
statistic can be used to measure the performance of classifiers, which is more robust than classifi-
cation accuracy. In this experiment, the kappa statistic of all classifiers ranges from 0 to 1, where
0 indicates the chance agreement between the prediction and the truth, and 1 represents a perfect
agreement between them. Therefore, a larger kappa statistic value means that the correspond-
ing classifier performs better. Table 3 reports the kappa statistic for PFCVMLP and other algo-
rithms on the emotional EEG datasets. The standard error interval with two-sided 95% confidence
level of PFCVMLP is also reported in Table 3, which constitutes the confidence interval of kappa
statistic. The kappa statistics of other algorithms lying in the confidence interval of PFCVMLP are
marked with *, which indicates that they are not significantly worse or better than PFCVMLP . From
Table 3, we observe that PFCVMLP achieves the best kappa statistics on 5 out of 14 datasets. Al-
though other algorithms achieve the best kappa statistics on the rest datasets, they are not signif-
icantly better than PFCVMLP since the best kappa statistics lie within the confidence interval of
PFCVMLP except on the subject 8 dataset. Overall, PFCVMLP is the most frequent winner in terms
of kappa statistic.

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
Probabilistic Feature Selection and Classification Vector Machine 21:17

Table 3. The Kappa Statistic (in %) of PFCVMLP and Other Competing Algorithms
on the Emotional EEG Datasets

Subject SMPM PCVM mRMR TRC FSNM L1 -SVM WSVM PFCVMEM JCFO PFCVMLP
#1 83.38 82.51 90.20* 90.93* 90.45* 82.92 89.75* 89.28* 74.82 90.35 ± 2.84
#2 48.75 72.80 47.18 54.08 59.98 63.15 74.47 71.97 68.25 78.55 ± 3.79
#3 85.06 81.34 81.26 87.09 93.67 85.96 87.08 92.56 93.28 100.00 ± 0.00
#4 62.49 62.90 46.96 45.07 78.79* 73.65 74.47 74.37 55.80 78.88 ± 3.12
#5 52.91 63.62 68.51 68.40 51.84 63.42 79.76* 74.76 66.73 78.78 ± 3.30
#6 66.55 84.58 90.63* 74.70 78.36 66.71 78.43 84.09 66.47 89.60 ± 2.63
#7 60.20 84.25* 78.65 80.75 71.86 58.94 80.79* 83.97* 80.25 83.99 ± 3.51
#8 90.10 82.13 78.41 92.85 92.77 94.25 92.02 86.39 100.00 96.90 ± 1.36
#9 75.18 75.22 74.23 70.34 79.82 86.43* 87.60* 75.97 83.91* 85.01 ± 2.69
#10 85.98 76.23 82.56 76.16 81.53 81.27 74.97 82.63 87.04* 90.32 ± 3.53
#11 92.53 82.76 67.81 73.89 72.90 95.09* 86.95 93.86* 93.29* 94.11 ± 2.69
#12 90.40* 74.47 77.59 78.57 76.41 91.78 86.42 80.16 92.50* 91.97 ± 1.77
#13 90.14 73.72 96.05* 94.51 95.75* 95.75* 96.39* 93.06 94.50 96.73 ± 1.33
#14 78.78* 60.88 63.01 69.32 60.38 72.52* 70.38 77.12* 74.40* 75.02 ± 4.14
The best result for each dataset is illustrated in boldface.

Fig. 6. Profiles of top 20 features selected by PFCVMLP on the Beta and Gamma frequency bands.

According to Zheng and Lu [59], there are several irrelevant EEG channels in the SEED dataset,
which will introduce noise to emotion recognition, and degrade the performance of classifiers.
To illustrate the ability of PFCVMLP for selecting discriminative features and remove irrelevant
features, Figure 6 illustrates the positions of the top 20 features selected by PFCVMLP . As depicted
in the figure, the top 20 features are all from the Beta and Gamma frequency bands and located at
the lateral temporal area, which is consistent with previous findings [59, 60]. This result indicates
that PFCVMLP can effectively select the relevant channels containing discriminative information
and simultaneously eliminate irrelevant channels for emotion recognition task.

4.3 High-Dimensional Gene Expression Data: Performance in the Presence of Many


Irrelevant Features
Gene expression datasets contain lots of irrelevant features, which may degrade the performance
of classifiers [1, 16, 25]. In this experiment, three gene expression datasets: colon cancer [1], Duke
cancer [50], and ALLAML [16] are chosen to examine whether PFCVMLP is able to eliminate the
irrelevant features and make informative predictions for high-dimensional data. The colon can-
cer dataset includes expression levels of 2,000 gene features from 62 different samples, in which

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
21:18 B. Jiang et al.

Fig. 7. Accuracy curves of different algorithms with different numbers of selected features. The symbols •/∗
denote win/tie of PFCVMLP comparing to the other algorithms, and the three numbers in brackets represent
the times of win/tie/loss of PFCVMLP comparing to the corresponding algorithms by the binomial test at
the 5% significance level.

40 samples are tumor colon and 22 normal colon tissues. The Duke cancer dataset contains ex-
pression levels of 7,129 genes from 42 tumor samples, in which 21 samples are estrogen receptor-
positive tumors and the rest of the samples are estrogen receptor-negative tumors. The ALLAML
dataset comes from 47 normal and 40 cancer tissues. The ALLAML dataset consists of 72 samples in
two classes, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML), which have
47 and 25 instances, respectively. Each sample is represented by 7,129 gene expression values.
Considering the relatively small numbers of samples vs. the large numbers of features, in this
set of experiments, we use the leave one out cross-validation method to compute the accuracy:
we train on n − 1 samples first and then test the classification model on the remaining sample.
So we generate 62, 42, and 72 runs for the colon cancer dataset, the Duke cancer dataset, and the
ALLAML dataset, respectively. Following [43], in preprocessing each dataset is normalized in two
ways: sample-wise to follow a standard normal distribution and then dimension-wise to follow a
standard normal distribution.
We compare PFCVMLP with PCVM and three filter feature selection algorithms, including
mRMR, TRC, and FSNM, with PCVM as the classifier. As before, we also compare PFCVMLP with
other feature and classifier co-learning algorithms, i.e., L1 SVM, WSVM, JCFO, and PFCVMEM . As
the dimensions of gene data are relatively high, we choose the inner product, i.e., the linear kernel,
as the metric. Finally, we report the average accuracies of 62, 42, and 72 runs on the Colon cancer,
Duke cancer, and ALLAML datasets, respectively. To get the statistical significance results, we run
the binomial test between PFCVMLP and other algorithms at the 5% significance level.
Figure 7 shows the accuracy curves of feature selection algorithms. For the filter-based feature
selection algorithms, i.e., mRMR, TRC, and FSNM, we first select a subset of features in the range
of [10, 20, . . . , 80]. And then we train a PCVM with each number of selected features to obtain the
accuracy. The symbols •/∗ denote the wins/ties of PFCVMLP against the embedded methods, and
the numbers in brackets represent the win/tie/loss counts for PFCVMLP against PCVM and the
filter methods with different numbers of selected features. From Figure 7, we see that PFCVMLP
selects the smallest subsets of features on all datasets, and achieves the highest accuracies on two
out of three datasets. However, we note that the gaps between PFCVMLP and competing algo-
rithms are small. The binomial test indicates that all filter-based feature selection algorithms are

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
Probabilistic Feature Selection and Classification Vector Machine 21:19

Fig. 8. Illustrations of selected features on gene expression datasets. The horizontal axis shows the index of
features and the vertical axis shows the cumulative number of occurrences for the corresponding feature.
The dashed line in each figure indicates the maximum cumulative number. The figures at the top show the
features selected by PFCVMLP ; those at the bottom show the features selected by PFCVMEM .

outperformed by PFCVMLP on the Colon cancer dataset. And they can only reach the same level
of performance as PFCVMLP by including much more features on the other two datasets. Except
for PFCVMEM , PFCVMLP also outperforms other embedded feature selection algorithms, i.e., JCFO,
WSVM, and L1 SVM on some of datasets.
Figure 7 shows that the performances of PFCVMEM are close to those of PFCVMLP . But in our
experiments, we find that the features selected by PFCVMEM are more noisy. We show the cumu-
lative number of occurrences7 for each feature on gene expression datasets with PFCVMLP and
PFCVMEM , in Figure 8. The numbers of features selected by two algorithms are similar on aver-
age, but from Figure 8, we see that PFCVMLP concentrates on smaller sets of relevant features
than PFCVMEM . This is because that the Bayesian solution approximated by the type-II maximum
likelihood is more stable than a solution achieved by the EM algorithm. This result and the results
shown in Figure 7 together demonstrate that PFCVMLP outperforms the state-of-the-art on the
gene expression datasets in terms of accuracy.
In terms of feature selection effectiveness, PFCVMLP averagely selects 4.94 features, on the colon
cancer dataset. Among all 2,000 genes, five of them are particularly important (occurring in more
than half of the tests). The biological explanations of these five genes are reported in Table 4 and
two of them (No. 377 and No. 1772) are the same genes selected by Li et al. [31]. On the Duke
cancer dataset, PFCVMLP on average selects 2.14 features and two of them are selected in almost
every run. On the ALLAML dataset, PFCVMLP selects 2.94 features on average, and five of the
seven most occurred genes are among the 50 genes most correlated with the diagnosis [16]. The
biological significance of these genes is reported in Table 5.
To analyze the usefulness of the selected feature subsets by PFCVMLP for other algorithms, we
run RVMPFCVMLP , SVMPFCVMLP , and SMPMPFCVMLP with features selected by PFCVMLP and compare
them with the original RVM, SVM, and SMPM. The results are reported in Table 6. According to

7 Weuse the phrase “cumulative number of occurrences” to refer to the number of times a feature is selected in the exper-
iments, e.g., if a feature is never selected in the experiments, its “cumulative number” is 0.

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
21:20 B. Jiang et al.

Table 4. Biological Significance of the Most Frequently Occurring


Genes in the Colon Cancer Dataset

Feature ID # GenBank ID Description [1]


1772 61 0H8393 Collagen alpha 2(XI) chain
1668 60 M82919 mRNA for GABAA receptor
1210 58 R55310 Mitochondrial processing peptidase
377 51 R39681 Eukaryotic initiation factor
1679 37 X53586 mRNA for integrin alpha 6
# denotes the number of occurrences for a feature in all runs.

Table 5. Biological Significance of the Most Frequently Occurring


Genes in the ALLAML Cancer Dataset

Feature ID # GenBank ID Relation Description [16]


4847∗ 70 X95735 AML Zyxin
4951 68 Y07604 N/A Nucleoside diphosphate
6169 11 M13690 AML Hereditary angioedema
3847∗ 6 U82759 AML HoxA9 mRNA
2354∗ 6 M92287 AML CCND3, Cyclin D3
4973∗ 5 Y08612 ALL Protein RABAPTIN-5
1834∗ 5 M23197 AML CD33 antigen
# denotes the number of occurrences for a feature in all runs. The superscript ∗ denotes
that these genes are among the top 50 most important genes for diagnosing AML/ALL
[16].

Table 6. Accuracy on the Gene Expression Datasets

Accuracy (in %) RVM SVM SMPM RVMPFCVMLP SVMPFCVMLP SMPMPFCVMLP PFCVMLP


Colon cancer 85.48• 83.87• 75.81• 87.10• 85.48• 86.26• 96.77
Duke cancer 80.95• 85.71• 80.95• 92.86∗ 97.62∗ 90.48• 95.24
ALLAML 93.06• 87.50• 76.39• 95.83∗ 95.83∗ 94.44∗ 98.61
Win/tie/loss 3/0/0 3/0/0 3/0/0 1/2/0 1/2/0 2/1/0 /
The symbols •/∗ represent the win/tie of PFCVMLP against others algorithms according to the binomial test at the 5%
significance level.

this table, using the selected features by PFCVMLP , RVMPFCVMLP , SVMPFCVMLP , and SMPMPFCVMLP
achieve better performances comparing to the original methods. Even to our surprise, SVMPFCVMLP
outperforms PFCVMLP and obtains the best prediction on the Duke cancer dataset. This im-
provement demonstrates that the feature subsets selected by PFCVMLP also work well for other
algorithms.

4.4 Complexity Analysis


While computing the posterior covariance Σθ in Equation (11), we have to derive the negative
inverse of the Hessian matrix. This derivation does not guarantee a numerically accurate result,
because of the ill-condition of this Hessian matrix. Practically, we abandon the term E in Equa-
tion (11), so that the calculation becomes
  −1
Σθ = DT CD + B + Oθ . (21)

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
Probabilistic Feature Selection and Classification Vector Machine 21:21

Fig. 9. Illustration of the rapid decrease in the size of features, samples, and CPU time. In these experiments,
we choose the subject 1 dataset as an example.

In Equation (21), B and Oθ are positive definite diagonal matrices and DT CD has a quadratic form.
Theoretically, the Hessian is a positive definite matrix. Nevertheless, because of machine precision,
ill condition may still occur occasionally, especially when βk is very large.
In the case of large βk , especially when βk → ∞, the corresponding feature weight θ k is re-
stricted to a small neighborhood around 0. So during the iteration, we filter out this feature from
our model. Initially, all the features are contained in the model. The main computational cost is the
Cholesky decomposition in computing covariances of posteriors, Σθ and Σw , which is O (N 3 + M 3 ).
Thus, the computational complexity of PFCVMLP is the same as that of PFCVMEM , RSFM [35], and
JCFO [26].
As for the storage requirements for PFCVMLP , the basis function matrix Φθ needs O (N 2 ) for
storage, and in the initial training stage the covariance matrices Σθ and Σw require O (M 2 ) and
O (N 2 ) storage, respectively. Therefore, the overall space complexity of PFCVMLP is O (N 2 + M 2 ),
which is better than RSFM with a O (N 2 M + M 2 ) space complexity. As iterations proceed, N and M
are rapidly decreasing, resulting in O ( N̄ 3 + M̄ 3 ) computational complexity and O ( N̄ 2 + M̄ 2 ) space
complexity, where N̄ N and M̄ M. In our experiments, N and M rapidly decrease to relatively
small numbers in the first few iterations and the training speed quickly accelerates.
As illustrated in Figure 9, during the first 40 iterations the size of features and samples decreases
from 310 to 40 and from 1,376 to 127, respectively, and the CPU time for each iteration step is
decreased to 2.7% of the first iteration.

5 GENERALIZATION AND SPARSITY


Both the emotional EEG and gene expression experiments indicate that the proposed classifier and
feature selection co-learning algorithm is capable of generating a sparse solution. In this section,
we first analyze the KL-divergence between the prior and posterior. Following this, we investi-
gate the entropic constraint Rademacher complexity [33] and derive a generalization bound for
PFCVMLP . By tightening the bound, we theoretically demonstrate the significance of the sparsity
assumption and introduce a method to choose the initial values for PFCVMLP .

5.1 KL-Divergence Between Prior and Posterior


In Bayesian learning, we use KL-divergence to measure the information gain from prior to poste-
rior. As discussed in Section 3.1, the approximated posterior over feature parameters, denoted as
q̃(θ ) = N (θ | uθ , Σθ ), is a multivariate Gaussian distribution. However, as the feature prior is the
left-truncated Gaussian prior, the true posterior over feature parameters should be restricted to
the positive quadrant. In ∞ order to achieve this, we first compute the probability mass of the pos-
terior in this half, Z 0 = 0 q̃(θ )dθ ; after that, we obtain a re-normalized version of the posterior:
q(θ ) = q̃(θ )/Z 0 , where θ k ≥ 0.
We denote β 0 = (β 0,1 , β 0,2 , . . . , β 0, M ) as the initial prior and β = (β 1 , β 2 , . . . , β M ) as the op-
timized prior. Following [9], we adopt the independent posterior assumption. We compute

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
21:22 B. Jiang et al.

Fig. 10. Illustration of the numerical contribution of θ and β to KLθ (qp). To evaluate this contribution
alone, we assume each sample weight w i = 1 and each sample hyperparameter α i = 1.

KLθ (qp) 8 using the following formula (the details are specified by [10]):
 



1 β 0, k
− +
βk
+ 2 ⎫⎪
k ⎪
⎪ 1 ln β θ ⎪
 ∞ ⎪


2 βk β 0, k 0,k



ln q(θ )  ⎪ ⎪ ⎪

KLθ (qp) = q(θ ) dθ = ⎨ + (2π βk ) ( β0, k +βk ) θ k
−1/2
⎬,
ln p(θ | β ) ⎪
⎪ √ ⎪

0 0 k,θ k 0 ⎪ ⎪
⎪ erfcx − θ i β k /2 ⎪



⎪   θ β  ⎪

⎪ − ln erfc − k k ⎪
⎩ 2 ⎭
∞ −t 2 2
where erfc(z) = π z e dt and erfcx(z) = e erfc(z).
√2 z
∞ 
Note that Z 0,k = 0 q̃(θ k )dθ k = 1/2 erfc(−θ k βk /2), so we can calculate KLθ (qp) as
 



1 β 0, k
− +
βk
+ 2 ⎫ ⎪
k ⎪
⎪ 1 ln β 0,k θ ⎪



2 βk β 0, k ⎪


 ⎪ ⎪
⎪ −1/2 ⎪


⎪ (2π β k ) ( β0, k +βk ) θ k −1 ⎪
KLθ (qp) = ⎨ + 2 /2 Z ⎬.
⎪ 2 exp ( β θ ) 0,k



k k
√ ⎪

k,θ k 0 ⎪
⎪ ⎪



⎪   
erfc −θ k β k /2
 ⎪



⎪ − ln Z 0,k + ln 2 erfc(−θ k β k /2) ⎪ ⎪
⎩  ⎭
The KL-divergence is dominated by two parameters: θ and β. However, the sensitivity of KLθ (qp)
to these two parameters is different. As shown in Figure 10, setting the initial hyperparameter
β 0,k = 0.5, we see that when changing the value of θ , the curve of KLθ (qp) shows significant
changes, while this curve changes little when changing the optimized β k . Also, the minimum of
KLθ (qp) is near the θ k = 0, where the corresponding feature is pruned.

5.2 Rademacher Complexity Bound


For a binary classification learning problem, the goal is to learn a function f : RM → {−1, +1}
from a hypothesis class F , with the given dataset S = {x i , yi }i=1 N drawn i.i.d. from a distribution

D. We attempt to assess f by the expectation loss: L( f ) = E (x,y )∼D l (y, f (x )), where l (y, f (x )) is a
loss function. Practically, D is unaccessible and we can only assess the empirical loss for the given
N
dataset S: Λ( f , S) = N1 i=1 l (yi , f (x i )). We adopt a 0–1 loss function: l 0-1 (y, f (x )) = I (y f (x ) ≥ 0),
where I (·) is the indicator function. The loss function is dominated by the 1/c-Lipschitz function:
lc (a) = min(1, max(0, 1 − a/c)), namely l 0-1 (y, f (x )) ≤ lc (y f (x )). Then, we conclude the entropic
constraint Rademacher complexity bound in the following theorem:

8 KL (q p ) denotes the KL-divergence between the posterior and prior in feature weights; p and q are short for p (θ | β )
θ
and q (θ ), respectively.

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
Probabilistic Feature Selection and Classification Vector Machine 21:23

Theorem 1 ([9, 33]). Based on the posterior q(w, θ ) given in Section 3.1, we have the Bayesian
voting classifier:
ŷ = f (x, q) = Eq (w,θ ) [sign(Φθ (x)w)]. (22)

Define r > 0 and д > 0 as arbitrary parameters. For all f ∈ F , defined at the start of Section 5.2,
with probability at least 1 − δ , the bound for the generalization error of PFCVMLP on a given
dataset S holds:


r д̃(q (w,θ ))
2 2д̃(q(w, θ )) ln logr д + 12 ln δ1
P (y f (x ) < 0) ≤ Λ( f , S) + + , (23)
c N N
N
where c is the 1/c-Lipschitz parameter, the empirical loss Λ( f , S) = N1 i=1 lc (yi f (x i )), and the
Rademacher entropic constraints д̃(q(w, θ )) = r · max{KL(Q P ), д}. KL(Q P ) is the KL-divergence
between the posteriors and priors in the sample and feature weights.
According to Equation (23), we observe that with a constant training set, the generalization
error of PFCVMLP is mainly bounded by the empirical loss and д̃(q(w, θ )), in which the latter is
determined by KL(Q P ). Therefore, when the empirical loss is acceptable, a smaller KL(Q P ) could
lead to a tighter bound. The contribution of w (the sample weight) has been analyzed in [9]. In
order to analyze the effect of θ (the feature weight) alone, we can assume the w is a given constant
value ŵ. As a result, we have q(ŵ, θ ) = q(θ | ŵ) and thus KL(Q P ) = KLθ |ŵ (qp). As shown in
Figure 10, the minimal KLθ |ŵ (qp) is near θ k = 0. This is consistent with our prior assumption and
demonstrates that a truncated Gaussian (sparse) prior over features can benefit the generalization
performance by running as a regularization term and simultaneously encourage sparsity in feature
space. Furthermore, in our model, the posterior and marginal likelihood are maximized iteratively
in the training step. To accelerate the speed of convergence, we may choose proper starting points
by minimizing KLθ |w (qp), i.e., as indicated at the end of Section 5.1, we can use an optimal β
instead of β 0 as initial hyperparameter.

6 CONCLUSION
We have proposed a joint classification and feature learning algorithm PFCVMLP . The proposed
algorithm adopts sparseness-promoting priors for both sample and feature weights to jointly
learn to select the informative samples and features. By using the Laplace approximation, we
compute a complete Bayesian estimation of PFCVMLP , which is more stable than previously con-
sidered EM-based solutions. The performance of PFCVMLP has been examined according to two
criteria: the accuracy of its classification results and its ability to select features. Our experiments
demonstrate that the recognition performance of PFCVMLP on EEG emotion recognition datasets
is either the best or close to the best. On high-dimensional gene expression datasets, PFCVMLP
performs more accurately when compared to other approaches. A Rademacher complexity bound
is derived for PFCVMLP . By tightening this bound, we demonstrate the significance of feature
selection and introduce a way of finding proper initial values.
PFCVMLP jointly encourages sparsity to features and samples. However, in order to select fea-
tures for non-linear basis functions, we have to differentiate, which leads to high computational
costs. As future work, we plan to use incremental learning [9, 15, 30] to reduce the computational
costs. We also plan to design an online strategy [44] for joint feature and classifier learning. Also,
PFCVMLP focuses on the supervised binary classification task. It would be interesting to extend
PFCVMLP to other tasks, e.g., multi-class classification [24, 42], ordinal regression [30], or semi-
supervised learning [22]. Finally, we aim to use PFCVMLP in other areas of research, such as in
bioinformatics problems and clinical diagnoses [52].

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
21:24 B. Jiang et al.

APPENDIX
Hyperparameter Optimization
In order to compute a complete Bayesian classifier, feature and classifier co-learning includes com-
puting this formula:
(α , β ) = arg max p(t | α , β, S)p(α )p(β ), (24)
(α , β )
where we assume α and β are mutually independent. Equation (24) could be iteratively maximized
between α and β. The re-estimation rules of α have been derived by [9]. In this appendix, we focus
on deriving the re-estimating rules for β, which means that we need to compute the following
equation:
β = arg max p(t | α old , β, S)p(β ). (25)
β

The hyperprior p(β ) follows the Gamma distribution, p(β ) = Πk=1 M Gam(β | c, d ), where c and d
k
are the parameters of the Gamma distribution.
As discussed in Section 3.2, we calculating a closed form of the marginal likelihood is non-trivial.
Using Bayesian rules, the marginal likelihood is expanded as follows:
p(t | w, θ, S)p(w | α old )p(θ | β )
p(t | α old , β, S) = . (26)
p(w, θ | t, α old , β )
Applying approximate Gaussian distributions for the sample and feature posteriors, in Section 3.1,
we can obtain p(w, θ | t, α , β ) ≈ N (uθ , Σθ ) ∗ N (uw , Σw ). As a result, we maximize the logarithm
of Equation (25):
L = log[p(t | α old , β )p(β )]
= log p(θ | β ) − log N (uθ , Σθ ) + log p(β ) + const
1 M
= (ϵ T B−1ϵ + log |B| − log |H + B|) + (c log βk − dβk ) + const,
2
k=1

where const is independent of β, ϵ = (DT (t − σ ) + kθ ) is an M-dimensional vector and H =


DT CD + Oθ − E is an M × M matrix. Practically, the latter two terms will disappear if we set
c = d = 0.
To compute the optimal β, we first differentiate Equation (27):
2
∂L 1 ϵ 1 1 c
= −  k2 − + − 2 + 2d  , (27)
∂βk 2  βk βk βk + hk βk 
2 = ϵk2
where hk denotes the kth diagonal elements of H. Note that uθ,k β k2
and Σθ,kk = 1
β k +h k , shown
in Equations (10) and (11). So setting Equation (27) equal to 0, we obtain the update formula for β:
2c + 1
βknew = 2 , (28)
uθ,k + Σθ,kk + 2d
which is the same formula as the EM-based solution established by [29], and guarantees a local
optimum. However, if using the methodology of Bayesian Occam’s razor reported by MacKay in
[32], we derive more efficient update rules as follows:
γk + 2c
βknew = 2 , (29)
uθ,k + 2d
where γk ≡ 1 − βk Σθ,kk .

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
Probabilistic Feature Selection and Classification Vector Machine 21:25

ACKNOWLEDGMENTS
We are grateful to the associate editor and the anonymous reviewers for the constructive feedback
and suggestions.
REFERENCES
[1] Uri Alon, Naama Barkai, Daniel A. Notterman, Kurt Gish, Suzanne Ybarra, Daniel Mack, and Arnold J. Levine. 1999.
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by
oligonucleotide arrays. Proceedings of the National Academy of Sciences 96, 12 (1999), 6745–6750.
[2] Richard Ernest Bellman. 1961. Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton.
[3] James O. Berger. 1985. Statistical Decision Theory and Bayesian Analysis (2nd. ed.). Springer.
[4] Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer, New York.
[5] Paul S. Bradley and Olvi L. Mangasarian. 1998. Feature selection via concave minimization and support vector ma-
chines. In Proceedings of the 15th International Conference on Machine Learning. Vol. 98, 82–90.
[6] Rich Caruana and Alexandru Niculescu-Mizil. 2004. Data mining in metric space: An empirical analysis of supervised
learning performance criteria. In Proceedings of the 10th ACM International Conference on Knowledge Discovery and
Data Dining. ACM, 69–78.
[7] Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions
on Intelligent Systems and Technology 2, 3 (2011), 27. Retrieved from http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/
datasets/binary.html.
[8] Huanhuan Chen, Peter Tino, and Xin Yao. 2009. Probabilistic classification vector machines. IEEE Transactions on
Neural Networks 20, 6 (2009), 901–914.
[9] Huanhuan Chen, Peter Tino, and Xin Yao. 2014. Efficient probabilistic classification vector machine with incremental
basis function selection. IEEE Transactions on Neural Networks and Learning Systems 25, 2 (2014), 356–369.
[10] Rizwan A. Choudrey. 2002. Variational Methods for Bayesian Independent Component Analysis. Ph.D. Dissertation.
University of Oxford.
[11] Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning 20, 3 (1995), 273–297.
[12] Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning
Research 7, 1 (2006), 1–30.
[13] Ruonan Duan, Jiayi Zhu, and Baoliang Lu. 2013. Differential entropy feature for EEG-based emotion classification. In
Proceedings of the 6th International IEEE/EMBS Conference on Neural Engineering. 81–84.
[14] Richard O. Duda, Peter E. Hart, and David G. Stork. 2012. Pattern Classification. John Wiley & Sons.
[15] Anita C. Faul and Michael E. Tipping. 2002. Analysis of sparse Bayesian learning. In Proceedings of the 16th Conference
on Neural Information Processing Systems. 383–390.
[16] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing,
M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. 1999. Molecular classification of cancer: Class discovery and class
prediction by gene expression monitoring. Science 286, 5439 (1999), 531–537.
[17] Michael Grant, Stephen Boyd, and Yinyu Ye. 2008. CVX: Matlab software for disciplined convex programming. Re-
trieved from http://cvxr.com/cvx/doc/install.html.
[18] Bin Gu, Xingming Sun, and Victor S. Sheng. 2017. Structural minimax probability machine. IEEE Transactions on
Neural Networks and Learning Systems 28, 7 (2017), 1646–1656.
[19] Shan He, Huanhuan Chen, Zexuan Zhu, Douglas G. Ward, Helen J. Cooper, Mark R. Viant, John K. Heath, and Xin
Yao. 2015. Robust twin boosting for feature selection from high-dimensional omics data with label noise. Information
Sciences 291 (2015), 1–18.
[20] Xiaofei He, Deng Cai, and Partha Niyogi. 2005. Laplacian score for feature selection. In Proceedings of the 18th Ad-
vances in Neural Information Processing Systems. 507–514.
[21] Kaizhu Huang, Haiqin Yang, Irwin King, Michael R Lyu, and Laiwan Chan. 2004. The minimum error minimax prob-
ability machine. Journal of Machine Learning Research 5, (Oct. 2004), 1253–1286.
[22] Bingbing Jiang, Huanhuan Chen, Bo Yuan, and Xin Yao. 2017. Scalable graph-based semi-supervised learning through
sparse Bayesian model. IEEE Transactions on Knowledge and Data Engineering 29, 12 (2017), 2758–2771.
[23] Alexandros Kalousis, Julien Prados, and Melanie Hilario. 2007. Stability of feature selection algorithms: A study on
high-dimensional spaces. Knowledge and Information Systems 12, 1 (2007), 95–116.
[24] Balaji Krishnapuram, Lawrence Carin, Mario AT Figueiredo, and Alexander J Hartemink. 2005. Sparse multinomial
logistic regression: Fast algorithms and generalization bounds. IEEE Transactions on Pattern Analysis and Machine
Intelligence 27, 6 (2005), 957–968.
[25] Balaji Krishnapuram, Lawrence Carin, and Alexander J Hartemink. 2003. Joint classifier and feature optimization for
cancer diagnosis using gene expression data. In Proceedings of the 7th Annual International Conference on Research in
Computational Molecular Biology. ACM, 167–175.

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
21:26 B. Jiang et al.

[26] Balaji Krishnapuram, AJ Harternink, Lawrence Carin, and Mario AT Figueiredo. 2004. A Bayesian approach to joint
feature selection and classifier design. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 9 (2004),
1105–1111.
[27] James T Kwok, Ivor Wai-Hung Tsang, and Jacek M Zurada. 2007. A class of single-class minimax probability machines
for novelty detection. IEEE Transactions on Neural Networks 18, 3 (2007), 778–785.
[28] Gert RG Lanckriet, Laurent El Ghaoui, Chiranjib Bhattacharyya, and Michael I Jordan. 2002. A robust minimax ap-
proach to classification. Journal of Machine Learning Research 3, (Dec. 2002), 555–582.
[29] Chang Li and Huanhuan Chen. 2014. Sparse Bayesian approach for feature selection. In Proceedings of the IEEE Sym-
posium on Computational Intelligence in Big Data. IEEE, 1–7.
[30] Chang Li and Maarten de Rijke. 2018. Incremental sparse Bayesian ordinal regression. Neural Networks 106 (2018),
294–302.
[31] Yi Li, Colin Campbell, and Michael Tipping. 2002. Bayesian automatic relevance determination algorithms for clas-
sifying gene expression data. Bioinformatics 18, 10 (2002), 1332–1339.
[32] David JC MacKay. 1992. Bayesian interpolation. Neural Computation 4, 3 (1992), 415–447.
[33] Ron Meir and Tong Zhang. 2003. Generalization error bounds for Bayesian mixture algorithms. The Journal of Machine
Learning Research 4 (2003), 839–860.
[34] Yalda Mohsenzadeh, Hamid Sheikhzadeh, and Sobhan Nazari. 2016. Incremental relevance sample-feature machine:
A fast marginal likelihood maximization approach for joint feature selection and classification. Pattern Recognition
60 (2016), 835–848.
[35] Yalda Mohsenzadeh, Hamid Sheikhzadeh, Ali M Reza, Najmehsadat Bathaee, and Mahdi M Kalayeh. 2013. The rele-
vance sample-feature machine: A sparse Bayesian learning approach to joint feature-sample selection. IEEE Transac-
tions on Cybernetics 43, 6 (2013), 2241–2254.
[36] D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz. 1998. UCI repository of machine learning databases. Retrieved
from http://www.ics.uci.edu/ mlearn/MLRepository.html.
[37] Minh Hoai Nguyen and Fernando De la Torre. 2010. Optimal feature selection for support vector machines. Pattern
Recognition 43, 3 (2010), 584–591.
[38] Feiping Nie, Heng Huang, Xiao Cai, and Chris H. Ding. 2010. Efficient and robust feature selection via joint L2,
1-norms minimization. In Proceedings of the Advances in Neural Information Processing Systems. 1813–1821.
[39] Feiping Nie, Shiming Xiang, Yangqing Jia, Changshui Zhang, and Shuicheng Yan. 2008. Trace ratio criterion for
feature selection. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence. Vol. 2, 671–676.
[40] Sarah Nogueira and Gavin Brown. 2016. Measuring the stability of feature selection. In Proceedings of the Joint Euro-
pean Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 442–457.
[41] Hanchuan Peng, Fulmi Long, and Chris Ding. 2005. Feature selection based on mutual information criteria of max-
dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27,
8 (2005), 1226–1238.
[42] Ioannis Psorakis, Theodoros Damoulas, and Mark Girolami. 2010. Multiclass relevance vector machines: Sparsity and
accuracy. IEEE Transactions on Neural Networks 21, 10 (2010), 1588–1598.
[43] Shirish Krishnaj Shevade and S. Sathiya Keerthi. 2003. A simple and efficient algorithm for gene selection using sparse
logistic regression. Bioinformatics 19, 17 (2003), 2246–2253.
[44] Pannaga Shivaswamy and Thorsten Joachims. 2015. Coactive learning. Journal of Artificial Intelligence Research 53
(2015), 1–40. DOI: https://doi.org/10.1613/jair.4539
[45] Michael E. Tipping. 2001. Sparse Bayesian learning and the relevance vector machine. The Journal of Machine Learning
Research 1 (2001), 211–244.
[46] Vladimir Naumovich Vapnik. 1998. Statistical Learning Theory, Vol. 2. Wiley, New York.
[47] Anthony J. Viera and Joanne M. Garrett. 2005. Understanding interobserver agreement: The kappa statistic. Family
Medicine 37, 5 (2005), 360–363.
[48] Haishuai Wang, Peng Zhang, Xingquan Zhu, Ivor Wai-Hung Tsang, Ling Chen, Chengqi Zhang, and Xindong Wu.
2017. Incremental subgraph feature selection for graph classification. IEEE Transactions on Knowledge and Data En-
gineering 29, 1 (2017), 128–142.
[49] Jing Wang, Meng Wang, Peipei Li, Luoqi Liu, Zhongqiu Zhao, Xuegang Hu, and Xindong Wu. 2015. Online feature
selection with group structure analysis. IEEE Transactions on Knowledge and Data Engineering 27, 11 (2015), 3029–
3041.
[50] Mike West, Carrie Blanchette, Holly Dressman, Erich Huang, Seiichi Ishida, Rainer Spang, Harry Zuzan, John A.
Olson, Jeffrey R. Marks, and Joseph R. Nevins. 2001. Predicting the clinical status of human breast cancer by using
gene expression profiles. Proceedings of the National Academy of Sciences 98, 20 (2001), 11462–11467.
[51] Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil, Tomaso A. Poggio, and Vladimir Vapnik.
2000. Feature selection for SVMs. In Proceedings of the Advances in Neural Information Processing Systems. 668–674.

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.
Probabilistic Feature Selection and Classification Vector Machine 21:27

[52] Darren J. Wilkinson. 2007. Bayesian methods in bioinformatics and computational systems biology. Briefings in Bioin-
formatics 8, 2 (2007), 109–116.
[53] Xindong Wu, Kui Yu, Wei Ding, Hao Wang, and Xingquan Zhu. 2013. Online feature selection with streaming features.
IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 5 (2013), 1178–1192.
[54] Xindong Wu, Kui Yu, Hao Wang, and Wei Ding. 2010. Online streaming feature selection. In Proceedings of the 27th
International Conference on Machine Learning. 1159–1166.
[55] Yue Wu, Steven C. H. Hoi, Tao Mei, and Nenghai Yu. 2017. Large-scale online feature selection for ultra-high dimen-
sional sparse data. ACM Transactions on Knowledge Discovery from Data 11, 4 (2017), 48.
[56] Haiqin Yang, Michael R. Lyu, and Irwin King. 2013. Efficient online learning for multitask feature selection. ACM
Transactions on Knowledge Discovery from Data 7, 2 (2013), 6.
[57] Kui Yu, Wei Ding, Dan A. Simovici, Hao Wang, Jian Pei, and Xindong Wu. 2015. Classification with streaming features:
An emerging-pattern mining approach. ACM Transactions on Knowledge Discovery From Data 9, 4 (2015), 30.
[58] Kui Yu, Xindong Wu, Wei Ding, and Jian Pei. 2016. Scalable and accurate online feature selection for big data. ACM
Transactions on Knowledge Discovery from Data 11, 2 (2016), 16.
[59] Weilong Zheng and Baoliang Lu. 2015. Investigating critical frequency bands and channels for EEG-based emotion
recognition with deep neural networks. IEEE Transactions on Autonomous Mental Development 7, 3 (2015), 162–175.
Retrieved from http://bcmi.sjtu.edu.cn/ seed/download.html.
[60] Weilong Zheng, Jiyi Zhu, and Baoliang Lu. 2017. Identifying stable patterns over time for emotion recognition from
EEG. IEEE Transactions on Affective Computing. DOI: 10.1109/TAFFC.2017.2712143

Received November 2017; revised September 2018; accepted January 2019

ACM Transactions on Knowledge Discovery from Data, Vol. 13, No. 2, Article 21. Publication date: April 2019.

You might also like