0% found this document useful (0 votes)

68 views

Feature PDF

The document discusses feature selection algorithms. It introduces forward feature selection, which begins by evaluating all single feature subsets and selecting the best individual feature. It then evaluates all pairs including the best feature to select the best pair, then evaluates triples, and so on until evaluating m-tuples to select the best m-feature subset. The document also discusses evaluating feature selection algorithms using nested cross-validation to prevent overfitting. It proposes three greedy variants of forward selection to improve computational efficiency.

Uploaded by

Bruno Fabio Bedón Vásquez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views

Feature PDF

Uploaded by

Bruno Fabio Bedón Vásquez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Chapter 7

Feature Selection

Feature selection is not used in the system classification experiments, which will be discussed
in Chapter 8 and 9. However, as an autonomous system, OMEGA includes feature selection as
an important module.

7.1 Introduction

A fundamental problem of machine learning is to approximate the functional relationship f( )

between an input X = { x 1, x 2, ... , x M } and an output Y, based on a memory of data points,
{ X i, Y i }, i = 1, ..., N, usually the Xi’s are vectors of reals and the Yi’s are real numbers. Some-
times the output Y is not determined by the complete set of the input features { x 1, x 2, ... , x M } ,
instead, it is decided only by a subset of them { x ( 1 ), x ( 2 ), ... , x ( m ) } , where m < M . With suf-
ficient data and time, it is fine to use all the input features, including those irrelevant features,
to approximate the underlying function between the input and the output. But in practice, there
are two problems which may be evoked by the irrelevant features involved in the learning pro-
cess.

1.The irrelevant input features will induce greater computational cost. For example, using
cached kd-trees as we discussed in last chapter, locally weighted linear regression’s com-
putational expense is O(m3 + m2 log N) for doing a single prediction, where N is the num-

117
118 Chapter 7: Feature Selection

ber of data points in memory and m is the number of features used. Apparently, with more
features, the computational cost for predictions will increase polynomially; especially
when there are a large number of such predictions, the computational cost will increase
immensely.

2.The irrelevant input features may lead to overfitting. For example, in the domain of medi-
cal diagnosis, our purpose is to infer the relationship between the symptoms and their cor-
responding diagnosis. If by mistake we include the patient ID number as one input feature,
an over-tuned machine learning process may come to the conclusion that the illness is
determined by the ID number.

Another motivation for feature selection is that, since our goal is to approximate the underlying
function between the input and the output, it is reasonable and important to ignore those input
features with little effect on the output, so as to keep the size of the approximator model small.
For example, [Akaike, 73] proposed several versions of model selection criteria, which basi-
cally are the trade-offs between high accuracy and small model size.

The feature selection problem has been studied by the statistics and machine learning commu-
nities for many years. It has received more attention recently because of enthusiastic research
in data mining. According to [John et al., 94]’s definition, [Kira et al, 92] [Almuallim et al., 91]
[Moore et al, 94] [Skalak, 94] [Koller et al, 96] can be labelled as “filter” models, while [Caru-
ana et al., 94] [Langley et al, 94]’s research is classified as “wrapped around” methods. In the
statistics community, feature selection is also known as “subset selection”, which is surveyed
thoroughly in [Miller, 90].

The brute-force feature selection method is to exhaustively evaluate all possible combinations
of the input features, and then find the best subset. Obviously, the exhaustive search’s compu-
tational cost is prohibitively high, with considerable danger of overfitting. Hence, people resort
7.2 Cross Validation vs. Overfitting 119

1. Shuffle the data set and split into a training set of 70% of the data
and a testset of the remaining 30%.
2. Let j vary among feature-set sizes: j = ( 0 , 1 , 2 , ... , m )
a. Let fsj = best feature set of size j, where “best” is measured as
the minimizer of the leave-one-out cross-validation error over
the training set.
b. Let Testscorej = the RMS prediction error of feature set fsj on
the test set.
End of loop of (j).
3. Select the feature set fsj for which the test-set score is minimized.

Figure 7-1: Cascaded cross-validation procedure for finding

the best set of up to m features.

to greedy methods, such as forward selection. In this paper, we propose three greedier selection
algorithms in order to further enhance the efficiency. We use real-world data sets from over ten
different domains to compare the accuracy and efficiency of the various algorithms.

7.2 Cross Validation vs. Overfitting

The goal of feature selection is to choose a subset X s of the complete set of input features
X = { x 1, x 2, ... ,, x M } so that the subset X s can predict the output Y with accuracy comparable
to the performance of the complete input set X, and with great reduction of the computational
cost.

First, let us clarify how to evaluate the performance of a set of input features. In this chapter we
use a very conservative form of feature set evaluation in order to avoid overfitting. This is
important. Even if feature sets are evaluated by testset cross-validation or leave-one-out cross
validation, an exhaustive search of possible feature-sets is likely to find a misleadingly well-
scoring feature-set by chance. To prevent this, we use the cascaded cross-validation procedure
in Figure 7-1, which selects from increasingly large sets of features (and thus from increasingly
120 Chapter 7: Feature Selection

large model classes). The score for the best feature set of a given size is computed by an inde-
pendent cross-validation from the score for the best size of feature set.

Two notes about the procedure in Figure 7-1: First, the choice of 70/30 split for training and
testing is somewhat arbitrary, but is empirically a good practical ratio according to more
detailed experiments. Second, note that Figure 7-1 does not describe how we search for the best
feature set of size j in Step 2a. This is the subject of Section 7-3.

To evaluate the performance a feature selection algorithm is more complicated than to evaluate
a feature set. This is because in order to evaluate an algorithm, we must first ask the algorithm
to find the best feature subset. Second, to give a fair estimate of how well the feature selection
algorithm performs, we should try the first step on different datasets. Therefore, the full proce-
dure of evaluating the performance of a feature selection algorithm, which is described in Fig-
ure 7-2, has two layers of loops. The inner loop is to use an algorithm to find the best subset of
features. The outer loop is to evaluate the performance of the algorithm using different datasets.

7.3 Feature selection algorithms

In this section, we introduce the conventional feature selection algorithm: forward feature
selection algorithm; then we explore three greedy variants of the forward algorithm, in order to
improve the computational efficiency without sacrificing too much accuracy.

7.3.1 Forward feature selection

The forward feature selection procedure begins by evaluating all feature subsets which consist
of only one input attribute. In other words, we start by measuring the Leave-One-Out Cross
Validation (LOOCV) error of the one-component subsets, {X1}, {X2}, ..., {XM}, where M is the
input dimensionality; so that we can find the best individual feature, X(1).
7.3 Feature selection algorithms 121

1. Collect a training data set from the specific domain.

2. Shuffle the data set.
3. Break it into P partitions, (say P = 20)
4. For each partition ( i = 0, 1, ..., P-1 )
a. Let OuterTrainset(i) = all partitions except i.
b. Let OuterTestset(i) = the i’th partition
c. Let InnerTrain(i) = randomly chosen 70% of the OuterTrain-
set(i).
d. Let InnerTest(i) = the remaining 30% of the OuterTrainset(i).
e. For j = 0, 1, ..., m
Search for the best feature set with j components, fsij.using
leave-one-out on InnerTrain(i)
Let InnerTestScoreij = RMS score of fsij on InnerTest(i).
End loop of (j).
f. Select the fsij with the best inner test score.
g. Let OuterScorei = RMS score of the selected feature set on Ou-
terTestset(i)
End of loop of (i).
5. Return the mean Outer Score.

Figure 7-2: Full procedure for evaluating feature

selection of up to m attributes.

Next, forward selection finds the best subset consisting of two components, X(1) and one other
feature from the remaining M - 1 input attributes. Hence, there are a total of M - 1 pairs. Let’s
assume X(2) is the other attribute in the best pair besides X(1).

Afterwards, the input subsets with three, four, and more features are evaluated. According to
forward selection, the best subset with m features is the m-tuple consisting of X(1), X(2), ..., X(m),
while overall the best feature set is the winner out of all the M steps. Assuming the cost of a
LOOCV evaluation with i features is C(i), then the computational cost of forward selection
searching for a feature subset of size m out of M total input attributes will be

MC ( 1 ) + ( M – 1 )C ( 2 ) + ... + ( M – m + 1 )C ( m ) .

For example, the cost of one prediction with one-nearest-neighbor as the function approxima-
tor, using a kd-tree with j inputs, is O(j log N) where N is the number of datapoints. Thus, the
122 Chapter 7: Feature Selection

cost of computing the mean leave-one-out error, which involves N predictions, is O(j N log N).
And so the full cost of feature selection using the above formula is O(m2 M N log N).

To find the overall best input feature set, we can also employ exhaustive search. Exhaustive
search begins with searching the best one-component subset of the input features, which is the
same in the forward selection algorithm; then it goes to find the best two-component feature
subset which may consist of any pairs of the input features. Afterwards, it moves to find the
best triple out of all the combinations of any three input features, etc. It is straightforward to
see that the cost of exhaustive search is the following:

MC ( 1 ) +   C ( 2 ) + ... +   C ( m )
M M
 2  m

Compared with the exhaustive search, forward selection is much cheaper.

However, forward selection may suffer because of its greediness. For example, if X(1) is the best
individual feature, it does not guarantee that either {X(1), X(2)} or {X(1), X(3)} must be better than
{X(2), X(3)}. Therefore, a forward selection algorithm may select a feature set different from that
selected by exhaustive searching. With a bad selection of the input features, the prediction Ŷ q
of a query X q = { x 1, x 2, ... ,x M } may be significantly different from the true Y q .

7.3.2 Three Variants of Forward Selection

In this subsection, we will investigate the following two questions based on empirical analysis
using real world datasets mixed with artificially designed features.

1.How severely does the greediness of forward selection lead to a bad selection of the input
features?

2.If the greediness of forward selection does not have a significantly negative effect on accu-
racy, how can we modify forward selection algorithm to be greedier in order to improve
7.3 Feature selection algorithms 123

the efficiency even further?

We postpone the first question until the next section. In this chapter, we propose three greedier
feature selection algorithms whose goal is to select no more than m features from a total of M
input attributes, and with tolerable loss of prediction accuracy.

Super Greedy Algorithm

Do all the 1-attribute LOOCV calculations, sort the individual features according to their
LOOCV mean error, then take the m best features as the selected subset. We thus do M compu-
tations involving one feature and one computation involving m features. If nearest neighbor is
the function approximator, the cost of super greedy algorithm is O((M + m) N log N).

Greedy Algorithm

Do all the 1-attribute LOOCVs and sort them, take the best two individual features and evaluate
their LOOCV error, then take the best three individual features, and so on, until m features have
been evaluated. Compared with the super greedy algorithm, this algorithm may conclude at a
subset whose size is smaller than m but whose inner testset error is smaller than that of the m-
component feature set. Hence, the greedy algorithm may end up with a better feature set than
the super-greedy one does. The cost of the greedy algorithm for nearest neighbor is O((M + m2)
N log N).

Restricted Forward Selection (RFS)

1.Calculate all the 1-feature set LOOCV errors, and sort the features according to the corre-
sponding LOOCV errors. Suppose the features ranking from the most important to the
least important are X ( 1 ), X ( 2 ), ... , X ( M ) .

2.Do the LOOCVs of 2-feature subsets which consist of the winner of the first round, X(1),
along with another feature, either X(2), or X(3), or any other one until X(M / 2). There are
124 Chapter 7: Feature Selection

M ⁄ 2 of these pairs. The winner of this round will be the best 2-component feature subset
chosen by RFS.

3.Calculate the LOOCV errors of M ⁄ 3 subsets which consist of the winner of the second
round, along with the other M ⁄ 3 features at the top of the remaining rank. In this way,
RFS will select its best feature triple.
4.Continue this procedure, until RFS has found the best m-component feature set.

5.From Step 1 to Step 4, RFS has found m feature sets whose sizes range from 1 to m. By
comparing their LOOCV errors, RFS can find the best overall feature set.

The difference between RFS and conventional Forward Selection (FS) is that at each step to
insert an additional feature into the subset, FS considers all the remaining features, while RFS
only tries a part of them which seem more promising. The cost of RFS for nearest neighbor is
O(M m N log N).

For all these varieties of forward selection, we want to know how cheap and how accurate they
are compared with the conventional forward selection method. To answer these questions, we
resort to experiments using real world datasets.

7.4 Experiments

In this section, we compare the greedy algorithms with the conventional methods empirically.
We run ten experiments; for each experiment, we try two datasets with different input dimen-
sionalities; and for each dataset, we use three different function approximators.

To evaluate the influence of the greediness on the accuracy and efficiency of the feature selec-
tion process, we use twelve real world datasets from StatLib/CMU and UCI’s machine learning
data repository. These datasets come from different domains, such as biology, sociology, robot-
ics, etc. The datasets each contain 62 to 1601 points, and each point consists of an input vector
7.4 Experiments 125

and a scalar output. The dimensionality of the input varies from 3 to 13. In all of these examples
we set m (the maximum feature set size) to be 10.

Table 7-1: Preliminary comparison of ES vs. FS

20Fold Mean Errors Time Cost Selected Features

Domain
(dim) ES FS ES / FS ES FS ES / FS ES FS

Crab (7) 0.415 0.469 0.885 35644 522 68.28 A,F,G A,E
Halibut (7) 57.972 52.267 1.109 61759 713 86.62 B,C,G A,D,E,G
Irish (5) 0.863 0.905 0.954 138088 1142 120.91 A,C,E A,D
Litter (3) 0.780 0.868 0.899 4982 117 42.58 A,B,C A,B,C

Our first experiment demonstrates that Exhaustive Search (ES) is prohibitively time-consum-
ing. We choose four domains with not-too-large datasets and limited input dimensionality for
this test. Referring to Table 7-1, even for these easy cases, ES is far more expensive than the
Forward Selection algorithm (FS), while it is not significantly more accurate than FS. However,
the features selected by FS may differ from the result of ES. That is because some of the input
features are not mutually independent.

Our second experiment investigates the influence of greediness. We compare the three greedier
algorithms, Super Greedy, Greedy and Restricted Forward Selection (RFS), with the conven-
tional FS in three aspects:(1) The probabilities for these algorithms to select any useless fea-
tures, (2) The prediction errors using the feature set selected by these algorithms, and (3) The
time cost for these algorithms to find their feature sets.

For example, if a raw data file consists of three input attributes, U, V, W and an output Y, we
generate a new dataset consisting of more input features, U, V, W, cU, cV, cW, R1, R2,..., R10,
and the output Y, in which cU, cV and cW are copies of U, V and W but corrupted with 20%
126 Chapter 7: Feature Selection

noise, while R1 to R10 are independent random numbers. The chance that any of these useless
features is selected can be treated as an estimation of the probability for the certain feature
selection algorithm to make a mistake.

Table 7-2: Greediness comparison

# Corrupt / Total Corrupts # Noise / Total Noise

Domain Funct.
(dim) Apprx. Super Greedy RFS FS Super Greedy RFS FS

Nearest 0.23 0.12 0.10 0.12 0.10 0.05 0.05 0.06

Bodyfat
LocLin 0.31 0.08 0.17 0.18 0.00 0.00 0.05 0.20
(13)
GlbLin 0.31 0.23 0.15 0.00 0.00 0.00 0.00 0.40
Nearest 0.23 0.19 0.21 0.17 0.20 0.20 0.23 0.35
Boston
LocLin 0.15 0.15 0.12 0.15 0.30 0.30 0.30 0.33
(13)
GlbLin 0.15 0.12 0.15 0.23 0.40 0.30 0.30 0.40
Nearest 0.29 0.29 0.29 0.29 0.30 0.13 0.17 0.20
Crab
LocLin 0.29 0.14‘ 0.21 0.21 0.40 0.40 0.20 0.15
(7)
GlbLin 0.29 0.14 0.29 0.24 0.40 0.30 0.15 0.17
Nearest 0.57 0.57 0.14 0.43 0.10 0.10 0.10 0.10
Halibut
LocLin 0.43 0.21 0.04 0.24 0.20 0.10 0.10 0.20
(7)
GlbLin 0.36 0.29 0.00 0.14 0.25 0.10 0.20 0.10
Nearest 0.60 0.60 0.00 0.00 0.20 0.20 0.10 0.10
Irish
LocLin 0.40 0.40 0.38 0.38 0.30 0.30 0.15 0.25
(5)
GlbLin 0.60 0.60 0.30 0.40 0.30 0.30 0.40 0.25
Nearest 0.67 0.33 0.33 0.33 0.30 0.00 0.05 0.07
Litter
LocLin 0.67 0.33 0.33 0.33 0.30 0.00 0.05 0.07
(3)
GlbLin 0.33 0.33 0.00 0.43 0.50 0.20 0.35 0.50
7.4 Experiments 127

Table 7-2: Greediness comparison

# Corrupt / Total Corrupts # Noise / Total Noise

Domain Funct.
(dim) Apprx. Super Greedy RFS FS Super Greedy RFS FS

Nearest 0.44 0.44 0.41 0.44 0.00 0.00 0.07 0.05

Mpg
LocLin 0.44 0.33 0.22 0.30 0.00 0.00 0.10 0.23
(9)
GlbLin 0.33 0.28 0.22 0.17 0.00 0.00 0.20 0.20
Nearest 0.33 0.00 0.25 0.25 0.30 0.10 0.15 0.15
Nursing
LocLin 0.33 0.08 0.33 0.22 0.40 0.25 0.20 0.20
(6)
GlbLin 0.33 0.25 0.33 0.25 0.40 0.35 0.20 0.30
Nearest 0.31 0.00 0.00 0.00 0.15 0.00 0.00 0.00
Places
LocLin 0.38 0.24 0.16 0.40 0.20 0.10 0.00 0.10
(8)
GlbLin 0.25 0.25 0.23 0.31 0.35 0.15 0.15 0.25
Nearest 0.29 0.00 0.04 0.04 0.25 0.10 0.13 0.17
Sleep
LocLin 0.43 0.11 0.03 0.00 0.20 0.03 0.08 0.10
(7)
GlbLin 0.26 0.21 0.26 0.29 0.40 0.15 0.18 0.40
Nearest 0.33 0.17 0.17 0.17 0.30 0.00 0.03 0.03
Strike
LocLin 0.58 0.00 0.00 0.00 0.15 0.00 0.00 0.05
(6)
GlbLin 0.50 0.33 0.22 0.33 0.15 0.00 0.08 0.18
Nearest 0.15 0.15 0.08 0.23 0.40 0.20 0.15 0.25
White-
LocLin 0.15 0.04 0.02 0.02 0.04 0.10 0.27 0.27
cell (13)
GlbLin 0.12 0.14 0.08 0.04 0.40 0.35 0.25 0.25
Mean Nearest 0.37 0.27 0.17 0.21 0.23 0.10 0.11 0.13
over all
LocLin 0.38 0.18 0.17 0.20 0.24 0.13 0.13 0.18
twelve
datasets GlbLin 0.30 0.26 0.19 0.23 0.29 0.18 0.21 0.28
TOTAL - 0.35 0.24 0.18 0.21 0.25 0.14 0.15 0.20
128 Chapter 7: Feature Selection

As we observe in Table 7-2, FS does not eliminate more useless features than the greedier com-
petitors except the Super Greedy one. However, the greedier an algorithm is, the more easily it
is confused by the relevant but corrupted features.

Since the input features may be mutually dependent, the different algorithms may find different
feature sets. To measure the goodness of these selected feature sets, we calculate the mean 20-
fold score. As described in Section 7-2, our scoring is carefully designed to avoid overfitting,
so that the smaller the score, the better the corresponding feature set is. To confirm the consis-
tency, we test the four algorithms in all the twelve domains from StatLib and UCI. For each
domain, we apply the algorithms to two datasets. Both of the datasets are generated based on
the same raw data file, but with different numbers of corrupted features and independent noise.
And for each dataset, we try three function approximators, nearest neighbor (Nearest), locally
weighted linear regression (LocLin) and global linear regression (GlbLin). For the sake of con-
ciseness, we only list the ratios. If a ratio is close to 1.0, the corresponding algorithm’s perfor-
mance is not significantly different from that of FS. The experimental results are shown in
Table 7-3. In addition, we also list the ratios of the number of seconds consumed by the greedier
algorithms to that of FS.

First, we observe in Table 7-3 that the three greedier feature selection algorithms do not suffer
great loss in accuracy, since the average ratios of the 20-fold scores to those of FS are very close
to 1.0. In fact, RFS performs almost as well as FS. Second, as we expected, the greedier algo-
rithms improve the efficiency. Super greedy algorithm (Super) is ten times faster than forward
selection (FS), while greedy algorithm (Greedy) seven times, and the restricted forward selec-
tion (RFS) three times. Finally, restricted forward selection (RFS) performs better than the con-
ventional FS in all aspects.

To further confirm our conclusion, we do the third experiment. This time, we insert more inde-
pendent random noise and corrupted features to the datasets. For example, if the original data
7.4 Experiments 129

Table 7-3: Greediness comparison

20Fold() / 20Fold(FS) Cost() / Cost(FS)

Domain Funct.
(dim) Apprx. Super Greedy RFS Super Greedy RFS

Nearest 0.975 0.969 0.915 0.095 0.126 0.330

Bodyfat
LocLin 1.080 1.015 0.973 0.062 0.092 0.287
(13)
GlbLin 0.984 0.981 0.966 0.084 0.109 0.247
Nearest 0.876 0.872 0.881 0.105 0.145 0.389
Boston
LocLin 1.091 1.091 0.969 0.058 0.080 0.270
(13)
GlbLin 1.059 1.052 1.068 0.084 0.127 0.287
Nearest 1.107 1.039 0.973 0.123 0.149 0.358
Crab
LocLin 1.121 1.093 1.024 0.095 0.128 0.349
(7)
GlbLin 1.123 1.101 0.957 0.079 0.116 0.319
Nearest 1.089 1.108 1.051 0.133 0.163 0.376
Halibut
LocLin 1.395 1.322 1.198 0.079 0.130 0.312
(7)
GlbLin 1.073 1.018 1.022 0.079 0.137 0.273
Nearest 1.132 1.072 0.954 0.127 0.171 0.343
Irish
LocLin 1.039 0.979 0.984 0.086 0.137 0.316
(5)
GlbLin 0.981 0.981 0.992 0.096 0.180 0.373
Nearest 1.370 1.014 1.000 0.145 0.222 0.419
Litter
LocLin 1.301 0.960 0.989 0.099 0.179 0.361
(3)
GlbLin 0.886 0.902 0.930 0.111 0.179 0.410
Nearest 1.384 1.250 1.084 0.112 0.165 0.398
Mpg
LocLin 1.550 1.524 1.081 0.074 0.093 0.271
(9)
GlbLin 1.295 1.317 1.014 0.086 0.142 0.298
Nearest 1.315 1.128 0.998 0.102 0.172 0.327
Nursing
LocLin 1.171 1.106 1.063 0.072 0.121 0.260
(6)
GlbLin 1.044 1.043 1.002 0.092 0.137 0.267
130 Chapter 7: Feature Selection

Table 7-3: Greediness comparison

20Fold() / 20Fold(FS) Cost() / Cost(FS)

Domain Funct.
(dim) Apprx. Super Greedy RFS Super Greedy RFS

Nearest 1.367 1.000 1.000 0.118 0.154 0.364

Places
LocLin 0.998 1.017 0.993 0.071 0.112 0.316
(8)
GlbLin 1.041 1.044 1.064 0.091 0.130 0.265
Nearest 1.098 0.883 0.981 0.143 0.165 0.361
Sleep
LocLin 1.170 0.852 0.922 0.090 0.113 0.273
(7)
GlbLin 0.918 0.925 1.026 0.096 0.122 0.276
Nearest 1.142 0.952 1.000 0.161 0.178 0.424
Strike
LocLin 1.172 0.987 1.003 0.068 0.108 0.293
(6)
GlbLin 1.004 0.992 0.993 0.093 0.166 0.310
Nearest 0.854 0.718 0.906 0.100 0.138 0.288
White-
LocLin 1.259 0.821 0.931 0.077 0.088 0.254
cell (13)
GlbLin 0.940 0.942 0.910 0.098 0.109 0.291
Mean Nearest 1.142 1.001 0.978 0.122 0.163 0.365
over all
LocLin 1.196 1.064 1.011 0.077 0.115 0.296
twelve
datasets GlbLin 1.029 1.025 0.995 0.091 0.138 0.301
TOTAL - 1.122 1.030 0.995 0.097 0.138 0.321

set consists of three input features, {U,V,W}, the new artificial data file contains {U, cU, V, cV,
cU * cV, W, cW, cV * cW, R1,..., R40}. The results are listed in Table 7-4 and Table 7-5.

Comparing Table 7-2 with Table 7-4, we notice that with more input features, the probability
for any corrupted feature to be selected remains almost the same, while that of independent
noise reduces greatly. Comparing Table 7-3 with Table 7-5, with more input features, (1) the
prediction accuracies of the feature sets selected by the variety of the algorithms are roughly
7.4 Experiments 131

Table 7-4: Greediness comparison with more inputs

# Corrupt / Total Corrupts # Noise / Total Noise

Funct.
Apprx. Greed
Super Greedy RFS FS Super RFS FS
y

Nearest 0.29 0.33 0.30 0.38 0.04 0.04 0.03 0.04

Mean
LocLin 0.38 0.38 0.25 0.41 0.05 0.03 0.02 0.03
Values
GlbLin 0.38 0.25 0.29 0.16 0.05 0.05 0.08 0.07
TOTAL - 0.35 0.32 0.28 0.32 0.05 0.04 0.04 0.05

Table 7-5: Greediness comparison with more inputs

20Fold( ) / 20Fold(FS) Cost( ) / Cost(FS)

Funct.
Apprx. Super Greedy RFS Super Greedy RFS

Nearest 1.197 1.056 1.001 0.080 0.080 0.282

Mean
LocLin 1.202 1.059 1.040 0.071 0.084 0.281
Values
GlbLin 1.032 1.026 0.998 0.079 0.104 0.294
TOTAL - 1.144 1.047 1.013 0.077 0.088 0.286

consistent, because the 20fold scores in the two tables are almost the same; (2) the efficiency
ratio of the greedier alternatives to FS is a little higher.

In summary, in theory the greediness of feature selection algorithms may lead to great reduc-
tion in the accuracy of function approximating, but in practice it does not happen quite often.
The three greedier algorithms we propose in this paper improve the efficiency of the forward
selection algorithm, especially for larger datasets with high input dimensionalities, without sig-
nificant loss in accuracy. Even in the case the accuracy is more crucial than the efficiency,
restricted forward selection is more competitive than the conventional forward selection.
132 Chapter 7: Feature Selection

7.5 Summary

In this chapter, we explore three greedier variants of the forward selection method. Our inves-
tigation shows that the greediness of the feature selection algorithms greatly improves the effi-
ciency, while does not corrupt the correctness of the selected feature set so that the prediction
accuracy using the selected features remains satisfactory. As an application, we apply feature
selection to a prototype system of Chinese and Japanese handwriting recognition.

AHIS 1 Syllabus Winter 2024 Section 1576
No ratings yet
AHIS 1 Syllabus Winter 2024 Section 1576
6 pages
Asl Ga 30+, Ga 37, Ga 45
100% (8)
Asl Ga 30+, Ga 37, Ga 45
216 pages
7 Selectia trasaturilor
No ratings yet
7 Selectia trasaturilor
54 pages
Module5.2 Feature selection methods
No ratings yet
Module5.2 Feature selection methods
64 pages
Using RL To Find An Optimal Set of Features
No ratings yet
Using RL To Find An Optimal Set of Features
13 pages
Feature Select
No ratings yet
Feature Select
13 pages
Generalized_Fisher_Score_for_Feature_Selection
No ratings yet
Generalized_Fisher_Score_for_Feature_Selection
9 pages
Module-3 DSV
No ratings yet
Module-3 DSV
20 pages
Hybrid-Recursive Feature Elimination for Efficient Feature Selection
No ratings yet
Hybrid-Recursive Feature Elimination for Efficient Feature Selection
9 pages
Feature Selection
No ratings yet
Feature Selection
18 pages
A Review of Feature Selection Methods On Synthetic Data
No ratings yet
A Review of Feature Selection Methods On Synthetic Data
37 pages
Feature Selection in PR
No ratings yet
Feature Selection in PR
6 pages
Feature Selection
No ratings yet
Feature Selection
5 pages
Feature Gradients: Scalable Feature Selection Via Discrete Relaxation
No ratings yet
Feature Gradients: Scalable Feature Selection Via Discrete Relaxation
9 pages
Feature Selection Methods
No ratings yet
Feature Selection Methods
24 pages
CS464_Ch5_FeatureSelection
No ratings yet
CS464_Ch5_FeatureSelection
31 pages
Feature Selection 1692278667
No ratings yet
Feature Selection 1692278667
100 pages
Correlation Based Feature Selection
No ratings yet
Correlation Based Feature Selection
4 pages
CDT B1 Lab06 MondayWeek2
No ratings yet
CDT B1 Lab06 MondayWeek2
6 pages
Lecture#10
No ratings yet
Lecture#10
24 pages
Eel891 Selecao Atributos George Bebis
No ratings yet
Eel891 Selecao Atributos George Bebis
58 pages
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
5 pages
u1 p2 2
No ratings yet
u1 p2 2
66 pages
ML Lab File[1]
No ratings yet
ML Lab File[1]
43 pages
Navot PHD
No ratings yet
Navot PHD
145 pages
icml2005
No ratings yet
icml2005
8 pages
eTasci
No ratings yet
eTasci
26 pages
DOC-20241211-WA0028.
No ratings yet
DOC-20241211-WA0028.
10 pages
Feature Subset Selection: A Correlation Based Filter Approach
No ratings yet
Feature Subset Selection: A Correlation Based Filter Approach
4 pages
A Survey On Evolutionary Multiobjective Feature Selection in Classification Approaches Applications and Challenges
No ratings yet
A Survey On Evolutionary Multiobjective Feature Selection in Classification Approaches Applications and Challenges
21 pages
Feature Engg Pre Processing Python
No ratings yet
Feature Engg Pre Processing Python
68 pages
GAIN RATIO and Correlation
No ratings yet
GAIN RATIO and Correlation
7 pages
Local-Learning-Based Feature Selection For High-Dimensional Data Analysis
No ratings yet
Local-Learning-Based Feature Selection For High-Dimensional Data Analysis
18 pages
Feature engineering
No ratings yet
Feature engineering
5 pages
Flairs99 042
No ratings yet
Flairs99 042
5 pages
Lua Chon Dac Trung
No ratings yet
Lua Chon Dac Trung
18 pages
n2020
No ratings yet
n2020
6 pages
1-2 The Problem 3-4 Proposed Solution 5-7 The Experiment 8-9 Experimental Results 10-11 Conclusion 12 References 13
No ratings yet
1-2 The Problem 3-4 Proposed Solution 5-7 The Experiment 8-9 Experimental Results 10-11 Conclusion 12 References 13
14 pages
1) SVM-RFE: This Is A Popular Method For Feature Selection Where Ranking Is Done Based On
No ratings yet
1) SVM-RFE: This Is A Popular Method For Feature Selection Where Ranking Is Done Based On
6 pages
Chandra Shekar 2014
No ratings yet
Chandra Shekar 2014
13 pages
Selecting Critical Features For Data Classification Based On Machine Learning Methods
No ratings yet
Selecting Critical Features For Data Classification Based On Machine Learning Methods
26 pages
Wrapper Method
No ratings yet
Wrapper Method
58 pages
ML Lecture 02
No ratings yet
ML Lecture 02
40 pages
Feature Selection Techniques in Machine Learning
No ratings yet
Feature Selection Techniques in Machine Learning
9 pages
Feature Selection
No ratings yet
Feature Selection
36 pages
Kernels, Model Selection and Feature Selection
No ratings yet
Kernels, Model Selection and Feature Selection
5 pages
2015-Elsevier-Multi-objective-optimization-of-shared-nearest-neighbor-similarity-for-feature-selection
No ratings yet
2015-Elsevier-Multi-objective-optimization-of-shared-nearest-neighbor-similarity-for-feature-selection
12 pages
IJETR2225
No ratings yet
IJETR2225
3 pages
Feature Selection: Slide 1
No ratings yet
Feature Selection: Slide 1
29 pages
A New Alternating Suboptimal Dynamic Programming A
No ratings yet
A New Alternating Suboptimal Dynamic Programming A
22 pages
Featuere Selection
No ratings yet
Featuere Selection
5 pages
Feature Subset Selection With Fast Algorithm Implementation
No ratings yet
Feature Subset Selection With Fast Algorithm Implementation
5 pages
Experiment No.: 9: T. Y. B. Tech (CSE) - II Subject: Open Source Lab-II
No ratings yet
Experiment No.: 9: T. Y. B. Tech (CSE) - II Subject: Open Source Lab-II
4 pages
Features Election
No ratings yet
Features Election
18 pages
Handout5 Regularization
No ratings yet
Handout5 Regularization
20 pages
3038-Article Text-5729-1-10-20210418
No ratings yet
3038-Article Text-5729-1-10-20210418
6 pages
Feature Selection
No ratings yet
Feature Selection
56 pages
A Comprehensive Review of Feature Selection and Fe
No ratings yet
A Comprehensive Review of Feature Selection and Fe
16 pages
Futureinternet 14 00178
No ratings yet
Futureinternet 14 00178
16 pages
An Introduction To Feature Selection
No ratings yet
An Introduction To Feature Selection
45 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
A Comparison of Multivariate Mutual Information Estimators For Feature Selection
No ratings yet
A Comparison of Multivariate Mutual Information Estimators For Feature Selection
10 pages
Futureinternet 12 00054 v2 PDF
No ratings yet
Futureinternet 12 00054 v2 PDF
14 pages
Reducing The Dimensionality of Data With Neural Networks: Reports
No ratings yet
Reducing The Dimensionality of Data With Neural Networks: Reports
5 pages
Feature Selection For Unsupervised Learning: Jennifer G. Dy
No ratings yet
Feature Selection For Unsupervised Learning: Jennifer G. Dy
45 pages
Confidence Intervals and Hypothesis Tests: 2.1 Binomial Data
No ratings yet
Confidence Intervals and Hypothesis Tests: 2.1 Binomial Data
13 pages
172923998086398KbWuJDOrTsj23C
No ratings yet
172923998086398KbWuJDOrTsj23C
10 pages
Literary Elements and Literary Techniques and Devices
No ratings yet
Literary Elements and Literary Techniques and Devices
30 pages
Havells 2020 Price List PDF
No ratings yet
Havells 2020 Price List PDF
28 pages
1196MGN 101 Burger King
67% (6)
1196MGN 101 Burger King
12 pages
SHS Diss Q2
100% (1)
SHS Diss Q2
9 pages
90-267_ocr (1)-pages
No ratings yet
90-267_ocr (1)-pages
5 pages
Body bushing ÐÐ BG-900 x Delta 7''
No ratings yet
Body bushing ÐÐ BG-900 x Delta 7''
1 page
TCA 2 Case Study Management and Leadership
No ratings yet
TCA 2 Case Study Management and Leadership
3 pages
Optimized Method For Measuring Persistence
No ratings yet
Optimized Method For Measuring Persistence
10 pages
Surfactants and Interfacial Phenomena Third Edition Milton J. Rosen(Auth.) All Chapters Instant Download
100% (9)
Surfactants and Interfacial Phenomena Third Edition Milton J. Rosen(Auth.) All Chapters Instant Download
40 pages
Rebranding A Guide For Historic Houses Museums Sites And Organizations Jane Mitchell Eliasof download
100% (2)
Rebranding A Guide For Historic Houses Museums Sites And Organizations Jane Mitchell Eliasof download
54 pages
5 Group 5 CRM Airtel
No ratings yet
5 Group 5 CRM Airtel
11 pages
UK Application Embasy
No ratings yet
UK Application Embasy
7 pages
Scientech 2660
No ratings yet
Scientech 2660
2 pages
5 Essentail Core Principles of Interactivity
No ratings yet
5 Essentail Core Principles of Interactivity
28 pages
New Online Resident Portal: 1000 Lake Sylvan Boulevard - Orlando, FL 32804 A Buena Vida Community
No ratings yet
New Online Resident Portal: 1000 Lake Sylvan Boulevard - Orlando, FL 32804 A Buena Vida Community
3 pages
PEER Catalogo Agrícolal
No ratings yet
PEER Catalogo Agrícolal
52 pages
7 Solidification, Casting Defects.
No ratings yet
7 Solidification, Casting Defects.
5 pages
LB Brochure Main Proof4
No ratings yet
LB Brochure Main Proof4
24 pages
LED Module (21W) Street Light
No ratings yet
LED Module (21W) Street Light
7 pages
Offers Internet Article Removal Service
100% (1)
Offers Internet Article Removal Service
2 pages
Nature Article On Effects of Russian Messaging in 2016
No ratings yet
Nature Article On Effects of Russian Messaging in 2016
11 pages
IP-1066 Daily Progress Report No.49
No ratings yet
IP-1066 Daily Progress Report No.49
1 page
Ethanol Determination Novel Method
No ratings yet
Ethanol Determination Novel Method
4 pages
Chemical Engineering Design
No ratings yet
Chemical Engineering Design
27 pages
Jpo For Bobyn Wagon
100% (1)
Jpo For Bobyn Wagon
5 pages
HP0 M47
No ratings yet
HP0 M47
151 pages
Crankshafts Balancing Cost, Sustainability, and Performance
No ratings yet
Crankshafts Balancing Cost, Sustainability, and Performance
4 pages

Feature PDF

Uploaded by

Feature PDF

Uploaded by

Chapter 7

A fundamental problem of machine learning is to approximate the functional relationship f( )

Figure 7-1: Cascaded cross-validation procedure for finding

7.2 Cross Validation vs. Overfitting

7.3 Feature selection algorithms

7.3.1 Forward feature selection

1. Collect a training data set from the specific domain.

Figure 7-2: Full procedure for evaluating feature

Compared with the exhaustive search, forward selection is much cheaper.

7.3.2 Three Variants of Forward Selection

the efficiency even further?

Super Greedy Algorithm

Restricted Forward Selection (RFS)

Table 7-1: Preliminary comparison of ES vs. FS

20Fold Mean Errors Time Cost Selected Features

Table 7-2: Greediness comparison

# Corrupt / Total Corrupts # Noise / Total Noise

Nearest 0.23 0.12 0.10 0.12 0.10 0.05 0.05 0.06

Table 7-2: Greediness comparison

# Corrupt / Total Corrupts # Noise / Total Noise

Nearest 0.44 0.44 0.41 0.44 0.00 0.00 0.07 0.05

Table 7-3: Greediness comparison

20Fold() / 20Fold(FS) Cost() / Cost(FS)

Nearest 0.975 0.969 0.915 0.095 0.126 0.330

Table 7-3: Greediness comparison

20Fold() / 20Fold(FS) Cost() / Cost(FS)

Nearest 1.367 1.000 1.000 0.118 0.154 0.364

Table 7-4: Greediness comparison with more inputs

# Corrupt / Total Corrupts # Noise / Total Noise

Nearest 0.29 0.33 0.30 0.38 0.04 0.04 0.03 0.04

Table 7-5: Greediness comparison with more inputs

20Fold( ) / 20Fold(FS) Cost( ) / Cost(FS)

Nearest 1.197 1.056 1.001 0.080 0.080 0.282

You might also like