An Empirical Study On The Stability of Feature Selection For Imbalanced Software Engineering Data

Pour répondre à ces deux questions, nous appliquons dans notre travail deux approches des apprenants machine : l'échantillonnage aléatoire avec des apprenants classiques et l'échantillonnage aléatoire avec des apprenants semi-supervisés. Selon notre expérience, l'apprenant machine semi-supervisé présente des performances significatives par rapport aux apprenants machine conventionnels.

Uploaded by

Franck

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views

An Empirical Study On The Stability of Feature Selection For Imbalanced Software Engineering Data

Uploaded by

Franck

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

2012 11th International Conference on Machine Learning and Applications

An Empirical Study on the Stability of Feature Selection for Imbalanced

Software Engineering Data

Huanjing Wang Taghi M. Khoshgoftaar Amri Napolitano

[email protected] [email protected] [email protected]

Abstract which modules are most likely to be faulty, so that quality

assurance efforts can focus on those modules. However, su-
In software quality modeling, software metrics are col- perfluous software metrics often exist in data repositories.
lected during the software development cycle. However, Not all metrics make the same contributions to the class at-
not all metrics are relevant to the class attribute (software tributes (fault-prone or not-fault-prone). Selecting a sub-
quality). Metric (feature) selection has become the cor- set of metrics that are most relevant to the class attribute is
nerstone of many software quality classification problems. needed prior to the model training process.
Selecting software metrics that are important for software One potential problem with feature selection is its sen-
quality classification is a necessary and critical step before sitivity to small changes in the dataset. If the metrics se-
the model training process. Recently, the robustness (e.g., lected will vary when instances are added or removed from
stability) of feature selection techniques has been studied, a dataset, it is difficult to tell if either the original or newly-
to examine the sensitivity of these techniques to changes selected features are the most important ones for predicting
(adding/removing program modules to/from their dataset). software quality. Thus, it is important to understand a fea-
This work provides an empirical study regarding the stabil- ture selection technique’s stability (e.g., robustness), which
ity of feature selection techniques across six software met- is its ability to provide the same results even in the face of
rics datasets with varying levels of class balance. In this changes to the data. This study is one of the few which ex-
work eighteen feature selection techniques are evaluated. amine feature selection stability in the context of software
Moreover, three factors, feature subset size, degree of per- quality modeling.
turbation, and class balance of datasets, are considered In addition, for a two-group classification problem, class
in this study to evaluate stability of feature selection tech- imbalance is frequently encountered. Class imbalance
niques. Experimental results show that these factors affect refers to the cases where the samples from one class are out-
the stability of feature selection techniques as one might ex- numbered by the samples from the other class in a dataset.
pect. We found that with few exceptions, feature ranking For instance, in software quality classification, there are
based on highly imbalanced datasets are less stable than typically fewer fault-prone modules than not-fault-prone
based on slightly imbalanced data. Results also show that modules. A variety of techniques have been proposed to
making smaller changes to the datasets has less impact on counter the problems associated with class imbalance [14].
the stability of feature ranking techniques. Overall, we con- In this study, we assess the stability of eighteen dif-
clude that a careful understanding of one’s dataset (and cer- ferent feature selection techniques, including chi-squared
tain choices of metric selection technique) can help practi- (CS), information gain (IG), gain ratio (GR), two types of
tioners build more reliable software quality models. ReliefF (RF and RFW), symmetrical uncertainty (SU), 11
Keywords: feature ranking, stability, subsample, imbal- threshold-based feature selection techniques (TBFS), and
anced data. signal-to-noise (S2N). We evaluate the stability of a fil-
ter by measuring the variance between subsamples chosen
from modified datasets with instances removed. In addi-
1 Introduction tion, we examine how the choice of feature selection tech-
nique, feature subset size, and degree of dataset perturba-
Extensive studies have been dedicated towards improv- tion can affect the feature subset stability. The factor of
ing software quality. One frequently used method is soft- class balance is also considered. Experimental results show
ware quality classification, in which software metrics (fea- that all these factors affect the stability of feature selection
tures) gathered during the software development life cy- techniques. With few exceptions, as the imbalance between
cle and various data mining techniques are used to predict the classes decreases, stability increases. The experimental

978-0-7695-4913-2/12 $26.00 © 2012 IEEE 317

DOI 10.1109/ICMLA.2012.60
results show that GR performed significantly worst among The stability of a feature selection method is normally
the 18 rankers and RF performed best. Also, the fewer in- defined as the degree of agreement between its outputs to
stances that are deleted from (or equivalently, added to) a randomly selected subsets of the same input data [11]. Re-
dataset, the more stable that feature ranking will be on that cent work in this area mainly focuses on consistency of
data. the outputs by measuring the variations between subsets of
The main contribution of the present work is that we con- features obtained from different subsamples of the original
sider the stability of feature selection techniques by compar- training dataset. Abeel et al. [1] studied the process for se-
ing the selected features from separate subsamples of the lecting biomarkers from microarray data and presented a
original dataset. This is an important distinction because in general framework for stability analysis of such feature se-
many real-world situations, software practitioners want to lection techniques. Kalousis et al. [8] used different mea-
know whether adding/removing program modules to/from sures of correlation to measure the stability of the feature
their dataset will change the results of feature selection. The ranker. In 2007, Kuncheva et al. [11] devised a framework
experiments discussed in this study contain the answer. to study the stability of feature selection methods by build-
The remainder of the paper is organized as follows. Sec- ing randomly-selected instance subsets of the original data
tion 2 provides an overview of related work, while Section 3 and comparing feature subsets chosen from these (using the
presents the eighteen filter-based feature ranking techniques same feature selection technique throughout). To perform
(rankers). Section 4 describes the datasets and Section 5 this comparison, the authors developed the consistency in-
shows experimental results for stability performance. Fi- dex, a measure of similarity between two different feature
nally, we conclude the paper and provide suggestions for subsets.
future work in Section 6.
3 Feature Selection Techniques
2 Related Work
In this study, we focus on filter-based feature ranking
Generally, feature selection can be broadly grouped as techniques and apply these feature ranking techniques to
feature ranking and feature subset selection. Feature rank- software engineering datasets. Filter-based feature rank-
ing sorts the attributes according to their individual predic- ing techniques rank features independently without involv-
tive power, while feature subset selection finds subsets of ing any learning algorithm. This paper uses 18 feature
attributes that collectively have good predictive power. Fea- ranking techniques, which can be placed into three cat-
ture selection can also be categorized as filters and wrap- egories: six commonly used feature ranking techniques,
pers. Filters are algorithms in which a feature subset is se- eleven threshold-based feature selection techniques (TBFS)
lected without involving any learning algorithm. Wrappers that were developed by our research team, and a new filter
are algorithms that use feedback from a learning algorithm technique called Signal-to-Noise (S2N).
to determine which feature(s) to include in building a clas-
sification model. 3.1 Commonly Used Feature Ranking Techniques
Many extensive studies on feature selection have been
done in data mining and machine learning during recent Six commonly used feature ranking techniques were
years. Guyon and Elisseeff [6] outline key approaches used used in this work: chi-squared (CS) [17], information gain
for attribute selection, including feature construction, fea- (IG) [17], gain ratio (GR) [17], two versions of ReliefF (Re-
ture ranking, multivariate feature selection, efficient search liefF, RF, and ReliefF-W, RFW) [10], and symmetric un-
methods, and feature validity assessment methods. A study certainty (SU) [7]. All of these feature selection methods
by Liu and Yu [12] provides a comprehensive survey of are available within the WEKA machine learning tool [17].
feature selection algorithms and presents an integrated ap- Outside of RFW, WEKA’s default parameter values were
proach to intelligent feature selection. used. RFW is the ReliefF technique with the weight by dis-
Feature selection has been applied in many data mining tance parameter set to “true”. Since most of these methods
and machine learning applications. However, its application are widely known and for space considerations, the inter-
in the software quality and reliability engineering domain ested reader can consult with the included references for
is limited. Chen et al. [3] have studied the applications of further details.
wrapper-based feature selection in the context of software
cost/effort estimation. They concluded that the reduced 3.2 Threshold-based Feature Ranking Tech-
dataset improved the estimation. Rodrı́guez et al. [13] ap- niques
plied attribute selection with three filter techniques and two
wrapper models to five software engineering datasets using Eleven threshold-based feature selection techniques
the WEKA [17] tool. (TBFS) were recently proposed and implemented by our re-

318
Algorithm 1: Threshold-based Feature Selection Algorithm itive cases [5]. Power is defined as:
input :
1. Dataset D with features F j , j = 1, . . . , m; PO = max (T N R(t))k − (F N R(t))k
2. Each instance x ∈ D is assigned to one of two classes t∈[0,1]
c(x) ∈ {f p, nf p};
3. The value of attribute F j for instance x is denoted F j (x); where k = 5.
4. Metric ω ∈ {FM, OR, PO, PR, GI, MI, KS, DV, GM, AUC, PRC};
5. A predefined threshold: number (or percentage) of the features to be
selected. • Probability Ratio (PR): is the sample estimate proba-
output: bility of the feature given the positive class divided by
Selected feature subsets.
the sample estimate probability of the feature given the
for F j , j = 1, . . . , m do
F j −min(F j ) negative class [5]. PR is the maximum value of the ra-
Normalize F j → F̂ j = ;
max(F j )−min(F j )
j
tio when varying the decision threshold value between
Calculate metric ω using attribute F̂ and class attribute at various
decision threshold in the distribution of F̂ j . The optimal ω is used, 0 and 1.
ω(F̂ j ).
Create feature ranking R using ω(F̂ j ) ∀j.
• Gini Index (GI): measures the impurity of a dataset [2].
Select features according to feature ranking R and a predefined threshold. GI for the attribute is then the minimum Gini index at
all decision thresholds t ∈[0, 1].
• Mutual Information (MI): measures the mutual depen-
search group within WEKA [17]. The procedure is shown dence of the two random variables. High mutual infor-
in Algorithm 1. First, each attribute’s values are normalized mation indicates a large reduction in uncertainty, and
between 0 and 1 by mapping F j to F̂ j . The normalized val- zero mutual information between two random vari-
ues are treated as posterior probabilities. Each independent ables means the variables are independent.
attribute is then paired individually with the class attribute
and the reduced two attribute dataset is evaluated using 11 • Kolmogorov-Smirnov (KS): utilizes the Kolmogorov-
different performance metrics based on this set of “poste- Smirnov statistic to measure the maximum difference
rior probabilities.” In standard binary classification, the pre- between the empirical distribution functions of the at-
dicted class is assigned using the default decision threshold tribute values of instances in each class [9]. It is ef-
of 0.5. The default decision threshold is often not optimal, fectively the maximum difference between the curves
especially when the class is imbalanced. Therefore, we pro- generated by the true positive and false positive rates
pose the use of performance metrics that can be calculated as the decision threshold changes between 0 and 1.
at various points in the distribution of F̂ j . At each threshold
position, we classify values above the threshold as positive, • Deviance (DV): is the residual sum of squares based
and below as negative. Then we go in the opposite direction, on a threshold t. That is, it measures the sum of the
and consider values above as negative, and below as posi- squared errors from the mean class given a partition-
tive. Whatever direction produces the more optimal perfor- ing of the space based on the threshold t and then the
mance metric values is used. The true positive (T P R), true minimum value is chosen.
negative (T N R), false positive (F P R), and false negative • Geometric Mean (GM): is a single-value performance
(F N R) rates can be calculated at each threshold t ∈ [0, 1] measure which is calculated by finding the maximum
relative to the normalized attribute F̂ j . The threshold-based geometric mean of T P R and T N R as the decision
attribute ranking techniques we propose utilize these rates threshold is varied between 0 and 1.
as described below.
• Area Under ROC (Receiver Operating Characteristic)
• F-measure (FM): is a single value metric derived from Curve (AUC): has been widely used to measure clas-
the F-measure that originated from the field of infor- sification model performance [4]. The ROC curve is
mation retrieval [17]. The maximum F-measure is ob- used to characterize the trade-off between true positive
tained when varying the decision threshold value be- rate and false positive rate. In this study, ROC curves
tween 0 and 1. are generated by varying the decision threshold t used
to transform the normalized attribute values into a pre-
• Odds Ratio (OR): is the ratio of the product of correct dicted class.
(T P R times T N R) to incorrect (F P R times F N R)
predictions. The maximum value is taken when vary- • Area Under the Precision-Recall Curve (PRC): is a
ing the decision threshold value between 0 and 1. single-value measure that originated from the area of
information retrieval. The area under the PRC ranges
• Power (PO): is a measure that avoids common false from 0 to 1. The PRC diagram depicts the trade off
positive cases while giving stronger preference for pos- between recall and precision.

319
Table 1. Software Datasets Characteristics metric datasets with different level of class balance, using
Project Data #Metrics #Modules %fp %nfp
E2.0-10 209 377 6.1% 93.9%
eighteen ﬁlter-based feature selection techniques, four dif-
Eclipse I E2.1-5 209 434 7.83% 92.17% ferent levels of dataset perturbation, and four different num-
E3.0-10 209 661 6.2% 93.8%
E2.0-3 209 377 26.79% 73.21%
bers of features chosen. Discussion and results from this
Eclipse II E2.1-2 209 434 28.8% 71.2% case study are presented below.
E3.0-3 209 661 23.75% 76.25%

5.1 Dataset Perturbation

3.3 Signal-To-Noise For this study, we consider stability in the face of

changes to the datasets (perturbations) at the instance level.
Signal-to-noise ratio is a measure used in electrical engi- Consider a dataset with m instances: a smaller dataset can
neering to quantify how much a signal has been corrupted be generated by keeping a fraction of c of instances and ran-
by noise. It is deﬁned as the ratio of the signal’s power to domly removing 1 − c fraction of instances from the orig-
the noise’s power corrupting the signal. The signal-to-noise inal data, where c is greater than 0 and less than 1. We
(S2N) can also be used as feature ranking method [16]. For removed from each class instead of just from the dataset as
a binary class problem (such as f p, nf p), the S2N is deﬁned a whole in order to maintain the original level of class bal-
as the ratio of the difference of class means (μf p − μnf p ) to ance/imbalance for each dataset. For a given c, this pro-
the sum of standard deviation of each class (σf p + σnf p ). cess can be performed x times. This will create x new
If one attribute’s expression in one class is quite different datasets, each having c × m instances, where each of these
from its expression in the other, and there is little variation new datasets is unique (since each was built by randomly
within the two classes, then the attribute is predictive. The removing (1 − c) × m instances from the original dataset).
larger the S2N ratio, the more relevant a feature is to the In this study, x was set to 30 and c was set to 0.95, 0.9,
dataset [16]. 0.8, or 0.67, thereby obtaining 30 datasets for each original
dataset and choice of c.
4 Datasets
5.2 Feature Selection
Experiments conducted in this study used software met-
rics and defect data collected from a real-world software First the features are ranked according to their relevance
project, the Eclipse project [18]. We consider three releases to the class using 18 different feature selection techniques
of the Eclipse system, where the releases are denoted as separately. The rankings are applied to each combination of
2.0, 2.1, and 3.0. We transform the original data by: (1) dataset and level of perturbation. The next step is to select
removing all nonnumeric attributes, including the package a subset consisting of the most relevant ones. In this study,
names, and (2) converting the post-release defects attribute four subsets are chosen for each dataset. The number of
to a binary class attribute: fault-prone (fp) and not fault- features that is retained in each subset for each dataset are
prone (nfp). Membership in each class is determined by 3, 4, 5, and 6. These numbers are deemed reasonable after
a post-release defects threshold t, which separates fp from some preliminary experimentation conducted on the corre-
nfp packages by classifying packages with t or more post- sponding datasets [15].
release defects as fp and the remaining as nfp. In our study,
we use t {10, 3} for release 2.0 and 3.0 while we use 5.3 Stability Measure
t {5, 2} for release 2.1. A different set of thresholds is
chosen for release 2.1 because we wanted to maintain rel-
In order to measure stability, we decided to use consis-
atively similar class distributions for the three datasets in a
tency index [11] because it takes into consideration bias
given group. Table 1 presents key details about the different
due to chance. First, we assume the original dataset has
datasets used in our study. The two groups of datasets ex-
n features. Let Ti and Tj be subsets of features, where
hibit different class distributions with respect to the fp and
|Ti | = |Tj | = k. The consistency index [11] is obtained
nfp modules, where Eclipse-I is relatively the most imbal-
as follows:
anced and Eclipse-II is the least imbalanced. dn − k 2
IC (Ti , Tj ) = , (1)
k (n − k)
5 Experiments
where d is the cardinality of the intersection between sub-
sets Ti and Tj , and −1 < IC (Ti , Tj ) ≤ +1. A consistency
To test the stability of different feature selection tech-
index of 1 means that the two subsets are identical. The
niques, we performed a case study on six different software
greater the consistency index, the more similar the subsets

320
Table 2. Average Stability for Class Balance Table 3. Average Stability for Perturbation
Highly Imbalanced Imbalanced Degree of Perturbation
Ranker Ranker
3 4 5 6 3 4 5 6 0.67 0.8 0.9 0.95
CS 0.4615 0.4676 0.4686 0.4814 0.6201 0.6029 0.6101 0.6238 CS 0.4066 0.5019 0.5799 0.6796
GR 0.2992 0.3081 0.3166 0.3315 0.5252 0.5175 0.5100 0.5018 GR 0.2894 0.3696 0.4406 0.5554
IG 0.4881 0.4801 0.4782 0.4743 0.5884 0.5796 0.5858 0.6111 IG 0.4287 0.4821 0.5713 0.6606
RF 0.8286 0.7685 0.7371 0.7092 0.7984 0.8332 0.8089 0.8173 RF 0.6505 0.7452 0.8461 0.9088
RFW 0.8157 0.7680 0.7563 0.7591 0.6522 0.6605 0.6919 0.7006 RFW 0.5635 0.6760 0.7966 0.8660
SU 0.3593 0.3655 0.3832 0.3977 0.6034 0.5786 0.5658 0.5715 SU 0.3678 0.4355 0.5201 0.5891
FM 0.5675 0.5554 0.5623 0.5828 0.6532 0.6884 0.6990 0.7123 FM 0.4676 0.5824 0.6935 0.7670
OR 0.3362 0.4127 0.4540 0.4596 0.5932 0.6090 0.6292 0.6323 OR 0.3665 0.4322 0.5670 0.6974
Pow 0.5754 0.5737 0.5922 0.6242 0.6976 0.6731 0.6654 0.6636 Pow 0.4863 0.5805 0.6946 0.7712
PR 0.2829 0.3062 0.3330 0.3589 0.4862 0.4964 0.5291 0.5760 PR 0.2598 0.3553 0.4811 0.5882
GI 0.2926 0.3223 0.3566 0.3843 0.5008 0.4923 0.5369 0.5824 GI 0.2667 0.3566 0.4868 0.6240
Mi 0.5429 0.5303 0.5412 0.5570 0.6514 0.6488 0.6396 0.6429 Mi 0.4311 0.5327 0.6624 0.7508
KS 0.4838 0.5078 0.5257 0.5522 0.6753 0.6760 0.6701 0.6742 KS 0.4221 0.5274 0.6692 0.7639
Dev 0.5579 0.5358 0.5320 0.5531 0.6571 0.6956 0.6913 0.6896 Dev 0.4516 0.5638 0.6807 0.7600
GM 0.4610 0.4853 0.5128 0.5352 0.6985 0.6765 0.6754 0.6788 GM 0.4111 0.5164 0.6621 0.7721
AUC 0.6255 0.6649 0.6742 0.6885 0.7627 0.8421 0.7996 0.8106 AUC 0.5968 0.7049 0.7861 0.8462
PRC 0.6765 0.6984 0.6992 0.7123 0.7535 0.8412 0.8205 0.8132 PRC 0.6028 0.7221 0.8154 0.8671
S2N 0.5998 0.5878 0.6001 0.5727 0.8218 0.7285 0.6854 0.6794 S2N 0.4966 0.6067 0.7352 0.7994
Average 0.5141 0.5188 0.5291 0.5408 0.6522 0.6578 0.6563 0.6656 Average 0.4425 0.5384 0.6494 0.7370

In order to properly test the stability of the rankers, the

datasets with varying levels of class imbalance are used in
!"

the experiments. We used two different levels of class bal-

ance: highly imbalanced (> 92% not-fault-prone class) and

imbalanced (≤ 92% and ≥ 71% not-fault-prone class). Ta-

ble 2 and Figure 1 show that with few exceptions, as the im-
balance between the classes decreases, stability increases.

This is regardless of feature subset size.

Table 3 and Figure 2 show the effect of the degree of
Figure 1. Stability: Class Balance dataset perturbation on the stability of feature ranking tech-
niques across all six datasets and four different sizes of fea-
ture subsets. The figure demonstrate that the more instances
retained in a dataset (e.g., the fewer instances deleted from
are. When we apply the consistency index to the two deriva- the original dataset), the more stable the feature ranking on
tive datasets we can use the resultant consistency measure- that dataset will be.
ment as a measurement of stability for the feature ranker. Table 4 shows that the size of the feature subset can influ-
ence the stability of a feature ranking technique. For most
5.4 Results and Analysis rankers, stability is improved by an increased number of
features in the selected subset. Figure 3 shows the average
The experiments were conducted to discover the robust- stability performance (average stability values on y axis) of
ness (stability) of 18 rankers and the impact of feature sub- each ranker, averaged across all six datasets and with the
set size, class balance of dataset, and degree of perturba- 95% perturbation levels and feature subset size four. The
tion. We first generate 720 subsamples (6 original datasets figure shows that among the 18 rankers, GR shows the least
× 4 levels of perturbation × 30 repetitions). For each sub- stability on average while RF, PRC, and AUC show the most
sample, we applied the feature selection method and then stability.
generate a ranking list. The top three, four, five, and six
features are selected according to their respective scores. 6 Conclusion
For each combination of datasets, degree of perturbation,
rankers, and feature subset, 435 stability values are com- Feature selection is an important preprocessing step
puted (there are 435 pairs of subsamples due to 30 repeti- when performing data mining tasks such as classification
tions, 30 × 29 / 2 = 435). In total, 751,680 consistency and prediction. Because some feature selection techniques
index (6 original datasets × 4 degree of perturbation × 18 may provide different outputs when performed on a dataset
rankers × 4 feature subsets × 435 pairs of subsamples) val- which has been changed in some way, it is important to un-
ues are computed. derstand the stability of feature selection techniques in order
Tables 2 through 4 and Figures 1 through 3 contain the to choose those techniques and parameters which give the
results from the experiments. Each table focuses on one of most reliable results. In this study we conducted stability
the following three factors: class balance of dataset, degree analysis on eighteen different feature selection techniques.
of perturbation, and feature subset sizes. Six datasets with different levels of class balance were used.

321
# Experimental results show that four factors affect the sta-

bility performance of rankers. These factors are class bal-

ance of datasets, degree of perturbation, feature subset size,

!"
and choice of rankers. Experimental results show that the

number of features used in the feature subsets follows the

trend of the datasets with more features producing higher

stability values. With few exceptions, as the imbalance be-

tween the classes decreases, stability increases. We also

# $ $
found that as the amount of perturbation increases, stabil-
%&'(&

ity decreases. While intuitive, the fewer instances removed
from (or equivalently, added to) a given dataset, the less
Figure 2. Stability: Degree of Perturbation the selected features will change when compared to each
other. Lastly, in terms of the choice of ranker we believe
the most stable rankers are RF, PRC, and AUC, while GR
is the least stable ranker. Thus, we would recommend these
three rankers for use by software quality practitioners who
wish to find reliable feature subsets which are not sensitive
to minor changes in the data.
Table 4. Average Stability for Feature Subset Future work will involve conducting additional empir-
Size 4, 95% Perturbation ical studies with data from other software projects and ap-
Feature Subset Size
Ranker
3 4 5 6 plication domains, and experiments with other ranking tech-
CS 0.6755 0.6676 0.6827 0.6924
GR 0.5523 0.5533 0.5600 0.5561
niques and classifiers for building classification models.
IG 0.6600 0.6506 0.6622 0.6697
RF 0.9361 0.9344 0.8826 0.8820
RFW 0.8773 0.8455 0.8730 0.8683
SU 0.6037 0.5763 0.5850 0.5912 References
FM 0.7306 0.7500 0.7787 0.8087
OR 0.6101 0.7016 0.7498 0.7280
Pow 0.7701 0.7608 0.7668 0.7871
PR 0.5566 0.5582 0.5913 0.6467 [1] T. Abeel, T. Helleputte, Y. Van de Peer, P. Dupont, and
GI 0.5878 0.5872 0.6344 0.6867
Mi 0.7579 0.7478 0.7437 0.7539
Y. Saeys. Robust biomarker identification for cancer diagno-
KS
Dev
0.7445
0.7439
0.7547
0.7662
0.7673
0.7643
0.7890
0.7655
sis with ensemble feature selection methods. Bioinformat-
GM 0.7636 0.7584 0.7798 0.7866 ics, 26(3):392–398, February 2010.
AUC 0.8039 0.8742 0.8511 0.8557
PRC 0.8293 0.8895 0.8800 0.8694 [2] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classifi-
S2N
Average
0.8324
0.7242
0.8012
0.7321
0.7910
0.7413
0.7730
0.7505
cation and Regression Trees. Chapman and Hall/CRC Press,
Boca Raton, FL, 1984.
[3] Z. Chen, T. Menzies, D. Port, and B. Boehm. Finding
the right data for software cost modeling. IEEE Software,
22(6):38–46, 2005.
[4] T. Fawcett. An introduction to ROC analysis. Pattern Recog-
nition Letters, 27(8):861–874, June 2006.
[5] G. Forman. An extensive empirical study of feature selection
$ metrics for text classification. Journal of Machine Learning
#
Research, 3:1289–1305, 2003.

[6] I. Guyon and A. Elisseeff. An introduction to variable and

feature selection. Journal of Machine Learning Research,

3:1157–1182, March 2003.
[7] M. A. Hall and L. A. Smith. Feature selection for machine
learning: Comparing a correlation-based ﬁlter approach to
the wrapper. In Proceedings of the Twelfth International

)
*+ * + +,
- . /+ 0 + * . 1
*. -) +)
2
Florida Artiﬁcial Intelligence Research Society Conference,
3
pages 235–239, May 1999.
[8] A. Kalousis, J. Prados, and M. Hilario. Stability of feature
selection algorithms: a study on high-dimensional spaces.
Figure 3. Stability: Rankers Knowledge and Information Systems, 12(1):95–116, Dec.
2006.
[9] T. M. Khoshgoftaar and N. Seliya. Fault-prediction model-
ing for software quality estimation: Comparing commonly
used techniques. Empirical Software Engineering Journal,
8(3):255–283, 2003.

322
[10] I. Kononenko. Estimating attributes: Analysis and exten-
sions of RELIEF. In European Conference on Machine
Learning, pages 171–182. Springer Verlag, 1994.
[11] L. I. Kuncheva. A stability index for feature selection. In
Proceedings of the 25th conference on Proceedings of the
25th IASTED International Multi-Conference: artificial in-
telligence and applications, pages 390–395, Anaheim, CA,
USA, 2007. ACTA Press.
[12] H. Liu and L. Yu. Toward integrating feature selection algo-
rithms for classification and clustering. IEEE Transactions
on Knowledge and Data Engineering, 17(4):491–502, 2005.
[13] D. Rodriguez, R. Ruiz, J. Cuadrado-Gallego, and J. Aguilar-
Ruiz. Detecting fault modules applying feature selection to
classifiers. In Proceedings of 8th IEEE International Confer-
ence on Information Reuse and Integration, pages 667–672,
Las Vegas, Nevada, August 13-15 2007.
[14] J. Van Hulse, T. M. Khoshgoftaar, and A. Napolitano. Exper-
imental perspectives on learning from imbalanced data. In
In Proceedings of the 24th International Conference on Ma-
chine Learning, pages 935–942, Corvallis, OR, USA, June
2007.
[15] H. Wang, T. M. Khoshgoftaar, and N. Seliya. How many
software metrics should be selected for defect prediction?
In Proceedings of the Twenty-Fourth International Florida
Artificial Intelligence Research Society Conference, pages
69–74, May 2011.
[16] M. Wasikowski and X. wen Chen. Combating the small
sample class imbalance problem using feature selection.
IEEE Transactions on Knowledge and Data Engineering,
22:1388–1400, 2010.
[17] I. H. Witten and E. Frank. Data Mining: Practical Machine
Learning Tools and Techniques. Morgan Kaufmann, 2 edi-
tion, 2005.
[18] T. Zimmermann, R. Premraj, and A. Zeller. Predicting de-
fects for eclipse. In ICSEW ’07: Proceedings of the 29th
International Conference on Software Engineering Work-
shops, page 76, Washington, DC, USA, 2007. IEEE Com-
puter Society.