An Empirical Study On The Stability of Feature Selection For Imbalanced Software Engineering Data
An Empirical Study On The Stability of Feature Selection For Imbalanced Software Engineering Data
318
Algorithm 1: Threshold-based Feature Selection Algorithm itive cases [5]. Power is defined as:
input :
1. Dataset D with features F j , j = 1, . . . , m; PO = max (T N R(t))k − (F N R(t))k
2. Each instance x ∈ D is assigned to one of two classes t∈[0,1]
c(x) ∈ {f p, nf p};
3. The value of attribute F j for instance x is denoted F j (x); where k = 5.
4. Metric ω ∈ {FM, OR, PO, PR, GI, MI, KS, DV, GM, AUC, PRC};
5. A predefined threshold: number (or percentage) of the features to be
selected. • Probability Ratio (PR): is the sample estimate proba-
output: bility of the feature given the positive class divided by
Selected feature subsets.
the sample estimate probability of the feature given the
for F j , j = 1, . . . , m do
F j −min(F j ) negative class [5]. PR is the maximum value of the ra-
Normalize F j → F̂ j = ;
max(F j )−min(F j )
j
tio when varying the decision threshold value between
Calculate metric ω using attribute F̂ and class attribute at various
decision threshold in the distribution of F̂ j . The optimal ω is used, 0 and 1.
ω(F̂ j ).
Create feature ranking R using ω(F̂ j ) ∀j.
• Gini Index (GI): measures the impurity of a dataset [2].
Select features according to feature ranking R and a predefined threshold. GI for the attribute is then the minimum Gini index at
all decision thresholds t ∈[0, 1].
• Mutual Information (MI): measures the mutual depen-
search group within WEKA [17]. The procedure is shown dence of the two random variables. High mutual infor-
in Algorithm 1. First, each attribute’s values are normalized mation indicates a large reduction in uncertainty, and
between 0 and 1 by mapping F j to F̂ j . The normalized val- zero mutual information between two random vari-
ues are treated as posterior probabilities. Each independent ables means the variables are independent.
attribute is then paired individually with the class attribute
and the reduced two attribute dataset is evaluated using 11 • Kolmogorov-Smirnov (KS): utilizes the Kolmogorov-
different performance metrics based on this set of “poste- Smirnov statistic to measure the maximum difference
rior probabilities.” In standard binary classification, the pre- between the empirical distribution functions of the at-
dicted class is assigned using the default decision threshold tribute values of instances in each class [9]. It is ef-
of 0.5. The default decision threshold is often not optimal, fectively the maximum difference between the curves
especially when the class is imbalanced. Therefore, we pro- generated by the true positive and false positive rates
pose the use of performance metrics that can be calculated as the decision threshold changes between 0 and 1.
at various points in the distribution of F̂ j . At each threshold
position, we classify values above the threshold as positive, • Deviance (DV): is the residual sum of squares based
and below as negative. Then we go in the opposite direction, on a threshold t. That is, it measures the sum of the
and consider values above as negative, and below as posi- squared errors from the mean class given a partition-
tive. Whatever direction produces the more optimal perfor- ing of the space based on the threshold t and then the
mance metric values is used. The true positive (T P R), true minimum value is chosen.
negative (T N R), false positive (F P R), and false negative • Geometric Mean (GM): is a single-value performance
(F N R) rates can be calculated at each threshold t ∈ [0, 1] measure which is calculated by finding the maximum
relative to the normalized attribute F̂ j . The threshold-based geometric mean of T P R and T N R as the decision
attribute ranking techniques we propose utilize these rates threshold is varied between 0 and 1.
as described below.
• Area Under ROC (Receiver Operating Characteristic)
• F-measure (FM): is a single value metric derived from Curve (AUC): has been widely used to measure clas-
the F-measure that originated from the field of infor- sification model performance [4]. The ROC curve is
mation retrieval [17]. The maximum F-measure is ob- used to characterize the trade-off between true positive
tained when varying the decision threshold value be- rate and false positive rate. In this study, ROC curves
tween 0 and 1. are generated by varying the decision threshold t used
to transform the normalized attribute values into a pre-
• Odds Ratio (OR): is the ratio of the product of correct dicted class.
(T P R times T N R) to incorrect (F P R times F N R)
predictions. The maximum value is taken when vary- • Area Under the Precision-Recall Curve (PRC): is a
ing the decision threshold value between 0 and 1. single-value measure that originated from the area of
information retrieval. The area under the PRC ranges
• Power (PO): is a measure that avoids common false from 0 to 1. The PRC diagram depicts the trade off
positive cases while giving stronger preference for pos- between recall and precision.
319
Table 1. Software Datasets Characteristics metric datasets with different level of class balance, using
Project Data #Metrics #Modules %fp %nfp
E2.0-10 209 377 6.1% 93.9%
eighteen filter-based feature selection techniques, four dif-
Eclipse I E2.1-5 209 434 7.83% 92.17% ferent levels of dataset perturbation, and four different num-
E3.0-10 209 661 6.2% 93.8%
E2.0-3 209 377 26.79% 73.21%
bers of features chosen. Discussion and results from this
Eclipse II E2.1-2 209 434 28.8% 71.2% case study are presented below.
E3.0-3 209 661 23.75% 76.25%
320
Table 2. Average Stability for Class Balance Table 3. Average Stability for Perturbation
Highly Imbalanced Imbalanced Degree of Perturbation
Ranker Ranker
3 4 5 6 3 4 5 6 0.67 0.8 0.9 0.95
CS 0.4615 0.4676 0.4686 0.4814 0.6201 0.6029 0.6101 0.6238 CS 0.4066 0.5019 0.5799 0.6796
GR 0.2992 0.3081 0.3166 0.3315 0.5252 0.5175 0.5100 0.5018 GR 0.2894 0.3696 0.4406 0.5554
IG 0.4881 0.4801 0.4782 0.4743 0.5884 0.5796 0.5858 0.6111 IG 0.4287 0.4821 0.5713 0.6606
RF 0.8286 0.7685 0.7371 0.7092 0.7984 0.8332 0.8089 0.8173 RF 0.6505 0.7452 0.8461 0.9088
RFW 0.8157 0.7680 0.7563 0.7591 0.6522 0.6605 0.6919 0.7006 RFW 0.5635 0.6760 0.7966 0.8660
SU 0.3593 0.3655 0.3832 0.3977 0.6034 0.5786 0.5658 0.5715 SU 0.3678 0.4355 0.5201 0.5891
FM 0.5675 0.5554 0.5623 0.5828 0.6532 0.6884 0.6990 0.7123 FM 0.4676 0.5824 0.6935 0.7670
OR 0.3362 0.4127 0.4540 0.4596 0.5932 0.6090 0.6292 0.6323 OR 0.3665 0.4322 0.5670 0.6974
Pow 0.5754 0.5737 0.5922 0.6242 0.6976 0.6731 0.6654 0.6636 Pow 0.4863 0.5805 0.6946 0.7712
PR 0.2829 0.3062 0.3330 0.3589 0.4862 0.4964 0.5291 0.5760 PR 0.2598 0.3553 0.4811 0.5882
GI 0.2926 0.3223 0.3566 0.3843 0.5008 0.4923 0.5369 0.5824 GI 0.2667 0.3566 0.4868 0.6240
Mi 0.5429 0.5303 0.5412 0.5570 0.6514 0.6488 0.6396 0.6429 Mi 0.4311 0.5327 0.6624 0.7508
KS 0.4838 0.5078 0.5257 0.5522 0.6753 0.6760 0.6701 0.6742 KS 0.4221 0.5274 0.6692 0.7639
Dev 0.5579 0.5358 0.5320 0.5531 0.6571 0.6956 0.6913 0.6896 Dev 0.4516 0.5638 0.6807 0.7600
GM 0.4610 0.4853 0.5128 0.5352 0.6985 0.6765 0.6754 0.6788 GM 0.4111 0.5164 0.6621 0.7721
AUC 0.6255 0.6649 0.6742 0.6885 0.7627 0.8421 0.7996 0.8106 AUC 0.5968 0.7049 0.7861 0.8462
PRC 0.6765 0.6984 0.6992 0.7123 0.7535 0.8412 0.8205 0.8132 PRC 0.6028 0.7221 0.8154 0.8671
S2N 0.5998 0.5878 0.6001 0.5727 0.8218 0.7285 0.6854 0.6794 S2N 0.4966 0.6067 0.7352 0.7994
Average 0.5141 0.5188 0.5291 0.5408 0.6522 0.6578 0.6563 0.6656 Average 0.4425 0.5384 0.6494 0.7370
the experiments. We used two different levels of class bal-
ance: highly imbalanced (> 92% not-fault-prone class) and
imbalanced (≤ 92% and ≥ 71% not-fault-prone class). Ta-
ble 2 and Figure 1 show that with few exceptions, as the im-
balance between the classes decreases, stability increases.
This is regardless of feature subset size.
Table 3 and Figure 2 show the effect of the degree of
Figure 1. Stability: Class Balance dataset perturbation on the stability of feature ranking tech-
niques across all six datasets and four different sizes of fea-
ture subsets. The figure demonstrate that the more instances
retained in a dataset (e.g., the fewer instances deleted from
are. When we apply the consistency index to the two deriva- the original dataset), the more stable the feature ranking on
tive datasets we can use the resultant consistency measure- that dataset will be.
ment as a measurement of stability for the feature ranker. Table 4 shows that the size of the feature subset can influ-
ence the stability of a feature ranking technique. For most
5.4 Results and Analysis rankers, stability is improved by an increased number of
features in the selected subset. Figure 3 shows the average
The experiments were conducted to discover the robust- stability performance (average stability values on y axis) of
ness (stability) of 18 rankers and the impact of feature sub- each ranker, averaged across all six datasets and with the
set size, class balance of dataset, and degree of perturba- 95% perturbation levels and feature subset size four. The
tion. We first generate 720 subsamples (6 original datasets figure shows that among the 18 rankers, GR shows the least
× 4 levels of perturbation × 30 repetitions). For each sub- stability on average while RF, PRC, and AUC show the most
sample, we applied the feature selection method and then stability.
generate a ranking list. The top three, four, five, and six
features are selected according to their respective scores. 6 Conclusion
For each combination of datasets, degree of perturbation,
rankers, and feature subset, 435 stability values are com- Feature selection is an important preprocessing step
puted (there are 435 pairs of subsamples due to 30 repeti- when performing data mining tasks such as classification
tions, 30 × 29 / 2 = 435). In total, 751,680 consistency and prediction. Because some feature selection techniques
index (6 original datasets × 4 degree of perturbation × 18 may provide different outputs when performed on a dataset
rankers × 4 feature subsets × 435 pairs of subsamples) val- which has been changed in some way, it is important to un-
ues are computed. derstand the stability of feature selection techniques in order
Tables 2 through 4 and Figures 1 through 3 contain the to choose those techniques and parameters which give the
results from the experiments. Each table focuses on one of most reliable results. In this study we conducted stability
the following three factors: class balance of dataset, degree analysis on eighteen different feature selection techniques.
of perturbation, and feature subset sizes. Six datasets with different levels of class balance were used.
321
# Experimental results show that four factors affect the sta-
bility performance of rankers. These factors are class bal-
trend of the datasets with more features producing higher
stability values. With few exceptions, as the imbalance be-
tween the classes decreases, stability increases. We also
# $ $
found that as the amount of perturbation increases, stabil-
%&'(&
ity decreases. While intuitive, the fewer instances removed
from (or equivalently, added to) a given dataset, the less
Figure 2. Stability: Degree of Perturbation the selected features will change when compared to each
other. Lastly, in terms of the choice of ranker we believe
the most stable rankers are RF, PRC, and AUC, while GR
is the least stable ranker. Thus, we would recommend these
three rankers for use by software quality practitioners who
wish to find reliable feature subsets which are not sensitive
to minor changes in the data.
Table 4. Average Stability for Feature Subset Future work will involve conducting additional empir-
Size 4, 95% Perturbation ical studies with data from other software projects and ap-
Feature Subset Size
Ranker
3 4 5 6 plication domains, and experiments with other ranking tech-
CS 0.6755 0.6676 0.6827 0.6924
GR 0.5523 0.5533 0.5600 0.5561
niques and classifiers for building classification models.
IG 0.6600 0.6506 0.6622 0.6697
RF 0.9361 0.9344 0.8826 0.8820
RFW 0.8773 0.8455 0.8730 0.8683
SU 0.6037 0.5763 0.5850 0.5912 References
FM 0.7306 0.7500 0.7787 0.8087
OR 0.6101 0.7016 0.7498 0.7280
Pow 0.7701 0.7608 0.7668 0.7871
PR 0.5566 0.5582 0.5913 0.6467 [1] T. Abeel, T. Helleputte, Y. Van de Peer, P. Dupont, and
GI 0.5878 0.5872 0.6344 0.6867
Mi 0.7579 0.7478 0.7437 0.7539
Y. Saeys. Robust biomarker identification for cancer diagno-
KS
Dev
0.7445
0.7439
0.7547
0.7662
0.7673
0.7643
0.7890
0.7655
sis with ensemble feature selection methods. Bioinformat-
GM 0.7636 0.7584 0.7798 0.7866 ics, 26(3):392–398, February 2010.
AUC 0.8039 0.8742 0.8511 0.8557
PRC 0.8293 0.8895 0.8800 0.8694 [2] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classifi-
S2N
Average
0.8324
0.7242
0.8012
0.7321
0.7910
0.7413
0.7730
0.7505
cation and Regression Trees. Chapman and Hall/CRC Press,
Boca Raton, FL, 1984.
[3] Z. Chen, T. Menzies, D. Port, and B. Boehm. Finding
the right data for software cost modeling. IEEE Software,
22(6):38–46, 2005.
[4] T. Fawcett. An introduction to ROC analysis. Pattern Recog-
nition Letters, 27(8):861–874, June 2006.
[5] G. Forman. An extensive empirical study of feature selection
$ metrics for text classification. Journal of Machine Learning
#
Research, 3:1289–1305, 2003.
[6] I. Guyon and A. Elisseeff. An introduction to variable and
3:1157–1182, March 2003.
[7] M. A. Hall and L. A. Smith. Feature selection for machine
learning: Comparing a correlation-based filter approach to
the wrapper. In Proceedings of the Twelfth International
)
*+ * + +,
- . /+
0 + * .
1
*. -) +)
2
Florida Artificial Intelligence Research Society Conference,
3
pages 235–239, May 1999.
[8] A. Kalousis, J. Prados, and M. Hilario. Stability of feature
selection algorithms: a study on high-dimensional spaces.
Figure 3. Stability: Rankers Knowledge and Information Systems, 12(1):95–116, Dec.
2006.
[9] T. M. Khoshgoftaar and N. Seliya. Fault-prediction model-
ing for software quality estimation: Comparing commonly
used techniques. Empirical Software Engineering Journal,
8(3):255–283, 2003.
322
[10] I. Kononenko. Estimating attributes: Analysis and exten-
sions of RELIEF. In European Conference on Machine
Learning, pages 171–182. Springer Verlag, 1994.
[11] L. I. Kuncheva. A stability index for feature selection. In
Proceedings of the 25th conference on Proceedings of the
25th IASTED International Multi-Conference: artificial in-
telligence and applications, pages 390–395, Anaheim, CA,
USA, 2007. ACTA Press.
[12] H. Liu and L. Yu. Toward integrating feature selection algo-
rithms for classification and clustering. IEEE Transactions
on Knowledge and Data Engineering, 17(4):491–502, 2005.
[13] D. Rodriguez, R. Ruiz, J. Cuadrado-Gallego, and J. Aguilar-
Ruiz. Detecting fault modules applying feature selection to
classifiers. In Proceedings of 8th IEEE International Confer-
ence on Information Reuse and Integration, pages 667–672,
Las Vegas, Nevada, August 13-15 2007.
[14] J. Van Hulse, T. M. Khoshgoftaar, and A. Napolitano. Exper-
imental perspectives on learning from imbalanced data. In
In Proceedings of the 24th International Conference on Ma-
chine Learning, pages 935–942, Corvallis, OR, USA, June
2007.
[15] H. Wang, T. M. Khoshgoftaar, and N. Seliya. How many
software metrics should be selected for defect prediction?
In Proceedings of the Twenty-Fourth International Florida
Artificial Intelligence Research Society Conference, pages
69–74, May 2011.
[16] M. Wasikowski and X. wen Chen. Combating the small
sample class imbalance problem using feature selection.
IEEE Transactions on Knowledge and Data Engineering,
22:1388–1400, 2010.
[17] I. H. Witten and E. Frank. Data Mining: Practical Machine
Learning Tools and Techniques. Morgan Kaufmann, 2 edi-
tion, 2005.
[18] T. Zimmermann, R. Premraj, and A. Zeller. Predicting de-
fects for eclipse. In ICSEW ’07: Proceedings of the 29th
International Conference on Software Engineering Work-
shops, page 76, Washington, DC, USA, 2007. IEEE Com-
puter Society.
323