A Study On Software Effort Prediction Using Machine Learning Techniques
A Study On Software Effort Prediction Using Machine Learning Techniques
Learning Techniques
Abstract. This paper conducts a study on of software effort prediction using ma-
chine learning techniques. Both supervised and unsupervised learning techniques
are employed to predict software effort using historical dataset. The unsupervised
learning as k-medoids clustering equipped with different similarity measures is
used to cluster projects in historical dataset. The supervised learning as J48 de-
cision tree, back propagation neural network (BPNN) and naïve Bayes is used
to classify the software projects into different effort classes. We also impute the
missing values in the historical datasets and then machine learning techniques are
adopted to predict software effort. Experiments on ISBSG and CSBSG datasets
demonstrate that unsupervised learning as k-medoids clustering produced a poor
performance. Kulzinsky coefficient has the best performance in measuring the
similarities of projects. Supervised learning techniques produced superior per-
formances than unsupervised learning techniques in software effort prediction.
BPNN produced the best performance among the three supervised learning tech-
niques. Missing data imputation improved the performances of both unsupervised
and supervised learning techniques in software effort prediction.
1 Introduction
The task of software effort prediction is to estimate the needed effort to develop a soft-
ware artifact [17]. Overestimate of software effort may lead to tight schedule of de-
velopment and faults may leave in the system after delivery, whereas underestimate of
effort may lead to delay of deliver of system and complains from customers. The im-
portance of software development effort prediction has motivated the construction of
prediction models to estimate the needed effort as accurate as possible.
Current software effort prediction techniques can be categorized into four types: em-
pirical, regression, theory-based, and machine learning techniques [2]. Machine Learn-
ing (ML) techniques learn patterns (knowledge) from historical project data and use
these patterns for effort prediction, such as artificial neural network (ANN), decision
tree, and naive Bayes. Recent studies [2] [3] provide detailed reviews of different stud-
ies on predicting software development effort.
L.A. Maciaszek and K. Zhang (Eds.): ENASE 2011, CCIS 275, pp. 1–15, 2013.
c Springer-Verlag Berlin Heidelberg 2013
2 W. Zhang, Y. Yang, and Q. Wang
The primary concern of this paper is on using machine learning techniques to predict
software effort. Despite that COCOMO has provided a viable solution to effort esti-
mation by building analytic model, machine learning techniques such as naïve Bayes
and artificial neural network have come up with alternative approaches by making use
of knowledge learned from historical projects. Although machine learning techniques
though may not be the best solution for effort estimation, we believe they can be used at
least by project managers to complement other models. Especially in intensely compet-
itive software market, accurate estimation of software development effort has a decisive
effect on success of a software project. Consequently, effort estimation using different
techniques, and further risk assessment of budget overrun are of necessity for a trust-
worthy software project [5].
The basic idea of using machine learning techniques for effort prediction is that,
historical data set contains many historical projects which are described by features with
their values to characterize those projects and, similar values of the features of projects
may induce almost the similar project efforts. The task of machine learning methods is
to learn the inherent patterns of feature values and their relations with project efforts,
which can be used for predicting the effort of new projects.
The rest of this paper was organized as follows. Section 2 introduce machine learn-
ing techniques to software effort prediction. Both unsupervised and supervised learning
techniques are introduced. Section 3 conducts experiments to examine the effectiveness
of machine learning techniques on software prediction. The datasets we used in the
experiments, and the performance measures for unsupervised and supervised learning
techniques are also introduced. The experimental results are illustrated with explana-
tions. Section 4 presents the threats to validity to this research. Section 5 reviews related
work of this paper. Section 6 concluds this paper.
data set. Whenever it encounters a set of boolean vectors (training set) it identifies the
variable that has the largest information gain [14]. Among the possible values of this
variable, if there is any value for which there is no ambiguity, that is, for which the
projects falling within this value having the same label of effort, then we terminate that
branch and assign to the terminal node the label of effort.
The back propagation neural network (BPNN) [15] is used to classify the software
projects in both ISBSG and CSBSG data sets as well. BPNN defines two sweeps of
the network: first a forward sweep from the input layer to the output layer and second
a backward sweep from the output layer to the input layer. The back ward sweep is
similar to the forward sweep except that error values are propagated back through the
network to determine how the weights of neurons are to be changed during training.
The objective of training is to find a set of network weights of neurons that construct a
model for prediction with minimum error.
A three-layer fully connected feed-forward network which consists of an input layer,
a hidden layer and an output layer is adopted in the experiments. The “tansigmod”
function is used in the hidden layer with 5 nodes and “purelinear” function for the
output layer with 3 nodes [17]. The network of BPNN is designed as shown in Figure
1.
Fig. 1. BPNN with 5 nodes in hidden layer and 3 nodes in output layer
3 Experiments
3.1 The Data Sets
We employed two data sets to investigate the predictability of software effort using
machine learning techniques. The one is ISBSG (International Software Benchmarking
Standard Group) data set (http://www.isbsg.org) and the other one is CSBSG
(Chinese Software Benchmarking Standard Group) data set [7].
A Study on Software Effort Prediction Using Machine Learning Techniques 5
ISBSG Data Set. ISBSG data set contains 1238 projects from insurance, government,
etc., of 20 different countries and each project was described with 70 attributes. To make
the data set suitable for the experiments, we conduct three kinds of preprocessing: data
pruning, data discretization and adding dummy variables.
We pruned ISBSG data set into 249 projects with 22 attributes using the criterion
that each project must have at least 2/3 attributes whose values are observed and, for
each attribute, its values must be observed on at least 2/3 of total projects. We adopt
the criterion for data selection in that too many missing values will deteriorate the per-
formances of most machine techniques thus a convincing evaluation of software effort
prediction is impossible. Among the 22 attributes, 18 of them are nominal attributes and
4 of them are continuous attributes. Table 2 lists the attributes used in the ISBSG data
set.
Data discretization is utilized to transfer the continuous attributes into discrete vari-
ables. The values of each continuous attribute are preprocessed into 3 unique partitions.
Too many partitions of values of an attribute will cause data redundancy nevertheless
too few partitions may not capture the distinction of values of continuous attributes.
For each nominal attribute, dummy variables are added according to its unique values
to make all variables having binary values [24]. As a result, all the projects are described
using 99 boolean variables with 0-1 and missing values. Only some of machine learning
techniques can handle mixed data of nominal and continuous values but, most machine
learning techniques can be used to handle Boolean values. In preprocessing, missing
values are denoted as “-1” and kept for all projects on corresponding variables. Table 3
shows the value distribution of variables of ISBSG projects after preprocessing. Most
values of the variables are zeros due to the transferring from discrete attributes to binary
variables.
Finally, software effort of those selected 249 projects in the ISBSG data set was
categorized into 3 classes. The projects with “normalized work effort” more than 6,000
person hours were categorized into the class with effort label as ”high”, projects with
”normalized work effort” between 2,000 and 6,000 person hours as ”medium” and
projects with ”normalized work effort” less than 2,000 person hours as ”low”. Table
4 lists the effort distribution of the selected projects in the ISBSG data set.
CSBSG Data Set. CSBSG data set contains 1103 projects from Chinese software in-
dustry. It was created in 2006 with its mission to promote Chinese standards of software
productivity. CSBSG projects were collected from 140 organizations and 15 regions
across China by Chinese association of software industry. Each CSBSG project is de-
scribed with 179 attributes. The same data preprocessing as those used in ISBSG data
set is used on CSBSG data set. In data pruning, 104 projects and 32 attributes (15 nom-
inal attributes and 17 continuous attributes) are extracted from CSBSG data set. Table
5 lists the attributes used in the CSBSG data set.
In data discretization, the values of each continuous attribute are partitioned into 3
unique classes. Dummy variables are added to transfer nominal attributes into Boolean
variables. As a result, 261 Boolean variables are produced to describe the 104 projects
with missing values denoted as “-1”. The value distribution of variables of CSBSG
projects is shown in Table 6 and we can see that CSBSG data set has more missing
values than ISBSG data set.
6 W. Zhang, Y. Yang, and Q. Wang
Value Proportion
1 20% ∼ 50%
0 20% ∼ 60%
-1 5% ∼ 33%
Finally, the projects in CSBSG data set were categorized into 3 classes according to
their real efforts. The projects with “normalized work effort” more than 5,000 person
hours were categorized into the class with effort label as “high”, projects with “normal-
ized work effort” between 2,000 and 5,000 person hours as ”medium” and projects with
“normalized work effort” less than 2,000 person hours as “low”. Table 7 lists the effort
distribution of the selected projects in the CSBSG data set.
With supervised learning, our experiments are carried out with 10-flod cross-valida-
tion technique. For each experiment, we divide the whole data set (ISBSG or CSBSG
data set) into 10 subsets. 9 of 10 subsets are used for training and the remaining 1
A Study on Software Effort Prediction Using Machine Learning Techniques 7
Table 6. The value distribution of variables for describing projects in CSBSG data set
Value Proportion
1 15% ∼ 40%
0 20% ∼ 60%
-1 10% ∼ 33%
subset was used for testing. We repeat the experiment 10 times and, the performance of
the prediction model is measured by the average of 10 accuracies of the 10 repetitions.
2 × P (i, j) × R(i, j)
F (i, j) = (6)
P (i, j) + R(i, j)
ni
F − measure = maxj F (i, j) (7)
i
n
Here, ni is the number of software projects with effort label hi , nj is the cardinality
of cluster cj , and ni,j is the number of software projects with effort label hi in cluster
cj . n is the total number of software projects in Y . P (i, j)is the proportion of projects
in cluster cj with effort label hi ; Ri,j is the proportion of projects with effort label hi
in cluster cj ; F (i, j) is the F-measure of cluster cj with respect to projects with effort
label hi . In general, the larger the F-measure is, the better is the clustering result is.
F-measure
Similarity Measure
Without imputation With imputation
Dice Coefficient 0.3520 0.3937
Jaccard Coefficient 0.3824 0.4371
Kulzinsky Coefficient 0.4091 0.4624
Table 9. Clustering result using Kulzinsky coefficient with imputation on ISBSG data set
clustering has not produced a favorable performance on CSBSG data set. By contrast,
the performance of k-medoids clustering on CSBSG data set is worse than that on
ISBSG data set. Without (with) imputation, the average F-measure on the three coef-
ficients on CSBSG data set is decreased by 7.5% (3.7%) using the average on ISBSG
data set as baseline. We explain this outcome as that CSBSG data set has less ones and
more missing values (denoted as “-1”) in boolean vectors than ISBSG data set, as can
be seen in Tables 3 and 6. Based on the analysis, the predictability of software effort
using unsupervised learning is not acceptable by software industry.
F-measure
Similarity Measure
Without imputation With imputation
Dice Coefficient 0.3403 0.3772
Jaccard Coefficient 0.3881 0.4114
Kulzinsky Coefficient 0.4065 0.4560
Table 11. Clustering result using Kulzinsky coefficient with imputation on CSBSG data set
Results from Supervised Learning. Table 12 shows the performances of the three
mentioned classifiers in classifying the projects in ISBSG data set. On average, we can
see that BPNN outperforms other classifiers in classifying the software projects based
on efforts. J48 decision tree has better performance than naı̈ve Bayes. Using the perfor-
mance of naı̈ve Bayes as the baseline, BPNN increases the average accuracy by 16.25%
(11.71%) and J48 decision tree by 5.6% (2.5%) without (with) imputation.
A Study on Software Effort Prediction Using Machine Learning Techniques 11
We explain this outcome that BPNN has the best capacity to eliminate the noise
and peculiarities because it adopts back sweep to change the weights of neurons for
reducing errors of predictive model. However, the performance of BPNN is not robust
as other classifiers (we observe this point from its standard deviation). The adoption
of cross-validation technique may reduce overfitting of BPNN to some extent but, it
cannot eliminate the drawback of BPNN entirely. The J48 decision tree classifies the
projects using learned decision rules. Due to the adoption of information gain [15],
those variables having more discriminative power will be fetched out by J48 in earlier
branches in constructing decision rules and thus, the noise and peculiarities connotated
in the variables with less discriminative power will be ignored automatically (especially
in tree pruning).
naı̈ve Bayes has the worst performance among the three classifiers in classifying
software efforts. We explain this as that the variables used for describing projects may
not be independent of each other. Moreover, naı̈ve Bayes regard all variables as has
equivalent weights as each other in the prediction model. The conditional probabilities
of all variables have the same weight when predicting the label of an incoming project.
However, in fact, some variables of projects have more discriminative power than other
variables in deciding the project effort. The noise and peculiarities are often contained
in the variables those have little discriminative power and those variables should be
given less importance in the prediction model. We conjecture that this fact is also the
cause of the poor performance of k-medoids in project clustering projects. In the same
manner as that in k-medoids clustering, MINI technique has significantly improved the
performance of project classification by imputing missing values in Boolean vectors.
Table 13 shows the performances of the three classifiers on CSBSG data set. The
similar conclusion as on ISBSG data set can be drawn on CSBSG data set. However,
the performances of three classifiers on CSBSG data set are worse than those on ISBSG
data set. The average of overall accuracies of the three techniques without (with) impu-
tation on CSBSG data set is decreased by 6.95% (6.66%) using that on ISBSG data set
as the baseline. We also explain this outcome as the lower quality of CSBSG data set
than that of ISBSG data set.
12 W. Zhang, Y. Yang, and Q. Wang
We can see from Tables 12 and 13 that, in both ISBSG and CSBSG data sets, all
the three supervised learning techniques have not produced a favorable classification
on software efforts using project attributes. The best performance that was produced by
BPNN is with the accuracy around 60%. The accuracy as 60% is meaningless for soft-
ware effort prediction in most cases because, that means that at the probability 0.4, the
prediction results fall beyond the range of each effort class. Combined with the results
of effort prediction from unsupervised learning, we draw that the predictability of soft-
ware effort using supervised learning techniques is not acceptable by software industry,
either.
4 Threads to Validity
The threats to external validity primarily include the degree to which the attributes of
projects in ISBSG and CSBSG data set have exactly captured the characteristics of soft-
ware projects in real practice. For data quality, we only extracted a small portion of data
samples from ISBSG and CSBSG data sets. We hope these projects are representative
of the population of software projects in these two data sets. These threats could be
reduced by more experiments on more data sets of software efforts in future work. The
threats to internal validity are instrumentation effects that can bias our results. The un-
certainty of values of attributes, the ambiguity of software efforts of projects, and the
unbalanced distribution of projects with respect to attributes in the data sets might cause
such effects. To reduce these threats, we manually inspected the software projects and
their values of attributes and evaluated the reliability of the data for each project. One
threat to construct validity is that our experiments involve large amount of data prepro-
cessing, hoping that the preprocessed data can still precisely capture the characteristics
of original software projects.
5 Related Work
Srinivasan and Fisher [19] used decision tree and BPNN to estimate software develop-
ment effort. COCOMO data with 63 historical projects was used as the training data
and Kremer data with 15 projects was used as testing data. They reported that decision
tree and BPNN are competitive with traditional COCOMO estimator. However, they
pointed out that the performances of machine learning techniques are very sensitive
to the data on which they were trained. [17] compared three estimation techniques as
BPNN, case-based reasoning and regression models using Function Points as the mea-
sure of system size. They reported that neither of case-based reasoning and regression
model was favorable in estimating software efforts due to the considerable noise in the
data set. BPNN appears capable of providing adequate estimation performance (with
MRE as 35%) nevertheless its performance is largely dependent on the quality of train-
ing data as well as the suitability of testing data to the trained model. Of all the three
methods, a large amount of uncertainty is inherent in their performances. In both [17]
and [19], a serious problem confronted with effort estimation using machine learning
techniques is that huge uncertainty involved in the robustness of these techniques. That
A Study on Software Effort Prediction Using Machine Learning Techniques 13
is, model sensitivity and data-dependent property of machine learning techniques hin-
der their admittance by industrial practice in effort prediction. These work as well as
[22] motivates this study to investigate the effectiveness of a variety of machine learn-
ing techniques on two different data sets.
Park and Baek [18] conducted an empirical validation of a neural network model for
software effort estimation. The data set used in their experiments is collected from a
Korean IT company and includes 148 IT projects. They compared expert judgment, re-
gression models and BPNN with different input variables in software effort estimation.
They reported that neural network using Function Point and other 6 variables (length of
project, usage level of system development methodology, number of high/middle/low
level manpower and percentage of outsourcing) as input variables outperforms other
estimation methods. However, even in the best performance, the average MRE is nearly
60% with standard deviation more than 30%. This result makes it is very hard that the
method proposed in their work can be satisfactorily admitted in practice. For this rea-
son, a validation of machine learning methods is necessary in order to shed light on the
advancement of software effort estimation. This point also motivates us to investigate
the effectiveness of machine learning techniques for software effort estimation and the
predictability of software effort using machine techniques.
[20] proposed a neuron-genetic approach to predict software development effort
while the neural network is employed to construct the prediction model and genetic al-
gorithm is used to optimize the weights between nodes in the input layer and the nodes
in the output layer. They used the same data sets as that was used in Srinivasan and
Fisher [19] and reported that the neuron-genetic approach outperforms both decision
tree and BPNN. However, they also reported that local minima and over fitting dete-
riorate the performance of the proposed method in some cases, even make it a poorer
predictor than traditional estimator as COCOMO [21]. The focus of our study is not
to propose a novel approach to software effort estimation but to extensively review the
usefulness of machine learning techniques in software effort estimation. That is, to how
much extent the typical machine techniques can accurately estimate the effort of a given
project using historical data.
6 Concluding Remarks
measure for unsupervised learning and, BPNN has produced the best performance
among the examined supervised learning techniques. Moreover, the MINI imputation
can improve data quality and improve effort prediction significantly.
Acknowledgements. This work is supported by the National Natural Science Foun-
dation of China under Grant Nos. 71101138, 60873072, 61073044, and 60903050; the
National Basic Research Program under Grant No. 2007CB310802; the Beijing Natural
Science Foundation under Grant No. 4122087; the Scientific Research Foundation for
the Returned Overseas Chinese Scholars, State Education Ministry.
References
1. Boehm, B., Abts, C., Brown, A., Chulani, S., Clark, B., Horowitz, E.: Software Cost Estima-
tion with COCOMO II. Prentice Hall, New Jersey (2001)
2. Pendharkar, P., Subramanian, G., Roger, J.: A Probabilistic Model for Predicting Software
Development Effort. IEEE Transactions on Software Engineering 31(7), 615–624 (2005)
3. Jorgensen, M.: A Review of Studies on Expert Estimation of Software Development Effort.
Journal of Systems and Software 70, 37–60 (2004)
4. Fairley, R.: Recent Advances in Software Estimation Techniques. In: Proceedings of Inter-
national Conference on Software Engineering, pp. 382–391 (1992)
5. Yang, Y., Wang, Q., Li, M.: Process Trustworthiness as a Capability Indicator for Measuring
and Improving Software Trustworthiness. In: Wang, Q., Garousi, V., Madachy, R., Pfahl, D.
(eds.) ICSP 2009. LNCS, vol. 5543, pp. 389–401. Springer, Heidelberg (2009)
6. Korte, M., Port, D.: Confidence in Software Cost Estimation Results based on MMRE and
PRED. In: Proceedings of PROMISE 2008, pp. 63–70 (2008)
7. He, M., Li, M., Wang, Q., Yang, Y., Ye, K.: An Investigation of Software Development Pro-
ductivity in China. In: Wang, Q., Pfahl, D., Raffo, D.M. (eds.) ICSP 2008. LNCS, vol. 5007,
pp. 381–394. Springer, Heidelberg (2008)
8. Krupka, E., Tishby, N.: Generalization from Observed to Unoberserved Features by Cluster-
ing. Journal of Machine Learning Research 83, 339–370 (2008)
9. Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 3rd edn. Elsevier (2006)
10. Gan, G., Ma, C., Wu, J.: Data Clustering, Theory, Algorithmsm, and Applications. In: ASA-
SIAM Series on Statistical and Applied Probability, pp. 78–78 (2008)
11. Song, Q., Shepperd, M.: A new imputation method for small software project data sets.
Journal of Systems and Software 80, 51–62 (2007)
12. Zhou, Z., Tang, W.: Clusterer ensemble. Knowledge-Based Systems 19, 77–83 (2006)
13. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In:
Proceedings of KDD-2000 Workshop on Text Mining, pp. 109–119 (2000)
14. Quinlan, J.: Programs for Machine Learning, 2nd edn. Morgan Kaufmann Publishers (1993)
15. Rumelhart, D., Hinton, G., Williams, J.: Learning internal representations by error propaga-
tion. In: Proceedings of Parallel Distributed Processing, Exploitations in the Microstructure
of Cognition, pp. 318–362 (1986)
16. Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. John Wiley & Sons (2003)
17. Finnie, G., Wittig, G.: A Comparison of Software Effort Estimation Techniques: Using Func-
tion Points with Neural Networks, Case-Based Reasoning and Regression Models. Journal
of Systems and Software 39, 281–289 (1997)
18. Park, H., Baek, S.: An empirical validation of a neural network model for software effort
estimation. Expert System with Applications 35, 929–937 (2008)
A Study on Software Effort Prediction Using Machine Learning Techniques 15
19. Srinivasan, K., Fisher, D.: Machine Learning Approaches to Estimating Software Develop-
ment Effort. IEEE Transactions on Software Engineering 21(2), 126–137 (1995)
20. Shukla, K.: Neuro-genetic prediction of software development effort. Information and Soft-
ware Technology 42, 701–713 (2000)
21. Boehm, B.: Software Engineering Economics. Prentice Hall, New Jersey (1981)
22. Prietula, M., Vicinanza, S., Mukhopadhyay, T.: Software-effort estimation with a case-based
resoner. Journal of Experimental & Theoritical Artificial Intelligence 8, 341–363 (1996)
23. Jorgensen, M., Shepperd, M.: A Systematic Review of Software Development Cost Estima-
tion Studies. IEEE Transactions on Software Engineering 33(1), 33–53 (2007)
24. Zhang, W., Yang, Y., Wang, Q.: Handling missing data in software effort prediction with
naive Bayes and EM algorithm. In: Proceedings of International Conference on Predictive
Models in Software Engineering, vol. 4 (2011)