0% found this document useful (0 votes)

41 views

A Study On Software Effort Prediction Using Machine Learning Techniques

Les résultats de nos expériences montrent également que RCForest, un apprenant semisupervisé, obtient de meilleures performances sur la base des mesures F et de la CUA qui sont en moyenne de 85,84% et 82,28% respectivement. RCForest obtient de meilleurs résultats dans la situation d'ensembles de données déséquilibrés, les ensembles de données PROMISE de la NASA sur les défauts.

Uploaded by

Franck

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views

A Study On Software Effort Prediction Using Machine Learning Techniques

Uploaded by

Franck

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

A Study on Software Effort Prediction Using Machine

Learning Techniques

Wen Zhang, Ye Yang, and Qing Wang

Laboratory for Internet Software Technologies, Institute of Software

Chinese Academy of Sciences, Beijing 100190, P.R.China
{zhangwen,ye,wq}@itechs.iscas.ac.cn

Abstract. This paper conducts a study on of software effort prediction using ma-
chine learning techniques. Both supervised and unsupervised learning techniques
are employed to predict software effort using historical dataset. The unsupervised
learning as k-medoids clustering equipped with different similarity measures is
used to cluster projects in historical dataset. The supervised learning as J48 de-
cision tree, back propagation neural network (BPNN) and naïve Bayes is used
to classify the software projects into different effort classes. We also impute the
missing values in the historical datasets and then machine learning techniques are
adopted to predict software effort. Experiments on ISBSG and CSBSG datasets
demonstrate that unsupervised learning as k-medoids clustering produced a poor
performance. Kulzinsky coefficient has the best performance in measuring the
similarities of projects. Supervised learning techniques produced superior per-
formances than unsupervised learning techniques in software effort prediction.
BPNN produced the best performance among the three supervised learning tech-
niques. Missing data imputation improved the performances of both unsupervised
and supervised learning techniques in software effort prediction.

Keywords: Effort prediction, Machine learning, k-medoids, BPNN, Missing

imputation.

1 Introduction

The task of software effort prediction is to estimate the needed effort to develop a soft-
ware artifact [17]. Overestimate of software effort may lead to tight schedule of de-
velopment and faults may leave in the system after delivery, whereas underestimate of
effort may lead to delay of deliver of system and complains from customers. The im-
portance of software development effort prediction has motivated the construction of
prediction models to estimate the needed effort as accurate as possible.
Current software effort prediction techniques can be categorized into four types: em-
pirical, regression, theory-based, and machine learning techniques [2]. Machine Learn-
ing (ML) techniques learn patterns (knowledge) from historical project data and use
these patterns for effort prediction, such as artificial neural network (ANN), decision
tree, and naive Bayes. Recent studies [2] [3] provide detailed reviews of different stud-
ies on predicting software development effort.

L.A. Maciaszek and K. Zhang (Eds.): ENASE 2011, CCIS 275, pp. 1–15, 2013.

c Springer-Verlag Berlin Heidelberg 2013
2 W. Zhang, Y. Yang, and Q. Wang

The primary concern of this paper is on using machine learning techniques to predict
software effort. Despite that COCOMO has provided a viable solution to effort esti-
mation by building analytic model, machine learning techniques such as naïve Bayes
and artificial neural network have come up with alternative approaches by making use
of knowledge learned from historical projects. Although machine learning techniques
though may not be the best solution for effort estimation, we believe they can be used at
least by project managers to complement other models. Especially in intensely compet-
itive software market, accurate estimation of software development effort has a decisive
effect on success of a software project. Consequently, effort estimation using different
techniques, and further risk assessment of budget overrun are of necessity for a trust-
worthy software project [5].
The basic idea of using machine learning techniques for effort prediction is that,
historical data set contains many historical projects which are described by features with
their values to characterize those projects and, similar values of the features of projects
may induce almost the similar project efforts. The task of machine learning methods is
to learn the inherent patterns of feature values and their relations with project efforts,
which can be used for predicting the effort of new projects.
The rest of this paper was organized as follows. Section 2 introduce machine learn-
ing techniques to software effort prediction. Both unsupervised and supervised learning
techniques are introduced. Section 3 conducts experiments to examine the effectiveness
of machine learning techniques on software prediction. The datasets we used in the
experiments, and the performance measures for unsupervised and supervised learning
techniques are also introduced. The experimental results are illustrated with explana-
tions. Section 4 presents the threats to validity to this research. Section 5 reviews related
work of this paper. Section 6 concluds this paper.

2 Effort Prediction Using Machine Learning

2.1 Effort Prediction with Unsupervised Learning
Generally, researchers in software engineering hold the assumption that projects with
similar characteristics, such as the number of function points, application domain and
programming language, are expected to have approximately equivalent efforts (at least
they should be in the same effort class). In the standpoint of machine learning, clus-
tering software projects on the basis of a random subset can capture information on
the unobserved attributes [8]. If we regarded effort as an attribute that also character-
ize software projects in the data set, then software effort can be deduced by clustering
projects using other attributes.
To validate this assumption, k-medoids [9] is adopted for clustering the projects and
three similarity measures are used to measure the similarities of boolean vectors that
represent the software projects. k-medoids is actually evolved from k-means [9] and
their difference lies in that k-medoids assigns existing element in a cluster as cluster
center but k-means assigns mean vector of elements in a cluster as the cluster center. We
adopt k-medoids other than k-means because the mean vector of boolean vectors lacks
explainable meaning in practice nevertheless their medoid denotes a real project. The
typical k-medoids clustering is implemented by partitioning around medoids (PAM)
A Study on Software Effort Prediction Using Machine Learning Techniques 3

Algorithm 1. The k-medoids clustering implemented by PAM algorithm.

Input:
k, the number of clusters
m, Boolean vectors
Output:
k clusters partitioned from the m Boolean vectors.
Procedure:
1. Initialize: randomly select k of the m Boolean vectors as the mediods.
2. Associate each Boolean vector to the closest medoid under predefined similarity measure.
3. For each mediod d
4. For each non-medoid Boolean vector b
5. Swap d and b and compute the total cost of the configuration
6. End for
7. End for
8. Select the configuration with the lowest cost.
9. Repeat steps 2 to 5 until there is no change in the medoid.

algorithm as depicted in Algorithm 1. The computation complexity and the convergence

of PAM algorithm refers to [9].
The three adopted similarity measures are Dice coefficient, Jaccard coefficient and
Kulzinsky coefficient for binary vectors [10]. Assuming that Di and Dj are two projects
represented by two n-dimensional Boolean vectors and spq (Di , Dj ) is the number of
entries in Di and Dj whose values are p and q respectively, we define A, B, C and D
in Equation 1.

A = s11 (Di , Dj ), B = s01 (Di , Dj ),

(1)
C = s10 (Di , Dj ), E = s00 (Di , Dj )
The similarity measures of Dice, Jaccard and Kulzinsky coefficients are listed in
Table 1. We regard that E, which means that the characteristic does not exist in both
Di and Dj , might not be an important factor when measuring similarity of two projects
because, the proportion of zero in values of variables is very large in both ISBSG and
CSBSG data set.

Table 1. Three similarity measure used in k-medoids clustering

Measure Similarity Range

A
Dice 2A+B+C
[0, 21 ]
A
Jaccard A+B+C [0,1]
A
Kulzinsky B+C ∞

2.2 Effort Prediction with Supervised Learning

The employed supervised learning techniques are those usually used in effort predic-
tion, including J48 decision tree, BPNN and naive Bayes. The J48 decision tree classi-
fier follows the following simple algorithm. In order to classify the effort of a software
project, it firstly creates a decision tree based on the values of variables in the training
4 W. Zhang, Y. Yang, and Q. Wang

data set. Whenever it encounters a set of boolean vectors (training set) it identifies the
variable that has the largest information gain [14]. Among the possible values of this
variable, if there is any value for which there is no ambiguity, that is, for which the
projects falling within this value having the same label of effort, then we terminate that
branch and assign to the terminal node the label of effort.
The back propagation neural network (BPNN) [15] is used to classify the software
projects in both ISBSG and CSBSG data sets as well. BPNN defines two sweeps of
the network: first a forward sweep from the input layer to the output layer and second
a backward sweep from the output layer to the input layer. The back ward sweep is
similar to the forward sweep except that error values are propagated back through the
network to determine how the weights of neurons are to be changed during training.
The objective of training is to find a set of network weights of neurons that construct a
model for prediction with minimum error.
A three-layer fully connected feed-forward network which consists of an input layer,
a hidden layer and an output layer is adopted in the experiments. The “tansigmod”
function is used in the hidden layer with 5 nodes and “purelinear” function for the
output layer with 3 nodes [17]. The network of BPNN is designed as shown in Figure
1.

Fig. 1. BPNN with 5 nodes in hidden layer and 3 nodes in output layer

Naive Bayes [16] is a well known probabilistic classifier in machine learning. It is

based on the Bay’s theorem of posteriori probability and assumes that the effect of an
attribute value on a given class is independent of the value of the other attributes. This
class conditional independence assumption simplifies computation involved in building
the classifier so we called the produced classifier “naive”. Compared to other traditional
prediction models, naive Bayes provides tools for risk estimation and allows decision-
makers to combine historical data with subjective expert estimates [2].
The J48 decision tree and naive Bayes are implemented using Weka (Waikato
Environment for Knowledge Analysis) (http://www.cs.waikato.ac.nz/ml/
weka/) and, BPNN is implemented using Matlab simulink tool box (http://www.
mathworks.com/products/neural-net/). Also, MINI algorithm [12] is used
to impute the missing values of Boolean vectors if necessary.

3 Experiments
3.1 The Data Sets
We employed two data sets to investigate the predictability of software effort using
machine learning techniques. The one is ISBSG (International Software Benchmarking
Standard Group) data set (http://www.isbsg.org) and the other one is CSBSG
(Chinese Software Benchmarking Standard Group) data set [7].
A Study on Software Effort Prediction Using Machine Learning Techniques 5

ISBSG Data Set. ISBSG data set contains 1238 projects from insurance, government,
etc., of 20 different countries and each project was described with 70 attributes. To make
the data set suitable for the experiments, we conduct three kinds of preprocessing: data
pruning, data discretization and adding dummy variables.
We pruned ISBSG data set into 249 projects with 22 attributes using the criterion
that each project must have at least 2/3 attributes whose values are observed and, for
each attribute, its values must be observed on at least 2/3 of total projects. We adopt
the criterion for data selection in that too many missing values will deteriorate the per-
formances of most machine techniques thus a convincing evaluation of software effort
prediction is impossible. Among the 22 attributes, 18 of them are nominal attributes and
4 of them are continuous attributes. Table 2 lists the attributes used in the ISBSG data
set.
Data discretization is utilized to transfer the continuous attributes into discrete vari-
ables. The values of each continuous attribute are preprocessed into 3 unique partitions.
Too many partitions of values of an attribute will cause data redundancy nevertheless
too few partitions may not capture the distinction of values of continuous attributes.
For each nominal attribute, dummy variables are added according to its unique values
to make all variables having binary values [24]. As a result, all the projects are described
using 99 boolean variables with 0-1 and missing values. Only some of machine learning
techniques can handle mixed data of nominal and continuous values but, most machine
learning techniques can be used to handle Boolean values. In preprocessing, missing
values are denoted as “-1” and kept for all projects on corresponding variables. Table 3
shows the value distribution of variables of ISBSG projects after preprocessing. Most
values of the variables are zeros due to the transferring from discrete attributes to binary
variables.
Finally, software effort of those selected 249 projects in the ISBSG data set was
categorized into 3 classes. The projects with “normalized work effort” more than 6,000
person hours were categorized into the class with effort label as ”high”, projects with
”normalized work effort” between 2,000 and 6,000 person hours as ”medium” and
projects with ”normalized work effort” less than 2,000 person hours as ”low”. Table
4 lists the effort distribution of the selected projects in the ISBSG data set.

CSBSG Data Set. CSBSG data set contains 1103 projects from Chinese software in-
dustry. It was created in 2006 with its mission to promote Chinese standards of software
productivity. CSBSG projects were collected from 140 organizations and 15 regions
across China by Chinese association of software industry. Each CSBSG project is de-
scribed with 179 attributes. The same data preprocessing as those used in ISBSG data
set is used on CSBSG data set. In data pruning, 104 projects and 32 attributes (15 nom-
inal attributes and 17 continuous attributes) are extracted from CSBSG data set. Table
5 lists the attributes used in the CSBSG data set.
In data discretization, the values of each continuous attribute are partitioned into 3
unique classes. Dummy variables are added to transfer nominal attributes into Boolean
variables. As a result, 261 Boolean variables are produced to describe the 104 projects
with missing values denoted as “-1”. The value distribution of variables of CSBSG
projects is shown in Table 6 and we can see that CSBSG data set has more missing
values than ISBSG data set.
6 W. Zhang, Y. Yang, and Q. Wang

Table 2. The used attributes from ISBSG data set

Branch Description Type
Count Approach Nominal
Sizing technique
Adjusted Functional Points Continuous
Schedule Project Activity Scope Nominal
Minor Defects Continuous
Quality Major Defects Continuous
Extreme Defects Continuous
Development Type Nominal
Organization Type Nominal
Grouping Attributes
Business Area Type Nominal
Application Type Nominal
Architecture Architecture Nominal
Documents Techniques Development Techniques Nominal
Development Platform Nominal
Language Type Nominal
Primary Programming language Nominal
1st Hardware Nominal
Project Attributes
1st Operating System Nominal
1st Data Base System Nominal
CASE Tool Used Nominal
Used Methodology Nominal
Product Attributes Intended Market Nominal

Table 3. The value distribution of variables of projects in ISBSG data set

Value Proportion
1 20% ∼ 50%
0 20% ∼ 60%
-1 5% ∼ 33%

Table 4. The effort classes categorized in ISBSG data set

Class No Number of projects Label

1 64 Low
2 85 Medium
3 100 High

Finally, the projects in CSBSG data set were categorized into 3 classes according to
their real efforts. The projects with “normalized work effort” more than 5,000 person
hours were categorized into the class with effort label as “high”, projects with “normal-
ized work effort” between 2,000 and 5,000 person hours as ”medium” and projects with
“normalized work effort” less than 2,000 person hours as “low”. Table 7 lists the effort
distribution of the selected projects in the CSBSG data set.
With supervised learning, our experiments are carried out with 10-flod cross-valida-
tion technique. For each experiment, we divide the whole data set (ISBSG or CSBSG
data set) into 10 subsets. 9 of 10 subsets are used for training and the remaining 1
A Study on Software Effort Prediction Using Machine Learning Techniques 7

Table 5. The used attributes from CSBSG data set

Branch Description Type
Count Approach Nominal
City of development Nominal
Business area Nominal
Development type Nominal
Application type Nominal
Development Platform Nominal
IDE Nominal
Basic information of projects Programming Language Nominal
Operation System Nominal
Database Nominal
Target Market Nominal
Architecture Nominal
Maximum Number of Concurrent Users Nominal
Life-cycle model Nominal
CASE Tool Nominal
Added lines of code Continuous
Revised lines of code Continuous
Reused lines of code Continuous
Number of team members in inception phase Continuous
Size
Number of team members in requirement phase Continuous
Number of team members in design phase Continuous
Number of team members in coding phase Continuous
Number of team members in testing phase Continuous
Schedule Time limit in planning Continuous
Predicted number of Defects in requirements phase Continuous
Predicted number of Defects in design phase Continuous
Quality
Predicted number of Defects in testing phase Continuous
Number of defects within one month after deliver Continuous
Number of requirement changes in requirement phase Continuous
Number of requirement changes in design phase Continuous
Other
Number of requirement changes in coding phase Continuous
Number of requirement changes in testing phase Continuous

Table 6. The value distribution of variables for describing projects in CSBSG data set

Value Proportion
1 15% ∼ 40%
0 20% ∼ 60%
-1 10% ∼ 33%

Table 7. The effort classes categorized in CSBSG data set

Class No Number of projects Label

1 27 Low
2 31 Medium
3 46 High
8 W. Zhang, Y. Yang, and Q. Wang

subset was used for testing. We repeat the experiment 10 times and, the performance of
the prediction model is measured by the average of 10 accuracies of the 10 repetitions.

3.2 Evaluation Measures

In software engineering, the deviation of predicted effort to real effort is used to mea-
sure the accuracy of effort estimators, such as MMRE (Magnitude of Relative Error),
PRED(x) (Prediction within x) and AR (Absolute Residual) [6]. In machine learning,
the performance of classification is often evaluated by accuracy and, F-measure [13]
is usually used to evaluate the performance of unsupervised learning. Essentially, the
evaluation measures of effort predictors in software engineering and those in machine
learning do not conflict. In this paper, we adopted the measures from machine learning
to evaluate the performances of effort predictors.

Accuracy. Assuming that D = (D1 , ..., Di , ..., Dm ) is a collection of software projects,

where Di is a historical project and it is denoted by n attributes Xi (1 i n). That
is, Di = (xi1 , ..., xij , ..., xin )T . hi denotes the label of effort for project Di .
xij is the value of attribute Xj (1 j n) on Dj . To evaluate the performance
of a classifier in effort prediction, the whole data set was divided into two subsets:
one is used for training the classifier and the other one is used for testing. That is,
D = (Dtrain | Dtest ) = (D1 , ..., Dk | Dk+1 , ..., Dm )T , where k is the predefined
number of projects in training set and m is the total number of projects in D. For
instance, in 10-fold-cross validation, k should be predefined as 0.9m and the remaining
0.1m projects are used for testing the trained model. hi is known for training set but
remains unknown for testing set. By machine learning on the training set, a classifier
denoted as M is produced. If we define a Boolean function F as Equation 2, then the
performance of M is evaluated by accuracy as Equation 3.

1, if M (Dj ) = hj ;
F (M (Dj ), hj ) = (2)
0, otherwise.
1
accuracy = F (M (Dj ), hj ) (3)
m−k
k<j≤m

F-measure. In classification, Y was partitioned into l clusters and l is a predefined

number of clusters in the data set. That is, Y = c1 , ..., cl , ci = {Di,1 , ..., Di,|ci | }
(1 ≤ i ≤ l) and ci ∩cj = φ. F-measure [14] is employed to evaluate the performance of
unsupervised learning (clustering). The formula of F-measure is depicted as Equations
7 with the supports of Equations 4, 5, and 6.
ni,j
P (i, j) = (4)
nj
ni,j
R(i, j) = (5)
ni
A Study on Software Effort Prediction Using Machine Learning Techniques 9

2 × P (i, j) × R(i, j)
F (i, j) = (6)
P (i, j) + R(i, j)
ni
F − measure = maxj F (i, j) (7)
i
n
Here, ni is the number of software projects with effort label hi , nj is the cardinality
of cluster cj , and ni,j is the number of software projects with effort label hi in cluster
cj . n is the total number of software projects in Y . P (i, j)is the proportion of projects
in cluster cj with effort label hi ; Ri,j is the proportion of projects with effort label hi
in cluster cj ; F (i, j) is the F-measure of cluster cj with respect to projects with effort
label hi . In general, the larger the F-measure is, the better is the clustering result is.

3.3 Experimental Results

Results from Unsupervised Learning. PAM algorithm is used to cluster the software
projects in ISBSG and CSBSG data sets. The number of clusters is predefined as the
number of classes. That is, the parameter k in PAM algorithm for both ISBSG and
CSBSG data sets was set as 3. Without imputation, we regard the missing values in
boolean vectors as zeros. We also employed MINI imputation technique [11] to impute
the missing values before clustering. Due to the unstable clustering results caused by
initial selection of cluster centers in PAM algorithm, we repeated each experiment 10
times and ensemble clustering proposed by Zhou et al [12] was utilized to produce the
final clusters. Table 8 shows the performances of k-medoids clustering on ISBSG data
set using PAM algorithm with three similarity measures with and without imputation.
We can see from Table 8 that in similarity measure, Kulzinsky coefficient has the best
performance among the three measures and Jaccard coefficient has better performance
than Dice coefficient. In k-medoids clustering without (with) imputation, Kulzinsky co-
efficient increases the F-measure by 16.29% (17.45%) and Jaccard Coefficient increases
the F-measure by 8.6% (5.78%) using Dice coefficient as the baseline.
This outcome illustrates that the number of common entries as in Equation 7 is more
important than other indices in similarity measure of software project in effort predic-
tion using clustering. Imputation significantly improves the quality of clustering results.
This validates the effectiveness of imputing missing values of projects represented by
boolean vectors in k-medoids clustering.
To have a detailed look at the clustering results, Table 9 shows the projects in the pro-
duced clusters across the classes in Table 4. These clusters were produced by k-medoids
clustering using Kulzinsky coefficient with imputation (i.e. F-measure is 0.4624). We
can see that k-medoids clustering actually has not produced high-quality clusters in the
ISBSG data set. The results are not good as acceptable in real practice of software ef-
fort prediction. For instance, cluster 2 mixes projects in both class 1 and 2 and, most
projects in one class scatter on more than one cluster such as the projects in class 2 and
class 3.
Table 10 shows the performance of PAM algorithm on CSBSG data set. Table 11
shows the distribution of projects in clusters across classes. Similarly, k-medoids
10 W. Zhang, Y. Yang, and Q. Wang

Table 8. k-medoids clustering on ISBSG data set

F-measure
Similarity Measure
Without imputation With imputation
Dice Coefficient 0.3520 0.3937
Jaccard Coefficient 0.3824 0.4371
Kulzinsky Coefficient 0.4091 0.4624

Table 9. Clustering result using Kulzinsky coefficient with imputation on ISBSG data set

Similarity Class 1 Class 2 Class 3 Total

Cluster 1 26 23 28 77
Cluster 2 32 40 27 99
Cluster 3 6 22 45 73
Total 64 85 100 249

clustering has not produced a favorable performance on CSBSG data set. By contrast,
the performance of k-medoids clustering on CSBSG data set is worse than that on
ISBSG data set. Without (with) imputation, the average F-measure on the three coef-
ficients on CSBSG data set is decreased by 7.5% (3.7%) using the average on ISBSG
data set as baseline. We explain this outcome as that CSBSG data set has less ones and
more missing values (denoted as “-1”) in boolean vectors than ISBSG data set, as can
be seen in Tables 3 and 6. Based on the analysis, the predictability of software effort
using unsupervised learning is not acceptable by software industry.

Table 10. k-medoids clustering on CSBSG data set

F-measure
Similarity Measure
Without imputation With imputation
Dice Coefficient 0.3403 0.3772
Jaccard Coefficient 0.3881 0.4114
Kulzinsky Coefficient 0.4065 0.4560

Table 11. Clustering result using Kulzinsky coefficient with imputation on CSBSG data set

Similarity Class 1 Class 2 Class 3 Total

Cluster 1 9 11 16 36
Cluster 2 10 10 17 37
Cluster 3 8 10 13 31
Total 27 31 46 104

Results from Supervised Learning. Table 12 shows the performances of the three
mentioned classifiers in classifying the projects in ISBSG data set. On average, we can
see that BPNN outperforms other classifiers in classifying the software projects based
on efforts. J48 decision tree has better performance than naı̈ve Bayes. Using the perfor-
mance of naı̈ve Bayes as the baseline, BPNN increases the average accuracy by 16.25%
(11.71%) and J48 decision tree by 5.6% (2.5%) without (with) imputation.
A Study on Software Effort Prediction Using Machine Learning Techniques 11

We explain this outcome that BPNN has the best capacity to eliminate the noise
and peculiarities because it adopts back sweep to change the weights of neurons for
reducing errors of predictive model. However, the performance of BPNN is not robust
as other classifiers (we observe this point from its standard deviation). The adoption
of cross-validation technique may reduce overfitting of BPNN to some extent but, it
cannot eliminate the drawback of BPNN entirely. The J48 decision tree classifies the
projects using learned decision rules. Due to the adoption of information gain [15],
those variables having more discriminative power will be fetched out by J48 in earlier
branches in constructing decision rules and thus, the noise and peculiarities connotated
in the variables with less discriminative power will be ignored automatically (especially
in tree pruning).
naı̈ve Bayes has the worst performance among the three classifiers in classifying
software efforts. We explain this as that the variables used for describing projects may
not be independent of each other. Moreover, naı̈ve Bayes regard all variables as has
equivalent weights as each other in the prediction model. The conditional probabilities
of all variables have the same weight when predicting the label of an incoming project.
However, in fact, some variables of projects have more discriminative power than other
variables in deciding the project effort. The noise and peculiarities are often contained
in the variables those have little discriminative power and those variables should be
given less importance in the prediction model. We conjecture that this fact is also the
cause of the poor performance of k-medoids in project clustering projects. In the same
manner as that in k-medoids clustering, MINI technique has significantly improved the
performance of project classification by imputing missing values in Boolean vectors.

Table 12. Classification of software project efforts on ISBSG data set

Average accuracy ± Standard Deviation

Classifier
Without imputation With imputation
J48 decision tree 0.5706 ± 0.1118 0.5917 ± 0.1205
BPNN 0.6281 ± 0.1672 0.6448 ± 0.1517
naı̈ve Bayes 0.5403 ± 0.1123 0.5772 ± 0.1030

Table 13. Classification of software project efforts on CSBSG data set

Average accuracy ± Standard Deviation

Classifier
Without imputation With imputation
J48 decision tree 0.4988 ± 0.1103 0.5341 ± 0.1322
BPNN 0.1650 ± 0.1650 0.6132 ± 0.1501
naı̈ve Bayes 0.5331 ± 0.1221 0.5585 ± 0.0910

Table 13 shows the performances of the three classifiers on CSBSG data set. The
similar conclusion as on ISBSG data set can be drawn on CSBSG data set. However,
the performances of three classifiers on CSBSG data set are worse than those on ISBSG
data set. The average of overall accuracies of the three techniques without (with) impu-
tation on CSBSG data set is decreased by 6.95% (6.66%) using that on ISBSG data set
as the baseline. We also explain this outcome as the lower quality of CSBSG data set
than that of ISBSG data set.
12 W. Zhang, Y. Yang, and Q. Wang

We can see from Tables 12 and 13 that, in both ISBSG and CSBSG data sets, all
the three supervised learning techniques have not produced a favorable classification
on software efforts using project attributes. The best performance that was produced by
BPNN is with the accuracy around 60%. The accuracy as 60% is meaningless for soft-
ware effort prediction in most cases because, that means that at the probability 0.4, the
prediction results fall beyond the range of each effort class. Combined with the results
of effort prediction from unsupervised learning, we draw that the predictability of soft-
ware effort using supervised learning techniques is not acceptable by software industry,
either.

4 Threads to Validity

The threats to external validity primarily include the degree to which the attributes of
projects in ISBSG and CSBSG data set have exactly captured the characteristics of soft-
ware projects in real practice. For data quality, we only extracted a small portion of data
samples from ISBSG and CSBSG data sets. We hope these projects are representative
of the population of software projects in these two data sets. These threats could be
reduced by more experiments on more data sets of software efforts in future work. The
threats to internal validity are instrumentation effects that can bias our results. The un-
certainty of values of attributes, the ambiguity of software efforts of projects, and the
unbalanced distribution of projects with respect to attributes in the data sets might cause
such effects. To reduce these threats, we manually inspected the software projects and
their values of attributes and evaluated the reliability of the data for each project. One
threat to construct validity is that our experiments involve large amount of data prepro-
cessing, hoping that the preprocessed data can still precisely capture the characteristics
of original software projects.

5 Related Work

Srinivasan and Fisher [19] used decision tree and BPNN to estimate software develop-
ment effort. COCOMO data with 63 historical projects was used as the training data
and Kremer data with 15 projects was used as testing data. They reported that decision
tree and BPNN are competitive with traditional COCOMO estimator. However, they
pointed out that the performances of machine learning techniques are very sensitive
to the data on which they were trained. [17] compared three estimation techniques as
BPNN, case-based reasoning and regression models using Function Points as the mea-
sure of system size. They reported that neither of case-based reasoning and regression
model was favorable in estimating software efforts due to the considerable noise in the
data set. BPNN appears capable of providing adequate estimation performance (with
MRE as 35%) nevertheless its performance is largely dependent on the quality of train-
ing data as well as the suitability of testing data to the trained model. Of all the three
methods, a large amount of uncertainty is inherent in their performances. In both [17]
and [19], a serious problem confronted with effort estimation using machine learning
techniques is that huge uncertainty involved in the robustness of these techniques. That
A Study on Software Effort Prediction Using Machine Learning Techniques 13

is, model sensitivity and data-dependent property of machine learning techniques hin-
der their admittance by industrial practice in effort prediction. These work as well as
[22] motivates this study to investigate the effectiveness of a variety of machine learn-
ing techniques on two different data sets.
Park and Baek [18] conducted an empirical validation of a neural network model for
software effort estimation. The data set used in their experiments is collected from a
Korean IT company and includes 148 IT projects. They compared expert judgment, re-
gression models and BPNN with different input variables in software effort estimation.
They reported that neural network using Function Point and other 6 variables (length of
project, usage level of system development methodology, number of high/middle/low
level manpower and percentage of outsourcing) as input variables outperforms other
estimation methods. However, even in the best performance, the average MRE is nearly
60% with standard deviation more than 30%. This result makes it is very hard that the
method proposed in their work can be satisfactorily admitted in practice. For this rea-
son, a validation of machine learning methods is necessary in order to shed light on the
advancement of software effort estimation. This point also motivates us to investigate
the effectiveness of machine learning techniques for software effort estimation and the
predictability of software effort using machine techniques.
[20] proposed a neuron-genetic approach to predict software development effort
while the neural network is employed to construct the prediction model and genetic al-
gorithm is used to optimize the weights between nodes in the input layer and the nodes
in the output layer. They used the same data sets as that was used in Srinivasan and
Fisher [19] and reported that the neuron-genetic approach outperforms both decision
tree and BPNN. However, they also reported that local minima and over fitting dete-
riorate the performance of the proposed method in some cases, even make it a poorer
predictor than traditional estimator as COCOMO [21]. The focus of our study is not
to propose a novel approach to software effort estimation but to extensively review the
usefulness of machine learning techniques in software effort estimation. That is, to how
much extent the typical machine techniques can accurately estimate the effort of a given
project using historical data.

6 Concluding Remarks

In this paper, we conducted a series experiments to investigate the predictability of

software effort using machine learning techniques. With ISBSG and CSBSG data sets,
unsupervised learning as k- medoids clustering is used to cluster software projects with
respect to efforts and, supervised learning as J48 decision tree, BPNN and naive Bayes
are used to classify the projects. Our assumption for this investigation is that the efforts
of software projects can be deduced from the values of other attributes in historical data
and projects with similar values on attributes other than effort will also have approxi-
mately equivalent efforts.
The experimental results demonstrate that neither unsupervised nor supervised learn-
ing techniques can provide software effort prediction with a favorable model. Despite of
this fact, Kulzinsky coefficient has produced the best performance in similarity
14 W. Zhang, Y. Yang, and Q. Wang

measure for unsupervised learning and, BPNN has produced the best performance
among the examined supervised learning techniques. Moreover, the MINI imputation
can improve data quality and improve effort prediction significantly.
Acknowledgements. This work is supported by the National Natural Science Foun-
dation of China under Grant Nos. 71101138, 60873072, 61073044, and 60903050; the
National Basic Research Program under Grant No. 2007CB310802; the Beijing Natural
Science Foundation under Grant No. 4122087; the Scientific Research Foundation for
the Returned Overseas Chinese Scholars, State Education Ministry.

References

1. Boehm, B., Abts, C., Brown, A., Chulani, S., Clark, B., Horowitz, E.: Software Cost Estima-
tion with COCOMO II. Prentice Hall, New Jersey (2001)
2. Pendharkar, P., Subramanian, G., Roger, J.: A Probabilistic Model for Predicting Software
Development Effort. IEEE Transactions on Software Engineering 31(7), 615–624 (2005)
3. Jorgensen, M.: A Review of Studies on Expert Estimation of Software Development Effort.
Journal of Systems and Software 70, 37–60 (2004)
4. Fairley, R.: Recent Advances in Software Estimation Techniques. In: Proceedings of Inter-
national Conference on Software Engineering, pp. 382–391 (1992)
5. Yang, Y., Wang, Q., Li, M.: Process Trustworthiness as a Capability Indicator for Measuring
and Improving Software Trustworthiness. In: Wang, Q., Garousi, V., Madachy, R., Pfahl, D.
(eds.) ICSP 2009. LNCS, vol. 5543, pp. 389–401. Springer, Heidelberg (2009)
6. Korte, M., Port, D.: Confidence in Software Cost Estimation Results based on MMRE and
PRED. In: Proceedings of PROMISE 2008, pp. 63–70 (2008)
7. He, M., Li, M., Wang, Q., Yang, Y., Ye, K.: An Investigation of Software Development Pro-
ductivity in China. In: Wang, Q., Pfahl, D., Raffo, D.M. (eds.) ICSP 2008. LNCS, vol. 5007,
pp. 381–394. Springer, Heidelberg (2008)
8. Krupka, E., Tishby, N.: Generalization from Observed to Unoberserved Features by Cluster-
ing. Journal of Machine Learning Research 83, 339–370 (2008)
9. Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 3rd edn. Elsevier (2006)
10. Gan, G., Ma, C., Wu, J.: Data Clustering, Theory, Algorithmsm, and Applications. In: ASA-
SIAM Series on Statistical and Applied Probability, pp. 78–78 (2008)
11. Song, Q., Shepperd, M.: A new imputation method for small software project data sets.
Journal of Systems and Software 80, 51–62 (2007)
12. Zhou, Z., Tang, W.: Clusterer ensemble. Knowledge-Based Systems 19, 77–83 (2006)
13. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In:
Proceedings of KDD-2000 Workshop on Text Mining, pp. 109–119 (2000)
14. Quinlan, J.: Programs for Machine Learning, 2nd edn. Morgan Kaufmann Publishers (1993)
15. Rumelhart, D., Hinton, G., Williams, J.: Learning internal representations by error propaga-
tion. In: Proceedings of Parallel Distributed Processing, Exploitations in the Microstructure
of Cognition, pp. 318–362 (1986)
16. Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. John Wiley & Sons (2003)
17. Finnie, G., Wittig, G.: A Comparison of Software Effort Estimation Techniques: Using Func-
tion Points with Neural Networks, Case-Based Reasoning and Regression Models. Journal
of Systems and Software 39, 281–289 (1997)
18. Park, H., Baek, S.: An empirical validation of a neural network model for software effort
estimation. Expert System with Applications 35, 929–937 (2008)
A Study on Software Effort Prediction Using Machine Learning Techniques 15

19. Srinivasan, K., Fisher, D.: Machine Learning Approaches to Estimating Software Develop-
ment Effort. IEEE Transactions on Software Engineering 21(2), 126–137 (1995)
20. Shukla, K.: Neuro-genetic prediction of software development effort. Information and Soft-
ware Technology 42, 701–713 (2000)
21. Boehm, B.: Software Engineering Economics. Prentice Hall, New Jersey (1981)
22. Prietula, M., Vicinanza, S., Mukhopadhyay, T.: Software-effort estimation with a case-based
resoner. Journal of Experimental & Theoritical Artificial Intelligence 8, 341–363 (1996)
23. Jorgensen, M., Shepperd, M.: A Systematic Review of Software Development Cost Estima-
tion Studies. IEEE Transactions on Software Engineering 33(1), 33–53 (2007)
24. Zhang, W., Yang, Y., Wang, Q.: Handling missing data in software effort prediction with
naive Bayes and EM algorithm. In: Proceedings of International Conference on Predictive
Models in Software Engineering, vol. 4 (2011)

Purity Is Power PDF
No ratings yet
Purity Is Power PDF
189 pages
Louis Jordan SET
No ratings yet
Louis Jordan SET
11 pages
Recruitment and Selection Project
50% (4)
Recruitment and Selection Project
41 pages
Motion To Defer
No ratings yet
Motion To Defer
4 pages
Awareness of The Grade 8 Learners of St. Michael'S College On What To Do in Cases of Dengue Outbreak: Basis For Recommendation
No ratings yet
Awareness of The Grade 8 Learners of St. Michael'S College On What To Do in Cases of Dengue Outbreak: Basis For Recommendation
9 pages
Predicting Project Delivery Rates Using The Naive-Bayes Classifier
No ratings yet
Predicting Project Delivery Rates Using The Naive-Bayes Classifier
20 pages
An_Empirical_Analysis_on_Software_Develo
No ratings yet
An_Empirical_Analysis_on_Software_Develo
14 pages
Comparative Study On Effort Estimation Using Different Data Mining Techniques
No ratings yet
Comparative Study On Effort Estimation Using Different Data Mining Techniques
7 pages
Using Genetic Programming To Improve Software Effort Estimation Based On General Data Sets
No ratings yet
Using Genetic Programming To Improve Software Effort Estimation Based On General Data Sets
11 pages
Prediction of Software Effort Using Artificial NeuralNetwork and Support Vector Machine
No ratings yet
Prediction of Software Effort Using Artificial NeuralNetwork and Support Vector Machine
7 pages
Study 3-A - Review - of - Effort - Estimation - in - Agile - Software - Development - Using - Machine - Learning - Techniques-2022
No ratings yet
Study 3-A - Review - of - Effort - Estimation - in - Agile - Software - Development - Using - Machine - Learning - Techniques-2022
7 pages
2022-A Review of Effort Estimation in Agile Software Development Using Machine Learning Techniques
No ratings yet
2022-A Review of Effort Estimation in Agile Software Development Using Machine Learning Techniques
7 pages
Data Mining Techniques For Software Effort Estimation: A Comparative Study
No ratings yet
Data Mining Techniques For Software Effort Estimation: A Comparative Study
23 pages
Engineering Applications of Arti Ficial Intelligence: Seyyed Hamid Samareh Moosavi, Vahid Khatibi Bardsiri
No ratings yet
Engineering Applications of Arti Ficial Intelligence: Seyyed Hamid Samareh Moosavi, Vahid Khatibi Bardsiri
15 pages
1371-1379-1-PB
No ratings yet
1371-1379-1-PB
7 pages
Software Effort Estimation Based On Use Case Reuse (Back Propagation)
No ratings yet
Software Effort Estimation Based On Use Case Reuse (Back Propagation)
11 pages
Software Effort Estimation Based On Use Case Reuse (Back Propagation)
No ratings yet
Software Effort Estimation Based On Use Case Reuse (Back Propagation)
11 pages
ssrn-3647639
No ratings yet
ssrn-3647639
6 pages
Ijaerv13n6 99
No ratings yet
Ijaerv13n6 99
8 pages
2024 Kruti ML
No ratings yet
2024 Kruti ML
11 pages
Abstract - Ensemble Model For Predicting Software Project Effort - Mohammad Haris - MSC (IT)
No ratings yet
Abstract - Ensemble Model For Predicting Software Project Effort - Mohammad Haris - MSC (IT)
1 page
Machine Learning Approaches To Estimating Software Development Effort
No ratings yet
Machine Learning Approaches To Estimating Software Development Effort
13 pages
Empirical Validation of Neural Network Models For Agile Software Effort Estimation Based On Story Points
No ratings yet
Empirical Validation of Neural Network Models For Agile Software Effort Estimation Based On Story Points
10 pages
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
No ratings yet
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
9 pages
A Hybrid Machine Learning Method For Estimating Software Project Cost
100% (1)
A Hybrid Machine Learning Method For Estimating Software Project Cost
7 pages
Software Effort Estimation Using Feed Forward Backpropagation Neural Network
No ratings yet
Software Effort Estimation Using Feed Forward Backpropagation Neural Network
5 pages
International Journal of Advanced Trends in Computer Science and Engineering
No ratings yet
International Journal of Advanced Trends in Computer Science and Engineering
11 pages
Review_SMS (2)
No ratings yet
Review_SMS (2)
27 pages
Efficient Software Cost Estimation Using Machine Learning Techniques
No ratings yet
Efficient Software Cost Estimation Using Machine Learning Techniques
20 pages
SE3M
No ratings yet
SE3M
17 pages
Hot Method Prediction Using Support Vector Machines: Ubiquitous Computing and Communication Journal
No ratings yet
Hot Method Prediction Using Support Vector Machines: Ubiquitous Computing and Communication Journal
7 pages
Software Cost Estimation PDF
No ratings yet
Software Cost Estimation PDF
6 pages
Software Reusability
No ratings yet
Software Reusability
6 pages
Mipro2003 - Software Quality Prediction Based On Information Analysis - A Decision Tree Approach
No ratings yet
Mipro2003 - Software Quality Prediction Based On Information Analysis - A Decision Tree Approach
11 pages
Software Project Cost Estimation Using AI Techniqu
No ratings yet
Software Project Cost Estimation Using AI Techniqu
6 pages
Can Neural Networks Be Easily Interpreted in Software Cost Estimation?
No ratings yet
Can Neural Networks Be Easily Interpreted in Software Cost Estimation?
6 pages
Factors On Software Effort Estimation
No ratings yet
Factors On Software Effort Estimation
10 pages
A Decision Support System For Estimating
No ratings yet
A Decision Support System For Estimating
9 pages
Software Quality Prediction Based On Information Analysis - A Decision Tree Approach
No ratings yet
Software Quality Prediction Based On Information Analysis - A Decision Tree Approach
7 pages
2023-Effort Estimation in Agile Software Development Using Autoencoders
No ratings yet
2023-Effort Estimation in Agile Software Development Using Autoencoders
7 pages
Defect Prediction in Software Development & Maintainence
From Everand
Defect Prediction in Software Development & Maintainence
Rudra Kumar
No ratings yet
2023-Analysis of Software Effort Estimation by Machine Learning Techniques
No ratings yet
2023-Analysis of Software Effort Estimation by Machine Learning Techniques
14 pages
A Comparative Study For Estimating Software Development Effort Intervals
No ratings yet
A Comparative Study For Estimating Software Development Effort Intervals
17 pages
Post Op Weka Data Set Sample PDF
No ratings yet
Post Op Weka Data Set Sample PDF
8 pages
CFS Based Feature Subset Selection For Software Maintainance Prediction
No ratings yet
CFS Based Feature Subset Selection For Software Maintainance Prediction
11 pages
Discovering New Knowledge - Data Mining
No ratings yet
Discovering New Knowledge - Data Mining
55 pages
Applying Machine Learning To Estimate The Effort and Duration of Individual Tasks in Software Projects
No ratings yet
Applying Machine Learning To Estimate The Effort and Duration of Individual Tasks in Software Projects
14 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
70 pages
Software Effort Estimation: A Comparison Based Perspective
No ratings yet
Software Effort Estimation: A Comparison Based Perspective
12 pages
Abstract
No ratings yet
Abstract
17 pages
Identifing Software Bugs or Not Using SMLT Model
No ratings yet
Identifing Software Bugs or Not Using SMLT Model
34 pages
Supervised Learning Algorithm
No ratings yet
Supervised Learning Algorithm
59 pages
CCST9017 (2023-24lecture11printed Version) MachineLearning
No ratings yet
CCST9017 (2023-24lecture11printed Version) MachineLearning
55 pages
An Efficient Effort and Cost Estimation Framework For Scrum Based Projects PDF
No ratings yet
An Efficient Effort and Cost Estimation Framework For Scrum Based Projects PDF
6 pages
Software Project Effort and Cost Estimation Techniques: January 2013
No ratings yet
Software Project Effort and Cost Estimation Techniques: January 2013
11 pages
A deep learning model for estimating story points
No ratings yet
A deep learning model for estimating story points
21 pages
Classification and Prediction
100% (1)
Classification and Prediction
31 pages
Data Mining Classification Algorithms: Credits: Padhraic Smyth
No ratings yet
Data Mining Classification Algorithms: Credits: Padhraic Smyth
54 pages
Design and Development of Machine Learning
No ratings yet
Design and Development of Machine Learning
9 pages
Software Analogies
No ratings yet
Software Analogies
8 pages
Project Report
No ratings yet
Project Report
54 pages
Fault Prediction
No ratings yet
Fault Prediction
9 pages
New Classification11
No ratings yet
New Classification11
98 pages
2005 C Metrics Lokan
No ratings yet
2005 C Metrics Lokan
11 pages
Software Cost Estimation Technique Based On Multiple Artificial Neural Network Models
No ratings yet
Software Cost Estimation Technique Based On Multiple Artificial Neural Network Models
10 pages
Code:: Performance Evaluation of Public Bus Service Quality Using Machine Learning
No ratings yet
Code:: Performance Evaluation of Public Bus Service Quality Using Machine Learning
10 pages
On Evaluating The Quality of Machine Learning Classification Methods
No ratings yet
On Evaluating The Quality of Machine Learning Classification Methods
33 pages
SSRN Id3449848 PDF
No ratings yet
SSRN Id3449848 PDF
40 pages
Improving Software Automation Testing Using Jenkins, and Machine Learning Under Big Data
No ratings yet
Improving Software Automation Testing Using Jenkins, and Machine Learning Under Big Data
10 pages
An Empirical Study On The Stability of Feature Selection For Imbalanced Software Engineering Data
No ratings yet
An Empirical Study On The Stability of Feature Selection For Imbalanced Software Engineering Data
7 pages
Rockburst Prediction Using Gaussian Process Machine Learning
No ratings yet
Rockburst Prediction Using Gaussian Process Machine Learning
4 pages
Back-to-Back Testing Framework Using A Machine Learning Method
No ratings yet
Back-to-Back Testing Framework Using A Machine Learning Method
10 pages
Watcharapasorn 2016
No ratings yet
Watcharapasorn 2016
5 pages
35
No ratings yet
35
3 pages
Trauma, Dissociation, and Disorganized Attachment: Three Strands of A Single Braid
100% (2)
Trauma, Dissociation, and Disorganized Attachment: Three Strands of A Single Braid
15 pages
1 - American Gods
No ratings yet
1 - American Gods
2 pages
Plane and Solid Geometry
100% (1)
Plane and Solid Geometry
54 pages
Hot Pepper Ingredient Kills Prostate Cancer Cells
100% (1)
Hot Pepper Ingredient Kills Prostate Cancer Cells
9 pages
HyperPersonalization Acquisition
No ratings yet
HyperPersonalization Acquisition
5 pages
Activity Card Pack - English
No ratings yet
Activity Card Pack - English
43 pages
Creation of Eight Vasus
No ratings yet
Creation of Eight Vasus
2 pages
Ruby Industrial Corp. v. CA, 286 SCRA 445
No ratings yet
Ruby Industrial Corp. v. CA, 286 SCRA 445
2 pages
Polymyxin Antibiotics - Clinical Update
No ratings yet
Polymyxin Antibiotics - Clinical Update
43 pages
Leadership & Managing Change
No ratings yet
Leadership & Managing Change
34 pages
Prac. 1
No ratings yet
Prac. 1
15 pages
Webquest
No ratings yet
Webquest
11 pages
Serendipity - A Sociological Note
No ratings yet
Serendipity - A Sociological Note
2 pages
POWER POINT NEWS ITEM TEXT-dikonversi
No ratings yet
POWER POINT NEWS ITEM TEXT-dikonversi
26 pages
Auto Encoder
No ratings yet
Auto Encoder
39 pages
Lc249 Liability For Psychiatric Illness
No ratings yet
Lc249 Liability For Psychiatric Illness
137 pages
Rohms15690 1
No ratings yet
Rohms15690 1
6 pages
The Ten Principles of Material Handling
No ratings yet
The Ten Principles of Material Handling
46 pages
Contents of The Marketing Plan 04
78% (9)
Contents of The Marketing Plan 04
2 pages
Shodhganga - Inflibnet.ac - in Bitstream 10603 1638-12-12 Bibliography
100% (2)
Shodhganga - Inflibnet.ac - in Bitstream 10603 1638-12-12 Bibliography
7 pages
MV Cable System DST 34-1175
No ratings yet
MV Cable System DST 34-1175
77 pages
Get Human Rights On Trial A Genealogy Of The Critique Of Human Rights Justine Lacroix Jean Yves Pranchère free all chapters
100% (4)
Get Human Rights On Trial A Genealogy Of The Critique Of Human Rights Justine Lacroix Jean Yves Pranchère free all chapters
55 pages
CHN Ii Finals
No ratings yet
CHN Ii Finals
29 pages
Excavation and Trenching
No ratings yet
Excavation and Trenching
34 pages

A Study On Software Effort Prediction Using Machine Learning Techniques

Uploaded by

A Study On Software Effort Prediction Using Machine Learning Techniques

Uploaded by

A Study on Software Effort Prediction Using Machine

Wen Zhang, Ye Yang, and Qing Wang

Laboratory for Internet Software Technologies, Institute of Software

Keywords: Effort prediction, Machine learning, k-medoids, BPNN, Missing

2 Effort Prediction Using Machine Learning

Algorithm 1. The k-medoids clustering implemented by PAM algorithm.

algorithm as depicted in Algorithm 1. The computation complexity and the convergence

A = s11 (Di , Dj ), B = s01 (Di , Dj ),

Table 1. Three similarity measure used in k-medoids clustering

Measure Similarity Range

2.2 Effort Prediction with Supervised Learning

Naive Bayes [16] is a well known probabilistic classifier in machine learning. It is

Table 2. The used attributes from ISBSG data set

Table 3. The value distribution of variables of projects in ISBSG data set

Table 4. The effort classes categorized in ISBSG data set

Class No Number of projects Label

Table 5. The used attributes from CSBSG data set

Table 7. The effort classes categorized in CSBSG data set

Class No Number of projects Label

3.2 Evaluation Measures

Accuracy. Assuming that D = (D1 , ..., Di , ..., Dm ) is a collection of software projects,

F-measure. In classification, Y was partitioned into l clusters and l is a predefined

3.3 Experimental Results

Table 8. k-medoids clustering on ISBSG data set

Similarity Class 1 Class 2 Class 3 Total

Table 10. k-medoids clustering on CSBSG data set

Similarity Class 1 Class 2 Class 3 Total

Table 12. Classification of software project efforts on ISBSG data set

Average accuracy ± Standard Deviation

Table 13. Classification of software project efforts on CSBSG data set

Average accuracy ± Standard Deviation

In this paper, we conducted a series experiments to investigate the predictability of

You might also like