0% found this document useful (0 votes)
23 views6 pages

SSRN 3647639

The document presents a methodology for software cost estimation using machine learning techniques, addressing the challenges faced by traditional estimation methods that often lead to project failures due to inaccurate cost predictions. It discusses the application of various machine learning approaches, including genetic programming and neural networks, to improve estimation accuracy, particularly using the COCOMO dataset. The proposed model aims to enhance project management by providing more reliable estimates for software development costs and efforts.

Uploaded by

ajwad.ali309
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views6 pages

SSRN 3647639

The document presents a methodology for software cost estimation using machine learning techniques, addressing the challenges faced by traditional estimation methods that often lead to project failures due to inaccurate cost predictions. It discusses the application of various machine learning approaches, including genetic programming and neural networks, to improve estimation accuracy, particularly using the COCOMO dataset. The proposed model aims to enhance project management by providing more reliable estimates for software development costs and efforts.

Uploaded by

ajwad.ali309
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International conference on Recent Trends in Artificial Intelligence, IOT, Smart Cities

& Applications (ICAISC-2020) (May 27, 2020)

A Methodology for Software Cost Estimation Using Machine


Learning Techniques

Md aaquib nawaz, Mrityunjay kumar, Md kashif zaki, Sunil kumar gouda, Prithvi raj, Neeraj kumar, Raja babu das

Department of Computer Science and Engineering, Chaibasa Engineering College, Chaibasa, India, [email protected]
Department of Computer Science and Engineering, Chaibasa Engineering College, Chaibasa, India, [email protected]
Department of Computer Science and Engineering, Chaibasa Engineering College, Chaibasa, India, [email protected]
Department of Computer Science and Engineering, Chaibasa Engineering College, Chaibasa, India, [email protected]
Department of Computer Science and Engineering, Chaibasa Engineering College, Chaibasa, India, [email protected]
Department of Computer Science and Engineering, Chaibasa Engineering College, Chaibasa, India, [email protected]
Department of Computer Science and Engineering, Chaibasa Engineering College, Chaibasa, India, [email protected]

Abstract— Now a day’s many different organization expends pessimistic, business opportunities can be gone astray,
billions of dollars on software development and maintenance while optimism may be followed by significant loss. In
annually. Many organization software projects are fail to be
software project management, software developers has to
completed, or it is over budget, due to the inability of current
software cost-estimation techniques to estimate effort and cost, put effort to decide what resources are to be used and how
at early venture organize, the degree of exertion required for the software utilizes these resources efficiently for
an undertaking to be finished. One explanation is that present software cost estimation and effort estimation. The
programming cost-estimation models will in general perform parameters are required for developing software
ineffectively when applied outside of barely characterized accurately and efficiently, the incorporated parameters are
spaces.
risk analysis, development time estimation, estimation of
Machine learning offers an elective way to deal with the software development cost, estimation of team size, and
present models. In Machine learning, the area explicit effort estimation.
information and the computer can be coupled to make a motor At the beginning stage of software development,
for information revelation. Using genetic programming, neural estimation of the parameters like effort and software
networks and genetic algorithms, alongside a distributed
development costs are done. Hence, the need to develop
programming project data set. Several cost estimation models
were developed. Testing was conducted using a COCOMO 81 fine model to precisely calculate above mentioned
data set. Every one of the three procedures demonstrated parameters. In the domain of software effort estimation,
degrees of execution that show that every one of these systems the research community has been working in the past few
can furnish software project managers with capacities that can decades and different conventional models are developed
be utilized to get better software cost estimates. by researchers to estimate the size, effort and cost of
Keywords - Software cost estimation ,machine learning, software. The conventional models demands inputs, but
COCOMO, neural network. the extraction of different inputs is a challenging issue at
the beginning stages. Software effort estimation has even
I. INTRODUCTION been identified as one of the three most demanding
The major issues in software project management are cost of challenges in software application areas. During the
the software and estimation effort. The growth of software development process, the cost and time estimates are
quality depends on software cost and software effort useful for the initial rough validation and monitoring of the
calculation. It is ideally desirable that the improvement in project’s completion process and in addition, these
estimation techniques currently available to project estimates may be useful for project productivity
managers would facilitate increased control of time and assessment phases. Software effort estimation models are
overall cost benefit in software development life cycle. divided into two main categories: viz., algorithmic and non-
Software development effort estimates are the basis for algorithmic. The most popular algorithmic estimation
project bidding and planning. The consequences of poor models include Boehms COCOMO, Putnams SLIM and
budgeting and planning can be disastrous: if they are too Albrechts Function Point. Non-algorithmic techniques

Electronic copy available at: https://ssrn.com/abstract=3647639


include Price-to-Win,Parkinson, expert judgment and EAF is determined using 15 cost drivers. Each cost drivers is
machine learning approaches. Machine learning is used to evaluated from ordinal scale extending from low to high.
group together a set of techniques that embodies some of
the facets of human mind. For example, fuzzy systems, Detailed COCOMO model is a progressively itemized
analogy, regression trees, rule induction and neural grouping of 15 components for the project on each
networks are among the machine learning approaches, and progression of the product designing procedures. Different
fuzzy systems and neural networks are considered to phases are utilized in point by point COCOMO model
belong to the soft computing paradigm. Estimation by including requirement gathering and planning, system
analogy is simple and flexible, compared to algorithmic Architecture and Design, detailed design, components and
models. Analogy technique is applied effectively even for
sub-component code, unit test and integrating testing. The
local data which is not supported by algorithmic models. It
weights(Wi) are defined accordingly. Here, effort is
can be used for both qualitative and quantitative data,
reflecting closer types of datasets found in real life. Analogy calculated using the following equation:-
based estimation has the potential to mitigate the effect of
Effort=a*(KLOC)b * EAF * sum(Wi) …….(3)
outliers in a historical data set, since estimation by analogy
does not rely on calibrating a single model to suit all the B. Pearson product-moment correlation coefficient
projects. Unfortunately, it is difficult to assess the Pearson product-moment correlation coefficient is widely
preliminary estimation as the available information about used to predict the linear relationship between two sets of
the historic project data during early stages is not data. For two variable X and Y Pearson product-moment
sufficient. The proposed method effectively estimates the correlation coefficient is a measure of the linear dependence
software effort using an incorporated approach. between the variables. It is denoted by r and its value range
lies between -1 and 1 where 1 is total positive correlation, 0
II. THEORY OF SOFTWARE COST ESTIMATION is no correlation and -1 is total negative correlation.
A. COCOMO
̅)(Yi-Y
r = ∑ (Xi-X ̅) /√∑ (Xi − X
̅)2(Yi − Y
̅)2 ….(4)
“COCOMO”, a new cost evaluation model has been
introduced. This model is notable scientific portrayal for From The Expression, Numerator is the XY co-variance and
programming cost assessment. It is mainly founded on the Denominator is Standard deviation of X and Y. Therefore,
past experience of programming activities and utilizing LOC Pearson Correlation Coefficient can also be defined as the
as the unit of measures for programming size. It consists of ratio of co-variable and standard deviation of X and Y.
three variants namely fundamental model and semidetached
Where, X ̅ and Y ̅ are the means of X and Y respectively.
model. The essential COCOMO model assess effort to make
programming progress and cost as a component of program C. One-Way ANOVA
size verbalized in evaluated LOC. The effort is determined
One-way ANOVA is a general techniques for studying tested
utilizing the following equation:-
information relationship. It has been broadly utilized in
different fields for example, Chen and his colleagues have
Effort = a*(KLOC)b .............................(1)
utilized one-way ANOVA in hereditary engineering. For
lodging staff job satisfaction fulfillment tang has been used.
Where, effort assessed face to face month and KLOC is Ronen has utilized it to software development risks.
evaluated number lines of codes for the project. The
D. Fuzzy C-Means clustering algorithm
estimation of parameters ‘a’ and ‘b’ dependent on the project
type. Software projects are grouped into three classifications Fuzzy C-Means clustering is a method of clustering which
allows one piece of data to belong to two or more clusters. A
dependent on the complexity of the projects namely organic, main objective of this algorithm is to minimize:-
semi-detached and embedded.(for organic projects
a=2.4,b=1.05,for semi-detached a=3.0,b=1.12 and for Jm = ∑i=1 N ∑j=1 C Uijm ||Xi – Cj||2 , 1≤m< ∞ ….(5)
embedded a=3.6,b=1.20).
Where, ’m’ is the fuzziness index m ∈ [1,∞], ‘N’ is
Intermediate COCOMO model computes the estimation of the number of data points, ‘C’ denotes the number of
cluster center, ’Xi’ is the ith of d-dimensional measured
programming improvement exertion as a component of
data,’Uij ‘ is the membership of Xi in the cluster j,’Cj
program size and set of cost drivers that incorporate
‘ is the d-dimensional center of the cluster, ‘Xi-Cj’ is the
individual appraisal of the products, hardware ,personnel and Euclidean distance between ith data and jth cluster center.
task properties. Here, effort is determined utilizing the
following condition:- E. Multi-Objective Genetic Algorithm
Multi-Objective Genetic Algorithm (MOGA) states that
Effort=a*(KLOC)b *EAF …………….(2) it is a method of solving optimization problems which
involve multiple objectives such as minimizing cost and
The estimation of the parameters a and b dependent on the maximizing reliability and others objectives. It is different
project type (for organic projects a=3.0,b=1.05, for semi- from single objective optimization in that in MOGA
problem, there doesnot exist a single solution that
detached a=3.0,b=1.12 and for embedded a=2.8,b=1.20) and simultaneously optimizes each objectives.

Electronic copy available at: https://ssrn.com/abstract=3647639


Here, the main task is to find out the trade-off surface, prediction models ranging from regression approaches to
which is a set of non-dominated solution points, that is analogy based and machine learning techniques are studied
known as pareto optimal or non-inferior solutions. It has in order to cover a wide range of estimation methods
been seen that none of the solutions in the non-dominated proposed so far in the literature[9]. Public domain datasets
set is extremely better than any other; any one of them is an with different characteristics are used in order to address
acceptable solution. The choice of one solution over the the inherent problem of prediction systems, i.e their high
other requires problem knowledge and a number of dependency on the types of data[10]. Through the Cost
problem-related factors[2]. estimation model alternative error functions measuring
F. Evaluation Criteria different important aspects of error are studied[11]. The
Intercultural challenges Mitigation Model (ICCMM) to
Evaluation Criteria are essential for calculating the
assist outsourcing vendor organizations in addressing
estimation accuracy of software cost proposed estimation
model[13]. Here, MMRE and PRED (0.25) are used. intercultural challenges In outsourcing relationships[12].
MMRE:- is average percentage of the absolute values of the Genuine Estimates of software development costs are
relative errors over a whole dataset. It can be calculated by required in the software development cycle to find out the
the following equation:- feasibility of software projects and to provide the required
resources accordingly. Many methods have been given so
MMRE = 1/n ∑i=1 n |Predicted value-Actual Value/Actual far to predict the cost of the software one of the most
Value | ……..(6) popular being estimation by expert knowledge. In this
PRED (0.25) :- It is defined as the percentage of predictions method, the reliability of estimates leased on expert opinion
failing within 25% of the actual known value. It can be depends on the fact, how much a new project agrees with
calculated by the following equation:- the skills and experience of the expert [13]. Scaling is
PRED (0.25) :- 1/n ∑i=1 n (Predicted value-Actual generally refers to measurements or assessments conducted
Value/Actual Value ≤ 0.25)………(7) under exact specified and repeatable conditions. In ML
Scaling transforms feature values according to defined rule
Here, n is the number of projects. These two evaluation so that all scaled features have the same degree of influence
criteria are considered as the objectives function for MOGA and thus the method is immune to the choice of units, which
to search optimal parameter of COCOMO. Here, our main is a major stage for ML methods[14]. Due to the cost of
objectives are to minimize the MMRE and other is to gathering and reporting data from projects, development
maximize the PRED. Generally, the optimization algorithms teams are less focused on data collection. The Missing
in this case are implemented to minimize the objectives. To values have significant impact on ML estimation
maximize PRED, we take the reciprocal of PRED. Another preformation.
way is to use negative sign in front of objective function to
convert minimization to maximization. F1 = MMRE and F2
= 1/PRED . IV. PROPOSED WORK
In this experiment, COCOMO dataset has been used to
evaluate the proposed model. Here, we use Pearson product-
III. RELATED WORKS
moment correlation coefficient to analyze the linear
Through most of the Researchers used FFNN model for association between each of the fifteen cost adjustment
effort estimation task, other neural network models can also factors and the actual software effort. The correlation
be tried. Neural network models with required changes in coefficient ′r′ is calculated using the equation(4).X is a cost
architecture and functions can be supportive in research for adjustment factor and is the actual effort for a project.
predicting effort of software development[4]. The use of Therefore , X̅ is the arithmetic mean of X; X̅ is the
Soft Computing techniques to build a suitable model average of all the cost factors for all the 63 projects in the
structure to utilize improved estimations of software effort dataset. Similarly, Y̅ is the arithmetic mean of Y̅ ; Y̅ is the
for NASA software projects. On doing this, Particle Swarm average of all effort values for the 63 projects in the dataset.
Optimization(PSO) was used to tune the parameters of the Through Pearson correlation analysis, a comparatively
COCOMO model. The performance of the developed model strong positive association (+0.4493) is found between the
was evaluated using NASA software project data set[5]. cost factor Database size (data) and Effort. This means that
Soft Computing methods were explored to build efficient as the value of the cost factor Database size increases, the
effort estimation models Structures. Kelly utilized the value of software Effort also increases. Similarly, linear
concept of neural networks genetic algorithms and genetic association is also found between each of the cost factors
programming to introduce a methodology for software cost Modern Programming Practices (modp), Required Software
estimation[6]. Three models with Fuzzy Logic and PSO Reliability (rely) and Computer Turnaround Time (turn) and
Algorithm with Inertia weight was present in the research of the Effort. Thus these four effort or cost factors are selected
NASA. In This research NASA datasets were used to for clustering. The other cost factors show weaker
training and testing sets[7]. Dizaji and F.S Gharehchobogh association, hence, they are not selected. The results are
have used chaos factor to improve the performance of PSO shown in the chart below. The x-axis represents the fifteen
Algorithm. In their Article, Tent mat, Lorenz attractor and cost factors and the y-axis gives the ′r′ value of the analysis
Logistic map are used as Chaos Optimization Algorithms. of effort with the corresponding cost factor.
These researchers used the hybrid of this algorithm with
chaor factor to predict the road accidents according to
accident type(damage, injury,death)[8]. Alternative

Electronic copy available at: https://ssrn.com/abstract=3647639


value of the ANOVA analysis of effort with the
corresponding cost factor.

Fig 3 COCOMO FACTOR ANALYZED ONE-WAY ANOVA


ANALYSIS

The above selected eight effort adjustment factors and effort


is selected for clustering. Here, we use Fuzzy C- Means
clustering algorithm for project clustering. On the other
hand, it has been seen that in COCOMO dataset, the value of
the effort is much higher than that of the effort factors; hence
the results of clustering are easily impacted by the effort
values. This makes the clustering poor. To make the
clustering depend on all variables we take the logarithm
conversion of effort that makes the clustering better. The
resultant fuzzy membership matrix due to the process of
clustering into three and four groups is given in the appendix
Table II & III. In fuzzy clustering a project has some
membership in all clusters. In order to do optimization, crisp
sets are needed. This fuzzy to crisp conversion is done by the
maximum membership defuzzification process [14]. The
V. IMPLEMENTATION AND EXPERIMENTAL project goes into that cluster in which its membership is
SETUP highest. The y-axis gives the cluster number and x-axis gives
we use one-way ANOVA analysis to test whether or not the the membership of a project.
means of several variables are equal. It is done to find out if
there is a noteworthy difference among the means of two
unrelated group, the group of the cost adjustment factor and
the effort. A cost factor out of the fifteen cost adjustment
factors is chosen to analysis with the actual effort in the
COCOMO 81 dataset. The null hypothesis is that the means
of the two variables are equal.
HO ∶ µk = µe ……………(8)
Where µk is the kth cost factor k = 1, 2, 3, … , 15; and µe
is the effort. The null hypothesis is rejected, if P value is less
than significance level α (α =0.005). Greater F-value of the
test signifies that the sample means µk and µe is more
significantly different. The selected factors are Analyst FIG-4 result of defuzzification of fuzzy sets into crisp sets for
Capability (acap), Applications Experience (aexp), three cluster
Programmer Capability (pcap), and Programming Language
Experience (lexp). Computer Turnaround Time and Modern
Programming Practices, even though their F-value is high,
were already selected in the correlation analysis hence they
are not selected again over here. The F-values for cost factors
Virtual Machine Volatility (virt), Virtual Machine
Experience (vexp) and Programming Language
Experience(lexp) are approximately same, but Programming
language Experience is selected because it has more ′ ′
value (in correlation analysis) than virt and vexp. The result
of one-way ANOVA is given in the chart below. The x-axis
represents the fifteen cost factors and the y-axis gives the

Electronic copy available at: https://ssrn.com/abstract=3647639


FIG 5:- result of defuzzification of fuzzy sets into crisp sets for four
cluster Fig. 7. average number of pareto-optimal solutions
obtainedby wts. 0.75 mmre/0.25 pred
A. Parameter Optimization using MOGA
A multi-target optimization issue has various target
functions which are to be limited or augmented. fig. 8. average number of pareto-optimal solutions obtained
Mathematically, we can compose multi-objective problem by wts. 0.25 mmre/0.75 pred
as:-After obtaining crisp sets, each group has been identified
by MOGA and MOPSO to optimize COCOMO parameters.
One of the most important advantages of the multi-
objective optimization is that it gives several non-
dominated solutions in which a user can choose any one of
the best solution from a set of solutions according to his or
her preference. Here, two objectives are considered. One is
to minimize the MMRE and other is to maximize the
Prediction. Fig. 6 to 11 represents the average values of the
above mentioned optimization algorithms for three and four
clusters with different-different weights. A comparative
result is presented in Table I. By making a comparison
between the obtained values in Fig. 6 to 11 and Table I, it After doing experiment by this proposed model uses person
has been seen that the performance of MOGA are better product-moment correlation coefficient and one-way
than the MOPSO algorithms in terms of maximum ANOVA analysis for selecting several effort adjustment
prediction. Here, we focus on finding maximum prediction factors. Further, it applies fuzzy C-means clustering
with lower MMRE. The best compromised solution of the algorithm for project clustering. Then, parameters of
proposed model shows that this model is effective for COCOMO model have been optimized using Multi-
estimation of software cost. However, generally research on objective Genetic Algorithm. The result has proved
software cost estimation is based on three project type of superiority of MOGA in parameter optimization for getting
COCOMO. It is little related research that using project strength back the accuracy of software cost estimation.
database of the effort adjustment factor and software effort VI. CONCLUSION
to clustering project and estimation software effort.
In this proposed work, we use MFCM for clustering the
dataset. Once the clustering is done, various rules are
obtained and these rules are given as the input to the neural
network. Here, we modify the neural network by
incorporating optimization algorithms. The optimization
algorithms employed here are the ABC, MCS, and hybrid
ABC-MCS algorithms. Hence, we obtain three optimized
sets of rules that are used for the effort estimation process.
The performance of our proposed method is investigated
using parameters such as the MARE and MMRE. The
experimental outcomes have demonstrated that our proposed
method outperforms the existing method in estimating the
software effort more precisely. In the future, we can examine
the software effort based on the neuroticism characters of
employees. The neuroticism characters highly affected the
effort and cost of the software. Neuroticism characters are
important to model because they have been shown to impact
a range of cognitive, perceptual, and behavioral processes,
such as memory recall (mood-congruent recall), learning,
FIG-6 average number of pareto-optimal solutions obtained
by wts. 0.5 mmre/0.5 pred. psychological disorders (depression), and decision making.

Electronic copy available at: https://ssrn.com/abstract=3647639


The characters such as anger, anxiety, skill, and joy can be
used to estimate the effort.
REFERENCES
[1] R. Agarwal, M. Kumar, Yogesh, S. Mallick, R.M. Bharadwaj, D.
Anantwar, Estimating software projects, SIGSOFT Software
Engineering Notes 26 (4) (2001) 60–67.
[2] The Standish Group Interaction, “EXTREME CHAOS Report 2001”
[3] Boehm, B.W., “Software Engineering Economics”, Englewood Cliffs,
NJ: Prentice-Hall, 1981.
[4] Patil,Lalit V.,Rina M.Waghmode, S.D.Joshi,and V.Khanna,
“Generic model of software cost estimation:A hybrid
approach”,2014 IEEE International Advance Computing
Conference (IACC),2014
[5] Fischman ,L,K.McRitchie , and D.D.Galorath.”Inside SEER-
SEM”, Cross Talk, The Journal of Defense Software
Engineering,2005
[6] Caper Jones, “Estimating Software Cost”, Tata Mc–Graw Hill
Edition,2007.
[7] University of South California Software Engineering Research,
http://sunset.usc.edu/research/, 2006.
[8] Rubin, H. , “ESTIMACS”, IEEE, 1983.
[9] Mitchell, T. M., Machine Learning, McGraw-Hill and MIT Press,
1997.
[10] Y Conte, S. D., Dunsmore, H. E., Shen, V. Y., “Software Engineering
Metrics and Models”, BenjaminCummings, Menlo Park CA, 1986.
[11] ] Boehm B., Abts C., “Software Development Cost Estimation
Approaches – A Survey”, University of Southern California, 1998.
[12] University of Southern California, USC COCOMO Reference
Manual, 1994
[13] Shepperd, M., Schofield, M., “Estimating Software Project Effort
Using Analogies”, IEEE Transactions on Software Engineering, Nov
1997.
[14] Software Cost Estimation : Metrics and Models,
http://sern.ucalgary.ca/courses/seng/621/W98/johnsonk/
cost.htm#Original%20COCOMO, 2006
[15] K. Moløkken-Østvold et al., “Project Estimation in the Norwegian
Software Industry - A Summary”, 2004, Simulation Research
Laboratory
[16] Sun-Jen Huang , Nan-Hsing Chiu “Applying fuzzy neural network to
estimate software development effort” Applied Intelligence, Springer
Science+Business Media, Vol. 30(2), LLC,pp.73-83, 2007
[17] The Standish Group Interaction, “EXTREME CHAOS Report 2001”.
[18] Yucheng Kao, Jin-Cherng Lin , Jian-Kuan Wu “A Differential
Evolution Approach for Machine Cell Formation” IEEE International
Conference on Industrial Engineering and Engineering Management,
pp.772-775, 2008.
[19] A. Osyczka, “Multicriteria optimization for engineering design”, in
Design Optimization, J. S. Gero, Ed. New York: Academia, pp. 193-
227, 1985
[20] S.Lalwani, S. Singhal, R. Kumar and N. Gupta, “A Comprehensive
Survey: Applications of Multi-objective Particle Swarm
Optimizations (MOPSO) Algorithm”, Transactions of Combinatorics,
Vol. 2, No. 1, pp 39-101, 2013.

Electronic copy available at: https://ssrn.com/abstract=3647639

You might also like