Role of Data Mining and Machine Learning
Role of Data Mining and Machine Learning
Software Reusability
Abstract—Integration of machine learning and data mining approaches for development of software. The taxonomy gives
has opened new streams in software development. Automation of a guidance about which specific algorithm can be utilized
software reusability plays a dynamic role in software development for reusability. The three dimensions proposed as important
life cycle SDLC. It curtails the cost and effort required for devel-
opment of a software product. This paper provides a taxonomical are Type of AI (TAI) applied, Point of Application (PA),
mapping of reuse metrics and corresponding machine learning and Level of Automation (LA) offered. The paper further
and data mining techniques. Two artificial intelligence techniques explains application of PA in SWEBOK’s knowledge areas
i.e. neural networks and clustering are inspected in the paper. as to where and when AI can be applied. In [5], CBSE
A model is created to identify best reusability technique using is described as emerging trend. The techniques such as K-
aforementioned techniques.
mean and cosine similarity have been applied to various reuse
Keywords: Component based software engineering non-code components i.e. project plan, architecture, design,
(CBSE), Artificial Intelligence (AI) Artificial Intelligence in detailed architecture & pattern, source code and test cases.
software engineering application levels (AI-SEAL), clustering, The metrics for reuse are identified as weighted method per
classification, neural network class (WMC), depth of inheritance tree (DIT), number of
children (NOC), coupling between classes (CBO) and response
I. I NTRODUCTION for class (RFC). In paper [7] L. Kaur and A. Mishra have
used classification models also known as Meta-classifiers
Software development life cycle is a time consuming
to classify data based on seven reusable metrics on four
process, which involves effort and cost. However, software
version of same software. The metrics used for classification
reusability is cost effective and enhances yield of the
are coupling between objects (CBO), efferent coupling (EC),
product. AI integrates knowledge discovery and data mining
depth of inheritance (DIT), lack of cohesion between methods
techniques in software engineering processes to generate
(LCOM), number of calls, number of methods and cyclomatic
the field of Software Intelligence. AI generates intelligent
complexity (CC). The seven Meta classifiers used in paper are:
softwares, which can later be reused via data mining and
adaBoost, filtered, M-class, bagging, random sub space, stack-
machine learning techniques. Reusability can categorized
ing and voting. In paper [5], component clustering method is
into two major components; code components and non-code
used for grouping of component with similar characteristics.
components. Reusability can be further measured on various
K-mean, K-mode, K-mediod and hierarchal clustering are
metrics. Literature provides a variety of algorithms used for
the data mining approaches used in this paper. Once the
reusability however metrics of reuse are not mapped explicitly.
cluster is designed, anticipated component can be found by
searching the cluster. Hence, this technique proves beneficial
Therefore, in Section II, the studied literature has been
for reuse of components. In paper [3], N.Krishna explains that
discussed which explains the need of data mining and ma-
CBSE divides the development in two aspects “Development
chine learning in the field of software engineering especially
for reuse” and “Development by reuse.”. Therefore, massive
software reusability. In Section III, discusses the researh con-
repositories of components are created to be reused later.
ducted regarding the techniques of reusability. The subsection
Neural network algorithms effectually scan the repositories to
III-A, provides taxonomical mapping of data mining and
identify and retrieve specific component. The paper explains
machine learning techniques for a corresponding reuse metrics.
that reusability can be measured on two schemes i.e. empirical
Neural networks technique is used to identify and retrieve
and qualitative.
components that best fit the functionality of component for
In paper [13], D.Priyadarshni and Wangoo have recognized
reuse. In subsection III-B, a model is suggested using clas-
various AI techniques that lead to software intelligence. The
sification, clustering and neural networks where identified
paper explains that intelligent knowledge discovery can be
taxonomy helps in building the basis of model. Later in the
performed using AI techniques on software engineering data
section, an evaluation of the model is provided based on the
to create software intelligence. Software intelligence leads to
survey results.
intelligent software reuse which finally generates intelligent
automation. Various AI techniques are mapped to software
II. L ITERATURE R EVIEW process. Techniques included in the paper are artificial neural
We have reviewed a variety of papers for our research. In networks, neuro-fuzzy neural networks, evolutionary algo-
[4], a comprehensive taxonomical detail is provided of various rithms, fuzzy system etc. All the aforementioned techniques
AI techniques in software engineering. The taxonomy has are fruitful for software reusability.
been named as AI-SEAL. It guides investigators and experts In [1], a case study about Microsoft has been discussed
to connect, understand the pros and cons of applying AI that Microsoft is currently using AI techniques in their pre-
existing software development processes. Three challenges
have been faced by developers when applying AI techniques to
software development processes; Discover, organize, manage
and version the data needed for machine learning applications
is much more complicated and difficult than other domains
of software development. Personalizing and customizing the
model and to reuse a model requires different skillset by the
software development team. AI components are more complex
to tackle as seperate entities and modules than typical software
modules— models may be “entangled” in a complicated nature
of non-continous error behavior. The author has conducted
surveys in form of interviews and observing software flows.
The finding are presented in statistical form that identify each
challenge and corresponding measures used for addressing it.
Paper [2], uses software processes and maps it to state of
art AI techniques. For example it maps knowledge based
systems to requirement engineering as it identifies that reuse
of experts design knowledge can play a substantial part in
refining quality and productivity of the software development
process. Similarly, other processes are linked to their effective Fig. 1. Steps followed for applying reusability in sofware engineering
AI methods.
TABLE I
S OFTWARE REUSABILITY METRICS FOR OBJECT ORIENTED
III. O UR P ROPOSED FRAMEWORK PROGRAMMING PARADIGM
This section provides taxonomy for two techniques of data
Metrics
mining i.e. Clustering and Classification and one technique of Coupling between classes (CCBC)
machine learning i.e. neural network which is also used in Number of children of a class (NOC)
combination for predicting and measuring the reusability later Depth of Inheritance tree (DIT)
Weighted method per class (WWC)
in the section. The taxonomy discusses the usage of techniques Response for Class (RFC)
and their purpose. The section also presents a model suggest- Method hiding factor (MHF)
ing classification, clustering and neural networks. Later in the Method Inheritance factor (MIF)
section, an evaluation of the model is provided based on survey Polymorphism Factor (PF)
Efferent Coupling (EF)
results. Coupling between object classes (CBO)
Lack of cohesion in methods (LCOM)
Number of Interfaces
A. Taxonomical Mapping Class size
The key purpose of software reusability is to curtail re- Number of Classes
currence of effort, time and cost. According to [10], the the reusability [11], [9]. They are mapped to metrics, which
US department of defence solely could save 300 million are used for calculating reusability of an artifact using this
dollars annually by reusing software components to increase technique.
reusability level as little as 1%.
At a high level, two aspects of reusability are found i.e.
usability and usefulness [6]. B. Model to improve Quality
Reusability = Usability + Usefulness Clustering refers to grouping together components with
Usability is the “ease of use” of the component being used and same or similar characteristics in one cluster. It then makes
is not dependent on the features of the component. Whereas, it easier and faster for searching and retrieving components as
usefulness is suitability for the required purpose i.e. it depends TABLE II
on functionality, quality and generality of the components. S OFTWARE REUSABILITY METRICS FOR PROCEDURE ORIENTED
PROGRAMMING PARADIGM
Keeping this definition of reusability, steps [11] required for
reusability are identified in Fig.1. Metrics
Multiple metrics are required for applying data mining and Cyclomatic complexity (CC)
machine learning techniques for reusability. Table. I identi- Cyclomatic density (CD)
Total lines of code (TLOC)
fies metrics for object oriented paradigm [7],[11],[12],[14], Comments lines of code (CLOC)
whereas Table. II represents metrics for procedural paradigm Executable lines of code (ELOC)
[11]. Blank lines of code (BLOC)
Node code (NC)
Table. III and table. IV provides taxonomy for two tech- Halstead Metrics
niques of data mining i.e. Clustering and classification and Reuse Frequency Metrics
one technique of machine learning i.e. neural network which Regularity Metrics
is also used in amalgamation for forecasting and calculating Coupling Metrics
Fig. 2. Model to improve quality with respect to reusability achieved using Machine Learning and Data mining
the search space is reduced. In other words clustering makes using metrics identified in taxonomy for OO Paradigm are as
“Divide and conquer” rule possible [7]. In paper [11] it is Number of Children, Lack of Cohesion in Methods, Weighted
proposed that reusability of object oriented software can be Methods per Class, Coupling Between Object Classes and
predicted using clustering approach with software reusability Depth of Inheritance Tree, since all the five metrics are
metrics. [13] suggests that the clustering technique is not only commonly used by both K-mean clustering and ANN. Once
useful in object oriented software but for procedure oriented the clusters are formed, single hidden perceptron will be used
software paradigm as well. In both cases, the main function to identify desired component from gathered clusters. One by
for which clustering is used is predicting reusability of the one, ANN will work on all clusters and will give 0 OR 1
components based on the concluded metrics. The similar soft- input, corresponding to found or not found.
ware components are then combined in clusters. The metrics Figure. 2 identifies the model suggested in the paper.
are important to generate Meta information about the software Accuracy of model can be tested on a data set once it
component which acts as the attribute when different clustering is implemented in practice. Clusters with gather similar
techniques and algorithms are applied. Therefore, the metrics components together and ANN will take less time to identify
mentioned in Table III and Table IV will be used to generate desired component.
clusters per metrics.
Artificial neural networks replicate working pattern of human
nervous system (working of brain). ANN is a varying deep IV. S URVEY A NALYSIS
learning technology originating from diverse domain of AI. Scholars and intellectuals of the domain helped in model
Deep learning is a subpart of machine learning which includes evaluation. The questionnaire covered three important aspects
neural networks. The motivation behind these approaches of i.e automation of of reusability, effectiveness of machine
neural networks is the way the human brain functions and thus learning in reusability and the design of the model. Google
scientists believe in going towards real AI. There are several forms was the medium used to conduct this survey. A two
types of neural networks algorithms i.e. Feed Forward- Ar- weeks time period was allocated to collect the survey results.
tificial Neural Network, Radial Neural Network, Convolution The gathered data and analysis is presented below.
Neural Network, Recurrent Neural Network, Modular Neural Fig. 3 displayed response regarding automating reusability for
Network, etc. The performance of neural networks depends survey question, where helpfulness of automating reusability
on the size of the dataset used, the bigger the dataset, better in the field of software industry was questioned. 89.7% people
the performance is. Therefore, they can be effectively used in responded in favour of the idea automating reusability. Fig.
software reusability. 4 shows the response for software repositories that can be
The model proposed in the paper will use K-means clustering used for finding reusable software component which tends
[8] on component repository, which clusters the data with to collect data regarding availability of software components
highest similarity within a cluster and lowest similarity within repositories. 51.7% response rate displays that there are suf-
different clusters are using K-means. The data is partitioned ficient software repositories while 17.2% disagree with their
into different levels of reusability value based on the features availability.
and attributes. K-means produces K number of clusters. For Fig. 5 depicts that how effective clustering can be to be find
each cluster there is a centroid and each data value is related similar component in a large repository of components. 55.2%
to the closest centroid with highest similarity [7]. K-means people responded that clustering is highly effective. Fig. 6
clustering is fast and most efficient hard clustering techniques displays response percentage for K-mean clustering suggested
producing tight clusters when the number of attributes or as highly effective algorithm for large datasets. While 37.9%
values increases [8]. percent people believe K-mean clustering is highly effective,
In the first step of constructing model, clusters will be created 51.7% still responded with ”Maybe”. This shows that K-means
TABLE III TABLE IV
M APPING OF SOFTWARE REUSABILITY METRICS WITH TECHNIQUES OF M APPING OF SOFTWARE REUSABILITY METRICS WITH TECHNIQUES OF
DATA MINING AND MACHINE LEARNING FOR OBJECT ORIENTED DATA MINING AND MACHINE LEARNING FOR PROCEDURE ORIENTED
PROGRAMMING PARADIGM PROGRAMMING PARADIGM
Software Reusability Metrics Techniques for Reusability Software Reusability Metrics Techniques for Reusability
Coupling between classes (CCBC) Cyclomatic complexity (CC)
• K-mean Clustering • K-mean Clustering
• Hierarchal Clustering
Number of children of a class • Feed forward Neural Net-
(NOC) • K-mean Clustering work
• Hierarchal Clustering • K- Mode clustering
• Neuro-Fuzzy Interface • Support Vector Machine
• Feed forward Neural Net- Classifier
work • K-NN Classifier
• K- Mode clustering
• Support Vector Machine Cyclomatic density (CD)
Classifier Total lines of code (TLOC)
• K-NN Classifier • K-NN Classifier
Yes
89.7% 55.2%
7.3%
Maybe 3.4%
3% Not Effective
No
41.4%
Yes
51.7%
No
37.9%
10.3%
17.2% 31%
No
51.7%
Maybe
R EFERENCES
[1] Saleema Amershi et al. “Software engineering for ma-
58.6% chine learning: a case study”. In: Proceedings of the
41st International Conference on Software Engineering:
Software Engineering in Practice. IEEE Press. 2019,
6.9% pp. 291–300.
No [2] Hany H Ammar, Walid Abdelmoez, and Mohamed
34.5% Salah Hamdi. “Software engineering using artificial in-
telligence techniques: Current state and open problems”.
In: Proceedings of the First Taibah University Interna-
tional Conference on Computing and Information Tech-
nology (ICCIT 2012), Al-Madinah Al-Munawwarah,
Maybe
Saudi Arabia. 2012, p. 52.
Fig. 9. ANN and cost effecctiveness
[3] N Krishna Chythanya and Lakshmi Rajamani. “Neural
Network Approach for Reusable Component Handling”.
In: 2017 IEEE 7th International Advance Computing
Conference (IACC). IEEE. 2017, pp. 75–79.
[4] Robert Feldt, Francisco G de Oliveira Neto, and Richard
Yes
Torkar. “Ways of applying artificial intelligence in soft-
ware engineering”. In: Proceedings of the 6th Interna-
tional Workshop on Realizing Artificial Intelligence Syn-
ergies in Software Engineering. ACM. 2018, pp. 35–41.
72.4% [5] “[Front matter]”. In: 2016 3rd International Conference
on Computing for Sustainable Global Development (IN-
0% No DIACom). 2016, pp. i–64.
[6] Capers Jones. Estimating software costs: Bringing real-
ism to estimating. McGraw-Hill Companies New York,
27.6% 2007.
[7] Loveleen Kaur and Ashutosh Mishra. “An Empirical
Analysis for Predicting Source Code File Reusabil-
Maybe ity Using Meta-Classification Algorithms”. In: Ad-
vanced Computational and Communication Paradigms.
Fig. 10. Correctness of model Springer, 2018, pp. 493–504.
[8] Ajay Kumar. “Measuring Software reusability using
SVM based classifier approach”. In: International Jour-
nal of Information Technology and Knowledge Manage-
ment 5.1 (2012), pp. 205–209.
[9] Marko Mijač and Zlatko Stapić. “Reusability metrics of
software components: survey”. In: 26th Central Euro-
pean Conference on Information and Intelligent Systems
(CECIIS 2015). 2015.
[10] Jeffrey S Poulin. Measuring software reuse: principles,
practices, and economic models. Addison-Wesley Read-
ing, MA, 1997.
[11] BV Ajay Prakash, DV Ashoka, and VN Manjunath
Aradhya. “Application of data mining techniques for
software reuse process”. In: Procedia Technology 4
(2012), pp. 384–389.
[12] Anju Shri et al. “Prediction of reusability of object ori-
ented software systems using clustering approach”. In:
World academy of science, Engineering and Technology
43 (2010), pp. 853–856.
[13] Divanshi Priyadarshni Wangoo. “Artificial Intelligence
Techniques in Software Engineering for Automated
Software Reuse and Design”. In: 2018 4th International
Conference on Computing Communication and Automa-
tion (ICCCA). IEEE. 2018, pp. 1–4.
[14] Syeda Iffat Zahara, Muhammad Ilyas, and Tehseen
Zia. “A study of comparative analysis of regression
algorithms for reusability evaluation of object oriented
based software components”. In: 2013 International
Conference on Open Source Systems and Technologies.
IEEE. 2013, pp. 75–80.