Project Report
Project Report
for
software fault prediction
by
July-2021
i
DECLARATION
We hereby declare that this submission is our own work and that, to the best of our
knowledge and belief, it contains no material previously published or written by another
person nor material which to a substantial extent has been accepted for the award of any
other degree or diploma of the university or other institute of higher learning, except
where due acknowledgment has been made in the text
Signature :
Name: Devansh Rastogi
Roll no : 1709113033
Date:
Signature :
Name: Navendu Raj
Roll no : 1709113065
Date:
Signature :
Name: Satyam Agarwal
Roll no : 1709113093
Date:
Signature :
Name: Vaibhav Chaudhary
Roll no : 1709113116
Date:
ii
CERTIFICATE
This is to certify that Project Report entitled “Dynamic selection of classifier for
software fault prediction” which is submitted by Devansh Rastogi, Navendu Raj,
Satyam Agarwal, Vaibhav Chaudhary in partial fulfillment of the requirement for the
award of degree B.Tech. in Department of Information Technology of Dr APJ Abdul
Kalam Technical University, is a record of the candidate own work carried out by them
under my supervision. The matter embodied in this thesis is original and has not been
submitted for the award of any other degree.
iii
ACKNOWLEDGEMENT
It gives us a great sense of pleasure to present the report of the B. Tech Project undertaken
during B. Tech. Final Year. We owe special debt of gratitude to Mrs. Yogita Khatri,
Department of Information Technology, JSS Academy of Technical Education, Noida for
her constant support and guidance throughout the course of our work. Her sincerity,
thoroughness and perseverance have been a constant source of inspiration for us. It is only
her cognizant efforts that our endeavors have seen the light of the day.
We also do not like to miss the opportunity to acknowledge the contribution of all faculty
members of the department for their kind assistance and cooperation during the
development of our project. Last but not the least, we acknowledge our friends for their
contribution in the completion of the project.
Signature :
Name: Devansh Rastogi
Roll no : 1709113033
Date:
Signature :
Name: Navendu Raj
Roll no : 1709113065
Date:
Signature :
Name: Satyam Agarwal
Roll no : 1709113093
Date:
Signature :
Name: Vaibhav Chaudhary
Roll no : 1709113116
iv
Date:
ABSTRACT
Defective modules in software projects have a considerable risk. It reduces the software
quality. Defective modules decrease customer satisfaction and increase the development
and maintenance cost. In software development life cycle, it is very important to predict
the defective model in an early stage so as to improve software developer’s ability to focus
on the quality of the software.
Determining the most appropriate learning technique(s) is vital for the accurate and
effective software fault prediction (SFP). Earlier techniques used for SFP have reported
varying performance for different software projects and none of them has always reported
the best performance across different projects.
This project presents a Software fault prediction model based on a dynamic selection
approach to predict whether a module is faulty or not. The basic idea of dynamic selection
approach is that different learning techniques have varying prediction capabilities for
different subsets of the input modules. Each of them has a sub domain, where it performs
better compared to the other learning techniques. The different classifiers used are SVM,
Decision Tree, MLP, GaussianNB, MultinomialNB. Performance parameters f1_score is
calculated for measuring the performance and validation of this work.
Based on this concept, we presented an approach to dynamically select classifier for the
prediction of faults. Presented approach selects the best learning technique for each
unseen testing module in the given testing dataset .Now, for a given unseen testing module,
we determine the subset that has modules similar to the given testing module. The learning
technique that has the best prediction performance for the determined subset is now
selected for predicting faults in the testing module
v
TABLE OF CONTENTS
Page
DECLARATION ..............................................................................................................................ii
CERTIFICATE ...............................................................................................................................iii
ACKNOWLEDGEMENTS .............................................................................................................iv
ABSTRACT ......................................................................................................................................v
LIST OF FIGURES..........................................................................................................................ix
LIST OF TABLES.......................................................................................................................... x
CHAPTER 1 1
1. INTRODUCTION………………………………………………………………………………1-6
CHAPTER 2 7
2. LITERATURE SURVEY
2.1.1 SVM.......................................................................................... 8
CHAPTER 3 20
CHAPTER 4 31
vii
CHAPTER 5 40
5 CONCLUSION
5.1 Conclusion………………………………………………………………...40
viii
LIST OF FIGURES
Fig 4.3 : Number of faulty and non faulty modules before prepro…………………….34
Fig 4.6 : Accuracy metrics for clusters and selection of expert model…… ………….36
LIST OF TABLES
ix
TABLE 2.1 Literature Survey…………………………...……………………… 15
x
CHAPTER 1
INTRODUCTION
This report presents an approach that dynamically selects learning techniques to predict the
software faults. For a given testing module, the presented approach first locates its
neighbor module subset that contained modules similar to testing module using a distance
function and then chooses the best learning technique in the region of that module subset to
make the prediction for testing module. The learning technique is selected based on its past
performance in the region of module subset.
The purpose of SFP is to reveal software modules, which are probably going to have a
large number of faults. It helps in allocating software testing efforts optimally. It helps the
software tester to prioritize the testing efforts based on the faults information and also
allow her/him to locate maximum faults early and quickly. However, only limited works
are available in the literature focusing on the prediction of fault. Therefore, in this report,
we focus on the prediction the faults in the software modules.
This report presents a dynamic selection approach, which dynamically selects the best
learning technique for the given testing module. The dynamic behavior lies in the sense
1
that the learning technique that is used to predict the fault in the unseen testing module
depends on the characteristics of the given module. The presented dynamic selection
approach is two-folded. In the first fold, we partition the validation dataset into different
disjoint module subsets using a partitioning technique and determine the best learning
technique for each subset. In the second fold, for an unseen testing module, we determine
the subset from the validation dataset that has modules similar to the given testing module.
The learning technique that has the best prediction performance for the determined subset
is now selected for predicting fault in the testing module.
1.1.1 Motivation
Software Fault prediction is one of the major activities of quality assurance. Fault
prediction plays a significant role in the reduction of software cost and time. Even though,
there are so many prediction techniques that are available in software engineering there is a
need for stable software fault prediction methodology that can perform constantly better
with different different modules.
The main motivation of this report was on finding an answer about how to overcome the
setbacks of SFP i.e. varying performance for different software modules and very few have
always reported the best performance across different modules in finding faults before the
software is released to the market. As such, by fully using the benefits of testing, the other
processes of software development become more qualitative. The usage of a fair fault
prediction algorithm in a continuous delivery environment is the final report goal. By
reading several articles, it can be seen that the interest for the fault prediction topic during
the last few years is increasing. Even though the number of contributions is big, the same
number holds true even for unsolved issues and future improvements that can be done in
the field. It is interesting how the evolution was done from simple metrics to more
complex algorithms in recent years. Most of the work is based on data gathered from a
long period after release. The type of data that certain algorithms require is different from
algorithm to algorithm. Some rely on testing metrics, others on process quality data.
If we were somehow able to add the prediction capability of a model to identify which
models are most likely to be faulty before testing, can help the management to effectively
2
and reasonably allocate the limited testing resources and can prove the efficiency of the
testing process to a great extent.
Earlier classifiers used for Software Fault Prediction have reported varying performance
for different software modules and none of them has always reported the best performance
across different modules. The empirical evaluation of all these approaches indicated that
there is no machine learning classifier providing the best accuracy in any context,
highlighting interesting complementarity among them. For these reasons Dynamic
Selection of Learning Techniques have been proposed to estimate the bug-proneness of a
class by combining the predictions of different classifiers.
Our aim in this project is to dynamically select the most appropriate learning technique(s)
or classifier for the accurate and effective software fault prediction (SFP) which will in
turn :
To resolve the problem of the absence of any general framework for the software defect
prediction which helps us to identify the modules which are likely to have faults. This
aids the software project management team to deal with those areas in the project on a
timely basis and with sufficient effort
To study various techniques and tools available for finding which software modules are
frequently faulty.
To find out the error rate using the predicted and actual values using the fault prediction
techniques
3
1.1.4 Scope of the Project
In the proposed approach , only the classification algorithms and their dynamic selections
have been implemented. However, in future new and different machine learning methods
and models can be tried in the environment. More new and important attributes could be
added to the currently used datasets to expand the effectiveness of the models that we have
used. We will try to use more different datasets and some other techniques and methods to
simplify and generalize our finding. Moreover, in forthcoming time we will emphasise on
using different domain 40 datasets and then evaluating and validating some new models
for the number of faulty and non-faulty modules prediction .
Several experiments have been conducted on prediction of Software fault . A lot of them
have used the Regression and single classifiers or ensemble techniques for SFP. The
results show that the dynamic selection approach improved the overall performance of the
model.
Afzal et al ,[21] Used genetic programming (GP), a failure prediction model is established
for the prediction of the software failure count. Afzal et al. [21] established and evaluated
failure prediction models for three industrial software systems, using weekly failure counts
as independent variables. Experimental results show that GP has significant performance
in predicting software failure counts.
Rathore and Kumar [13] evaluated different counting models to predict the number of
failures. The research includes five different techniques based on counting models. The
evaluation of the counting model used is carried out using two different software systems,
which contain various software metrics related to complexity metrics and object-oriented
metrics. The results show that the zero-expansion NBR model and the barrier NBR model
are superior to other counting models in predicting the number of failures.
Gao and Khoshgoftaar [9] evaluated Poisson regression and Poisson regression's zero
inflation (ZIP) failure prediction, and compared their performance with logistic regression.
4
It turns out that the performance of the model based on logistic regression is comparable to
other predictive models used.
Chen and Ma [26] studied the use of different regression techniques to predict the number
of defects. The predictive model is built for multiple accumulated open source software
project data sets from the PROMISE data repository, and the results are evaluated for
intra-project and cross-project scenarios. Chen and Ma found that, in general, regression-
based techniques work well for predicting the number of defects. They report that decision
tree regression (DTR) is superior to other techniques used.
Menzies et al. [5] introduced the study of defect prediction and workload estimation using
rules generated from data from local projects (the same software project) and data from
different software projects. Several data sets corresponding to the PROMISE data
repository were experimentally studied. The research results suggest using local data to
select the best learning technique for predicting defects. He also suggested that by
selecting the closest source data set for a given test module, predictive performance can be
improved.
Several regression-based approaches have been used in the literature to predict the
number of faults in the software systems. However, results of previous studies found no
clear winner among these approaches . Moreover, the performance of different
approaches varied with respect to the dataset used. Ghotra et al. [21] found that the
performance of the fault prediction model can vary up to 30%, depending upon the type
of the learning technique used.
-Chapter 1 presents the research problem, research objectives, justifying the need for
carrying out the research work and outlines the main contributions arising from the work
undertaken.
-Chapter 2 provides the essential background and context for this project.
-Chapter 3 provides the details of system architectural design and methodology.
-Chapter 4 provides the results of experiments.
5
-Chapter 5-Concludes the report
This chapter has laid the foundations for this project . It briefly introduced the research
problem, research objectives, scope of the project, previous related work and the proposed
solution framework. The next chapter examines the pertinent literature most relevant to our
research.
6
CHAPTER 2
LITERATURE SURVEY
This chapter focuses on the review of the Software Fault prediction methods that have
already been implemented, emphasizing on:
The previous research work in SFP basically revolved around two broad approaches which
are as follows:
Single Classifiers for SFP
Ensemble Techniques for SFP
i) Single Classifier for error prediction: several machine learning classifiers are
being used in literature [2]. Examples include logistic regression (LOG), support
vector machines (SVM), dynamic floor function networks (RBF), multilayer
perceptrons (MLP), and more. Bayesian networks (BN), decision trees (DTree),
and decision tables (DTable). However, previous studies have shown that there is
no definite winner for these classifiers. [20] [24] Based on the datasets used in
particular, researchers have found that they achieve higher performance than other
classifiers, other classifiers.
7
ii) Ensemble Technique for Error Prediction: Ensemble technique aims to provide
better classification performance by combining various classifiers. Recently,
several methods have been proposed to infer reliable and unreliable classifiers
using meta data according to the stacking integration technique [6] [5] [13] [2].
These techniques take the classifier expected classification as input. However,
unlike the stack, the model classifier does not use the prediction of the base
classifier to more accurately predict the failure of a particular class, but rather uses
the class' characteristics to better predict the class. of a classifier that predicts the
tendency of errors.
Few software defect prediction methods use dynamic selection methods, especially for
software defect prediction by dynamic selection of classifiers. The proposed method aims
to take advantage of various learning techniques to improve the overall performance of the
SFP model . The proposed method uses the characteristics of different learning methods to
better predict the best classifier for defect detection in the specified software module.
SVM is considered one of the new trends in machine learning algorithms; it can deal with
nonlinear problem data by using Kernel Function [35]. SVM achieves high classification
accuracy because it has a high ability to map high dimensional input data from nonlinear to
linear separable data. The main concept of SVM depends on the maximization of margin
distance between different classes and minimizing the training error. The hyperplane is
determined by selecting the closest samples to the margin. SVM can solve the
classification problems by building a global function after completing the training phase
for all samples. One of the disadvantages of global methods is the high computational cost
is required. Furthermore, a global method sometimes cannot achieve a sufficient
approximation because no parameter values can be provided in the global solution
methods.
8
The parameters of the maximum-margin hyperplane are derived by solving the
optimization. There exist several specialized algorithms for quickly solving the quadratic
programming (QP) problem that arises from SVMs, mostly relying on heuristics for
breaking the problem down into smaller, more manageable chunks.
The special case of linear support-vector machines can be solved more efficiently by the
same kind of algorithms used to optimize its close cousin, logistic regression; this class of
algorithms includes sub-gradient descent. LIBLINEAR has some attractive training-time
properties. Each convergence iteration takes time linear in the time taken to read the train
data, and the iterations also have a Q-linear convergence property, making the algorithm
extremely fast.
9
supervised learning family of algorithms, it is considered as easy to be trained due to the
fact that events are unrelated and independent among them.
Naive Bayes can be extended to real-valued attributes, most commonly by assuming a
Gaussian distribution.
This extension of naive Bayes is called Gaussian Naive Bayes. Other functions can be used
to estimate the distribution of the data, but the Gaussian (or Normal distribution) is the
easiest to work with because you only need to estimate the mean and the standard
deviation from your training data.
The Gaussian Naïve Bayes classifier is a quick and simple classifier technique that works
very well without too much effort and a good level of accuracy.
Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It
is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.In a
decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next
node.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.The
decisions or the test are performed on the basis of features of the given dataset.
10
It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.In order to build a tree, we
use the CART algorithm, which stands for Classification and Regression Tree algorithm.
A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.
11
2.1.4 Multi-layer Perceptron
Multi layer perceptron (MLP) is a supplement of feed forward neural network. It consists
of three types of layers—the input layer, output layer and hidden layer, as shown in Fig. 3.
The input layer receives the input signal to be processed. The required task such as
prediction and classification is performed by the output layer. An arbitrary number of
hidden layers that are placed in between the input and output layer are the true
computational engine of the MLP. Similar to a feed forward network in a MLP the data
flows in the forward direction from input to output layer. The neurons in the MLP are
trained with the back propagation learning algorithm. MLPs are designed to approximate
any continuous function and can solve problems which are not linearly separable. The
major use cases of MLP are pattern classification, recognition, prediction and
approximation.
12
Fig. 2.2. Schematic representation of a MLP with a single hidden layer.
Independent variables are analyzed to determine the binary outcome with the results falling
into one of two categories. The independent variables can be categorical or numeric, but
the dependent variable is always categorical. Written like this:
P(Y=1|X) or P(Y=0|X)
Logistic regression is used to calculate the probability of a binary event occurring, and to
deal with issues of classification. For example, predicting if an incoming email is spam or
not spam, or predicting if a credit card transaction is fraudulent or not fraudulent. In a
medical context, logistic regression may be used to predict whether a tumor is benign or
malignant. In marketing, it may be used to predict if a given user (or group of users) will
buy a certain product or not. An online education company might use logistic regression to
predict whether a student will complete their course on time or not.
13
Logistic regression uses an equation as the representation which is very much like the
equation for linear regression. In the equation, input values are combined linearly using
weights or coefficient values to predict an output value. A key difference from linear
regression is that the output value being modelled is a binary value (0 or 1) rather than a
numeric value. Here is an example of a logistic regression equation:
Where:
In the equation, each column in your input data has an associated b coefficient (a constant
real value) that must be learned from your training data.
Several regression-based approaches have been used in the literature to predict the number
of faults in the software system. However, the results of the previous research did not find
a clear winner in these approaches . In addition, the performance of several approaches is
different for the different data set used.
SAIQA [2] ET AL. In this study, 15 data sets (such as AR1, AR6, CM1, KC1, KC3, etc.)
were used. By several methods of learning the machine. The performance of each method
was measured and finally concluded when SVM, MLP and pockets have high precision
and performance.
Ostrand et al.[13] suggested Different studies are carried out using negative binomial
regression technology (NBR), and the number of failures of a given software system [20],
[26], [27] are expected. The experiments were performed using several software metrics
based on file features and LOC for two industrial software projects. The result was
14
evaluated using performance measurements that count the number of faults detected by the
prediction model of 20% higher than the file. It turned out that NBR demonstrated precise
performance by predicting the number of software failures. However, the simple model
shortens the effort to build a model.
Yu and Janes et al [16],rebased The prediction model and found that it was executed
exactly against SFP. Jiang et al. [17] And Li and Luita [18] have concluded the same as it
was found that the model found the appropriate combination of different metrics to
function. The proper combination of metrics depends on how data collection is achieved.
Catal and Diri [9] suggested that if the project does not violate the labeled data, it indicates
that the resident semideous approach must be followed. However, this approach is not so
popular. In its subsequent investigation, the Catal [19] gives it as a reason why data
collection can only be done in a large part of the module. Even so, they warn researchers
who pay attention to the results achieved in different releases.
Arvinder Kaur [4] et al evaluated for the application of random forests to predict classes of
fault elections using open source software. Researchers used open source software Jedit
using objectified metrics to implement studies. Based on the experimental results, the RF
precision is 74.24%, the accuracy is 72%, and the withdrawal is 79%, the result is 75%,
and the AUC is 0.81.
Malkit Singh et al. [10] found It is likely that the failure of the initial software test
software has a LevergMarquardt (LM) algorithm for data collected from empirical
software engineering engineering data. Neural based. Communication network The
experiment showed that Levenbergmarquardt (LM) has greater precision (88.1%).
Therefore, the learning of the machine based on the neural network is a good precision..
According to Martin Shepperd et al [6] We use a new reference framework to predict and
evaluate software defects. In the evaluation stage, different learning schemes are evaluated
according to the selected method. Next, in the prediction stage, the best learning method is
used to build predictors using all historical data, and predictors are ultimately used to
predict new data defects.
15
xi tan et al. [7] Prediction model of software defects based on functional groups in order to
Xi Tan To improve the Eclipse 3.0 data. Cluster based They used Eclips 3.0 data and the
(2011)
improve the performance90 of
performance
the experimental model. After applying
to 10% data split. defect
this99.2%
max accuracy method,
(Recall).
Program of software
researchers prediction
are updated from 31.6% to 99.2% withdrawal and precision from 73.8% to
clustering andfault fault prediction Betterthan class-based
prediction 91.6%.
modelRecently, some of the approaches [5],
using recall [13],
model [32],recall
in both [13], [32], [32], which implies
[7] and precision and precision. Recall
any classifier using meta data. These techniques (31.6% touse the predicted classifications as
99.2%).
contributors of the classifier. Precision
(73.8% to
91.6%).
Mikyoung To predicting software Three promise X-Means have the higherThey used AR3, AR4 and AR5; the
Park faults repository data (AR3, accuracy max accuracy was 90.48 for
(2014) International AR4 andAR5). (90.48)forAR3 without AR3 using XMeans unsupervised.
journal EM attribute reduction.
of SE.[9] X-Means. 16
Saiqa Aleem
(2015) [3]
Santosh Singh An Approach for the The performance ofExploited the use of The presented approach produced
Rathore and Prediction of Numberthe SFP process is the fact that each the average values of 0.50, 0.25,
Sandeep Kumar of Software Faults highly dependent learning technique and 67% for AAE, ARE, and pred(0.
Based on the Dynamicon the use of has a domain, where analysis, respectively.
(2018)[1] learning techniques it is more reliable
Selection of Learningand compared to
Techniques characteristics of other techniques.
fault datasets.
Techniques used
Are :- Decision tree
regression(DTR),
Multilayerperceptro
n(MLP),
Linear
regression(LR).
Pradeep Singh Multi-Classifier Combination of The accuracy of the NASA MDP were used &
and Shrish VermaModel for r Naïve Bayes, SVM proposed method Maximum accuracy was 99.55 and
SoftwareFault and Random forestwas excellent for all maximum AUC was 0.96
(2018) [5] Prediction was used .10-fold the fault data
cross validation sets.AUC 0.85
strategy Accuracy Accuracy 86.71
and AUC was used
for evaluation.
Le HoangSon , Empirical Study of In this module, We Address nine A total of 156 studiesare selected
Nakul Pritam , SoftwareDefect explored each research questions and the mapping is conducted based
Manju Prediction:A aspect of the corresponding to on these studies.
Khari(2019) [2] Systematic Mapping. process ranging different stages of
from data development of a
collection; data pre-DeP model.
processing, and
techniques used to
build DeP models
to measures used to
evaluate model
performance and
statistical
evaluation schemes
used to
mathematically
17
validate the results
of a DeP model.
Devika Best Suited Machine Machine Decision tree gave a Average accuracy - 93.15 for tree
S,Lekshmi Learning Techniques Learning(ML) considerable based techniques
for Software Fault techniques used for accuracy in
(2020) [4] Prediction code defect predicting software
prediction were faults
Decision Trees,
Support Vector
Machines
(SVMs)and
Artificial Neural
Networks (ANNs)
This chapter contains the papers that helped us to understand and reach a position where
we could implement different techniques and do their comparative analysis.
CHAPTER 3
We worked on dynamic selection of classifier for software fault prediction and running the
following classifiers: -
● SVM
● MLP
● Decision tree
● GaussianNB
● MultinomialNB
The Project is developed in python using Jupyter Notebook. The Jupyter Notebook is an
open- source web application which helps you to do the creation or sharing of documents
which have live codes, equations, visualization or narrative text. The uses of Jupyter
19
Notebook include: cleaning of data, transforming the data, numerical simulation, modeling
the statistical data, data visualizations, machine learning, and many more. In our project,
we used 5 datasets mentioned below and applied a dynamic selection approach to predict
software fault.
3.2 Datasets Used
Table 3.1 represents the datasets and the characteristics of the software fault classification
datasets used. The table exhibits the number of modules or instances present in each
dataset used in our project and also the percentage of faulty data present. Through the table
it is quite apparent that the datasets we are dealing with, are quite imbalanced. We have
used an oversampling technique to counteract the class imbalance problem and moreover,
we choose F-measure as one of the performance metrics. In this project, we have selected
five smalls to large scale datasets from the PROMISE repository. The largest dataset is
JM1 which belongs to the real time predictive ground system project of NASA, with
10,885 instances with 22 features. KC3, KC1, JM1, KC4 and CM1 consist of 22 features
as independent variables including 1 classifier as a dependent variable. These datasets
contain several software metrics such as Line of Code, number of operands and operators,
Design complexity, Program length, effort and time estimator and various other metrics
which are useful to identify whether a software module is faulty or not.
Below is a chart representing the Total number of Modules/Instances along with the
number of faulty instances present in the five datasets.
20
Fig 3.2 Representation of faulty and non-faulty features
Table 3.2 is representing all the attributes which are commonly present in all the above
mentioned datasets along with their definitions. Below the table, a detailed description
about each of the metrics present.
21
1 Loc McCabe's line count of code
6 V Halstead "volume"
9 d Halstead "difficulty"
10 i Halstead "intelligence"
11 e Halstead "effort"
22
1) loc: This metric describes the total number of lines for a given module. This is the
sum of the executable lines and the commented lines of code and blank lines. Pure, simple
count from open bracket to close bracket and includes every line in between, regardless of
character content.
3) ev(g): Essential Complexity (ev(g)) is the extent to which a flow graph can be
reduced by decomposing all the sub-flow graphs of ‘g’ that are "D-structured primes''.
Such "D- structured primes'' are also sometimes referred to as "proper one-entry one-exit
sub-flow graphs [1]. ev(G) is calculated using,
ev(G) = v(G) – m
where "m" is the number of sub-flow graphs of "g" that are D-structured primes.
6) V: This metric describes the halstead (V) metric of a module that contains the
minimum number of bits required for coding the program
23
7) L: This metric describes the halstead level (L) metric of a module i.e. level at
which the program can be understood.
9) d: The difficulty level or error proneness (d) of the program is proportional to the
number of unique operators in the program.
11) e: This metric describes the halstead effort (e) metric of a module. Effort is the
number of mental discriminations required to implement the program and also the effort
required to read and understand the program.
12) B: This metric describes the halstead error estimate metric of a module. It is an
estimate for the number of errors in the implementation.
13) lOCode: The number of lines of executable code for a module. This includes all
lines of code that are not fully commented.
16) locCodeAndComment: This metric describes the number of lines which contain
both code & comment in a module.
24
17) uniq_Op: This metric describes the number of unique operators contained in a
module i.e. the number of distinct operators in a module.
18) uniq_Opnd: This metric describes the number of unique operands contained in a
module. It is a count of unique variables and constants in a module.
19) total_Op: This metric describes the total usage of all the operators.
20) total_Opnd: This metric describes the total usage of all the operands.
Pre-processing refers to the transformations applied to our data before feeding it to the
algorithm. Data Preprocessing is a technique that is used to convert the raw data into a
clean data set. In other words, whenever the data is gathered from different sources it is
collected in raw format which is not feasible for the analysis. For achieving better results
from the applied model in Machine Learning projects the format of the data has to be in a
proper manner. Some specified Machine Learning model needs information in a specified
format.
To ensure high quality data, it’s crucial to preprocess it. To make the process easier, data
preprocessing is divided into four stages: data cleaning, data integration, data reduction,
and data transformation.
25
Fig 3.3: - Data Preprocessing
Our datasets did not contain any missing values. The datasets obtained after applying the
preprocessing techniques were skewed, with all the values lying between 0’s and 1’s and
with zero mean deviation. The null values in the data were handled at the same time.
(Standard Scaler and Min-Max Normalization were applied).
There are different metrics that we have used for performance comparison among different
approaches. Following are the names of various metrics used in our project: f1_score: The
F1-score combines the precision and recall of a classifier into a single metric by taking
their harmonic mean. It is primarily used to compare the performance of two classifiers.
Suppose that classifier A has a higher recall, and classifier B has higher precision. In this
case, the F1-scores for both the classifiers can be used to determine which one produces
better results.
The F1-score of a classification model is calculated as follows:
2(P∗R)/(P+R)
P = the precision
R = the recall of the classification model
26
3.5 Data Clustering using k-means clustering algorithm
The clustering phase is the first phase of the dynamic selection algorithm. Preprocessing is
done to prepare the data for developing the classification algorithms. The data is split into
two parts: the training part, which is the only part that is used in this stage for clustering,
and a testing part, which is used to evaluate the performance of the trained models. During
this preprocessing step, clustering of training data into a set of predefined number of
clusters is done. In our work one of the most popular clustering techniques, the k-means
algorithm, is used.
After segmenting the training data into a set of clusters, the next step is to develop a
classification model for each cluster. To do that, five classification algorithms are trained
and evaluated on each cluster. The goal is to find the most suitable and expert model for
each cluster. For example, we have three classification algorithms called X, Y, and Z. All
algorithms will be trained and evaluated based on each cluster, if algorithm Y produced the
highest average accuracy over the cross-validation process based on a given cluster, then Y
is assigned to this cluster for future predictions because it showed higher prediction power
than algorithms X and Z. It is important to note that when there is a cluster of only one
class the classifier works as one class classification algorithm. So, it trains based on one
class in the training phase and detects the other class in the testing phase as outlier. After
finishing this phase, each cluster has its own expert model. Note that the best classifier can
be different from one cluster to another.
In this project, five different classifiers are used to train and evaluate the model. They are
described as follows -
●SVM : A support vector machine (SVM) is a supervised machine learning model that
uses classification algorithms for two-group classification problems
●MLP : MLPClassifier stands for Multi-layer Perceptron classifier which in the name
itself connects to a Neural Network.
●Decision tree : The decision tree classifier creates the classification model by building a
decision tree. Each node in the tree specifies a test on an attribute, each branch descending
from that node corresponds to one of the possible values for that attribute.
27
●GaussianNB : A Gaussian Naive Bayes algorithm is a special type of NB algorithm. It's
specifically used when the features have continuous values. It's also assumed that all the
features are following a gaussian distribution i.e, normal distribution.
●MultinomialNB : Multinomial Naive Bayes algorithm is a probabilistic learning method
that is mostly used in Natural Language Processing (NLP). The algorithm is based on the
Bayes theorem and predicts the tag of a text such as a piece of email or newspaper.
28
In this phase we are concerned with the testing data generated in the first phase. For each
instance in the testing data, we must specify to which cluster it belongs by calculating the
distance between the instance and each centroid of the clusters. As a result, the instance
belong to the closest cluster (most similar), and it will be given to the model that has been
assigned to the cluster in the training phase for final prediction.
To determine the similarity, we use the Euclidean distance between the testing instance I
and the centroid C which can be defined as follows:
where d is the number of input features in the dataset. After classifying all instances in the
testing data, we can use the predictions against the actual values of classes to evaluate the
performance of the given hybrid algorithm.
This chapter determines the system design and architecture required for implementation of
the models. It also provides a detailed knowledge of the proposed procedure used in the
project.
CHAPTER – 4
IMPLEMENTATION AND RESULTS
29
The project is built in python language using jupyter notebook. In this project, we tried to
analyze the results obtained after running different techniques on five different datasets.
For achieving results, the assistance of numerous python libraries have been taken into use.
● Python 3.6
● Jupyter notebook
● Scikit learn
● NLTK(natural language toolkit)
● Google colab
● keras
30
This section provides the result obtained on the implementation of different techniques and
helps us justify our decision of choosing the proposed technique.
● The experiments are carried out on 5 datasets from the NASA repository called
PROMISE. Various datasets that we worked on include CM1, JM1, KC1, PC1and
PC2 .CM1 has 37 features and all other datasets contain 22 features each and one
label that is the predictor. These datasets have been used so that we can easily
compare the performance of our dynamic classification model with other existing
models with the same datasets. The datasets are highly unbalanced as they contain
a very high proportion of faulty modules.
● Most common k-means clustering method is used for dividing data into subsets.
After clustering the clusters are trained using five classifiers.
● The test data is then assigned to a particular cluster by measuring the euclidean
distance (similarity) with the clusters and an expert model for that cluster is
applied to predict faultiness in the test module.
31
Fig 4.1 Representation of features of the dataset used
Checking for missing or null values is an important step in data pre-processing , by looking
at fig 4.2 we can conclude that our data set had no missing values and hence no adjustment
of missing values is needed.
32
Fig 4.3 represents the ratio of number of faulty and non-faulty module before
preprocessing
Fig 4.4 involves steps needed to handle the data imbalance i.e. oversampling approach was
used here to handle data imbalance problem
33
Fig 4.4 Oversampling of data to balance data
4.5 We used elbow method to find the number of cluster(value of k) and then find K-means
clustering
34
Fig 4.5 K-means clustering using 4 clusters
Fig 4.6 Accuracy matrix for clusters and selection of expert model
35
Fig 4.7 represents selection of learning technique for the prediction of faults. Presented
approach selects the best learning technique for each unseen testing module in the given
testing dataset
4.4 Results
Decision Tree
Datasets F1 score
PC1 0.870967742
0.81
KC1 0.915470494
PC2 0.628571429
JM1 0.770234213
CM1 0.796310969
36
MultinomialNB
Datasets F1 score
PC1 0.810466754
KC1 0.715480492
PC2 0.827575426
JM1 0.790234957
CM1 0.856318454
SVM
Datasets F1 score
PC1 0.830967968
KC1 0.815470097
PC2 0.728571265
JM1 0.760234749
CM1 0.696310756
GaussianNB
Datasets F1 score
PC1 0.775679738
KC1 0.715470937
PC2 0.698571638
JM1 0.780234638
CM1 0.816310969
37
MLP
Datasets F1 score
PC1 0.830967489
KC1 0.755470036
PC2 0.728571568
JM1 0.790234738
CM1 0.836310905
Final Result
Datasets F1 score
PC1 0.779751325
KC1 0.814532158
PC2 0.725683201
JM1 0.747886542
CM1 0.865255353
Table 4.6 Represents the final results after dynamic selection of classifier
This chapter represents the complete implementation of our project using five classifier
and results obtained are displayed in the form of tables.
38
CHAPTER 5
CONCLUSION
5.1 Conclusion
In this paper we proposed an approach able to dynamically recommend the classifier to use
to predict the bug proneness of a class based on its structural characteristics. We have used
the fact that each learning technique has a domain, where it is more reliable compared to
other techniques. Based on this concept, we presented an approach to dynamically select
learning technique for the prediction of faults. Presented approach selects the best learning
technique for each unseen testing module in the given testing dataset .Now, for a given
unseen testing module, we determine the subset that has modules similar to the given
testing module. The learning technique that has the best prediction performance for the
determined subset is now selected for predicting faults in the testing module. To build our
approach, we firstly performed an empirical study aimed at verifying whether five
different classifiers correctly classify different sets of buggy components: as a result, we
found that even different classifiers achieve similar performances, they often correctly
predict the bug-proneness of different sets of classes. Once assessed the complementarity
among the classifiers, we compared its performances with the ones obtained by (i) the bug
prediction models based on each of the five classifiers independently, and (ii) other
ensemble technique. Key results of our experiment indicate that:
●Our model achieves higher performances than the ones achieved by the best stand-alone
model over all the software systems in our dataset. On average, the performances increases
up to 7% in terms of F-Measure.
●A technique that analyzes the structural characteristics of classes to decide which
classifier should be used might be more effective than ensemble techniques that combine
the output of different classifiers. Indeed, our model exhibits performances which are on
average 5% better than the Validation and Voting technique in terms of F-measure
39
5.2 Future Scope
In the proposed approach, only the classification algorithms have been implemented.
However, in future new and different machine learning methods and models can be tried in
the environment. More new and important attributes could be added to the currently used
datasets to expand the effectiveness of the models that we have used. We will try to use
more different datasets and some other techniques and methods to simplify and generalize
our finding. Moreover, in forthcoming time we will emphasise on using different domain
datasets and then evaluating and validating some new models for the number of faulty and
non-faulty modules prediction.
REFERENCES
40
[1] Santosh Singh Rathore and Sandeep Kumar “An Approach for the Prediction of
Number of Software Faults Based on the Dynamic Selection of Learning
Techniques” ,2018
[2] Le Hoang Son , Nakul Pritam , Manju Khari “Empirical Study of Software Defect
Prediction:A Systematic Mapping.” , 2019
[3] Saiqa Aleem “Comparative Machine learning methods for public ally available
data using software prediction model” , 2015
[4] Devika S, Lekshmy P L “Best Suited Machine Learning Techniques for Software
Fault Prediction” ,March 2020
[5] Pradeep Singh and Shrish Verma “Multi-Classifier Model for Software Fault
Prediction” , 2018
[7] Xi Tan “A Package Based Clustering for enhancing software defect prediction
accuracy” , 2011
[8] Yi Peng, Gang Kou, Guoxun Wang, Wenshuai Wu, Yong Shi “ENSEMBLE OF
SOFTWARE DEFECT PREDICTORS: AN AHP-BASED EVALUATION METHOD”, 2011
[9] M.Park , E.Hong “Software fault prediction model using clustering algorithms
determining the number of clusters automatically”, 2014
[10] Malkit Singh “ Software Defect Prediction Tool based on Neural Network ”,
2013
[12] Qinbao Song, Zihan Jia, Martin Shepperd, Shi Ying and Jin Liu “A General
Software Defect-Proneness Prediction Framework ” , 2010
41
[13] S. S. Rathore and S. Kumar, “Predicting number of faults in software system
using genetic programming” Procedia Comput. Sci., vol. 62, pp. 303–311, 2015.
[14] E.Erturk and E.A.Sezer , “A comparison of some soft computing methods for
software fault prediction "Expert systems with applications , 2014.
[15] A.Kaur and R.Malhotra , “Application of random forest in predicting fault prone
classes” pp.37-43 , 2008.
[16] Sarwesh S and S.A , “A review of ensemble technique for improving majority
voting for classifier "International journal of advanced research in computer science and
software engineering , vol. 1, pp.177-180 , Jan 2013.
[17] Martin S, “Research bias the use of machine learning in software defect
prediction” IEEE TRANSACTION of Software Engineerin”, Vol 40, pp. 603-616, JUNE
2014.
[21] C. Catal, “Performance evaluation metrics for software fault prediction studies,”
Acta Polytechnica Hungarica, vol. 9, no. 4, pp. 193–206, 2012.
42
[22] M. Singh and D. S. Salaria, “Approaches for software fault prediction,” International
Journal of Computer Science and Technology (IJCST), vol. 3, no. 4, pp. 419–421, 2012.
[23] D. Kaur, A. Kaur, S. Gulati, and M. Aggarwal, “A clustering algorithm for software
fault prediction,” in Computer and Communication Technology (ICCCT), 2010
International Conference on. IEEE, 2010, pp. 603–607.
[24] A. Kaur and I. Kaur, “An empirical evaluation of classification algorithms for fault
prediction in open source projects,” Journal of King Saud University - Computer and
Information Sciences, Apr. 2016.
[25] W. Liu, S. Liu, Q. Gu, J. Chen, X. Chen, and D. Chen, “Empirical studies of a two-
stage data preprocessing approach for software fault prediction,” IEEE Transactions on
Reliability, vol. 65, no. 1, pp. 38–53, 2016.
[26] R. Malhotra, “A systematic review of machine learning techniques for software fault
prediction,” Applied Soft Computing, vol. 27, pp. 504 – 518, 2015.
[27] G. Abaei and A. Selamat, “A survey on software fault detection based on different
prediction approaches,” Vietnam Journal of Computer Science, vol. 1, no. 2, pp. 79–95,
May 2014.
[29] E. Erturk and E. A. Sezer, “A comparison of some soft computing methods for
software fault prediction,” Expert Systems with Applications, vol. 42, no. 4, pp. 1872 –
1879, 2015.
[30] S. Wang and X. Yao, "Using class imbalance learning for software defect
prediction",
43
Vol.62, No.2, 2013, pp.434443.
44