0% found this document useful (0 votes)

12 views8 pages

An Empirical Study On Application of Word Embedding Techniques For Prediction of Software Defect Severity Level

....

Uploaded by

himanshu Chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views8 pages

An Empirical Study On Application of Word Embedding Techniques For Prediction of Software Defect Severity Level

....

Uploaded by

himanshu Chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Proceedings of the 16th Conference on Computer DOI: 10.

15439/2021F100
Science and Intelligence Systems pp. 477±484 ISSN 2300-5963 ACSIS, Vol. 25

An Empirical Study on Application of Word

Embedding Techniques for Prediction of Software
Defect Severity Level

Lov Kumar Mukesh Kumar Lalita Bhanu Murthy

Dept. CSIS Dept. CS&E Dept. CSIS
BITS Pilani Hyderabad Campus NIT Patna BITS Pilani Hyderabad Campus
[email protected] [email protected] [email protected]

Prof. Sanjay Misra Vipul kocher Srinivas Padmanabhuni

Østfold University College, Halden, Norway Testaing.Com Testaing.Com
[email protected] [email protected] [email protected]

Abstract—Software defect severity level helps to indicate the that the median time to repair bugs for ArgoUML software
impact of bugs on the execution of the software and how rapidly is 190 days, and PostgreSQL is 200 days. They have also
these bugs need to be addressed by the team. The working team observed that more than 50% of all fixed bugs in Mozilla took
is regularly analyzing the bugs report and prioritizing the defects. more than 29 days [5][6]. Therefore, it becomes essential to
The manual prioritization of these defects based on the experience reduce the time and cost of the bug fixing process and also
may be an inaccurate prediction of the severity that will delay in
fixing of critical bugs. It is compulsory to automate the process
improve the quality of the software system. Defect severity
of assigning an appropriate level of severity based on bug report level prediction has been emerged as a novel research field for
results with an objective to fix critical bugs without any delay. the effective allocation of resources and plans to fix the defects
This work aims to develop defect severity level prediction models based on their severity level [3]. These models help to find the
that have the ability to assign severity level of defects based severity level of defects that can be used to find the effect
on bugs report. In this work, seven different word embedding of defects on the software. Defect severity level prediction
techniques are applied to defect description to represent the models are designed based on the features extracted from the
word, not just as a number but as a vector in n-dimensional defect description. Recent research has used different data
space in order to reduce the number of features. Since the mining techniques to extract numerical features from defect
predictive ability of the developed models depends on the vectors descriptions for the severity level of defect prediction using
extracted from text as they are used as an input to the defect
severity level prediction models. Further, three feature selection
machine learning techniques. However, there are three main
techniques have been applied to find the right set of relevant technical challenges in building defect severity level prediction
vectors. The effectiveness of these word embedding techniques models for predicting the proper severity level of the defects
and different sets of vectors are evaluated using eleven different using defect description.
classification techniques with Synthetic Minority Oversampling
Technique (SMOTE) to overcome the class imbalance problem. • Word Embedding: The defect severity level pre-
The experimental results show that the word embedding, feature diction models are often developed based on the
selection techniques and SMOTE have the ability to predict the unstructured form of the description of defects. The
severity level of the defect in a software. unstructured nature of data poses intrinsic challenges.
Keywords—Defect Severity Level Prediction, Data Imbalance If some sort of numerical features can be assigned
Methods, Feature Selection, Classification Techniques, Word Em- using text mining techniques that can use as an input
bedding. for model development, then it can be utilized for pre-
diction of future severity level of defects. In this work,
seven different word embedding techniques such as
I. I NTRODUCTION Continuous Bag of Words Model (CBOW)1 , Skip-
gram (SKG)1, Global Vectors for Word Representa-
PPLYING data mining techniques on software reposi-
A tories such as software fault prediction, maintainability
prediction, version control systems, source code analysis, bug
tion (GLOVE)2 , Google news word to vector(w2v)3 ,
fasttext (FST)4 , Bidirectional Encoder Representations
from Transformers (BERT) 5 , and generative pre-
archives, etc. is an emerging field that has received significant
research interest in recent times. Researchers have proposed 1 https://towardsdatascience.com/nlp-101-word2vec-skip-gram-and-cbow-
many tools and methods using machine learning techniques 93512ee24314
to assist a practitioner in decision making and automating 2 https://nlp.stanford.edu/projects/glove/
software engineering tasks [1][2][3][4]. However, Forrest et 3 https://code.google.com/archive/p/word2vec/
al. observed that the finding and fixing defects in software 4 https://fasttext.cc/

is a time-consuming and expensive process. They have found 5 https://en.wikipedia.org/wiki/BERT_(language_model)

IEEE Catalog Number: CFP2185N-ART ©2021, PTI 477

Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on November 21,2024 at 07:22:19 UTC from IEEE Xplore. Restrictions apply.
478 PROCEEDINGS OF THE FEDCSIS. ONLINE, 2021

training model (GPT) 6 have been applied on bugs defects using defect description. Rajni Jindal et al. also done
reports to represent the word not just as a number similar work to extract features from defect descriptions using
but as a vector in n-dimensional space. The above Term Frequency and Inverse Document Frequency (TFIDF) to
techniques provide similar representation for similar extract features [10]. They have used the Radial Basis function
words and also provide a small number of features as network for developing defect severities prediction models.
compared to the size of the vocabulary. We have also Finally, they found that the proposed methods have a high
removed stop-words, spaces, and bad symbols before predictive ability to predict the severity levels of the defects.
applying these techniques. The predictive ability of Sari and Siahaan have also followed a similar method for
these techniques is compared with frequently used developing models to predict the severity level of the defects
term frequency, inverse document frequency (TFIDF). based on defect description [11]. They have applied InfoGain
gain on extracted features from text to find relevant features
• High-Dimensional Features Data: The predictive
for model development. Finally, they have used a support
ability of defect severity level prediction models also
vector machine with an objective to develop defect severities
depends on the features that are considered as the
prediction models.
input of the models. Researchers have concluded that
the data having high dimension features consisting In 2011, David Lo and the team analyzed the performance
of redundant and irrelevant features negatively affect of models at three different levels of severity: low, medium,
the performance of the defect severity level prediction and high. It was found that an artificial neural network (ANN)
models [2][3][1]. The presence of a huge number of was among the best methods. However, the predictions were
the feature in the case of text analysis poses intrinsic less accurate for high severity faults. In 2012, Sharma et
challenges to develop models for predicting the proper al. [12] proposed a priority prediction method using SVM,
severity level of the defects using defect description. Naive Bayes, KNN, and Neural Network. This predicted the
In this study, we have used three different feature priority of the newly arrived bug reports, and the accuracy
selection techniques to remove irrelevant features and of almost all techniques (except NB) was less than 70%
select the right sets of the relevant features. for Eclipse and Open Office projects. In 2014, Gayathri and
• Imbalanced Data: The last challenge in building Sudha developed an enhanced Multilayer Perceptron Neural
defect severity level prediction model is that the Network [13]. Comparative analysis of modeling of defect
data used for building the models are imbalanced. proneness predictions using a dataset of different metrics from
A dataset is defined as a balanced dataset when the NASA MDP (Metrics Data Program) was performed. In 2017,
samples of the dependent variable or output variable Gupta and Saxena developed a model for the prediction of
are approximately evenly distributed across different the existence of bugs in class [14]. The model developed was
values of dependent variable [7][8]. In this study, the the object-oriented Software Bug Prediction System (SBPS),
considered datasets are observed to be not possessing and it was trained using the Promise Software Engineering
have an equal number of the severity level of the de- Repository. The Logistic Regression Classifier provided the
fects. Hence, it has been proposed to apply Synthetic best accuracy. The average accuracy of the model was found
Minority Oversampling Technique (SMOTE) on each out to be 76.27%.
dataset in order to get balance data. In the context of software severity level prediction, most
of the researchers have used count vectorization and TFIDF to
Prioritization of defect based on the severity level com-
extract numerical features from bugs report. The concepts of
puted using bugs reports? is a problem encountered by
these techniques are based on bag-of-words, therefore it has not
software practitioners and the study presented in this work
capability to capture the position of vocabulary in sentences.
is motivated by the need to develop defect severity level
These methods do not play well with many machine learning
prediction models using extracted features with the help of
models because of high-dimensional features. While in this
word embedding techniques from bugs reports. This study aims
work, we are attempting to use seven different word embedding
to find the best word embedding technique by comparing the
techniques that represent the word not just as a number but
predictive ability of the models developed using seven different
as a vector in n-dimensional space. The above techniques
types of word embedding techniques. It further investigates
provide similar representation for similar words. The effec-
the application of feature selection techniques, data sampling
tiveness of these word embedding techniques are evaluated
techniques, and eleven different classification techniques for
using eleven different classification techniques with Synthetic
prediction of severity level of defects.
Minority Oversampling Technique (SMOTE) to overcome the
class imbalance problem.
II. R ELATED W ORK
Software researchers have used different methods in the III. STUDY DESIGN
past to extract features from bugs report and used these features
as an input for developing models. Menzies and Marcus have This section presents the details regarding various design
used various text mining concepts to extract features from the setting used for this research.
bugs report [9]. They proposed an automated method called
SEVERIS and validated these models using the defects report A. Experimental Dataset
of NASA’s Project and Issue Tracking System (PITS). These
proposed models help to predict proper severity level of the In this study, six different software datasets have been
used, which are referred to as CDT, JDT, PDE, Platform,
6 https://openai.com/blog/language-unsupervised/ Bugzilla, and Thunderbird to validate our proposed models.

Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on November 21,2024 at 07:22:19 UTC from IEEE Xplore. Restrictions apply.
LOV KUMAR ET AL.: AN EMPIRICAL STUDY ON APPLICATION OF WORD EMBEDDING TECHNIQUES FOR PREDICTION OF SOFTWARE DEFECT 479

These datasets have been collected from msr2013-bug_dataset- C. Word Embedding:

master 7 . Mining Software Repositories (MSR) conducted
Challenge every year by providing software-related data and The software bugs report consist of the defect ID and
motivate participate to apply data mining techniques for ﬁnding their corresponding defect description. In this work, seven
important patterns. The datasets are the collection of bugs different word embedding techniques including Continuous
reports wherein each bugs report contains the defect ID, defect Bag of Words Model (CBOW), Skip-gram(SKG), Global Vec-
description, and severity level of the defects. Table I shows the tors for Word Representation (GLOVE), Google news word
details of the dataset used for this study. As shown in Table I, to vector(w2v), fasttext (FST), Bidirectional Encoder Rep-
the CDT software bugs report consists of 2220 normal defects, resentations from Transformers (BERT), and generative pre-
146 minor defects, 288 major defects, 42 trivial defects, 58 training model (GPT) have been applied on defect description
blocker defects, 106 critical defects. extracted from bugs reports. We have applied these techniques
to represent the word not just as a number but as a vector in n-
dimensional space. These vectors are used as input to develop
TABLE I: Experimental Data Set Description models for assigning appropriate severity levels to the defects
present in the bugs reports. We have also removed stopwords,
ker

bad symbols, and spaces before applying word embedding. We

l
ia l
r

ica
jo r
no
rm

it
Ma
Mi

have also compared the predictive ability of these techniques

Blo
Tr
No

CDT 2220 146 288 42 58 106 with term frequency, inverse document frequency(TFIDF).
JDT 1906 261 430 104 50 106
PDE 2380 91 295 52 27 70
Platform 1485 215 715 145 77 215
Bugzilla 1342 598 352 302 167 114 D. Feature Selection Techniques
Thunderbird 1100 387 655 91 25 658
After successfully finding a vector of defect description,
we have used these vectors as an input of the models. Since
we are using these vector of n-dimension as an input of the
models, so, the performance of the models also depends upon
B. Training of Models from Imbalanced Data Set:
the selection of important features vectors. In this study, we
After analyzing experimental data as shown in Table I, it is have used three different features selection techniques, i.e.,
quite evident that the considered datasets suffering from class significant sets of features using rank-sum test, uncorrelated
imbalance problem, i.e., the number of samples in each class, sets of features using cross-correlation analysis, and principal
are not same. Therefore, balancing of data is required before component analysis to remove irrelevant features and select
applying any classification techniques [15]. This approach help right sets of the relevant feature. We have also compared the
to improve the predictive ability of the developed software predictive ability of the models developed using selected sets
defect severity level prediction models [16][17]. In this study, of features with original features.
we have performed Synthetic Minority Oversampling Tech-
nique (SMOTE) one each dataset in order to get balance data. E. Classification Technique:
SMOTE technique is identified as a very popular technique by
different researches that helps to improve the predictive ability The predictive ability of different word-emending tech-
of the models. niques, feature selection techniques and SMOTE are evalu-
ated using eleven most frequently used classifiers such as
7 http://2013.msrconf.org/ multinomial naive bayes (MNB), bernoulli naive bayes (BNB),

TFIDF CBOW SKG GLOVE

W2V FST BERT GPT Data Set

SMOTE
BNB MNB Signiﬁcant Features

LOGR GNB

SVML DST Correlation Analysis

Performance
Analysis Data Set
SVMR SVMP

NNSGD NNLBFG Principal component

analysis
NNADAM

Fig. 1: Framework of proposed work

Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on November 21,2024 at 07:22:19 UTC from IEEE Xplore. Restrictions apply.
480 PROCEEDINGS OF THE FEDCSIS. ONLINE, 2021

gaussian naive bayes (GNB), Logistic Regression (LOGR), de- data on different sets of features. The results for other cases
cision tree (DST), SVM with linear kernel (SVML), SVM with are of similar type. Looking at information present in Table II,
polynomial kernel (SVMP), SVM with RBF kernel (SVMR), we can be inferred that:
Neural network with LBFG (NNLBFG), Neural network with
SGD (NNSGD), and Neural network with ADAM (NNADAM) • The high value of AUC confirm that the developed
in software engineering domain [1][18][19]. models have the ability to predict proper severity level
of the defects using defect description.
IV. R ESEARCH M ETHODOLOGY • The models developed using a support vector machine
with polynomial kernel have better predictive ability
In this work, we have applied seven different word embed-
as compared to other classifiers.
ding methods to extract features from bugs reports and consid-
ered these features as an input to develop models for predicting • The models trained using neural network with ADAM
proper severity level of the defects using defect description. (NNADAM) training algorithm have better predictive
These models are trained using eleven different classifiers ability as compared LBFG, and SGD traing algo-
and validated using 5-fold cross-validation. In this study, we rithms.
have also considered SMOTE for handling imbalanced data
• The models trained by considering balanced data using
and three feature selection techniques for finding the best
smote as an input have better predictive ability as
combination of relevant features. The detailed overview of
compared to original data.
our proposed work is giving in Figure 1. The information
presented in Figure 1 suggested that the proposed framework
is a multi-step process consisting of features extraction from VI. C OMPARATIVE A NALYSIS
text data using word embedding, handling class imbalance In this section, we analyze and compare the performance of
problem using SMOTE, removal of irrelevant features, and models developed using different word-embedding, classifiers,
finally development of prediction models using eleven different sampling techniques, and sets of features. In this paper, we
classification techniques. have considered Descriptive statistics, box-plot, and Significant
First, bugs report for a software project is collected from tests to compare the developed models for severity level
the Bugzilla bug tracking system containing the unique id prediction.
of defects, description of the defect, and associated severity
level of the defects. Next, we have used seven different word A. Word Embedding
embedding to find the numerical representation of defect de- The predictive ability of developed defect severity level
scription. Next, we have used SMOTE techniques to handle the prediction models using different word embedding are com-
class imbalance problem because the considered dataset is not puted with the help of AUC, F-Measure, and accuracy. They
evenly distributed. The performance of models trained using are compared using Descriptive statistics, box-plot, and Sig-
balanced data is also compared with models developed using nificant tests. In this study, seven different word embedding
original data. After balancing the data, three different features techniques such as Continuous Bag of Words Model (CBOW),
selection techniques such as significant features using rank- Skip-gram (SKG), Global Vectors for Word Representation
sum test, cross-correlation analysis, and principal component (GLOVE), Google news word to vector(w2v), fasttext (FST),
analysis are used to remove irrelevant features and select the BERT, and generative pre-training model (GPT) have been
right sets of reverent features. Finally, eleven different clas- used to compute the numerical vector of defects reports.
sifiers are used to develop models predicting proper severity Comparison of Word Embedding: box-plots: Figure 2
level of the defects using defect description. The performance provides the performance value, i.e., AUC, F-Measure, and
of these developed models is computed and compared using accuracy of different word embedding in terms of Box-Plot
AUC, F-Measure, and accuracy performance values. diagrams and descriptive statistics. It is clear from Figure 2 that
the models developed by considered word vector computed
V. E MPIRICAL R ESULTS AND A NALYSIS using GLOVE and w2v have better predictive ability to predict
the appropriate severity level to the defects present in the bugs
In this work, we have applied eight different word em-
reports as compared to other models. The models developed
bedding, one sampling technique, three feature selection tech-
using w2v achieve 0.70 average AUC value, 0.99 max auc, and
niques, and eleven classification techniques for developing
0.87 Q3 AUC i.e., 25% models developed using w2v have 0.87
models to predict proper severity level of the defects using
AUC value. However, the models developed using SKG have
defect description. Each word-embedding is applied on the
low predictive ability as compared to other techniques.
considered datasets as mentioned in Table I. The effectiveness
of these word-emending techniques is evaluated using 11 Comparison of Word Embedding: Significant Test: In
different most frequently used classifiers. Therefore, a total this study, the Wilcoxon signed-rank test is also applied on the
of 4224 (6 datasets * 8 word-embedding*(1 Original Data+ AUC, F-Measure, and accuracy for statistically comparing the
Smote data)*(3 Feature Selection+ 1 All Features)* 11 dif- ability to predict the appropriate severity level of developed
ferent classification technique) distinct prediction models are models using different word embedding. The objective of
built in the study. The predictive ability of these trained models this testing is to find whether the models developed using
are evaluated in terms of AUC, F-Measure, and accuracy different word embedding have a significant improvement or
performance values. These models are validated with the help not. This test uses p-value to accept or reject the considered
of 5-fold cross-validation methods. Table II reports the results null hypothesis. The considered null hypothesis for this paper
achieved by different classifiers on original data and sampled is "the defect severity level prediction models developed by

TABLE II: Performance Value: Classiﬁcation Techniques

Accuracy
OD
MNB BNB GNB LOGR DST SVML SVMP SVMR NNLBFG NNSGD NNADAM
TFIDF 78.11 75.56 22.78 78.22 62.44 78.22 77.78 78.00 72.67 78.11 73.89
CBOW 78.11 78.11 73.44 78.11 60.22 78.11 70.89 78.11 72.78 78.11 78.00
SKG 78.11 78.00 60.11 78.00 60.22 78.11 74.56 78.11 74.56 78.11 77.67
GLOVE 78.11 78.00 51.00 77.67 59.56 78.00 71.11 78.00 70.22 78.11 73.89
W2V 78.11 77.67 55.11 78.11 60.11 77.89 70.44 78.22 68.56 78.11 75.33
FST 78.11 77.56 24.22 78.00 58.56 78.11 75.11 78.11 72.89 78.11 78.11
BERT 74.22 76.89 17.22 78.11 59.67 78.22 72.11 78.22 78.11 78.11 78.11
GPT 54.67 73.89 7.33 78.11 72.33 78.11 78.00 78.11 78.11 78.11 78.11
SMOTE
TFIDF 60.18 62.60 59.50 64.35 79.53 67.47 86.95 88.85 85.78 83.32 91.9
CBOW 42.94 16.18 58.21 54.38 76.17 48.53 91.92 95.92 90.46 92.06 95.71
SKG 24.55 16.21 25.92 44.62 74.39 42.32 75.95 68.59 72.14 58.24 74.71
GLOVE 45.89 16.43 56.77 76.52 78.60 78.29 95.79 98.28 92.09 95.60 95.65
W2V 46.52 16.22 59.43 79.95 77.56 80.43 95.07 98.18 91.88 95.93 94.08
FST 27.45 15.97 34.14 37.29 77.35 33.71 86.66 82.33 67.96 84.94 87.25
BERT 24.13 15.97 31.22 77.99 78.90 81.07 94.72 84.63 15.97 15.97 15.97
GPT 22.49 19.05 27.93 44.52 68.06 46.57 61.16 50.58 15.97 15.97 15.97
AUC
OD
TFIDF 0.51 0.51 0.59 0.51 0.52 0.51 0.51 0.50 0.53 0.50 0.54
CBOW 0.50 0.50 0.53 0.50 0.55 0.50 0.56 0.50 0.52 0.50 0.50
SKG 0.50 0.50 0.58 0.51 0.57 0.50 0.56 0.50 0.54 0.50 0.51
GLOVE 0.50 0.51 0.64 0.51 0.56 0.50 0.54 0.50 0.53 0.50 0.55
W2V 0.50 0.50 0.63 0.53 0.54 0.51 0.57 0.50 0.57 0.50 0.54
FST 0.50 0.51 0.54 0.51 0.55 0.50 0.52 0.50 0.51 0.50 0.50
BERT 0.53 0.52 0.57 0.51 0.54 0.50 0.53 0.50 0.50 0.50 0.50
GPT 0.57 0.52 0.54 0.50 0.52 0.50 0.50 0.50 0.50 0.50 0.50
SMOTE
TFIDF 0.80 0.82 0.80 0.80 0.87 0.82 0.90 0.93 0.92 0.91 0.95
CBOW 0.67 0.51 0.76 0.76 0.85 0.72 0.95 0.98 0.94 0.96 0.98
SKG 0.56 0.50 0.55 0.71 0.84 0.70 0.87 0.83 0.87 0.78 0.88
GLOVE 0.72 0.50 0.78 0.88 0.87 0.89 0.97 0.99 0.96 0.97 0.97
W2V 0.73 0.51 0.79 0.90 0.86 0.91 0.97 0.99 0.96 0.97 0.96
FST 0.57 0.50 0.60 0.64 0.87 0.65 0.93 0.90 0.82 0.92 0.93
BERT 0.56 0.50 0.58 0.88 0.87 0.90 0.97 0.91 0.50 0.50 0.50
GPT 0.55 0.54 0.59 0.70 0.81 0.73 0.78 0.72 0.50 0.50 0.50

100 1 1

90 0.9 0.95

80 0.8 0.9

0.7 0.85
70
0.8
F-Measure

0.6
Accuracy

60
AUC

0.5 0.75
50
0.4 0.7
40
0.65
0.3
30
0.6
0.2
20
0.55
0.1
10
0.5
0
G

2V
F

T
RT
W

2V
F

T
T
OW

T
G

2V
F

T
RT
OW

GP
FS
OV
SK

GP
R
FS
OV
SK
ID

O
FS
OV
SK

BE
W

TF
BE

CB
TF

GL
GL

Fig. 2: Performance Box-Plot Diagram: Performance of Different Word Embedding

considering word vector using a different word embedding as signed-rank test on different pairs of word embedding are
an input are signiﬁcantly same". The considered null hypoth- depicted in Table III. For the purpose of simplicity, we have
esis is only accepted if the obtained p-values using Wilcoxon used only two number for representing results, i.e., 0 means
signed-rank test is greater than 0.05. The results of Wilcoxon hypothesis accepted (models are signiﬁcantly same) and 1

Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on November 21,2024 at 07:22:19 UTC from IEEE Xplore. Restrictions apply.
482 PROCEEDINGS OF THE FEDCSIS. ONLINE, 2021

means hypothesis rejected (models are significantly different). of the models trained using sampled data and original data
According to the information present in Table III, the models is less than 0.05, i.e., our considered hypothesis is rejected.
developed by considering word vector using different word Hence, the models trained using sampled data have significant
embedding as an input are significantly different for most of improvement in predicting defect severity levels.
the cases.
TABLE III: Significant tests: Different Word Embedding C. Feature Selection
TFIDF CBOW SKG GLOVE W2V FST BERT GPT In this study, we have used three different features selection
TFIDF 0 1 1 0 0 1 1 1 techniques, i.e., significant sets of features using rank-sum test,
CBOW 1 0 1 1 1 1 1 1
SKG 1 1 0 1 1 0 0 0 uncorrelated sets of features using cross-correlation analysis,
GLOVE 0 1 1 0 0 1 1 1 and principal component analysis to remove irrelevant features
W2V 0 1 1 0 0 1 1 1
FST 1 1 0 1 1 0 0 1
and select right sets of the relevant feature. We have also
BERT 1 1 0 1 1 0 0 1 validated the performance of the models developed using
GPT 1 1 0 1 1 1 1 0 selected sets of features with all features using AUC, F-
Measure, and accuracy performance values and compared with
the help of Descriptive statistics, boxplot, and Significant tests.
B. SMOTE Comparison of Different Sets of Features: box-plots: Figure
3 provides the performance value i.e., AUC, F-Measure, and
The predictive ability of developed defect severity level accuracy of the models trained using selected sets of features
prediction models using original data and smote sampled data and all features. We can see that the models developed using
are computed using AUC, F-Measure, and accuracy perfor- CCRA and AF have slightly better performance as compared to
mance values and compared using Descriptive statistics, box- other techniques. The models developed using CCRA achieve
plot, and Significant tests. 0.65 average AUC value, 0.98 max auc, and 0.78 Q3 AUC i.e.,
Comparison of Original Data and SMOTE: box-plots: Fig- 25% models developed using CCRA have 0.78 AUC value. We
ure 4 provides the performance value, i.e., AUC, F-Measure, can also observed that the models developed using AF have
and accuracy of the models developed using original data similar performance, but the number of features is more as
and smote sampled data in terms of Box-Plot diagrams and compared to CCRA features sets.
descriptive statistics. The information in Figure 4 demonstrate
that the SMOTE data sampling technique plays an important Comparison of Different Sets of Features: Significant
role in improving the predictive ability of the defect severity Test: In this study, the Wilcoxon signed-rank test is also
level prediction models. The models developed using SMOTE applied to the AUC, F-Measure, and accuracy for statistically
sampled data achieve 0.75 average AUC value, 0.99 max auc, comparing the ability to predict the appropriate severity level
and 0.86 Q3 AUC, i.e., 25% models developed using SMOTE of developed models by considering different sets of features
sampled data have 0.86 AUC value. an input. The objective of this testing is to find whether the
performance of the models depends on input sets of features or
100 1 1
not. The considered null hypothesis for this paper is "the defect
severity level prediction models developed by considering
80 0.8 0.9 different sets of features as an input are significantly same".
The considered null hypothesis is only accepted if the obtained
F-Measure
Accuracy

60 0.6 0.8
AUC

p-values using Wilcoxon signed-rank test is greater than 0.05.

0.7
40 0.4
The results of Wilcoxon signed-rank test are depicted in Table
20 0.2 0.6 IV. We can see that the models developed using all features,
0.5
signiﬁcant sets of features, and uncorrelated sets of features
0
are signiﬁcantly same.
OD

E
OD

E
OT
OT

OT
SM
SM

Fig. 4: Performance Box-Plot Diagram: Performance of Orig- TABLE IV: Significant tests: Different Sets of Features
inal Data and SMOTE AF SIGF CCRA PCA
AF 0 0 0 1
SIGF 0 0 0 1
CCRA 0 0 0 1
Comparison of Original data SMOTE: Significant Test: PCA 1 1 1 0
In this study, the Wilcoxon signed-rank test is also applied on
the AUC, F-Measure, and accuracy for statistically comparing
the ability to predict the appropriate severity level of developed D. Classification Techniques
models using original data and SMOTE sampled data. The
objective of this testing is to find whether the models developed The predictive ability of developed defect severity level
using sampled data have a significant improvement or not. prediction models using different classification techniques are
The considered null hypothesis for this paper is "the defect computed using AUC, F-Measure, and accuracy performance
severity level prediction models trained using sampled have values and compared with the help of Descriptive statistics,
not a significant improvement." The considered null hypothesis box-plot, and Significant tests. In this work, we have used
is only accepted if the obtained p-values using the Wilcoxon eleven different classification techniques such as multinomial
signed-rank test is greater than 0.05. In this work, the p-value naive bayes (MNB), bernoulli naive bayes (BNB), gaussian

TABLE V: Signiﬁcant tests: Classiﬁcation Techniques

MNB BNB GNB LOGR DST SVML SVMP SVMR NNLBFG NNSGD NNADAM
MNB 0 1 1 1 1 1 1 1 1 0 1
BNB 1 0 1 1 1 1 1 1 1 0 1
GNB 1 1 0 0 1 0 1 1 0 1 1
LOGR 1 1 0 0 1 0 1 1 0 1 0
DST 1 1 1 1 0 1 1 0 1 1 1
SVML 1 1 0 0 1 0 1 1 0 1 0
SVMP 1 1 1 1 1 1 0 1 1 1 1
SVMR 1 1 1 1 0 1 1 0 1 1 1
NNLBFG 1 1 0 0 1 0 1 1 0 1 0
NNSGD 0 0 1 1 1 1 1 1 1 0 1
NNADAM 1 1 1 0 1 0 1 1 0 1 0

naive bayes (GNB), Logistic Regression (LOGR), decision tree is "the defect severity level prediction models trained using
(DST), SVM with linear kernel (SVML), SVM with polyno- different classifiers are significantly same". The considered
mial kernel (SVMP), SVM with RBF kernel (SVMR), Neural null hypothesis is only accepted if the obtained p-values using
network with LBFG (NNLBFG), Neural network with SGD Wilcoxon signed-rank test is greater than 0.05. The results
(NNSGD), and Neural network with ADAM (NNADAM) with of Wilcoxon signed-rank test on different pairs of classifiers
5-fold cross-validation to train defect severity level prediction are depicted in Table V. For the purpose of simplicity, we
models. have used only two number for representing results, i.e., 0
Comparison of Classification Techniques: Descriptive means hypothesis accepted (models are significantly same)
Statistics and box-plots: Figure 5 provides the performance and 1 means hypothesis rejected (models are significantly
value, i.e., AUC, F-Measure, and accuracy of different classi- different). While comparing the values present in Table V, we
fiers in terms of Box-Plot diagrams and descriptive statistics. can observed that the models trained using different classifiers
It is clear from Figure 2 that the models trained using SVM are significantly different for most of the cases.
with polynomial kernel have better predictive ability to predict
the appropriate severity level to the defects present in the bugs
reports as compared to other models. The models developed VII. C ONCLUSION
using SVM with polynomial kernel achieve 0.73 average AUC In this paper, we build a model to predict proper severity
value, 0.98 max auc, and 0.89 Q3 AUC i.e., 25% models level of the defects using defect description. Different from
developed using SVM with polynomial kernel have 0.89 AUC existed researches, this work focus on seven different word
value. However, the models developed using bernoulli naive embedding methods to represent the word not just as a number
bayes (BNB) have low predictive ability as compared to other but as a vector in n-dimensional space. The predictive ability
techniques. of these methods are evaluated using three sets of features
selected using feature selection techniques, and eleven different
Comparison of Classification Techniques: Significant classifiers with 5-fold cross-validation. We have also used
Test: In this study, the Wilcoxon signed-rank test is also SMOTE techniques in order to handle the class imbalance
applied to the AUC, F-Measure, and accuracy for statistically problem. Finally, the predictive ability of these models are
comparing the ability to predict the appropriate severity level computed and compared using AUC, F-Measure, and accuracy
of developed models using different classifiers. The objective performance values. Our main conclusions are the following:
of this testing is to find whether the models trained using
different classification techniques have a significant improve- • The high value of AUC confirms that the developed
ment or not. The considered null hypothesis for this paper models using word embedding on balanced data have

100 1 1

90 0.9 0.95

80 0.8 0.9

0.7 0.85
70
0.8
F-Measure

0.6
Accuracy

60
AUC

0.75
50 0.5
0.7
40 0.4
0.65
30 0.3
0.6
20 0.2
0.55
10 0.1
0.5
0
AF

A
F

A
AF

RA
F

RA
GF

PC
G

CC
SI

Fig. 3: Performance Box-Plot Diagram: Performance of Different Sets of Features

Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on November 21,2024 at 07:22:19 UTC from IEEE Xplore. Restrictions apply.
484 PROCEEDINGS OF THE FEDCSIS. ONLINE, 2021

100 1 1

90 0.9 0.95

80 0.8 0.9

70 0.7 0.85
Accuracy

0.8

F-Measure
60 0.6

AUC
0.75
50 0.5
0.7
40 0.4
0.65
30 0.3
0.6
20 0.2
0.55
10 0.1
0.5
0
ML
B

MP
T
B

FG
B

AM
D

ML
B

MP
T
DS

FG
B

AM
D
BN

GN
MN

ML
SG

MP
T
B

FG
B

AM
D

DS
BN

GN
MN

SG
DS
LB
SV

BN
SV

GN
LO

SG
SV

LB
SV

SV
LO

SV
LB
SV

AD
NN

SV
LO

AD
NN

NN
NN
NN

NN
NN

NN
NN
Fig. 5: Performance Box-Plot Diagram: Performance of Different Classiﬁcation Techniques

the ability to predict severity levels of the defects [6] P. Bhattacharya and I. Neamtiu, “Bug-fix time prediction models: can
present based on defect descriptions. we do better?” in Proceedings of the 8th Working Conference on Mining
Software Repositories, 2011, pp. 207–210.
• The models developed by considered word vector [7] A. More and D. P. Rana, “Review of random forest classification tech-
computed using GLOVE and w2v have a better pre- niques to resolve data imbalance,” in 2017 1st International Conference
dictive ability as compared to other models. on Intelligent Systems and Information Management (ICISIM). IEEE,
2017, pp. 72–78.
• The defected severity levels prediction models devel- [8] N. Junsomboon and T. Phienthrakul, “Combining over-sampling and
oped using different word embedding methods are under-sampling techniques for imbalance dataset,” in Proceedings of
significantly different. the 9th International Conference on Machine Learning and Computing,
2017, pp. 243–247.
• The models trained on sampled data have significant [9] T. Menzies and A. Marcus, “Automated severity assessment of software
improvement in predicting defect severity levels. defect reports,” in 2008 IEEE International Conference on Software
Maintenance. IEEE, 2008, pp. 346–355.
• The predictive ability of the models developed using [10] R. Jindal, R. Malhotra, and A. Jain, “Software defect prediction using
significant uncorrelated features has a better ability to neural networks,” in Proceedings of 3rd International Conference on
predict severity level as compared to all features. Reliability, Infocom Technologies and Optimization. IEEE, 2014, pp.
1–6.
• The models developed using SVM with polynomial [11] S. Ghaluh Indah Permata, “An attribute selection for severity level deter-
kernel achieve significantly better performance as mination according to the support vector machine classification result,”
compared to other techniques. in proceedings intl conf information system business competitiveness,
2012.
In this study, developed are trained using most frequently [12] M. Sharma, P. Bedi, K. Chaturvedi, and V. Singh, “Predicting the
used classifiers. Future work can be extended to deep-learning priority of a reported bug using machine learning techniques and cross
approach to achieve higher accuracy of software severity level project validation,” in 2012 12th International Conference on Intelligent
Systems Design and Applications (ISDA). IEEE, 2012, pp. 539–545.
prediction.
[13] M. Gayathri and A. Sudha, “Software defect prediction system using
multilayer perceptron neural network with data mining,” International
VIII. ACKNOWLEDGEMENTS Journal of Recent Technology and Engineering, vol. 3, no. 2, pp. 54–59,
This research is funded by TestAIng Solutions Pvt. Ltd. 2014.
[14] D. L. Gupta and K. Saxena, “Software bug prediction using object-
oriented metrics,” Sādhanā, vol. 42, no. 5, pp. 655–669, 2017.
R EFERENCES
[15] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE
[1] R. Malhotra and A. Jain, “Fault prediction using statistical and machine Transactions on knowledge and data engineering, vol. 21, no. 9, pp.
learning methods for improving software quality,” Journal of Informa- 1263–1284, 2009.
tion Processing Systems.
[16] N. V. Chawla, “Data mining for imbalanced datasets: An overview,” in
[2] L. Kumar, S. Misra, and S. K. Rath, “An empirical analysis of Data mining and knowledge discovery handbook. Springer, 2005, pp.
the effectiveness of software metrics and fault prediction model for 853–867.
identifying faulty classes,” Computer Standards & Interfaces, vol. 53,
[17] T. R. Hoens and N. V. Chawla, “Imbalanced datasets: from sampling
pp. 1–32, 2017.
to classifiers,” Imbalanced Learning: Foundations, Algorithms, and
[3] R. Malhotra, N. Kapoor, R. Jain, and S. Biyani, “Severity assessment of Applications, pp. 43–59, 2013.
software defect reports using text classification,” International Journal
[18] S. S. Rathore and S. Kumar, “An empirical study of some software
of Computer Applications, vol. 83, no. 11, 2013.
fault prediction techniques for the number of faults prediction,” Soft
[4] G. Abaei, A. Selamat, and H. Fujita, “An empirical study based on semi- Computing, vol. 21, no. 24, pp. 7417–7434, 2017.
supervised hybrid self-organizing map for software fault prediction,”
[19] R. Malhotra, Empirical research in software engineering: concepts,
Knowledge-Based Systems.
analysis, and applications. CRC Press, 2016.
[5] S. Kim and E. J. Whitehead Jr, “How long did it take to fix bugs?”
in Proceedings of the 2006 international workshop on Mining software
repositories, 2006, pp. 173–174.