Human Centric Software Engineering PDF
Human Centric Software Engineering PDF
IJIACS
ISSN 2347 8616
Volume 4, Issue 7
July 2015
1. INTRODUCTION
We conducted the study reported herein because we
wanted to investigate whether, given the increased
86
87
88
89
Interpretations
<0
Poor
0.00-0.20
Slight
0.21-0.40
Fair
0.41-0.60
Moderate
0.61-0.80
Substantial
0.81-1.00
Almost perfect
4.2 Diff1
As an alternative to Kappa and to balance Kappas
inherent problems with ordinal-scale assessments, we
constructed a new metric, which we refer to as the
Diff1 statistic, which is the number of times that two
judges differ by more than one point on the nine
criteria, i.e., a value of 1 for Diff1 meant there was
only one occasion out of a possible nine when a pair
of assessments differed by more than one point.
Diff1 is of particular relevance to situations where
there are several criteria per target and the criteria are
numerically equivalent (in our case, 4-point ordinalscale measures). However, it does not consider
multiple judges.
Diff1 does not have a statistical test, but it is possible
to obtain the null distribution of the statistic
empirically. Diff1 can be used for assessments that
include values such as 1.5, 2.5, and 3.5, i.e., for
assessments that are aggregated by averaging. Its
main disadvantage is that it is a coarse statistic when
the number of points on the ordinal scale is small (as
it is in our case), so it may not be possible to construct
confidence limits on the values obtained at a
specifically required alpha level.
4.3 Establishing Baseline Distributions of the Test
Statistics
This section identifies the empirical distribution of
Diff1 and Kappa for the case of nine criteria assuming
that assessments of each criterion were random. We
use the empirical distribution to identify whether it is
likely that the agreement between judges was better
than we would expect if all judgements were made at
random.
4.3.1 Diff1
The overall distribution of the Diff1 metric for
random evaluations is summarized in Table 3. For
Diff1, we were interested in low values of the statistic
for our evaluation data and take a value of 0 or 1 to
indicate an acceptable agreement.
4.3.2 Kappa
The distribution of the Kappa statistic for pairs of
random evaluations is summarized in Table 4. In this
case, we take a Kappa value >0.26 to indicate
acceptable agreement.
90
Statistic
Values
Observations
1000
Mean
3.349
Std. Dev
1.4620
1% Percentile
5% Percentile
10% Percentile
25% Percentile
50% Percentile
1000
Mean
-0.00163
Std. Dev
0.1929
75% Percentile
0.1111
90% Percentile
0.2593
95% Percentile
0.2593
99% Percentile
0.5555
91
outscored paper D.
The main limitation of the study is the number of
targets. With only four papers we cannot be sure how
well our results will generalise to our target of all
human-based software engineering experiments. For
example, the papers do not constitute an homogenous
sample. In particular, paper A is rather different from
the other papers because the human-based experiment
presented in the paper was only a small part of a
wider evaluation exercise. Overall, paper A was good,
but the human -based experiment was weak. Further,
even if the sample were homogenous, it might not be
representative. Other limitations are that we, as a
group of researchers, have extensive experience of
empirical software engineering, so our results may be
better than those that would have been obtained by a
random selection of researchers. In addition, in part II
of the study, we not only had a period of discussion
among pairs of judges, but we also reviewed each
paper for a second time without a time restriction.
Thus, the more favourable results with respect to
reliability may be due not only to the discussion, but
also to the additional time spent on reviewing the
paper.
Finally, an ultimate goal of the research community is
to conduct experiments of high quality. To reach this
goal, we must be able to evaluate the quality of
experiments. To achieve (as much as possible)
92
Paper A
Paper B
Paper C
Paper D
Diff1
Kappa
Diff1
Kappa
Diff1
Kappa
Diff1
Kappa
19
10
26
23
21
22
Average
1.179
0.339
0.571
0.238
0.75
0.444
1.00
0.228
Std Dev.
1.249
.2035
0.742
0.244
0.7515
0.169
1.018
0.203
Paper A
Paper B
Paper C
Paper D
Diff1
Kappa
Diff1
Kappa
Diff1
Kappa
Diff1
Kappa
Average
0.33
0.65
0.667
0.235
0.5
0.481
0.630
Std Dev.
0.52
0.153
0.516
0.218
0.837
0.182
0.156
93
Table 7. Median assessments for eight individual judges and four pairs of judges
Paper
Judges
Q1
Q2
Q3
Q4
Q5
8 individuals
2.5
4 pairs
2.5
Diff
0.5
-0.5
8 individuals
4 pairs
Diff
Q6
Q7
Q8
Q9
Sum of
Questions
Subjective Overall
Assessment
16.5
1.5
16
0.5
0.5
3.5
3.5
2.5
29.5
3.5
3.5
2.5
3.5
2.5
29.5
0.5
-0.5
8 individuals
3.5
3.5
2.5
31.5
4.5
4 pairs
3.5
3.5
2.5
31.5
4.5
Diff
8 individuals
27
4 pairs
26
Diff
Paper A
random combinations of
four (median)
94
joint
Paper B
Paper C
Min
Max
Rng
Mi n
Max
Rng
12
21
26
34
12.5
21
8.5
27
32
14
20
27.5
31
15
19
26
32
15
17.5
2.5
27.5
31
Paper D
Min
Max
Rng
Min
Max
Rng
29
36
24
32
29
33.5
4.5
25
29
29.5
33
3.5
26
30
30
32
24
28
30
32
24.5
27.
2.5
3.5
6
3.5
REFERENCES
[1] Abraho, S. and Poels, G. 2007. Experimental
evaluation of an object-oriented function. Informat ion
and Software Technology, 49(4), pp 366380.
[2] Afzal, W., Torkar, R. and Feldt, R. 2009. A systematic
review of search-based testing for non-functional
system properties. Informat ion and Software
Technology, 51(6), pp 959976.
[3] Born mann, L. 2011. Scientific Peer Review. Annual
Review of Informat ion Science and Technology, 45, in
press.
[4] Born mann, L., Mutz, R. and Daniel, D.-D. A
reliability-generalizat ion study of journal peer reviews
a mu lti-level analysis of inter-rater reliability and its
determinants, submitted.
[5] Dyb, T. and Dingsyr, T. 2008a. Strength of Evidence
in Systematic Reviews in Software Engineering,
ESEM2008, pp 178187.
[6] Dyb, T. and Dingsyr, T. 2008b. Empirical studies of
agile software develop ment: A systematic review.
Information and Software Technology, 50(9-10), pp
833859.
[7] Dyb, T., Kitchenham, B.A. and Jrgensen, M. 2005.
Ev idence-based
Software
Engineering
for
Practit ioners, IEEE Software, 22 (1), pp 5865.
[8] Jrgensen, M. and Molkken-stvold, K. 2000. Impact
of effort estimates on software project work? A review
of the 1994 Chaos report. Information and Software
Technology, 48(4), pp 297301.
[9] Kampenes, V.B. 2007. Quality of Design Analysis and
Reporting of Software Engineering Experiments. A
Systematic Review. PhD Thesis, Dept. Informat ics,
University of Oslo.
[10] Karahasanovi, A. Levine, A.K. and Thomas, R. 2007.
Co mprehension strategies and difficu lties in
maintaining object-oriented systems: An explorat ive
study. Journal of Systems and Software, 80, pp 1541
1559.
[11] Karlsson, L., Thelin, T., Regnell, B., Berander, P. and
Wohlin, C. 2007. Pair-wise comparisons versus
planning game partition ing experts on requirements
prioritisation
techniques.
Empirical
So ftware
Engineering, 12, pp 3 33.
[12] Kitchenham, B.A., Brereton, O.P., Budgen, D. and Li,
Z. 2009. An evaluation of quality checklist proposals
A participant observer case study. EASE09, BCS
eWic.
[13] Kitchenham, B., Brereton, P., Turner, M., Niazi, M.,
Linkman, S., Pretorius, R. and Budgen, D. Refining the
systematic literature review process Two observerparticipant case studies, accepted for publication in
95