0% found this document useful (0 votes)

31 views

Human Centric Software Engineering PDF

The document summarizes a study that assessed the reliability of using quality checklists to evaluate human-centric software engineering experiments reported in papers. The study found that: 1) Having four reviewers work in two pairs and discuss their evaluations led to more reliable results than eight individual assessments without discussion. 2) Using a quality checklist that aggregated scores across nine criteria produced more reliable results than individual criteria scores or simple overall assessments. 3) For quality evaluations to be reliable, more than two reviewers are needed and a round of discussion is necessary to reach consensus.

Uploaded by

Dr. Hitesh Mohapatra

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

Human Centric Software Engineering PDF

Uploaded by

Dr. Hitesh Mohapatra

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

International Journal of Innovations & Advancement in Computer Science

IJIACS
ISSN 2347 8616
Volume 4, Issue 7
July 2015

Human Centric Software Engineering

Mayur Masuti B, Hitesh Mohapatra
SRESs College of Engineering
ABSTRACT
Context: The authors wanted to assess whether the
quality of published human-centric software
engineering experiments was improving. This
required a reliable means of assessing the quality of
such experiments. Aims: The aims of the study were
to confirm the usability of a quality evaluation
checklist, determine how many reviewers were needed
per paper that reports an experiment, and specify an
appropriate process for evaluating quality. Method:
With eight reviewers and four papers describing
human-centric software engineering experiments, we
used a quality checklist with nine questions. We
conducted the study in two parts: the first was based
on individual assessments and the second on
collaborative evaluations. Results: The inter-rater
reliability was poor for individual assessments but
much better for joint evaluations. Four reviewers
working in two pairs with discussion were more
reliable than eight reviewers with no discussion. The
sum of the nine criteria was more reliable than
individual questions or a simple overall assessment.
Conclusions: If quality evaluation is critical, more
than two reviewers are required and a round of
discussion is necessary. We advise using quality
criteria and basing the final assessment on the sum of
the aggregated criteria. The restricted number of
papers used and the relatively extensive expertise of
the reviewers limit our results. In addition, the results
of the second part of the study could have been
affected by removing a time restriction on the review
as well as the consultation process.
Categories and Subject Descriptors
[Software Engineering]: General
General Terms-Analysis
Keywords-Hu man Centric

1. INTRODUCTION
We conducted the study reported herein because we
wanted to investigate whether, given the increased

Mayur Masuti B, Hitesh Mohapatra

number of guidelines and books on the topic, the

standards of human-centric software engineering
experiments had improved over the last decade. A
prerequisite for such a study was to find a means of
evaluating the quality of such experiments. We
believed that as experienced researchers, we would
have little difficulty in assessing the quality of
human-centric experimental studies objectively. We
were wrong. This paper describes our attempt to
develop a procedure for evaluating the quality of
software engineering experiments in terms of the
number of assessors (i.e., judges) needed to review
each paper, the instrument and process by which
quality can be assessed, and the process by which the
assessments can be aggregated. Note that in this work
we are in the same situation as ordinary reviewers of a
journal or conference paper; we can only assess an
experiment in terms of what can be inferred from the
reporting of it in a paper.
In addition to our own interest, the issue of evaluating
the quality of experiments is also of more general
importance [22], particularly to Evidence-Based
Software Engineering [14], [7]. This wider
importance lies in the fact that the results of
systematic literature reviews, which aggregate
experimental evidence, have been shown to give
different results if low-quality studies are omitted
from the analysis. Low-quality studies, such as post
hoc-correlation studies, sometimes favour a treatment,
whereas high-quality studies, e.g., rigorously
controlled field experiments, show no effect. This is
the case for studies on the efficacy of homeopathy
[21]. In the case of software engineering, Jrgensen
and Molkken-stvold [8] point out that the original
Chaos report should be omitted from any study
investigating the incidence of project failures, because
of the poor methodology used in the study.
From the viewpoint of undertaking systematic
literature reviews, there have been several suggestions

International Journal of Innovations & Advancement in Computer Science

IJIACS
ISSN 2347 8616
Volume 4, Issue 7
July 2015

for quality checklists that can be used to evaluate the

quality of empirical studies in software engineering.
In particular, Dyb and Dingsyr [5] developed a
questionnaire that they used themselves in a study of
agile methods [6] and that other researchers have
adopted [2].
We decided to use Dyb and Dingsyrs checklist and
undertake a pilot study to determine the sufficient
number of researchers necessary to obtain a reliable
assessment of the quality of software experiments.
We initially looked at the reliability of individual
assessments and were dismayed by the poor level of
agreement. Subsequently, we investigated the effect
of allowing evaluators to discuss their assessments
and provide a joint evaluation. The purpose of this
paper is to alert researchers in software engineering to
the practical problems of assessing the quality of
experiments and to offer some advice on the best way
to conduct quality assessments. The results may also
be of interest to the editors of conferences and
journals who are attempting to improve the quality of
reviews.
The study we report in this paper addressed the
following issues:
How many judges are needed to obtain a reliable
assessment of the quality of human-centric software
engineering experiments?
What is the best way to aggregate quality
assessments from different judges; in particular, is a
round of discussion better than using a simple
median?
Is using a quality checklist better than performing
a simple overall assessment?
Note that this is an investigatory study, not a formal
experiment; hence, we do not present formal
hypotheses.
2. RELATED RESEARCH
Weller [23] produced an extensive review of studies
that investigate peer review, covering 1,439 studies
published between period 1945 and 1997. Bornmann
[3] has written a review concentrating on research
from 2000 up to the beginning of 2009 concerning
three important issues for the peer review process:
reliability (i.e., inter-rater agreement), fairness, and
predictive validity.

Mayur Masuti B, Hitesh Mohapatra

These reviews show that there is a considerable body

of literature on the topic of peer review. However, the
majority of studies have looked at peer review of
journal or conference papers (see, for example, [24],
[20]) or the extent to which reviewers agree on
whether to accept or reject research grant applications
or research fellowships (see, for example, [16]).
Generally, researchers have found that reliability is
poor. Bornmann [3] reports the results from 16 studies
for which the Kappa or Intraclass correlations (ICC)
generally fall in the range from 0.2 to 0.4. He also
refers to a meta-analysis currently under review that
included 48 studies and found overall agreements of
approximately 0.23 for ICC, 0.34 for the Pearson
product moment correlation and 0.17 for Kappa [4].
Values of Kappa between 0 and 0.2 indicate only
slight agreement. The only paper we found in the field
of information science [24] also reported low levels of
reliability in two conferences: one conference had
kappa = -0.04, the other had kappa = 0.30.
Neff et al. [17] modelled the peer-review process,
focussing on the editors prescreening of submitted
manuscripts and the number of referees used. Their
model suggests that with respect to the number of
reviewers, the frequency of wrongful acceptance and
wrongful rejection can be optimized at about eight
referees. Looking at research proposals, Marsh et al.
[16] refer to a study in which it would require at least
six assessors per proposal to achieve more acceptable
reliability estimates of 0.71 (project) and 0.82
(researcher).
However, in our case, we are not interested solely in a
decision regarding acceptance or rejection, as is
normal for journal papers and research proposals; we
are interested in whether the use of a checklist leads to
greater reliability. Several researchers have suggested
that reliability can be improved by the use of
checklists [19],[18].
Reporting on experiences of evaluating abstracts over
a 4-year period, Poolman et al. [18] reported ICC
values between 0.68 and 0.96 with only two of 13
values being less than 0.8 with between six and eight
reviewers when the assessment was made on an
aggregate of the individual criteria. Rowe et al. [19]
reported a study on the acceptance of abstracts using a
quality checklist. They found that changes to the

International Journal of Innovations & Advancement in Computer Science

IJIACS
ISSN 2347 8616
Volume 4, Issue 7
July 2015

guidelines for using the checklist that were made in

response to criticism increased the reliability of the
aggregate score from ICC = 0.36 to ICC = 0.49 with
three reviewers. They noted that reviewers agreed less
well on the individual criteria than on the sum of
individual criteria and less well on subjective criteria
than on objective criteria.
In the context of the criteria for quality that are used
in systematic literature reviews, Kitchenham et al.
[13] report the outcome of two different strategies
they used to assess the quality of systematic literature
reviews in software engineering using the DARE
method, which is based on four criteria. They suggest
that a process they referred to as consensus and
minority report is more reliable than the median of
three independent assessments or an assessment made
based on two independent assessments and a
discussion. The consensus and minority report
process involved three researchers making individual
assessments, followed by two researchers coming to a
joint assessment and then comparing their joint
assessment with the third review.
3. MATERIALS AND METHODS
This section discusses the checklist we used and the
way in which the study was organised.
3.1 Quality Checklist Construction
One of us (DP) produced a revised set of criteria for
determining quality that was based primarily on Dyb
and Dingsyrs checklist [5] but included some
content from Kitchenham et al.s checklist [12] and
introduced ordinal -scale responses to the individual
questions. This checklist was reviewed and revised by
five of us (BAK, DS, TD, PR, DP) at a meeting in
Oslo on 22 Feb 2009. Those of us that did not attend
the Oslo meeting (MH, PB, DB) reviewed the quality
checklist to assess:
Whether the current checklist coincided with their
subjective opinion of paper quality.
Whether they understood each top-level question
and the associated more detailed questions.
Whether they felt confident that they would be
able to answer the question.
Whether there were any specific ambiguities,
errors, or omissions.
After some discussion, the checklist was further

Mayur Masuti B, Hitesh Mohapatra

refined. The final version of the checklist is shown in

Table 1. Each question is answered on a 4-point scale
where:
4 = Fully means all questions listed in the
"consider" column can be answered with "yes"
3 = Mostly means the majority of all (but not
all) questions listed in the "consider" column can be
answered with "yes"
2 = Somewhat means some (but the minority)
of the questions listed in the "consider" column can be
answered with "yes"
1 = Not at all means none of the questions listed
in the "consider" column can be answered with "yes"
However, we also recognized that the sub questions
are not guaranteed to be complete and other issues
may influence the answer for a specific study.
3.2 Quality Checklist Validation Part I
After the final version of the checklist was agreed, we
undertook the first part of the study in which we
assessed the checklist for usability and consistency.
We selected four papers from the set of human-centric
experiments found by Kampenes [9]. The papers were
A: [15]; B: [11]; C: [1]; D: [10]. All team members
evaluated each paper independently, using the criteria
for determining quality that are presented in Table 1,
noting:
1. The answers to each quality question for each
paper.
2. The time taken to evaluate each paper. We
agreed to try to restrict ourselves to about 30 minutes
per paper. This time limit was suggested by TD as a
result of his experience using his checklist.
3. Any difficulties that arose using the quality
checklist.
4. Whether the checklist-based evaluation of quality
was consistent with their general view of the quality
of each paper.
5. A subjective assessment of the overall quality of
the papers, based on a 5-point ordinal scale: excellent
(5), very good (4), acceptable (3), poor (2), and
unacceptable (1). This variable was used to assess
whether a simple overall assessment is as good as an
assessment based on a number of different criteria.
To ensure that the papers were assessed in a different
order (so that the analysis of how long it takes to
evaluate the quality checklist would not be

International Journal of Innovations & Advancement in Computer Science

IJIACS
ISSN 2347 8616
Volume 4, Issue 7
July 2015

confounded with the learning process or the specific

papers), the researchers were assigned at random to
four different orders (1: A,B,C,D; 2: D,A,B,C; 3:
C,D,B,A; 4: B,C,D,A), such that two researchers were
assigned to each order.
The results of part I were intended to assess:
1. The reliability of the checklist items in terms of
inter-rater agreement.
2. Whether the checklist appears to give a
reasonable evaluation of paper quality.
3. Whether four independent reviewers are
sufficient to obtain reliable results.
4. How much time each researcher would be likely
to need for the full study.
The results suggested that we achieved a rather poor
inter-rater reliability even with four judges, so we
undertook part II of the study to investigate whether
allowing judges to discuss their assessments would
improve the reliability.
3.3 Quality Checklist Validation Part II
In part II, we reread each of the papers individually,
revised our initial assessments and added a rationale
for each revised assessment. We did not place any
limit on the time to be spent rereading each paper.
After we had reviewed the papers again, we worked
in pairs to make a joint evaluation. Allocation to pairs
was not done at random, but was done in such a
manner that each pair was different for each paper. As
for part I, we answered each of the nine questions and
gave an overall assessment of the paper.
4. METHODS OF DATA ANALYSIS
There is no well- specified way of assessing the
reliability of k judges evaluating n target objects in m
dimensions where each dimension is an ordinal- scale
subjective variable taking values 1 to 4. In this paper,
we report the results of using the Kappa statistic and
an ad hoc statistic, Diff1.
4.1 The Kappa Statistic
The Kappa measure assumes a nominal scale
evaluation variable, usually a single variable (i.e., a
variable of the type accept/reject, yes/no) although it
is possible to have more than binary categories. The
basic formula for Kappa applies to two judges as
follows:
Kappa = (PO-PC)/(1-PC)

Mayur Masuti B, Hitesh Mohapatra

Where PO = proportion of the target values that are

the same for two judges
PC = the probability that an assessment would have
been the same by chance.
In our case, we have nine criteria to be assessed by
each judge on each paper on a 4-point scale (i.e., 1, 2,
3, 4). Thus, PO is the number of agreements divided
by nine and PC = 0.25 (i.e., we would expect 2.25
agreements by chance).
However, Kappa ignores the extent to which judges
almost agree. Furthermore, if we want to see the
effect of averaging the assessments for two judges to
see the effect of combining evaluations, we will
probably cause Kappa to have a reduced value when
we compare paired evaluations because we will have
values such as 1.5, 2.5, and 3.5 leading to seven
separate categories such that a value of 3.5 and a
value of 4 will be considered a disagreement although
they are close when considered as ordinal-scale
measures. The usual method of assessing Kappa is to
use the interpretation scale shown in Table 2.
However, a statistical test can be based on the
empirical distribution of data that conforms to the null
hypothesis, i.e., a set of evaluations of nine 4- point
ordinal-scale criteria made at random (see Section
4.3.2).
Another major problem with the use of the Kappa
metric is that it is commonly used to assess the
reliability of two judges who are assessing multiple
targets, not two judges who are assessing multiple
criteria that pertain to the same target, which is what
we were doing in our study.
Table 2. Interpretation scale for Kappa
Kappa value

Interpretations

Poor

0.00-0.20

Slight

0.21-0.40

Fair

0.41-0.60

Moderate

0.61-0.80

Substantial

0.81-1.00

Almost perfect

International Journal of Innovations & Advancement in Computer Science

IJIACS
ISSN 2347 8616
Volume 4, Issue 7
July 2015

4.2 Diff1
As an alternative to Kappa and to balance Kappas
inherent problems with ordinal-scale assessments, we
constructed a new metric, which we refer to as the
Diff1 statistic, which is the number of times that two
judges differ by more than one point on the nine
criteria, i.e., a value of 1 for Diff1 meant there was
only one occasion out of a possible nine when a pair
of assessments differed by more than one point.
Diff1 is of particular relevance to situations where
there are several criteria per target and the criteria are
numerically equivalent (in our case, 4-point ordinalscale measures). However, it does not consider
multiple judges.
Diff1 does not have a statistical test, but it is possible
to obtain the null distribution of the statistic
empirically. Diff1 can be used for assessments that
include values such as 1.5, 2.5, and 3.5, i.e., for
assessments that are aggregated by averaging. Its
main disadvantage is that it is a coarse statistic when
the number of points on the ordinal scale is small (as
it is in our case), so it may not be possible to construct
confidence limits on the values obtained at a
specifically required alpha level.
4.3 Establishing Baseline Distributions of the Test
Statistics
This section identifies the empirical distribution of
Diff1 and Kappa for the case of nine criteria assuming
that assessments of each criterion were random. We
use the empirical distribution to identify whether it is
likely that the agreement between judges was better
than we would expect if all judgements were made at
random.
4.3.1 Diff1
The overall distribution of the Diff1 metric for
random evaluations is summarized in Table 3. For
Diff1, we were interested in low values of the statistic
for our evaluation data and take a value of 0 or 1 to
indicate an acceptable agreement.
4.3.2 Kappa
The distribution of the Kappa statistic for pairs of
random evaluations is summarized in Table 4. In this
case, we take a Kappa value >0.26 to indicate
acceptable agreement.

Mayur Masuti B, Hitesh Mohapatra

Table 3. Distributi on of Di ff1 for random assessments

Statistic

Values

Observations

1000

Mean

3.349

Std. Dev

1.4620

1% Percentile

5% Percentile

10% Percentile

25% Percentile

50% Percentile

Table 4. Distribution of Kappa for random pairs of

assessments of nine variables
Statistic
Values
Observations

1000

Mean

-0.00163

Std. Dev

0.1929

75% Percentile

0.1111

90% Percentile

0.2593

95% Percentile

0.2593

99% Percentile

0.5555

5. DATA ANALYSIS AND RESULTS

This section presents the analysis of the data. First, we
present Kappa and Diff1 values for each paper
separately. Then we discuss the effect of aggregating
results.
5.1 Results for Individual Papers
For part I of the study, we calculated the reliability

International Journal of Innovations & Advancement in Computer Science

IJIACS
ISSN 2347 8616
Volume 4, Issue 7
July 2015

statistics for each of the 28 possible ways of pairing

the eight judges. Table 5 summarises the results,
which indicate that all papers show good agreement
among judges with respect to Diff1 but only paper C
shows good agreement with respect to Kappa.
Equivalent results for part II of the study where pairs
of judges provided a joint assessment are shown in
Table 6. In this case, there were four joint evaluations
and, therefore, six ways in which the joint evaluations
could be compared. These results for the joint
evaluations show good agreement for both Diff1 and
Kappa with the exception of paper B for the Kappa
results.
5.2 Composite Assessments
The median assessment of the eight individual
assessments and median of the four joint assessments
are shown in Table 7. It shows the results for each
question, the sum of median assessments and the
subjective overall assessment for the paper (an
additional question scored on a five point ordinal
scale). The agreement is remarkable for all four
papers and for all questions. However, we observe
that overall assessment suggests papers B and D are
of equivalent quality, whereas the sum of the nine
quality questions suggests that that paper B is better
than paper D.
However, we wanted to know whether we can assess
the quality of papers with fewer than eight reviewers
per paper and whether or not allowing judges to have
a round of discussion is useful. These issues cannot be
addressed with a formal statistical analysis, but we
present the effect of various strategies for aggregating
assessments in Table 8. This analysis is restricted to
the sum of the nine questions.
The strategies used in Table 8 are:
Median of eight independent evaluations.
The average of any two independent evaluations.
There are 28 possible ways of aggregating two
evaluations from eight.
The average of any four independent evaluations.
There are a total of 120 different ways in which four
assessments can be aggregated. We selected 30 such
combinations at random and found the median for
each question.
Median of four paired evaluations.
The average of any two-paired evaluations. There

Mayur Masuti B, Hitesh Mohapatra

are six possible ways of aggregating four evaluations.

We found the average of each pair. The sum was the
total of the average value for questions 1 to 9.
In each case, the sum was calculated as the total of the
aggregated values for questions 1 to 9, and the overall
assessment.
Table 8 shows the minimum, maximum, and range of
values for each option. The results show:
In three of the four cases, pair-wise aggregation
of individual assessments was less reliable than
evaluations based on two judges with a round of
discussion.
In two of the four cases, aggregation based on
four individual judges was less reliable than
evaluations based on two judges with a round of
discussion. In one case, the median of four individuals
was better, and in the final case, the results were the
same. This suggests four individual assessors are
broadly equivalent to two assessors who discuss their
assessments.
In three of the four cases, aggregation based on
the average of two consensus evaluations was better
than any other aggregation strategy. In the fourth case,
it was as good as the consensus evaluation...
6. DISCUSSION AND CONCLUSIONS
The inter-rater agreement statistics confirm that:
the reliability obtained for individua l assessments
was relatively poor, and
the reliability obtained by pairs of judges with a
round of discussion was generally quite good.
Our results suggest that good agreement can be
achieved with eight judges, whether or not there is
any discussion among the judges. However, we also
achieved almost as good a consensus by using four
judges, with the judges working in pairs to arrive at
two independent consensus evaluations that were then
averaged. The present study did not allow us to
determine whether three judges are sufficient if there
are two rounds of consensus making, as proposed by
Kitchenham et al. [13]. The results also show that
using the sum of the criteria to rank papers was better
than using a simple 5-point scale assessment. In
particular, the overall assessment was unable to
distinguish between papers B and D, whereas the sum
of the individual criteria made it clear that paper B

International Journal of Innovations & Advancement in Computer Science

IJIACS
ISSN 2347 8616
Volume 4, Issue 7
July 2015

outscored paper D.
The main limitation of the study is the number of
targets. With only four papers we cannot be sure how
well our results will generalise to our target of all
human-based software engineering experiments. For
example, the papers do not constitute an homogenous
sample. In particular, paper A is rather different from
the other papers because the human-based experiment
presented in the paper was only a small part of a
wider evaluation exercise. Overall, paper A was good,
but the human -based experiment was weak. Further,
even if the sample were homogenous, it might not be
representative. Other limitations are that we, as a
group of researchers, have extensive experience of
empirical software engineering, so our results may be
better than those that would have been obtained by a
random selection of researchers. In addition, in part II
of the study, we not only had a period of discussion
among pairs of judges, but we also reviewed each
paper for a second time without a time restriction.
Thus, the more favourable results with respect to
reliability may be due not only to the discussion, but
also to the additional time spent on reviewing the
paper.
Finally, an ultimate goal of the research community is
to conduct experiments of high quality. To reach this
goal, we must be able to evaluate the quality of
experiments. To achieve (as much as possible)

consensus on what is high quality, one needs to agree

on a conceptual definition of quality as well as a set of
operational quality criteria. Using a checklist is one
way of implementing a set of operational quality
criteria. The checklist that we used was a modified
version of a checklist that had already been developed
and used by others. Using another checklist with
another set of criteria for determining quality, might
have given other results.
Generally our results are consistent with other reports.
For example, as suggested by Neff et al. [17], we
found that the aggregated results from eight reviewers
gave very reliable results.
In the related papers that we were able to find, we
found no discussion of the use of criteria to determine
the quality of studies that are to be used in metaanalysis. This is an important issue and the reason we
undertook our study in the first place. Our results
suggest strongly that a discussion among judges is
needed when two judges are being used. This
indication is in agreement with the advice given in
most guidelines on how to conduct systematic
reviews. On the other hand, our results suggest that if
the quality of papers is a critical part of the review,
two judges and a discussion might not be sufficient to
obtain a reliable assessment of the quality of the
studies. Our study showed that four judges acting in
pairs obtained very high levels of reliability.

Table 1. Quality Checklist

Mayur Masuti B, Hitesh Mohapatra

International Journal of Innovations & Advancement in Computer Science

IJIACS
ISSN 2347 8616
Volume 4, Issue 7
July 2015

Table 5. S ummary of evaluation results part I

Measure

Paper A

Paper B

Paper C

Paper D

Diff1

Kappa

Diff1

Kappa

Diff1

Kappa

Diff1

Kappa

Number acceptable agreements of 28

(Diff11, >0.259)

Average

1.179

0.339

0.571

0.238

0.75

0.444

1.00

0.228

Std Dev.

1.249

.2035

0.742

0.244

0.7515

0.169

1.018

0.203

Table 6. S ummary of evaluation results part II

Measure

Paper A

Paper B

Paper C

Paper D

Diff1

Kappa

Diff1

Kappa

Diff1

Kappa

Diff1

Kappa

Number acceptable agreements of 6

(Diff11, >0.259)

Average

0.33

0.65

0.667

0.235

0.5

0.481

0.630

Std Dev.

0.52

0.153

0.516

0.218

0.837

0.182

0.156

Mayur Masuti B, Hitesh Mohapatra

International Journal of Innovations & Advancement in Computer Science

IJIACS
ISSN 2347 8616
Volume 4, Issue 7
July 2015

Table 7. Median assessments for eight individual judges and four pairs of judges
Paper

Judges

8 individuals

2.5

4 pairs

2.5

Diff

0.5

-0.5

8 individuals

4 pairs

Diff

Sum of
Questions

Subjective Overall
Assessment

16.5

1.5

0.5

3.5

2.5

29.5

3.5

2.5

3.5

2.5

29.5

0.5

-0.5

8 individuals

3.5

2.5

31.5

4.5

4 pairs

3.5

2.5

31.5

4.5

Diff

8 individuals

4 pairs

Diff

Table 8. Effect of various aggregation strategies on the sum of the criteria

Evaluation source

Paper A

Eight independent evaluations

28 pair-wise averages
30

random combinations of
four (median)

Four joint evaluations

Six combinations of
evaluations (average)

joint

Paper B

Paper C

Min

Max

Rng

Mi n

Max

Rng

12.5

8.5

27.5

17.5

2.5

27.5

Mayur Masuti B, Hitesh Mohapatra

Paper D

Min

Max

Rng

Min

Max

Rng

33.5

4.5

29.5

3.5

24.5

27.

2.5

3.5

6
3.5

International Journal of Innovations & Advancement in Computer Science

IJIACS
ISSN 2347 8616
Volume 4, Issue 7
July 2015

REFERENCES
[1] Abraho, S. and Poels, G. 2007. Experimental
evaluation of an object-oriented function. Informat ion
and Software Technology, 49(4), pp 366380.
[2] Afzal, W., Torkar, R. and Feldt, R. 2009. A systematic
review of search-based testing for non-functional
system properties. Informat ion and Software
Technology, 51(6), pp 959976.
[3] Born mann, L. 2011. Scientific Peer Review. Annual
Review of Informat ion Science and Technology, 45, in
press.
[4] Born mann, L., Mutz, R. and Daniel, D.-D. A
reliability-generalizat ion study of journal peer reviews
a mu lti-level analysis of inter-rater reliability and its
determinants, submitted.
[5] Dyb, T. and Dingsyr, T. 2008a. Strength of Evidence
in Systematic Reviews in Software Engineering,
ESEM2008, pp 178187.
[6] Dyb, T. and Dingsyr, T. 2008b. Empirical studies of
agile software develop ment: A systematic review.
Information and Software Technology, 50(9-10), pp
833859.
[7] Dyb, T., Kitchenham, B.A. and Jrgensen, M. 2005.
Ev idence-based
Software
Engineering
for
Practit ioners, IEEE Software, 22 (1), pp 5865.
[8] Jrgensen, M. and Molkken-stvold, K. 2000. Impact
of effort estimates on software project work? A review
of the 1994 Chaos report. Information and Software
Technology, 48(4), pp 297301.
[9] Kampenes, V.B. 2007. Quality of Design Analysis and
Reporting of Software Engineering Experiments. A
Systematic Review. PhD Thesis, Dept. Informat ics,
University of Oslo.
[10] Karahasanovi, A. Levine, A.K. and Thomas, R. 2007.
Co mprehension strategies and difficu lties in
maintaining object-oriented systems: An explorat ive
study. Journal of Systems and Software, 80, pp 1541
1559.
[11] Karlsson, L., Thelin, T., Regnell, B., Berander, P. and
Wohlin, C. 2007. Pair-wise comparisons versus
planning game partition ing experts on requirements
prioritisation
techniques.
Empirical
So ftware
Engineering, 12, pp 3 33.
[12] Kitchenham, B.A., Brereton, O.P., Budgen, D. and Li,
Z. 2009. An evaluation of quality checklist proposals
A participant observer case study. EASE09, BCS
eWic.
[13] Kitchenham, B., Brereton, P., Turner, M., Niazi, M.,
Linkman, S., Pretorius, R. and Budgen, D. Refining the
systematic literature review process Two observerparticipant case studies, accepted for publication in

Mayur Masuti B, Hitesh Mohapatra

Emp irical Soft ware Eng ineering.

[14] Kitchenham, B.A., Dyb, T. and Jrgensen, M. 2004
Ev idence-based Software Engineering. Proceedings of
the 26th International Conference on Software
Engineering, (ICSE 04), IEEE Co mputer Society,
Washington DC, USA, pp 273281.
[15] Liu, H. and Tan, H.B.K. 2008. Testing input validation
in Web applications through automated model
recovery. Journal of Systems and Software, 81, pp
222233
[16] Marsh, H.W., Jayasinghe, U.W. and Bond, N.W. 2008.
Improving the Peer-Review Process for Grant
Application.
Reliability,
Validity,
Bias
and
Generalizability. A merican Psychologist, 63(3), pp
160168.
[17] Neff, B.D. and Olden, J.D. 2006. Is Peer Review a
Game of Chance. BioScience, 56(4), pp 333340.
[18] Poolman, R.W., Keijser, L.C., de Waal Malefijt, M .C.,
Blankevoort, L., Farrokhyar, F., Bhandari, M. 2007.
Reviewer agreement in scoring 419 abstracts for
scientific orthopedics meetings. Acta Orthopaedica,
78(2) pp 278284.
[19] Rowe, B.H., Stro me, T.L. Spooner, C., Blit z, S.,
Grafstein, E. and Worster, A. 2006. Reviewer
agreement trends fro m four years of electronic
submissions of conference abstract. BMC Medical
Research Methodology, 6(14).
[20] Schultz, D.M. 2009. Are three heads better than two?
How the number of reviewers and editor behaviour
affect the rejection rate. Scientometrics, Springer.
doi:10.1007/s11192-009-0084-0.
[21] Shang, A., Huwwiler-Mntener, K., Nartney, L., Jni,
P., Drig, S., Pwesner, D. and Egger, M. 2005. Are the
clin ical effects of ho meopathy placebo effects?
Co mparative study of placebo-controlled trials of
homeopathy and allopathy. Lancet, 366 (9487), pp
726732.
[22] Sjberg, D.I.K., Dyb T. and Jrgensen. M. 2007. The
Future of Empirical Methods in Software Engineering
Research, In : Future of Soft ware Engineering (FOSE
'07), ed. by Briand L. and Wolf A., pp. 358 378,
IEEE-CS Press.
[23] Weller, A.C. 2002. Ed itorial Peer Review: Its
Strengths and Weaknesses. Medford, NJ, USA.
Information Today, Inc.
[24] Wood, M., Roberts, M. and Howell, B. 2004. The
reliability of peer reviews of papers on information
systems. Journal of Information Science, 30(1), pp 2
11.

Research Methodology
From Everand
Research Methodology
Dr.Archana Dadhe
4/5 (25)
Post-Project Reviews to Gain Effective Lessons Learned
From Everand
Post-Project Reviews to Gain Effective Lessons Learned
Terry Williams
No ratings yet
How to Read a Paper: The Basics of Evidence-Based Medicine
From Everand
How to Read a Paper: The Basics of Evidence-Based Medicine
Trisha Greenhalgh
No ratings yet
Methods of Research: Simple, Short, And Straightforward Way Of Learning Methods Of Research
From Everand
Methods of Research: Simple, Short, And Straightforward Way Of Learning Methods Of Research
Sherwyn Allibang
4.5/5 (6)
Tourist Places in Andhra Pradesh
100% (3)
Tourist Places in Andhra Pradesh
25 pages
IoT Connectivity Part 1
100% (2)
IoT Connectivity Part 1
28 pages
Literature Review Simplified: The Checklist Edition: A Checklist Guide to Literature Review
From Everand
Literature Review Simplified: The Checklist Edition: A Checklist Guide to Literature Review
Rafiq Muhammad
No ratings yet
Mixed Methods Research: Applying AI Tools for Effective Writing and Publishing
From Everand
Mixed Methods Research: Applying AI Tools for Effective Writing and Publishing
Krishna Bista
No ratings yet
Software Quality Information Needs
No ratings yet
Software Quality Information Needs
13 pages
FULLTEXT01
No ratings yet
FULLTEXT01
37 pages
An Experiment in Inspecting The Quality of Use Case Descriptions
No ratings yet
An Experiment in Inspecting The Quality of Use Case Descriptions
19 pages
Towards Hypotheses On Creativity in Software Development
No ratings yet
Towards Hypotheses On Creativity in Software Development
15 pages
Researching the Value of Project Management
From Everand
Researching the Value of Project Management
Mark Mullaly, PMP
2.5/5 (4)
990-Article Text-1802-1-10-20210404
No ratings yet
990-Article Text-1802-1-10-20210404
12 pages
A Bayesian Model For Controlling Software Inspections SW - Inspection
No ratings yet
A Bayesian Model For Controlling Software Inspections SW - Inspection
10 pages
Development of Questionnaires for Quantitative Medical Research
From Everand
Development of Questionnaires for Quantitative Medical Research
Mohamad Adam Bujang
No ratings yet
Qualitative Research for Beginners: From Theory to Practice
From Everand
Qualitative Research for Beginners: From Theory to Practice
Ismail Sheikh Ahmad Ph.D.
3/5 (2)
How Many Participantsare Really Enoughfor
No ratings yet
How Many Participantsare Really Enoughfor
10 pages
Choosing a Research Method, Scientific Inquiry:: Complete Process with Qualitative & Quantitative Design Examples
From Everand
Choosing a Research Method, Scientific Inquiry:: Complete Process with Qualitative & Quantitative Design Examples
Christian S. Yorgure PhD
No ratings yet
Research Methodology Approaches
From Everand
Research Methodology Approaches
Jerry H. Swift
No ratings yet
Implementing the Stakeholder Based Goal-Question-Metric (Gqm) Measurement Model for Software Projects
From Everand
Implementing the Stakeholder Based Goal-Question-Metric (Gqm) Measurement Model for Software Projects
Dr. Prashanth Harish Southekal
No ratings yet
Software Testing2
No ratings yet
Software Testing2
35 pages
Ijcttjournal V1i2p8
No ratings yet
Ijcttjournal V1i2p8
5 pages
Requirements Analysis: A Review: Joseph T. Catanio
No ratings yet
Requirements Analysis: A Review: Joseph T. Catanio
2 pages
A Survey of Controlled Experiments in Software Engineering
No ratings yet
A Survey of Controlled Experiments in Software Engineering
38 pages
Classical Test Levels
No ratings yet
Classical Test Levels
2 pages
A Survey of Effective and Efficient Software Testing
No ratings yet
A Survey of Effective and Efficient Software Testing
9 pages
A Methodology Based On PBL To Form Software Test Engineer
No ratings yet
A Methodology Based On PBL To Form Software Test Engineer
10 pages
Introduction To Testing As An Engineering Activity
100% (1)
Introduction To Testing As An Engineering Activity
12 pages
Artigo - RevisãoSistematica
No ratings yet
Artigo - RevisãoSistematica
6 pages
Projects in Software Testing
No ratings yet
Projects in Software Testing
8 pages
Guidelines For Critical Review of Product LCA
No ratings yet
Guidelines For Critical Review of Product LCA
10 pages
SQT Ut Ii
No ratings yet
SQT Ut Ii
26 pages
403 Lec Note 1
No ratings yet
403 Lec Note 1
18 pages
How to Research Qualitatively: Tips for Scientific Working
From Everand
How to Research Qualitatively: Tips for Scientific Working
Martin Gertler
No ratings yet
Unit Testing: R.Venkat Rajendran, Director, Deccanet Designs LTD
No ratings yet
Unit Testing: R.Venkat Rajendran, Director, Deccanet Designs LTD
4 pages
Software Engineering Principles and Prac
No ratings yet
Software Engineering Principles and Prac
63 pages
Wa0013.
No ratings yet
Wa0013.
7 pages
An Essay On Software Testing For Quality Assurance
No ratings yet
An Essay On Software Testing For Quality Assurance
9 pages
Software Quality Updated
No ratings yet
Software Quality Updated
25 pages
Dissertation Research: An Integrative Approach
From Everand
Dissertation Research: An Integrative Approach
Robert E. Levasseur
No ratings yet
Designing Evaluations To Evaluate
No ratings yet
Designing Evaluations To Evaluate
7 pages
The Research Process - Details and Examples: Appendix
No ratings yet
The Research Process - Details and Examples: Appendix
25 pages
Glossary of Research Methodology
From Everand
Glossary of Research Methodology
Dr. Awadhesh Kishore
No ratings yet
You Need To Know
No ratings yet
You Need To Know
7 pages
Software Quality Assessment Using Flexibility A Systematic Literature Review
100% (1)
Software Quality Assessment Using Flexibility A Systematic Literature Review
9 pages
N F A S D: EED OR OFT Imension
No ratings yet
N F A S D: EED OR OFT Imension
5 pages
A Systematic Review of Software Reliability Studies: Apoorva Singhal, Ankur Singhal
No ratings yet
A Systematic Review of Software Reliability Studies: Apoorva Singhal, Ankur Singhal
19 pages
Si Si Ap 15
No ratings yet
Si Si Ap 15
11 pages
Humphrey Results
No ratings yet
Humphrey Results
6 pages
Quiz Let
No ratings yet
Quiz Let
55 pages
Quality Assurance Plan Document
100% (1)
Quality Assurance Plan Document
6 pages
Participatory Action Research for Evidence-driven Community Development
From Everand
Participatory Action Research for Evidence-driven Community Development
AK Azad
No ratings yet
CVR College of Engineering: Purpose of Testing - CO1
No ratings yet
CVR College of Engineering: Purpose of Testing - CO1
90 pages
Towards best practice in the Archetype Development Process
From Everand
Towards best practice in the Archetype Development Process
Alberto Moreno Conde
No ratings yet
Software Testing For Moodle
No ratings yet
Software Testing For Moodle
6 pages
Pharmaceutical Research Methodology and Bio-statistics: Theory and Practice
From Everand
Pharmaceutical Research Methodology and Bio-statistics: Theory and Practice
Bayya Subba Rao
No ratings yet
Analise de Requisitos (2006) - PPT
No ratings yet
Analise de Requisitos (2006) - PPT
13 pages
Handbook of Improving Performance in the Workplace, Measurement and Evaluation
From Everand
Handbook of Improving Performance in the Workplace, Measurement and Evaluation
James L. Moseley
No ratings yet
Software Testing Research: Achievements, Challenges & Dreams
No ratings yet
Software Testing Research: Achievements, Challenges & Dreams
21 pages
Endodontics Review: Second edition
From Everand
Endodontics Review: Second edition
Brooke Blicher
No ratings yet
Modern Research Design: The Best Approach To Qualitative And Quantitative Data
From Everand
Modern Research Design: The Best Approach To Qualitative And Quantitative Data
Frank Albert
No ratings yet
Software Risk Management
No ratings yet
Software Risk Management
35 pages
Measures of Query Cost
No ratings yet
Measures of Query Cost
15 pages
Gantt Chart
No ratings yet
Gantt Chart
9 pages
Application Layer HTTP
No ratings yet
Application Layer HTTP
17 pages
Time Saving Tips Social Media
No ratings yet
Time Saving Tips Social Media
1 page
Automation Configuration Management by Using The PowerShell
No ratings yet
Automation Configuration Management by Using The PowerShell
75 pages
Measures of Query Cost
No ratings yet
Measures of Query Cost
15 pages
Data Structure Lab Record - VSSUT, Burla
No ratings yet
Data Structure Lab Record - VSSUT, Burla
29 pages
Advanced Database Protocols
No ratings yet
Advanced Database Protocols
15 pages
Database System
No ratings yet
Database System
58 pages
Reviewing SQL Concepts
No ratings yet
Reviewing SQL Concepts
12 pages
Basic Commands For Powershell: Configuring Windows PowerShell and Working With Basic Commands
100% (2)
Basic Commands For Powershell: Configuring Windows PowerShell and Working With Basic Commands
14 pages
Every Day Time Saving Tips
No ratings yet
Every Day Time Saving Tips
1 page
Questions and Answers Automation and Configuration Management
No ratings yet
Questions and Answers Automation and Configuration Management
16 pages
IoT Networking Part 2
No ratings yet
IoT Networking Part 2
24 pages
IEEE 802.11 and Bluetooth
No ratings yet
IEEE 802.11 and Bluetooth
29 pages
IoT Networking Part 1
100% (1)
IoT Networking Part 1
17 pages
IoT Connectivity - Part 2
100% (1)
IoT Connectivity - Part 2
24 pages
IoT Sensing and Actuation
100% (1)
IoT Sensing and Actuation
32 pages
Group Discussion: A Process of Rejection
No ratings yet
Group Discussion: A Process of Rejection
5 pages
Introduction To Software Engineering
100% (1)
Introduction To Software Engineering
60 pages
Android App For Shirdi Search Engine PDF
No ratings yet
Android App For Shirdi Search Engine PDF
4 pages
Ds Lab Manual
No ratings yet
Ds Lab Manual
68 pages
Literature Assignment-Iv: Submitted By: Prof - Hitesh Mohapatra
No ratings yet
Literature Assignment-Iv: Submitted By: Prof - Hitesh Mohapatra
5 pages
SSO Mechanism in Distributed Environment PDF
No ratings yet
SSO Mechanism in Distributed Environment PDF
4 pages
Human Centric Software Engineering PDF
No ratings yet
Human Centric Software Engineering PDF
10 pages
HCR (English) Using Neural Network PDF
No ratings yet
HCR (English) Using Neural Network PDF
7 pages
Four-Way Integrated Authentication For Android Smart-Phone Paper PDF
No ratings yet
Four-Way Integrated Authentication For Android Smart-Phone Paper PDF
4 pages

Human Centric Software Engineering PDF

Uploaded by

Human Centric Software Engineering PDF

Uploaded by

International Journal of Innovations & Advancement in Computer Science

Human Centric Software Engineering

Mayur Masuti B, Hitesh Mohapatra

number of guidelines and books on the topic, the

International Journal of Innovations & Advancement in Computer Science

for quality checklists that can be used to evaluate the

Mayur Masuti B, Hitesh Mohapatra

These reviews show that there is a considerable body

International Journal of Innovations & Advancement in Computer Science

guidelines for using the checklist that were made in

Mayur Masuti B, Hitesh Mohapatra

refined. The final version of the checklist is shown in

International Journal of Innovations & Advancement in Computer Science

confounded with the learning process or the specific

Mayur Masuti B, Hitesh Mohapatra

Where PO = proportion of the target values that are

International Journal of Innovations & Advancement in Computer Science

Mayur Masuti B, Hitesh Mohapatra

Table 3. Distributi on of Di ff1 for random assessments

Table 4. Distribution of Kappa for random pairs of

5. DATA ANALYSIS AND RESULTS

International Journal of Innovations & Advancement in Computer Science

statistics for each of the 28 possible ways of pairing

Mayur Masuti B, Hitesh Mohapatra

are six possible ways of aggregating four evaluations.

International Journal of Innovations & Advancement in Computer Science

consensus on what is high quality, one needs to agree

Table 1. Quality Checklist

Mayur Masuti B, Hitesh Mohapatra

International Journal of Innovations & Advancement in Computer Science

Table 5. S ummary of evaluation results part I

Number acceptable agreements of 28

Table 6. S ummary of evaluation results part II

Number acceptable agreements of 6

Mayur Masuti B, Hitesh Mohapatra

International Journal of Innovations & Advancement in Computer Science

Table 8. Effect of various aggregation strategies on the sum of the criteria

Eight independent evaluations

Four joint evaluations

Mayur Masuti B, Hitesh Mohapatra

International Journal of Innovations & Advancement in Computer Science

Mayur Masuti B, Hitesh Mohapatra

Emp irical Soft ware Eng ineering.

You might also like