In Tcom 1743
In Tcom 1743
net/publication/220054775
CITATIONS READS
550 31,294
1 author:
Kraig Finstad
Dunlap and Associates Inc.
10 PUBLICATIONS 1,663 CITATIONS
SEE PROFILE
All content following this page was uploaded by Kraig Finstad on 29 January 2019.
a r t i c l e i n f o a b s t r a c t
Article history: The Usability Metric for User Experience (UMUX) is a four-item Likert scale used for the subjective assess-
Received 21 September 2009 ment of an application’s perceived usability. It is designed to provide results similar to those obtained
Accepted 6 April 2010 with the 10-item System Usability Scale, and is organized around the ISO 9241-11 definition of usability.
Available online 6 May 2010
A pilot version was assembled from candidate items, which was then tested alongside the System Usabil-
ity Scale during usability testing. It was shown that the two scales correlate well, are reliable, and both
Keywords: align on one underlying usability factor. In addition, the Usability Metric for User Experience is compact
Usability
enough to serve as a usability module in a broader user experience metric.
User experience
Scale
Ó 2010 Elsevier B.V. All rights reserved.
Metric
1. Introduction be too large when other elements such as Product Support were
factored in and required their own additional scales. The concept
Measuring and tracking usability is an ongoing challenge for of user experience covers a lot of ground: any Product Use or
organizations that are concerned with improving user experience. usability component of a larger user experience index would have
A popular and cost-effective approach to usability measurement is to be much more compact than 10 items. Also, in its original form,
the use of standardized surveys. When the Information Technology the SUS did not lend itself well to electronic distribution in a global
(IT) department at IntelÒ decided to standardize on a usability environment due to non-native English speakers not understand-
inventory, it selected the System Usability Scale (SUS). The SUS is ing the word ‘‘cumbersome” in SUS Item 8 (Finstad, 2006), and it
a 10-item, five-point Likert scale with a weighted scoring range used a five-point Likert scale which has been shown to be inade-
of 0–100 and which has been shown to be a reliable measure of quate in many cases. Diefenbach et al. (1993) found that seven-
usability. It is anchored with one as Strongly Disagree and five as point scales outperformed five-point scales in reliability, accuracy,
Strongly Agree. According to Holyer (1993), it correlates at 0.86 and ease of use, while Cox’s (1980) review of Likert scales found
with the 50-item Software Usability Measurement Inventory (Kira- the optimal number of alternatives to be seven. Finstad (in press)
kowski et al., 1992). Tullis and Stetson (2004) found the SUS to out- found that respondents were more likely to provide non-integer
perform the Questionnaire for User Interface Satisfaction (Chin interpolations (e.g., saying ‘‘three and a half” instead of ‘‘three”
et al., 1988) and the Computer System Usability Questionnaire or ‘‘four”) in the five-point SUS than in a seven-point alternate ver-
(Lewis, 1995) at assessing website usability. The SUS was adopted sion of the same instrument. These interpolations indicate a mis-
as a standard usability measure because of these performance match between the scale and a user’s actual evaluation. From a
characteristics, in addition to being free and relatively compact. more theoretical standpoint, the SUS items did not map well onto
It proved to be easy for project teams to understand, but several is- the concepts that comprise usability according to ISO 9241-11
sues emerged. As IT at IntelÒ began to pursue a more comprehen- (1998), namely effectiveness, efficiency, and satisfaction. These
sive approach to user experience, the SUS was originally mappings are important because the SUS is not a diagnostic tool;
considered as a usability module for a more comprehensive index it can indicate whether there is s a problem with a system’s usabil-
of user experience. This definition describes user experience as a ity but not what those problems actually are. It is often used as a
lifecycle consisting of: Marketing and Brand Awareness, Acquisi- starting point in usability efforts, but an alignment with known
tion and Installation, Product or Service Use, Product Support, usability factors can provide a stronger foundation for user experi-
and Removal/End of Life (Sward and Macarthur, 2007). However, ence efforts.
it became apparent that simply adapting the SUS to work as a These issues with the SUS motivated a research program aimed
Product Use component was not feasible. Early trials with internal at developing a replacement. The goal was to provide an inventory
project teams showed that a 10-item Product Use module would that was substantially shorter than the SUS and therefore appropri-
ate as the usability component of a larger user experience index.
E-mail address: kraig.a.fi[email protected] An early attempt at item set reduction aimed to leverage a single
0953-5438/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved.
doi:10.1016/j.intcom.2010.04.004
Author's personal copy
ease of use item from the SUS. A SUS survey with 43 responses was SUS as an intact instrument in order to achieve a valid final score
conducted on an enterprise portal product. It was found that Item 3 for analysis. Each participant therefore responded to six candidate
in the SUS, ‘‘I thought [the system] was easy to use” correlated with items, two per usability component (effectiveness, efficiency, and
the final SUS score at r = 0.89, p < 0.01; the strongest correlation in satisfaction), in addition to the SUS. This allowed a direct per-par-
the set of 10 items. This result was not surprising in light of recent ticipant comparison of candidate item responses with a final SUS
findings. Sauro and Dumas (2009) have demonstrated the utility of score. Presentation of candidate items was counterbalanced across
a general ease of use Likert item, and have also shown a promising participants. Response to the Likert items was verbal, with the en-
alternative in the Subjective Mental Effort Questionnaire (SMEQ). tire items read aloud to help ensure comprehension of the scale.
Tedesco and Tullis (2006) found that a single ‘‘Overall this task The facilitator recorded responses manually. After completion of
was: Very Difficult. . .Very Easy” Likert item correlated significantly the composite survey, participants were thanked for their time, de-
with usability test performance. This direction motivated further briefed, and excused.
analysis of SUS surveys and SUS Item 3 with other systems, but
no consistent pattern emerged. In some cases SUS Item 3 corre-
2.2. Results
lated most strongly with the final SUS score, and in others it did
not. The idea of reducing an instrument to one general ease of
2.2.1. Item correlations
use item was abandoned. Instead, a new direction was taken –
The odd items in the SUS were scored as [score – 1], and the
the development of a concise scale that would more closely con-
even items were scored as [5 – score]. This aligned all scores in
form to the ISO 9241-11 (1998) definition of usability, would
one direction, removing the positive/negative keying of the lan-
minimize bias and language issues, and would still perform as well
guage in the instrument. It also allowed zeroes at the bottom of
as the baseline it was intended to replace. In this case the baseline
the range. The ten rescored items were summed and multiplied
was the updated, internationally-appropriate SUS (with ‘‘cumber-
by 2.5, providing a range of 0–100 (Brooke, 1996). The critical mea-
some” clarified as ‘‘awkward”) and the performance goal of total
sure of this study was the correlation of UMUX candidate items
SUS score to the total score of the new scale was set at a correlation
(scored similarly to the SUS) with the final SUS score. A high corre-
of 0.80 or better. The resulting instrument is the Usability Metric
lation coefficient indicated that the candidate item was in line with
for User Experience (UMUX), and this paper outlines the research
the total SUS score, regardless of direction. That is, a good candi-
and development of this usability component of a more general
date item would correlate highly with the SUS regardless of
user experience measurement model.
whether the SUS itself was indicating good or poor usability. This
is a different approach from that used in developing the original
2. Pilot study SUS, which selected candidates based on their tendencies toward
extreme (non-neutral) responses (Brooke, 1996). The UMUX is in-
A pilot study was developed to explore these possibilities. The tended to match the performance of the SUS, so alignment with
end goal of the pilot study was the determination of how candidate existing measures is more important.
Likert items would fare in an analysis of actual responses to items. As the UMUX was being designed to reflect the ISO 9241-11
(1998) definition of usability with as few items as possible, the
2.1. Method highest-correlating candidate items for each usability component
were chosen for further study. Table 2 below summarizes these
2.1.1. Participants results.
A total of 42 IntelÒ employees were recruited as part of a larger All the correlations in this table were negative due to the nega-
worldwide usability test. As a control for cultural and language fac- tive keying of the candidates; for instance, if the application was
tors in both the usability task and the candidate Likert items, par- usable then the participants disagreed on the item. The more gen-
ticipants were recruited worldwide. Users from the United States, eral items with language like ‘‘I am satisfied. . .” tended to correlate
Germany, Ireland, the Netherlands, China, the Philippines, Malay- poorly. As a point of comparison, the correlations of the items in
sia, and Israel participated in this study. the SUS to the SUS score itself varied from r = 0.36 to r = 0.78.
No participants required assistance with the terminology or
2.1.2. Materials phrasing of the UMUX candidate items. This was taken as evidence
A pool of candidate Likert items was developed that were re-
lated to the ISO 9241-11 (1998) definition of usability. A total of Table 1
12 such items were developed, four each for effectiveness, effi- Candidate items used (pilot study).
ciency, and satisfaction. Some were intentionally generic, while
Usability Candidate item
others were behavior-based (e.g., ‘‘I don’t make many errors with component
this system”) or emotion-based (e.g., ‘‘I would prefer to use some-
Efficiency [This system] saves me time.
thing other than this system”). These candidate items used a five- I tend to make a lot of mistakes with [this system].
point scale so they could be used alongside the SUS in an actual I don’t make many errors with [this system].
post-deployment usability survey. Also like the SUS, they used an I have to spend a lot of time correcting things with [this
alternating positive/negative keying to control for acquiescence system].
Effectiveness [This system] allows me to accomplish my tasks.
bias. These 12 candidate items and their usability factors are listed
I think I would need a system with more features for my
in Table 1. tasks.
I would not need to supplement [this system] with an
additional one.
2.1.3. Design and procedure
[This system’s] capabilities would not meet my
Participants first engaged in a usability test of an enterprise requirements.
software prototype involving the selection of contract workers Satisfaction I am satisfied with [this system].
and adding them to a database. After completing the usability test, I would prefer to use something other than [this system].
participants received a modified version of the SUS. The first three Given a choice, I would choose [this system] over others.
Using [this system] was a frustrating experience.
items were candidate items, followed by the SUS, which was then
followed by three more candidate items. This format presented the Note: Bracketed text is custom-replaced by relevant system.
Author's personal copy
Table 2
Items having highest correlation with overall SUS score (pilot study).
Table 6
Correlations of UMUX items with overall score (survey study).
lations. It was therefore concluded that all UMUX items were valid
contributors to the overall score.
Fig. 1. Scree plot of principal components (survey study).
4. Discussion
4.1. Implementation
emerged from the analysis, no attempts at further extractions or
rotations were performed. The SUS provided a similar one-compo-
The UMUX can be administered electronically as a survey, or as
nent extraction, with no additional elements emerging. For a more
a follow-up in usability testing. It is simple to administer, as it re-
thorough treatment of factoring in the SUS, see Lewis and Sauro
quires no branching or reordering of items. The UMUX is imple-
(2009), who found evidence that the SUS may be comprised of
mented as shown below, where bracketed text is custom-
two factors (usability and learnability). The conclusion from this
replaced by the relevant system.
analysis is that both instruments were unidimensional and align
on just one component (usability) rather than several.
1. [This system’s] capabilities meet my requirements.
1 2 3 4 5 6 7
3.2.2. Reliability
Strongly Strongly
Instruments need to measure an underlying construct consis-
Disagree Agree
tently. At the early stages of a metric’s development, one way to
2. Using [this system] is a frustrating experience.
establish this is through reliability estimation. Cronbach’s alpha
1 2 3 4 5 6 7
is a correlation coefficient that indicates how well a factor is being
Strongly Strongly
measured. The rule of thumb for Cronbach’s alpha is that a coeffi-
Disagree Agree
cient of higher than an absolute value of 0.70 indicates a high de-
3. [This system] is easy to use.
gree of internal reliability. Instruments farther along in their
1 2 3 4 5 6 7
development are subjected to more longitudinal reliability mea-
Strongly Strongly
sures. The Cronbach’s alpha for both instruments indicated high
Disagree Agree
reliability: 0.94 for the UMUX and 0.97 for the SUS. Therefore, both
4. I have to spend too much time correcting things with
instruments were reliable.
[this system].
1 2 3 4 5 6 7
3.2.3. Validity and sensitivity
Strongly Strongly
The overall correlation of UMUX with the SUS, across both sys-
Disagree Agree
tem conditions, was r = 0.96, p < 0.001. These results exceed the
goal criterion of r > 0.80, providing evidence of validity. T-tests
demonstrated that System 2 was more usable than System 1,
t(533) = 39.04, r = 0.86, p < 0.01 for UMUX, t(556) = 44.47,
4.2. Analysis
r = 0.89, p < 0.01 for SUS, thereby providing evidence for sensitivity.
The breakdown of usability inventory scores and correlations is
Once data are collected, they need to be properly recoded, with
shown in Table 5.
a method that borrows from the SUS. Odd items are scored as
[score – 1], and even items are scored as [7 – score]. As with the
3.2.4. Item correlations
SUS, this removes the positive/negative keying of the items and al-
After the UMUX had been developed and finalized, the perfor-
lows a minimum score of zero. Each individual UMUX item has a
mance of its individual items was examined in two applied situa-
range of 0 – 6 after recoding, giving the entire four-item scale a
tions. All final UMUX items were analyzed for their contribution
preliminary maximum of 24. To achieve parity with the 0–100
to the overall UMUX score, both as a post-usability test question-
range provided by the SUS, a participant’s UMUX score is the
naire (n = 45) and in the first seven internal usability projects com-
sum of the four items divided by 24, and then multiplied by 100.
pleted with the new scale as a standard instrument (n = 272). The
This calculation replaces the earlier methodology of weighting
results shown in Table 6 demonstrate significant item-total corre-
items by a 2.5 multiplier. These scores across participants are then
averaged to find a mean UMUX score. It is this mean score and its
Table 5 confidence interval that become the application’s UMUX metrics
Means, standard deviations, and correlation (survey study). for a system’s usability tracking and goal-setting.
System UMUX (0–100) SUS (0–100) r
4.3. Limitations
System 1 27.66 (20.54) 28.77 (18.19) 0.84*
System 2 87.91 (15.98) 88.39 (13.18) 0.81*
The UMUX, like the SUS, provides a subjective evaluation of a
*
p < 0.001. system’s usability. Its scoring has yet to be compared to objective
Author's personal copy
metrics, such as error rates and task timings, in a full experiment. guage suggestions, Charles Lambdin of IntelÒ for statistical assis-
Additionally, as it is currently the first module in a planned series tance, and Linda Wooding of IntelÒ for management support in
of user experience measures, it only measures usability. implementing this research.
As the UMUX consists of only four Likert items, it has fewer to-
tal data points available to respondents than the SUS, although the References
move to a seven-point scale does provide some mitigation. It has
four seven-point items for a total of 28 data points, while the Brooke, J., 1996. SUS: A ‘‘quick and dirty” usability scale. In: Jordan, P.W., Thomas, B.,
Weerdmeester, B.A., McClelland, A.L. (Eds.), Usability Evaluation in Industry.
SUS has ten five-point items for a total of 50 data points. By com- Taylor and Francis, London.
parison, a singular ease of use item like that used in Tedesco and Chin, J.P., Diehl, V.A., Norman, K., 1988. Development of an instrument measuring
Tullis (2006) may have only five data points. Reducing the total user satisfaction of the human–computer interface. In: Proceedings of ACM CHI
’88, Washington, DC, pp. 213–218.
information capacity of a survey effort can result in a less sensitive Cox III, E.P., 1980. The optimal number of response alternatives for a scale: a review.
measure. Once validity and reliability are established, there is still Journal of Marketing Research 17, 407–422.
a potential risk of application beyond the metric’s scope. For exam- Diefenbach, M.A., Weinstein, N.D., O’Reilly, J., 1993. Scales for assessing perceptions
of health hazard susceptibility. Health Education Research 8, 181–192.
ple, a simple ease of use item may do an exemplary job of measur- Finstad, K., 2006. The system usability scale and non-native english speakers.
ing ease of use, but a particular user experience professional needs Journal of Usability Studies 1 (4), 185–188.
to determine whether that information is sufficient as a usability Finstad, K., in press. Response interpolation and scale sensitivity: evidence against
five-point scales. Journal of Usability Studies.
metric.
Holyer, A., 1993. Methods for Evaluating user Interfaces. Cognitive Science Research
Paper No. 301. School of Cognitive and Computing Sciences, University of Sussex.
4.4. Conclusion ISO 9241-11 (1998). Ergonomic Requirements for Office Work with Visual Display
Terminals (VDTs). Part 11: Guidance on Usability.
Kirakowski, J., Porteous, M., Corbett, M., 1992. How to use the software usability
It can be concluded that the Usability Metric for User Expe- measurement inventory: the users’ view of software quality. In: Proceedings
rience is a reliable, valid, and sensitive alternative to the Sys- European Conference on Software Quality, Madrid.
tem Usability Scale. It correlates with the SUS at a rate higher Lewis, J., 1995. IBM Computer usability satisfaction questionnaires: Psychometric
evaluation and instructions for use. International Journal of Human–Computer
than 0.80, its items align on one usability factor, and it is fully Interaction 7 (1), 57–78.
capable as a standalone subjective usability metric. It is also Lewis, J.R., Sauro, J., 2009. The factor structure of the System Usability Scale. In:
aligned to a fundamental learning for the user experience com- Proceedings of the Human–Computer Interaction International Conference
(HCII 2009), San Diego CA, USA.
munity: in order to measure user experience effectively, its Sauro, J., Dumas, J.S., 2009. Comparison of three one-question, post-task usability
components need to be measured efficiently. The compact size questionnaires. In: Proceedings of the 27th International Conference on Human
of the UMUX is suited to a more fully realized measurement Factors in Computing Systems. Boston.
Sward, D., Macarthur, G., 2007. Making user experience design a business strategy.
model of user experience. Such a model would go beyond Towards a UX Manifesto; SIGCHI Workshop. Lancaster, UK, September 3–4.
Product Use, and would include other product lifecycle stages Tabachnik, B.G., Fidell, L.S., 1989. Using Multivariate Statistics, 2nd ed. Harper
such as Brand Awareness and Installation (Sward and Macar- Collins, New York.
Tedesco, D., Tullis, T., 2006. A comparison of methods for eliciting post-task
thur, 2007). Sward (personal communication, August 5, 2009)
subjective ratings in usability testing. Usability Professionals Association (UPA)
indicates significant progress in this area. The UMUX is well- 2006, 1–9.
positioned as a foundation for developing future instruments, Tullis, T.S., Stetson, J.N., 2004. A comparison of questionnaires for assessing website
usability. In: Proceedings of UPA 2004, June 7–11.
with the ultimate goal of metrics that can target any of a
product’s user experience aspects in a way that is concise
and cross-validated.
Acknowledgements