0% found this document useful (0 votes)
6 views

Automated System Testing using Visual GUI

Uploaded by

Zahra Waheed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Automated System Testing using Visual GUI

Uploaded by

Zahra Waheed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2012 IEEE Fifth International Conference on Software Testing, Verification and Validation

Automated System Testing using Visual GUI


Testing Tools: A Comparative Study in Industry
Emil Börjesson and Robert Feldt
Software Engineering and Technology
Chalmers University
Gothenburg, Sweden
[email protected]

Abstract—Software companies are under continuous pressure Automated testing has been proposed as one solution to the
to shorten time to market, raise quality and lower costs. More problems with manual regression testing since automated tests
automated system testing could be instrumental in achieving can run faster and more often, decreasing the need for test case
these goals and in recent years testing tools have been developed
to automate the interaction with software systems at the GUI
selection and thereby raising quality, while reducing manual
level. However, there is a lack of knowledge on the usability effort. However, most automated test techniques, e.g. unit test-
and applicability of these tools in an industrial setting. This ing [6], [7], Behavioral Driven Development [8], etc., approach
study evaluates two tools for automated visual GUI testing on testing on a lower system level that has spurred an ongoing
a real-world, safety-critical software system developed by the discussion regarding if these techniques, with certainty, can be
company Saab AB. The tools are compared based on their
properties as well as how they support automation of system
applied on high-level system tests, e.g. system tests [9], [10].
test cases that have previously been conducted manually. The This uncertainty has resulted in the development of automated
time to develop and the size of the automated test cases as test techniques explicit for system and acceptance tests, e.g.
well as their execution times have been evaluated. Results show Record and Replay (R&R) [11]–[13]. R&R is a tool-supported
that there are only minor differences between the two tools, one technique where user interaction with a System Under Test’s
commercial and one open-source, but, more importantly, that
visual GUI testing is an applicable technology for automated
(SUT) GUI components are captured in a script that can later
system testing with effort gains over manual system test practices. be replayed automatically. User interaction is captured either
The study results also indicate that the technology has benefits on a GUI component level, e.g. via direct references to the
over alternative GUI testing techniques and that it can be used for GUI components, or on a GUI bitmap level, with coordinates
automated acceptance testing. However, visual GUI testing still to the location of the component on the SUT’s GUI. The
has challenges that must be addressed, in particular the script
maintenance costs and how to support robust test execution.
limitation with this technique is that the scripts are fragile
to GUI component change [14], e.g. API, code, or GUI layout
Keywords-Visual GUI testing; Empirical; Industrial Study; change, which in the worst case can render entire automated
Tool Comparison;
test suites inept [15]. Hence, the state-of-practice automated
test techniques suffer from limitations and there is a need
I. I NTRODUCTION for a more robust technique for automation of system and
Market trends with demands for faster time-to-market and acceptance tests.
higher quality software continue to pose challenges for soft- In this paper, we investigate a novel automated testing
ware companies that often work with manual test practices technique, which we in the following call visual GUI testing,
that can not keep up with increasing market demands. Com- with characteristics that could lead to more robust system test
panies are also challenged by their own systems that are automation [16]. Visual GUI testing is a script based testing
often Graphical User Interface (GUI) intensive and therefore technique that is similar to R&R but uses image recognition,
complex and expensive to test [1], especially since software instead of GUI component code or coordinates, to find and
is prone to changing requirements, maintenance, refactoring, interact with GUI bitmap components, e.g. images and buttons,
etc., which requires extensive regression testing. Regression in the SUT’s GUI. GUI bitmap interaction based on image
testing should be conducted with configurable frequency [2], recognition allows visual GUI testing to mimic user behavior,
e.g. after system modification or before software release, on treat the SUT as a black box, whilst being more robust to
all levels of a system, from unit tests, on small components, to GUI layout change. It is therefore a prime candidate for better
system and acceptance tests, with complex end user scenario system and acceptance test automation. However, the body of
input data [3], [4]. However, due to the market imposed time knowledge regarding visual GUI testing is small and contain
constraints many companies are compelled to focus or limit no industrial experience reports or other studies to support
their manual regression testing with ad hoc test case selection the techniques industrial applicability. Realistic evaluation on
techniques [5] that do not guarantee testing of all modified industrial scale testing problems are key in understanding
parts of a system and cause faults to slip through. and refining this technique. The body of knowledge neither

978-0-7695-4670-4/12 $26.00 © 2012 IEEE 351


350
DOI 10.1109/ICST.2012.47
contains studies that compare different visual GUI testing object level or the GUI bitmap level, with different advantages
tools or the strengths and weaknesses of the technique in the and disadvantages for each level. On the top GUI bitmap
industrial context. level a common approach is to save the coordinates of the
This paper aims to fill these gaps of knowledge by pre- GUI interaction in a script, with the drawback that the script
senting a comparison of two visual GUI testing tools, one becomes sensitive to reconfiguration of GUI layout but with
commercial referred to as CommercialTool1 , and one open the advantage of making the scripts robust to API and code
source, called Sikuli [16], in an industrial context to answer changes. The other R&R approach is to record SUT interaction
the following research questions: on a lower GUI object level by saving references to the
1) Is visual GUI testing applicable in an industrial context GUI code components, e.g. Java Swing components, which
to automate manual high-level system regression tests? instead make the scripts sensitive to API and code structure
2) What are the advantages and disadvantages of visual change [15] but more robust to GUI layout reconfiguration.
GUI testing for system regression testing? GUI testing can also be conducted on the top GUI bitmap
To answer these questions we have conducted an empirical, level with techniques that use image recognition to execute
multi-step case study at a Swedish company developing safety- test scenarios [16], in this paper referred to as visual GUI
critical software systems, Saab AB. A preparation step evalu- testing. Visual GUI testing is very similar to the R&R ap-
ated key characteristics of the two tools and what could be proach but with the important distinction that R&R tools
the key obstacles to applying it at the company. Dynamic do not use image recognition and are thus more hardcoded
evaluation of the tools was then done in an experimental setup to the exact positioning of GUI elements. In current visual
to ensure the tools could handle key aspects of the type of GUI testing tools, the common approach is that scenarios
system testing done at the company. Finally, a representative are written manually in scripts that include images for SUT
selection of system test cases for one of the company’s safety- interaction in contrast to the R&R approach where test scripts
critical subsystems was automated in parallel with both of the are commonly generated automatically with coordinates or
tools. Our results and lessons learned give important insight GUI component references. In a typical visual GUI testing
on the applicability of visual GUI testing. script input is given to the SUT through automated mouse
The paper is structured as follows; section II presents related and keyboard commands to GUI bitmap components identified
work followed by section III that describes the case study through image recognition, output is then observed, once again
design. Section IV presents results which are then discussed with image recognition, and compared to expected results after
in section V. Section VI concludes the paper. which the next sequence of input is given to the SUT, etc. The
advantages of visual GUI testing is that it is impervious to GUI
II. R ELATED W ORK layout reconfiguration, API and code changes, etc., but with
The body of knowledge on using GUI interaction and image the disadvantage that it is instead sensitive to changes to GUI
recognition for automation is quite large and has existed bitmap objects, e.g. change of image size, shape or color.
since the early 90s, e.g. Potter [17] and his tool Triggers A different approach to GUI testing is to base it on models,
used for GUI interactive computer macro development. Other e.g. generate test cases from finite state machines (FSM) [19],
early works includes Zettlemoyer and Amant who explored [20]. However, the models often need to be created manually
GUI automation with image recognition in their tool, VisMap. at considerable cost and the approach often face scalability
VisMap’s capabilities were demonstrated through automation problems. Automated model creation approaches have been
of a visual scripting program and the game Solitaire [18]. proposed, such as GUI ripping proposed by Memon [21].
These early works did however not focus on automated testing Hence, the area of GUI interaction, automation and testing,
but rather automation in general with the help of image is quite broad but limited regarding empirical studies evalu-
recognition algorithms. ating the techniques on real-world, industrial-scale software
There is also a large body of knowledge on using GUI systems. Comparative research has been done on tools that
interaction for software testing, as shown by Adamoli et use the R&R technique [11], but, to the authors’ knowledge,
al. [11] who have surveyed 50 papers related to automated there are no studies that compare visual GUI testing tools or
GUI testing for their work on GUI performance testing. Note evaluate if they can substitute manual regression testing in the
that we differentiate between GUI interaction for automation industrial context.
and GUI interaction for testing since all techniques for GUI Another important test aspect is acceptance testing where
automation are not intended for testing and vice versa. user and customer requirement conformity is verified with test
One of the most common GUI testing approaches is Record scenarios that emulate end user interaction with the SUT. The
and Replay (R&R) [11]–[13]. R&R is based on a two step tests are similar to system test cases, but contain more end user
process where user mouse and keyboard inputs are first specific interaction information, i.e. how the system will be
recorded and automatically stored in a script that the tool can used in its intended domain. Acceptance test scenarios should
then replay in the second step. Different R&R tools record preferably be automated and run regularly to verify system
user input on different GUI abstraction levels, e.g. the GUI conformity to the system requirements [2] and has therefore
been subject to academic research. The academic research
1 For reasons of confidentiality we cannot disclose the name of the tool. has resulted in both tools and frameworks for acceptance test

351
352
automation, including tools for GUI-interaction [22], but to Pre-Study
the authors’ knowledge there is no research using visual GUI Initial tool analysis
testing for acceptance testing. Industrial context analysis
Experiments
III. C ASE S TUDY D ESCRIPTION
Comparable tool Company and
The empirical study presented in this paper was conducted properties and system context
information
in a real-world, industrial context, in one business area of the Experimental results
company Saab AB, in the continuation of this paper referred to
as Saab. Saab develops safety critical air traffic control systems Industrial-Study
Manual system test
that consist of several individual subsystems of which a key participation
one was chosen as the subject of this study. The subsystem
Classified test cases
has in the order of 100K Lines of Code (LOC), constituting
roughly one third of the functionality of the system it is part Test case Selection and
Automation
of, and is tested with different system level tests, including 50
manual scenario based system test cases. At the time of the Collected Development and Execution metrics
study the subsystem was in the final phase for a new customer
release that was one reason why it was chosen. Other reasons Data analysis
for the choice included the subsystem size in LOC, the number
of manual test cases, and because it had a non-animated Conclusions
GUI. With non-animated we mean that there are no moving
graphical components, only components that, when interacted
with, change face, e.g. color. Decision support information for
Fig. 1. Overview of research methodology (square nodes show activities/steps
what subsystem to include in the study was gathered through and rounded ones outcomes).
document analysis, interviews and discussions with different
software development roles at Saab.
CommercialTool was selected for this study because Saab the visual GUI testing tools. Both of the visual GUI testing
had been contacted by the tool’s vendor and been provided tools were then used to automate five, carefully selected,
with a trial license for the tool that made it accessible. It representative, test case scenarios (ten percent) of the manual
is a mature product for visual GUI testing having been on test suite during which metrics on script development time,
the market since more than 5 years. The second tool, Sikuli, script LOC and script execution time were collected.
was chosen since it seemed to have similar functionality as In the following sections the two phases of the methodology
CommercialTool and, if applicable, would be easier to refine will be described in more detail.
and adapt further to the company context. The company was
also interested in the relative cost benefits of the tools, i.e. if A. Pre-study
the functionality or support of CommercialTool would justify Knowledge about the industrial context at Saab was ac-
its increased up-front cost. quired through document analysis, interviews and discussions
The methodology used in the study was divided into two with different roles at the company. The company’s support
main phases, shown in Figure 1, with three steps in each phase. made it possible to identify a suitable subsystem for the study,
Phase one of the study was a pre-study with three different based on subsystem size, number of manual test cases, GUI
steps. An initial tool analysis compared the tools based on properties, criticality, etc., and to identify the manual test
their static properties as evaluated through ad hoc script practices conducted at the company.
development and review of the tools’ documentation. This was In parallel with the industrial context analysis, static prop-
followed by a series of experiments with the goal of collecting erties of the studied tools were collected, through explorative
quantitative metrics on the strengths and weaknesses of the literature review of the tools’ documentation and ad hoc script
tools. The experiments also served to provide information development. The collected properties were then analyzed
about visual GUI testing’s applicability for different types of according to the quality criteria proposed by Illes et al. [23],
GUIs, e.g. animated with moving objects and non-animated derived from the ISO/IEC 9126 standard supplemented with
with static buttons and images, which would provide decision criteria to define tool vendor qualifications. The criteria refer
support for, and possibly rule out, what type of system to to tool quality and are defined as Functionality, Reliability, Us-
study at Saab in the second phase of the study. In parallel with ability, Efficiency, Maintainability, Portability, General vendor
these experiments an analysis of the industrial context at Saab qualifications, Vendor support, and Licensing and pricing.
was also conducted. Phase two of the study was conducted The tools were also analyzed in four structured experiments
at Saab and started with a complete manual system test of all where scripts were written in both tools, with equivalent
the 50 test cases of the studied subsystem. This took 40 hours, instructions to make the scripts comparable, and then executed
spread over five days, during which the manual test cases were against controlled GUI input. The GUI input was classified
categorized based on their level of possible automation with into two groups, animated GUIs and non-animated GUIs, cho-

352
353
MacBook Pro
Computer 1
VNC connection Tools (Sikuli and
User account 1 User account 2 CommercialTool)
Screen from
CommercialTool user account 2
GUI input VNC Connections
applications,
VNC server
Sikuli + VNC SUT
viewer Commands to
Computer 2 Computer 3 Computer 4
user account 2
Subsystem Subsystem
Simulators
part A part B

Fig. 2. Visualization of the experimental setup.

LAN connection
sen to cover and evaluate how the tools perceivably performed
for different types of industrial systems. The ability to handle Fig. 3. Visualization of the test system setup.
animated GUIs is critical for visual GUI testing tools since
they apply compute-intensive image recognition algorithms
that might not be able to cope with highly dynamic GUIs. paired with a third party VNC viewer application. The VNC
Eight scripts were written in total, four in each tool, and viewer application was run on one user account connected to
each one was executed in 30 runs for each experiment. The a VNC server on a second user account on the experiment
experiments have been summarized in the following list: computer, visualized in Figure 2.
• Experiment 1: Aimed to determine how well the tools Finally the visual GUI testing tools were also analyzed in
could differentiate between alpha-numerical symbols by terms of learnability since this aspect affects the technique’s
adding the numbers six and nine in a non-animated desk- acceptance, e.g. if the tool has a steep learning curve it is
top calculator by locating and clicking on the calculator’s less likely to be accepted by users [25]. The learnability was
buttons. evaluated in two ad hoc experiments using Sikuli, where two
• Experiment 2: Aimed to determine how the tools could individuals with novice programming knowledge, at two dif-
handle small graphical changes on a large surface, tested ferent occasions, had to automate a simple computer desktop
by repeated search of the computer desktop for a specific task with the tool.
icon to appear that was controlled by the researcher.
• Experiment 3: Aimed to test the tools image recognition
B. Industrial Study
algorithms in an animated context by locating the back The studied subsystem at Saab consisted of two computers
fender of a car driving down a street in a video clip in with the Windows XP operating system, connected through a
which the sought target image was only visible for a few local area network (LAN). The LAN also included a third
video frames. computer running simulators, used during manual testing
• Experiment 4: Also in an animated context, aimed to to emulate domain hardware controlled by the subsystem’s
identify how well the tools could track a moving object GUI. The GUI consisted primarily of custom-developed GUI
over a multi-colored surface in a video clip of an aircraft, components, such as buttons and other bitmap graphics, and
represented by its textual call-sign, moving across a radar was non-animated. During the study a fourth computer was
screen. also added to the LAN to run the visual GUI testing tools and
The four experiments cover typical functionality and behavior VNC, visualized in Figure 3. VNC is scalable for distributed
of most software system GUIs, e.g. interaction with static systems so the level of complexity of the industrial test system
objects such as buttons or images, timed events and objects setup, Figure 3, was directly comparable to the complexity of
in motion, to provide a broad view of the applicability of the the experimental setup used during the pre-study, Figure 2.
tools for different systems. Experiment 4 was selected since In the first step of the industrial study the researchers
it is similar to one of the systems developed by the company. conducted a complete manual system test of the chosen
The experiments were run on a MacBook Pro computer, subsystem with two goals. The first goal was to categorize
with a 2.8GHz Intel Core 2 Duo processor, using virtual the manual test cases as fully scriptable, partially scriptable or
network computing (VNC) [24], which was a requirement not scriptable based on the tool properties collected during the
for CommercialTool. CommercialTool is designed to be non- pre-study. The categorization provided input for the selection
intrusive, meaning that it should not affect the performance of of representative manual test cases to automate and showed if
the SUT, and to support testing of distributed software systems. enough of the manual test suite could be automated for the
This is achieved by performing all testing over VNC and automation to be valuable for Saab.
support for it is built into the tool. Sikuli does not have VNC All the subsystem’s manual test cases were scenario based,
support so to equalize the experiment conditions Sikuli was written in natural language, including pre- and post-conditions

353
354
for each test case and were organized in tables with three Test case Physical Run-time Simulator
columns. Column one described what input to manually give computers config.
Test case 1 2 3 A
to the subsystem, e.g. click on button x, set property y, etc. Test case 2 2 0 B
Column two described the expected result of the input, e.g. Test case 3 2 2 A
button x changes face, property y is observed on object z, Test case 4 2 0 A
Test case 5 3 0 A
etc. The last column was a check box where the tester should
report if the expected result was observed or not. The test case TABLE I
table rows described the test scenario steps, e.g. after giving P ROPERTIES OF THE MANUAL TEST CASES SELECTED FOR AUTOMATION .
T HE NUMBER OF PHYSICAL COMPUTERS DOES NOT INCLUDE THE
input x, observing output y and documenting the result in the COMPUTER USED TO RUN THE VISUAL GUI TESTING TOOLS .
checkbox on row k the scenario proceeded on row k+1, etc.,
until reaching the final result checkbox on row n. Hence, the
test scenarios were well defined and documented in a way
suitable as input for the automation. the manual test cases and also had the most complex GUIs.
The second research purpose of conducting the manual In addition, simulators C and D had very similar functionality
system test was to acquire information of how the different to A and B and had no unique GUI components not present
parts of the subsystem worked together and what or which in A or B and were therefore identified as less important and
test cases provided test coverage for which part(s) of the possible to automate.
subsystem. Test coverage information was vital in the manual Once the representative test cases had been selected from the
test case selection process to ensure that the selected test manual test suite they were automated in both of the studied
cases were representative for the entire test suite so that the tools during which metrics were collected for comparison
results could be generalized. Generalization of the results was of the tools and the resulting scripts. Metrics that were
required since it was not feasible to automate all 50 of the collected included script development time, script LOC and
subsystem’s manual test cases during the study. script execution time.
Five test cases were selected for automation with the goal IV. R ESULTS
of capturing as many mutually exclusive GUI interaction
Below the results gathered during the study are presented
types as possible, e.g. clicks, sequences of clicks, etc., to
divided into the results gathered during the pre-study and the
ensure that these GUI interaction types, and in turn test cases
results gathered during the industrial phase of the study.
including these GUI interaction types, could be automated.
GUI interaction types with properties that added complexity A. Results of the Pre-study
to the automation were especially important to cover in the The pre-study started with a review of the studied visual
five automated test cases, the most complex properties have GUI testing tools’ documentation from which 12 comparable
been listed below: static tool properties relevant for Saab were collected. The
1) The number of physical computers in the subsystem the 12 properties are summarized in Table II that shows which
test case required access to. property had impact on what tool quality criteria defined by
2) Which of the available simulators for the subsystem the Illes et al. [23], described in section III. The table also shows
test case required access to. what tool was the most favorable to Saab in terms of a given
3) The number of run-time reconfigurations of the subsys- property, e.g. CommercialTool was more favorable in terms of
tem the test case included. real-time feedback than Sikuli. The favored tool is represented
in the table with an S for Sikuli, CT for CommercialTool and
The number of physical computers would impose complex-
(-) if the tools were equally favorable.
ity by requiring additional VNC control code and interaction
In the following section each of the 12 tool properties are
with a broader variety of GUI components, e.g. interaction
discussed in more detail, compared between the tools and
with custom GUI components in subsystem part A and B
related to what tool quality criteria they impact.
and the simulators. Simulator interaction was also important
Developed in. CommericalTool is developed in C#, whilst
to cover in the automated test cases since if some simulator
Sikuli is developed in Jython (a Python version in Java), which
interaction could not be automated neither could the manual
is relevant for the portability of the tools since Commercial-
test cases using that simulator. Run-time reconfiguration in
Tool only works on certain software platforms whilst Sikuli is
turn added complexity by requiring the scripts to read and
platform independent. Sikuli, being open source, also allows
write to XML files. In Table I the five chosen test cases have
the user to expand the tool with new functionality, written in
been summarized together with which of the three properties
Jython, whilst users of CommercialTool must rely on vendor
they automate. The minimum number of physical computers
support to add tool functionality.
required in any test case were two and maximum three whilst
Script Language syntax. The script language in Sikuli
the maximum number of run-time configurations in any test
is based on Python, extended with functions specific for
case were also three. There were four simulators, referred to as
GUI interaction, e.g. clicking on GUI objects, writing text
A, B, C and D, but only simulators A and B were automated
in a GUI, waiting for GUI objects, etc. Sikuli scripts are
in any script because they were the most commonly used in
written in the tool’s Integrated Development Environment

354
355
(IDE) and because of the commonality between Python and Property Commer- Sikuli Impacts Favored
other imperative/Object-Oriented languages the tool has both cialTool tool
Developed in C# Jython F/P/VS S
high usability and learnability with perceived positive impact Script language syn- Custom Python F/U/M S
on script maintainability. The learnability of Sikuli is also tax
supported by the learnability experiments conducted during the Supports imports No Java and F/U/E/ S
Python VS
pre-study, described in Section III, where novice programmers Image representation Text- Images F/U/M/P S
were able to develop simple Sikuli scripts after only 10 in tool IDE Strings
minutes of Sikuli experience and advanced scripts after an Real-time script exe- Yes No U/M CT
cution feedback
hour.
Image recognition 7 5 F/R/U CT
CommercialTool has a custom scripting language, modelled sweeps per second
to resemble natural language that the user writes in the tool’s Image recognition Multiple Image F/R/U/ CT
IDE, which has a lot of functionality, but the tool’s custom failure mitigation algorithms simi- E/M/P
to choose larity
language has a higher learning curve than Sikuli script. The from configu-
usability of CommercialTool is however strengthened by the ration
script language instruction-set that is more extensive than the Test suite support Yes Unit F/U/M/P -
tests
instruction-set in Sikuli, e.g. including functionality to analyze only
audio output, etc. Both Sikuli and CommercialTool do however Remote SUT connec- Yes No F/U/P -
support all the most common GUI interaction functions and tion support
Remote SUT connec- Yes No F/U/P S
programming constructs, e.g. loops, switch statements, excep- tion requirement
tion handling, etc. Cost 10.000 Free U/LP S
Supports imports. Additional functionality can be added to Euros per
license per
Sikuli by user-defined imports written in either Java or Python computer
code to extend the tool’s usability and efficiency. Commercial- Backwards compati- Guaranteed Uncertain F/M/ CT
Tool does not support user-defined imports and again users bility GVQ
must rely on vendor support to add tool functionality. TABLE II
Image representation in tool IDE. Scripts in Commercial- R ESULTS OF THE PROPERTY COMPARISON BETWEEN C OMMERCIALT OOL
AND S IKULI . C OLUMN I MPACTS : F - F UNCTIONALITY, R - R ELIABILITY,
Tool refers to GUI interaction objects (such as images) through
U - U SABILITY, E - E FFICIENCY, M - M AINTAINABILITY, P -
textual names whilst Sikuli’s IDE shows the GUI interaction P ORTABILITY, GVQ - G ENERAL V ENDOR QUALIFICATIONS , VS - V ENDOR
objects as images in the script itself. The image presentation in S UPPORT, LP - L ICENSING AND PRICING . C OLUMN FAVORED TOOL : S -
Sikuli’s IDE makes Sikuli scripts very intuitive to understand, S IKULI , CT - C OMMERCIALT OOL , (-) - E QUAL BETWEEN THE TOOLS
also for non-developers, which positively affects the usability,
maintainability and portability of the scripts between versions
of a system. In particular this makes a difference for large
match in the GUI. Hence, Sikuli has less failure mitigation
scripts with many images.
functionality that can have negative effects on usability, relia-
Real-time script execution feedback. CommercialTool pro-
bility, etc.
vides the user with real-time feedback, e.g. what function of
Test suite support. Sikuli does not have built in support to
the script is currently being executed and success or failure
create, execute or maintain test suites with several test scripts,
of the script. Sikuli on the other hand executes the script and
only single unit tests. CommercialTool has such support built
then presents the user with feedback, i.e. post script execution
in. A custom test suite solution was therefore developed during
feedback. This lowers the usability and maintainability of test
the study that uses Sikuli’s import ability to run several test
suites in Sikuli since it becomes harder to identify faults.
scripts in sequence, providing Sikuli with the same function-
Image recognition sweeps per second. Sikuli has one image
ality, usability, perceived maintainability and portability.
recognition algorithm that can be run five times every second
Remote SUT connection support / requirement. Sikuli does
whilst the image recognition algorithm in CommercialTool
not have built in VNC support, a property that is not only
runs seven times every second. CommercialTool is therefore
supported by CommercialTool but also required by the tool to
potentially more robust, e.g. to GUI timing constraints, and
operate. Sikuli was therefore paired with a third party VNC
have higher reliability and usability, at least in theory, than
application as described in Section III, to provide Sikuli with
Sikuli for this property.
the same functionality, usability and portability as Commer-
Image recognition failure mitigation. CommercialTool has
cialTool.
several image recognition algorithms with different search
Cost. The studied tools differ in terms of cost since Sikuli
criteria that give the tool higher reliability, usability, efficiency,
is open source with no up-front cost whilst CommercialTool
maintainability and portability by providing automatic script
costs around 10.000 Euros per ‘floating license’ per year. A
failure mitigation. Script failure mitigation in Sikuli requires
floating license means that it is not connected to any one user
manual effort, e.g. by additional failure mitigation code or by
or computer but only one user can use the tool at a time, hence
setting the similarity, 1 to 100 percent, of a bitmap interaction
the Licensing and pricing quality criterion in this case affects
object required for the image recognition algorithm to find a
the usability of CommericalTool since some companies may

355
356
Experiment Type Desc. CT Sikuli success rate in the experiments with animated GUIs and
success success
rate (%) rate (%)
showed to be easier to adapt, only requiring small efforts to be
1 non-animated Calculator 100 50 extended with additional functionality. In addition, Sikuli was
2 non-animated Icon finder 100 100 considered marginally favored according to the tool quality
3 animated Car Finder 3 25
4 animated Radar trace 0 100
criteria defined by Illes et al. and is therefore perceived as a
better candidate for future research.

TABLE III B. Results of the industrial study


ACADEMIC EXPERIMENT RESULTS . CT STANDS FOR C OMMERCIALT OOL .
T YPE INDICATES IF THE EXPERIMENT WAS NON - ANIMATED OR NOT AND The industrial part of the study started with the researchers
D ESC . DESCRIBES THE EXPERIMENT. conducting a complete manual system test of the studied
subsystem. During the manual system test all the test cases
were analyzed, as described in Section III, and classified into
not afford multiple licenses while still wanting to run multiple categories. The category analysis showed that Sikuli could
scripts at the same time. fully script 81%, partially script 17% and not script 2% of the
Backwards compatibility and support. The last property manual test cases. CommercialTool on the other hand could
concerns the backwards compatibility of the tools, and whilst fully script 95%, partially script 3% and not script 2% of the
CommercialTool’s vendor guarantees that the tool, which manual test cases. The higher percentage of scripts that could
has been available in market for several years, will always be fully automated in CommercialTool was given by the tool’s
be backwards compatible, Sikuli is still in beta testing and ability to analyze audio output, required in seven of the manual
therefore subject to change. Changes to Sikuli’s instruction test cases. The 2% of the manual test cases that could not be
set could affect the functionality and maintainability of the scripted, in either tool, were hardware related and required
tool and scripts. This property also provides general vendor physical interaction with the SUT.
qualification information, e.g. the maturity of the vendor and Based on the categorization and the selection criteria, dis-
the tool, which plays an important part for tool selection and cussed in Section III, five manual test cases were chosen for
tool adoption in a company, e.g. that CommercialTool may automation. The automation was done pair-wise in each tool,
be favored because it is more mature and the tool vendor can e.g. test case x was automated in one tool and then in the
supply support etc. other tool, with the order of the first tool chosen at random
The second part of the pre-study consisted of four structured for each test case. Random tool selection was used to ensure
experiments, described in Section III and their results are that the script development time for the script developed in
summarized in Table III. In the first experiment a script was the secondly used tool would not continuously be skewed,
developed in each tool for a non-animated desktop calculator lowered, because challenges with the script, e.g. required
application to evaluate CommercialTool’s and Sikuli’s image failure mitigation, etc., had already been resolved when the
recognition algorithms’ ability to identify alpha-numeric sym- script was developed in the first tool.
bols. Sikuli only had a success rate of 50 percent in this The main contributor to script development time was in the
experiment, over 30 runs, because the tool was not always able study observed to be the amount of code required to mitigate
to distinguish between the number 6 and the letter C, used to failure due to unexpected system behavior, e.g. GUI compo-
clear the calculator, whilst CommercialTool had a success rate nents not rendering properly, GUI components appearing on
of 100 percent. In the second experiment the goal was to find top of each other, etc. Failure mitigation was achieved through
a specific icon as it appeared on the desktop, hence identify a ad hoc addition of wait functions, conditional branches and
small bitmap change on a large surface, for which both tools other exception handling, e.g. try-catch blocks, which for each
had a 100 percent success rate. In the third experiment the goal added function required extra image recognition sweeps of
was to identify the back fender of a car driving down a road in the GUI that also increased the script execution time. Scripts
a video clip where the sought fender image was only visible that required failure mitigation also took longer to develop
for a few video frames, imposing a time constraint to the image since they had to be rerun more times during development
recognition algorithms. The car experiment resulted in Sikuli to ensure script robustness. The development time required
having a success rate of 25 percent and CommercialTool 3 to make a script robust also proved to be very difficult
percent. The final experiment required the tools to trace the to estimate because unexpected system behavior was almost
call sign, a text string, of an aircraft moving over a multi- never related to the test scenarios but rather a product of
colored radar screen in a video-clip, where Sikuli had a 100 the subsystem’s implementation. Each script was developed
percent success rate whilst CommercialTool’s success rate was individually and consisted of three parts. First a setup part
0 percent. to cover the preconditions of the test case. The second part
A summary of the pre-study results show that Commer- was the test scenario and the third part was a test teardown
cialTool had higher success rate in the experiments with non- to put the subsystem back in a known state to prepare it for
animated GUIs and had more built-in functionality required the following test case. After the five test scripts had been
for automated testing in the industrial context, shown by the developed in each tool the LOC and execution time for each
12 analyzed properties. Sikuli on the other hand had higher script was recorded, shown in Table IV together with the script

356
357
CT Sikuli

Test Dev- Exe- LOC Dev- Exe- LOC TC

250
case time time time time Steps
(min) (sec) (min) (sec)

ATC-1 255 111 103 105 90 212 5

200
ATC-2 195 405 233 200 390 228 4
ATC-3 285 390 368 260 338 345 16
ATC-4 205 80 80 180 110 92 9

150
ATC-5 120 90 115 150 154 169 8
Total: 17 hours 17.93 899 15 18.00 1046
40 min- min- LOC hours min- LOC
utes utes 55 utes

100
min-
utes CommercialTool Dev. Time Sikuli Dev. Time

TABLE IV
M ETRICS COLLECTED DURING TEST CASE AUTOMATION . CT STANDS FOR
C OMMERCIALT OOL , ATC FOR AUTOMATED TEST CASE AND TC STEPS Fig. 4. Boxplot showing development time of the five scripts in each tool.
FOR THE NUMBER OF TEST STEPS IN THE SCENARIO OF THE MANUAL
TEST CASE .

distribution allowed the data to be analyzed further with the


Student t-test that had the p-value results 0.3472 for develop-
ment time, 0.956 for execution time and 0.2815 for LOC. The
development time and number of steps in the corresponding
Student t-test results were then verified with a non-parametric
manual test case scenario.
paired Wilcoxon test that had results with the same statistical
Table IV shows that the total development time, LOC and implications. Hence, both the Student t- and Wilcoxon-tests
execution time were similar for the scripts in both tools. showed that we cannot reject the null hypothesis, H0 , on a
The five chosen test cases were carefully selected to be 0.05 confidence level. Therefore, it can be concluded that there
representative for the entire manual test suite for the subsys- is no statistical significant difference between the scripts of the
tem, as described in section III, to allow the collected data studied tools in terms of development time, execution time or
to be used for estimation. Estimation based on the average LOC. The statistical results are however limited by the few
execution times, from Table IV, shows that the fully automated data points the tests were conducted on.
test suite for the subsystem, all 50 test cases, would run
in approximately three and a half hours in each tool. A V. D ISCUSSION
three and a half hour execution time constitutes a gain of 78 Our study shows several differences between the two studied
percent compared to the execution time of the current manual tools but that both tools were able to successfully automate 10
test suite, 16 hours, if conducted by an experienced tester. percent of an industrial manual system test suite, for which 98
Hence, automation would constitute not only an advantage percent of the test cases can be fully or partially automated
in that it can be run automatically without human input but with visual GUI testing. The open-source tool, Sikuli, had a
a considerable gain in total execution time which allows for higher percentage of test cases that could only be partially
more frequent testing. Potentially tests can run every night and scripted since it has no current support for detecting audio
over weekends and shorten feedback cycles in development. output. However, this is not a major obstacle since either the
In Figure 4 the script development time for the scripts, taken audio output can be visualized, and thus tested visually, or
from Table IV, have been visualized in a box-plot that shows Sikuli can be extended with Operating System (OS) system
the time dispersion, mean development time, etc. Using the calls.
mean development time, the development time for the entire CommercialTool and Sikuli differ in terms of cost, vendor
automated test suite, all 50 test cases, can be estimated to support, test functionality, script languages, etc., with impacts
approximately 21 business days for CommercialTool and 18 on different tool quality criteria, shown in Table II, and
business days for Sikuli. The estimated development time for are all important properties to consider for the industrial
the automated test suite is in the same order of time that applicability of visual GUI testing. However, to show that
Saab spends on testing during one development cycle of the visual GUI testing has any applicability at all in industry the
subsystem. Hence, the investment of automating the test suite most important aspect concerns the functionality of the image
is perceived to be cost beneficial after one development cycle recognition algorithms.
of the subsystem. The image recognition algorithms are what sets visual GUI
The data in Table IV was also subject to statistical tests to testing apart from other GUI testing techniques, e.g. R&R, and
see if there was any statistical significant difference between also determine for what types of systems it is possible to apply
the two tools. The data was first analyzed with a Shapiro-Wilks the technique. R&R that interacts through GUI components
test of the difference between the paired variables in Table IV,
which showed that the data was normally distributed. Normal

357
358
was determined as unsuitable for the automation of the sub- writing visual GUI testing scripts was related to the effort
system test cases since they had to interact with components required to make the scripts robust to unexpected system
not developed by Saab, e.g. interaction with custom and OS behavior. Unexpected system behavior can be caused by faults
GUI components. These interactions required access to GUI in the system, related or unrelated to the script, and must be
component references that could not be acquired. The GUI handled to avoid that these faults are overlooked or break the
components in the SUT, e.g. the simulators, windows in the test execution. Other unexpected behavior can be caused by
OS, etc., did not always appear in the same place on the events triggered by the system’s environment, e.g. warning
screen when launched. This behavior also ruled out R&R messages displayed by the OS. Hence, events that may appear
with coordinate interaction as an alternative for the study. anywhere on the screen. These events can be handled with
Evaluation of visual GUI testing showed that it does not suffer visual GUI testing but are a challenge for R&R since the events
from R&R’s limitations and therefore works in contexts where location, the coordinates, are usually nondeterministic. Script
R&R cannot be applied. Visual GUI testing is applicable on robustness in visual GUI testing can be achieved through ad
different types of GUIs, evaluated in the pre-study experiments hoc failure mitigation but is a time-consuming practice. A
and in industry, which showed that both studied tools had new approach, e.g. a framework or guidelines, is therefore
high success-rates with non-animated GUIs and that Sikuli required to make robust visual GUI test script development
also had good success-rate on animated GUIs as well. Hence, more efficient. Hence, another subject for future research.
this study shows that visual GUI testing works for tests on non- The cost of automating the manual test suite for the studied
animated GUIs and perceivably also for animated GUIs. Non- subsystem was estimated to 20 business days, which is a
animated GUI applicability is however a subject for future considerable investment, and to ensure that it is cost-beneficial
deeper research. the maintenance costs of the suite therefore have to be small.
The purpose of automation of manual tests is to make Small is in this context measured compared to the cost of
the regression testing more cost-efficient by increasing the manual regression testing, hence the initial investment and
execution speed and frequency and lower the required manual the maintenance costs have to break even with the cost of
effort of executing the tests cases. Estimations based on the the manual testing within a reasonable amount of time. The
collected data show that a complete automatic test suite for maintenance costs of visual GUI testing scripts when the
the studied subsystem would execute in three and a half hours, system changes are however unknown and future research is
which constitutes a 78 percent reduction compared to manual needed.
test execution with an experienced tester. Hence, the automated Our results show that visual GUI testing is applicable for
test suite could be run daily, eliminating the need for partial system regression testing of the type of industrial safety critical
manual system tests, reduce cost, increase test frequency and GUI based systems in use at Saab. The technique is however
lower the risk of slip through of faults. Mitigation of slip limited to find faults defined in the scripted scenarios. Hence,
through of faults is however limited with this technique by visual GUI testing cannot replace manual testing but minimize
the test scenarios since faulty functionality not covered by it for customer delivery. Visual GUI testing also allows tests
the test scripts would be overlooked, whilst a human tester to be run more often and are more flexible than other GUI
could still detect them through visual inspection. Hence, the testing techniques, e.g. coordinate based R&R, because of
automated scripts cannot replace human testers and should image recognition that can find a GUI component regardless
rather be a complement to other test practices, such as manual of its position in the GUI. Furthermore, R&R tools that
free-testing. The benefit of visual GUI testing scripts compared require access to the GUI components, in contrast to visual
to a human tester in terms of test execution is that the scripts GUI testing, are not easily applicable at this company since
are guaranteed to run according to the same sequence every their systems have custom-developed GUIs as required in
time, whilst human testers are prone to take detours and make their domain. We have also seen that visual GUI testing can
mistakes during testing, e.g. click on the wrong GUI object, be applied for automated acceptance testing. Being able to
etc., which can cause faults to slip through. continuously test the system with user-supplied test data could
Scenario based system tests are very similar to acceptance have very positive results on quality.
tests and based on the results of this study it should there- Evaluating a technique’s applicability in a real-world con-
fore be concluded as plausible to automate acceptance tests text is a complex task. We have opted on a multi-step case
with visual GUI testing. This conclusion is supported by the study that covers multiple different criteria that gives the
research of similar GUI testing techniques, e.g. R&R, which company better decision support on which to proceed. Even
has been shown to work for acceptance test automation [12], though the test automation comparison is based on a limited
[22]. Further support is provided by the fact that some of number of test cases the research was designed so that these
the manual test cases, categorized as fully scriptable, for the test cases are representative of the rest of the manual test suite.
studied subsystem had been developed with customer specific Still, this is a threat to the validity of our results. Our industrial
data. The results of this study therefore provide initial support partner is more concerned with the amount of maintenance that
that visual GUI testing can be used for automated acceptance will be needed as the system evolves. If these costs are high
testing in industry. they will seriously limit the long-term applicability of visual
During the study it was established that the primary cost of GUI testing.

358
359
VI. C ONCLUSION 331–357, 2007.
[2] R. Miller and C. Collins, “Acceptance testing,” Proc. XPUniverse, 2001.
In this paper we have shown that visual GUI testing tools [3] P. Hsia, D. Kung, and C. Sell, “Software requirements and acceptance
are applicable to automate system and acceptance tests for testing,” Annals of software Engineering, vol. 3, no. 1, pp. 291–317,
industrial systems with non-animated GUIs with both cost and 1997.
[4] P. Hsia, J. Gao, J. Samuel, D. Kung, Y. Toyoshima, and C. Chen,
potentially quality gains over state-of-practice manual testing. “Behavior-based acceptance testing of software systems: a formal sce-
Experiments also showed that the open source tool that was nario approach,” in Computer Software and Applications Conference,
evaluated can successfully interact with dynamically changing, 1994. COMPSAC 94. Proceedings., Eighteenth Annual International.
IEEE, 1994, pp. 293–298.
animated GUIs that would broaden the number and type of [5] T. Graves, M. Harrold, J. Kim, A. Porter, and G. Rothermel, “An em-
systems it can be successfully applied to. pirical study of regression test selection techniques,” ACM Transactions
We present a comparative study of two visual GUI testing on Software Engineering and Methodology (TOSEM), vol. 10, no. 2, pp.
184–208, 2001.
script tools, one commercial and one open source, at the [6] M. Olan, “Unit testing: test early, test often,” Journal of Computing
company Saab AB. The study was conducted in multiple steps Sciences in Colleges, vol. 19, no. 2, pp. 319–328, 2003.
involving both static and dynamic evaluation of the tools. One [7] E. Gamma and K. Beck, “Junit: A cook’s tour,” Java Report, vol. 4,
no. 5, pp. 27–38, 1999.
of the company’s safety critical subsystems, distributed over [8] D. Chelimsky, D. Astels, Z. Dennis, A. Hellesoy, B. Helmkamp, and
two physical computers, with a non-animated GUI, was chosen D. North, “The rspec book: Behaviour driven development with rspec,
and 10 percent, 5 out of 50, representative, manual, scenario- cucumber, and friends,” Pragmatic Bookshelf, 2010.
[9] E. Weyuker, “Testing component-based software: A cautionary tale,”
based, test cases were automated in both tools. A pre-study Software, IEEE, vol. 15, no. 5, pp. 54–59, 1998.
helped select the relevant test cases to automate as well as [10] S. Berner, R. Weber, and R. Keller, “Observations and lessons learned
evaluate the strengths and weaknesses of the two tools on key from automated testing,” in Proceedings of the 27th international
conference on Software engineering. ACM, 2005, pp. 571–579.
criteria relevant for the company. [11] A. Adamoli, D. Zaparanuks, M. Jovic, and M. Hauswirth, “Automated
Analysis of the tools properties show differences in the tools gui performance testing,” Software Quality Journal, pp. 1–39, 2011.
functionality but overall results show that both studied tools [12] J. Andersson and G. Bache, “The video store revisited yet again:
Adventures in gui acceptance testing,” Extreme Programming and Agile
work equally well in the industrial context with no statistically Processes in Software Engineering, pp. 1–10, 2004.
significant differences in either development time, run time [13] A. Memon, “Gui testing: Pitfalls and process,” IEEE Computer, vol. 35,
or LOC of the test scripts. Analysis of the subsystem test no. 8, pp. 87–88, 2002.
[14] M. Jovic, A. Adamoli, D. Zaparanuks, and M. Hauswirth, “Automating
suite show that up to 98 percent of the test cases can be fully performance testing of interactive java applications,” in Proceedings of
or partially automated using visual GUI testing with gains to the 5th Workshop on Automation of Software Test. ACM, 2010, pp.
both cost and quality of the testing. Execution times of the 8–15.
[15] E. Sjösten-Andersson and L. Pareto, “Costs and benefits of structure-
automated test cases are 78% lower than running the same test aware capture/replay tools,” SERPS’06, p. 3, 2006.
cases manually and the execution requires no manual input. [16] T. Chang, T. Yeh, and R. Miller, “Gui testing using computer vision,”
Our analysis shows that visual GUI testing can overcome in Proceedings of the 28th international conference on Human factors
in computing systems. ACM, 2010, pp. 1535–1544.
the obstacles of other GUI testing techniques, e.g. Record and [17] R. Potter, Triggers: Guiding automation with pixels to achieve data
Replay (R&R). R&R either requires access to the code in order access. University of Maryland, Center for Automation Research,
to interact with the System Under Test (SUT) or is tied to Human/Computer Interaction Laboratory, 1992, pp. 361–382.
[18] L. Zettlemoyer and R. St Amant, “A visual medium for programmatic
specific physical placement of GUI components on the display. control of interactive applications,” in Proceedings of the SIGCHI
Visual GUI testing is more flexible, interacting with GUI conference on Human factors in computing systems: the CHI is the
bitmap components through image recognition, and robust to limit. ACM, 1999, pp. 199–206.
[19] A. Memon, M. Pollack, and M. Soffa, “Hierarchical gui test case
changes and unexpected behavior during testing of the SUT. generation using automated planning,” Software Engineering, IEEE
Both of these advantages were important in the investigated Transactions on, vol. 27, no. 2, pp. 144–155, 2001.
subsystem since it had custom GUI components and GUI [20] P. Brooks and A. Memon, “Automated gui testing guided by usage
profiles,” in Proceedings of the twenty-second IEEE/ACM international
components that changed position between test executions. conference on Automated software engineering. ACM, 2007, pp. 333–
However, more work is needed to extend the tools with ways 342.
to specify and handle unexpected system events in a robust [21] A. Memon, “An event-flow model of gui-based applications for testing,”
Software Testing, Verification and Reliability, vol. 17, no. 3, pp. 137–
manner; the potential for this in the technique is not currently 157, 2007.
well supported in the available tools. For testing of safety- [22] C. Lowell and J. Stell-Smith, “Successful automation of gui driven
critical software systems there is also a concern that the acceptance testing,” Extreme Programming and Agile Processes in
Software Engineering, pp. 1011–1012, 2003.
automated tools are not able to find defects that are outside the [23] T. Illes, A. Herrmann, B. Paech, and J. Rückert, “Criteria for software
scope of the test scenarios, such as safety defects. Thus any testing tool evaluation. a task oriented view,” in Proceedings of the 3rd
automated system testing will still have to be combined with World Congress for Software Quality, vol. 2, 2005, pp. 213–222.
[24] T. Richardson, Q. Stafford-Fraser, K. Wood, and A. Hopper, “Virtual
manual system testing before delivery but the main concern network computing,” Internet Computing, IEEE, vol. 2, no. 1, pp. 33–
for future research is the maintenance costs of the scripts as 38, 1998.
a system evolves. [25] L. Fowler, J. Armarego, and M. Allen, “Case tools: Constructivism and
its application to learning and usability of software engineering tools,”
Computer Science Education, vol. 11, no. 3, pp. 261–272, 2001.
R EFERENCES
[1] P. Li, T. Huynh, M. Reformat, and J. Miller, “A practical approach to
testing gui systems,” Empirical Software Engineering, vol. 12, no. 4, pp.

359
360

You might also like