0% found this document useful (0 votes)
5 views8 pages

Automated Generation of Diverse and Challenging Scenarios For Test and Evaluation of Autonomous Vehicles

This document presents a novel method for generating diverse test scenarios for autonomous vehicles, focusing on critical performance transitions in complex environments. The approach utilizes adaptive sampling and unsupervised clustering to identify performance boundaries and improve the testing process, moving beyond traditional fault detection methods. By effectively exploring the state space, the method aims to enhance the evaluation of autonomous systems' decision-making capabilities during missions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views8 pages

Automated Generation of Diverse and Challenging Scenarios For Test and Evaluation of Autonomous Vehicles

This document presents a novel method for generating diverse test scenarios for autonomous vehicles, focusing on critical performance transitions in complex environments. The approach utilizes adaptive sampling and unsupervised clustering to identify performance boundaries and improve the testing process, moving beyond traditional fault detection methods. By effectively exploring the state space, the method aims to enhance the evaluation of autonomous systems' decision-making capabilities during missions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2017 IEEE International Conference on Robotics and Automation (ICRA)

Singapore, May 29 - June 3, 2017

Automated Generation of Diverse and Challenging Scenarios for Test


and Evaluation of Autonomous Vehicles
Galen E. Mullins1 , Paul G. Stankiewicz 2 , and
Satyandra K. Gupta3 , Senior Member, IEEE

Abstract— We propose a novel method for generating test tems and behavioral modes of the UUV must work in concert
scenarios for a black box autonomous system that demonstrate in the presence of competing priorities. For example, it must
critical transitions in its performance modes. In complex offset the risk of detection when surfacing with the need to
environments it is possible for an autonomous system to fail
at its assigned mission even if it complies with requirements localize itself via GPS. It also it must determine if it has
for all subsystems and throws no faults. This is particularly enough fuel to complete the survey or if uncertainties in its
true when the autonomous system may have to choose between environment make the effort too risky and it must return to
multiple exclusive objectives. The standard approach of testing the base early. This is of particular concern for long duration
robustness through fault detection is directly stimulating the missions where the vehicle must transition among multiple
system and detecting violations of the system requirements.
Our approach differs by instead running the autonomous mission objectives [3]. As such it can be difficult to pro-
system through full missions in a simulated environment and vide guarantees of the system’s decision-making capabilities
measuring performance based on high-level mission criteria. without running extensive tests in a variety of challenging
The result is a method of searching for challenging scenarios scenarios. This requires both a simulation framework capable
for an autonomous system under test that exercise a vari- of exercising the autonomous system realistically [4] and a
ety of performance modes. We utilize adaptive sampling to
intelligently search the state space for test scenarios which suite of tests that provide coverage of the mission space [5].
exist on the boundary between distinct performance modes. To do this, we focus our attention on the regions in the
Additionally, using unsupervised clustering techniques we can configuration space where small changes in the scenario
group scenarios by their performance modes and sort them by result in transitions between performance modes. The canon-
those which are most effective at diagnosing changes in the ical example is how a small change to the position of an
autonomous system’s behavior.
obstacle can cause the vehicle to take a different path and
I. INTRODUCTION fail to reach its goal. Understanding where these transitions
Effective test and evaluation (T&E) of autonomous ve- occur is key to predicting the performance of the autonomous
hicles remains an open problem in the testing community. vehicles and is useful for both the design and validation
Traditional testing has focused on the physical components of the system. They are particularly useful for identifying
of a system to ensure safe and reliable operation. The the triggers for specific behaviors, such as the strength of
very nature of autonomous systems, however, dictates that a current which overcomes an obstacle avoidance strategy.
the reasoning component, i.e. the ”brains”, of the system These can then be utilized for repairing bugs or simply
must be effectively tested as well. A great deal of recent understanding the likelihood of a certain behavior being
work has focused on applying software testing techniques to triggered for that region of the configuration space.
autonomous systems with an enormous amounts of inputs One issue immediately encountered is that the number of
such as fault detection [1] or model checking [2] of the configuration parameters for the scenario quickly increases
underlying decision engine. This can provide some runtime when attempting to capture realistic missions. Moving and
assurances for the robustness of the software, but it does not static obstacles, tidal and constant currents, time windows
provide information about the how the autonomous system for objectives, and other environmental factors are just a few
will perform when executing a mission. Additionally, these of the different parameters on which an engineer may wish
testing techniques do not provide insight into the environ- to test a UUV. As the number of parameters increase the
mental factors that contribute to the autonomous system’s number of samples required to maintain the same resolution
decision process. grows exponentially. In addition if the software being tested
For example, consider an unmanned underwater vehicle is not capable of running faster than real time a single
(UUV) tasked with a survey mission. The multiple subsys- scenario taking hours to run to completion.
Given the limitations of computational resources and the
1 Galen E. Mullins is with the Department of Mechanical Engineering
enormity of the testing space, we require a sampling method
and Institute for Systems Research, University of Maryland, College Park, that maximizes the information returned given a limited
MD 20742, USA [email protected]
2 Paul G. Stankiewicz is with the Johns Hopkins University Applied number of runs. This involves both preferentially returning
Physics Laboratory, 11100 Johns Hopkins Road, Laurel, Maryland 20723, samples from performance boundary regions rather than
USA [email protected] spending resources exploring regions of stable performance
3 S.K. Gupta is with the Department of Aerospace and Mechanical En-
gineering and Center for Advanced Manufacturing, University of Southern and exploring along the entire performance boundary. This
California, Los Angeles, California 90089, USA [email protected] strategy is similar to using active learning to train a classifier,

978-1-5090-4633-1/17/$31.00 ©2017 IEEE 1443


Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on November 01,2024 at 10:57:09 UTC from IEEE Xplore. Restrictions apply.
Fig. 1: An overview of the test generation cycle for a system with continuous unlabeled outputs.

where the highest information cases lie along the decision II. RELATED WORK
boundary and thus the algorithm will preferentially sample
in these regions [6]. Furthermore given that running tests is The validation and verification of autonomous systems
restricted by time rather than number of queries, this search has been an incredibly active area of research in the past
technique cannot incur a significant amount of overhead few years, particularly the use of active learning for test
above and beyond what the simulations already demand. This case generation. Surrogate optimization of complex systems
means we require a method that scales well with both number has similarly been an increasingly popular tool for design
of samples and number of dimensions in order to properly of experiments. Here we discuss recent work in the fields
capture the complexity of realistic missions. In this paper, we of verification of autonomous systems and surrogate model
introduce a novel adaptive search technique designed for the generation and how it relates to the work in this paper.
purpose of discovering performance boundaries of a system Test scenario generation has been an active area of re-
that scales better than previous approaches with number of search in the software testing domain for some time [7].
samples and number of dimensions. We also introduce a The current focus entails generating tests for requirements
technique for identifying performance boundaries through verification, ensuring that the software does not throw faults,
unsupervised clustering and adjacency tests. and ensuring that the hardware meets its reliability specifica-
In this paper we refer to the regions in the parameter space tions. One testing method is to simply stimulate the system
where transitions in behaviors occur as the performance with an enormous amount of inputs for fault detection [8]
boundaries of the autonomous system. Our objective is to [1], often utilizing optimization-based combination testing
utilize active learning methodology to effectively sample [9] to minimize the size of the test suite while maximizing
the scenario parameter space and automatically identify the coverage. Additionally, sampling-based methods have been
performance boundary cases. In this paper we discuss the two used to discover different performance modes for a car
parts of this learning process - first, using adaptive sampling control system [10]. All of these testing methods provide run-
to search the parameter space followed by unsupervised time assurances for the robustness of the software, but they
learning to determine the performance modes and the cases do not provide information about the how the autonomous
that make up the performance boundaries. An overview of system will perform when executing a mission. Put another
our approach is shown in Figure 1 way, current software testing validates that the system can

1444
Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on November 01,2024 at 10:57:09 UTC from IEEE Xplore. Restrictions apply.
accommodate receiving bad inputs or sending bad outputs; these can be mapped to discrete behaviors or performance
it does not necessarily validate the actual decisions made by modes of the autonomous system. The test cases we consider
the underlying algorithms. to be the most informative are those which occur on the
Generating and evaluating test scenarios which stress transition regions between performance modes, previously
the autonomous system under test in simulation has been referred to as the performance boundaries. The reasoning
explored with success in the past [11]. For example, the behind this claim is that it is ineffective to test the system in
work in [12] used this strategy to find types of multi-UAV regions where performance is constant and known, i.e. cases
encounters that stressed the autonomous system’s conflict where the system will almost surely succeed or conversely
resolution algorithms. Evolutionary generation techniques cases where the system will almost surely fail. Much more
are frequently used for this purpose [13] with objective information about the system is gained by testing in regions
functions specifically designed for the domain being tested. where critical decisions must be made by the autonomous
The concept of discovering simulation cases based on their system that result in variable performance. Additionally, the
difficulty has also been explored for the navigation of ground traditional strategy of testing under worst-case conditions
vehicles in a 3D environment [14]. These results are encour- does not fully characterize the performance envelope of the
aging and demonstrate that test generation techniques can system - there may be failure modes or performance bound-
be applied to a variety of domains. The open challenge that aries that occur in regions other than worst-case conditions
remains is determining which search strategy is the most that are not immediately apparent.
efficient at delivering the relevant test cases with the fewest
required simulations. B. Definition of the SUT
Validation of autonomous systems for critical missions (i.) The scenario configuration state space X n =
has been increasing in importance as robots have become [X1 , ..., Xn ] of n elements. Each element in the state
more prevalent in ordnance disposal, search-and-rescue, and space vector represents a variable in the environment,
other applications that have very low tolerance for failure mission, or vehicle parameters with a range of possible
[15]. Of particular interest are methods of providing per- values (obstacle positions, time windows, mission prior-
formance guarantees based on model-checking and formal ities, etc.). The state space in this context is synonymous
methods [2] [16]. These methods require that a model which with the testing space, i.e. the space of all possible
fully describes the autonomous system performance can tests that could be performed based on the parameters
be generated and exhaustively tested for exceptions which specified by the test engineer.
break the specifications. Models that have been used in the (ii.) A scenario input state is defined as the vector X =
past include finite state machines [1] and process algebras [x1 , x2 , ..xn ] where ∀i ∈ n : xi ∈ Xi . The scenario
[16]. The drawback of these techniques is that the resulting is a specific instantiation of each parameter from their
model must fully describe the autonomous system and test corresponding state space range. Thus, the state space
engineers must have full access to the model. Given the consists of all the possible scenario configurations that
increasing complexity of autonomous systems and the black could be tested. A sample set of N scenarios states is
box nature of proprietary software, these limitations prevent defined as X N = [X1 , .., XN ]. The normalized state
these methods from being applied to many systems. vector where each x̄i ∈ [0, 1] is defined as X̄.
(iii.) The performance score space Y m of m parameters
III. PROBLEM FORMULATION where each output score is defined as the vector Y =
What differentiates our process from previous works is [y1 , y2 , ..ym ]. Each element in the score vector repre-
the concept of performance boundaries. As described earlier, sents a performance metric by which the autonomous
performance boundaries are regions of the testing space system is evaluated, such as percentage of fuel con-
where the performance of the system under test (SUT) is sumed or number of waypoints reached. A sample set
uncertain, i.e. small alterations to the scenario configuration of N score vectors is defined as Y N = [Y1 , .., YN ].
can cause transitions in the SUT behaviors which result in The normalized score vector where each ȳi ∈ [0, 1] is
large performance changes. In this section we more formally defined as Ȳ .
define the search and boundary identification problem as well (iv.) A black box system under test (SUT) function
as the terms used throughout the paper. F(X N ) = Y N . It accepts a set of N input states
X N = [X1 , ..., XN ] and returns sample set of N
A. System under test score vectors Y N = [Y1 , ..., YN ]. For our purposes this
Our target SUT is the decision making software for an providing a scenario configuration as input, running the
autonomous vehicle executing a mission in a simulated simulation until completion, and receiving the scoring
environment. It takes a scenario configuration as an input and metrics against the history of the simulation as output.
returns a set of score metrics of its mission performance as (v.) A performance mode is defined as P ⊂ Y m where
output. This score is based on externally observable attributes ∪i Pi = Y m and ∀i = j, Pi ∩ Pj = Ø. In other
that would be associated with mission requirements such words a performance mode is a category of scores which
as time elapsed, path taken, and completion of objectives. represent a distinct type of performance for the system
While the output of the system is a set of continuous values, under test.

1445
Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on November 01,2024 at 10:57:09 UTC from IEEE Xplore. Restrictions apply.
(vi.) The boundary region Ba,b ⊂ X between performance the state space for an autonomous system under test makes it
modes Pa and Pb is defined as the region where intractable to simply perform an exhaustive spread of simu-
∀Xi,a ∈ Ba,b , ∃Xj,b ∈ Ba,b s.t. |Xi,a −Xj,a | < D and lations. Thus we have focused our problem of searching the
vice versa. Where the D is the width of the boundary state space primarily on adequate coverage of the boundary
region and set of all boundaries that exist for the SUT regions while minimizing the number of simulations. In the
in question is referred to as B identification phase, the samples generated during the search
(vii.) A boundary pair bij ∈ Ba,b is a set of samples who are phase are used to identify the performance modes in the
each others closest neighbor in a difference performance resulting data using unsupervised clustering algorithms. Once
mode. It is defined as bi,j = [Xi , Xj , Yi , Yj ] where test cases have been classified by their performance mode,
|Xi − Xj | = Dij < D , Xi , Xj ∈ X N , and Yi ∈ the boundaries between performance modes are identified
Pa ,Yj ∈ Pb |a = b. and the tested scenarios adjacent to boundaries can be used
(viii.) The sampled boundary region is defined as to aid in live test design.
Sa,b (X N , D ) ⊂ Ba,b where ∀Xi ∈ Sa,b (X N , D ),
∃Xj ∈ X N such that |Xi − Xj | < D and Xj ∈ Ba,b . IV. SEARCH STRATEGY
A. Adaptive Sampling
C. Problem Statement
Adaptive sampling is an iterative process consisting of
1) Search Problem: Given a SUT function along with the submitting queries to the SUT, using the returned scores to
state space and score space which define its inputs the search generate a meta-model, and then applying an information
function is defined as follows: metric to the meta-model to generate a new set of queries.
Γ(F, X n , Y m , N ) = LN . (1) This is an alternative to space-filling designs, such as Latin
hypercube (LH) or Sobol sequences, which attempt to op-
Where N is the number of samples allocated to the search. timize uniform coverage and density and are precomputed
The output, LN , is a set of labeled samples LN = [X N , Y N ] based upon the size of the state space. In this paper we utilize
consisting of the queried states X N and their respective a generalized method for adaptive sampling which allows
scores Y N . for changing the underlying meta-models and information
Our objective is to generate the set of samples X N which metric. This is more formally defined in Algorithm 1. The
maximizes the volume of the sampled boundary regions adaptive algorithm uses the normalized unit states X̄ and
Sa,b (X N , D ) for all boundaries in B for the smallest scores Ȳ for the information metrics.
possible value of D .
The number of performance modes of the SUT and the B. Boundary Information Metrics
mapping from score to performance mode are not known a There are multiple query strategies that can be used for
priori. adaptive sampling including entropy, model improvement,
2) Boundary Identification Problem: We formally define uncertainty, and density. Our objective is to drive the search
the boundary identification algorithm as a function towards performance boundaries; thus, we have designed
our metrics to look for areas with high gradients that have
C(L) = B (2)
not been sampled yet. This is similar to the exploration-
which accepts a set of labeled samples, LN , and returns the exploitation approach of the LOLA-Voronoi algorithm [18]
set of identified performance boundaries: and as such we have included it as one of our baseline
comparisons. The Voronoi tesselation present in LOLA-
B = [B1,2 , B1,3 , ..., BL−2,L , BL−1,L ] (3) Voronoi, however, scales poorly with both number of sam-
where L is the number of identified performance modes and ples and input dimensionality. For n points in Rd it takes
N is the number of samples in LN . Each boundary Ba,b is O(nlogn + n[d/2] ), making it infeasible for higher dimen-
the set of samples that borders the performance modes a and sional problems. Therefore techniques which provide better
b. scaling with number of samples and input parameters.
Our objective is to successfully identify all samples in LN We introduce two new meta-model metrics for the purpose
which exist on the boundaries between performance modes of discovering performance boundaries: one which uses
and provide an estimate of their distance from the boundary. Gaussian Process Regression (GPR) meta-model and one
which uses a k-nearest neighbor technique for density and
D. Overview of Approach variance estimation. As the Gaussian process scales with
The approach presented in this paper is broken into two O(n3 ) and the k-nearest neighbors algorithm which scales
primary phases: search and identification. During the search with O(knlogn), we believe these can offer better scaling
phase we utilize an adaptive sampling or active learning as the number of dimensions and the required number of
approach to select new test cases that are run by the au- samples increases. These meta-model evaluators are defined
tonomous system simulation. This process utilizes Gaussian as M(X) - they take existing samples as inputs and return
Process Regression [17] to model the autonomous system’s the quality of a proposed query as an output.
performance and preferentially select regions that might For each query the GPR meta-model returns the mean
indicate performance boundaries. The high dimensionality of value μ, the first-order gradient of the mean μ, and the

1446
Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on November 01,2024 at 10:57:09 UTC from IEEE Xplore. Restrictions apply.
estimated co-variance σ. This covariance is proportional to B. Cluster Stitching
the distance to the nearest sample; thus, variance in this case To obtain the boundaries from these clusters, we perform a
makes it an appropriate reflection of how far away the query pair-wise comparison between every cluster with a differing
is from one of the training samples. The GPR meta-model performance mode. We utilize a k-nearest neighbor detection
evaluator uses the magnitude of the gradient and uncertainty algorithm to determine the closest neighbor in the adjacent
as follows: MGP R (X) = (| μ(X)|)g · (σ(X))v where g cluster for each sample. Any samples that are within B
and v are tuning parameters to balance exploration of high distance of their nearest neighbor in the opposite cluster
uncertainty regions with high gradient regions. are added to the final boundary set. The pair distances in
The Nearest Neighbor Density and Variance (NNDV) the boundary set are then used to determine how close the
meta-model estimates the size and gradiant of a query using samples are to the performance boundary. This approach is
its nearest neighbors. For a given query state it returns defined further in Algorithm 2.
the scores and distances of its k nearest neighbors. It then
computes the variance σK of the scores and the mean
Algorithm 2 B OUNDARY I DENTIFICATION(L)
distance dK of its neighbors. The NNDV evaluator is then
computed as follows: MN N DV (X) = (σK (X))g ·(dK (X))v Input: A set N of labeled samples L containing the input states
X N and output scores Y N
where g and v are the same tuning parameters used in the Output: A set of identified performance modes, a collection of
GPR meta-model evaluator. boundaries B, and distance estimate vector D
Let λP be the threshold distance for the flat kernel mean shift
Algorithm 1 A DAPTIVE S EARCH(SU T, X n , M, N ) function, C and nmin be the radius and minimum member
Input: A function representing the system under test F, a scenario parameters for the DBSCAN function. Let D be the maximum
state space X n , a meta-model evaluator M, and a desired distance between two samples to be considered part of a
number of samples N boundary.
Output: A set of labeled samples L P = M eanShif t(Y N , λP ), identify the performance modes
Select a query batch size of L and an initial batch of randomly for all Pl ∈ P do
selected query states X0L . In addition, choose a number of Create the set of all states belonging to that performance
proposed queries, p, to perform per iteration. mode
for all i ∈ [0, N/L] do XPL = Xi |Yi ∈ Pl
F (XiL ) = YiL Append the new cluster of states CY = [XPL , Y ] to the list
concatenate(L, [XiL , YiL ]) of existing clusters
Train M on normalized sample set [X̄, Ȳ ] C  [CY ]
Randomly select a new set of proposed queries X p : p > L end for
L
Xi+1 = argmaxX L ⊂X p M(X̄ L ) for all CY ∈ C do
end for Create a set of subclusters for the regions of interest using
return L the DBSCAN algorithm
CˆY = DBSCAN (X¯PL , C , nmin )
Append the subclusters to the complete set of clusters
Ĉ  [CˆY ]
V. BOUNDARY IDENTIFICATION end for
A. Identifying Performance Modes for all CˆY i and CˆY j ∈ Ĉ|Yi = Yj do
Dij = knnsearch(X¯Pi , X¯Pj )
One of the issues of black box testing is that we cannot Bij = [XPi , XPj , Yi , Yj ]∀XPi , XPj |Dij < D
look inside the decision engine to determine which behavior end for
the autonomous system is executing. Instead, we must use return B
externally observable states and infer changes in behavior
from changes in the performance of the system. Our current
approach is to apply unsupervised clustering techniques to VI. RESULTS
identify the performance modes of the system.
A. Test Systems
In cases where the autonomous system is scored using
discrete values, e.g. binary criteria for mission success and Several candidate systems were developed to evaluate the
safety success, it is trivial to identify distinct performance adaptive search and boundary identification algorithms. The
modes from the resulting scores. In order to apply our first category of candidate systems consisted of mathemat-
techniques to systems which provide continuous unlabeled ical test functions with performance boundaries that were
outputs we utilize Mean-Shift [19] clustering on the score known a priori. The second category consisted of a simple
space to identify the performance modes and classify the unmanned undersea vehicle (UUV) scenario.
samples. Once the samples have been classified with respect 1) Synthetic Test Functions: Three synthetic test functions
to their performance mode, they are then subjected to DB- were developed in order to evaluate the algorithms against
SCAN clustering [20] in in space to identify distinct regions a known mathematical surface. The intention in designing
of interest. We selected these two clustering algorithms as custom test functions was to mimic the wide variety of fea-
our classification method because they do not require a priori tures and boundaries that may be present in an autonomous
knowledge of the shapes of the boundaries between classes system’s performance landscape. The three functions are as
or the number of classes. follows,

1447
Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on November 01,2024 at 10:57:09 UTC from IEEE Xplore. Restrictions apply.
success. The scenario was labeled as a mission success if
the UUV achieved its goal of making it to the waypoint.
The scenario was labeled as a safety success if the UUV
returned to its recovery point with sufficient battery. Thus,
four performance modes exist in this framework, consisting
of the different combinations of the mission success and
safety success criteria.

B. Search Performance on Synthetic Functions


(a) Custom2D (b) Plates2D
We evaluated the performance of the search algorithms
Fig. 2: Top-down view of synthetic 2D functions with presented in Section IV based on their ability to identify
outlines of true boundaries: (a) Custom2d (b) Plates2d features in test functions and sample near the performance
boundaries. For comparison, we chose a Sobol sequence
design as our baseline space-filling approach. To compare
against the current state-of-the-art in adaptive sampling, the
LOLA-Voronoi sequential design method with a Blind Krig-
ing model was also included for comparison. The LOLA-
Voronoi code is accessible using the SUMO software toolbox
[21] in MATLAB. We compared these methods against
our adaptive search algorithms using both the GPR-based
and NNDV information functions. We use the following
metrics for each of the mathematical test functions: precision,
Fig. 3: Illustration of the autonomous UUV scenario de- coverage, convergence, and runtime. Precision is defined as
scribed in Section VI-A.2 . the percentage of samples from the entire set that within a
distance of 0.01 units of a performance boundary. Coverage
is the percentage of a performance boundary which has
• Custom 2D - Two input dimensions with one continuous a sample within a distance of 0.01 units. Convergence is
unlabeled output. It contains peaks, valleys, plateaus and number of samples required to reach 90% coverage of all
cliffs as features of interest and is the function illustrated known boundaries. Runtime is simply the number of seconds
in Figure 1. required to collect the prescribed number of samples. The
• Plates 2D - Two input dimensions with one discrete results of these tests are summarized in Table I.
output. There are 5 score categories. The search methods introduced in this paper outperformed
• Plates 3D - Three input dimensions with one discrete both the space-filling approaches as well as the popular
output. There are 5 score categories. LOLA-Voronoi adaptive search in all of our chosen metrics.
This is particularly true in cases where the boundaries are
These low-dimensional test functions have the advantage
sharply defined, as in our Plates2d test function. As shown
that they are easy to visualize and have peformance bound-
in Figure 4, the GPR-based search concentrated nearly all of
aries that were known a priori. The performance boundaries
its samples in the regions near the boundaries with minimal
were defined as the local maxima of the first derivative of
cases selected in the uninteresting regions of small gradient.
the test function.
More importantly, it also managed to obtain near full cov-
2) UUV Scenario: In addition to test functions, the search erage of the boundaries in under half the cases of the Latin
and identification algorithms were also evaluated on an
autonomous vehicle simulation. For the autonomous vehicle
we chose a simulated UUV operating under a multi-objective Test System LH Sobol LOLA GPR NNDV
navigation scenario. The objective of the mission was to Design Design Voronoi Search Search
Custom2d Based on 1000 Samples
travel from a starting point to a goal waypoint and then return Precision 6.52% 6.4% 9.53% 11.6% 19.2%
to a separate recovery point (all of which are fixed) before Coverage 29.6% 31.76% 49.0% 48.43% 59.2 %
running out of battery. The goal waypoint was placed on Convergence 1401 801 1101 701 701
Runtime(sec) 0.177 0.791 27.2 2.96 0.645
the opposite side of a 2km operational area from the start Plates2d Based on 1000 Samples
and recovery points. Additionally, inside the operational area Precision 5.10% 6.4% 6.58% 11.6% 19.2%
were three square obstacles that are 400m on a side that Coverage 30.9% 31.7% 39.4% 48.4% 59.2 %
Convergence 1151 1051 951 501 601
could vary in the East and North directions. Thus, the centers Runtime(sec) 0.177 0.791 31.9 2.96 0.64
for the three obstacles form our six input state dimensions, Plates3d Based on 3000 Samples
X = [E1 , N1 , E2 , N2 , E3 , N3 ]. This scenario is illustrated in Precision 3.33% 3.46% 4.22% 7.43% 12.17%
Coverage 1.28% 1.31% 1.526% 2.64% 4.65 %
Figure 3. Convergence 36501 26201 N/A 12001 7501
The score space for the UUV scenario consisted of Runtime(sec) 0.081 0.233 246.0 32.7 2.12
two binary performance labels: mission success and safety
TABLE I: Comparison of Search Methods
1448
Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on November 01,2024 at 10:57:09 UTC from IEEE Xplore. Restrictions apply.
(a) Latin-Hypercube (b) GPR Search
(a) Boundary 1: Success (b) Boundary 1: Safety Fail

(c) Latin-Hypercube (d) GPR Search

Fig. 4: Scatter plot of a GPR search vs. Latin hypercube (c) Boundary 2: Success (d) Boundary 2: Mission Fail
sampling of the Custom2d (top) and Plates2d test function
(bottom). Samples taken are in blue and the true locations Fig. 5: Examples of two performance boundaries identified
of the boundaries are in red. for the UUV simulation. Boundary 1 (a)&(b) illustrates a
boundary between safety success and safety failure, while
Boundary 2 illustrates a boundary between mission success
hypercube method. The results are even more pronounced for and mission failure.
the NNDV search algorithm, with the added benefit of shorter
runtime as well. One thing that becomes immediately appar-
ent in the performance comparison between the Plates2d and
Plates3d functions is the added dimension greatly increases
the number of cases necessary to obtain coverage of the
boundaries.
For the given number of samples, LOLA-Voronoi search
did not distinguish itself significantly from a space-filling
design. This is likely due to the fact that it was originally
designed to minimize global model-fitting error. The tech-
niques proposed in this paper have the different objective
of finding boundary regions, resulting in sample sets that
do not waste samples in low-gradient regions. Despite the
superficial similarities in approach, the problem of identify- Fig. 6: The above histogram shows the distance to the
ing boundary regions in an unknown landscape is one that boundary for all samples collected from the UUV simulation
traditional adaptive sampling techniques are not suited. Test System Latin-Hypercube GPR Search
Precision 1.76% 4.13%
Convergence 12500 4800
Mean distance 0.2647 0.2146
Min distance 0.053 0.043
C. Evaluation of UUV Simulation Runtime (minutes) 82.75 95.66

TABLE II: Comparison of Search on UUV Scenario


While the test functions are useful for evaluating the per- After 7500 runs were performed using both GPR search
formance of the search and identification algorithms against and the Latin hypercube search, there were sufficient cases
a priori truth boundaries, they cannot serve as a substitute of all 4 potential performance modes to execute the boundary
for the performance space of an autonomous system under identification step. Mean-Shift identification method was
test. This section applies the GPR-based search algorithm set with a boundary distance threshold of 0.1 units. The
and cluster stitching boundary identification algorithm to the Precision metric for this case was also set to 0.05 units.
UUV simulation, where the actual performance boundaries As the true locations of the boundaries were not known, the
are unknown. A second data set was run using the Latin coverage metric was not applied. The system was determined
hypercube sampling method and boundaries identified using to have converged when each boundary had at least 10 cases
the same boundary identification algorithm. within the boundary distance threshold.

1449
Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on November 01,2024 at 10:57:09 UTC from IEEE Xplore. Restrictions apply.
In nearly every regard the GPR search outperformed the [2] J. Choi, “Model checking for decision making behaviour of heteroge-
space-filling approach as can be seen summarized in table neous multi-agent autonomous system,” Ph.D. dissertation, Cranfield
University, 2012.
II. The histogram presented in Figure 6 also shows how [3] M. Steinberg, J. Stack, and T. Paluszkiewicz, “Long duration auton-
the distribution of cases trend closer to the boundary. The omy for maritime systems: challenges and opportunities,” Autonomous
GPR search sample set contained examples of all 6 possible Robots, vol. 40, no. 7, pp. 1119–1122, 2016.
[4] J. Kramer and M. Scheutz, “Development environments for au-
boundaries within the boundary threshold of 0.1 units while tonomous mobile robots: A survey,” Autonomous Robots, vol. 22,
the Latin hypercube set contained examples from only 3 of no. 2, pp. 101–132, 2007.
the possible boundaries within the boundary threshold. In [5] R. Alexander, H. R. Hawkins, and A. J. Rae, “Situation coveragea
coverage criterion for testing autonomous robots,” 2015.
total, the GPR identified an average amount of 250 cases per [6] B. Settles, “Active learning literature survey,” University of Wisconsin,
boundary while the Latin hypercube only identified 20 cases Madison, vol. 52, no. 55-66, p. 11, 2010.
per boundary identified. Convergence tests demonstrated the [7] S. Anand, E. K. Burke, T. Y. Chen, J. Clark, M. B. Cohen,
W. Grieskamp, M. Harman, M. J. Harrold, and P. McMinn, “An
GPR capable of finding cases from all 6 boundaries within orchestrated survey of methodologies for automated software test case
3000 samples while a Latin hypercube method required generation,” Journal of Systems and Software, vol. 86, no. 8, pp. 1978–
approximately 12000 samples. Figure 5 shows examples 2001, Aug. 2013.
[8] A. L. Christensen, R. O’Grady, M. Birattari, and M. Dorigo, “Fault
of two performance boundaries with a pair of test cases detection in autonomous robots based on fault injection and learning,”
representing each side of the boundary. In the first case a Autonomous Robots, vol. 24, no. 1, pp. 49–67, 2008.
slight change in the obstacle causes the UUV to attempt to [9] T. Mahmoud and B. S. Ahmed, “An efficient strategy for covering
array construction with fuzzy logic-based adaptive swarm optimization
navigate to the left instead of the right due to it’s heuristic for software testing use,” Expert Systems with Applications, vol. 42,
for guessing the shorter path and failing to reach the recover no. 22, pp. 8753–8765, 2015.
point. [10] Y. Qin, C. Xu, P. Yu, and J. Lu, “Sit: Sampling-based interactive testing
for self-adaptive apps,” Journal of Systems and Software, vol. 120, pp.
70–88, 2016.
VII. CONCLUSIONS AND FUTURE WORK [11] J. Arnold and R. Alexander, “Testing autonomous robot control
In this work we introduced a process for intelligently software using procedural content generation,” in Computer Safety,
Reliability, and Security. Springer, 2013, pp. 33–44.
discovering and identifying test cases for an autonomous sys- [12] X. Zou, R. Alexander, and J. McDermid, “Testing method for
tem near its performance boundaries. These boundaries were multi-uav conflict resolution using agent-based simulation and multi-
defined as locations where a small change in the scenario objective search,” Journal of Aerospace Information Systems, pp. 191–
203, 2016.
configuration would cause a large change in the system’s [13] S. Alam, H. A. Abbass, C. J. Lokan, M. Ellejmi, and S. Kirby,
performance. By utilizing an adaptive sampling approach, “Computational Red Teaming to investigate Failure Patterns in
we were able to reduce the number of samples necessary to Medium Term Conflict Detection,” in 8th Eurocontrol Innovation
Research Workshop, Eurocontrol Experimental Center, Brtigny-sur-
find the features of interest. We also introduced a method Orge, France, 2009.
for unsupervised clustering of the resulting samples in order [14] T. Sotiropoulos, J. Guiochet, F. Ingrand, and H. Waeselynck, “Virtual
to identify the queries which described the performance worlds for testing robot navigation: a study on the difficulty level,”
in 12th European Dependable Computing Conference (EDCC 2016),
boundary. By combining these techniques we were able to 2016.
generate sets of test cases for a multi-objective UUV mission [15] P. J. Durst, W. Gray, A. Nikitenko, J. Caetano, M. Trentini, and
that exercised different underlying decision processes of the R. King, “A framework for predicting the mission-specific perfor-
mance of autonomous unmanned systems,” in Intelligent Robots and
autonomous system. Systems (IROS 2014), 2014 IEEE/RSJ International Conference on.
This research is still ongoing and there are many avenues IEEE, 2014, pp. 1962–1969.
that have yet to be explored. In particular, we are interested [16] M. O’Brien, R. C. Arkin, D. Harrington, D. Lyons, and S. Jiang,
“Automatic verification of autonomous robot missions,” in Simulation,
in methods for scaling our system to handle more test Modeling, and Programming for Autonomous Robots. Springer, 2014,
cases and a higher dimensionality. One approach is utilizing pp. 462–473.
localized GPR techniques to speed up model generation, as [17] E. Snelson, “Tutorial: Gaussian process models for machine learning,”
Gatsby Computational Neuroscience Unit, UCL, 2006.
well as investigation into non-stationary covariance functions [18] K. Crombecq, L. De Tommasi, D. Gorissen, and T. Dhaene, “A novel
to handle varying resolution of features. Another avenue of sequential design strategy for global surrogate modeling,” in Winter
particular interest are sensitivity analyses to determine the Simulation Conference. Winter Simulation Conference, 2009, pp.
731–742.
impact of certain states on the performance of the system. [19] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward
In addition we are investigating more sophisticated anomaly feature space analysis,” Pattern Analysis and Machine Intelligence,
detection methods as a possible augmentation to our current IEEE Transactions on, vol. 24, no. 5, pp. 603–619, 2002.
[20] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based
approach to boundary detection. Finally we are developing algorithm for discovering clusters in large spatial databases with
more realistic test scenarios for an autonomous UUV to noise.” in Kdd, vol. 96, 1996, pp. 226–231.
include parameters such as ocean current, additional types [21] D. Gorissen, I. Couckuyt, P. Demeester, T. Dhaene, and K. Crombecq,
“A surrogate modeling and adaptive sampling toolbox for computer
of objectives, and a larger variety of obstacles. based design,” Journal of Machine Learning Research, vol. 11, no.
Jul, pp. 2051–2055, 2010.
R EFERENCES
[1] K. Meinke and P. Nycander, “Learning-Based Testing of Distributed
Microservice Architectures: Correctness and Fault Injection,” in
Software Engineering and Formal Methods. Springer, 2015, pp. 3–
10. [Online]. Available: http://link.springer.com/chapter/10.1007/978-
3-662-49224-6-1

1450
Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on November 01,2024 at 10:57:09 UTC from IEEE Xplore. Restrictions apply.

You might also like