Derandomized Self-Adaptation in ESs
Derandomized Self-Adaptation in ESs
Evolution Strategies
Abstract
This paper puts forward two useful methods for self-adaptation of the mutation dis-
tribution – the concepts of derandomization and cumulation. Principle shortcomings of
the concept of mutative strategy parameter control and two levels of derandomization
are reviewed. Basic demands on the self-adaptation of arbitrary (normal) mutation dis-
tributions are developed. Applying arbitrary, normal mutation distributions is equiv-
alent to applying a general, linear problem encoding.
The underlying objective of mutative strategy parameter control is roughly to favor
previously selected mutation steps in the future. If this objective is pursued rigor-
ously, a completely derandomized self-adaptation scheme results, which adapts arbi-
trary normal mutation distributions. This scheme, called covariance matrix adaptation
(CMA), meets the previously stated demands. It can still be considerably improved by
cumulation – utilizing an evolution path rather than single search steps.
Simulations on various test functions reveal local and global search properties of
the evolution strategy with and without covariance matrix adaptation. Their per-
formances are comparable only on perfectly scaled functions. On badly scaled, non-
separable functions usually a speed up factor of several orders of magnitude is ob-
served. On moderately mis-scaled functions a speed up factor of three to ten can be
expected.
Keywords
Evolution strategy, self-adaptation, strategy parameter control, step size control, de-
randomization, derandomized self-adaptation, covariance matrix adaptation, evolu-
tion path, cumulation, cumulative path length control.
1 Introduction
The evolution strategy (ES) is a stochastic search algorithm that addresses the following
search problem: Minimize a non-linear objective function that is a mapping from search
space S ⊆ Rn to R. Search steps are taken by stochastic variation, so-called mutation, of
(recombinations of) points found so far. The best out of a number of new search points
are selected to continue. The mutation is usually carried out by adding a realization
of a normally distributed random vector. It is easy to imagine that the parameters
of the normal distribution play an essential role for the performance1 of the search
1 Performance, as used in this paper, always refers to the (expected) number of required objective function
evaluations to reach a certain function value.
c
°2001 by the Massachusetts Institute of Technology Evolutionary Computation 9(2): 159-195
N. Hansen and A. Ostermeier
Figure 1: One-σ lines of equal probability density of two normal distributions respec-
tively. Left: one free parameter (circles). Middle: n free parameters (axis-parallel ellip-
soids). Right: (n2 + n)/2 free parameters (arbitrarily oriented ellipsoids).
algorithm. This paper is specifically concerned with the adjustment of the parameters
of the normal mutation distribution.
Among others, the parameters that parameterize the mutation distribution are
called strategy parameters, in contrast to object parameters that define points in search
space. Usually, no particularly detailed knowledge about the suitable choice of strat-
egy parameters is available. With respect to the mutation distribution, there is typically
only a small width of strategy parameter settings where substantial search progress
can be observed (Rechenberg, 1973). Good parameter settings differ remarkably from
problem to problem. Even worse, they usually change during the search process (pos-
sibly by several orders of magnitude). For this reason, self-adaptation of the mutation
distribution that dynamically adapts strategy parameters during the search process is
an essential feature in ESs.
We briefly review three consecutive steps of adapting normal mutation distribu-
tions in ESs.
2. The concept of global step size can be generalized.2 Each coordinate axis is as-
signed a different variance (Figure 1, middle) –– often referred to as individual step
sizes. There are n free strategy parameters. The disadvantage of this concept is the
dependency on the coordinate system. Invariance against rotation of the search
space is lost. Why invariance is an important feature of an ES is discussed in Sec-
tion 6.
The adaptation of strategy parameters in ESs typically takes place in the concept of
mutative strategy parameter control (MSC). Strategy parameters are mutated, and a new
search point is generated by means of this mutated strategy parameter setting.
We exemplify the concept formulating an (1, λ)-ES with (purely) mutative control
of one global step size. x(g) ∈ Rn and σ (g) ∈ R+ are the object parameter vector and
step size3 of the parent at generation g. The mutation step from generation g to g + 1
reads for each offspring k = 1, . . . , λ
(g+1)
σk = σ (g) exp (ξk ) (1)
(g+1) (g) (g+1)
xk = x + σk zk , (2)
where:
General demands on such an adaptation mechanism are developed, and the CMA-ES
is introduced. In Section 4, the objective of strategy parameter control is expanded
using search paths (evolution paths) rather than single search steps as adaptation cri-
terion. This concept is implemented by means of so-called cumulation. In Section 5,
we formulate the CMA-ES algorithm, an evolution strategy that adapts arbitrary, nor-
mal mutation distributions within a completely derandomized adaptation scheme with
cumulation. Section 6 discusses the test functions used in Section 7, where various sim-
ulation results are presented. In Section 8, a conclusion is given. Appendix A provides
a MATLAB implementation of the CMA-ES.
• Selection of the strategy parameter setting is indirect. The selection process oper-
ates on the object parameter adjustment. Comparing two different strategy para-
meter settings, the better one has (only) a higher probability to be selected –– due
to object parameter realization. Differences between these selection probabilities
can be quite small, that is, the selection process on the strategy parameter level is
highly disturbed. One idea of derandomization is to reduce or even eliminate this
disturbance.
• The mutation on the strategy parameter level, as with any mutation, produces dif-
ferent individuals that undergo selection. Mutation strength on the strategy para-
meter level must ensure a significant selection difference between the individuals.
In our view, this is the primary task of the mutation operator.
• The possible (and the realized) change rate of the strategy parameters between
two generations is an important factor. It gives an upper limit for the adaptation
speed. When adapting n or even n2 strategy parameters, where n is the problem
dimension, adaptation speed becomes an important factor for the performance of
the ES. If only one global step size is adapted, the performance is limited by the
possible adaptation speed only on a linear objective function (see discussion of
parameter d from (3) below).
On the one hand, it seems desirable to realize change rates that are as large as
possible –– achieving a fast change and consequently a short adaptation time. On
the other hand, there is an upper bound to the realized change rates due to the
finite amount of selection information. Greater changes cannot rely on valid in-
formation and lead to stochastic behavior. (This holds for any adaptation mech-
anism.) As a simple consequence, the change rate of a single strategy parameter
must decrease with an increasing number of strategy parameters to be adapted,
assuming a certain constant selection scheme.
In MSC, the change rate of strategy parameters obviously depends (more or less
directly) on the mutation strength on the strategy parameter level. Based on this
observation, considerable theoretical efforts were made to calculate the optimal
mutation strength for the global step size (Schwefel, 1995; Beyer, 1996b). But, in
general, the conflict between an optimal change rate versus a significant selection
difference (see above) cannot be resolved by choosing an ambiguous compromise
for the mutation strength on the strategy parameter level (Ostermeier et al., 1994a).
The mutation strength that achieves an optimal change rate is usually smaller than
the mutation strength that achieves a suitable selection difference. The discrep-
ancy increases with increasing problem dimension and with increasing number of
strategy parameters to be adapted. One idea of derandomization is to explicitly
unlink the change rate from the mutation strength resolving this discrepancy.
Parent number µ and the recombination procedure have a great influence on the
possible change rate of the strategy parameters between (two) generations. As-
suming a certain mutation strength on the strategy parameter level, the possible
change rate can be tuned downwards by increasing µ. This is most obvious for
intermediate multi-recombination: The mean change of µ recombined individu-
√
als is approximately µ times smaller than the mean change of a single individ-
ual. Within the concept of MSC, choosing µ and an appropriate recombination
mechanism is the only way to tune the change rate independently of the mutation
strength (downwards). Therefore, it is not a surprising observation that a success-
ful strategy parameter adaptation in the concept of MSC strongly depends on a
suitable choice of µ: In our experience µ has to scale linearly with the number
of strategy parameters to be adapted. One objective of derandomization is to fa-
cilitate a reliable and fast adaptation of strategy parameters independent of the
population size, even in small populations.
be used in practice, because the selection relevance of kz k k vanishes for increasing n.7
One can interpret Equation (5) as a special mutation on the strategy parameter level
or as an adaptation procedure for σ (which must only be done for selected individu-
als). Now, all stochastic variation of the object parameters –– namely originated by the
random vector realization z k in Equation (6) –– is used for strategy parameter adapta-
tion in Equation (5). In other words, the actually realized mutation step on the object
parameter level is used for strategy parameter adjustment. If kz k k is smaller than ex-
pected, σ (g) is decreased. If kz k k is larger than expected σ (g) is increased. Selecting a
small/large mutation step directly leads to reduction/enlargement of the step size.
The disadvantage of Equation (5) compared to Equation (1) is that it cannot be
expanded easily for the adaptation of other strategy parameters. The expansion to the
adaptation of all distribution parameters is the subject of this paper.
In general, the completely derandomized adaptation follows the following princi-
ples:
1. The mutation distribution is changed, so that the probability to produce the se-
lected mutation step (again) is increased.
2. There is an explicit control of the change rate of strategy parameters (in our exam-
ple, by means of d).
3. Under random selection, strategy parameters are stationary. In our example, ex-
pectation of log σ (g+1) equals log σ (g) .
In the next section, we discuss the adaptation of all variances and covariances of the
mutation distribution.
• The suitable encoding of a given problem is crucial for an efficient search with
an evolutionary algorithm.8 Desirable is an adaptive encoding mechanism. To
restrict such an adaptive encoding/decoding to linear transformations seems rea-
sonable: Even in the linear case, O(n2 ) parameters have to be adapted. It is easy
to show that in the case of additive mutation, linear transformation of the search
space (and search points accordingly) is equivalent to linear transformation of the
mutation distribution. Linear transformation of the (0, I)-normal mutation dis-
tribution always yields a normal distribution with zero mean, while any normal
distribution with zero mean can be produced by a suitable linear transformation.
This yields equivalence between an adaptive general linear encoding/decoding
and the adaptation of all distribution parameters in the covariance matrix.
demands a more general mutation distribution and suggests the adaptation of ar-
bitrary normal mutation distributions.
Adaptation: The adaptation must be successful in the following sense: After an adap-
tation phase,
P progress-rates comparable to those on the (hyper-)sphere objective
function i x2i must be realized on any convex-quadratic objective function.9 This
must hold especially for large problem conditions (greater than 104 ) and non-
systematically oriented principal axes of the objective function.
Invariance: The invariance properties of the (1, λ)-ES with isotropic mutation distribu-
tion with respect to transformation of object parameter space and function value
should be preserved. In particular, translation, rotation, and reflection of object
parameter space (and initial point accordingly) should have no influence on strat-
egy behavior as well as any strictly monotonically increasing, i.e., order-preserving
transformation of the objective function value.
From a conceptual point of view, one primary aim of strategy parameter adapta-
tion is to become invariant (modulo initialization) against certain transformations of
the object parameter space. For any fitness function f : x 7→ f (c · A · x), the global
step size adaptation can yield invariance against c 6= 0. An adaptive general linear
R
9 That is any function f : x 7→ xT Hx + bT x + c, where the Hessian matrix H ∈ n×n is symmetric and
R R
positive definite, b ∈ n and c ∈ . Due to numerical requirements, the condition of H may be restricted to,
e.g., 1014 . We use the expectation of n
k
(g) (g+k)
N
ln((fbest − f ∗ )/(fbest − f ∗ )) with k ∈ as the progress measure
(where f ∗ is the fitness value at the optimum), because it yields comparable values independent of H , b,
c, n, and k. For H = I and large n, this measure yields values close to the common normalized progress
n
measure ϕ? := (g) (r(g) − E[r(g+1) ]) (Beyer, 1995), where r is the distance to the optimum.
r
10 Not taking into account weighted recombination, the theoretically optimal (1 + 1)-ES yields approxi-
mately 0.20.
encoding/decoding can additionally yield invariance against the full rank n × n ma-
trix A. This invariance is a good starting point for achieving the adaptation demand.
Also, from a practical point of view, invariance is an important feature for assessing a
search algorithm (compare Section 6). It can be interpreted as a feature of robustness.
Therefore, even to achieve an additional invariance seems to be very attractive.
Keeping in mind the stated demands, we discuss two approaches to the adaptation
of arbitrary normal mutation distributions in the next two sections.
3.1 A (Purely) Mutative Approach: The Rotation Angle Adaptation
Schwefel (1981) proposed a mutative approach to the adaptation of arbitrary normal
mutation distributions with zero mean, often referred to as correlated mutations. The
distribution is parameterized by n standard deviations σi and (n2 −n)/2 rotation angles
αj that undergo a selection-recombination-mutation scheme.
The algorithm of a (µ/ρI , λ)-CORR-ES as used in Section 7.2 is briefly described.
At generation g = 1, 2, . . . for each offspring ρ parents are selected. The (component
wise) arithmetic mean of step sizes, angles, and object parameter vectors of the ρ par-
rec rec rec
ents, denoted with σi=1,...,n , αj=1,...,(n2 −n)/2 , and x , are starting points for the muta-
tion. Step sizes and angles read component wise
µ µ ¶ µ ¶¶
(g+1) 1 1
σi = σirec · exp N 0, + Ni 0, √ (7)
2n 2 n
à à µ ¶2 ! !
(g+1) rec 5
αj = αj + Nj 0, π + π mod 2π − π (8)
180
The random number N(0, 1/(2n)) in Equation p (7), which denotes a normal distribution
with zero mean and standard deviation 1/(2n), is only drawn once for all i = 1, . . . , n.
(g+1)
The modulo operation ensures the angles to be in the interval −π ≤ αj < π, which
is, in our experience, of minor relevance. The chosen distribution variances reflect the
typical parameter setting (see Bäck and Schwefel (1993)). The object parameter muta-
tion reads
(g+1)
³ ´ σ1
(g+1) (g+1) ..
x(g+1) = xrec + R α1 , . . . , α(n2 −n)/2 · . · N (0, I) (9)
(g+1)
σn
The spherical N (0, I) distribution is multiplied by a diagonal step size matrix deter-
(g+1) (g+1)
mined by σ1 , . . . , σn . The resulting axis-parallel mutation ellipsoid is newly ori-
ented by successive rotations in all (i.e., (n2 −n)/2) two-dimensional subspaces spanned
by canonical unit vectors. This complete rotation matrix is denoted with R(.). This al-
gorithm allows one to generate any n-dimensional normal distribution with zero mean
(Rudolph, 1992).
It is generally recognized that the typical values for µ = 15 and λ = 100 are
not sufficient for this adaptation mechanism (Bäck et al., 1997). Due to the mutative
approach, the parent number has presumably to scale with n2 . Choosing µ ≈ n2 /2,
which is roughly the number of free strategy parameters, performance on the sphere
problem declines with increasing problem dimension and becomes unacceptably poor
for n ≈> 10. The performance demand cannot be met. This problem is intrinsic to the
(purely) mutative approach.
The parameterization by rotation angles causes the following effects:
• The effective mutation strength strongly depends on the position in strategy para-
meter space. Mutation ellipsoids with high axis ratios are mutated much more
strongly than those with low axis ratios: Figure 2 shows two distributions with
different axis ratios before and after rotation by the typical variance. The resulting
cross section area, a simple measure of their similarity, differs in a wide range. For
axis ratio 1 : 1 (spheres, not shown) it is 100%. For axis ratio 1 : 3 (Figure 2 left)
it is about 90%, while it decreases to roughly 5% for axis ratio 1 : 300 (Figure 2
right). The section area tends to become zero for increasing axis ratios. The mu-
tation strength (on the rotation angles) cannot be adjusted independently of the
actual position in strategy parameter space.
• In principle, invariance against a new orientation of the search space is lost, be-
cause rotation planes are chosen with respect to the given coordinate system. In
simulations, we found the algorithm to be dependent even on permutations of the
coordinate axes (Hansen et al., 1995; Hansen, 2000)! Furthermore, its performance
depends highly on the orientation of the given coordinate system (compare Section
7.2). The invariance demand is not met.
Taking into account these difficulties, it is not surprising to observe that the adaptation
demand is not met. Progress rates on convex-quadratic functions with high axis ratios
can be several orders of magnitude lower than progress rates achieved on the sphere
problem (Holzheuer, 1996; Hansen, 2000, and Section 7.2).
When the typical intermediate recombination is applied to the step sizes, they in-
crease unbounded under random selection. The systematic drift is slow, usually causes
no problems, and can even be an advantage in some situations. Therefore, we assume
the stationarity demand to be met.
Using another parameterization together with a suitable mutation operator can
solve the demands for adaptation and invariance without giving up the concept of
MSC (Ostermeier and Hansen, 1999). To satisfy the performance demand the concept
of MSC has to be modified.
aim of MSC to raise the probability of producing successful mutation steps: The covari-
ance matrix of the mutation distribution is changed in order to increase the probability
of producing the selected mutation step again. Second, the rate of change is adjusted
according to the number of strategy parameters to be adapted (by means of ccov in
Equation (15)). Third, under random selection, the expectation of the covariance ma-
trix C is stationary. Furthermore, the adaptation mechanism is inherently independent
of the given coordinate system. In short, the CMA implements a principle component
analysis of the previously selected mutation steps to determine the new mutation dis-
tribution. We give an illustrative description of the algorithm.
Consider a special method to produce realizations of a n-dimensional normal dis-
tribution with zero mean. If the vectors z 1 , . . . , z m ∈ Rn , m ≥ n, span Rn , and N (0, 1)
denotes independent (0, 1)-normally distributed random numbers, then
This mutation distribution tends to reproduce the mutation steps selected in the
past generations. In the end, it leads to an alignment of the distributions before and af-
ter selection, i.e., an alignment of the recent mutation distribution with the distribution
of the selected mutation steps. If both distributions become alike, as under random
selection, in expectation, no further change of the distributions takes place (Hansen,
1998).
This illustrative but also formally precise description of the CMA differs in three
points from the CMA-ES formalized in Section 5. These extensions are as follows:
• Apart from the adaptation of the distribution shape, the overall variance of the
mutation distribution is adapted separately. We found the additional adaptation
of the overall variance useful for at least two reasons. First, changes of the overall
variance and of the distribution shape should operate on different time scales. Due
to the number of parameters to be adapted, the adaptation of the covariance ma-
trix, i.e., the adaptation of the distribution shape, must operate on a time scale of
n2 . Adaptation of the overall variance should operate on a time scale n, because the
variance should be able to change as fast as required on simple objective functions
11 The covariance matrix of this normally distributed random vector reads z 1 z T
1 + . . . + zm zm .
T
12 q adjusts the change rate, and q 2 corresponds to 1 − ccov in Equation (15).
e2 qe2
e1 qe1
(1)
z sel
g=0 g=1
d11 b1
q 2 e2 3
q e2 (2)
(2)
z sel qz sel
q 2 e1 q 3 e1
(3)
z sel (1)
(1)
qz sel q 2 z sel d22 b2
g=2 g=3
P
like the sphere i x2i . Second, if overall variance is not adapted faster than the
distribution shape, an (initially) small overall variance can jeopardize the search
process. The strategy “sees” a linear environment, and adaptation (erroneously)
enlarges the variance in one direction.
w
v
w
v v
v
−w −w
Figure 4: Two idealized evolution paths (solid) in a ridge topography (dotted). The
distributions constructed by the single steps (dashed, reduced in size) are identical.
where 0 < c ≤ 1 and p(0) = 0. This procedure, introduced in Ostermeier et al. (1994b),
p (g+1)
is called cumulation. c(2 − c) is a normalization factor: Assuming z sel and p(g)
(g+1)
in Equation (12) to be independent and identically distributed yields p ∼ p(g)
(g+1)
independently of c ∈]0; 1]. Variances of p and p are identical because 12 =
(g)
p 2 (g+1)
(1 − c)2 + c(2 − c) . If c = 1, no cumulation takes place, and p(g+1) = z sel . The life
(g)
span of information accumulated in p is roughly 1/c (Hansen, 1998): After about 1/c
generations the original information in p(g) is reduced by the factor 1/e ≈ 0.37. That
means c−1 can roughly be interpreted as the number of summed steps.
The benefit from cumulation is shown with an idealized example. Consider the
two different sequences of four consecutive generation steps in Figure 4. Compare any
evolution path, that is, any sum of consecutive steps in the left and in the right of the
figure. They differ significantly with respect to direction and length. Notice that the
difference is only due to the sign of vectors two and four.
Construction of a distribution from the single mutation steps (single vectors in
the figure) according to Equation (10) leads to exactly the same result in both cases
(dashed). Signs of constructing vectors do not matter, because the vectors are multi-
plied by a (0, 1)-normally distributed random number that is symmetrical about zero.
The situation changes when cumulation is applied. We focus on the left of Figure 4.
Consider a continuous sequence of the shown vectors –– i.e., an alternating sequence of
(2i−1) (2i)
the two vectors v and w. Then z sel = v and z sel = w for i = 1, 2, . . ., and according
(g)
to Equation (12), the evolution path p alternately reads
p ³ ´
(1)
p(g)
v := c(2 − c) v + (1 − c)w + (1 − c)2 v + (1 − c)3 w + . . . + (1 − c)g−1 z sel
1
p(g)
v → pv = p (v + (1 − c)w) and
c(2 − c)
1
p(g)
w → pw = p (w + (1 − c)v)
c(2 − c)
After 3/c generations, the deviation from these limits is under 5%.
To visualize the effect of cumulation, these vectors and the related distributions are
shown in Figure 5 for different values of c. With increasing life span c−1 , the vectors pv
and pw become more and more similar. The distribution scales down in the horizontal
direction, increasing in the vertical direction. Already for a life span of approximately
three generations, the effect appears to be very substantial. Notice again that this effect
(g+1)
xk ∈ Rn , object parameter vector of the k th individual in generation g + 1.
(g) Pµ (g)
hxiw := Pµ 1 wi i=1 wi xi:λ , wi ∈ R+ , weighted mean of the µ best individuals of
i=1
generation g. The index i : λ denotes the ith best individual.
σ (g) ∈ R+ , step size in generation g. σ (0) is the initial component wise standard
deviation of the mutation step.
(g+1)
zk ∈ Rn , for k = 1, . . . , λ and g = 0, 1, 2, . . . independent realizations of a (0, I)-
(g+1)
normally distributed random vector. Components of z k are independent
(0, 1)-normally distributed.
C (g) Symmetrical positive definite n × n-matrix. C (g) is the covariance matrix of
the normally distributed random vector B (g) D (g) N(0, I). C (g) determines
¡ ¢T ¡ ¢2 ¡ ¢T
B (g) and D (g) . C (g) = B (g) D (g) B (g) D (g) = B (g) D (g) B (g) , which
is a singular value decomposition of C (g) . Initialization C (0) = I.
13 The algorithm described here is identical to the algorithms in Hansen and Ostermeier (1996) setting
1
µ = 1, cc = cσ , and dσ = β χ b ; in Hansen and Ostermeier (1997) setting w1 = . . . = wµ , cc = cσ and
n
1
dσ = D ; and in Hansen (1998) setting w1 = . . . = wµ .
14 This notation follows Beyer (1995), simplifying the (µ/µ , λ)-notation for intermediate multi-recombi-
I
nation to (µI , λ), avoiding the misinterpretation of µ and µI being different numbers.
D (g) n × n-diagonal matrix (step size matrix). dij = 0 for i 6= j and diagonal
(g)
elements dii of D (g) are square roots of eigenvalues of the covariance matrix
(g)
C .
B (g) orthogonal n×n-matrix (rotation matrix) that determines the coordinate sys-
tem, where the scaling with D (g) takes place. Columns of B (g) are (defined
as) the normalized eigenvectors of the covariance matrix C (g) . The ith di-
(g) 2
agonal element of D (g) squared dii is the corresponding eigenvalue to the
th (g) (g) (g) (g) 2 (g)
i column of B , bi . That is, C (g) bi = dii bi for i = 1, . . . , n. B is
−1 T
orthogonal, i.e., B =B .
(g+1)
The surfaces of equal probability density of D (g) z k are axis-parallel (hyper-)
ellipsoids. B (g) reorients these distribution ellipsoids to become coordinate system
independent. The covariance matrix C (g) determines B (g) and D (g) apart from signs
of the columns in B (g) and permutations of columns in both matrices accordingly.
2
Conferring to Equation (13), notice that σ (g) B (g) D (g) N(0, I) is (0, σ (g) C (g) )-normally
distributed.
(g+1)
C (g) is adaptated by means of the evolution path pc . For the construction of
(g+1) (g+1)
pc the “weighted mean selected mutation step” B (g) D (g) hziw is used. Notice
(g) (g) (g)
that the step size σ is disregarded. The transition of pc and C reads
p(g+1)
c = (1 − cc ) · p(g) u
c + cc · cw B
(g) (g)
D hzi(g+1)
w (14)
| {z }
cw (g+1) (g)
= hxiw −hxiw
σ (g)
³ ´T
C (g+1) = (1 − ccov ) · C (g) + ccov · p(g+1)
c p(g+1)
c , (15)
(g+1) (0)
pc ∈ Rn , sum of weighted differences of points hxiw . Initialization pc = 0.
³ ´T
(g+1) (g+1)
Note that pc pc is a symmetrical n × n-matrix with rank one.
cc ∈ ]0, 1] determines the cumulation time for pc , which is roughly 1/cc .
p
cuc := cc (2 − cc ) normalizes the variance of pc , because 12 = (1 − cc )2 + cuc 2
(see
Pµ Section 4).
wi (g+1) (g+1)
√
cw := Pi=1 µ 2
is chosen that under random selection cw hziw and z k
i=1 wi
have the same variance (and are identically distributed).
(g+1) Pµ (g+1) (g+1)
hziw := Pµ 1 wi i=1 wi z i:λ , with z k from Equation (13). The index i : λ
i=1
(g+1) (g+1)
denotes the index of the ith best individual from x1 , . . . , xλ . The
(g+1)
weights wi are identical with those for hxiw .
ccov ∈ [0, 1[, change rate of the covariance matrix C. For ccov = 0, no change
takes place.
Equation (15) realizes the distribution change as exemplified in Section 3.2 and Figure
3.
An additional adaptation of the global step size σ (g) is necessary, taking place on a
considerably shorter time scale. For the cumulative path length control in the CMA-ES,
(g+1)
a “conjugate” evolution path pσ is calculated, where scaling with D (g) is omitted.
p(g+1)
σ = (1 − cσ )· p(g) u
σ + cσ · cw B
(g)
hzi(g+1) (16)
| {z w }
−1 −1 cw
(g+1) (g)
= B (g) (D (g) ) (B (g) ) hxiw −hxiw
σ (g)
à !
(g+1)
(g+1) (g) 1 kpσ k−χ bn
σ = σ · exp · , (17)
dσ χ
bn
(g+1) (0)
pσ ∈ Rn , evolution path not scaled by D (g) . Initialization pσ = 0.
(g)
cσ ∈ ]0,
p1[ determines the cumulation time for pσ , which is roughly 1/cσ .
cuσ := cσ (2 − cσ ) fulfills (1 − cσ )2 + cuσ 2 = 1 (see Section 4).
dσ ≥ 1, damping parameter, determines the possible change rate of σ (g) in the
generation sequence √ (compare
¡ ¢ Section
¡ n ¢ 2, in particular Equations (3) and (5)).
χ
bn = E[kN(0, I)k] = 2 · Γ n+1 2 /Γ 2 , expectation of the length of a (0, I)-
normally
√ ¡ distributed
¢ random vector. A good approximation is χ bn ≈
1 1
n 1 − 4n + 21n 2 (Ostermeier, 1997).
Apart from omitting the transformation with D (g) , Equations (16) and (14) are identi-
(g+1) (g+1)
cal: In (16) we use B (g) hziw for the cumulation, instead of B (g) D (g) hziw . Under
random selection, B hziw is (0, I)-normally distributed, independently of C (g) .
(g)
Thereby step lengths in different directions are comparable, and the expected length of
pσ , denoted χ bn , is well known.15 The cumulation in Equation (14) often speeds up the
adaptation of the covariance matrix C but is not an essential feature (cc can be set to 1).
For the path length
p control, the cumulation is essential (c−1
σ must not be considerably
smaller than n/2). With increasing n, the lengths of single steps become more and
more similar and therefore more and more selection irrelevant.
While in Equation (13) the product matrix B (g) D (g) only has to satisfy the equation
¡ ¢T
C (g) = B (g) D (g) B (g) D (g) , cumulative path length control in Equation (16) requires
its factorization into an orthogonal and a diagonal matrix.
λ µ wi=1,...,µ cc ccov cσ dσ
¡ λ+1 ¢ 4 2 4
4 + b3 ln(n)c bλ/2c ln 2 − ln(i) n+4
√ 2
n+4 c−1
σ +1
(n+ 2)
n 2 3 4 6 8 11 15 21 40 208
4 + b3 ln(n)c 6.08 7.30 8.16 9.38 10.24 11.19 12.12 13.13 15.07 20.01
where λ(n − 1) < λ(n). In the (µI , λ)-CMA-ES, where w1 = . . . = wµ ,17 we choose
µ = bλ/4c as the default. To make the strategy more robust or more explorative in case
of multimodality, λ can be enlarged, choosing µ accordingly. In particular, for unimodal
and non-noisy problems, λ = 9 is often most appropriate even for large n.
Partly for historical reasons, in this paper, we use the (µI , λ)-CMA-ES. Based on
our experience, for a given λ, an optimal choice of µ and w1 , . . . , wµ only achieves
speed-up factors less than two compared to the (µI , λ)-CMA-ES, where µ ≈ 0.27λ. The
optimal recombination weights depend on the function to be optimized, and it remains
an open question whether the (µI , λ) or the (µW , λ) scheme performs better overall
(using the default parameters accordingly).
If overall simulation time does not substantially exceed, say, 2n generations, dσ
should be chosen smaller than in Table 1, e.g., dσ = 0.5 c−1 −1
σ + 1. Increasing ccov , dσ or µ
and λ by a factor α > 1 makes the strategy more robust. If this increases the number of
needed function evaluations, as generally to be expected, it will be typically by a factor
less than α. The explorative behavior can be improved by increasing µ and λ, or µ up to
λ/2 in the (µI , λ)-CMA-ES, or dσ in accordance with a suffiently large initial step size.
The parameter settings are discussed in more detail in the following.
give a reasonable choice for λ. Large values linearly worsen progress rates, e.g., on
simple problems like the sphere function. Also on more sophisticated problems,
performance can linearly decrease with increasing λ, because the adaptation time
(in generations) is more or less independent of λ. Values considerably smaller than
ten may decline the robustness of the strategy.
cσ : (Cumulation for step size) Investigations based on Hansen (1998) give strong evi-
dence that cσ < 1 can and must be chosen with respect to the problem dimension
r
n < −1 <
c n ,
2 ≈ σ ≈
while for large n, most sensible values are between n/10 and n/2. Large values, i.e.,
c−1 −1
σ ≈ n, slow down the possible change rate of the step size because dσ = cσ + 1
but can still be a good choice for certain problem instances.
dσ : (Damping for step size) According to principle considerations, the damping para-
meter must be chosen dσ ≈ α c−1 σ , where α is near one, and dσ ≥ 1. Therefore we
define dσ = α c−1σ + 1. Choosing α smaller than one, e.g., 0.3, can yield (slightly)
larger progress rates on the sphere problem. Depending on n, cσ , µ, and λ, a factor
up to three can be gained. This is recommended if overall simulation time is con-
siderably shorter than 3nλ function evaluations. Consequently one may choose
α = 1 − min(0.7, n/gmax ). If α is chosen too small, oscillating behavior of the step-
size can occur, and strategy performance may decline drastically. Larger values for
α linearly slow down the possible change rate of the step size and (consequently)
make the strategy more robust.
cc : (Cumulation for distribution) Based on empirical observations, we suspect cσ ≤
cc ≤ 1 to be a sensible choice for cc . Especially when long axes have to be learned,
cc ≈ cσ should be most efficient without compromising the learning of short axes.
ccov : (Change rate of the covariance matrix) With decreasing change rate ccov , reliability
of the adaptation increases (e.g., with respect to noise) as well as adaptation time.
For ccov → 0, the CMA-ES becomes an ES with only global step size adaptation.
For strongly disturbed problems,
√ it can be appropriate to choose smaller change
rates like ccov = α/(n + α)2 with α ≈ < 1 instead of α = 2.
step size σ should be enlarged (compare Appendix A). Problem specific knowledge
must be used for setting the bound. Numerically, a lower limit is given by the demand
(g) (g)
hxiw 6= hxiw + 0.2 · σ B (g) D (g) ei for each unit vector ei . With respect to the selection
(g) (g)
procedure, even f (hxiw ) 6= f (hxiw + 0.2 · σ B (g) D (g) ei ) seems desirable.
To avoid numerical errors, a maximal condition number for C (g) , e.g., 1014 , should
be ensured. If the ratio between the largest and smallest eigenvalue of C (g) exceeds
(g) (g)
1014 , the operation C (g) := C (g) +(maxi (dii )2 /1014 −mini (dii )2 ) I limits the condition
number in a reasonable way to 1014 + 1.
The numerical stability of computing eigenvectors and eigenvalues of the symmet-
rical matrix C usually poses no problems. In MATLAB, the built-in function eig can be
used. To avoid complex solutions, the symmetry of C must be explicitly enforced (com-
pare Appendix A). In C/C++, we used the functions tred2 and tqli from Press et
al. (1992), substituting float with double and setting the maximal iteration number
in tqli to 30n. Another implementation was done in Turbo-PASCAL. For condition
numbers up to 1014 , we never observed numerical problems in any of these program-
ming languages.
If the user decides to manually change object parameters during the search pro-
cess – which is not recommended – the adaptation procedure (Equations (14) to (17))
must be omitted for these steps. Otherwise, the strategy parameter adaptation can be
severely disturbed.
Because change of C (g) is comparatively slow (time 2
√ scale n ), it is possible to up-
date B (g) and D (g) not after every generation but after n or even after, e.g., n/10 gen-
erations. This reduces the computational effort of the strategy from O(n3 ) to O(n2.5 ) or
O(n2 ), respectively. The latter corresponds to the computational effort for a fixed linear
encoding/decoding of the problem as well as for producing a realization of an (arbi-
trarily) normally distributed, random vector on the computer. In practical applications,
it is often most appropriate to update B (g) and D (g) every generation.
A simple method to handle constraints repeats the generation step, defined in
Equation (13), until λ or at least µ feasible points are generated before Equations (14)
to (17) are applied. If the initial mutation distribution generates sufficiently numer-
ous feasible solutions, this method can remain sufficient during the search process due
to the symmetry of the mutation distribution. If the minimum searched for is located
at the edge of the feasible region, the performance will usually be poor and a more
sophisticated method should be used.
6 Test Functions
According to Whitley et al. (1996), we prefer test functions that are non-linear, non-
separable, scalable with dimension n, and resistant to simple hill-climbing. In addition,
we find it reasonable to use test functions with a comprehensible topology even if n ≥ 3.
This makes the interpretation of observed results possible and can lead to remarkable
scientific conclusions. Table 3 gives the test functions used in this paper. The test suite
mainly strives to test (local) convergence speed. Only functions 10–12 are multimodal.18
Apart from functions 6–8, the global optimum is located at 0.
To yield non-separability, we set y := [o1 , . . . , on ]T x, where x is the object parame-
ter vector according to Equation (13). [o1 , . . . , on ]T implements an angle-preserving lin-
ear transformation of x, i.e., rotation and reflection of the search space. o1 , . . . , on ∈ Rn
18 For higher dimension, even function 8 fRosen has a local minimum near y = (−1, 1, . . . , 1)T . With the
given initialization of hxiw and σ (0) , the ES usually converges into the global minimum at y = 1.
(0)
(0)
Function σ (0) hxiw f stop
Pn Pn
1. fsphere (x) = i=1 xi 2 1 i=1 oi 10−10
Pn ³Pi ´2 Pn
2. fSchwefel (y) = i=1 j=1 yi 1 oi 10−10
Pn Pi=1
n
3. fcigar (y) = y1 2 + i=2 (1000
P yi )2 1 Pi=1 oi 10−10
n n
4. ftablet (y) = (1000 y1 )2 + i=2 yi 2 1 i=1 oi 10−10
Pn ³ i−1
´2 Pn
5. felli (y) = i=1 1000 n−1 yi 1 i=1 oi 10−10
Pn
6. fparabR (y) = −y1 + 100p i=2 yi 2 1 0 −1000
Pn 2
7. fsharpR (y) = −y1 + 100 i=2 yi 1 0 −1000
Pn−1 ³ ¡ 2 ¢2
8. fRosen (y) = i=1 100 yi − yi+1 0.1 0 10−10
´
+(yi − 1)2
Pn i−1 Pn
9. fdiff pow (y) = i=1 |yi |2+10 n−1 0.1 i=1 oi 10−15
Pn ¡ Pn
10. fRast (y) = 10n + i=1 (ai yi )2 ¢ 100
i=1 ui oi
−10 cos (2π · ai yi ) , ui ∈ [−5.12; 5.12]
ai = 1
11. fRast10 (y) = fRast (y), 100 Pn
i−1 i=1 ui oi
where ai = 10 n−1 ui ∈ [−5.12; 5.12]
12. fRast1000 (y) = fRast (y), 100 Pn
i−1 i=1 ui oi
where ai = 1000 n−1 ui ∈ [−5.12; 5.12]
is a randomly oriented, orthonormal basis, fixed for each simulation run. Each oi is
realized equally distributed on the unit (hyper-)sphere surface, dependently drawn so
that hoi , oj i = 0 if i 6= j. An algorithm to generate the basis is given in Figure 6. For
(0) Pn (0)
the initial hxiw = i=1 oi , we yield y (0) = [o1 , . . . , on ]T hxiwP= (1, . . . , 1)T . For the
canonical basis, where oi = ei , it holds y = x and, obviously, i oi = (1, . . . , 1)T . For
a non-canonical basis, only fsphere remains separable.
Functions 1–5 are convex-quadratic and can be linearly transformed into the sphere
problem.19 They serve to test the adaptation demand stated in Section 3. The axis
scaling between longest and shortest principal axis of functions 3–5 is 1000, i.e., the
problem condition is 106 . Taking into account real-world problems, a search strategy
should be able to handle axis ratios up to this number. ftablet can be interpreted as a
sphere model with a smooth equality constraint in o1 direction.
fparabR and fsharpR facilitate straight ridge topologies. The ridge points into o1
direction that has to be enlarged unlimited. The sharp ridge function fsharpR is topo-
logically invariant from the distance to the ridge peak. In particular, the local gradient
is constant, independent of the distance, which is a hard feature for a local search strat-
egy. Results can be influenced by numerical precision problems (even dependent on the
implementation of the algorithm). Therefore, we establish for this problem a minimal
step size of 10−10 and a maximal condition number for C (g) of 1014 (compare Section
19 They can be written in the form f (x) = f
sphere (Ax) = (Ax) Ax = x A Ax = x Hx, where A is
T T T T
a full rank n × n-matrix, and the Hessian matrix H = AT A is symmetric and positive definite.
FOR i = 1 TO n
7 Simulation Results
Four different evolution strategies are experimentally investigated.
• (µI , λ)-CMA-ES, where the default parameter setting from Table 1 is used apart
from λ and µ as given below, and wi = 1 for all i = 1, . . . , µ. To reduce the compu-
update of B (g) and D (g) from C (g) for Equations (13), (14), and
tational effort, the √
(16) is done every n generations in all simulations.
20 For example, invariance against rotation is lost when discrete recombination is applied.
CORR
CPU−Times [s]
−1
10
CMA
seconds
−2
10
−3
10
MUT
PATH
−4
10
5 10 20 40 80 160 320
dimension
Figure 7: Strategy internal CPU-times per function evaluation (i.e., per generated off-
spring) of different (2I , 10)-ESs in seconds. The strategies were coded in MATLAB
(compare text) and run on a pentium 400 MHz processor.
2 2
Cigar, Axis−Parallel Cigar, Non−Axis−Parallel
0 0
−2 −2
−4 −4
−6 −6
−8 −8
−10 −10
0 1000 2000 3000 4000 0 10000 20000 30000 40000
generations = feval/100 generations = feval/100
2 2
Tablet, Axis−Parallel Tablet, Non−Axis−Parallel
0 0
−2 −2
−4 −4
−6 −6
−8 −8
−10 −10
0 200 400 600 800 1000 0 2000 4000 6000 8000 10000
generations = feval/100 generations = feval/100
2 2
Ellipsoid, Axis−Parallel Ellipsoid, Non−Axis−Parallel
0 0
−2 −2
−4 −4
−6 −6
−8 −8
−10 −10
0 100 200 300 400 500 600 0 2000 4000 6000 8000 10000 12000
generations = feval/100 generations = feval/100
Figure 8: log10 (function value) vs. generation number with (15/2I , 100)-CORR-ES (5),
where n = 5. The best, middle, and worst out of 17 runs are shown. Bold symbols at
the lower edge indicate the mean generation number to reach f stop = 10−10 . In the left
column, oi = ei , that is, the functions are axis-parallel oriented and completely sepa-
rable. The range of the abscissa between left and right columns is enlarged by a factor
20 for felli (lower row) and 10 otherwise. For comparison are shown a single run on
fsphere (∗) and respective simulations with the (15I , 100)-CMA-ES (°) that do not reflect
the real performance of the CMA-ES. The default (4W , 8)-CMA-ES performs roughly
ten times faster. The CORR-ES does not meet the adaptation demand. Progress rates
on the non-separable functions are worse by a factor between 100 and 6000 compared
to those on fsphere .
10
8 6
Cigar
6 4
4 2
2 0
0 −2
−2 Tablet
−4
−4
−6
−6
−8 −8
−10 −10
−12 −12
0 100 200 300 400 0 200 400 600 800 1000
generations = feval/10 generations = feval/10
6 5
4
2 0
0 Sphere
Ellipsoid −5
−2
−4 −10
−6 Ellipsoid
−8 −15
−10
0 100 200 300 400 500 600 700 0 200 400 600 800 1000 1200 1400
generations = feval/10 generations = feval/10
Figure 9: Simulation runs with the (2I , 10)-CMA-ES, where n = 10. °: log10 (function
value). 2: log10 (smallest and largest variance of the mutation distribution), ∗:
log10 (function value) from a run on fsphere . The upper ten graphs (without symbol)
are the variances of the principal axes of the mutation distribution ellipsoid, sorted,
2
and multiplied with 105 /σ (g) on a logarithmic scale. Lower right: Simulation run on
−5
felli until function value 10 is reached and afterwards on const · fsphere . The CMA-
ES meets the adaptation demand. After an adaptation phase, in all cases, the progress
rates are identical with those on fsphere .
(upper left) and ftablet (upper right), where n = 10, compared to a run on fsphere . The
adaptation process can be clearly observed through the variances of the principle axes
of the mutation distribution, where global step size is disregarded (upper ten graphs).
When the variances correspond to the scaling in the objective function, progress rates
identical to those on fsphere are realized (notice again that the basis o1 , . . . , on is chosen
randomly here). The shorter adaptation time on fcigar is due to the cumulation, which
detects the parallel correlations of single steps very efficiently.
In the lower left of Figure 9, a simulation run on felli is shown. Similar to ftablet ,
but more pronounced, local adaptation occurs repeatedly on this function. Progress
increases and decreases a few times together with the step size. When the adaptation
is completed, the variances are evenly spread in a range of 106 . In the lower right of
Figure 9, the objective function is switched from felli to (some multiple of) fsphere , after
reaching function value 10−5 . The mutation distribution adapts from an ellipsoid-like
shape back to the isotropic distribution. Adaptation time and graphs of function value
and step size are very similar to those on felli . In fact, from a theoretical point of view,
the algorithm must show exactly the same behavior in both adaptation cases (apart
from stochastic influences). Again, after the adaptation phase, progress rates identical
to those on fsphere are achieved in both cases.
Concluding these results, the adaptation demand is satisfied by the CMA-ES: Any
convex-quadratic function is rescaled into the sphere function.
2 9 3
10 5
6 2
10 8
10
function evaluations
function evaluations
7 4
5 10
10 5
6
4
1 10
5
4 10 3
10
4 3 cigar
1 sphere 10
4 4 tablet
3 2 schwefel
10 3 5 ellipsoid
10
4 10 30 100 300 4 10 30 100 300
dimension dimension
8 6 8
10 10
8
7 7
10 10 9
function evaluations
function evaluations
8
7
6 6
10 10
5 5
10 10
6
4 4
10 10
6 parabolic ridge 7 sharp ridge
3 8 rosenbrock 3 9 different powers
10 10
4 10 30 100 300 4 10 30 100 300
dimension dimension
Figure 10: Function evaluations to reach f stop with the (2I , 10)-CMA-ES (solid graphs)
and the (2I , 10)-PATH-ES (dashed graphs) vs. dimension n on the test functions 1–9,
where n = 5; 20; 80; 320. Shown are mean values and standard deviations. Sloping
dotted lines indicate const·n = [10; 100; 1000; . . .]·n and const·n2 = [1/100; 1; 100; . . .]·n2
function evaluations. For the PATH-ES the missing result on fcigar (n = 320) could not
be obtained in reasonable time (upper right) and results on fsharpR and fdiff pow are
far above the shown area (lower right, compare text). On fsphere (upper left), both
strategies perform almost identically.
scaling slightly better than PATH-ES on fSchwefel . Second and more surprisingly, per-
formance of the PATH-ES on ftablet is almost independent of n. This effect is due to
the cumulative path length control that adapts here better (i.e., larger) step sizes with
higher dimension.
In general, we expected the CMA-ES to scale quadratically with n. The number
of free strategy parameters to be adapted is (n2 + n)/2. fRosen and fdiff pow meet our
expectations quite well. On fsphere , no adaptation is necessary, and CMA-ES scales
linearly with n, as to be expected of any ES.
In contrast, the perfect linear scaling on fcigar and fparabR , even though desirable,
comes as a surprise to us. On both functions one long axis has to be adapted. We
found the cumulation to be responsible for this excellent scaling. Without cumulation,
the scaling on fcigar is similar to the scaling on ftablet . As mentioned above, the cu-
mulation especially supports the adaptation of long axes. On ftablet , fsharpR , felli , and
fSchwefel , increasingly ordered, the scaling is between n1.6 and n1.8 . Where the CMA-ES
scales worse than the PATH-ES, performances align with n → ∞ because the progress
surpasses the adaptation procedure with n → ∞.
In summary, the CMA-ES substantially outperforms the PATH-ES in dimensions
up to 320 on all tested functions –– apart from fsphere , where both strategies perform
almost identically. The CMA-ES always scales between n and n2 : Exactly linearly for
the adaptation of long axes and if no adaptation takes place, nearly quadratically if a
continuously changing topology demands persistent adaptation.
7 7
10 10
CMA-ES on fRast MUT-ES on fRast
6 6
10 10
5 5
10 10
function value
function value
4 4
10 10
3 3
10 10
2 2
10 10
1 1
10 10
0 200 400 600 800 1000 0 200 400 600 800 1000
generations generations
Figure 11: 30 simulation runs on the generalized Rastrigin function fRast with the
(2I , 10)-CMA-ES (left) and the (2I , 10)-MUT-ES (right), where n = 20. Both strategies
perform very similar.
7
10
CMA-ES on fRast10 MUT-ES on fRast10
6 6
10 10
5
10
function value
function value
4 4
10 10
3
10
2 2
10 10
1
10
0 200 400 600 800 1000 0 200 400 600 800 0 5 10
1000 105
generations generations x 10
4
Figure 12: 30 simulation runs on the scaled Rastrigin function fRast10 , maximal axis
ratio 1 : 10, with the (2I , 10)-CMA-ES (left) and the (2I , 10)-MUT-ES (right), where
n = 20. Function values reached with MUT-ES are worse by a factor 20.
10 10
10 10
function value
6 6
10 10
4 4
10 10
2
2
10 10
0 5 105
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 10 4
generations generations x 10
Figure 13: 30 simulation runs on the scaled Rastrigin function fRast1000 , maximal axis
ratio 1 : 1000, with the (2I , 10)-CMA-ES (left) and the (2I , 10)-MUT-ES (right), where
n = 20. Function values reached with the MUT-ES are worse by a factor 10 000.
the global search performance of a local search procedure. The effect of the distribu-
tion adaptation on the step size can be clearly observed in Figure 9. Variances in all
directions continually increase with the ongoing adaptation to the topology on fcigar
between generations 200 and 330 (shown are the smallest and the largest variance) and
several times on felli .
Concluding this section, we observe the local adaptation in the CMA-ES to come
along with an increasing step size that is more likely to improve than to spoil global
search performance of a local search algorithm.
8 Conclusion
For the application to real-world search problems, the evolution strategy (ES) is a rele-
vant search strategy if neither derivatives of the objective function are at hand, nor dif-
ferentiability and numerical accuracy can be assumed. If the search problem is expected
to be significantly non-separable, an ES with covariance matrix adaptation (CMA-ES),
as put forward in this paper, is preferable to any ES with only global or individual step
size adaptation.
We believe that there are principle limitations to the possibilities of self-adaptation
in evolution strategies: To reliably adapt a significant change of the mutation distri-
bution shape, at least roughly 10n function evaluations have to be done (where n is
the problem dimension). For a real-world search problem, it seems unrealistic to ex-
pect the adaptation to improve strategy behavior before, say, 30n function evaluations
(apart from global step size adaptation). A complete adaptation can take even 100n2
function evaluations (compare Section 7.3). Therefore, to get always the most out of
adaptation, CPU resources should allow roughly between 100 and 200(n + 3)2 function
evaluations.
One main reason for the general robustness of ESs is that the selection scheme
is solely based on a ranking of the population. The CMA-ES preserves this selection
related robustness because no additional selection information is used. Furthermore,
the CMA-ES preserves all invariance properties against transformations of the search
space and of the objective function value, which facilitates any simple (1, λ)-ES with
isotropic mutation distribution. Apart from initialization of object and strategy param-
eters, the CMA-ES yields an additional invariance against any linear transformation
of the search space –– in contrast to any evolution strategy to our knowledge (besides
Hansen et al. (1995) and Hansen and Ostermeier (1999)). Invariance properties are of
major importance for the evaluation of any search strategy (compare Section 6).
The step from an ES with isotropic mutation distribution to an ES facilitating
the adaptation of arbitrary normal mutation distributions, if successfully taken, can
be compared with the step from a simple deterministic gradient strategy to a quasi-
Newton method. The former follows the local gradient, which in a certain sense, also
does an ES with isotropic mutation distribution. The latter approximates the inverse
Hessian matrix in an iterative sequence without acquiring additional information on
the search space. This is exactly what the CMA does for evolution strategies.
In simulations, the CMA-ES reliably approximates the inverse Hessian matrix of
different objective functions. In addition, there are reported successful applications of
the CMA-ES to real-world search problems (Alvers, 1998; Holste, 1998; Meyer, 1998;
Lutz and Wagner, 1998a; Lutz and Wagner, 1998b; Olhofer et al., 2000; Bergener et al.,
2001; Cerveri et al., 2001; Igel and von Seelen, 2001; Igel et al., 2001). Consequently,
comparable to quasi-Newton methods, we expect this algorithm, or at least some quite
similar method, based on its superior performance to become state-of-the-art for the
application of ESs to real-world search problems.
Acknowledgments
This work was supported by the Deutsche Forschungsgemeinschaft under grant
Re 215/12-1 and the Bundesministerium für Bildung und Forschung under grant
01 IB 404 A. We gratefully thank Iván Santibáñez-Koref for many helpful discussions
and persistent support of our work. In addition, we thank Christian Igel and all re-
sponding users of the CMA-ES who gave us helpful suggestions from many different
points of view.
A CMA-ES in MATLAB
% CMA-ES for non-linear function minimization
% See also [Link]
function xmin=cmaes
% Set dimension, fitness fct, stop criteria, start values...
N=10; strfitnessfct = ’cigar’;
maxeval = 300*(N+2)ˆ2; stopfitness = 1e-10; % stop criteria
xmeanw = ones(N, 1); % object parameter start point (weighted mean)
sigma = 1.0; minsigma = 1e-15; % step size, minimal step size
% Generation loop
counteval = 0; arfitness(1) = 2*abs(stopfitness)+1;
while arfitness(1) > stopfitness & counteval < maxeval
disp([num2str(counteval) ’: ’ num2str(arfitness(1))]);
xmin = arx(:, arindex(1)); % return best point of last generation
function f=cigar(x)
f = x(1)ˆ2 + 1e6*sum(x(2:end).ˆ2);
References
Alvers, M. (1998). Zur Anwendung von Optimierungsstrategien auf Potentialfeldmodelle. Berliner ge-
owissenschaftliche Abhandlungen, Reihe B: Geophysik. Selbstverlag Fachbereich Geowis-
senschaften, Freie Universität Berlin, Germany.
Bäck, T. and Schwefel, H.-P. (1993). An overview of evolutionary algorithms for parameter opti-
mization. Evolutionary Computation, 1(1):1–23.
Bäck, T., Hammel, U., and Schwefel, H.-P. (1997). Evolutionary computation: Comments on the
history and current state. IEEE Transactions on Evolutionary Computation, 1(1):3–17.
Bergener, T., Bruckhoff, C., and Igel, C. (2001). Parameter optimization for visual obstacle de-
tection using a derandomized evolution strategy. In Blanc-Talon, J. and Popesc, D., editors,
Imaging and Vision Systems: Theory, Assessment and Applications, Advances in Computation:
Theory and Practice. NOVA Science Books, Huntington, New York.
Beyer, H.-G. (1995). Toward a theory of evolution strategies: On the benefit of sex - the (µ/µ, λ)-
theory. Evolutionary Computation, 3(1):81–110.
Beyer, H.-G. (1996b). Toward a theory of evolution strategies: Self-adaptation. Evolutionary Com-
putation, 3(3):311–347.
Beyer, H.-G. (1998). Mutate large, but inherit small! In Eiben, A. et al., editors, Proceedings of
PPSN V, Parallel Problem Solving from Nature, pages 109–118, Springer, Berlin, Germany.
Beyer, H.-G. and Deb, K. (2000). On the desired behaviors of self-adaptive evolutionary algo-
rithms. In Schoenauer, M. et al., editors, Proceedings of PPSN VI, Parallel Problem Solving from
Nature, pages 59–68, Springer, Berlin, Germany.
Cerveri, P., Pedotti, A., and Borghese, N. (2001). Enhanced evolution strategies: A novel approach
to stereo-camera calibration. IEEE Transactions on Evolutionary Computation, in press.
Ghozeil, A. and Fogel, D. B. (1996). A preliminary investigation into directed mutations in evo-
lutionary algorithms. In Voigt, H.-M. et al., editors, Proceedings of PPSN IV, Parallel Problem
Solving from Nature, pages 329–335, Springer, Berlin, Germany.
Hansen, N. and Ostermeier, A. (1996). Adapting arbitrary normal mutation distributions in evo-
lution strategies: The covariance matrix adaptation. In Proceedings of the 1996 IEEE Interna-
tional Conference on Evolutionary Computation, pages 312–317, IEEE Press, Piscataway, New
Jersey.
Hansen, N. and Ostermeier, A. (1997). Convergence properties of evolution strategies with the
derandomized covariance matrix adaptation: The (µ/µI , λ)-CMA-ES. In Zimmermann, H.-
J., editor, Proceedings of EUFIT’97, Fifth European Congress on Intelligent Techniques and Soft
Computing, pages 650–654, Verlag Mainz, Aachen, Germany.
Hansen, N., Ostermeier, A., and Gawelczyk, A. (1995). On the adaptation of arbitrary normal
mutation distributions in evolution strategies: The generating set adaptation. In Eshelman,
L., editor, Proceedings of the Sixth International Conference on Genetic Algorithms, pages 57–64,
Morgan Kaufmann, San Francisco, California.
Hildebrand, L., Reusch, B., and Fathi, M. (1999). Directed mutation—a new self adaptation for
evolution strategies. In Angeline, P., editor, Proceedings of the 1999 Congress on Evolutionary
Computation CEC99, pages 1550–1557, IEEE Press, Piscataway, New Jersey.
Igel, C. and von Seelen, W. (2001). Design of a field model for early vision: A case study of
evolutionary algorithms in neuroscience. In 28th Goettingen Neurobiology Conference. In press.
Igel, C., Erlhagen, W., and Jancke, D. (2001). Optimization of neural fields models. Neurocomput-
ing, 36(1–4):225–233.
Lutz, T. and Wagner, S. (1998a). Drag reduction and shape optimization of airship bodies. Journal
of Aircraft, 35(3):345–351.
Lutz, T. and Wagner, S. (1998b). Numerical shape optimization of natural laminar flow bodies.
In Proceedings of 21st ICAS Congress, International Council of the Aeronautical Sciences and
the American Institute of Aeronautics, Paper No. ICAS-98-2,9,4.
Olhofer, M., Arima, T., Sonoda, T., and Sendhoff, B. (2000). Optimisation of a stator blade used
in a transonic compressor cascade with evolution strategies. In Parmee, I., editor, Adaptive
Computing in Design and Manufacture (ACDM), pages 45–54. Springer Verlag, Berlin, Ger-
many.
Ostermeier, A. (1992). An evolution strategy with momentum adaptation of the random number
distribution. In Männer, R. and Manderick, B., editors, Parallel Problem Solving from Nature,
2, pages 197–206, North Holland, Amsterdam, The Netherlands.
Ostermeier, A. and Hansen, N. (1999). An evolution strategy with coordinate system invariant
adaptation of arbitrary normal mutation distributions within the concept of mutative strat-
egy parameter control. In Banzhaf, W. et al., editors, Proceedings of the Genetic and Evolution-
ary Computation Conference, GECCO-99, pages 902–909, Morgan Kaufmann, San Francisco,
California.
Ostermeier, A., Gawelczyk, A., and Hansen, N. (1994a). A derandomized approach to self-
adaptation of evolution strategies. Evolutionary Computation, 2(4):369–380.
Ostermeier, A., Gawelczyk, A., and Hansen, N. (1994b). Step-size adaptation based on non-local
use of selection information. In Davidor, Y. et al., editors, Proceedings of PPSN IV, Parallel
Problem Solving from Nature, pages 189–198, Springer, Berlin, Germany.
Press, W., Teukolsky, S., Vetterling, W., and Flannery, B. (1992). Numerical Recipes in C: The Art of
Scientific Computing. Second Edition. Cambridge University Press, Cambridge, England.
Rechenberg, I. (1973). Evolutionsstrategie, Optimierung technischer Systeme nach Prinzipien der biolo-
gischen Evolution. Frommann-Holzboog, Stuttgart, Germany.
Schwefel, H.-P. (1995). Evolution and Optimum Seeking. Sixth-Generation Computer Technology
Series. John Wiley and Sons, New York, New York.
Whitley, D., Mathias, K., Rana, S., and Dzubera, J. (1996). Evaluating evolutionary algorithms.
Artificial Intelligence, 85:245–276.