Metaheuristic Optimization: Algorithm Analysis and Open Problems
Metaheuristic Optimization: Algorithm Analysis and Open Problems
Problems
Xin-She Yang
Mathematics and Scientific Computing, National Physical Laboratory,
Teddington, Middlesex TW11 0LW, UK.
Abstract
1 Introduction
Optimization is an important subject with many important application, and algorithms
for optimization are diverse with a wide range of successful applications [10, 11]. Among
these optimization algorithms, modern metaheuristics are becoming increasingly popular,
leading to a new branch of optimization, called metaheuristic optimization. Most meta-
heuristic algorithms are nature-inspired [8, 29, 32], from simulated annealing [20] to ant
colony optimization [8], and from particle swarm optimization [17] to cuckoo search [35].
Since the appearance of swarm intelligence algorithms such as PSO in the 1990s, more than
a dozen new metaheuristic algorithms have been developed and these algorithms have been
applied to almost all areas of optimization, design, scheduling and planning, data mining,
machine intelligence, and many others. Thousands of research papers and dozens of books
have been published [8, 9, 11, 19, 29, 32, 33].
Despite the rapid development of metaheuristics, their mathematical analysis remains
partly unsolved, and many open problems need urgent attention. This difficulty is largely
due to the fact the interaction of various components in metaheuristic algorithms are highly
nonlinear, complex, and stochastic. Studies have attempted to carry out convergence anal-
ysis [1, 22], and some important results concerning PSO were obtained [7]. However, for
other metaheuristics such as firefly algorithms and ant colony optimization, it remains an
active, challenging topic. On the other hand, even we have not proved or cannot prove their
convergence, we still can compare the performance of various algorithms. This has indeed
formed a majority of current research in algorithm development in the research community
of optimization and machine intelligence [9, 29, 33].
In combinatorial optimization, many important developments exist on complexity anal-
ysis, run time and convergence analysis [25, 22]. For continuous optimization, no-free-lunch-
theorems do not hold [1, 2]. As a relatively young field, many open problems still remain
in the field of randomized search heuristics [1]. In practice, most assume that metaheuris-
tic algorithms tend to be less complex for implementation, and in many cases, problem
sizes are not directly linked with the algorithm complexity. However, metaheuristics can
often solve very tough NP-hard optimization, while our understanding of the efficiency and
convergence of metaheuristics lacks far behind.
Apart from the complex interactions among multiple search agents (making the math-
ematical analysis intractable), another important issue is the various randomization tech-
niques used for modern metaheuristics, from simple randomization such as uniform distri-
bution to random walks, and to more elaborate Lévy flights [5, 24, 33]. There is no unified
approach to analyze these mathematically. In this paper, we intend to review the conver-
gence of two metaheuristic algorithms including simulated annealing and PSO, followed by
the new convergence analysis of the firefly algorithm. Then, we try to formulate a framework
for algorithm analysis in terms of Markov chain Monte Carlo. We also try to analyze the
mathematical and statistical foundations for randomization techniques from simple random
walks to Lévy flights. Finally, we will discuss some of important open questions as further
research topics.
Simulated annealing (SA) is one of the widely used metaheuristics, and is also one of the
most studies in terms of convergence analysis [4, 20]. The essence of simulated annealing
is a trajectory-based random walk of a single agent, starting from an initial guess x0 . The
next move only depends on the current state or location and the acceptance probability p.
This is essentially a Markov chain whose transition probability from the current state to
the next state is given by
h∆E i
p = exp − , (1)
kB T
where kB is Boltzmann’s constant, and T is the temperature. Here the energy change
∆E can be linked with the change of objective values. A few studies on the convergence
of simulated annealing have paved the way for analysis for all simulated annealing-based
algorithms [4, 15, 27]. Bertsimas and Tsitsiklis provided an excellent review of the conver-
gence of SA under various assumptions [4, 15]. By using the assumptions that SA forms
an inhomogeneous Markov chain with finite states, they proved a probabilistic convergence
function P , rather than almost sure convergence, that
A
max P [xi (t) ∈ S∗ x0 ] ≥ , (2)
tα
where S∗ is the optimal set, and A and α are positive constants [4]. This is for the cooling
schedule T (t) = d/ ln(t), where t is the iteration counter or pseudo time. These studies
largely used Markov chains as the main tool. We will come back later to a more general
framework of Markov chain Monte Carlo (MCMC) in this paper [12, 14].
v t+1
i = v ti + αǫ1 [g ∗ − xti ] + βǫ2 [x∗i − xti ]. (3)
xt+1
i = xti + v t+1
i , (4)
where ǫ1 and ǫ2 are two random vectors, and each entry taking the values between 0 and 1.
The parameters α and β are the learning parameters or acceleration constants, which can
typically be taken as, say, α ≈ β ≈ 2.
There are at least two dozen PSO variants which extend the standard PSO algorithm,
and the most noticeable improvement is probably to use inertia function θ(t) so that v ti is
replaced by θ(t)v ti where θ ∈ [0, 1] [6]. This is equivalent to introducing a virtual mass to
stabilize the motion of the particles, and thus the algorithm is expected to converge more
quickly.
The first convergence analysis of PSO was carried out by Clerc and Kennedy in 2002 [7]
using the theory of dynamical systems. Mathematically, if we ignore the random factors,
we can view the system formed by (3) and (4) as a dynamical system. If we focus on a
single particle i and imagine that there is only one particle in this system, then the global
best g ∗ is the same as its current best x∗i . In this case, we have
v t+1
i = v ti + γ(g ∗ − xti ), γ = α + β, (5)
and
xt+1
i = xti + v t+1
i . (6)
Considering the 1D dynamical system for particle swarm optimization, we can replace g ∗
by a parameter constant p so that we can see if or not the particle of interest will converge
towards p. By setting ut = p − x(t + 1) and using the notations for dynamical systems, we
have a simple dynamical system
or
1 γ vt
Yt+1 = AYt , A= , Yt = . (8)
−1 1−γ ut
The general solution of this dynamical system can be written as Yt = Y0 exp[At]. The
system behaviour can be characterized by the eigenvalues λ of A, and we have λ1,2 =
p
1 − γ/2 ± γ 2 − 4γ/2. It can be seen clearly that γ = 4 leads to a bifurcation. Following
a straightforward analysis of this dynamical system, we can have three cases. For 0 <
γ < 4, cyclic and/or quasi-cyclic trajectories exist. In this case, when randomness is
gradually reduced, some convergence can be observed. For γ > 4, non-cyclic behaviour
can be expected and the distance from Yt to the center (0, 0) is monotonically increasing
with t. In a special case γ = 4, some convergence behaviour can be observed. For detailed
analysis, please refer to Clerc and Kennedy [7]. Since p is linked with the global best, as
the iterations continue, it can be expected that all particles will aggregate towards the the
global best.
Firefly Algorithm (FA) was developed by Yang [32, 34], which was based on the flashing
patterns and behaviour of fireflies. In essence, each firefly will be attracted to brighter
ones, while at the same time, it explores and searches for prey randomly. In addition, the
brightness of a firefly is determined by the landscape of the objective function.
The movement of a firefly i is attracted to another more attractive (brighter) firefly j is
determined by
2
xt+1
i = xti + β0 e−γrij (xtj − xti ) + α ǫti , (9)
where the second term is due to the attraction. The third term is randomization with α
being the randomization parameter, and ǫti is a vector of random numbers drawn from a
Gaussian distribution or other distributions. Obviously, for a given firefly, there are often
many more attractive fireflies, then we can either go through all of them via a loop or use the
most attractive one. For multiple modal problems, using a loop while moving toward each
brighter one is usually more effective, though this will lead to a slight increase of algorithm
complexity.
Here is β0 ∈ [0, 1] is the attractiveness at r = 0, and rij = ||xi − xj ||2 is the ℓ2 -
norm or Cartesian distance. For other problems such as scheduling, any measure that can
effectively characterize the quantities of interest in the optimization problem can be used
as the ‘distance’ r.
For most implementations, we can take β0 = 1, α = O(1) and γ = O(1). It is worth
pointing out that (9) is essentially a random walk biased towards the brighter fireflies. If
β0 = 0, it becomes a simple random walk. Furthermore, the randomization term can easily
be extended to other distributions such as Lévy flights [16, 24].
We now can carry out the convergence analysis for the firefly algorithm in a frame-
work similar to Clerc and Kennedy’s dynamical analysis. For simplicity, we start from the
equation for firefly motion without the randomness term
2
xt+1
i = xti + β0 e−γrij (xtj − xti ). (10)
Figure 1: The chaotic map of the iteration formula (13) in the firefly algorithm and the
transition between from periodic/multiple states to chaos.
If we focus on a single agent, we can replace xtj by the global best g found so far, and we
have
2
xt+1
i = xti + β0 e−γri (g − xti ), (11)
where the distance ri can be given by the ℓ2 -norm ri2 = ||g − xti ||22 . In an even simpler 1-D
case, we can set yt = g − xti , and we have
2
yt+1 = yt − β0 e−γyt yt . (12)
We can see that γ is a scaling parameter which only affects the scales/size of the firefly
√
movement. In fact, we can let ut = γyt and we have
2
ut+1 = ut [1 − β0 e−ut ]. (13)
These equations can be analyzed easily using the same methodology for studying the well-
known logistic map
ut+1 = λut (1 − ut ). (14)
The chaotic map is shown in Fig. 1, and the focus on the transition from periodic multiple
states to chaotic behaviour is shown in the same figure.
As we can see from Fig. 1 that convergence can be achieved for β0 < 2. There is
a transition from periodic to chaos at β0 ≈ 4. This may be surprising, as the aim of
designing a metaheuristic algorithm is to try to find the optimal solution efficiently and
accurately. However, chaotic behaviour is not necessarily a nuisance; in fact, we can use it
to the advantage of the firefly algorithm. Simple chaotic characteristics from (14) can often
be used as an efficient mixing technique for generating diverse solutions. Statistically, the
logistic mapping (14) with λ = 4 for the initial states in (0,1) corresponds a beta distribution
Γ(p + q) p−1
B(u, p, q) = u (1 − u)q−1 , (15)
Γ(p)Γ(q)
when p = q = 1/2. Here Γ(z) is the Gamma function
Z ∞
Γ(z) = tz−1 e−t dt. (16)
0
√
In the case when z = n is an integer, we have Γ(n) = (n − 1)!. In addition, Γ(1/2) = π.
From the algorithm implementation point of view, we can use higher attractiveness β0
during the early stage of iterations so that the fireflies can explore, even chaotically, the
search space more effectively. As the search continues and convergence approaches, we
can reduce the attractiveness β0 gradually, which may increase the overall efficiency of the
algorithm. Obviously, more studies are highly needed to confirm this.
From the above convergence analysis, we know that there is no mathematical framework in
general to provide insights into the working mechanisms, the stability and convergence of a
give algorithm. Despite the increasing popularity of metaheuristics, mathematical analysis
remains fragmental, and many open problems need urgent attention.
Monte Carlo methods have been applied in many applications [28], including almost
all areas of sciences and engineering. For example, Monte Carlo methods are widely used
in uncertainty and sensitivity analysis [21]. From the statistical point of view, most meta-
heuristic algorithms can be viewed in the framework of Markov chains [14, 28]. For example,
simulated annealing [20] is a Markov chain, as the next state or new solution in SA only
depends on the current state/solution and the transition probability. For a given Markov
chain with certain ergodicity, a stability probability distribution and convergence can be
achieved.
Now if look at the PSO closely using the framework of Markov chain Monte Carlo
[12, 13, 14], each particle in PSO essentially forms a Markov chain, though this Markov
chain is biased towards to the current best, as the transition probability often leads to the
acceptance of the move towards the current global best. Other population-based algorithms
can also be viewed in this framework. In essence, all metaheuristic algorithms with piece-
wise, interacting paths can be analyzed in the general framework of Markov chain Monte
Carlo. The main challenge is to realize this and to use the appropriate Markov chain theory
to study metaheuristic algorithms. More fruitful studies will surely emerge in the future.
where αi > 0 is a parameters controlling the step sizes or scalings. If si is drawn from a
normal distribution N (µi , σi2 ), then the conditions of stable distributions lead to a combined
Gaussian distribution
N
X N
X
xN ∼ N (µ∗ , σ∗2 ), µ∗ = αi µi , σ∗2 = αi [σi2 + (µ∗ − µi )2 ]. (19)
i=1 i=1
We can see that that the mean location changes with N and the variances increases as N
increases, this makes it possible to reach any areas in the search space if N is large enough.
A diffusion process can be viewed as a series of Brownian motion, and the motion obeys
the Gaussian distribution. For this reason, standard diffusion is often referred to as the
Gaussian diffusion. As the mean of particle locations is obviously zero if µi = 0, their
variance will increase linearly with t. In general, in the d-dimensional space, the variance
of Brownian random walks can be written as
where v0 is the drift velocity of the system. Here D = s2 /(2τ ) is the effective diffusion
coefficient which is related to the step length s over a short time interval τ during each
jump. If the motion at each step is not Gaussian, then the diffusion is called non-Gaussian
diffusion. If the step length obeys other distribution, we have to deal with more generalized
random walks. A very special case is when the step length obeys the Lévy distribution,
such a random walk is called Lévy flight or Lévy walk.
In nature, animals search for food in a random or quasi-random manner. In general, the
foraging path of an animal is effectively a random walk because the next move is based on the
current location/state and the transition probability to the next location. Which direction it
chooses depends implicitly on a probability which can be modelled mathematically [3, 24].
For example, various studies have shown that the flight behaviour of many animals and
insects has demonstrated the typical characteristics of Lévy flights [24, 26]. Subsequently,
such behaviour has been applied to optimization and optimal search, and preliminary results
show its promising capability [30, 24].
In general, Lévy distribution is stable, and can be defined in terms of a characteristic
function or the following Fourier transform
which increases much faster than the linear relationship (i.e., σ 2 (t) ∼ t) of Brownian random
walks.
Studies show that Lévy flights can maximize the efficiency of resource searches in un-
certain environments. In fact, Lévy flights have been observed among foraging patterns of
albatrosses and fruit flies [24, 26, 30]. In addition, Lévy flights have many applications.
Many physical phenomena such as the diffusion of fluorenscent molecules, cooling behavior
and noise could show Lévy-flight characteristics under the right conditions [26].
4 Open Problems
It is no exaggeration to say that metahueristic algorithms have been a great success in
solving various tough optimization problems. Despite this huge success, there are many
important questions which remain unanswered. We know how these heuristic algorithms
work, and we also partly understand why these algorithms work. However, it is difficult to
analyze mathematically why these algorithms are so successful, though significant progress
has been made in the last few years [1, 22]. However, many open problems still remain.
For all population-based metaheuristics, multiple search agents form multiple interact-
ing Markov chains. At the moment, theoretical development in these areas are still at
early stage. Therefore, the mathematical analysis concerning of the rate of convergence is
very difficult, if not impossible. Apart from the mathematical analysis on a limited few
metaheuristics, convergence of all other algorithms has not been proved mathematically,
at least up to now. Any mathematical analysis will thus provide important insight into
these algorithms. It will also be valuable for providing new directions for further important
modifications on these algorithms or even pointing out innovative ways of developing new
algorithms.
For almost all metaheuristics including future new algorithms, an important issue to
be addresses is to provide a balanced trade-off between local intensification and global
diversification [5]. At present, different algorithm uses different techniques and mechanism
with various parameters to control this, they are far from optimal. Important questions
are: Is there any optimal way to achieve this balance? If yes, how? If not, what is the best
we can achieve?
Furthermore, it is still only partly understood why different components of heuristics
and metaheuristics interact in a coherent and balanced way so that they produce efficient
algorithms which converge under the given conditions. For example, why does a balanced
combination of randomization and a deterministic component lead to a much more efficient
algorithm (than a purely deterministic and/or a purely random algorithm)? How to measure
or test if a balance is reached? How to prove that the use of memory can significantly
increase the search efficiency of an algorithm? Under what conditions?
In addition, from the well-known No-Free-Lunch theorems [31], we know that they have
been proved for single objective optimization for finite search domains, but they do not hold
for continuous infinite domains [1, 2]. In addition, they remain unproved for multiobjective
optimization. If they are proved to be true (or not) for multiobjective optimization, what
are the implications for algorithm development? Another important question is about
the performance comparison. At the moment, there is no agreed measure for comparing
performance of different algorithms, though the absolute objective value and the number of
function evaluations are two widely used measures. However, a formal theoretical analysis
is yet to be developed.
Nature provides almost unlimited ways for problem-solving. If we can observe carefully,
we are surely inspired to develop more powerful and efficient new generation algorithms.
Intelligence is a product of biological evolution in nature. Ultimately some intelligent algo-
rithms (or systems) may appear in the future, so that they can evolve and optimally adapt
to solve NP-hard optimization problems efficiently and intelligently.
Finally, a current trend is to use simplified metaheuristic algorithms to deal with complex
optimization problems. Possibly, there is a need to develop more complex metaheuristic
algorithms which can truly mimic the exact working mechanism of some natural or biological
systems, leading to more powerful next-generation, self-regulating, self-evolving, and truly
intelligent metaheuristics.
References
[1] A. Auger and B. Doerr, Theory of Randomized Search Heuristics: Foundations and
Recent Developments, World Scientific, (2010).
[2] A. Auger and O. Teytaud, Continuous lunches are free plus the design of optimal
optimization algorithms, Algorithmica, 57(1), 121-146 (2010).
[4] D. Bertsimas and J. Tsitsiklis, Simulated annealing, Stat. Science, 8, 10-15 (1993).
[6] A. Chatterjee and P. Siarry, Nonlinear inertia variation for dynamic adapation in
particle swarm optimization, Comp. Oper. Research, 33, 859-871 (2006).
[7] M. Clerc, J. Kennedy, The particle swarm - explosion, stability, and convergence in
a multidimensional complex space, IEEE Trans. Evolutionary Computation, 6, 58-73
(2002).
[8] Dorigo, M., Optimization, Learning and Natural Algorithms, PhD thesis, Politecnico
di Milano, Italy, (1992).
[10] C. A. Floudas and P. M. Parlalos, Collection of Test Problems for Constrained Global
Optimization Algorithms, Springer-Verlag, Lecture Notes in Computation Science,
455, (1990).
[12] D. Gamerman, Markov Chain Monte Carlo, Chapman & Hall/CRC, (1997).
[13] C. J. Geyer, Practical Markov Chain Monte Carlo, Statistical Science, 7, 473-511
(1992).
[14] A. Ghate and R. Smith, Adaptive search with stochastic acceptance probabilities for
global optimization, Operations Research Lett., 36, 285-290 (2008).
[17] J. Kennedy and R. C. Eberhart, Particle swarm optimization, in: Proc. of IEEE
International Conference on Neural Networks, Piscataway, NJ. pp. 1942-1948 (1995).
[19] J. Holland, Adaptation in Natural and Artificial systems, University of Michigan Press,
Ann Anbor, (1975).
[21] C. Matthews, L. Wright, and X. S. Yang, Sensitivity Analysis, Optimization, and Sam-
pling Methodds Applied to Continous Models, National Physical Laboratory Report,
UK, (2009).
[23] J. P. Nolan, Stable distributions: models for heavy-tailed data, American University,
(2009).
[24] I. Pavlyukevich, Lévy flights, non-local search and simulated annealing, J. Computa-
tional Physics, 226, 1830-1844 (2007).
[26] A. M. Reynolds and C. J. Rhodes, The Lévy fligth paradigm: random search patterns
and mechanisms, Ecology, 90, 877-887 (2009).
[28] I. M. Sobol, A Primer for the Monte Carlo Method, CRC Press, (1994).
[34] X. S. Yang, Firefly algorithm, stochastic test functions and design optimisation, Int.
J. Bio-Inspired Computation, 2, 78–84 (2010).
[35] X. S. Yang and S. Deb, Engineering optimization by cuckoo search, Int. J. Math.
Modelling & Num. Optimization, 1, 330-343 (2010).