Nonlinear Optimization
Nonlinear Optimization
edited by
Almerico Murli
Gerardo Toraldo
Universitii di Napoli “FedericoII”
A Special Issue of
COMPUTATIONAL OPTIMIZATION
AND APPLICATIONS
An International Journal
Volume 7, No. 1 (1997)
Introduction
A. RlURLI AND G. T O R A L D O { murli,toraldo} @ matna2.dma.unina.it
Center for Research on Parallel Computing and Supercomputers (CPS), Italian National Research Council &
University of Naples “Federico 11” Italy
Abstract. This paper provides a means for comparing various computer codes for solving large scale mixed
complementarity problems. We discuss inadequacies in how solvers are currently compared, and present a testing
environment that addresses these inadequacies. This testing environment consists of a library of test problems,
along with GAMS and MATLAB interfaces that allow these problems to be easily accessed. The environment is
intended for use as a tool by other researchers to better understand both their algorithms and their implementations,
and to direct research toward problem classes that are currently the most challenging. As an initial benchmark,
eight different algorithm implementations for large scale mixed complementarity problems are briefly described
and tested with default parameter settings using the new testing environment.
1. Introduction
In recent years, a considerable number of new algorithms have been developed for solv-
ing large scale mixed complementarity problems. Many of these algorithms appear very
promising theoretically, but it is difficult to understand how well they will work in practice.
Indeed, many of the papers describing these algorithms are primarily theoretical papers and
include only very minimal computational results. Even with extensive testing, there are
inadequacies in the way the results are reported, which makes it difficult to compare one
approach against another.
The purpose of this paper is to describe a testing environment for evaluating the strengths
and weaknesses of various codes for solving large scale mixed complementarity problems.
We believe that the environment is ideally suited for the computational study, development,
and comparison of algorithm implementations. The careful description and documentation
of the environment given here should help algorithm designers focus their developmental
efforts toward practical and useful codes. To exhibit its intended usage, we benchmark eight
different algorithm implementations for large scale mixed complementarity problems with
the new testing environment. At the same time, we intend to provide a convenient mecha-
nism for modelers to provide new and challenging problems for use in solver comparison.
* This material is based on research supported by National Science Foundation Grant CCR-9157632 and the
Air Force Office of Scientific Research Grant F49620-94- 1-0036.
4 BILLUPS. DIRKSE AND FERRIS
As an added benefit, we believe the environment will help modelers determine which code
best fits their needs.
The mixed complementarity problem (MCP) is a generalization of a system of nonlinear
equations and is completely determined by a nonlinear function F : R" -+ R" and upper
and lower bounds on the variables. The variables z must lie between the given bounds e and
U . The constraints on the nonlinear function are determined by the bounds on the variables
in the following manner:
e2 < 22 < U2 =+ Fi(Z)= 0
z; = ez =+ Fz(z) 2 0
z2 = U2 * Fz(z) 5 0.
We will use the notation B to represent the set [k', U ] .
Several special cases of this formulation are immediately obvious. For example, if k' s
-CO and U 5 foo then the last two implications are vacuous and MCP is the problem of
determining z E R" such that F ( z ) = 0.
As another example, the Karush-Kuhn-Tucker conditions for nonlinear programs of the
form
min f(z)
s.t. g(z) 5 0
are given by
Vf(z) + XVg(2) = 0,
g(z) 5 0, x 2 0, XTg(z) = 0.
Here RT represents the nonnegative orthant of R". Many problems in economic equilib-
rium theory can be cast as MCPs and an overview of how this is accomplished is given in
[31]. Other application areas are detailed in [7, 121. There has been much recent interest in
less traditional applications of the complementarity framework. Some of these are based on
the generalized equation literature [28] that reformulates the MCP as 0 E F ( z ) + N B ( z ) .
Here N B ( z ) is the classical normal cone to the set B at the point z defined by
N f j ( 2 ) := { y 1 y'(z - 2) I o v x E B} ,
if z E B and is empty otherwise.
Nonlinear complementarity problems appeared in the literature in [ 5 ] .The first algorithms
for these problems were based on simplicial labeling techniques originally due to Scarf [32].
Extensions of these algorithms led to fixed point schemes [18, 331. Newton techniques [8,
22,301 that are based on successive linearization of the nonlinear problem have proven very
useful for solving these problems, although their convergence analysis is less satisfactory
LARGE SCALE MIXED COMPLEMENTARITY PROBLEM SOVLERS 5
than the fixed point theory. Recent extensions have looked at reformulating the nonlinear
complementarity problem as a system of nonsmooth nonlinear equations and solving these
using a damped Newton or Gauss-Newton approach [6, 8, 10, 11, 13, 16, 19, 20, 21, 23,
24, 25,26, 27,29, 34, 351.
We are concerned in this paper with computational testing and comparison of such al-
gorithms. We see several problems with the current state of affairs in the way solvers are
developed and compared.
Codes are tweaked to solve particular problems, with different choices of control pa-
rameters being used to solve different problems. This is contrary to how solvers are used
in practice. In general, modelers are not interested in parameter adjustment; instead,
they usually run codes only with default options. A good code will have a set of default
parameters that performs well on most problems.
Even when a consistent set of control parameters is used, codes are developed and
tuned using the same limited set of test problems for which computational results are
reported. Consequently, the results do not give a fair picture of how the codes might
behave on other problems. Enlarging the test suite and adding real world problems
alleviates some of these difficulties.
There is no clear understanding of what makes problems difficult. Thus, test cases
reported do not necessarily reflect the various difficulties that can cause algorithms to
fail. As a result, it is extremely difficult for a modeler to determine which algorithm
will work best for his particular class of problems.
The majority of papers written are theoretical in nature and provide computational
results only for naive implementations of the algorithms. While this can exhibit the
potential of a particular approach, it is inadequate for evaluating how an algorithm will
work in practice. Instead, computational results need to be reported for sophisticated
implementations of the algorithms. In particular, algorithm specific scaling, prepro-
cessing or heuristics are crucial for improved robustness and developer supplied default
settings should be used in all solver comparisons.
Test problems do not reflect the interests of users with real-world applications. Thus,
algorithms are developed which are good at solving “toy” problems, but are not neces-
sarily good at solving problems of practical importance.
These problems in the way solvers are currently tested result in two major deficiencies
in the usefulness of test results. First, the reported results are inadequate for modelers
to determine which codes will be most successful for solving their problems. Second, it
is difficult for algorithm developers to determine where additional research needs to be
directed.
In order to overcome these difficulties, this paper proposes that a testing environment for
large scale mixed complementarity problems be developed. The goals of this environment
are again twofold: first, it should provide a means of more accurately evaluating the strengths
and weaknesses of various codes, and second, it should help direct algorithm developers
toward addressing the issues of greatest importance. A preliminary version of such an
6 BILLUPS. DIRKSE AND FERRIS
environment is described in Section 2 and was used to generate the computational results
reported in Section 4. A brief description of each of the codes tested is provided in Section 3.
2. Testing Environment
This section describes a testing environment that aims to correct many of the problems
discussed in the introduction concerning how codes are developed and tested. This en-
vironment has four main components: a library of test problems, GAMS and MATLAB
interfaces that allow these problems to be easily accessed, a tool for verifying the correctness
of solutions, and some awk scripts for evaluating results.
The centerpiece of the testing environment is a large publicly available library of test prob-
lems that reflects the interests of users with real-world applications, and that also includes
problems having known types of computational difficulties. Many of these problems are
contained in the standard GAMS distribution [3], while others are part of the expanding
collection of problems called MCPLIB[7]. All of the problems that are used in this work
are publicly available and can be accessed both from within the GAMS modeling system
[3] and from within MATLAB[ 141.
Because most of the problems in the test library come from real-world applications, the
library reflects, as much as possible, the needs of the user community. As this library
has become more popular among code developers, we have observed an increased interest
among modelers to contribute more and more challenging problems to the library. The
motivation is simple: modelers want to encourage the development of codes capable of
solving their most difficult problems.
We note that many of the problems contained in the test library are difficult for varying
reasons. We believe that it is important to identify the characteristics that make problems
hard. This is a daunting task; toward this end, we give an incomplete classification of the
types of problem difficulties that may prove challenging for different algorithms.
3. Problem Size. Some algorithms may be better at exploiting problem structure than
others, making them less sensitive to the size of the problem. One weakness of our
current test suite is that it does not address the issue of size very well. We have attempted
to include problems of reasonable size, but it is clear that the test library needs to be
expanded in this area.
4. Sensitivity to Scaling. Our experience is that modelers, of necessity, tend to become very
good at scaling their models so that relevant matrices are reasonably well-conditioned.
Indeed, most of the problems in our model library are well scaled. However, models
under development are often poorly scaled. Frequently, solutions are used to scale
models properly and to aid in the model construction. Thus, sensitivity to scaling
is quite important. In general it is very difficult to scale highly nonlinear functions
effectively, so that an algorithm that is less sensitive to scaling may prove to be more
practical for highly nonlinear problems.
5. Others. Several other problem characteristics have been proposed, but have not been
well studied in the context of real models. These include monotonicity, multiple solu-
tions, and singularity at the solution.
Tables 1 and 2 describe the problems that are included in the test library. Further docu-
mentation on these problems can be found in [31] and [7] respectively. Since the starting
point can greatly influence the performance of an algorithm, the library includes multiple
starting points for most problems. We note that many of the economic problems have the
first starting point very close to a solution. This is the “calibration” point and is used by a
modeler to test whether the model reproduces benchmark data. The following abbreviations
are used when referring to the type of the problem:
The tables also include a column labeled “other”. In this column we have added some
known characteristics of the problems. Thus “M” is entered in this column if the problem
is known to be monotone. Similarly a digit “4” for example indicates the number of known
solutions. If an “S” occurs in this column then the submatrix of the Jacobian corresponding
to the “active constraints’’ is known to have condition number greater than 10’ at a solution.
The fact that one of these entries does not appear in the table only signifies that the authors
do not know whether the problem has this particular characteristic.
8 BILLUPS. DIRKSE AND FERRIS
2.2. Interfaces
To make the test library useful, two interfaces are provided that make the problems easily
accessible both for testing of mature codes and for evaluating prototype algorithms.
The first interface is a means for programs to communicate directly with the GAMS
modeling language [3]. For realistic application problems, we believe that the use of a
modeling system such as AMPL[ 171 or GAMS is crucial. In earlier work with Rutherford
[9], we developed the GAMSKPLIB interface that provides simple routines to obtain
function and Jacobian evaluations and recover problem data. This makes it easy to hook
up any solver that is written in Fortran or C as a subsystem of GAMS. The advantages
of using a modeling system are many; some of the most important advantages include
automatic differentiation, easy data handling, architecture-independent interfaces between
models and solvers, and the ability to extend models easily to answer new questions arising
from solutions of current models. In addition, modeling languages provide a ready library
of examples on which to test solvers. GAMS was chosen for our work instead of AMPL
LARGE SCALE MIXED COMPLEMENTARITY PROBLEM SOVLERS 9
because it is a mature product with many users, resulting in the availability of many real-
world problems.
While we believe that any mature code should be connected with a modeling language,
we also feel that there should be an easier means for making the library of test problems
available to prototype algorithms. The MATLAB interface described in [ 141provides such
a means. Using MATLAB, it is possible to quickly implement a prototype version of a
new algorithm, which can be tested on the entire suite of test problems with the MATLAB
interface. Thus, the test library can play an active role in influencing the development of
new algorithms. It must be noted, however, that there are subtle differences between the
MATLAB models and the GAMS models. In particular, many GAMS models vary not only
the starting point for different runs, but also some of the underlying nonlinearities, whereas
the MATLAB models vary only the starting point. Thus, a completely accurate comparison
must be carried out exclusively in GAMS or exclusively in MATLAB.
10 BILLUPS. DIRKSE AND FERRIS
Since stopping criteria vary from algorithm to algorithm, a standardized measure is needed
to ensure that different algorithms produce solutions that have some uniformity in solution
quality. To achieve this goal, we developed an additional solver, accessible through GAMS,
that evaluates the starting point and returns the value of the following merit function:
where - i r ~represents the projection operator onto the set B. To use this verification test,
we first solve the problem with the algorithm we are testing, and pass the solution to our
“special” solver to verify that the standardized residual is not too large. Since the special
solver is callable from GAMS, this can be achieved by adding a few lines to the GAMS
problem files.
The output of MCP codes is typically quite extensive and varies from solver to solver. To
extract pertinent information from this output, we have written several awk scripts that read
through the files, and then generate data tables. These scripts require slight modifications for
each solver, but are a tremendous help in extracting data to produce meaningful information.
3. Description of Algorithms
Ideally, the computational study of algorithms should be performed using only mature,
sophisticated codes, so that the strengths and limitations of each algorithm would be accu-
rately reflected in the numerical results. Unfortunately, many of the algorithms proposed
for complementarity problems are not accompanied by such mature codes. Of the algo-
rithms described below, the implementations of MILES, PATH, and SMOOTH are the most
mature. For the remaining algorithms, we have developed our own implementations which
incorporate the GAMS interface.
All of the algorithms outlined have been coded to take explicit advantage of the MCP
structure; however, several of them were originally devised for the special case of the
nonlinear complementarity problem (NCP)
and will be described below in this context. We now give a brief description of the codes
that were tested and indicate pertinent references for further details.
3.1. MZLES
MILES [30] is an extension of the classical Josephy-Newton method for NCP in which the
solution to each linearized subproblem
LARGE SCALE MIXED COMPLEMENTARITY PROBLEM SOVLERS 11
3.2. PATH
The PATH solver [8] applies techniques similar to those used in Newton methods for smooth
systems to the following reformulation of the MCP
+
0 = F ( r g ( z ) ) z - 7rB(")
Here r~ represents the projection operator onto the set B , which is in general not differ-
entiable. The algorithm consists of a sequence of major iterations, each consisting of an
approximation or linearization step similar to that of MILES, the construction of apath to the
Newton point (the solution to the approximation), and a possible search of this path. When
the Newton point does not exists or the path cannot be entirely constructed, a step along
the partially computed path is taken before the problem is relinearized. A nonmonotone
watchdog strategy is employed in applying the path search; this helps avoid convergence
to local minima of the merit function (1)' and keeps the number of function evaluations
required as small as possible.
Other computational enhancements employed by PATH are a projected Newton prepro-
cessing phase (used to find an initial point that better corresponds to the optimal active set)
and the addition of a diagonal perturbation term to the Jacobian matrix when rank deficiency
is detected. The Jacobian elements are also automatically scaled by the algorithm at each
major iteration.
3.3. NE/SQP
The NE/SQP algorithm [26] is based upon reformulating the NCP as the system of nons-
mooth equations
O = H ( z ) := min{z,F(z)}.
In [2] the NE/SQP algorithm is extended to the MCP by using the reformulation
to find a zero of H . The nonsmoothness of the equations is handled using directional deriva-
tives of H . Specifically, at each iteration, a search direction is calculated by minimizing a
convex quadratic program whose objective function is formed by squaring a linear approx-
imation of H . At points where the derivative is not well defined, the linear approximation
is created by choosing a particular element of the subdifferential. Once this direction is
determined, an Armijo-type linesearch is used to calculate the step size to be taken along
that direction. The advantage of this approach is that the direction finding subproblems are
always solvable. This is in contrast to Newton-based approaches, which may fail due to a
singular Jacobian matrix, and to PATH and MILES, which determine the search direction
by attempting to solve a linear complementarity problem, which may, in fact, be unsolvable.
One weakness of the algorithm is that it is vulnerable to converging to local minima of
the merit function 6 that are not solutions to the problem. The code uses scaling of the
subproblems and enforces a small cushion between the iterates and the boundary of B as
suggested in [26].
3.4. SMOOTH
The SMOOTH algorithm [4] is based upon reformulating the NCP as a system of nonsmooth
equations
and then approximately solving a sequence of smooth approximations, which lead to a zero
of the nonsmooth system. More precisely, at each iteration, a smooth approximation to the
original system is formed where the accuracy of the approximation is determined by the
I
residual of the current point, that is x - - / r (Z
+
~ -~ F ( x ) ) ( l .The smooth approximation
along this direction. Assuming this new point produces an improved residual, the next
iteration is based upon a tighter approximation of the nonsmooth equations.
An initial scaling of the data is used in the code, and the PATH preprocessor is used.
However, in SMOOTH, the preprocessor is used to try to solve the MCP instead of merely
to identify the active set. If this technique fails, the code is restarted and the smoothing
technique is then used to find a solution.
3.5. QPCOMP
QPCOMP [2] is an enhancement of the NE/SQP algorithm, which adds a proximal per-
turbation strategy that allows the iterates to escape local minima of the merit function 0
LARGE SCALE MIXED COMPLEMENTARITY PROBLEM SOVLERS 13
defined in (3). In essence, the algorithm detects when the iterates appear to be converging
to a local minimum, and then approximately solves a sequence of perturbed problems to
escape the domain of convergence of that local minimum. The perturbed problems are
formed by replacing F with the perturbed function
where the centering point Z is generally chosen to be the current iterate, and the perturbation
parameter X is chosen adaptively in a manner that guarantees global convergence to a
solution when F is both continuously differentiable and pseudomonotone at a solution. In
general, the perturbed function is updated after each iteration. Thus, the perturbed problems
are not solved exactly; they are just used to determine the next step.
An important aspect of the algorithm is that F is perturbed only when the iterates are
not making good progress toward a zero of the merit function. In particular, during the
perturbation strategy, whenever an iterate is encountered where the merit function (of the
unperturbed problem) has been sufficiently reduced, the algorithm reverts to solving the un-
perturbed problem. Thus, near a solution, the algorithm maintains the fast local convergence
rates of the underlying NE/SQP algorithm.
We note that NE/SQP is equivalent to QPCOMP without the proximal perturbation strat-
egy. Thus, to test NE/SQP, we simply ran the QPCOMP algorithm with the proximal
perturbation strategy turned off.
3.6. PROXI
PROXI [ 13, like NE/SQP and QPCOMP is based upon reformulating the MCP as the system
of nonsmooth equations (2). However, instead of solving this system using a Gauss-Newton
approach, PROXI uses a nonsmooth version of Newton’s method. Specifically, at each
iteration, the search direction is calculated by solving a linear system that approximates
H at the current iterate. Again, if H is not differentiable at the current iterate, the linear
approximation is created by choosing a particular element of the subdifferential.
Like QPCOMP, PROXI uses a proximal perturbation strategy to allow the iterates to
escape local minima of the merit function 8 defined in (3). This strategy also allows the
algorithm to overcome difficulties resulting from singular Jacobian matrices. In particular,
if the Newton equation is unsolvable at a particular iteration, the algorithm simply creates
a slightly perturbed problem using (4) with a very small A. The resulting Newton equation
for the perturbed function will then be solvable. This strategy for dealing with unsolvable
Newton subproblems is considerably more efficient than the Gauss-Newton approach used
by NE/SQP and QPCOMP.
3.7. SEMISMOOTH
which was introduced by [ 151. This function has the property that
@ ( a b, ) =0 a 2 0, b 2 0, ab = 0.
Using this function, the NCP is reformulated as the semismooth system of equations
0 = @(z),
where @i ( z ) := @(z i , Fi (2)). This reformulation has the nice feature that the natural merit
function !P( z ) := 11 @( z )11 is continuously differentiable. The SEMISMOOTH algorithm
described in [ 11 extends the approach to the MCP by using the reformulation of MCP given
by
@z(z):= @(zz- ez, @(U2 - zi,-Fi(Z))).
Hkd = -@(zk),
where H k is an element of the B-subdifferential of @. The next point zk+' is then chosen
by performing a nonmonotone, Arimijo linesearch along the direction d k .
3.8. SEMZCOMP
4. Computational Comparison
With the exception of NE/SQP and QPCOMP, each of the eight algorithms described in the
previous section was run on all of the problems in the test library from all of the starting
points. Since NE/SQP and QPCOMP were implemented using a dense QP code, we only
ran the problems with fewer than 1 10 variables for these solvers. Table A. 1 in appendix A
LARGE SCALE MIXED COMPLEhlENTARITY PROBLEM SOVLERS 15
shows the execution time needed by each algorithm on a SPARC 10/51, while Table A.2,
also in Appendix A, shows the number of function and Jacobian evaluations required by
each algorithm. To abbreviate the results, we excluded any problems that were solved in
less than 2 seconds by all of the algorithms we tested.
Each algorithm minimizes its own merit function as described in Section 3 and all were
terminated when this measure was reduced below 10-6. Since the merit functions are
different for each code, we tested the solutions to ensure that the standardized residual
given by ( 1 ) was always less than 10-5. It is possible that one more or one less “Newton”
step would be carried out if the same merit function was used for every algorithm. Since
this is impractical, the method we now outline for reporting our results makes these small
changes entirely irrelevant.
How one chooses to summarize data of this nature depends on what one’s goals are.
From a modeling standpoint, one could determine which models were the most difficult
to solve by aggregating results for each model. From a computational standpoint, one can
compare the solvers using many different criteria, including number of successes/failures,
cumulative solution time required, number of cases where solution time is “acceptable”,
number of functiodgradient evaluations required, etc. As examples of useful metrics, we
have chosen the following:
competitive
success
verycomp.
competitive
success
MILES
32%
45%
84%
SQP
0%
2%
67%
PATH
43%
67%
94%
PROXI
34%
54%
95%
COMP
0%
1%
90%
COMP
26%
44%
88%
SMTH
25%
40%
65%
5
1
92%
16 BILLUPS. DIRKSE AND FERRIS
5. Conclusions
The testing environment we have described addresses many of the problems we have ob-
served about how codes are developed and tested. In particular, with a large collection
of test problems available, it is more difficult to tune a code to the test set. Moreover,
even if such tuning is successful, the resulting code will be good at solving the types of
problems that are represented in the library, namely, the problems that are of interest to
the user community. The inclusion of problems with known difficulties allows codes to
be compared by how well they solve different classes of problems, thus allowing users to
more accurately choose codes that meet their needs. Finally, by categorizing problems with
different computational difficulties, the library can be used to highlight the areas where
research energies most need to be directed.
Our testing indicates superior performance by the PATH, SMOOTH, and PROXI algo-
rithms. However, as the codes continue to mature, it is possible that their relative perfor-
mance will change. It is not our intention to declare a winner, but rather to “clarify the
rules” so that code developers will focus on the right issues when developing algorithms.
To a large extent, we have accomplished this with our testing environment.
It is unfortunate that the scope of our testing could not have been more broad. Some of
the algorithms mentioned above were coded by the authors of this paper (not the originators
of the algorithm), while there are numerous other algorithms that we were not able to test at
all. This is due primarily to the fact that these algorithms do not have GAMS interfaces. It
is our hope that as the CPLIB interface becomes more widely known, other code developers
will hook up their solvers to GAMS. This will allow their algorithms to be easily compared
with other codes using our testing environment.
Lastly, we wish to emphasize that the test library is continually being expanded. In
particular, we are always eager to add challenging new real world models to the library. To
this end, we have begun to augment the MCPLIB by adding new models that have recently
come to our attention. The 10 models listed in Table 5 have been used in various disciplines
to answer questions that give rise to complementarity problems. Some of these models
are solved from many different starting points, indicated by the “solves” column. The
first 6 are economic models, the next two arise from applications in traffic equilibrium and
multi-rigid-body contact problems, the final two correspond to complementarity problems
for which all solutions are required. The numbers of solutions for the last two problems
are known to be odd, the number listed below is a lower bound. These problems appear to
be more difficult than most of the problems solved in this paper. Certainly, some are much
larger, while others have singularities either at solutions or starting points. Most of these
problems do not have underlying monotonicity.
The results that we present in Tables 6 and 7 for these models are somewhat different to
the results in Appendix A and are motivated more by the models themselves. For the games
and tinloi models, it is important to find all solutions of the model, and so after a fixed
number of runs from a variety of starting points, we report the number of distinct solutions
found for these models in Table 6.
For the remaining problems, we just report one statistic in Table 7 for each model. If
every problem was solved, we report the total resources used to solve the complete model,
LARGE SCALE MIXED COMPLEMENTARITY PROBLEM SOVLERS 17
tinloi
MILES
3
2
33 PROXI SKOMP S/SMTH SMOOTH
otherwise we report an error using a letter to signify some sort of failure “F”, memory error
“M”, time limit exceeded “T” or iteration limit exceeded “I”. Only the first error is listed
per problem, while the numbers in parentheses are the number of problems that failed to
solve.
Table 7. Summary for New Models
Model PATH
shubik F(9)
110.81 214.32
asean9a 62.08 91.62
203.79 239.73
Uruguay 2760.17 45 19.53
hanson 39.36 4.94
trafelas 150.55 346.23
lincont 10.76 718.27
It is our intention to add these models and newer models that are brought to our attention
to MCPLIB. In this way we hope that the problem library will continue to serve as a guide
for code developers so that they will direct their energies into areas that will best serve the
users.
18 BILLUPS. DIRKSE AND FERRIS
Appendix A
-
Problem st. NE/ QP- SEMI- SEMI-
Name Pt.
- MILES SQP PATH PROXI COMP COMP SMTH SMTH
pgvon 105 3 fail 33.47 1.58 52.13 58.80 fail fail fail
pgvon 105 4 fail fail fail fail fail 28.09 fail fail
pgvon 106 1 fail fail 19.77 13.21 fail fail fail 125.46
pgvon 106 2 fail fail 1.80 fail fail fail fail 5.37
pgvon 106 3 fail fail 1.29 fail fail fail fail 8.48
pgvonl06 4 fail fail fail 2.46 fail 38.30 fail fail
pgvon 106 5 5.33 fail fail fail fail fail fail fail
pgvon 106 6 fail fail fail fail fail fail fail 3.76
pies 1 0.07 fail 0.13 0.29 7.26 0.1 1 0.13 0.27
sammge 1 0.07 fail 0.0 1 0.01 fail 0.00 0.00 0.00
sammge 3 0.10 0.27 0.05 0.16 0.26 0.17 fail 0.18
sammge 5 0.12 0.42 0.07 0.12 0.48 0.36 fail 0.13
sammge 6 0.13 0.45 0.05 0.27 0.58 0.40 fail 0.13
sammge 7 0.18 0.69 0.06 0.13 0.58 0.23 fail 0.20
sammge 8 0.10 0.78 0.05 0.63 0.74 0.39 fail 0.19
sammge 9 0.10 0.7 1 0.07 0.45 0.69 0.65 fail 0.20
sammge 10 0.17 fail 0.01 0.01 fail 0.0 1 0.00 0.01
sammge 13 0.05 0.30 0.12 0.20 0.28 0.23 fail 0.23
sammge 14 0.12 0.29 0.11 0.17 0.35 0.23 fail 0.18
sammge 15 0.05 0.27 0.06 0.48 0.3 1 0.38 fail 0.25
sammge 16 0.05 0.47 0.1 1 0.26 0.46 0.3 1 fail 0.10
sammge 17 0.10 0.62 0.09 0.57 1.05 0.20 fail 0.17
sammge 18 0.08 0.37 0.1 1 0.46 0.45 0.50 fail 0.16
scarfasum 2 0.15 fail 0.04 0.15 1.51 0.15 0.12 0.10
scarfasum 3 0.13 0.29 0.07 0.15 0.37 fail fail 0.05
scarfbnum 1 0.08 6.27 0.39 0.57 6.42 1.01 fail 0.32
scarfbnum 2 0.10 6.01 0.44 0.43 6.09 7.36 fail 0.32
scarfbsum 1 0.17 fail fail 0.49 8.77 0.39 0.3 1 0.24
scarfbsum 2 0.18 fail 3.43 5.16 31.1 1 1.22 fail 0.66
threemge 7 0.08 - 0.06 fail - 0.14 0.13 0.05
threemge 8 0.07 - 0.06 fail - 0.12 0.14 0.05
threemge 11 0.12 - 0.05 fail - 0.82 fail 0.05
transmcp 1 0.03 fail 0.04 0.09 1.22 0.23 fail 0.05
transmcp 2 0.03 fail 0.01 0.00 fail 0.00 fail 0.00
transmcp 3 0.03 0.02 0.02 0.01 0.02 0.03 fail 0.02
transmcp 4 0.03 0.11 0.02 0.01 0.10 0.04 fail 0.02
vonthmcp 1 fail - fail fail - fail fail fail
vonthmge 1
- 0.08 fail 1.06 fail fail fail fail 17.14
LARGE SCALE MIXED COMPLEMENTARITY PROBLEM SOVLERS 21
References
1. S. C. Billups. Algorithmsfor Complementarity Problems and Generalized Equations. PhD thesis, University
of Wisconsin-Madison, Madison, Wisconsin, August 1995.
2. S. C. Billups and M. C. Ferris. QPCOMP: A quadratic program based solver for mixed complementarity
problems. Mathematical Programming forthcoming, 1996.
3. A. Brooke, D. Kendrick, and A. Meeraus. GAMS: A User’s Guide. The Scientific Press, South San Francisco,
CA, 1988.
4. Chunhui Chen and 0. L. Mangasarian. A class of smoothing functions for nonlinear and mixed comple-
mentarity problems. Computational Optimization and Applications, 5:97-138, 1996.
5. R. W. Cottle. Nonlinear programs with positively bounded Jacobians. PhD thesis, Department of Mathe-
matics, University of California, Berkeley, California, 1964.
6. T. De Luca, F. Facchinei, and C. Kanzow. A semismooth equation approach to the solution of nonlinear com-
plementarity problems. Preprint 93, Institute of Applied Mathematics, University of Hamburg, Hamburg,
January 1995.
7. S. P. Dirkse and M. C. Fems. MCPLIB: A collection of nonlinear mixed complementarity problems.
Optimization Methods and Software, 5:3 19-345, 1995.
8. S. P. Dirkse and M. C. Fenis. The PATH solver: A non-monotone stabilization scheme for mixed comple-
mentarity problems. Optimization Methods and Sofmare, 5 :123-156, 1995.
9. S. P. Dirkse, M. C. Fenis, P. V. Preckel, and T. Rutherford. The GAMS callable program library for variational
and complementarity solvers. Mathematical Programming Technical Report 94-07, Computer Sciences
Department, University of Wisconsin, Madison, Wisconsin, 1994. Available from ftp://ftp.cs.wisc.edu/math-
proghech-report d.
10. F. Facchinei and J. Soares. A new merit function for nonlinear complementarity problems and a related
algorithm. SIAM Journal on Optimization, forthcoming 1996.
11. E Facchinei and J. Soares. Testing a new class of algorithms for nonlinear complementarity problems. In
F. Giannessi and A. Maugeri, editors, Variational Inequalities and Network Equilibrium Problems, pages
69-83. Plenum Press, New York, 1995.
12. M. C. Fems and J. S. Pang. Engineering and economic applications of complementarity problems. Discussion
Papers in Economics 9 5 4 , Department of Economics, University of Colorado, Boulder, Colorado, 1995.
Available from ftp://ftp.cs.wisc.edu/math-prog/tech-reportd.
13. M. C. Ferris and D. Ralph. Projected gradient methods for nonlinear complementarity problems via normal
maps. In D. Du, L. Qi, and R. Womersley, editors, Recent Advances in Nonsmooth Optimization, pages
57-87. World Scientific Publishers, 1995.
14. M. C. Fems and T. E Rutherford. Accessing realistic complementarity problems within Matlab. In G. Di
Pill0 and F. Giannessi, editors, Proceedings of Nonlinear Optimization and Applications Workshop, Erice
June 1995, Plenum Press, New York, 1996.
15. A. Fischer. A special Newton-type optimization method. optimization, 24:269-284, 1992.
16. A. Fischer and C. Kanzow. On finite termination of an iterative method for linear complementarity problems.
Preprint MATH-NM-10-1994, Institute for Numerical Mathematics, Technical University of Dresden,
Dresden, Germany, 1994.
17. R. Fourer, D. Gay, and B. Kernighan. AMPL. The Scientific Press, South San Francisco, California, 1993.
18. C. B. Garcia and W. I. Zangwill. Pathways to Solutions, Fixed Points, and Equilibria. Prentice-Hall, Inc,
Englewood Cliffs, New Jersey, 198 1.
19. C. Geiger and C. Kanzow. On the resolution of monotone complementarity problems. Computational
Optimization and Applications, 5 : 155-173, 1996.
20. S.-P. Han, J. S. Pang, and N. Rangaraj. Globally convergent Newton methods for nonsmooth equations.
Mathematics of Operations Research, 17586-607, 1992.
21. P. T. Harker and B. Xiao. Newton’s method for the nonlinear complementarity problem: A B-differentiable
equation approach. Mathematical Programming, 48:339-358, 1990.
22. Lars Mathiesen. Computation of economic equilibria by a sequence of linear complementarity problems.
Mathematical Programming Study, 23: 144-162, 1985.
23. J. J. Mor6. Global methods for nonlinear complementarity problems. Technical Report MCS-P429-0494,
Argonne National Laboratory, Argonne, Illinois, April 1994.
LARGE SCALE MIXED COMPLEMENTARITY PROBLEM SOVLERS 25
24. J. S . Pang. Newton’s method for B-differentiable equations. Mathematics of Operations Research, 15:31 1-
341, 1990.
25. J. S . Pang. A B-differentiable equation based, globally and locally quadratically convergent algorithm for
nonlinear programs, complementarity and variational inequality problems. Mathematical Programming,
5 1 :101-132, 1991.
26. J. S . Pang and S . A. Gabriel. NE/SQP: A robust algorithm for the nonlinear complementarity problem.
Mathematical Programming, 601295-338, 1993.
27. D. Ralph. Global convergence of damped Newton’s method for nonsmooth equations, via the path search.
Mathematics of Operations Research, 19:352-389, 1994.
28. S . M. Robinson. Generalized equations. In A. Bachem, M. Grotchel, and B. Korte, editors, Mathematical
Programming: The State of the Art, Bonn 1982, pages 346-367. Springer Verlag, Berlin, 1983.
29. S. M. Robinson. Normal maps induced by linear transformations. Mathematics of Operations Research,
I7:691-7 14, 1992.
30. T. E Rutherford. MILES: A mixed inequality and nonlinear equation solver. Working Paper, Department
of Economics, University of Colorado, Boulder, 1993.
3 1. T. E Rutherford. Extensions of GAMS for complementarity problems arising in applied economic analysis.
Journal of Economic Dynamics and Control,forthcoming, 1996.
32. H. E. Scarf. The approximation of fixed points of a continuous mapping. SIAM Journal on Applied
Mathemutics, 15:1328-1343, 1967.
33. M. J. Todd. Computation of Fixed Points and Applications, volume 124 of Lecture Notes in Economics and
Mathematical Systems. Springer-Verlag, Heidelberg, 1976.
34. B. Xiao and P. T. Harker. A nonsmooth Newton method for variational inequalities: I: Theory. Mathematical
Programming, 65:151-194, 1994.
35. B. Xiao and P. T. Harker. A nonsmooth Newton method for variational inequalities: 11: Numerical results.
Mathemutical Programming, 65 :195-2 16, 1994.
Computational Optimization and Applications, 7, 27-40 (1997)
0 1997 Kluwer Academic Publishers. Boston. Manufactured in The Netherlands.
Abstract. ELSO is an environment for the solution of large-scale optimization problems. With ELSO the user
is required to provide only code for the evaluation of a partially separable function. ELSO exploits the partial
separability structure of the function to compute the gradient efficiently using automatic differentiation. We
demonstrate ELSO's efficiency by comparing the various options available in ELSO. Our conclusion is that the
hybrid option in ELSO provides performance comparable to the hand-coded option, while having the significant
advantage of not requiring a hand-coded gradient or the sparsity pattern of the partially separable function. In
our test problems, which have carefully coded gradients, the computing time for the hybrid AD option is within a
factor of two of the hand-coded option.
1. Introduction
i= 1
where each element function fi depends only on a few components of z, and rn is the number
of element functions. Algorithms and software that take advantage of partial separability
have been developed for various problems (for example, [ 1 1, 19, 20, 17, 2 1, 22, 10]), but
this software requires that the user provide the gradient of fo. An important design goal of
ELSO is to avoid this requirement.
For small-scale problems we can approximate the gradient by differences of function
values, for example,
* Work supported by the Mathematical, Information, and Computational Sciences Division subprogram of the
Office of Computational and Technology Research, U.S. Department of Energy, under Contract W-3 1-109-Eng-
38, and by the National Science Foundation, through the Center for Research on Parallel Computation, under
Cooperative Agreement No. CCR-9 120008.
28 BOUARICHA AND M O R E
where hi is the difference parameter, and ei is the i-th unit vector, but this approximation
suffers from truncation errors, which can cause premature termination of an optimization
algorithm far away from a solution. We also note that, even for moderately sized problems
with n 2 100 variables, use of this approximation is prohibitive because it requires n func-
tion evaluations for each gradient. For these reasons, the accurate and efficient evaluation
of the gradient is essential for the solution of optimization problems.
ELSO is able to solve large-scale unconstrained optimization problems, while requiring
only that the user provide the function in partially separable form. This is an important
advantage over standard software that requires the specification of the gradient and the
sparsity pattern of the partially separable function, that is,
ELSO exploits the partial separability structure of the function to compute the gradient
efficiently by using automatic differentiation (AD). The current version of ELSO incorpo-
rates four different approaches for computing the gradient of a partially separable function
in the context of large-scale optimization software. These approaches are hand-coded,
compressed AD, sparse AD, and hybrid AD. In our work we have been using the ADI-
FOR (Automatic Differentiation of Fortran) tool [4, 61, and the SparsLinC (Sparse Linear
Combination) library [ 5 , 61, but other differentiation tools can be used.
We demonstrate ELSO’Sefficiency by comparing the compressed AD, sparse AD, and
hybrid AD options with the hand-coded approach. Our conclusion is that the performance
of the hybrid AD option is comparable with the compressed AD option and that the per-
formance penalty over the hand-coded option is acceptable for carefully coded gradients.
In our test problems, which have carefully coded gradients, the computing time for the
hybrid AD option is within a factor of two of the hand-coded option. Thus, the hybrid AD
option provides near-optimal performance, while providing the significant advantage of not
requiring a hand-coded gradient or the sparsity pattern of the partially separable function.
We describe in Section 2 the different approaches used by ELSOto compute the gradient of
a partially separable function. In Section 3 we provide a brief description of the MINPACK-
2 large-scale problems and show how to convert these problems into partially separable
problems. In Section 4 we compare and analyze the performance of large-scale optimization
software using the different options available in ELSO. We present results for both a
superscalar architecture (IBM RS6000) and a vector architecture (Cray C90). Our results
on the Cray C90 are of special interest because they show that if the hand-coded gradient
does not run at vector speeds, the hybrid AD option can outperform the hand-coded option.
Finally, we present our conclusions in Section 5 .
ELSO relies on the representation (2) to compute the gradient of a partially separable
function. Given this representation of fo : R”H R, we can compute the gradient of fo
by noting that if the mapping f : R” R”is defined by
1 (4)
where T { - }and M { . } denote computing time and memory, respectively, and LR, and Oh,
are small constants; if the function fo is defined by a discretization of a continuous problem,
we also wish the constants to be independent of the mesh size. Any automatic differentiation
tool can be used to compute f’(z) and thus the gradient of fo, but efficiency requires that
we insist on (6) and (7).
Automatic differentiation tools can be classified roughly according to their use of the
forward or the reverse mode of automatic differentiation. See, for example, the survey of
Juedes [16]. Automatic differentiation tools that use the forward mode generate code for
the computation of f‘(z)V for any V E R n x pIf. L {f} and M {f} are, respectively, the
number of floating-point operations and the amount of memory required by the computation
of f(z), then an AD-generated code employing the forward mode requires
floating-point operations and memory, respectively, to compute f’(z) V . For many large-
scale problems we can obtain the Jacobian matrix f’(z)by computing f’(z)V for a matrix
V E R n x pwith p small. Thus, in this case, an automatic differentiation tool based on
the forward mode satisfies (6) and (7). We elaborate on this point when we discuss the
compressed AD approach.
Automatic differentiation tools that use the reverse mode generate code for the compu-
tation of W T f ’ ( x )for any W E E l m x q . We can also use the reverse mode to compute
f’(z), but since the reverse mode reverses the partial order of program execution and re-
members (or recomputes) any intermediate result that affects the final result, the complexity
of the reverse mode is harder to predict. In general, the reverse mode requires 0 ( L {f})
floating-point operations and up to 0 ( L {f} + M {f}) memory, depending on the code.
In particular, there is no guarantee that (7) is satisfied. Griewank [12, 131 has discussed
30 BOUARICHA AND MORE
how to improve the performance of the reverse mode, but at present the potential memory
demands of the reverse mode are a disadvantage. For additional information on automatic
differentiation, see the proceedings edited by Griewank and Corlis [ 141; the paper of Iri
[ 151 is of special interest because he discusses the complexity of both the forward and the
reverse modes of automatic differentiation.
In ELSO we have used the ADIFOR [4,6] tool and the SparsLinC library [5,6] because,
from a computational viewpoint, they provide all the flexibility and efficiency desired on
practical problems. Indeed, Bischof, Bouaricha, Khaderni, More [3] have shown that the
ADIFOR tool can satisfy (6) and (7) on large-scale vatiational problems.
We now outline the three approaches used by ELSO to compute the gradient of f o . As
we shall see, all these approaches have advantages and disadvantages in terms of ease of
use, applicability, and computing time.
In the compressed AD approach we assume that the sparsity pattern of the Jacobian matrix
f ’ ( x ) is known for all vectors x E V ,where V is a region where all the iterates are known
to lie. For example, V could be the set
where xg is the initial starting point. Thus, in the compressed AD approach we assume that
the closure of the sparsity pattern is known. The sparsity pattern S ( x )for f’(x)at a given
x E V is just the set of indices
U{S(x) : x E V}.
To determine the closure of the sparsity pattern, we are required to know how the function
fo depends on the variables. When f is given by (4), a pair (2, j ) is in the closure of the
sparsity pattern if and only if f i depends on xj. Hence, the closure of the sparsity pattern
is the sparsity pattern (3) of the partially separable function when x is restricted to lie in D.
Determining the closure of the sparsity pattern is straightforward for problems with a
fixed structure. For example, for finite element problems where the triangulation is fixed
during the iteration. This is the case for the problems considered in Section 3. If, on the
other hand, the structure evolves over time, then the sparsity pattern is likely to change
as the iteration progresses. In these cases we must be able to detect these changes, and
re-compute the sparsity pattern. This is the topic of current research.
Given the sparsity pattern of f ’ ( x ) , we can determine the Jacobian matrix f ’ ( x ) if we
partition the columns of the Jacobian matrix into groups of structurally orthogonal columns,
that is, columns that do not have a nonzero in the same row position. In our work we employ
the partitioning software described by Coleman, Garbow, and Mor6 [8,7].
IMPACT OF PARTIAL SEPARABILITY O N LARGE-SCALE OPTIhlIZATION 31
do j = 1, n
grad(j) = 0.0
do k = jpntr(j), jpntr(j+l)-l
i = indrow(k)
grad(j) = grad(j) + c-fjac(i,ngrp(j))
enddo
enddo
Figure 1. Computing Vfo(z) from the compressed Jacobian array c-f jac
For the sparse AD approach we need an automatic differentiation tool that takes advantage
of sparsity when V and most of the vectors involved in the computation of f’(z)V are
sparse. We also require the sparsity pattern of f’(z)V as a by-product of this computation.
At present, the SparsLinC library [ 5 , 6 ]is the only tool that addresses this situation, but we
expect that others will emerge.
The main advantage of the sparse AD approach over the compressed AD approach is that
no knowledge of the sparsity pattern is required. A disadvantage, however, is that because
of the need to maintain dynamic data structures for sparse vectors, the sparse AD approach
usually runs slower than the compressed AD approach.
Numerical results [3] with ADIFOR and SparsLinC show that the compressed AD ap-
proach outperforms the sparse AD approach on various architectures. In fact,
32 BOUARICHA AND MORE
As stated in the introduction, an important design goal of ELSO is to avoid asking the user
to provide code for the evaluation of the gradient or the sparsity pattern of the partially
separable function. We can achieve this goal by using the sparse AD option. However, as
noted above, this imposes a heavy performance penalty on the user.
In an optimization algorithm we can avoid this performance penalty by first using the
sparse AD option, to obtain the sparsity pattern of the function, and then using the com-
pressed AD option. This strategy must be used with care. We should not use the sparse
AD option to obtain the sparsity pattern at the starting point because the starting point is
invariably special, and not representative of a general point in the region V of interest. In
particular, there are usually many symmetries in the starting point that are not necessarily
present in intermediate iterates.
We can also use the sparse AD option for a number of iterates until we feel that any
symmetries present in the starting point have been removed by the optimization algorithm.
This strategy is not satisfactory, however, because optimization algorithms tend to retain
symmetries for many iterations, possibly for all the iterates.
The current strategy in ELSO is to randomly perturb every component of the user’s initial
point, and compute the sparsity pattern at the perturbed point. This destroys any symmetries
in the original iterates, and the resulting sparsity pattern is likely to be the closure of the
sparsity pattern in V.
This strategy may fail if the closure of the sparsity pattern in a neighborhood of the initial
iterate is different from the sparsity pattern in a neighborhood of the solution. For most
optimization problems, this does not occur. If it occurs, however, failure does not occur
unless some entries in the current sparsity pattern are not present in the previous sparsity
pattern. The justification of this remark comes about by noting that the compressed AD
approach works provided the sparsity pattern of the Jacobian matrix f’(z)is a subset of the
sparsity pattern provided by the user. Of course, if the sparsity pattern provided by the user
is too large, then the number of groups p is likely to increase, leading to increased memory
requirements and some loss in efficiency in the computation of the gradient.
IMPACT O F PARTIAL SEPARABILITY ON LARGE-SCALE OPTIRiIIZATION 33
We used the test problems in the MINPACK-2 collection to compare the performance of a
large-scale optimization software employing the four approaches for computing the gradient
of a partially separable function described in Section 2. This collection is representative of
large-scale optimization problems arising from applications. Table 1 lists each test problem
with a short description; see [ 11 for additional information on these problems.
The optimization problems in the MINPACK-2 collection arise from the need to minimize
a function f of the form
where V is some domain in either R or R2,and Q is defined by the application. In all cases
f is well defined if v : V I--+ RPbelongs to H 1(V),
the Hilbert space of functions such that
U and llVvII belong to L 2 ( V ) .
Finite element approximations to these problems are obtained by minimizing f over the
+ +
space of piecewise linear functions v with values vi,j at z i , j , 0 5 i 5 nY 1 , 0 5 j L. n, 1,
where zi,j E R2are the vertices of a triangulation of V with grid spacings h, and h,. The
vertices zi,jare chosen to be a regular lattice so that there are n, and n, interior grid
points in the coordinate directions, respectively. Lower triangular elements TL are defined
by vertices z i , j ,z i + l , j , zi,j+l, while upper triangular elements Tu are defined by vertices
z ~ ,zi-
~ ,1,j zi,j- 1. A typical triangulation is shown in Figure 2.
In a finite element formulation of the variational problem defined by (8), the unknowns
are the values U i , j of the piecewise linear function v at the vertices z i , j . The values vi,j are
obtained by solving the minimization problem
where f& and f& are the finite element approximation to the integrals
respectively. Clearly, this is a partially separable problem because the element functions
f&(v) and f$(v) depend only on the vertices vi,j, vi+1,j1vi,j+l and V i , j , vi-1,j1 V i , j - l ,
respectively. We can formulate this problem by setting
In this case the number of element functions m ==: 2n. On the other hand, if we define
the number of element functions m n. Since the number of element functions differs
for (9) and (lO), the number of groups p determined by the partitioning software [8, 71 is
likely to be different, and thus the computing times for the compressed Jacobian matrix
may depend on p . In our experience the computing time of formulation (9) is slightly better
than that of (10). Therefore, we used formulation (9) in the numerical results of Section 4.
The problems in Table 1 are representative of a large class of optimization problems.
These problems share some common characteristics. The main characteristics are that
the computation of f requires order n flops and that the Jacobian matrix of f is sparse.
Moreover, the number of groups p determined by the partitioning software leads to an
almost dense compressed Jacobian matrix; the only exception is the GL2 problem, where
the compressed Jacobian matrix is 50% dense. We expect that our numerical results are
representative for any problem with these characteristics.
IMPACT OF PARTIAL SEPARABILITY ON LARGE-SCALE OPTIhIIZATION 35
4. Numerical Results
Our aim in these experiments is to show that the performance of the hybrid AD option of
ELSO is comparable to the compressed AD option and that the performance penalty over
the hand-coded option is quite reasonable.
We chose a limited-memory variable metric method for these comparisons because codes
of this type are commonly used to solve large-scale optimization problems. These methods
are of the form
where a k > 0 is the search parameter, and the approximation Hk to the inverse Hessian
matrix is stored in a compact representation that requires only the storage of 2nv vectors,
where nu is chosen by the user. The compact representation of Hk permits the efficient
+
computation of H ~ , V f ( x kin) (8nv 1)n flops; all other operations in an iteration of the
algorithm require l l n flops.
We used the vmlm implementation of the limited-memory variable metric algorithm (see
Averick and Mor6 [2]) with nu = 5. This implementation is based on the work of Liu and
Nocedal [ 181. The web page
http:://www.mcs.anl.gov/home/more/minpack-2/m~npack-2.html
contains additional information on the vmlm implementation.
In our numerical experiments we are interested in measuring performance in terms of
time per iteration. Thus, instead of using a termination test, such as
we terminate after 100 iterations. This strategy is needed because optimization algorithms
that require many iterations for convergence are affected by small perturbations in the
function or the gradient, and, as a result, there may be large differences in the number of
iterations required for convergence when the different vmlm options of ELSO are used.
All computations were performed on two platforms: an IBM RS6000 (model 370) using
double-precision arithmetic, and a Cray C90 using single-precision arithmetic. The IBM
RS6000 architecture has a superscalar chip and a cache-based memory architecture. Hence,
this machine performs better when executing short vector operations, since these operations
fill the short pipes and take advantage of memory locality. The Cray C90 is a vector processor
without a cache that achieves full potential when the code has long vector operations.
Without optimization of the source Fortran code, short vector loops and indirect addressing
schemes perform poorly.
Table 2 has the computing time ratios of the compressed AD and sparse AD function-
gradient evaluation to the hand-coded function-gradient evaluation on the IBM RS6000.
These results show that the use of the sparse AD gradient can lead to a significant degradation
in performance.
Tables 3 and 4 compare the computing time for the compressed AD, sparse AD, and
hybrid AD options of vmlm to the computing time of the hand-coded option. The most
important observation that can be made from these tables is that the computing times for
36 BOUARICHA AND MORE
the hybrid AD option are approximately the same as those for the compressed AD option.
The performance similarity between the hybrid AD option and the compressed AD option
is expected because the difference in cost between the two options is only one sparse AD
gradient evaluation and the partitioning of the columns of the Jacobian matrix into groups of
structurally orthogonal columns. Tables 3 and 4 show that the hybrid AD option is clearly
the method of choice because of its significant advantage of not requiring a hand-coded
gradient or the sparsity pattern of the partially separable function.
The ratios in Tables 3 and 4 are below the corresponding ratios in Table 2. This result
can be explained by noting that the ratios in Tables 3 and 4 can be expressed as
Tad + Ta1o
Thc + Talg ’
where Tad,Talg,and ThCare the computing times for the function and AD-generated gra-
dient evaluation, the vmlm algorithm, and the function and hand-coded gradient evaluation,
respectively. Since Tad > The, we have
which is the desired result. If Tad and Thcare the dominant costs, the ratio (1 1) should
be close to Tad/Thc. This can be seen in the results for the MSA and SSC problem, since
these are the two most expensive functions in the set.
Tables 3 and 4 also show that when we increase the problem dimension from n = 10,000
to n = 40,000, the corresponding compressed AD, sparse AD, and hybrid AD ratios remain
about the same. This observation can be explained by noting that the ratio (1 1 ) can also be
expressed as
624.6
40000 694.4
49.2
49.4
10000
2.4
The poor performance of the compressed AD and hybrid AD options is due to the short
innermost loops of length p , where p is the number of groups in the compressed AD
approach. These loops are vectorizable, but when the compiler vectorizes only innermost
loops, as is the case of the Cray C90, the performance degrades. We can vectorize the
compressed AD gradient by strip-mining the computation of the gradient; that is, the gradient
computation is divided into strips and each strip computes the gradient with respect to a few
components of the independent variables. In the case of the compressed AD gradient, strip-
mining can be done conveniently via the seed matrix mechanism. A disadvantage of the
strip-mining approach is that the function is evaluated in every strip, resulting in a runtime
overhead of n s t r i p s - 1 extra function evaluations, where n s t r i p s is the number of
strips. Using strips of size 5 is appropriate for the Cray C90 because the compiler unrolls
innermost loops of length five or less, and, as a result, the loops that run over the grid points
in the second coordinate direction are vectorized.
There is one additional complication. Since the value of p is not known at compile time,
the Cray compiler cannot unroll a loop of length p even if the computed value of p at runtime
is less than or equal to five. We fix this problem by setting the upper bound of the innermost
loops to a fixed number at least equal to p but at most equal to 5. The generation of the
compressed AD gradients with a fixed upper bound of the innermost loops can be done
automatically by setting the appropriate ADIFOR flags [6].
The computing time ratios for the strip-mining approach (with loop unrolling) are shown
in Tables 7 and 8. The improvement is dramatic for both the compressed AD and hybrid
AD options. If we compare the results in Table 6 with those in Table 8, we find that the
computing time ratios are reduced by a factor of 1.6 for the GL2 problem and a factor of 2
for the SSC problem.
Also note that the results in Tables 7 and 8 show that the compressed AD approach
performs better on the SSC problem than the hand-coded approach. The reason for this
IMPACT OF PARTIAL SEPARABILITY ON LARGE-SCALE OPTIMIZATION 39
Compressed AD
40000
10000
is that the strip-mining in the compressed AD approach improves the performance of this
approach, while the hand-coded approach is still running at scalar speeds. These results
illustrate the important point that the compressed and hybrid AD approaches can run faster
than the hand-coded approach if the user does not provide a carefully coded gradient.
5. Conclusions
References
B. M. Averick, R. G. Carter, J. J. Mor6, and G-L. Xue. The MINPACK-2 test problem collection. Preprint
MCS-PI 53-0692, Mathematics and Computer Science Division, Argonne National Laboratory, 1992.
B. M. Averick and J. J. More. Evaluation of large-scale optimization problems on vector and parallel
architectures. SIAM J. Optimization, 4:708-72 1, 1994.
Christian Bischof, Ali Bouaricha, Peyvand Kahdemi, and Jorge J. Mork. Computing gradients in large-scale
optimization using automatic differentiation. Preprint MCS-P488-0 195, Argonne National Laboratory,
Argonne, Illinois, 1995.
40 BOUARICHA AND MORE
4. Christian Bischof, Alan Carle, George Corliss, Andreas Griewank, and Paul Hovland. ADIFOR: Generating
derivative codes from Fortran programs. Scientijic Programming, 1(1): 1-29, 1992.
5. Christian Bischof, Alan Carle, and Peyvand Khademi. Fortran 77 interface specification to the SparsLinC
library. Technical Report ANWMCS-TM-196, Argonne National Laboratory, Argonne, Illinois, 1994.
6. Christian Bischof, Alan Carle, Peyvand Khademi, and Andrew Mauer. The ADIFOR 2.0 system for the
automatic differentiation of Fortran 77 programs. Preprint MCS-P38 1- 1194, Argonne National Laboratory,
Argonne, Illinois, 1994. Also available as CRPC-TR9449 1, Center for Research on Parallel Computation,
Rice University.
7. T. E Coleman, B. S. Garbow, and J. J. Mork. Fortran subroutines for estimating sparse Jacobian matrices.
ACM Trans. Math. Software, 10:346-347, 1984.
8. T. E Coleman, B. S. Garbow, and J. J. Mork. Software for estimating sparse Jacobian matrices. ACM Trans.
Math. Sofmare, 10:329-345, 1984.
9. T. E Coleman and J. J. Mork. Estimation of sparse Jacobian matrices and graph coloring problems. SIAM
J. Numer. Anal., 20:187-209, 1983.
10. A. R. Conn, N. I. M. Gould, and Ph. L. Toint. LANCELOT. Springer Series in Computational Mathematics.
Springer-Verlag, 1992.
11. A. Griewank and Ph. L. Toint. Numerical experiments with partially separable optimization problems. In
D. E Griffiths, editor, Numerical Analysis: Proceedings Dundee 1983, Lecture Notes in Mathematics 1066.
Springer-Verlag, 1984.
12. Andreas Griewank. Achieving logarithmic growth of temporal and spatial complexity in reverse communi-
cation. Optim. Methods Software, 1:35-54, 1992.
13. Andreas Griewank. Some bounds on the complexity of gradients, Jacobians, and Hessians. In P.M. Pardalos,
editor, Complexity in Nonlinear Optimization, pages 128-161. World Scientific Publishers, 1993.
14. Andreas Griewank and George F. Corliss, editors. Automatic Diferentiation of Algorithms: Theory, Imple-
mentation, and Application. Society for Industrial and Applied Mathematics, 1991.
15. M. Itr. History of automatic differentiation and rounding error estimation. In A. Griewank and G. F. Corliss,
editors, Automatic Differentiation of Algorithms, pages 3-16. SIAM, 1992.
16. David Juedes. A taxonomy of automatic differentiation tools. In Andreas Griewank and George Corliss,
editors, Automatic Diferentiation of Algorithms: Theory, Implementation, and Application, pages 3 15-329.
SIAM, 1991.
17. M. Lescrenier. Partially separable optimization and parallel computing. Ann. Oper. Res., 14:213-224, 1988.
18. D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Math.
Programming, 45 :503-528, 1989.
19. Ph. L. Toint. Numerical solution of large sets of algebraic nonlinear equations. Math. Comp., 46:175-189,
1986.
20. Ph. L. Toint. On large scale nonlinear least squares calculations. SIAM J. Sci. Statist. Comput., 8:41&435,
1987.
21. Ph. L. Toint and D. Tuyttens. On large-scale nonlinear network optimization. Math. Programming, 48: 125-
159, 1990.
22. Ph. L. Toint and D. Tuyttens. LSNNO: A Fortran subroutine for solving large-scale nonlinear network
optimization problems. ACM Trans. Math. Sojhvare, 18:308-328, 1992.
Computational Optimization and Applications, 7,41-69 (1997)
@ 1997 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.
Abstract. This paper considers the number of inner iterations required per outer iteration for the algorithm
proposed by Conn et al. [9]. We show that asymptotically, under suitable reasonable assumptions, a single inner
iteration suffices.
1. Introduction
minimize f (z)
XERTL
ci(x)2 0 , i = l , ..., m,
1LXgL. (3)
I
We assume that the region B = {x E !Rn 1 5 x 5 U} is non-empty and may be infinite.
We do not rule out the possibility that further simple bounds on the variables are included
amongst the general constraints (2) if that is deemed appropriate. Indeed, it is conceivable
that all simple bounds should be handled this way. Furthermore, we assume that
AS1. f(x) and the ci(x) are twice continuously differentiable for all x in B.
42 CONN. GOULD AND TOINT
Our exposition will be conveniently simplified by taking the lower bounds as identically
equal to zero and the upper bound as infinity for a subset of n/ !Ef { 1 , 2 , . . . n } in (3)
and by assuming that the remaining variables are either not subjected to simple bounds
or their simple bounds are treated as general constraints. Thus, in most of what follows,
B = {z E Rn I z j 2 0 for all j E Nb}, where n / b n/ is the index set of bounded
variables. The modification required to handle more general bounds is indicated at the end
of the paper.
The approach we intend to take is that of Conn et al. [9] and is based upon incorpo-
rating the equality constraints via a Lagrangian barrier function whilst handling upper and
lower bounds directly. The sequential, approximate minimization of the Lagrangian barrier
function is performed in a trust region framework such as that proposed by Conn et al. [5].
Our aim in this paper is to consider how these two different algorithms mesh together.
In particular, we aim to show that ultimately very little work is performed in the itera-
tive sequential minimization algorithm for every iteration of the outer Lagrangian barrier
algorithm. This is contrary to most analyses of sequential penalty and barrier function
methods in which the effort required to solve the inner iteration subproblems is effectively
disregarded, the analysis concentrating on the convergence of the outer iteration (see for
instance the books by Fiacco and McCormick[l2] and Bertsekas [l]. Exceptions to this
are the sequential penalty function method analyzed by Gould [14], and the sequential
augmented Lagrangian algorithm considered by Conn et al. [S]).
This work was primarily motivated by observations that the authors made when testing
a prototype of their large-scale nonlinear programming package LANCELOT, release B
(see [7] for a description of release A), which includes an implementation of the algorithms
discussed in this paper. It was often apparent that only a single iteration of the inner itera-
tion subroutine SBMIN was ultimately required for every outer iteration of our sequential
Lagrangian barrier program. While the conditions required in this paper to turn this ob-
servation to a proven result are relatively strong (and we feel probably about as weak as is
possible), the package frequently exhibits the same behaviour on problems which violate
our assumptions.
We define the concepts and notation that we shall need in section 2. Our algorithm is
fully described in section 3 and analyzed in sections 4 and 5.
2. Notation
Let g(z) denotes the gradient V,J(z) of f(z).Similarly, let A ( z )denote the Jacobian of
c(z), where
Thus
and
m
respectively, where the components X i of the vector X are positive and are known as Lagrange
multiplier estimates and where the elements si of the vector s are positive and are known as
shifts. We note that l ( z ,A) is the Lagrangian with respect to the general constraints only.
Let ge(x,A) and H e ( z ,A) respectively denote the gradient, V,l(s, A), and Hessian,
V J z z l (A),
z , of the Lagrangian. We define the vector by x
where n / b C n/.We will make much use of the projection operator defined componentwise
by
lj if ~j 5 lj
This operator projects the point x onto the region defined by the simple bounds (3). Let
P ( z ,U , 1, U ) = z - P [ z - U ,1, U ] . (11)
r, r,
Furthermore, define P [ z ]= P [ z , 003 and P ( z ,v) = P ( z ,U , m), where r j = 0 for
j E Nb and -m otherwise.
Let d k )E I3 and A('") be given values of x and A. If h ( z ,A,. . .) is any function of z,
A, . . ., we shall write h(') as a shorthand for h(z(')),
A('), . . .).
For any x(') we have two possibilities for each component z i b ) , j = 1,.. . , n, namely
where Nf defN \ Nb is the index set offree variables. We shall call all zjk) that satisfy
(i) dominated variables while the remaining z y ) arejoating variables. It is important to
notice that, as z(') E B,
44 CO”. GOULD AND TOINT
( P ( d k )V, , @ ( k ) )=
) jx i k ) whenever x i k ) is dominated, (12)
while
the sets of inactive (strictly satisfied) and active (violated or just satisfied) constraints at the
point x. We develop our algorithm so that the set A* i A(z*)at any limit point of our
generated sequence is precisely the set of constraints for which ci(z*) = 0. We also write
z* EZ(Z*).
We will use the notation that if 3 1 and 3 2 are any subsets of N and H is an n by n matrix,
H[gl,g21 is the matrix formed by taking the rows and columns of H indexed by 3 1 and 3 2
respectively. Likewise, if A is an m by n matrix, A[glj is the matrix formed by taking the
columns of A indexed by 31.
We denote the (appropriately dimensioned) identity matrix by I ; its j-th column is e j . A
vector of ones is denoted by e.
We will use a variety of vector and subordinate matrix norms. We shall only consider
norms 11 ) I z which are consistent with the two-norm, that is, norms which satisfy the
inequalities
for all vectors ‘U and some constant a0 2 1, independent of z. It then follows that, for any
pair of two-norm-consistent norms IJy and 1) - ] I z ,
I(
In order to solve the problem (l), (2) and (9), we consider the algorithmic model given in
Figure 1.
We shall call the vector P ( x ( ~ ),
V,!P(k)) the projected gradient of the Lagrangian barrier
function or the projected gradient for short. The norms 11 . and 11 11 are normally chosen
to be either two or infinity norms.
Our decreasing sequence of ~ ( ~ 1is’ sgiven by p ( k ) = po ( ~ ) ~ but j , any monotonic
and choose
Thus variables which are significantly dominated at the end of the ( k - 1)-st iteration are
set to their bounds while the remainder are left unaltered. This choice is made since, under
[Outer-iterationAlgorithm]
Step 0 : [Initialization] The strictly positive constants q o , W O , a,, p,, a7),&, a x 5 1, T < 1,
p < 1,y2 < 1, w* << 1 and q* << 1 for which
An initial estimate of the solution, xest E B,and vector of positive Lagrange multiplier estimates,
A(o), for which ci (zest) p(O) + > 0 are specified. Set k = 0.
[IP(Z(V +
~ )z,Q ( k ) ) l / g5 w ( ~ )and c i ( z ( l C ) ) sjk) > 0, (i = 1 , . . . , m). (21)
Step 0 : [Initialization] The positive constants p < 77 < 1 and yo 5 72 < 1 5 7 3 are given. The
starting point, Z ( ~ > O ) , a nonnegative convergence tolerance, w ( ~ )an, initial trust region radius,
asymmetric approximation, B ( k i o )to, the Hessian oftheLagrangian, H ~ ( Z ( ~ )A(k)), O),
and a two-norm-consistent norm 11 . [I9
are specified. Compute \ k ( z ( ” O ) , A(”), s(lC))and its
gradient. Set the inner iteration counter j = 0.
Step 1 : [Test for convergence] If
set z(IC)
=z and
( ~ v ~ stop.
)
Step 2 : [Significantly reduce a model of the Lagrangian barrier function] Construct a quadratic
model,
and
End of Algorithm
a suitable non-degeneracy assumption (AS7 in section 4), the set of dominated variables is
asymptotically the same as the set of variables which lie on their bounds (see [9], Theorem
5.4). Furthermore, under a second non-degeneracy assumption (AS5 in section 4), the
assignment x('?') = 2('-') is guaranteed for k sufficiently large. Our choice of x('yo) then
encourages subsequent iterates to encounter their asymptotic state as soon as possible.
We also pick A('?') so that
<
for some positive constants K. and < 1 (typical values might be K = 1and = 0.9). This <
value is chosen so that the trust region does not interfere with the asymptotic convergence
of the algorithm, while providing a reasonable starting value in the earlier stages of the
method.
Finally B('~')is taken to be any sufficiently good symmetric approximation to the Hessian
of the Lagrangian function at x ( ~ We) . qualify what we mean by "sufficiently good" in the
next section but suffice it to say that exact second derivatives satisfy this property and are
often to be recommended.
The calculation in Step 2 is performed in two stages.
as the parameter t increases from 0, which finishes when the path first intersects the
boundary of the trust region,
for some two-norm-consistent norm 11 . [ I t . Thus the Cauchy arc is simply the path
which starts in the steepest descent direction for the model but which is subsequently
"bent" to follow the boundary of the "box" region defined by the feasible region (9) (or,
in general, (3)) and which stops on the boundary of the trust region (36). The two or
infinity norm is normally chosen, the latter having some advantages as the trust region is
then aligned with the feasible region (9). (Indeed, it is possible to extend the Cauchy arc
along the boundary of the trust region when the infinity norm is used. Further reduction
of the quadratic model along this extended Cauchy arc may prove beneficial.)
The method proposed by Conn et al. [5] calculates the exact generalized Cauchy point
by marching along the Cauchy arc until either the trust region boundary is encountered or
the model starts to increase. An alternative method by Mor6 [ 151finds an approximation
p c ( ' ~ j ) = p ( ' , j ) ( t c ( ' i j ) ) which is required to lie within the trust-region and to satisfy
the Goldstein-type conditions
ASYMPTOTIC COMPLEXITY IN INEQUALITY CONSTRAINED OPTIMIZATION 49
and
or
and the positive constants p1, p2, u1, u2 and u3 satisfy the restrictions p1 < p2 < 1,
u2 < 1 and u3 < 1. Condition (37) ensures that a sufficient reduction in the model
takes place at each iteration while condition (38) is needed to guarantee that every step
taken is non-negligible. Mor6 shows that it is always possible to pick such a value of
t c ( ' ? j ) using a backtracking linesearch, starting on or near to the trust region boundary.
Similar methods have been proposed by Calamai and Mor6 [4], Burke and Mor6 [2],
Toint [ 161 and Burke et al. [3].
2. Secondly, we pick p('3j) so that + p ( k y j ) lies within (9), Ijp(kjj)IIt 5 ,&A(k3j)
x(k,j)
and
denote the composite approximation to the Hessian of the Lagrangian barrier function.
The specific Model-reduction Algorithm we shall consider is summarized in Figure 3.
In Step 2 of this method, the value of p [ F ]would normally be computed as the aggregate
step after a number of Conjugate Gradient (CG) iterations, where CG is applied to minimize
the model in the subspace defined by the free variables. The CG process will end when
either a new bound is encountered or the convergence test (45) is satisfied. The Model-
reduction Algorithm is itself finite as the number of free variables at each pass of Step 2 is
strictly monotonically decreasing. See the paper by Conn et al. [6] for further details.
4. Convergence analysis
[Model-reduction Algorithm]
Step 0 : [Initialization] Select positive constants U < 1,< < 1,02 2 1 and ,& I 1.
Step 1 : [Calculate the generalized Cauchy point] Calculate an approximation to the
generalized Cauchy point x C ( k , J = ) z ( ~ ) )+) p c ( k 7 J )using one of the previously
mentioned techniques. Compute the set of variables, F C ( l c i Jwhich ) , are free from
their bounds at x c ( l C 3 jSet ) , = p C ( k J )and 3 = F C ( l c J ) .
). x = x C ( l c Js
and
THEOREM 1 ([9], Theorem 4.4) Assume that AS1 and AS2 hold, that x* is a limit point
of the sequence { x ( ~})generated by the Outer-iteration Algorithm and that
AS3. The second derivatives of the functions f(x) and the ci(x) are Lipschitz continuous
at all points within an open set containing E?.
AS4. Suppose that (x*,A*) is a Kuhn-Tucker point for the problein (l), (2) and (9), and
and
is non-singular for all sets A and 3,where A is any set made up from the union of AT
and any subset of A; and J is any set made up from the union of J1 and any subset of
32.
Under these additional assumptions, we are able to derive the following result.
ASYMPTOTIC COMPLEXITY IN INEQUALITY CONSTRAINED OPTIMIZATION 53
THEOREM 2 ([9], Theorems 5.3 and 5.5) Assume that ASI-AS6 hold. Then there is a
constant p m i n > 0 such that the penalty parameter p ( k )generated by the Outer-iteration
Algorithm satis$es p ( k ) = p m i n for all k suficiently large. Furthermore, and x[i)*l
satisfy the bounds
for the two-norm-consistent norm I 1. I Ig and some positive constants a, and ax, while each
[xik' I, i E 2*,converges to zero at a Q-superlinear rate.
We shall now investigate the behaviour of the Outer-iteration Algorithm once the penalty
parameter has converged to its asymptotic value, p m i n . There is no loss of generality in
assuming that we restart the algorithm from the point which is reached when the penalty
parameter is reduced for the last time. We shall call this iteration k = 0 and will start with
p(o) = p m i n . By construction, (23) is satisfied for all k and the updates (24) are always
performed. Moreover,
Assumptions AS5 and AS7 are often known as strict complementary slackness conditions.
We observe that AS8 is closely related to the necessary and sufficient conditions for super-
linear convergence of the inner iterates given by Dennis and More [IO]. We also observe
that AS9 is entirely equivalent to requiring that the matrix
is positive definite (see, for instance, Gould [ 131). The uniqueness of the limit point in AS9
can also be relaxed by requiring that (57) has its smallest eigenvalue uniformly bounded from
below by some positive quantity for all limit points B* of the sequence B('1'). Moreover it
is easy to show that that AS4, AS5 and AS7 guarantee AS9 provided that pmjn is sufficiently
small and sufficient second-order optimality conditions (see Fiacco and McCormick [ 121,
Theorem 4) hold at x* (see Wright [ 171, Theorem 8, for the essence of a proof of this in
our case). Although we shall merely assume that AS9 holds in this paper, it is of course
possible to try to encourage this eventuality. We might, for instance, insist that Step 4 of
the Outer-iteration Algorithm is executed rather than Step 3 so long as the matrix H(',') is
not positive definite. This is particularly relevant if exact second derivatives are used.
We now show that if we perform the step calculation for the Inner-iteration Algorithm
using the Model-reduction Algorithm, a single iteration of the Inner-iteration Algorithm
suffices to complete an iteration of the Outer-iteration Algorithm when k is sufficiently
large. Moreover, the solution of one inner-iteration subproblem, rc('- ') and the shifted
starting point for the next inner iteration (33) are asymptotically identical. We do this by
showing that, after a finite number of iterations,
(i) moving to the new starting point does not significantly alter the norms of the projected
gradient or constraints. Furthermore, the status of each variable (floating or dominated)
is unchanged by the move;
(ii) the generalized Cauchy point rcC('~') occurs before the first "breakpoint" along the
Cauchy arc - the breakpoints are the values o f t > 0 at which the Cauchy arc changes
direction as problem or trust region bounds are encountered. Thus the set of variables
which are free at the start of the Cauchy arc dk,')and those which are free at the
generalized Cauchy point are identical;
(iii) any step which satisfies (45) also satisfies pp1]lies strictly interior to C ( P 2 ) . Thus a
single pass of Step 2 of the Model-reduction Algorithm is required;
(v) the new point ~ ( ~ 9 ' satisfies the convergence test (26); and
)
THEOREM 3 Assume that assumptions ASI-AS9 hold and that the convergence tolerances
p,and 6, satisfy the extra condition
ASYMPTOTIC COMPLEXITY IN INEQUALITY CONSTRAINED OPTIMIZATION 55
Thenfor all k suficiently large, a single inner iteration of the Inner-iteration Algorithm, with
the step computed from the Model-reduction Algorithm, sufices to complete an iteration
of the Outer-iterationAlgorithm. Moreover; the solution to one inner iteration subproblem
provides the starting point for the next without further adjustment, for all k suficiently
large.
Proof. In order to make the proof as readable as possible, we will make frequent use of
the following shorthand: the iterates will be abbreviated as
the shifts as
and
+ +
Other quantities which occur at inner iterations ( k 1 , O ) and ( I c 1 , l ) will be given
suffices @ and + respectively. Thus H e H(k+130) and H + H('+lsl).
Recall, we have used Theorem 2 to relabel the sequence of iterates so that
and
for all k 2 0. Let be any closed, bounded set containing the iterates dlC) and d k + l > O ) .
We shall follow the outline given above.
(i) Status of the starting point. The strict complementary slackness assumption AS7
ensures that for all k sufficiently large, each variable belongs exclusively to one of the sets
F1 and D1 (see [9], Theorem 5.4);moreover,
and
8
*f
> 0,
Ex -
1 + 8 min max[x;, ge(x*,A*)j]
j € N b
where 8 is as in (32). Then there is an iteration ko such that for variables in Fl,
for all k 2 ko. Hence, for those variables in D1, (67) and (69) give that
Thus, by definition (32), 2ik) = 0 for each j E D1 when k 2 ko. Similarly, when
j E 3; n Nb and k 2 ko, x i k ) > 8 ( V x 9 ( k ) )and
j hence, using (32), 2ik) = xj for all
j E F1. Thus 2(') converges to x*.
The other strict complementary slackness assumption, AS5, ensures that each constraint
belongs exclusively to one of the sets Z* and A*, for all k sufficiently large. Moreover,
and thus one of c i ( d k ) )and A:"') converges to zero while its partner converges to a strictly
positive limit for each i.
+
Using the shorthand introduced in (59)-(60), we have that ci(x) 's > ci(x) > 0 for
each i E Z* and all Ic sufficiently large. Thus, as 2 converges to x* and s+ converges to
+
zero, 2ci(x*) > c i (2) s+ > +ci(x*)> 0 for all i E Z* and k sufficiently large. On
the other hand, if i E A*, ci(z) +'s > 0 for all k (see [9], Lemma 3.1). In this case, as
s+ converges to s: pmin(Af)"' > 0 and ci(x) converges to zero, the convergence of 2
+
to x* and A' to A* implies that 2s; > ci(2) s+ > is: > 0 for all k sufficiently large.
Hence, from (33), x@ = 2 and thus there is an integer kl 2 ko for which
x j for all j E Fl
0 for all j E D1, (73)
We next let T be any real number and consider points on the line
def
z(r) = z + r ( z @- z). (74)
We firstly show that the diagonal matrix D ( z ( r ) )is bounded for all 0 IT 5 1, where D is
given by (28). As z and z@both converge to z*, the definition (28) implies that D ( z ( r ) )
converges to the matrix Dl,i, satisfying (56), as k increases. Thus, we have the bound
where a1 def211[(A:)1-ax]zll12, for all k sufficiently large. It also follows from the
convergence of z and z@to x* and that of si to s: that there is an integer k2 2 k l for which
and
where a3 is an upper bound on the two-norm of the Hessian of the Lagrangian function
(with bounded multiplier estimates) within a.
We now use the identity
But, considering i E A*, picking Ic sufficiently large so that I X T l 5 2lXal and using the
integral mean value theorem, the relationship c(z*)[A*I = 0, the bounds (77), (78), (79)
and the inequalities (18) and (5 l), we obtain the bounds
and
and hence
+
where U A = 4 r n a o u ~ ( 2 a o u ~a,) maxiEd3(At)'-*^, for any two-norm-consistent norm
(1 .(Iz. Furthermore, the superlinear convergence of Xi to zero, i E Z*,(76) and the bound-
edness of the remaining terms implies a bound
AS Y M P T 0TIC C 0M P L EX ITY IN 1N E QU A L I TY C 0N STRAINE D 0P T I M IZ A T I 0 N 59
for some constant a l (In fact, this term can be made arbitrarily smaller than (85) by picking
k sufficiently large). Thus, combining (82), (85) and (86), we obtain the componentwise
bound
if j E J&. Secondly, combining (80) and (87), and using (13), (17), (18) and (63), we
derive the inequality
def
+ + + +
where a4 = U A a 1 aowo(1 U O ( U ~ ala;)). As k increases, the right-hand-side of
the inequality (89) converges to zero. Thus, from (68) and for k sufficiently large, :x is
floating for each j E F1, and (13) and (89) imply that
Conversely, consider the variables which lie in D1 for k 2 k2. Then, combining (80) and
(87), and using (17) and (18) we obtain the inequality
60 C O N N , GOULD A N D T O I N T
def
+ + +
where a5 = ad a~ U&JO(U, alai). Thus, for sufficiently large k the right-hand-side
of (91) can be made arbitrarily small. Combining this result with (69) and the identity
:x = 0, we see that :x is dominated for each j E D1, and (12) and (9 1) imply that
def
for all k sufficiently large, where a6 = aoaeIle[Fl]112.
We also need to be able to bound the Lagrange multiplier estimates x+ z x ( x @A+,
, s+).
We have, from (8), that
But then, recalling (84),when i E A*, and the superlinear convergence of A t to zero, when
i E 2*,together with (18), we obtain a bound
say, for all sufficiently large k. Hence the model (27) is a strictly convex function in the
subspace of free variables during the first inner iteration.
We now show that the set
lies strictly interior to the set C(1) (defined in the Model-reduction Algorithm) for all k
sufficiently large. The diameter d of C, the maximum distance between two members of
the set (measured in the two norm), can be no larger than twice the distance from the center
of the ellipsoid defined by C to the point on 2 (the boundary of C) furthest from the center.
The center of C is the Newton point,
def
Let plFll E E and pFl1 = 0 and define w = p - p*. Then, combining (27), (98), (100)
and ( Ol), we have t at
Hence, using the extremal properties of the Rayleigh quotient and (102), we have
for all steps within, or on the boundary of, C. Inequality (93) then combines with (105) to
show that any such step is shorter than the distance to the trust region boundary for all k
sufficiently large.
Thus C lies strictly interior to C(1) C C ( p 2 ) for all k sufficiently large. But, as all iterates
generated by the Model-reduction Algorithm satisfy (41) and thus lie in C,it follows that
both the generalized Cauchy point and any subsequent improvements are not restricted by
the boundaries of C or C(p2).
It remains to consider the Cauchy step in more detail. The Cauchy arc starts in the steepest
descent direction for the variables in F1.The minimizer of the model in this direction occurs
when
and thus, from the above discussion, gives the generalized Cauchy point proposed by Conn
et al. [ 5 ] .We use the definition of t*,(16), (99) and the extremal property of the Rayleigh
quotient to obtain
for this variant of the generalized Cauchy point. Alternatively, if More's (1988) variant is
used, the requirement (37) and the definition of the Cauchy arc imply that
m@(z@
-)m @ ( z @
+ p C @ ) 2 p l t ~ @ ~ l 2v , ~ ~ l l ~ ~ 2 .
(108)
If the first alternative of (38) holds, (108) implies that
Otherwise, we may use the same arguments as above to show that it is impossible for t L @
to satisfy (40) when k is sufficiently large. Therefore, tL@must satisfy (39). Combining
(27), (39), (98) and the definition of the Cauchy arc, we have that
Hence, combining (99) and (1 10) with the extremal properties of the Rayleigh quotient,
we have that tL@2 (1 - p2)/7rma,. Thus, when the second alternative of (38) holds, this
result and (108) give that
m@(z@> + p C @ )2 [ P P 2 ( 1 - p 2 ) / ~ m a x 1 1 1 ~ x Q112.2~ l ]
- m@(z@
(1 11)
Therefore, (17), (109) and (1 1 1) give the inequality
A SYh4 P T 0 T I C C 0M P LEX ITY IN IN E Q U A LITY C 0N STR A IN E D 0 P T I h I I Z AT I 0 N 63
and
for some P 3 5 1. The set of values which satisfy (1 13) and (1 14) is non-empty as the
Newton step (101) satisfies both inequalities.
It remains to consider such a step in slightly more detail. Suppose that p p l l satisfies
(113). Let
Thus, combining (93) and (1 16), and picking k sufficiently large so that
and p p l l satisfies (1 13). As p @ can be made arbitrarily small, it follows (as in (76) and
(77)) from the convergence of'x to x* and that of 's to s: that there is an integer k3 for
which
and
64 CONN, GOULD AND TOINT
m@(z@ + P @ ) 2 a711VzQ(z@,
-)m@(z@ A+, S+)[F1][I;, ( 124)
where a7 = 03 r n i n ( 1 / ( 4 ~ 0 7 r ~
p1~min(v1,
~), v2(1 - pZ)/7rm,,)/ao). Turning to the
numerator on the right-hand-side of (123), we use the integral mean value theorem to
obtain
+
Q(z@ p @ ,A+, s+)
A+,
= Q(z@, s+) + Pg:]wpl]
1
+3 Jo Pp:]VzzQ(z@(% A+, S + ) [ F l , F 1 ] P p l p
using (16), (42), the definition of the Hessian of the Lagrangian barrier function and AS8,
and
ASYMPTOTIC COMPLEXITY IN INEQUALITY CONSTRAINED OPTIMIZATION 65
using (16), the convergence (and hence boundedness) of the Lagrange multiplier estimates
and the Lipschitz continuity of the second derivatives of the problem functions (assumption
AS3) with some composite Lipschitz constant a 8 . Thus, combining (1 16), (123), (124),
( 125), ( 126) and (127), we obtain
for all 0 5 t 5 1. This latter follows from the definition (28) and the convergence of x@
and, because of (1 19) and (1 18), the convergence of x@ + p @ to x*. Thus, as the right-
hand-side of (129) can be made arbitrarily small, by taking k sufficiently large, (69) and
the identity z+ = x? = 0 for each j E D1, imply that xj' is dominated for each j E D1
3
while (12) and (92) imply that
to the approximation to the Newton direction used, the second to the approximation of a
nonlinear function by a quadratic and the third to the particular approximation to the second
derivatives used. We now bound each of these terms in turn.
The first term satisfies the bound (1 13). Hence, combining (93) and (1 13), we obtain
The same arguments as those used to establish (126) imply that the second term on the
right-hand-side of (132) satisfies the bound
for some composite Lipschitz constants a9 and ale. We may then combine (17), (5 l), (63),
(78), (96), (1 18) and (1 34) to obtain the bound
for all sufficiently large k . Lastly, the third term on the right-hand-side of (1 32) satisfies
the bound
by the same arguments we used to establish inequality (127). We may then combine (1 18)
and (136) so that
where
AS Y A4 P T 0TIC C 0M P LE X I TY I N IN E Q U A L I T Y C 0N ST R A I N E D 0P T I M I Z AT I 0N 67
and
Firstly, observe that the right-hand-side of (138) may be made arbitrarily small. Therefore,
(13), (131) and (138) imply that
aw+pw-cu-6
kl 2
P - Pw
for all sufficiently large k 2 k l . Thus, the iterate x+ satisfies the inner iteration first
convergence test of (21) for all k sufficiently large and we have dk+') = z(lc+l?l) x+.
(vi) Redundancy of the shifted starting point. Finally, we observe that all the variables
xi?),j E D , lie on their bounds for sufficiently large k. Therefore, z('+'t0) = z(') and the
perturbed starting point is redundant.
We now turn briefly to the more general problem (1)-(3). The presence of the more
general bounds (3) does not significantly alter the conclusions that we are able to draw. The
algorithms of section 3 are basically unchanged. We now use the region B = {z E X n I 1 5
x 5 U } - and hence Nb = N - and replace P ( x )w) by P ( z ,U , 1, U ) where appropriate.
The concept of floating and dominated variables stays essentially the same. For each iterate
in I3 we have three mutually exclusive possibilities, namely, (i) 0 5 xi') - l j 5 (VZ!P('))%,
(ii) (V,q('))i 5 x:!) - u j 5 0 or (iii) xj!) - u j < ( V Z P ( k ) )< i xi!) - l j , for each
component xi'). In case (i) we then have that P ( d k ) V,*(')) ) I , u)i = zjk) - l j while
in case (ii) P ( x ( " )V,*('))
, 1, u)i = xi!) - u j and in case (iii) P(x(')),V,*(')) 1, u ) =~
(V, Q ( k ) ) i . The variables that satisfy (i) and (ii) are said to be the dominated variables, the
ones satisfying ( i )are dominated above while those satisfying (ii)are dominated below.
Consequently, the sets corresponding to (14) are straightforward to define. 271 is now made
up as the union of two sets 2711, whose variables are dominated above for all k sufficiently
large, and D l u , whose variables are dominated below for all k sufficiently large. F'1 contains
68 CONN, GOULD AND TOINT
variables which float for all k sufficiently large and which converge to values interior to 23.
Similarly F2 is the union of two sets, F21 and .?&, whose variables are floating for all k
sufficiently large but which converge to their lower and upper bounds respectively. We also
replace (32) by
With such definitions, we may reprove the results of section 4, extending AS4, AS7-AS9
in the obvious way. The only important new ingredient is that Conn et al. [9] indicate that
the non-degeneracy assumption AS7 ensures that the iterates are asymptotically isolated in
the three sets Fl,V11 and Vlu.
6. Conclusions
We have shown that, under suitable assumptions, a single inner iteration is needed for each
outer iteration of the Lagrangian barrier algorithm. We anticipate that such an algorithm
may prove to be an important ingredient of release B of the LANCELOT package.
7. Acknowledgement
The work reported here has been partly supported by the NATO travel grant CRG 890867.
References
1. D. P. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods. Academic Press, London, 1982.
2. J. V. Burke and J. J. MorC. On the identification of active constraints. SIAM Journal on Numerical Analysis,
25(5):1197-1211, 1988.
3. J. V. Burke, J. J. MorC, and G. Toraldo. Convergence properties of trust region methods for linear and convex
constraints. Mathematical Programming, Series A, 47(3):305-336, 1990.
4. P. H. Calamai and J. J. MorC. Projected gradient methods for linearly constrained problems. Mathematical
Programming, 39:93-116, 1987.
5. A. R. Conn, N. I. M. Gould, and Ph. L. Toint. Global convergence of a class of trust region algorithms for
optimization with simple bounds. SIAM Journal on Numerical Analysis, 25:433460, 1988. See also same
journal 26:764-767, 1989.
6. A. R. Conn, N. I. M. Gould, and Ph. L. Toint. Testing a class of methods for solving minimization problems
with simple bounds on the variables. Mathematics of Computation, 50:399430, 1988.
7. A. R. Conn, N. I. M. Gould, and Ph. L. Toint. LANCELOT: a Fortran package for large-scale nonlinear
optimization (Release A ) . Number 17 in Springer Series in Computational Mathematics. Springer Verlag,
Heidelberg, Berlin, New York, 1992.
8. A. R. Conn, N. I. M. Gould, and Ph. L. Toint. On the number of inner iterations per outer iteration of a globally
convergent algorithm for optimization with general nonlinear equality constraints and simple bounds. In D.F
Griffiths and G.A. Watson, editors, Proceedings of the 14th Biennal Numerical Analysis Conference Dundee
1991, pages 49-68. Longmans, 1992.
ASYMPTOTIC COMPLEXITY IN INEQUALITY CONSTRAINED OPTIXilIZATION 69
9. A. R. Conn, N. I. M. Gould, and Ph. L. Toint. A globally convergent Lagrangian barrier algorithm for
optimization with general inequality constraints and simple bounds. Mathematics of Computation, volume 66,
pages 261-288, 1997.
10. J. E. Dennis and J. J. Mori. A characterization of superlinear convergence and its application to quasi-Newton
methods. Mathematics of Computation, 28( 126):549-560, 1974.
I 1 . J. E. Dennis and R. B . Schnabel. Numerical methodsfor unconstrained optimization and nonlinear equations.
Prentice-Hall, Englewood Cliffs, USA, 1983.
12. A. V. Fiacco and G. P. McCormick. Nonlinear Programming: Sequential Unconstrained Minimization
Techniques. J. Wiley and Sons, New York, 1968. Reprinted as Classics in Applied Mathematics 4 , SIAM,
1990.
13. N. I. M. Gould. On the accurate determination of search directions for simple differentiable penalty functions.
IMA Journal of Numerical Analysis, 61357-372, 1986.
14. N. I. M. Gould. On the convergence of a sequential penalty function method for constrained minimization.
SIAM Journal on Numerical Analysis, 26: 107-1 28, 1989.
15. J. J. Mori. Trust regions and projected gradients. In M. Iri and K. Yajima, editors, System Modelling
and Optimization, volume 113, pages 1-13, Berlin, 1988. Springer Verlag. Lecture Notes in Control and
Information Sciences.
16. Ph. L. Toint. Global convergence of a class of trust region methods for nonconvex minimization in Hilbert
space. IMA Journal of Numerical Analysis, 8:23 1-252, 1988.
17. M. H. Wright. Interior methods for constrained optimization. volume 1 of Acta Numerica, pages 341407.
Cambridge University Press, New York, 1992.
Computational Optimization and Applications, 7, 7 1-87 (1997)
@ 1997 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.
Abstract. Recently, in [I21 a very general class of truncated Newton methods has been proposed for solving
large scale unconstrained optimization problems. In this work we present the results of an extensive numerical
experience obtained by different algorithms which belong to the preceding class. This numerical study, besides
investigating which are the best algorithmic choices of the proposed approach, clarifies some significant points
which underlies every truncated Newton based algorithm.
Keywords: Large scale unconstrained optimization, Truncated Newton methods, negative curvature direction,
curvilinear linesearch. Lanczos method.
1. Introduction
This work deals with a new class of algorithms for solving large scale unconstrained prob-
lems. We consider the minimization problem
min f ( 5 )
X E R "
where f : IR" ---+ R is a real valued function and we assume that the gradient g(z) =
of(.) and the Hessian matrix H ( z ) = 0 2 f ( z )exist and are continuous. We are interested
in solving Problem (1) when the dimension n is large. This interest derives from the fact that
problems with larger and larger number of variables are arising very frequently from real
world applications. Moreover, besides its own interest, the definition of efficient algorithms
for solving Problem (1) may be also considered an essential starting point to tackle large
scale constrained minimization problems.
As well known, the knowledge of the Hessian matrix enables to significantly exploit more
information about the problem than those available merely from the gradient. This is clearly
evidenced by the efficiency of Newton-type methods. For wide classes of unconstrained
optimization problems the Hessian matrix is available, but, unfortunately, when the dimen-
sion is large, it can not be stored and the exact computation of the Newton direction can
be too expensive. For these classes of problems it is appropriate to use a truncated Newton
approach. In fact, the algorithms which follow this approach use an approximate solution
of the Newton equation
* This work was partially supported by Agenzia Spaziale Italiana, Roma, Italy.
72 LUCID1 AND ROMA
and require only the storage of matrix vector product of H ( x k ) with a suitable vector.
Moreover, they present good convergence properties.
The most popular method used as iterative scheme for computing an approximate Newton-
type direction is the conjugate gradient algorithm [4, 5, 191. In most of all these truncated
Newton methods, the inner conjugate gradient iterates are terminated when the desired
accuracy is obtained or whenever a negative curvature direction (i.e. a vector v such that
wTHv < 0) is detected.
Recently, a different strategy has been used by the truncated Newton method implemented
in the LANCELOT software package [2]. In fact, as usual, the inner iterates are terminated
whenever a negative curvature direction is detected, but, unlike the other truncated algo-
rithms, this direction is exploited by performing a significant movement along it. This
strategy enables this algorithm to take into account the local nonconvexity of the objective
function and its beneficial effects are evidenced by the numerical behaviour of the algorithm.
However, the choice of terminating the inner conjugate gradient iterations whenever a
negative curvature is found can present the following drawbacks:
0 the accuracy of the solution of (2) could be very poor when the iterates of the conjugate
gradient method are terminated since a negative curvature has been detected;
0 the first negative curvature direction detected by the iterative scheme could not to have
sufficient information on the local nonconvexity contained in the Hessian matrix.
0 it does not break down when the Hessian matrix is indefinite and hence, whenever
system (2) is solvable, an accurate Newton-type direction can be computed;
In [12], by drawing inspiration from [lO, 13, 141, a new class of the truncated Newton
methods has been proposed. A common feature of the methods belonging to this class, is
the use of the Lanczos algorithm for determining both a Newton-type direction S k and a
negative curvature direction d k . This pair of directions is used for defining a curvilinear
search path
Then, the new point is computed along this path by means of a very general nonmonotone
stabilization strategy.
NUMERICAL EXPERIENCES WITH NEW TRUNCATED NEWTON METHODS 73
Many different algorithms can be derived from the class proposed in [ 121 and all these
algorithms have, of course, the same theoretical properties. However, from the computa-
tional point of view, they can have very different behaviours and therefore only an extensive
numerical experience can clarify which are the best algorithms within the proposed class.
The numerical results reported in [ 121 have already evidenced which general strategy
should be adopted in order to define efficient algorithms. In fact, the conclusions drawn
from those results have indicated that the best strategy is that of defining algorithms which
are based on the joint use of negative curvature directions and nonmonotone stabiliza-
tion techniques. Furthermore, the comparison between the results obtained in [ 121 and
those obtained by different versions of the truncated Newton method implemented in the
LANCELOT package, have evidenced that the approach proposed in [ 121 is very promising
from the computational point of view.
In this work we continue the numerical investigation on the algorithmic choices concern-
ing the class of algorithms proposed in [ 121. First of all, we focus our attention on the most
effective way of computing a good Newton-type direction by adopting different termination
criteria of the inner Lanczos iterates. Then we analyse different strategies for calculating
negative curvature direction and study different ways of evaluating the resemblance of the
negative curvature direction to the eigenvector corresponding to the smallest eigenvalue of
the Hessian matrix.
We point out that these numerical investigations, besides allowing us to understand better
which are the best algorithmic choices within the class proposed in [12], give interesting
answers to some open questions which are behind the truncated Newton approach and hence
they give helpful hints for defining any new truncated Newton method.
The paper is organized as follows: in Section 2 the new method proposed in [ 121 is briefly
reviewed. In Section 3 we report the results of our numerical experiences.
In this section we review some of the proposals of [ 121. First of all, after having briefly
recalling the truncated Newton methods proposed in that paper, we describe the particular
algorithm which showed the best numerical behaviour in the computational testing reported
in [ 121. We refer to [ 121 for a detailed description and a rigorous analysis of this algorithm.
Now we briefly recall the various parts that constitute the algorithm model proposed in [ 121.
Step I : Initialization
Choose - g ( x k ) as the Lanczos starting vector and set = 1.
- If the system (2) is solvable and the current estimate Si of the solution of the system
(2) produced by SYMMLQ routine satisfies the convergence conditions, then set
-
s k = si
- otherwise set d k = 0.
NUMERICAL EXPERIENCES WITH NEW TRUNCATED NEWTON METHODS 75
A key point of this truncated scheme LTS is the convergence criterion at Step 2. In fact,
this test should ensure that the Lanczos algorithm is terminated at the i-th iteration provided
that both the estimate of the solution of the Newton system Si is sufficiently accurate and the
smallest eigenvalue pi of the tridiagonal matrix Ti is a good approximation of an eigenvalue
of H ( z k ) . The accuracy of the solution of the Newton system (2) can be evaluated by the
magnitude of the residual at each step. As regards pi, its difference from an eigenvalue
of H ( z k ) is bounded by the scalar & + I which, hence, can be used as a measure of the
accuracy of pi in approximating an eigenvalue of the Hessian matrix [ 181.
I Stabilization algorithm 1
As regards the computation of step length c y k , the nonmonotone globalization strategy
proposed in [7, 101 is adopted. The motivation of this choice is to try to accept as many
times as possible the unit stepsize whenever the iterates of the algorithm are in a region
where Newton method present strong convergence properties. This stabilization algorithm
is now recalled.
F. - max
- 0 5 2 <rn(j)
Zj-i, where m ( j )5 min[m(j - 1) + 1,M] (5)
and go to Step 5 .
If k =t + N compute f ( z k ) ; then:
The convergence properties of this algorithm have been studied in [ 121. In particular,
assuming that for a given ZO, the level set QO = {x E IR”(f(x)5 f(x0)) is compact, the
algorithm is globally convergent towards a stationary point if the direction s k and d k are
bounded and satisfy the following convergence conditions:
S(sk)% L 0, S(Zk)% 5 0, d p q Z k ) d k L 0,
g(xk)Tsk -+ o implies g(zk) -+ o and Sk --+ 0,
+
l)sr~]I 1 ) d k ) )+ 0 implies g(zk) -+ 0.
If, in addition, the direction d k satisfies also
NUMERICAL EXPERIENCES WITH NEW TRUNCATED NEWTON METHODS 77
(where Amin ( H ( x k ) ) is the minimum eigenvalue of the Hessian matrix) then it is possible to
show that the preceding algorithm is globally convergent towards stationary points where the
Hessian matrix is positive semidefinite [ 12, Theorem 2.11. We point out that the directions
S k and d k used in the algorithm (NSA) are computed (as described in Section 2.1) in such
a way as to satisfy the conditions for convergence.
where r = 10-l. Moreover, as regards the termination criterion at Step 2 of LTS, the
SYMMLQ routine has been slightly modified in order to continue to perform Lanczos iter-
ations until the i-th iteration where both one of the original stopping criteria of SYMMLQ
routine is fulfilled and one of this additional criteria is satisfied:
pi+l < Ek or i >L
{ k}
where Ek = 103max 1 1 g ( Z k ) l l , and L = 50 is an upper bound on the number of
the vectors stored in the matrix Vi. Furthermore the following value of the scaling factor
P k in (4) is used
where pi is the smallest eigenvalue of the matrix T' produced by the Lanczos algorithm.
In the nonmonotone stabilization algorithm, the following values for its parameters have
been adopted: A, = 103, N = 20, M = 20, 6 = 0.9, IT = 0.5; moreover whenever a
backtracking is performed (see Step 3 (b) and Step 4 (a)), then the current values of A and
M are modified as follows: A = 10-lA and M = M / 5 1. +
LUCID1 A N D ROMA
3. Numerical experiences
In this section we continue the numerical investigation started in [12] with the aim to further
study the effectiveness of different algorithmic choices concerning the class of algorithms
proposed. As we said in the introduction, the results of this numerical study can be of
interest also in a more wide context. In fact, some of the indications drawn from this
investigation, can be useful for defining, in general, any truncated Newton method.
As concerns the reported numerical experiences, we have considered as default choices
those of algorithm NMonNC described in Section 2.3. The following numerical investiga-
tion has been based on the use of a large set of test problems. In particular we have used
all the large scale unconstrained test problems available from the CUTE collection [ 11; all
the problems with a number of variables which ranges between 930 and 10000 have been
selected providing us with a test set of 98 problems.
All tests have been performed on an IBM RISC Systed6000 375 under AIX using Fortran
in double precision with the default optimization compiling option. All the runs have been
terminated when the convergence criterion 11g11 5 10-5 has been fulfilled.
In the comparisons between different algorithms which will be reported in the sequel,
we consider all the test problems coherently solved by all these algorithms, namely all the
problems where these algorithms converge to the same point. Moreover, we consider equal
the results of two runs if they differ by at most of 5%. Finally, we consider as failure all the
runs which need more than 5000 iterations or 5000 seconds of CPU time.
Since the results of all the runs consist in many tables, in the sequel we report only the
summaries of this extensive numerical testing together with some statistical comparisons.
First of all, we perform a numerical investigation on the influence of the accuracy of the
Newton-type direction within a general truncated Newton algorithm. More in particular,
we investigate on the following questions:
2. which is the best value for the tolerance rtol used by the SYMMLQ routine ?
Question 1. In most of the classical truncated Newton methods the conjugate gradient
algorithm is used for computing an approximate Newton-type direction, and the conjugate
gradient iterations are terminated when a desired accuracy is achieved or whenever a negative
curvature direction is detected. In this second case the estimate generated at the previous
iteration is often accepted as approximate solution of (2) even if a sufficiently accuracy
has been not reached. Hence, a possible drawback of this use of the conjugate gradient
method is that the accuracy of the solution of (2) could be very poor when the iterates of
the conjugate gradient method are terminated as a negative curvature has been found. In
particular, in [9] was pointed out that it could be of beneficial effect to try to continue the
NUMERICAL EXPERIENCES WITH NEW TRUNCATED NEWTON METHODS 79
inner conjugate gradient iterations even if the Hessian matrix is not positive definite, and
hence to try to compute, also in this case, a sufficiently good solution of (2).
Since the Lanczos based iterative scheme LTS does not break down when the Hessian
matrix is not positive definite, our algorithm model, represents a useful and flexible tool for
investigating on this aspect which is, of course, one the most important for the definition
of an effective truncated Newton algorithm. In particular, we consider some modifications
of default implementation of NSA (algorithm NMonNC) which do not use the negative
curvature direction (i.e. we set d k = 0) in order to focus our attention only on the influence
on the efficiency of the algorithm with respect to different strategies adopted for computing
the truncated Newton direction. In particular, we consider three different versions of NSA
where the only difference consists in the termination criterion at the Step 2 of the LTS
scheme. In fact, the following different criteria are used:
Algorithm A: the inner iterations are terminated whenever the original stopping criteria
of SYMMLQ routine are satisfied or a negative curvature direction is detected.
Algorithm B: the inner iterates continue until the termination criteria of SYMMLQ routine
are satisfied.
Algorithm C: the inner iterates continue until the original tests of SYMMLQ routine and
one of the criteria (8) are satisfied.
In the algorithm A the criterion commonly used in the truncated Newton method is adopted.
The termination criterion of algorithm B differs from that adopted by algorithm A since
the inner iterates are not terminated whenever a direction of negative curvature is detected
and this should enable to compute a more accurate approximation of the Newton direction.
The choice of algorithm C is the same used in the default implementation described in
Section 2.3. In that implementation this test is needed to ensure the goodness of the negative
curvature direction; here, since negative curvature directions are not used, the numerical
behaviour of algorithm C, should indicate the influence on the computation of the truncated
Newton direction Sk of continuing the inner iterates even if a sufficiently small residual
in the solution of (2) has been obtained. The summary of the results obtained by these
three algorithms on the whole test set is summarized in the following tables. In particular,
in Table 1 we report the cumulative results of this comparison that is the total number of
iterations, function and gradient evaluations, CPU time needed to solve all the problems
considered. Table 2 reports how many times each of the three algorithms is the best, second
and worst in terms of number of iterations, function and gradient evaluations and CPU time.
By comparing the results obtained by algorithm A and those obtained by algorithms B and
C it appears clear that the computation of an accurate Newton-type direction also when the
Hessian is not positive definite can improve significantly the efficiency of the algorithm.
This is confirmed also by observing the results obtained by algorithms B and C. In fact, the
best results have been obtained by algorithm C where the estimate Si is considered a good
Newton-type direction only when, besides being a good approximation of the solution of the
system (2), it conveys sufficient information on the curvature of the objective function. This
is obtained by using the additional criterion (8) which ensures that the iterates of SYMMLQ
routine continue until a small scalar p is produced by the Lanczos algorithm or until the
80 LUCID1 AND ROMA
number of inner iterations is greater than the prefixed upper limit L. Therefore, this last
consideration shows that this additional test (8) needed in the default algorithm NMonNC
to compute a sufficiently good negative curvature direction d k , has a beneficial influence
also in the computation of the Newton-type direction.
Question 2. As well known, the value of the tolerance in approximately solving the Newton
equation (2) is a key point for the efficiency of every truncated Newton method and, hence,
an empirical tuning of the parameter rtol is surely needed. Of course, since there are so
many different choices for this parameter, it is out of the scope of this work to give any
conclusive answer. Here we have only performed some numerical experiences for testing
different choices of the parameter rtol with respect to the value (7) used in algorithm
NMonNC. In particular, we have investigated on the use of some values which draw their
inspiration from some proposals widely used in literature. More specifically, in algorithm
D, the inner iterates are interrupted when the original stopping criteria of SYMMLQ are
satisfied (same strategy as algorithm B ) and, following [S], the tolerance parameter is set
to the value
NUMERICAL EXPERIENCES WITH NEW TRUNCATED NEWTON METHODS 81
Similarly, in algorithm E, following [ 161, we use for the parameter rtol the value
together with a bound on the maximum number of the iterates of LTS scheme set to
min{n,500}. In both the algorithms D and E the negative curvature direction is not
considered. A comparison between the results obtained by algorithm B which uses the
D 15 9 15 15
B 36 32 36 49
tie 42 52 42 29
value given by (7) and algorithm D is reported in Table 3 and in Table 4 is reported the
number of times each algorithm performs the best. Note that algorithm D showed three
failures which are not considered in these cumulative results. Now we reports a summary
of the comparison between the results obtained by algorithms B and E. Table 5 reports the
cumulative results and Table 6 the number of times each algorithm performs the best. Note
that algorithm E showed four failures. By observing Table 3, Table 4, Table 5 and Table 6,
it appears clear that the best choice is that one adopted by algorithm B both in terms of
efficiency and in terms of robustness. More in details, Table 3 and Table 4 clearly show that
algorithm B outperforms algorithm D while Table 5 and Table 6 indicate that algorithm E
is efficient in terms of number of iterations and also in terms of gradient evaluations.
82 LUCID1 AND ROMA
B 27 30 27 54
tie 29 35 29 22
In this section, we perform a numerical study on the effect that different negative curva-
ture directions have on the behaviour of a truncated Newton algorithm. Since the use of
negative curvature directions in large scale optimization is relatively new, up to now, few
investigations have been carried out on the sensitivity of an algorithm as the computation
of negative curvature direction varies. The aim of the numerical experiences reported in
this section is to shed some light on the following questions:
1. is it convenient to compute a “good” negative curvature direction ?
2. how can the “goodness” of a negative curvature direction be tested ?
3. which is the influence of the bound L on the Lanczos based vectors stored ?
Question 1. In the field of large scale minimization methods, roughly speaking, two alter-
native strategies for computing negative curvature directions have been proposed. The first
one derives from the LANCELOT algorithm [2] which, as we said in the introduction, uses
a negative curvature which is very cheap to compute but that could not to have significant
information on the local nonconvexity of the objective function. The other one is based
on the use of negative curvature directions which are more expensive to compute but that
can be considered “better” than the previous ones in the sense that they have a “good”
resemblance to the eigenvectors of the Hessian matrices corresponding to the most negative
eigenvalues. This last strategy has been followed by algorithm NMonNc proposed [ 121
where the iterates of SYMMLQ algorithm are continued until a sufficient information on
the smallest eigenvalue of the Hessian matrix (which can be tested by controlling the mag-
nitude of ,&+I) is obtained or until the upper bound on number of Lanczos basis vectors
stored is achieved (see criterion (8)). The numerical results reported in [ 121 seem to indi-
cate that the second strategy is more efficient. However, this conclusion is influenced by
the fact that the LANCELOT algorithm and the algorithm NMonNc are very different: in
fact the method implemented in the LANCELOT package is a trust region Newton method
where the search directions are computed by means of the conjugate gradient algorithm
while algorithm NMonNC follows a curvilinear linesearch approach and uses the Lanczos
algorithm for computing the search directions. Therefore, in order to investigate better on
the effect of computing “accurate” search directions, we have implemented an algorithm
(denoted by algorithm F) which, within the class of methods proposed in [12], draws its
NUMERICAL EXPERIENCES WI TH N EW TRUNCATED NEWTON LlETHODS 83
inspiration from the strategy adopted by the LANCELOT algorithm. More in particular,
algorithm F is characterized as follows:
Algorithm F: the inner Lanczos iterations are terminated whenever the original stopping
criteria of SYMMLQ are satisfied or a negative curvature direction is detected and the
direction of negative curvature is used.
Therefore, algorithm NMonNC described in Section 2.3 differs from algorithm F only
in the termination criterion of the inner iterates. In Table 7 we report the cumulative
results obtained by these two algorithms and in Table 8 we report the number of times each
algorithm F and NMonNC performs the best. Table 7 and Table 8 clearly show that algo-
F 9 7 9 15
NMonNC 19 23 19 18
tie 69 67 69 64
Algorithm G: the inner iterates continue until the original criteria of SYMMLQ routine
are satisfied and one of the criteria (8) is fulfilled but the negative curvature used is the
first negative curvature direction detected.
Therefore, algorithm G and algorithm NMonNC uses the same Newton type direction
and they differ only in the negative curvature direction. We have compared the numerical
84 LUCID1 AND ROMA
behaviour of these algorithms, on 14 test problems which are the only ones where negative
curvature directions are detected and the two algorithms perform differently. In Table 9 we
report the cumulative results obtained by both the algorithms and in Table 10 we report the
number of times each algorithm G and NMonNC performs the best on these 14 problems.
Table 9 and Table 10 seem to indicate that the use of “better” negative curvature direction
has a clear beneficial effect as regards number of iterations, gradient evaluations and as
regards CPU time. More questionable is the comparison between these two algorithms in
Table 9. Cumulative results for the Algorithms G
and NMonNC
NMonNC 9 8 9 9
tie 5 3 5 5
terms of function evaluations. In fact, Table 9 and Table 10 show that algorithm NMonNC
performs better in most of the test problems but, on the other hand, in few test problems
algorithm G allows a considerable saving in terms of function evaluations. Moreover, by
observing more in detail the results obtained on the whole test set we note that in the only
test problem (BROYDN7D) where the two algorithms converge towards different critical
points, algorithm NMonNC is able to locate a point where the objective function value is
lower as reported in Table 11.
Table IZ. Detailed results of the problem BROYDN7D.
the magnitude of the scalar pi+l generated by the Lanczos algorithm until enough room
is available. In literature a different criterion for evaluating the effectiveness of a negative
curvature direction has been proposed in [20]. In order to compare these two criteria, we
have implemented another algorithm (denoted by algorithm H) which, instead of using
the criterion adopted in algorithm NMonNC, uses the test proposed in [20] which in our
notation can be written
where & is given by (4) and wi is the eigenvector of the tridiagonal matrix Ti corresponding
to the smallest eigenvalue. In particular, algorithm H has the following features:
Algorithm H: the termination criterion of the inner iterates is the same of the algorithm
NMonNC but the test on ,&+I in (8) is replaced by the test (10).
In Table 12 we reports the cumulative results obtained by algorithm H together with those
obtained algorithm NMonNC. Note that algorithm H shows one failure. In Table 13 we
Table 12. Cumulative results for the Algorithms H and
NMonNC
report the number of times each algorithm H and NMonNC performs the best On the
Table 13. Number of times each algorithm H
and NMonNC performs the best.
NMonNC 12 21 12 48
~ ~~
tie 59 60 60 37
basis of Table 12 algorithm NMonNC appears the most effective. However, Table 13 shows
that algorithm H is superior as regards the number of wins in terms of iterations and gradient
evaluations. Therefore, the obtained results indicate that, probably, the best way to test the
“goodness” of direction d k is to use a criterion which a compromise between the criterion
of algorithm NMonNC and the one proposed in [20].
Question 3. The computation of the negative curvature direction d k (see (4)) requires the
use of the matrix V , whose columns are the Lanczos basis vectors. As the number of
86 LUCID1 A N D ROhIA
By observing this Table 14, it is clear that as L increases there is a substantial improvement
of the behaviour of the algorithm in terms of number of iterations, function and gradient
evaluations. On the other hand, when L is greater than 100, an excessive increase of
CPU time is needed without producing a substantial improvement of the behaviour of the
algorithm. In conclusion, the best choices seem to be L E [30,75].
As concluding remark, we believe that a suitable scaling of the negative curvature direction
dk could play an important role for improving the efficiency of the algorithm. Therefore,
this topic is worthy of an extensive study and will be the subject of future work.
Acknowledgments
We would like to thank Ph.L. Toint for helpful discussions and for many useful comments
and suggestions.
References
1. I. Bongartz, A. Conn, N. Gould, and P. Toint. CUTE: Constrained and unconstrained testing environment.
ACM Transaction on Mathematical Sofmare, 21: 123-160, 1995.
NUMERICAL EXPERIENCES W I T H NEW TRUNCATED NEWTON I I E T H O D S a7
2. A. Conn, N. Could, and P. Toint. LANCELOT: A Fortran puckagej?)r Large-scale Nonlinear Optimization
(Release A). Springer Verlag, Heidelberg, Berlin, 1992.
3 . J. Cullum and R. Willoughby. Lanczos alghorithms for large symmetric eigenvulue computarions.
Birkhauser, Boston, 1985.
4. R. Dembo, S. Eisenstat, and T. Steihaug. Inexact Newton methods. SIAM Journal on Numerical Analysis,
19:400408, 1982.
5. R. Dembo and T. Steihaug. Truncated-Newton methods algorithms for large-scale unconstrained optimiza-
tion. Mathematical Programming, 26: 190-2 12, 1983.
6. N. Deng, Y. Xiao, and F. Zhou. Nonmonotonic trust region algorithm. Journul of Optimization Theory and
Applications, 76:259-285, 1993.
7. M. Ferris, S. Lucidi, and M. Roma. Nonmonotone curvilinear linesearch methods for unconstrained opti-
mization. Computational Optimization und Applications, 6: 1 17-136, 1996.
8. G. Golub and C. Van Loan. Mutrix Computations. The John Hopkins Press, Baltimore, 1989.
9. L. Grippo, F. Lampariello, and S. Lucidi. A truncated Newton method with nonmonotone linesearch for
unconstrained optimization. Journal of Optimization Theory and Applications, 60:40 1 4 1 9 , 1989.
10. L. Grippo, E Lampariello, and S. Lucidi. A class of nonmonotone stabilization methods in unconstrained
optimization. Numerische Mathemutik, 59:779-805, 1991.
1 1 . G. Liu and J. Han. Convergence of the BFGS algorithm with nonmonotone linesearch. Technical repon,
Institute of Applied Mathematics, Academia Sinica, Beijing, 1993.
12. S. Lucidi, E Rochetich, and M. Roma. Curvilinear stabilization techniques for truncated Newton methods
in large scale unconstrained optimization: the complete results. Technical Report 02.95, Dipartimento di
Informatica e Sistemistica, Universith di Roma “La Sapienza”, 1995.
13. G. McCormick. A modification of Armijo’s step-size rule for negative curvature. Mathematical Program-
ming, 13:111-1 15, 1977.
14. J. Mori and D. Sorensen. On the use of directions of negative curvature in a modified Newton method.
Mathemutical Programming, 16:1-20, 1979.
15. S. Nash. Newton-type minimization via Lanczos method. SIAM Journal on Numerical Analysis, 2 1 :770-788,
1984.
16. S. Nash and J. Nocedal. A numerical study of the limited memory BFGS method and the truncated-Newton
method for large scale optimization. SIAM Journal on Optimization, 1 :358-372, 1991.
17. C. Paige and M. Saunders. Solution of sparse indefinite systems of linear equations. SIAM Journal on
Numerical Analysis, 12:6 17-629, 1975.
18. B. Parlett. The symmetric eigenvalue problem. Prentice-Hall series in Computational Mathematics, Engle-
wood Cliffs, 1980.
19. T. Schlick and A. Fogelson. TNPACK - A truncated Newton package for large-scale problems: I. algorithm
and usage. ACM Transaction on Muthematical Sojmare, 18:46-70, 1992.
20. G. Shultz, R. Schnabel, and R. Byrd. A family of trust-region-based algorithms for unconstrained mini-
mization. SIAM Journal on Numericul Anulysis, 22:47-67, 1985.
21. P. Toint. An assesment of non-monotone linesearch techniques for unconstrained optimization. SIAM
Journal of Scientific Computing, 17: 725-739, 1996.
22. P. Toint. A non-monotone trust-region algorithm for nonlinear optimization subject to convex constraints.
Technical Report 94/24, Department of Mathematics, FUNDP, Namur, Belgium, 1994.
23. Y. Xiao and F. Zhou. Nonmonotone trust region methods with curvilinear path in unconstrained minimization.
Computing, 48:303-3 17, 1992.
Computational Optimization and Applications, 7, 89-1 10 (1997)
@ 1997 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.
Abstract. Numerical and computational aspects of direct methods for large and sparse least squares problems
are considered. After a brief survey of the most often used methods, we summarize the important conclusions
made from a numerical comparison in MATLAB. Significantly improved algorithms have during the last 10-
15 years made sparse QR factorization attractive, and competitive to previously recommended alternatives. Of
particular importance is the multifrontal approach, characterized by low fill-in, dense subproblems and naturally
implemented parallelism. We describe a Householder multi frontal scheme and its implementation on sequential
and parallel computers. Available software has in practice agreat influence on the choice of numerical algorithms.
Less appropriate algorithms are thus often used solely because of existing software packages. We briefly survey
software packages for the solution of sparse linear least squares problems. Finally, we focus on various applications
from optimization, leading to the solution of large and sparse linear least squares problems. In particular, we
concentrate on the important case where the coefficient matrix is a fixed general sparse matrix with a variable
diagonal matrix below. Inner point methods for constrained linear least squares problems give, for example, rise
to such subproblems. Important gains can be made by taking advantage of structure. Closely related is also the
choice of numerical method for these subproblems. We discuss why the less accurate normal equations tend to
be sufficient in many applications.
1. Introduction
Many scientific applications lead to the solution of large and sparse unconstrained linear
least squares problems,
min ( ( A z- b ( ( 2 .
2
Typical areas of applications include chemistry, structural analysis and image processing.
Sparse least squares problems are also common subproblems in large scale optimization.
Above, the coefficient matrix A E R"'" is large and sparse with at least as many rows as
columns (rn 2 n),b E R" is aright-hand side vector and z E R" is the solution. Moreover,
we assume that A has full column rank and that the nonzero entries are nonstructured. The
full rank assumption of A makes ATA positive definite and the least squares solution
uniquely determined, Much attention has during the last years been concentrated to sparse
rank deficient problems. By rank revealing QR factorization (RRQR), the columns of A
are permuted in such a way that the orthogonal factorization
In real applications sparse matrices usually have more than 105 rows and columns, but
often less than 0.1% of nonzero entries. An extreme application, described by Kolata [34],
is the adjustment of coordinates of North American geodetic stations. It results in mildly
non-linear least squares problems of about 6.5 million equations and 540,000 unknowns. It
should, however, be emphasized that the complexity of sparse matrix algorithms not only
is determined by the dimension and degree of sparsity; also the sparsity pattern plays an
important role.
The method of least squares dates back almost exactly 200 years. It was proposed as an
algebraic procedure by Legendre [36] in 1805, and was at that time used for applications in
astronomy. Gauss [ 181 later justified the method as a statistical procedure. He even claimed
to have used the method of least squares since 1795, and should therefore be the “legitimate
inventor”. Later examinations of Gauss’ processing of astronomical data suggest that he
was right in his claim.
Let Ax = b denote a linear system with more equations than unknowns. The method of
least squares finds a “solution” x that minimizes the distance between the range of A and the
right-hand side b. For consistent problems, where b belongs to the range of A, the residual
T = b - Ax becomes zero. A general residual T , corresponding to an optimal solution 2,
satisfies the orthogonality relation ATr = 0.
The standard direct methods for least squares problems can, with respect to their un-
derlying theoretical foundations, be classified into two groups: (a) those based on the
orthogonality relation ATr = 0 and (b) those based on the orthogonal invariance of the
2-norm. The first group includes the method of normal equations and the augmented system
method, while methods based on QR factorization and SVD belong to the second group.
Sparse linear least squares problems are, as already mentioned, frequent in large scale
optimization. Typical sources include constrained linear least squares problems, interior
point methods for linear programming and nonlinear least squares problems. Often the
underlying optimization problem is solved by the repeated solution of linear least squares
problems, where the coefficient matrices are defined by a fixed sparse matrix and a variable
diagonal matrix below. Let A be a given sparse matrix, for different values of the real
parameter X we then consider the solution of
The repeated problems can, by taking advantage of their special structure, be solved using
previously computed information. If QR factorization is used, then we notice that the
structure of the upper triangular factor R(X) is invariant under the choice of A. Much of
the overall computation (the symbolic analysis phase) do therefore not need to be repeated
for each new subproblem.
Another interesting subproblem from optimization arises in the interior point solution of
linear programs. Repeated large and sparse linear least squares problems of the type
min I)D(Ax- b))12
2
are then solved. Here, the diagonal matrix D may contain large elements making D A
ill-conditioned. Despite this ill-conditioning, the method of normal equations is common
SPARSE LINEAR LEAST SQUARES PROBLEhIS IN OPTIhIIZATION 91
and in the literature often recommended as a standard method for solving these problems.
The coefficient matrix AT D 2 A is then explicitly formed and its Cholesky factor computed.
It is, however, interesting that the disparaged method of normal equations in practice turns
out to work well in this application. One explanation could be that the computed solutions
are search directions in Newton’s method, wherefore high accuracy not should be that
important. This is, however, not a sufficient motivation. We suggest that the observed
attractive properties also can be explained as an effect of implicit iterative refinement.
The outline the paper is as follows. Section 2 surveys the most often used direct methods
for sparse linear least squares problems. The numerical properties of methods for dense
problems obviously carry over to the sparse case. Stable methods for general dense matrices
are thus equally accurate when used for solving sparse problems. Dense methods may,
however, be less appropriate from a sparsity point of view. Section 3 deals with sparse QR
factorization in general. We sketch the different alternatives that have been used during the
last 25 years. The most recent advance for efficient sparse QR factorization, multifrontal
methods, are considered in Section 4. We describe the algorithm and briefly some details
related to the implementation on sequential and parallel computers. In Section 5 , we
discuss and summarize some often used software packages for solving sparse linear least
squares problems. The last section is devoted to applications in optimization where sparse
least squares problems arise as repeated subproblems. In particular we focus on the case
mentioned above, where the coefficient matrix is a fixed general sparse matrix with a variable
diagonal matrix below.
2. Direct methods
Execution times and memory usage are central issues in sparse matrix computation. Many
large and sparse problems can thus only be solved if sparsity is well utilized. The in-
troduction of fill-in must in particular be avoided, making it a determining criteria in the
evaluation of sparse methods. However, the question of numerical stability must also be
taken into account. There are often inherent incompatibilities between sparsity and stability.
One such example is the solution of weighted least squares problem by QR factorization.
High accuracy may require a column ordering that is less suitable with respect to fill-in.
Another example is sparse LU factorization, where a column ordering chosen for sparsity
reasons often must be modified to ensure numerical stability. In many implementations of
sparse LU factorization pivots are chosen by a threshold criterion, that balances sparsity
and numerical stability.
The method of normal equations, based on the orthogonality relation ATr = 0, is the
classical and because of its simplicity probably the most common method for solving linear
least squares problems. It was derived and used already by Gauss. The solution is computed
simply by forming and solving the normal equations,
ATAz = ATb.
The full rank assumptions of A makes A T A positive definite, wherefore the symmetric
linear system can be solved without pivoting using Cholesky factorization, A T A = RTR.
92 MATSTOMS
Important savings in memory usage and execution times are made by appropriately ordering
the columns of A . Savings are achieved both in the factorizartion step and in the subsequent
solution of triangular systems in R. Often used methods for finding low fill-in column
orderings of A (symmetric orderings of ATA)include the Minimum degree ordering, Nested
dissection and Reverse Cuthill-McKee. In many scientific applications it is, however, also
possible to formulate the practical problem in such a way that the natural ordering gives
low fill-in. With this remark in mind, it is clear that general purpose software in general
not can compete with software designed for particular application. The explicit forming of
A T A gives rise to two potential problems: (a) loss of accuracy due to the squared condition
number, n(ATA) = n 2 ( A ) ,and (b) fill-in. Dense rows in an otherwise sparse matrix A
make A T A filled.
Two of the most reliable and accurate direct numerical methods for solving sparse least
squares problems are based on the QR factorization of A,
The matrix A, with columns permuted for sparsity in R, is decomposed into an orthogonal
matrix Q E Rmxmand an upper triangular matrix R E RnX".Using the corrected
seminormal equations (CSNE), a solution Z E R" is first computed from
RTRZ = ATb,
+
and then corrected by one step of iterative refinement, x + Z Sx. Here, the correction
vector Sx is computed in fixed precision from RTRSx = ATr. By the correction step it
can be shown (Bjorck [ 2 ] ) that computed solutions normally are of high accuracy.
It is easily shown that the Cholesky factor of A T Aand the upper triangular factor in the QR
factorization of A mathematically are the same. They have the same sparsity patterns and,
except from possible sign differences of the rows, the same numerical values. It follows that
the seminormal equations can be seen as an alternative way of solving the normal equations.
The factor R is computed by QR factorization of A instead of Cholesky factorization of
ATA. One could believe that due to a "better" matrix R, a more accurate solution then
can be expected. This is, however, not a sufficient explanation. It should be noticed that
the solution Z, obtained without the refinement step, generally is not more accurate than a
solution computed by the normal equations. Only a careful error analysis can explain why
the CSNE works better than the normal equations. One important difference is, however,
that the rate of convergence in iterative refinement is much better when the seminormal
equations are used.
Golub's method [3 11 is another approach based on QR factorization. It computes the least
squares solution by factorizing A and then solving the triangular system
This method and the CSNE have essentially the same attractive numerical properties. An
advantage of the CSNE method is that it only uses the factor R. Subsequent right-hand sides
can be handled without the extra cost for storing Q. This is important since Q normally is
SPARSE LINEAR LEAST SQUARES PROBLEMS IN OPTIMIZATION 93
much more costly to store than than the matrix R. For details related to the storage and use
of a large and sparse matrix Q, we refer to Lu and Barlow [40] and Puglisi [53].
A potential drawback of the normal equations and the methods based on QR factorization
of A, is that R is assumed to be sparse. This is, however, not the case when using the
augmented system method. The least squares solution x E R" and the corresponding
residual vector T = b - Ax E R" are computed from a symmetric but indefinite linear
+
system of order m n,
The scaling factor a E R is introduced in order to reduce the effect of roundoff errors in
the computed solution. With a sufficiently large value of a , the m first pivots in Gaussian
elimination are chosen from the (1 ,1)-block. The resulting system then reduces to the system
of normal equations which, as indicated above, is unsatisfactory for less well-conditioned
problems. Bjorck [4] shows how a! can be chosen to minimize the condition number of the
augmented matrix or, alternatively, to minimize an upper bound for the introduced roundoff
error. Since the expressions for these optimal values both include the smallest singular value
of A, they are expensive to compute. Cheaper approximative values of a have therefore
been proposed. Arioli et al. [ l ] and Gilbert et al. [27] use a! = maxi,j Iaijl/lOOO, while
Matstoms [46] approximates the smallest singular value of A by one step of inverse iteration
on A ~ A .
A small a improves stability but may also introduce more fill-in. The required number of
floating point operations is then also increased. It should be noticed that Bjorck's optimal
choices of a are derived only with respect to numerical stability. Notice, however, that
iterative refinement often compensates for a large value of a.
Efficient solution of the augmented system requires that structure and symmetry is utilized
and preserved. However, ordinary symmetric Gaussian elimination, with 1x 1pivots chosen
from the diagonal, may be unstable. A combination of 1 x 1 and 2 x 2 pivots should instead
be used. The use of 2 x 2 pivots may also reduce the fill-in (Duff and Reid [13]). The
MA27 software (Duff and Reid [14]) factorizes sparse matrices by a pivot strategy similar
to the one proposed by Bunch and Kaufman [6] for dense matrices. The balance between
stability and sparsity is controlled by a threshold criterion in the choice of pivots.
To evaluate the described four methods, Matstoms [45] compares accuracy and execution
times in the sparse extention of MATLAB. The experiments are carried out on nine of the
matrices from the widely used Harwell-Boeing test collection (Duff et al. [ 12]), together
with five matrices formed by the merging of two Harwell-Boeing matrices. Following
Arioli et al. [ 11 a second set of more ill-conditioned matrices is formed by a row scaling
of the Harwell-Boeing matrices. Rows from index n - 1 to m are multiplied by a factor
16-5. A set of consistent sparse linear least squares problems is defined by choosing the
exact solutions x = (1,.. . , l ) Tand then setting the right-hand sides to b = Ax.
Matstoms' conclusions from the numerical experiments can be summarized as follows:
The current MATLAB implementation of the augmented system method (build-in) works
well for well-conditioned problems of moderate sizes. For general sparse problems a better
choice of the scaling parameter Q or iterative refinement must be used to get accurate
94 MATSTOMS
3. Sparse QR factorization
This section briefly summarizes the methods that during the last 20-30 years have been
used for sparse QR factorization. Algorithms based on Householder transformations have
traditionally, until the recent introduction of multifrontal methods, suffered costly (interme-
diate) fill-in, and have therefore been rejected. See, for example, Duff and Reid [ 131, Gill
and Murray [28] or Heath [33]. They all conclude that “Givens rotations are a much more
appropriate tool in this context because of their ability to introduce zeros more selectively
and in a moreflexible order” (Heath [33]).
Givens rotations, either based on row or column-wise elimination, have instead been
preferred and used. The column-wise strategy uses the same elimination order as used in
the Householder method. Thus, the kth major step eliminates the subdiagonal elements
in the kth column, and computes the kth row of R. This variant of sparse Givens QR
factorization has, for example, been considered by Duff [ 101 and Duff and Reid [ 131. The
introduction of intermediate fill-in can be controlled by an appropriate row ordering. Duff
[ 101 considers different strategies for finding a row ordering that minimizes the introduction
of intermediate fill-in.
An alternative strategy, variable pivot rows,was suggested by Gentleman [21],[22] (see
also Duff [ 101). Instead of using a fixed pivot row within each major step, any two rows with
nonzero entries in the pivot column can be rotated. A proper choice of row combinations
may, compared with the fixed pivot strategies, lead to savings in operation count and memory
requirements.
Row-oriented Givens schemes have, for example, been considered by Gentleman [ 191,
[20], Gill and Murray [29], and George and Heath [23]. The input matrix A is processed
by rows, in such a way that the kth major step eliminates the subdiagonal elements of the
kth row. This is made by rotations of rows into the partially computed factor. In contrast to
SPARSE LINEAR LEAST SQUARES PROBLEMS I N OPTIh~lIZATION 95
column-oriented algorithms, the factor R is not computed such that new rows or columns
of R in each major step are definitely computed. The “partially computed factor” therefore
just refers to an intermediate result.
Row-oriented schemes have, compared with column-oriented schemes, two important
advantages. First, since the algorithm at the same time only operates on a single working row
of A and on the partially computed factor, out-of-core implementations follows naturally.
Only the partially computed factor R need to be held in memory. Rows of A can sequentially
be read from a secondary storage and merged into R. Second, the method is well suited
for updates. New observations, in terms of new rows of A , can easily be handled when the
original R has been computed. In this case, new rows are processed in exactly the same
way as the original rows of A . Finally, it should be noticed that local heuristics for finding
an appropriate row and column ordering often are based on column wise annihilation of
nonzeros. Such algorithms can therefore not be used together with row-oriented schemes.
A priori heuristics must instead be used; but then the out-of-core argument vanishes.
The algorithm proposed by George and Heath [23] was an important advance. It made QR
factorization a useful alternative for the solution of sparse linear least squares problems. QR
factorization was previously much slower and more memory consuming than the alternative
direct methods. The main contribution by George and Heath was the proposed a priori
strategy for finding an appropriate column permutation P, of A , and a scheme for efficiently
handling intermediate fill-in. In the symbolic phase they utilized the previously mentioned
relation between QR factorization of A and Cholesky factorization of ATA. A sparse
Cholesky factor of Pc(ATA)PT = (APc)T(APc) guarantees a sparse factor R in QR
factorization AP,. Standard symmetric strategies, for example minimum degree and nested
dissection orderings, can therefore be used for computing a column ordering of A .
Variable row pivoting may, as already pointed out, decrease the required number of
rotations in sparse Givens QR factorization. It reduces the propagation of intermediate
fill-in from previously rotated rows, and decreases the number of new nonzero elements to
be annihilated. From this idea, Liu [38] generalizes the row-oriented algorithm by George
and Heath [23] to handle more than one simultaneously active triangular matrix. In the
George and Heath algorithm, the rows of A are sequentially rotated into a triangular matrix
that finally, when all rows are processed, defines the factor R. The operation of rotating a
sparse row into a upper triangular matrix is called a row rotation (George and Ng [ 2 5 ] ) .In
Liu’s algorithm [38], rows are instead annihilated by rotations with one of many triangular
structures. Such triangular matrices are then pairwise rotated together into new upper
triangular structures (Generalized row merging ((Liu [ 381)). The resulting upper triangular
matrix equals the factor R.
We finally comment on some important implementation details mentioned by Liu [38].
First, the submatrix rotations can be performed as dense matrix operations. Rows and
columns identically zero remain zero during the triangularization, and can therefore be
removed in advance. A simple mapping between local and global column indices is used to
match the resulting dense matrix with the overall sparse problem. Second, by visiting the
nodes in depth-first order, the simultaneously active triangular matrices can be stored and
retrieved in first-idlast-out manner. Efficient data representation can therefore be obtain
by a stack data structure. The use of depth-first ordered trees was in this context first
96 MATSTOMS
proposed by Duff and Reid [14]. George and Liu [24] generalizes Liu’s algorithm by,
instead of using Givens rotations, performing the dense triangularizations by Householder
transformations. All these modifications of George and Heath’s algorithm are used in the
multifrontal Householder algorithm discussed in the next section.
4. Multifrontal QR factorization
then defines a contribution to the matrix factor R and the dense update matrix of the node
itself. The recursive way of computing update matrices motivates the tree traverse rule
given above. QR factorization has no effect on columns identically zero, wherefore such
columns can be removed in advance. The condensation makes dense methods possible to
use in the frontal factorization. A number of further improvements have been proposed and
studied by Matstoms [43]and Puglisi [53].
Multifrontal algorithms are basicly parallelized along two different lines. First, the elimi-
nation tree can be traversed in parallel. Columns associated with nodes in different branches
SPARSE LINEAR LEAST SQUARES PROBLEMS IN OPTIMIZATION 97
are independent and can be simultaneously processed. We refer to this approach as tree
parallelism. In the alternative approach, called node level parallelism, the tree is sequen-
tially traversed but the dense factorization problems of each node are solved in parallel. The
efficiency of the two approaches is determined both by the structure of the sparse matrix to
be factorized, and by the computer used. In particular factors such as the size of the frontal
matrices and the structure of the underlying elimination tree are of great importance. On
shared memory architecture, where there is no cost for communication between processors,
also the machine dependent parameter nl/2 is of importance. It is defined to be the smallest
matrix dimension (square matrices), required to achieve half the asymptotic performance
of a certain matrix operation. We use it as a measure of how fast the performance increases
under increasing matrix dimensions. A large value of n1l2 means that large matrices are
required to achieve high performance on the computer used. The performance on small
matrices may then be unsatisfactory. Mostly large frontal matrices and a small value of
n1/2 indicates that node level parallelism should be used. Small frontal matrices and a large
value of nl/2 make, one the other hand, tree parallelism more attractive. In the latter case,
it is also important that the elimination tree has a suitable structure. Ideally, the tree should
be short and bushy.
In the shared memory implementation (see Matstoms [46],[47] and Puglisi [53]) a pool-
of-tasks is initially set to all leaf nodes of the elimination tree. During the computation,
processes ask the pool manager for new tasks. The contribution block from A and the update
matrices of the child nodes are then merged into a frontal matrix. Dense factorization of
the frontal matrix gives a contribution to R and an update matrix of the node itself. A
parent node is ready to be processed, and consequently moved to the pool, when all its
children are processed. Since nodes in this parallel setting not are visited in a strict depth
first order, the stack storage of update matrices can no longer be used. A more general
form of dynamic memory allocation (a buddy system) is instead used. Semaphores are also
required to prevent processes from simultaneously writing to shared memory blocks.
In a message passing implementation the main problem is to obtain good load balancing
and, at the same time, minimize the communication overhead. This can only be achieved
by dividing the elimination tree in a number of independent subtrees of essentially the same
computational complexity. Each processor is then assigned a subtree. Like in the shared
memory implementation, the rather sequential upper half of the tree must be treated in a
special way to make full use of parallelism.
In this section, we survey some often used software packages for general sparse least squares
problems. Many problem related issues, such as problem size, sparsity and structure,
determine whether direct or iterative methods should be used. Very large problems with
structured nonzero patterns and easily computed nonzero entries are, for example, often
solved by iterative methods. Direct methods are, on the other hand, preferable in statistical
modeling when the covariance matrix is required. Other problem related details, such as
possible rank deficiency, occurrence of weighted rows and the number of right-hand sides,
also influence on the choice of method and software. Special software packages, designed
98 MATSTOMS
for the particular problems of interest, may sometimes give better performance than the
general software packages described here.
MATLABhas been extended to include sparse matrix storage and operations. The in-
cluded operations and algorithms are described in Gilbert et al. [27]. A minimum degree
preordering algorithm and a sparse Cholesky decomposition have, for example, been in-
cluded. By using these and other built-in routines, new sparse algorithms are relatively
easily implemented. There is also a built-in sparse least squares solver in MATLAB.This
currently uses the augmented system formulation with the scaling parameter chosen to be
a = l O P 3 rnax (aij1. The solution is computed using the minimum degree ordering and
the built in sparse LU decomposition.
Matstoms [45] has developed a multifrontal sparse QR decomposition to be used with
MATLAB.This is implemented as four m-files, which are available from netlib. The main
routine is called sqr and the statements [ R , p , c ] =sqr( A , b ) will compute the factor R in
a sparse QR decomposition of A, and c = QTb. For further details we refer to [45].
More recently C. Sun, Advanced Computing Research Facility, Cornell University, has
developed another software package for computing a sparse QR decomposition This pack-
age is implemented in C and also designed to be used within the MATLABenvironment. C.
Sun [55] has also developed a parallel multifrontal algorithm for sparse QR factorization
on distributed-memory multiprocessors.
Pierce and Lewis [S 11at Boeing have implemented a multifrontal sparse rank revealing QR
decomposition/least squares solution module. This code has some optimization for vector
computers in general, but it also works very well on a wide variety of scientific workstations.
It is included in the commercial software package BCSLIB-EXT from Boeing Information
and Support Services, Seattle. This library of FORTRAN callable routines is also given
to researchers in laboratories and academia for testing, comparing and as a professional
courtesy.
The Harwell Subroutine Library (HSL) has a subroutine MA45 to solve the normal
equations. If the least squares problem is written in the augmented matrix form, then the
multifrontal subroutine MA27 for solving symmetric indefinite linear system can be used.
However, the MA27 code does not exploit the special structure of the augmented system.
There is also a new routine MA47 which is designed to efficiently solve this kind of systems.
Closely related to the Harwell MA27 code is the QR27 code, that has been developed by
Matstoms [44]. It is a Fortran-77 implementation of the Householder multifrontal algorithm
for sparse QR factorization that was described in a previous section. To solve sparse least
squares problems it uses the corrected seminormal equations (CSNE). The code is available
for academic research and can be ordered by e-mail to [email protected]. A parallel version
of QR27 has been developed for shared memory MIMD computers, see Matstoms [47].
SPARSPAK is a collection of routines for solving sparse systems of linear systems devel-
oped at University of Waterloo. It is divided into two portions; SPARSPAK-A deals with
sparse symmetric positive definite systems and SPARSPAK-B handles sparse linear least
squares problems, including linear equality constraints. For solving least squares both A
and B parts are needed. SPARSPAK-B has the feature that dense rows of A, which would
cause R to fill, can be withheld from the decomposition and the final solution updated to
SPARSE LINEAR LEAST SQUARES PROBLEMS IN OPTIhlIZATION 99
incorporate them at the end. Only the upper triangular factor is maintained, and the Givens
rotations are not saved.
Zlatev and Nielsen [57]have developed a Fortran subroutine called LLSSOl which uses
fast Givens rotations to perform the QR decomposition. The orthogonal matrix Q is not
stored, and elements in R smaller than a user specified tolerance are dropped. The solution
is computed using fixed precision iterative refinement, or alternatively preconditioned con-
jugate gradient, with the computed matrix R as preconditioner, see [%]. The table below
summarizes the considered software packages.
6. Applications in optimization
Sparse linear least squares problems are frequent in large scale optimization. Typical
sources are constrained linear least squares problems and interior point methods for linear
programming. Also the solution of nonlinear least squares problems give rise to a sequence
of sparse linear least squares problems. We discuss these applications and show how
linear least squares problems arise as subproblems. Of particular interest is the solution
of problems of regularization type. In this case the coefficient matrix is a fixed general
sparse matrix with a variable diagonal matrix below. Another interesting topic is the
repeated solution of certain linear least squares problems arising in interior point methods
for linear programming. The coefficient matrices may here be rather ill-conditioned, and it
has therefore been somewhat surprising that the normal equation method works well and
usually turns out to give sufficiently accurate solutions.
Our object is far from giving a complete survey of the considered optimization methods.
We rather aim to define some important problems in optimization and then, with techni-
cal details suppressed, show how the solution leads to repeated unconstrained linear least
squares problems. For more exhaustive treatments we refer to Gill et al. [30] and Dennis
and Schnabel [9] (nonlinear least squares), Lawson and Hanson [35]and Bjorck [ 5 ](con-
100 MATSTOMS
strained least squares), and Gonzaga [32] and Wright [56] (interior point methods for linear
programming).
Let f : Rn --+ R" be a vector-valued real function. The unconstrained nonlinear least
squares problem is to find a vector x E R" such that the sum of squares of the m 2 n
functions fi (x)is minimized,
Let us first consider the standard method of Gauss-Newton for the iterative solution of
(2). If J ( z ) E RmXn is the Jacobian of f(x) and Gi(x) E R n X n the Hessian of the ith
component fi(x) o f f , then the Jacobian and Hessian of the objective function F ( x ) equal
and m
H ( x ) = J ( x ) * J ( x )+ Q ( x ) , Q ( z )= fi(x) Gi(x),
*
i= 1
Here, xk is the lcth approximation to the solution and pk the increment defining x k + l ,
The Jacobian J ( Z k ) is in practice often ill-conditioned and sometimes even rank deficient.
Explicit formation of the coefficient matrix in (3) is therefore unsuitable. A better alternative
is to consider (3) as the normal equations for the linear least squares problem,
which instead can be solved by QR factorization. Rank detection is in the dense case easily
implemented by Golub's column pivoting, and for sparse matrices a sparse RRQR algorithm
can be used. Search directions are in the rank deficient case determined as minimum norm
solutions of (4).
An often used alternative to the Gauss-Newton approach, avoiding critical ill-conditioning
of the Jacobian J ( x ) ,is given by the Levenberg-Marquardt strategy (Levenberg [37] and
Marquardt [42]). To guard against inaccurate and unpropitious search directions, a restric-
tion I lpk 112 5 A is imposed upon (4). For a parameter X 2 0, related to the restriction A,
new search directions pk are defined by
SPARSE LINEAR LEAST SQUARES PROBLEMS I N OPTIMIZATION 101
With this interpretation of the XI term, the approach can be considered as a trust region
strategy. Otherwise, ( 5 ) can be seen as a pure regularization of (4). The key point is,
+
however, that the added diagonal matrix makes J T J XI nonsingular and a unique solution
can be guaranteed. Many strategies for choosing X have been proposed. Algorithm and
implementation details are discussed by MorC [48]. The system ( 5 ) can obviously be
interpreted as the normal equation solution of the linear least squares problem
Alternative methods, like Golub’s method and the CSNE, can then be used. We remark that
each iteration normally requires the solution of (6) for different values of A. The most recent
approximation X k and the associated Jacobian J ( x k ) are then fixed. More [48] introduces
a scaling matrix D and thus replaces XI with AD.
min
5
IIAx - b1I2 subject to x 2 0. (7)
candidate set. Block principal methods instead exchange more than one variable at a time.
A given set of active variables can be eliminated and the other variables computed. The
computational effort is in both cases restricted to the solution of unconstrained linear least
squares problems in the non-active variables. Thus, the coefficient matrices are defined by
a subset of columns from the original matrix A in (7).
Let us now focus on the interior point solution and how it leads to unconstrained least
squares problems of the considered regularization form. Technical details are skipped and
instead we refer to Portugal et al. [52]. The basic idea is to apply Newton’s method to
the complementarity problem (8). It can directly be formulated as a system of nonlinear
equations,
XYe =0
. (9)
A T A X e - Ye - ATb = 0
The component-wise complementarity conditions xi yi = 0 , i = 1,. . . , n, are expressed in
terms of the diagonal matrices X = diag(z) and Y = diag(y). In (9), e is the unit vector
with ones in all elements. The Jacobian of (9) is easily computed and the search directions
in Newton’s methods are defined by a square 2n x 2n sparse linear system,
A centralization parameter p k (Lustig et al. [41]) is added to the first block of the right-hand
side. With given search directions (U” v ‘ ) ~ ,new iterates are computed under a damped
update of x k and Y k . By block elimination of (10) we obtain
As in the discussion of nonlinear least squares problems, we identify the above equation as
the normal equations for the unconstrained least squares problem
Remember that XI, and Y k are diagonal matrices. The coefficient matrix in the above
problem therefore consists of afixed sparse matrix A and, in each iteration, difeerent diagonal
blocks. The vk component of the search direction is, for a given u k , defined by
vk = A T ( A u k - r k )+ Y k e , r k = b - Axke.
SPARSE LINEAR LEAST SQUARES PROBLEMS IN OPTIMIZATION 103
mincTx
2
subject to Ax = b, x 2 0, (12)
The non-negativity constraints are then implicitly handled by the objective function. The
original problem (12) is solved by the repeated solution of (13) for decreasing values of
p. Let x(2) = x(p2)be the solution for p = p i . Then it can be shown that x(2) -+ x* for
pi -+ O+, where x* is an exact solution of (12). If the starting vector d o )is feasible, i.e.
A d o ) = b and x(O) > 0, then also the subsequent vectors will be feasible. To derive an
iterative method for solving the subproblem (13), we first briefly consider the solution of
more general convex constrained minimization problems,
Feasibility and the first-order optimality conditions requires that a solution x*, for some
vector y*, satisfies
Ax* = b,
V ~ ( X=*ATy*
) .
Written as a single system of nonlinear equations, a solution x* and the Lagrange multiplier
y* must satisfy
(of(.)
(;)
AX- - ATY)
b
=
In the Newton solution of this nonhear system of equations, search directions ( P k , q k ) ,
Returning to the linear program (1 2), we now specialize (16) for the gradient and Hessian
of B ( X ? P L
VB(X,~ = )c - PX-le
104 MATSTOMS
and
0 2 B ( x , p )= P X - ~ .
Here, X = diag(x) and e = (1, . . , l)T. Given a solution zo = ~
8 . ( 2 1 ,the next iterate is
computed by the inner iteration
Thus, major iterates x(2) are computed as x(Z+') = limk+m X k . The damping factor a k
is introduced to guarantee feasibility. Pure Newton steps may violate the non-negativity
constraints. The Lagrange multiplier is simultaneously computed by the iteration
If the last iterate Z k is feasible and satisfies Axk = b, then (18) becomes
The second equation prescribes Apk = 0, wherefore also the subsequent solution xk+l
becomes feasible with respect to the linear constraints A x = b. By observing that a general
augmented system,
is equivalent to the weighted least squares problem min, IID-1/2(Ay - b ) l ( 2 , the square
linear system (19) can equivalently be formulated as a weighted unconstrained linear least
squares problem:
For a given vector qk, the search directions pk are explicitly given by
6.4. Discussion
Both the solution of nonlinear and constrained linear least squares problems give rise to
subproblems of regularization type. The Levenberg-Marquardt algorithm for nonlinear
problems leads to the solution of (6), while the interior point solution of linear constrained
problems leads to (1 1). In both cases we have a fixed general sparse matrix A merged
by a varying diagonal block. In the dense case, such problems are normally solved by
SPARSE LINEAR LEAST SQUARES PROBLEMS IN OPTIMIZATION 105
bidiagonalization. Orthogonal transformations are then applied both from the right and
left to transform A into bidiagonal form (Eldin [ 161). If the matrix A instead is large and
sparse, then Lanczos bidigonalization can be used. See Paige and Saunders [50].
Let the QR factorization of the coefficient matrix in (1) be given by
and since the second term to the right only affects the diagonal, it is clear that the nonzero
structure of R only depends on the structure of A. The analysis phase, which solely is based
on the predicted nonzero structure of R, need therefore not to be repeated for new values
of A. In particular the minimum fill-in ordering and the elimination tree, both computed in
the analysis phase, are constant and do not need to be recomputed.
Moreover, it is also clear that A in (21) can be replaced by the triangular factor R = R(0)
of A . Thus, the factor R(A) can equivalently be computed by the QR factorization
' X x \
' X x \ X \
x x X x x
x x x x X
x x X x x
X x x X
M2 =
X X x x
X x x X
X X x x
X X X
\ X , X / X I
Let us now consider the multifrontal factorization of the matrix in (22). The reduction
in row dimension compared with (21) also reduces the row dimension of the resulting
frontal matrices. Multifrontal factorization of (22) should therefore be faster. However,
different effects of vectorization and the structure of the frontal matrices also influence on
the performance. It turns out that the number of nonzero entries (nnz) in R is important.
By numerical experiments we are lead to the conclusion that (22) should be used only if
nnz(R) < nnz(A). Some examples are given in Table 2. Execution times are based on
the MATLAB implementation SQR on a Sun SparcStation.
The method of normal equations, applied on the above problem, gives the square system
( A T A+ A 2 I n ) x = ATb.
The fixed structure of R makes it possible also for the Cholesky factorization to reuse
information from the analysis phase. However, since the diagonal is modified under different
values of A, it is necessary to recompute the Cholesky factorization. Previously computed
factors R(A) give no computational advantages. However, efficient methods for sparse
Cholesky factorization makes this approach in general faster by a factor of 2-3 than the
above methods based on QR factorization. A drawback to the Cholesky method is, on
the other hand, that the diagonal term in (23) may vanish due to the squared A. In the
SPARSE LINEAR LEAST SQUARES PROBLEMS IN OPTIMIZATION 107
regularization of ill-conditioned least squares problems, this effect may require a larger
value of X in order to stabilize a solution.
In the previous section, it was shown how the interior point solution of linear programs
leads to the weighted least squares problem (20). The coefficient matrix X A T is in practice
often ill-conditioned and almost rank deficient, wherefore the method of normal equations
should be less useful. However, (20) is in practice solved by the normal equations. "The
small number of observed numerical dificulties with the normal-equation approach has
therefore been a continuing surprise. A careful error analysis is likely to explain this
phenomenon, but it remains slightly mysterious at this time" (Wright [56]).
For the solution vector z*, the second equation of (IS)can be considered as a consistent
overdetermined system in the unknown vector y. In terms of the original linear program
(14), one obtains
ATy = c - p X - l e .
The consistency makes it possible to introduce any non-singular scaling matrix X I , E
R" '"without affecting the solution,
This system is considered and solved as the weighted least squares problem,
min IIXk(ATq -
4
References
2. A. BJORCK,Stability analysis of the method of semi-normal equationsf o r least squares problems, Linear
Algebra Appl., 88/89 (1987), pp. 3 1 4 8 .
3. A. BJORCK,A direct method for sparse least squares problems with lower and upper bounds, Numer.
Math., 54 (1988), pp. 19-32.
4. A. BJORCK,Pivoting and stability in the augmented system method, Technical Report LiTH-MAT-R-1991-
30, Department of Mathematics, Linkoping University, June 1991.
5. A. BJORCK,Numerical Methods for Least Squares Problems, SIAM, Philadelphia, 1996.
6. J . R. BUNCHA N D L. KAUFMAN, Some stable methods f o r calculating inertia and solving symmetric
linear systems, Mathematics of Computation, 3 1 ( 1977), pp. 162-1 79.
7. E. C . H. CHU, J . A . GEORGE,J . LIU,A N D E . NG,SPARSPAK: Waterloosparsematrixpackageuser’s
guide fo r SPARSPAK-A, Research Report CS-84-36, Dept. of Computer Science, University of Waterloo,
1984.
8. T . F. COLEMAN, A . EDENBRANDT, AND J . R . GILBERT,Predicting fill for sparse orthogonal
factorization, J. ACM, 33 (1986), pp. 517-532.
9. J . DENNISA N D R. SCHNABEL, Numerical Methods fo r Unconstrained Optimization and Nonlinear
Equations, Prentice Hall, Englewood Cliffs, N.J., 1983.
10. I. S. DUFF,Pivot selection and row orderings in Givens reduction on sparse matrices, Computing, 13
(1974), pp. 239-248.
11. I . S. DUFF, N. I. M. GOULD,J . K. REID, 3. A . SCOTT,A N D K. TURNER, The factorization of
sparse symmetric indejinite matrices, IMA J . Numer. Anal., 11 (1991), pp. 181-204.
12. I. S. DUFF,R. G. CRIMES,AND J . G. LEWIS,Sparse matrix testproblems, ACM Trans. Math. Softw.,
15 (1989), pp. 1-14.
13. I. S. DUFFAND J . K . REID,A comparison of some methods f o r the solution of sparse overdetermined
systems of linear equations, J. Inst. Maths. Applics., 17 (1976). pp. 267-280.
14. I. S. DUFF AND J . K. REID,The multifrontal solurion of indeftnite sparse symmetric linear systems,
ACM Trans. Math. Software, 9 (1983), pp. 302-325.
15. S. C. EISENSTAT, M . H. SCHULTZ: AND A. H. SHERMAN, Algorithms and data structures for sparse
symmetric Gaussian elimination, SIAM J . Sci. Statist. Comput., 2 (1981), pp. 225-237.
16. L. ELDEN,Algorithms fo r the regularization of ill-conditioned least squares problems, BIT, 17 (1977),
pp. 134-145.
17. L. V. FOSTER, Modifications of the normal equations method that are numerically stable, in Numerical
Linear Algebra, Digital Signal Processing and Parallel Algorithms, G. H. Golub and P. V. Dooren, eds.,
NATO AS1 Series, Berlin, 199 1, Springer-Verlag, pp. 501-5 12.
18. C. F. GAUSS,Theory of the Motion of the Heavenly Bodies Moving about the Sun in Conic Sections,
Dover, New York (1963), 1809. C. H. Davis, Trans.
19. W. M. GENTLEMAN, Basic procedures for large, sparse, or weighted linear least squares problems,
Research report CSRR 2068, University of Waterloo, Waterloo, Ontario, Canada, July 1972.
20. W. M. GENTLEMAN, Least squares computations by Givens transformations without square roots, J . Inst.
Maths. Applics., 12 (1973), pp. 329-336,
21. W. M. GENTLEMAN, Error analysis of QR decompositions by Givens transformations, Linear Algebra
Appl., 10 (1975), pp. 189-197.
22. W. M. GENTLEMAN, Row elimination for solving sparse linear systems and least squares problems, in
Proceedings the 6th Dundee Conference on Numerical Analysis, G. A. Watson, ed., Springer Verlag, 1976,
pp. 122-133.
23. J . A . GEORGEAND M. T. HEATH,Solution of sparse linear least squares problems using Givens
rotations, Linear Algebra Appl., 34 (1980), pp. 69-83.
24. J . A. GEORGEAND J . W.-H. LIU,Householder rejections versus Givens rotations in sparse orthogonal
decomposition, Linear Algebra Appl., 88/89 (1987), pp. 223-238.
25. J . A. GEORGEAND E. G. N G , On row and column orderings for sparse least squaresproblems, SIAM
J. Numer. Anal., 20 (1981), pp. 326-344.
26. J . A. GEORGEA N D E. G. NG,SPARSPAK: Waterloosparsematrixpackage user’sguideforSPARSPAK-
B, Research Report CS-84-37, Dept. of Computer Science, University of Waterloo, 1984.
27. J . R . GILBERT,C. MOLER,A N D R. SCHREIBER, Sparse matrices in MATLAB: Design and implemen-
tation, SIAM J. Matrix Anal. Appl., 13 (1992), pp. 333-356.
SPARSE LINEAR LEAST SQUARES PROBLEMS IN OPTIMIZATION 109
28. P . E. GILL A N D W. MURRAY, Nonlinear least squares and nonlinearly constrained optimization, in
In Proceedings Dundee Conference on Numerical Analysis 1975, Lecture Notes in Mathematics No. 506,
Springer Verlag, 1976.
29. P . E. GILL A N D W. MURRAY,The orthogonal factorization of a large sparse matrix, in Sparse Matrix
Computations, J. R. Bunch and D. J. Rose, eds., New York, 1976, Academic Press, pp. 201-212.
30. P . E. GILL. W. MURRAY.AND M. H. WRIGHT,Practical Optimization, Academic Press, London
and New York, 1981.
3 1. G . H. GOLUB,Numerical methods for solving least squares problems, Numer. Math., 7 (19 6 3 , pp. 206-
2 16.
32. C. C. GONZAGA, Path-Jollowingmethods f o r linearprogramming, SIAM Review, 34 (1 992), pp. 167-224.
33. M. T. HEATH,Numerical methods for large sparse linear least squares problems, SIAM J. Sci. Statist.
Comput.,, 5 (1984), pp. 497-513.
34. G . B. KOLATA,Geodesy: Dealing with an enormous computer task, Science, 200 (1978). pp. 421-422.
35. C . L. LAWSONAND R. J . HANSON, Solving Least Squares Problems, Prentice Hall, Englewood Cliffs,
New Jersey, 1974.
36. A. M. LEGENDRE, Nouvelle mkthodes pour La dktermination des orbites des comites, Courcier, Paris,
1805.
37. K . LEVENBERG, A method for the solution of certain non-linear problems in least squares, Quart. Appl.
Math., 2 (1944), pp. 164-168.
38. J . W.-H. LIU,On general row merging schemes f o r sparse Givens transformations, SIAM J. Sci. Statist.
Comput., 7 (1986), pp. 1190-121 1.
39. J . W.-H. LIU,The role of elimination trees in sparsefactorization, SIAM J. Matrix Anal. Appl., 11 (1990),
pp. 134-172.
40. S. LU AND J . L. BARLOW,Multifrontal computation with the orthogonal factors of sparse matrices,
SIAM J. Matr. Anal. 3( 1996), pp. 658-679.
41. I. LUSTIG.R. MARSTEN.AND D. SHANNO, Computational experience with a primal-dual interior
point method for linearprogramming, Linear Algebra Appl., (1991), pp. 191-222.
42. D. W. MARQUARDT, An algorithm for least-squares estimation of nonlinear parameters, Journal of the
Society for Industrial and Applied Mathematics, 11 (1963), pp. 4 3 1 4 1 .
43. P. MATSTOMS,The multifrontal solution of sparse linear least squares problems, Licentiat thesis,
Linkoping University, 1991.
44. P . MATSTOMS, QR27-Specijication sheet, Tech. Report March 1992, Department of Mathematics, 1992.
45. P . MATSTOMS, Sparse QRfactorization in MATLAB, ACM Trans. Math. Software, 20 (1994), pp. 136-159.
46. P. MATSTOMS, Sparse QR Factorization with Applications to Linear Least Squares Problems, PhD thesis,
Linkoping University, 1994.
47. P. MATSTOMS, Parallel sparse QRfactorization on shared memory architectures, Parallel Computing, 2 1
(1993, pp. 473486.
48. J . J. MORE, The Levenberg-Marquardt algorithm: Implementation and theory, in G.A. Watson, Lecture
Notes in Math. 630, Berlin, 1978, Springer Verlag, pp. 105-1 16.
49. U. OREBORN, A Direct Methodfor Sparse Nonnegative Least Squares Problems, licentiat thesis, Linkoping
University, 1986.
50. C. C. PAIGE AND M. A. SAUNDERS, LSQR. an algorithm for sparse linear equations and sparse least
squares, ACM Trans. Math. Software, 8 (1982), pp. 43-71.
51. D. J . PIERCE AND J . G . LEWIS,Sparse multifrontal rank revealing QRfactorization, Technical Report
MEA-TR- 193-Revised, Boeing Information and Support Services, 1995.
52. L. F. PORTUGAL.J . J . J ~ D I C EAND . L. N. VICENTE,Solution of large scale linear least-squares
problems with nonnegativ variables, technical report, Departamento de cizcias da terra, Universidade de
Coimbra, 3000 Coimbra, Portugal, 1993.
53. C. PUGLISI, QR factorization of large sparse overdetermined and square matrices with the multifrontal
method in a multiprocessor environment, PhD thesis, CERFACS,42 av. G. Coriolis, 3 1057 Toulouse Cedex,
France.
54. J . K . REID,A note on the least squares solution of a band system of linear equutions by Householder
reductions, Comput J., 10 (1967), pp. 188-189.
55. C. SUN,Parallel sparse orthogonal factorization on distributed-memory multiprocessors, SIAM J. Sci.
Comput., 17 (1996), p. to appear.
110 M ATST OM S
56. M. H. WRIGHT, Interior methods for constrained optimization, Acta Numerica 1992, Cambridge Univer-
sity Press, 1992, pp. 341-407.
57. Z. ZLATEVA N D H. NIELSEN,LLSSOI - a Fortran subroutine for solving least squares problems (User’s
guide), Technical Report 79-07, Institute of Numerical Analysis, Technical University of Denmark, Lyngby,
Denmark, 1979.
58. Z. ZLATEVA N D H. NIELSEN,Solving large and sparse linear least-squares problems by conjugate
gradient algorithms, Comput. Math. Applic., 15 (1988), pp. 185-202.
Computational Optimization and Applications, 7, 1 11-1 26 (1997)
@ 1997 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.
Abstract. The facility layout problem (FLP) has many practical applications and is known to be N P - h a r d .
During recent decades exact and heuristic approaches have been proposed in the literature to solve FLPs. In this
paper we review the most recent developments regarding simulated annealing and genetic algorithms for solving
facility layout problems approximately.
1. Introduction
The facility layout problem deals with the physical arrangement of a given number of
departments or machines within a given configuration. In the context of manufacturing the
objective is to minimize the total material handling cost of moving the required material
between the departments. The importance of material handling is stated by Tompkins and
White [67] who claimed that 20-50 % of the total operating expenses within manufacturing
are attributed to it.
The facility layout problem is one of the best-studied problems in the field of combinatorial
optimization. A number of formulations have been developed for the problem. More
particularly the FLP has been modeled as [36] :
The quadratic assignment formulation has been traditionally used to model the facility
layout problem. QAP was first introduced by Koopmans and Beckmann [33] in 1957 as
a mathematical model for locating a set of indivisible economic activities. Consider the
problem of allocating a set of facilities to a set of locations, with the objective to minimize
the cost associated not only with the distance between locations but with the flow also.
More specifically, given two n x n matrices F = (fij) and D = ( d k l ) where fij is the flow
112 MAVRIDOU AND PARDALOS
between the facility i and the facility j , and d k l is the distance between the location k and
the location I , and a set of integers N = { 1 , 2 , . . . , n } , the QAP can be written as follows:
n n
where I ~ isNthe set of all permutations of N , and n is the number of facilities and locations
[561.
Exact algorithms for solving the FLP include branch and bound ([39], [30]) and cutting
plane algorithms ([7], [9]). These approaches require rather high computational time as the
problem size increases, resulting in practice in the solution of only moderately sized problem
instances. Therefore, a number of heuristic algorithms, such as construction, improvement
and hybrid algorithms, have been developed for sub-optimally solving large-size instances
in a reasonable amount of CPU time and computer memory. Recent survey papers on the
facility layout problem and its solution approaches can be found in [22], [36], and in [43].
In this paper we focus on the work that has been done to date for solving the facility lay-
out problem using simulated annealing (SA) and genetic algorithms (GA). Both heuristic
approaches are stochastic search techniques modeled on processes found in nature (thermo-
dynamic process and natural evolution). These heuristic methods have been used to solve
a wide variety of combinatorial optimization problems ([ 121, [ 191, [371, [47]).
Simulated annealing was first proposed by Kirkpatrick et al. [31] as a method for solving
combinatorial optimization problems. The name of the algorithm derives from an analogy
between the simulation of the annealing of solids first proposed by Metropolis et al. [44], and
the strategy of solving combinatorial optimization problems. Annealing refers to a process
of cooling material slowly until it reaches a stable state. Starting from an initial state, the
system is perturbed at random to a new state in the neighborhood of the original one, for
which a change of A E in the objective function value (OFV) takes place. In a minimization
process if the change A E is negative then the transformation to the new state is accepted.
-AE
If A E 2 0 the transformation is accepted with a certain probability of p ( A E ) = e T ,
where T is a control parameter corresponding to the temperature in the analogy and Icb
is Boltzmann’ s constant. The change A E in the OFV corresponds to the change in the
energy level (in the analogy) that occurs as the temperature T decreases. SA gives us
a mechanism for accepting small increases in the objective function value, controlling
though the probability of acceptance p ( A E ) through the temperatures. Kirkpatrick et al.
[31] argue that allowing “hill climbing” moves, one can avoid configurations that lead to
locally optimal solutions and eventually higher quality solutions can be obtained. So the
main advantage of the simulated annealing method is its ability to escape from local optima.
The main features of the SA method are [ 141 :
0 the temperature T , which is the parameter that controls the probability p ( A E ) of ac-
cepting a cost-increasing interchange. During the course of the algorithm T is decreased
SA AND GA FOR THE FACILITY LAYOUT PROBLEM: A SURVEY 113
0 the equilibrium, i.e. the condition in which a further improvement in the solution using
additional interchanges is highly unlikely to occur,
0 the annealing schedule that determines when and by how much the temperature is to
be reduced.
Output: A (sub-optimal)solution
2. while (T > 0) do
(a) while (thermal equilibrium not reached) do
(i) Generate a neighbor state at random and evaluate the change in
energy level A E .
Several implementations of the simulated annealing algorithm have been proposed for the
facility layout problem. We will present the main concepts of the most recent approaches
and comment on the computational results.
Heragu and Alfa in [21], present an extensive experimental analysis of two simulated
annealing based algorithms, implementing them on two patterns of layout, the single-row
114 MAVRIDOU AND PARDALOS
and multi-row facility layouts. The first algorithm uses the standard techniques of the SA
heuristic. In the main step the algorithm examines the random exchange of the positions
of two facilities. The new solution is accepted if the exchange results in a lower OFV.
Otherwise, the difference A E between the OFV of the best solution obtained so far and
the current solution is computed. This solution is accepted with probability e y . This
step is repeated lOOn times or until the number of new solutions accepted is equal to 10n,
where n is the number of facilities in the layout problem. Next, the algorithm decreases the
value of temperature T by multiplying it by the cooling ratio T and repeats the main step.
The stopping criterion is a fixed maximum number of temperature change steps. The initial
temperature T is set as a number sufficiently larger than the largest A E encountered for
problems tested with other heuristics. Guidelines for setting the parameters can be found
in [57].
The second algorithm presented in the same paper is a hybrid SA algorithm (HSA), which
uses a “core” algorithm to generate a “good” initial solution, and then improves it using the
SA algorithm described before. The core algorithm is a modified penalty algorithm (MP)
presented in [22]. Eight test problems of size up to 30 (available in the literature) are used
for the single-row case. Each test problem is solved 10 times using the same initial solution.
For six of the problems the HSA algorithm produces optimal or best-known solutions. For
the remaining two problems, the solutions are better than those previously reported in the
literature. A comparison between the HSA and the SA algorithms is presented, as well as
with three other heuristic algorithms (a 2-way exchange, a 3-way exchange and a Wilhem-
Ward version of simulated annealing [70]) using 15 equal-area multi-row FLPs. The HSA
in terms of solution quality, performed better than all the other algorithms though requiring
more computational time than the SA algorithm. Also as the number of annealing runs
increases, SA seems to produce similar quality solutions with HSA with less computational
effort.
Another implementation of the SA algorithm applied to the cellular layout problem can
be found in [27]. This problem involves the determination of the relative positions of n
equidimensional manufacturing entities which may represent either the set of machines
belonging to a cell (intru-cellproblem) or the manufacturing cells within a shop (inter-cell
problem). The objective of both layout problems is to minimize the total material flow
(cost) between the manufacturing entities. The method presented in the paper is called
CLASS, which stands for Computerized LAyout Solutions using Simulated annealing. The
proposed algorithm is a regular simulated annealing algorithm with the following most
important elements :
0 Solution space: The solution space consists of a n x n grid, i.e. n2 positions are
available to be occupied by the n entities. The distance between all pairs of positions
is determined using geometric or Manhattan distances.
0 Interchanges: The interchange given a solution can be either a move of an entity from
its current position to an unoccupied position or an “exchange” of the positions of two
entities. The two positions from the solution space that are exchanged are selected
random ly .
SA AND GA FOR T H E FACILITY LAYOUT PROBLEM: A SURVEY 115
Slicing tree construction: Using numerical clustering techniques a slicing tree is con-
structed in such a way that cells with large inter-cell traffic volume are placed in close
proximity with each other.
The attractive element of the algorithm is that it exploits the hierarchical representation
of the layout, so that the probability of selecting a neighborhood state is not uniformly
distributed (as in a regular SA algorithm), but is dependent on T . More particularly, when
T is high at the first steps of the annealing procedure, a cut near the root of the slicing
tree will be selected, causing large swings in the cost function value since a large number
of cells will have to be relocated. As T decreases during the course of the algorithm,
cuts that are located at a lower level in the tree are selected, to generate a neighborhood
state. So a guided search in the set of neighboring solutions is adopted. The algorithm was
compared to two other local search methods, denoted as HC (a straightforward hillclimbing
method) and BC (a modified version of HC). Two test problems of size n = 20 and n = 30
were constructed for the comparison. Each method was run 10 times with different initial
solutions. The computation time was kept the same among the three methods. In terms of
solution quality the proposed SA algorithm outperformed the other two methods, both in
average and minimum cost.
In [34] Kouvelis and Chiang address the single row layout problem (SRLP) in flexible
manufacturing systems (FMSs). The problem deals with the optimal arrangement of n
machines along a straight track with a material handling device moving jobs from one ma-
chine to another. The difficulty of the problem is due to the variety of parts to be processed
in different ranges of operation sequences. When the sequence of operations of a job is
not the same as the sequence of the locations of the machines, the job sometimes has to
travel in reverse (backtrack) in order to receive the required operations. The objective of the
SRLP is to find the ordering of the machines that minimizes the total backtracking distance
of the material handling device. If we consider n machines and n candidate locations for
the machines to be placed, the solution to the SRLP is one of the possible permutations of
the set S = { 1 , 2 , . . . , n ) defined as the set of the workstation assignment vectors, each
one representing a configuration of the machines in a single row. The neighborhood of a
configuration is the set N of configurations resulting by the interchange of the locations
of two machines. The initial configuration is obtained by randomly assigning machines to
locations. For the setting of the parameters of the SA algorithm, i.e. the initial acceptance
probability (through which the initial temperature will be calculated), the number of inter-
changes attempted before the reduction of the temperature, the value of the cooling ratio,
and the number of steps to reach the equilibrium, a sensitivity analysis was performed with
respect to each individual parameter. For each parameter a range of values is tested while
all other parameters are held fixed. The best values of the parameters are kept as the final
ones to be used in the algorithm. The experimental analysis showed that fine-tuning of
the SA parameters with respect to each specific application and the selection of the initial
solution is very important for the performance of the algorithm in terms of quality solution.
The same authors and J. Fitzsimmons in [35]describe two distinct implementations of
the simulated annealing algorithm for machine layout problems in the presence of zoning
constraints. These constraints are restrictions on the arrangement of machines. Positive
zoning constraints require that certain machines have to be placed near each other, while
SA AND GA FOR THE FACILITY LAYOUT PROBLEM: A SURVEY 117
negative zoning constraints do not allow certain machines to be in close proximity. The
problem is formulated as a restricted quadratic assignment problem. Assuming that the
number of candidate locations is equal to the number of machines, the objective is to assign
the machines to the locations in a way that the cost function is minimized with respect to
the zoning constraints. The first of the SA algorithms called the Compulsion Method, takes
into consideration the zoning constraints mostly during the search for a new layout in the
neighborhood of the original one. The second algorithm, the Penalty Method, takes into
account the presence of the zoning constraints in the objective function through the use
of appropriate penalty terms. For each layout that violates any of the zoning constraints,
corresponding penalty terms are charged in the OFV. The fine-tuning of the parameters
for both SA procedures and the interpretation of the configuration, the neighborhood of
a configuration and the initial configuration are the same as described for [34]. The two
versions are compared on an extensive set of computational experiments using test problems
of size ranges from 5 to 30 machines. The results showed that the Compulsion Method
outperforms the Penalty Method in terms of CPU time and solution quality. The basic
advantage of the Penalty Method is that it can be easily changed to handle the addition of
extra zoning constraints.
Meller and Bozer in [42] describe a Simulated Annealing Based Layout Evaluation algo-
rithm (SABLE), which introduces a new generator routine for candidate layout solutions,
combined with the use of spacefilling curves. The algorithm is implemented on a set of
single and multiple floor facility layout problems. For the single-floor case test problems
of sizes 11 to 25 are used, and the performance of SABLE is compared to the performance
of the algorithms presented in [2], [49], [70], and [8]. An average and a worst-case analysis
shows that the proposed algorithm performs the best in terms of solution quality. Addition-
ally, SABLE performed better than Tam’s SA algorithm [64] on a data set of 20 and 30-size
department single-floor FLPs. Let us note that regarding the department shapes, Tam’s
algorithm generally assumes rectangular shapes, while the proposed algorithm tends to
generate departments with non-rectangular shapes. For the multi-floor case, test problems
with up to 4 floors and 40 departments were used to evaluate the performance of SABLE.
The results indicate the robustness of the algorithm to changes in the vertical to horizontal
ratio.
Other recent Simulated Annealing algorithms for layout problems can be found in [62]
and [61].
For the special case of QAP several SA approaches have been proposed. Burkard and
Rend1 [ 101 were the first to apply simulated annealing for solving the QAP. They reported
on rather favorable computational results indicating that the obtained solutions deviate only
1 - 2 % from the best known solutions. Wilhelm and Ward [70] also applied the SA
algorithm to quadratic assignment problems, by further experimenting on the procedure.
They report on the sensitivity of SA to the control parameters, and evaluate the algorithm
using problems ranging in size from n = 5 to n = 100. In particular computational
results were provided for the test problems in Nugent et al. E491 and for two test problems
they introduced in the paper. In [ 111, Connolly discusses the implementation of SA on
7 problems. The computational results indicate that examining sequentially generated
neighboring solutions, rather than randomly generated ones, makes the SA algorithm more
118 MAVRIDOU AND PARDALOS
efficient. In [59] and [60]simulated annealing is used as a tool for interactive facility layout
decisions. More recently Laursen [38] investigated the performance of the SA algorithm
by varying two parameters: (1) the number of simulations, and (2) the simulation length,
while in both cases the algorithm uses the same computational time for a specific instance
problem. Laursen concluded that the length of each simulation is optimizable and that a
large range of its values generate a near-optimal solution quality.
Genetic Algorithms (GA) were first introduced by John Holland et al. [23] at the University
of Michigan in 1975. Genetic algorithms are search algorithms based on the mechanics of
natural selection and natural genetics. CA try to imitate the development of new and better
populations among different species during evolution. Unlike most of the heuristic search
algorithms, GA conduct the search through the information of a population consisting of a
subset of individuals, i.e. solutions. Each solution is associated with a fitness value, which
is the objective function value of the solution. Solutions to optimization problems can often
be coded to strings of finite length. The genetic algorithms work on these strings. The
encoding is done through the structure named chromosomes, where each chromosome is
made up of units called genes.
There are some determining factors that strongly affect the efficiency of genetic algorithms
0 Local search. It has proven very efficient to search for locally optimal solutions in the
neighborhood of the children [40]. If one is able to find a better solution then it will
replace the original child as a member of the new population.
0 Control of new individuals. It is not unlikely that a child will have worse fitness than
its parents. In that case the child might not be accepted in the new generation.
Let us note also that a GA implementation requires the specification of certain parameters
such as population size, and number of generations.
Let Pt denote the population at time t. Then the genetic algorithm procedure can be
described as in Figure (2) [54].
Output: A (sub-optimal)solution
We continue with the description of various implementations of the genetic algorithm for
the facility layout problem.
As we have seen in the section of SA for the facility layout problem, Tam [64] uses
a simulated annealing approach to solve the inter-cell problem. The same author using
the same problem formulation and representation of the floorplan layout as a slicing tree,
attempts a solution approach to the problem using Genetic Algorithms [63]. In applying a
GA an important part of the implementation is the coding of solutions as strings of finite
length. For the problem formulation under consideration, a slicing tree can be generated
by a string using as its elements the nodes of the tree in a sequence which starts from the
bottom level nodes and ends at the root of the tree. The nodes of the tree represent either
facility identifications (operands)or “cut” symbols (operators).The proposed GA uses for
the recombination of the population the crossover and mutation operators, as described for
120 MAVRIDOU AND PARDALOS
the general genetic algorithm. For the selection of the new population the reproduction
operator is used. Under this operator the chance of being selected to remain in the new
population Pt+l is proportional to the fitness value of the individual. This operator assigns
to each individual a sampling rate T ( s ,t ) = p ( s ) / f i ( s ) , where function p measures the
fitness level of individual s and f i ( s ) is the average fitness of Pt. So the individuals with
above the average fitness will have a higher survival probability than those below the average
fitness. The selection of the parameters population size, crossover rate p , and mutation rate
p,, is based on previous studies that can be found in the literature ([ 181, [%I). Four layouts
with 12,15,20, and 30 facilities were retrieved from Nugent et al. [49]. Initial solutions
were obtained by randomly generating cut operators for 30 slicing trees. For each case the
GA was run for 150 generations with 10 different sets of initial solutions. The best and
average solution in each generation were gathered. The performance of GA was compared
with that of a hillclimbing method (HC), which searches through a neighborhood N , where
N is the set of operator sequences generated from changing one operator. GA outperformed
HC both in terms of minimum and average costs. For the 30-facility layout GA improved
the minimum cost by 10.5% and the average cost by 13%.
Koakutsu and Hirata [32] propose an interesting combined approach called genetic sim-
ulated annealing (GSA) for the solution of the floorplan design of VLSI (Very Large Scale
Integrated) circuits. The problem involves the arrangement of a given set of rectangular
modules (with no fixed shapes or dimensions) in the plane, with the objective to minimize:
(1) the area of the enclosing rectangle which should contain all the modules, and (2) the
total wire length between modules that should be connected in the circuit. The main features
of the algorithm are the following:
0 Stochastic Optimization: GSA uses the stochastic optimization used in simulated an-
nealing so that a neighbor state for which there is an increase of the cost function is
accepted with a certain probability.
The formulation of the problem represents the floorplan layout as a slicing tree. The
representation of a solution as a string is similar to the one described previously in [63],
using in this case, vertical and horizontal cuts with corresponding branching operators.
GSA is tested on three floorplan problem instances. The first has 16 modules, each one a
fixed square of unit area, having wires connecting to its horizontal and vertical neighbors.
The second problem has 16 modules and 25 wires, and the third one has 20 modules and
31 wires. For the last two problems the total module area is 100. The proposed algorithm
SA AND GA FOR T H E FACILITY LAYOUT PROBLEM: A SURVEY 121
was compared to a regular SA algorithm. Both algorithms run 100 times with different
initial solutions for each of the above problem instances. The average costs are used for
the comparison. The results show that GA improves the average cost by 1.7% - 9.8%
compared to the SA within the same computational time.
More recently Banerjee and Zhou [3] developed a genetic algorithm to solve a variation of
Montereuil’ s mixed integer programming formulation for the FLP [46], and in particular
for the special case of single loop material Jlow path configuration. They introduce a
“knowledge-augmented mutation operator” to determine the flow path direction, which
appears to perform well for the cases where the layout has very low flow path dominance.
Previous applications of GA for facilities layout design can be found in [4] from the same
authors and Montreuil.
Tate and Smith [65] applied GA using an adaptive penalty function to the unequal-area
facility layout problem with shape constraints. The shape restrictions are expressed through
a flexible bay structure proposed in the literature [68]. The rectangular area in which the
facilities are to be located is divided into vertical bays of different width and each bay is
divided into rectangular departments of different length. The encoding of the solutions to
strings is done with two distinct chromosomes. The first one is the sequential chromosome
which is represented by a permutation of the set N = {1,2, . . . , n } ,where n is the number
of departments. The sequence of the permutation starts by reading departments bay to bay,
from top to bottom and from left to right at the rectangular area. The second chromosome
is the bay chromosome where each gene shows for each bay the number of departments
contained in the previous bays including the involved one, showing this way the breaks that
occur in the sequence between bays. For example, consider 4 bays having 3, 4, 6 and 2
departments respectively starting from the left bay. Then using the bay chromosome the
solution encoding is (3,7,13). Note that the last breakpoint at 15 is obvious. The proposed
GA uses variants of crossover and mutation operators.
The variant of the crossover operator works as follows: using two individuals to be
the parents, one offspring (child) is generated by the following rules. For the case of
GA encoding using the sequential chromosome, each location in the child’ s sequence
is the department number in the corresponding location from one of the parents, both
having the same probability to be selected. This will force the common locations in
the sequences of the parents to be carried over to the child. Also each department must
occur only once in the child. For the bay chromosomes, the location and number of
bay breaks in the child’ s sequence is taken from one of the parents, both having equal
probabilities to be selected.
The mutation uses three different operators. Two of the operators alter the number of
bays affecting only the bay chromosome and one operator reverses a subsequence of
the departments affecting the sequence chromosome.
The evolution parameters, i.e. the population size, and the crossover and mutation rates are
determined after several trial runs. An adaptivepenalty function is used to find good feasible
solutions. The penalty function is adaptive because during the course of the algorithm it uses
observed population data to adjust the level of the penalty that is applied to the infeasible
solutions. Test problems with size ranges from 10 to 20 departments, already published in
122 MAVRIDOU AND PARDALOS
the literature ([6], [69], [2]) were used to evaluate the efficiency of the proposed genetic
algorithm. The proposed approach proved to be the best in terms of quality solution when
compared with previous published results for the problems under consideration.
Genetic algorithms are inherently parallel in nature. Several implementations of GA in
parallel environments have recently appeared, introducing in this way a new group of GA,
the Parallel Genetic Algorithms (PGA). The population of a parallel genetic algorithm is
divided into subpopulations. Then an independent GA is locally performed on each of
these subpopulations, and the best solutions in each case are transferred to all the other
subpopulations. Two types of communication are established among the subpopulations
[48]. Either among all nodes where the best solution of each subpopulation is broadcasted to
all the other subpopulations, or among the neighboring nodes, where only the neighboring
subpopulations receive the best solutions.
The most important features of PGA, which result in a considerable speedup relative to
sequential GAS, are the following [28]:
0 Local selection : In sequential GAS the selection operation takes place by considering
the whole population. In a PGA this operation is performed locally by the selection of
an individual in a neighborhood.
0 Asynchronous behavior : It allows the evolution of different population structures
at different speeds, resulting in an overall improvement of the algorithm in terms of
computational time.
0 Reliability in computation performance : The computation performance of one proces-
sor does not affect the performance of the other processors.
Several implementations of PGA have been proposed for the solution of the quadratic
assignment problem. An application of an asynchronous parallel GA called ASPARA-
GOS has been presented by Muhlenbein [47] for the QAP, introducing a polysexual voting
recombination operator. The PGA was tested on QAPs of size 30 and 36 with known
solutions. The algorithm found a new optimum for the Steinberg’s problem (QAP of size
36). The numbers of processors that were used to run this problem were 16,32 and 64. The
64 processors implementation (on a system with distributed memory) gave by far the best
results in terms of computational time. Furthermore, Huntley and Brown [26] developed a
parallel hybrid of SA and GA to solve the QAP approximately. A parallel genetic algorithm
is used to produce a good initial solution for each population and the SA algorithm is used
for improving these solutions. More recently, Battiti and Tecchiolli in [ 5 ]developed par-
allelization schemes of genetic algorithms for quadratic assignment problems presenting
indicative experimental results.
4. Concluding Remarks
In this paper we summarized the work that has been done in recent years in implementing
simulated annealing and genetic algorithms for solving the facility layout problem. Both
heuristic approaches have been successfully used to approximately solve difficult combi-
natorial optimization problems. For the FLP also, the procedures seem to find sub-optimal
SA AND GA FOR T H E FACILITY LAYOUT PROBLEM: A SURVEY 123
5. Acknowledgments
We wish to thank Professor R.L. Francis for his careful review and valuable comments.
References
1. C.R. Aragon, D.S. Johnson, L.A. McGeoch, and C. Shevon, “Optimization by Simulated Annealing: An
Experimental Evaluation; Part 11, Graph Coloring and Number Partitioning”, Operations Research, 39, 3,
1991, pp. 378-406.
2. G.C. Armour, and E.S. Buffa, “A Heuristic Algorithm and Simulation Approach to Relative Location of
Facilities”, Management Science, 9, 1963, pp. 294-309.
3. P. Banerjee, and Y. Zhou, “Facilities Layout Design Optimization with Single Loop Material Flow Path
Configuration”, International Journal of Production Research, 33, 1995, pp. 183-203.
4. P. Banerjee, Y. Zhou, and B. Montreuil, “Genetically Induced Optimization of Facilities Layout Design”,
Technical Report TR-92- 18, Dept. of Mechanical Engineering, University of Illinois, Chicago, 1992.
5. R. Battiti and G. Tecchiolli, “Parallel Biased Search for Combinatorial Optimization: Genetic Algorithms
and TABU”, Microprocessors and Microsystems, 16, 1992, pp. 35 1-367.
6. M.S. Bazaraa, “Computerized Layout Design: A Branch and Bound Approach”, AIIE Transactions, 7,
1985, pp. 432-438.
7. M.S. Bazaraa, and M.D. Sherali, “Benders’ Partitioning Scheme Applied to a New Formulation of the
Quadratic Assignment Problem”, Naval Research Logistics Quarterly, 27, 1, 1980, pp. 29-41.
8. Y.A. Bozer, R.D. Meller, and S.J. Erlebacher, “An Improvement-type Layout Algorithm for Multiple Floor
Facilities”, Management Science, 40, 1994, pp. 91 8-932.
9. R.E. Burkard, and T. Bonniger, “A Heuristic for Quadratic Boolean Program with Applications to Quadratic
Assignment Problems”, European Journal of Operational Research, 13, 1983, pp. 374-386.
10. R.E. Burkard, and F. Rendl, “A Thermodynamically Motivated Simulation Procedure for Combinatorial
Optimization Problems”, European Journal of Operational Research, 17, 1984, pp. 169-174.
11. D.T. Connolly, “An Improved Annealing Sceme for the QAP’, European Journal of Operational Research,
46, 1990, pp. 93-100.
12. L. Davis, “Job Shop Scheduling with Genetic Algorithms”, Proceedings of an International Conference on
Genetic Algorithms and their Applications, Pittsburgh, PA, 1985, pp. 136- 140.
13. C.A. Floudas, and P.M. Pardalos (eds.), State of the Art in Global Optimization,Kluwer Academic Publishers,
1996.
14. R.L. Francis, L.F. McGinnis, and J.A. White, Facility Layout and Location: An Anulytical Approach,
Prentice-Hall, Englewood Cliffs, NJ, 1992.
15. A. Ferreira, and P.M. Pardalos (eds.), Solving Combinatorial and Optimization Problems in Parallel,
Springer-Verlag, 1996.
16. D.E. Goldberg, Genetic Algorithms In Search, Optimization, and Muchine Learning, Addison-Wesley, 1989.
17. B.L. Golden, and C.C. Skiscim, “Using Simulated Annealing to Solve Routing and Location Problems”,
Naval Research Logistics Quarterly, 33, 1986, pp. 261-279.
18. J.J. Grefenstette, “Optimization of Control Parameters for Genetic Algorithms”, IEEE Transactions on
Systems, Man, and Cybernetics, 16, 1986, pp. 122-128.
124 MAVRIDOU AND PARDALOS
19. J.J. Grefenstette, R. Gopal, B.J. Rosmaita, and D. Van Gucht, “Genetic Algorithms for the Traveling Sales-
person Problem”, Proceedings of an International Conference on Genetic Algorithms and their Applications,
Pittsburgh, PA, 1985, pp. 160-168.
20. M.M.D. Hassan, G.L. Hogg, and D.R. Smith, “SHAPE: A Construction Algorithm for Area Placement
Evaluation”, International Journal of Production Research, 24, 1986, pp. 1283-1295.
21. S.S. Heragu, and A.S. Alfa, “Experimental Analysis of Simulated Annealing based Algorithms for the
Layout Problem”, European Journal of Operational Research, 57, 1992, pp. 190-202.
22. S.S. Heragu, and A. Kusiak, “Efficient Models for the Facility Layout Problem”, European Journal of
Operational Research, 53, 1991, pp. 1-13.
23. J.H. Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, MI,
1975.
24. R. Horst, and P.M. Pardalos, Handbook of Global Optimization, Kluwer Academic Publishers, 1995.
25. R. Horst, P.M. Pardalos, and N.V. Thoai, Introduction to Global Optimization, Kluwer Academic Publishers,
1995.
26. C.L. Huntley, and D.E. Brown, “A Parallel Heuristic for Quadratic Assignment Problem”, Computers and
OR, 18, 1991, pp. 275-289.
27. S. Jajodia, I. Minis, G. Harhalakis, and J.M. Proth, “CLASS: Computerized Layout Solutions using Simu-
lated Annealing”, International Journal of Production Research, 30, 1, 1992, pp. 95-108.
28. P. Jog, J.Y. Suh and D. Van Gucht, “Parallel Genetic Algorithms Applied to the Traveling Salesman Problem”,
SIAM Journal of Optimization, 1, 199 1, pp. 5 15-529.
29. D.S. Johnson, C.R. Aragon, L.A. McGeoch, and C. Shevon, “Optimization by Simulated Annealing: An
Experimental Evaluation; Part I, Graph Partitioning”, Operations Research, 37, 6, 1989, pp. 865-892.
30. B.K. Kaku, and G.L. Thompson, “An Exact Algorithm for the General Quadratic Assignment Problem”,
European Journal of Operational Research, 23, 1986, pp. 382-390.
31. S. Kirkpatrick, C.D. Gelan, and M.P. Vecchi, “Optimization by Simulated Annealing”, Science, 220, 1983,
pp. 671-680.
32. S. Koakutsu, and H. Hirata, “Genetic Simulated Annealing for Floorplan Design”, Chiba University, Japan,
1992.
33. T.C. Koopmans, and M. Beckmann, “Assignment Problems and the Location of Economic Activities”,
Econometrica 25, 1, 1957, pp. 53-76.
34. P. Kouvelis, and W. -C. Chiang, “A Simulated Annealing Procedure for Single Row Layout Problems in
Flexible Manufacturing Systems”, International Journal of Production Research, 30,4, 1992, pp. 7 17-732.
35. P. Kouvelis, W. -C. Chiang, and J. Fitzsimmons, “Simulated Annealing for Machine Layout Problems in
the Presence of Zoning Constraints”, European Journal of Operational Research, 57, 1992, pp. 203-223.
36. A. Kusiak, and S.S. Heragu, “The Facility Layout Problem”, European Journal of Operational Research,
29, 1987, pp. 229-25 1.
37. P.J.M. Van Laarhoven, and E.H.L. Aarts, Simulated Annealing, Theory and Applications, D. Reidel Pub-
lishing Co., 1987.
38. P.S. Laursen, “Simulated Annealing for the QAP-Optimal Tradeoff Between Simulation Time and Solution
Quality”, European Journal of Operational Research, 69, 1993, pp. 238-243.
39. E.L. Lawler, “The Quadratic Assignment Problem”, Management Science, 9.4, 1963, pp. 586-599.
40. Y. Li, “Heuristic and Exact Algorithms for the Quadratic Assignment Problem”, Ph.D Thesis, The Pennsyl-
vania State University, Dept. of Computer Science, 1992.
41. W.G. Macready, A.G. Siapas, and S. A. Kauffman, “Criticality and Parallelism in Combinatorial Optimiza-
tion”, Science, 27 1, 1996, pp. 56-59.
42. R.D. Meller, and Y.A. Bozer, “A new Simulated Annealing Algorithm for the Facility Layout Problem”,
Technical Report 9 1-29, Department of Industrial and Operations Engineering, The University of Michigan,
1991.
43. R.D. Meller, and K.-Y. Gau, “The Facility Layout Problem: A Review of Recent and Emerging Research”,
Auburn University, Auburn, AL, 1995.
44. N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller, “Equation of State Calculations by
Fast Computing Machines”, Journal of Chemical Physics, 2 1, 1953, pp. 1087-1092.
45. P.B. Mirchandani, and R.L. Francis (eds.), Discrete Locarion Theory, Wiley-Interscience Series in Discrete
Mathematics and Optimization, 1990.
SA AND GA FOR T H E FACILITY LAYOUT PROBLEM: A SURVEY 125
46. B. Montreuil, “A Modeling Framework for Integrating Layout Design and Flow Network Design”, Proceed-
ings of the MHKICMHE Material Handling Research Colloquium, Hebron KY, 1990, (to appear in IIE
Transactions).
47. H. Muhlenbein, “Parallel Genetic Algorithms, Population Genetics and Combinatorial Optimization”, Lec-
ture Notes in Computer Science, 565, 1989, pp. 398-406.
48. H. Muhlenbein, M. Schomisch and J. Born, “The Parallel Genetic Algorithm as Function Optimizer”,
Proceedings on an International Conference on Genetic Algorithms, 1991.
49. C.E. Nugent, T.E. Vollmann, and J. Ruml, “An Experimental Comparison of Techniques for the Assignment
of Facilities to Locations”, Operations Research, 16, 1968, pp. 150-173.
50. P.M. Pardalos, Y. Li, and K. A. Murthy, “Computational Experience with Parallel Algorithms for Solving
the Quadratic Assignment Problem”, Computer Science and Operations Research: New Developments in
their Interfuce, 0. Balci et al. (eds.), Pergamon Press, 1992, pp. 267-278.
51. P.M. Pardalos, and G. Guisewite, “Parallel Computing in Nonconvex Programming”, Annals of Operations
Research, 43, 1993, pp. 87-107.
52. P.M. Pardalos, A.T. Phillips, and J.B. Rosen, Topics in Parallel Computing in Mathematical Programming,
Science Press, 1992.
53. P. M. Pardalos, L.S. Pitsoulis, and M.G.C. Resende, “A Parallel GRASP Implementation for the Quadratic
Assignment Problem”, Solving Irregular Problems in Parallel: State of the Art, A. Ferreira and J. Rolim
(eds.), Kluwer Academic Publishers, 1995, pp. 11 1-128.
54. P. M. Pardalos, L.S. Pitsoulis, T.D. Mavridou, and M.G.C. Resende, “Parallel Search for Combinatorial
Optimization: Genetic Algorithms, Simulated Annealing, Tabu Search and GRASP’, Lecture Notes in
Computer Science, 980, Springer-Verlag, 1995, pp. 3 17-33 1.
55. P.M. Pardalos, M.G.C. Resende, and K.G. Ramaknshnan (eds.), ParallelProcessing oj’Discrete Optimization
Problems, DIMACS Series, 22, American Mathematical Society, 1995.
56. P.M. Pardalos, F. Rentl, and H. Wolkowicz, “The Quadratic Assignment Problem: A Survey and Recent
Developments”, Quadratic Assignment and Related Problems, P.M. Pardalos, and H. Wolkowicz (eds.),
DIMACS Series on Discrete Mathematics and Theoretical Computer Science, 16, American Mathematical
Society, 1994, pp. 1-42.
57. W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling, Numericul Recipes: The Art ofScient$c
Computing, Cambridge University Press, New York, 1986.
58. J.D. Schaffer, R.A. Caruana, L.J. Eshelman, and R. Das, “A Study of Control Parameters Affecting Online
Performance of Genetic Algorithms for Function Optimization”, Proceedings on the Third International
Conference on Genetic Algorithms, Morgan Kaufmann, Los Altos, CA, 1989, pp. 5 1-60.
59. R. Sharpe, and B .S. Maksjo, “Facility Layout Optimization Using the Metropolis Algorithm”, Environment
and Planning B: Planning and Design, 12, 1985, pp. 443-453.
60. R. Sharpe, B.S. Maksjo, J.R. Mitchell, and J.R. Crawford, “An Interactive Model for the Layout of Buildings”,
Applied Mathematical Modeling, 9, 3, 1985, pp. 207-214.
61. A. Souilah, “Simulated Annealing for Manufacturing Systems Layout Design”, European Journal of Oper-
ational Research, 82, 1995, pp. 592-614.
62. G. Suresh, and S. Sahu, “Multiobjective Facility Layout Using Simulated Annealing”, International Journal
of Production Economics, 32, 1993, 239-254.
63. K.Y. Tam, “Genetic Algorithms, Function Optimization and Facility Layout Design”, European Journal of
Operational Research, 63, 1992, pp. 322-346.
64. K.Y. Tam, “A Simulated Annealing Algorithm for Allocating Space to Manufacturing Cells”, International
Journal of Production Research, 30, 1, 1992, pp. 63-87.
65. D.M. Tate, and A.E. Smith, “Unequal-Area Facility Layout by Genetic Search”, IIE Transactions, 27, 1995,
pp. 465-472.
66. M.B. Teitz, and P. Bart, “Heuristic Methods for Estimating the Generalized Vertex Median of a Weighted
Graph”, Operations Research, 16, 1968, pp. 955-961.
67. J.A. Tompkins, and J.A. White, Facilities Planning, Wiley, New York, 1984.
68. X. Tong, “SECOT: A Sequential Construction Technique for Facility Design”, Doctoral Dissertation, Uni-
versity of Pittsburgh, 1991.
69. D.J. Van Camp, M.W. Carter, and A. Vannelli, “A Nonlinear Optimization Approach for Solving Facility
Layout Problems”, European Journal of Operational Research 57, 199 1, pp. 174- 189.
126 MAVRIDOU AND PARDALOS
70. M.R. Wilhelm, and T.L. Ward, “Solving Quadratic Assignment Problems by Simulated Annealing”, IIE
Transactions, 19, 1987, pp. 107-1 19.
Computational Optimization and Applications 7, 127-142 (1997)
@ 1997 Kluwer Academic Publishers. Manufactured in The Netherlands.
Abstract. Sequential quadratic (SQP)programming methods are the method of choice when solving small or
medium-sized problems. Since they are complex methods they are difficult (but not impossible) to adapt to solve
large-scale problems. We start by discussing the difficulties that need to be addressed and then describe some
general ideas that may be used to resolve these difficulties. A number of SQP codes have been written to solve
specific applications and there is a general purposed SQP code called SNOPT, which is intended for general
applications of a particular type. These are described briefly together with the ideas on which they are based.
Finally we discuss new work on developing SQP methods using explicit second derivatives.
1. Introduction
minimize F(x)
X €!It* NP
s.t. c ( x ) >_ 0,
where F : f l n -+ !It and c: Znn -+ $Irn. If second derivatives are not known, computing x * , a
point satisfying thefirst-order KKT conditions for NP is the best that can be assured. Oth-
erwise, it is possible to assure finding a point satisfying the second-order KKT conditions.
Our particular interest here is when n and possibly rn are large.
When solving large problems the precise form of what are mathematically equivalent
forms of NP is important. In practice the constraints may be a mixture of linear and
nonlinear inequality constraints, simple bounds on the variables and equality constraints.
Also there may be upper as well as lower bounds on some constraints and some variables may
appear only linearly in the problem. Such trivial mathematical considerations may assume
quite large proportions in some practical algorithms. Success in solving large problems
often depends on attention to a myriad of small details in the definition of the model and
algorithm. The interface for software for large problems is more complex since the user
*Research supported by the National Science Foundation Grant DMI-9500668; the Office of Naval Research
Grant N00014-96- 1-0274.
128 MURRAY
needs to specify more details of their problem in order for the software to be efficient.
The level of description in this paper is such that such details will not be given much
consideration, but such details are vital in practice.
+
minimize V F ( X ~ ) ~ i Pp T H k p
p€M"
QP
s.t. C(Xk) + V c ( q ) p2 0
for some positive definite matrix Hk. Let p k (referred to as the search direction) denote the
+
unique solution to QP. We define xk+l = xk a k p k , where the steplength C Y ~is chosen to
achieve a reduction in a merit function. The matrix Hk is usually an approximation to the
Hessian of the Lagrangian function
L ( x , A) = F ( x ) - hTc(x),
where h are estimates to the Lagrange multipliers. Since the Hessian of the Lagrangian
function is not positive definite this clearly poses some difficulties although such difficulties
arise even when n is small. In most cases the approximation is obtained using quasi-Newton
updates to an initial approximation that is usually diagonal.
An often unappreciated feature of SQP methods is that they automatically take advantage
of linearity in a linear constraint. For example, if the initial estimate satisfies a linear
equality or inequality constraint then so do all subsequent iterates. Such constraints are
also automatically excluded from the merit functions. Once a linear constraint is satisfied it
is never again violated. Nonetheless in both large and small problems there is an advantage
to taking specific note of whether a constraint is linear. Although such a distinction may
not impact the sequence of iterates it does impact the effort to compute the iterates and is
therefore important in the large-scale case.
Merit functions
Merit functions are a means of obtaining the required convergence. Since the iterates may
not be feasible some means of defining whether one point is better than another is required.
The first merit function proposed (see [4] and [3 11) was the quadratic or 12 penalty function.
Later the 11 penalty function was used (see [27] and [38]). The 11 penalty function has the
advantage (as would any merit function based on a norm) of requiring only a finite penalty
parameter. However, such merit functions used alone may inhibit the rate of convergence.
It is also our experience that they are not as efficient as smooth merit functions regardless of
this deficiency. Smooth merit functions have been defined based on exact penalty functions
SEQUENTIAL QUADRATIC PROGRAMMING METHODS 129
(see [ 141). However, such an approach requires restrictive assumptions about the rank of
the Jacobian matrix and uses problem derivatives in the definition of the merit function. If
either of these two restrictions are to be avoided it requires searching in a higher dimensional
space. In [22] and [26] the following merit function was proposed
where the search is now in the triple space of x , h and s. Interestingly the idea of searching
in the space of the slack and dual variables has become common practice in the application
of barrier (interior-point) methods to linear programs. The above merit function is used in
NPSOL [22] and SNOPT, a new SQP code for large problems (see [20] and [21]).
3. Large problems
SQP algorithms are viewed by many as the best approach (see [24]) to the solution of NP
when n is small or moderate (say less than 1000). In this sense “best” means it requires
the least number of evaluations of the users functions F ( x ) and c(x). For small problems
the effort to solve the QP subproblems is rarely relevant provided the method of solution is
done with reasonable care. To adapt SQP methods to solve large problems requires being
able to solve efficiently the resulting large QP subproblems. One approach taken in Murray
and Prieto [32] is to show that a suitable search direction may be obtain without the need
to solve completely the QP subproblem. They also prove convergence of SQP methods
that incorporate a large degree of flexibility in the definition of the algorithm in order to
accommodate the type of adaptation that is necessary when designing algorithms for large
problems. Nonetheless even to solve the subproblem incompletely is likely to require the
use of sparse matrix technology. It is also necessary to store the matrices defining the QP
problem in compact form. A basic assumption made is that v c ( x k ) is a sparse matrix and
hence there are a number of ways of storing this matrix in compact form. In general when
Hk is obtained by a quasi-Newton approximation it will not be sparse so some modification
to the standard quasi-Newton approach is necessary.
One key to a successful SQP algorithm for large problems is a fast algorithm to solve
(or partially solve) the QP subproblems. Solving a sequence of related QP problems is not
quite the same as solving a single QP problem. Usually after a few nonlinear iterations the
active set of the QP subproblem changes only slightly. The active set of one QP subproblem
is then close to that of the following subproblem. It is important that full advantage is taken
of this information.
It should be understood that it is unlikely that a single SQP code will prove all powerful.
Indeed it is unlikely a single QP code that is efficient under all circumstances will emerge.
Efficiency for a QP algorithm is often dependent on the nature of the solution. For example,
in many problems the degree of freedom in the problem is small. For such problems methods
that assume this property will be more efficient than methods that do not. Moreover the
difference is significant. Such methods, however, are likely to perform poorly when the
property is not true. It can often be assessed prior to the solution whether or not the property
holds. For example, if the number of equality constraints is almost equal to the number
130 MURRAY
of variables then the degree of freedom in the problem must be small. Similarly it will
be known when the number of constraints is small since then rn is small. Again one can
design an SQP method specifically for that class of problem. Other issues impacting the
design are the relative cost of the linear algebra operations compared to that of evaluating
the user’s functions and derivatives (if derivatives are available). Again such information
may be obtained prior to solving a problem.
As we have already mentioned for large problems the standard quasi-Newton approach is
no longer adequate. Much of the work done on extending quasi-Newton methods to large
problems has been directed at the unconstrained case. Although the need to obtain some
compact expression for an approximation to the curvature of a function is the same for both
unconstrained and constrained problems the use of this information in constrained problems
is more complex. Moreover the nature of curvature for constrained problems is different.
Research in this area may be classified under the following general approaches:
Limited memory approximations are easily understood and have been the subject of con-
siderable research and experimentation (see [6, 8, 9, 16, 18, 28, 361). Instead of & being
a result of k updates only a limited number of updates are used. There are various ways
this may be done. We could for instance keep the last t updates, where t is small, say less
than 25 and typically less than ten. The key point is that Hk is not stored expikitty rather
the initial approximation is stored (usually a diagonal matrix) together with the information
required to perform the updates. Depending on the form of the update the storage require-
ments are at worst 2nt and at best nt. An alternative to retaining information on the last t
updates is to restart every t iterations. This has some advantages for problems with linear
constraints. Yet a third alternative is to try and retain what is thought to be the t most useful
updates.
Much of the work on limited memory methods has been directed at the unconstrained case
and it is likely the experience there does not fully carry over to the nonlinearly constrained
case (it may be better). Although simple in concept there are many ways to implement a
limited memory method. How the updates are stored and in what form impacts the cost of
operations with H k . Since Hk is not stored explicitly this limits the methods of solution of
the QP subproblems. This approach has the property that it does not depend on the Hessian
of the Lagrangian being sparse. Although one could argue that this is likely it does save
a user establishing that fact or possibly the need to cast their problem in a form where it
is true.
SEQUENTIAL QUADRATIC PROGRAMMING METHODS 131
where each function y i ( x ) has the form y ( x ) = aTx + xi f , ( x ) , with each function
f j ( x ) involving its own (small) subset of the variables x . Obviously such a representation
is not unique for a given function F ( x ) . The basic idea is that instead of approximat-
ing the Hessian of F ( x ) directly the Hessians of gj with respect to yi may be approxi-
mated and that for V 2 F ( x )constructed from these matrices. By assumption the function
gi(yi) is a function of only a few variables, which implies the approach gives a com-
pact representation of Hk. In general Hk may not be sparse since V 2 F ( x ) need not be
sparse. If that is the case it limits the use of Hk to be indirect when solving the QP
subproblem
We have not favored direct use of group partial separability, since it seems likely that
many users would not know how to define gi, y i ( x ) , etc., and the user interface uti-
lizing this information if available is inevitably complex [ 113. It seems to us to be an
idea primarily suited to unconstrained or linearly constrained problems. When a nonlin-
early constrained problem is formulated, such convoluted functions are unlikely to occur
(they are often a consequence of eliminating nonlinear constraints) because the mod-
eler always has the option of defining extra variables and constraints. For example, the
problem
minimize
X
F(x) (4.2)
132 MURRAY
could be treated as
Of course the problem is now constrained, but this is of no consequence if there are other
nonlinear constraints. It is not that the transformation from (4.2) to (4.3) need be made;
rather, the second formulation is the natural one. Indeed, forming convoluted functions
such as (4.1) is inherently dangerous. For example, consider the impact on the derivatives
+
of eliminating the variable y j and the constraint x,f y,f = 1 from a problem. It may well
be better to solve a constrained problem rather than a difficult unconstrained problem. A
good rule of thumb when solving large problems is that it is better to have a formulation of
the problem that has more variables and more constraints.
The case for partial separability is much stronger since it requires only identifying that
the function can be written in the form
i
where it is assumed each function gi ( x ) is a function of only a few variables. It is possible
that the user need not provide the above form for the problem since an attempt to identify
such a form can be made using information about the sparsity of the Hessian matrix (this is
a non-trivial task and finding the best form is even harder).
Consider now problems of the form NP. In the constrained case we wish to approximate
the Hessian of the Lagrangian function
~ ( x =) ~ ( x-) hici(x)
i
which is naturally in partially separable form provided F ( x ) and c i ( x ) are functions of
only a few variables. Let J ( x ) = V c ( x ) . If J ( x ) is sparse (our basic assumption), then
in general each constraint function must involve only a few variables (some of which may
appear linearly). The user already supplies the sparsity structure of J ( x ) , and from it we
may deduce the set of variables involved in each function c i ( x ) . It is also reasonable for
a user to be asked to specify the variables that appear nonlinearly in the objective. For
many problems such as those arising from optimal control, this is a trivial requirement.
(Indeed many problems have linear objectives.) Such a requirement is already a feature of
the codes MINOS and SNOPT. These codes require the variables that occur nonlinearly
in the objective and constraints to be specified. If these variables are listed first then it is
known that Hk has the form
For many problems the number of variables occuring nonlinearly in the problem is much
smaller than the total number of variables. If it is small enough then H k may be stored as
SEQUENTIAL QUADRATIC PROGRAMMING METHODS 133
a dense matrix. It also implies there are lots of constraints active at the solution and this
knowledge impacts the choice of method for solving the QP subproblems.
Using a quasi-Newton update procedure we can approximate HO % V 2 F ( x )and Hi x
V2ci( x ) and then form the appropriate linear combination HO - xj
hi Hj to obtain an
approximation to V ~ L .
To illustrate, consider obtaining the approximation Hj x V2ci( x ) . For some permutation
Pj, we have
where fij(x) is a dense matrix of order ni, the number of variables appearing in ci(x).
Assuming nj is small, we may implement quasi-Newton updates by maintaining a dense
approximation H i % I? j ( x ) . Since the search directions have no special properties it is likely
that the rank-one update will be most suitable. It may be that a reasonable approximation
to each individual Hessian will be obtained in only a few iterations. For example, if ci
is a quadratic function of nj variables, its Hessian approximation will be correct after ni
iterations. Typically we expect nj to be less than 10.
An important feature of this approach is that the individual Hessian approximations are
independent of the Lagrange multipliers estimates. If the multiplier estimates are poor at
some stage, then any quasi-Newton method approximating V 2 L directly will take many
iterations to recover even if the multiplier estimates improve immediately. By contrast, the
approach based on partial separability has the potential to recover immediately. In general
the resulting approximation Hk will not be positive semidefinite and this has consequences
as to the definition and method of solution of the QP subproblem. Many of the issues that
arise due to the lack of convexity are similar to those that arise when using exact second
derivatives and are discussed in Section 6.
minimize g T p + $pTHp
Jp =0
and Z is a basis for the null-space of the rows of J. The key point of this result is that only
the matrix Z T H Z is required to define the solution and not H . If the dimension of Z is
small then Z T H Z may be stored as a dense matrix. If we define Q (Z Y ) , where Y is
chosen to make Q nonsingular then we could define Q T H Q as
We shall assume that the QP is convex. Note that unlike the case of dense problems it may
be that Hk is structurally singular (4.4). This limits (but does not eliminate) the use of the
dual when solving the QP subproblem.
There are a variety of methods to solve large QP problems and which is best will depend
on the characteristics of the QP and its solution. We shall consider four basic approaches
(there are others):
0 A Schur-complement method.
0 A null-space method.
0 A range-space method.
0 A barrier (interior-point) method.
The first three are active-set methods and differ only in how the relevant linear equations
are solved. Which of the three active-set methods to use depends largely on the number of
active constraints. Null-space methods are efficient when the number of active constraints
is almost the same as the number of variables. Range-space methods are efficient when
there are only a few active constraints. If neither of these conditions hold or is known to hold
then the Schur-complement approach is recommended. The use of a barrier algorithm to
solve the subproblem as opposed to applying a barrier approach to the nonlinear problem is
likely to be preferred when the user's functions are expensive to compute. Usually applying
a barrier approach to the original nonlinear problem results in a substantial increase in the
number of nonlinear iterations. In theory the number of nonlinear iterations is independent
of how the QP subproblem is solved. In practice there are likely to be small but essentially
negligible differences on most problems. Whether a barrier algorithm is preferable to an
active-set method to solve the QP is more difficult to decide. Barrier methods may prove
SEQUENTIAL QUADRATIC PROGRAMMING METHODS 135
very efficient when the number of constraints is small. They are likely to be inefficient
when there are many variables on their bounds.
It is sometimes beneficial if the QP is in standard form although this is not always strictly
necessary. This helps in the barrier approach since the ill-conditioning in the resulting
subproblems is then benign (see [37]) and in the null-space method it simplifies the updating
procedures.
In all four approaches we are required at each iteration to solve the KKT equations, which
are of the form
where H is the Hessian approximation and A is some set of rows from the constraint
Jacobian. The three active-set methods differ in how they partition these equations in order
to solve them.
A Schur-complement method
If KOdenotes the initial KKT matrix the Schur-complement algorithm (see [25]) first forms
a sparse factorization of this matrix to determine the first iteration. Subsequent iterations
are performed using this factorization and a dense factorization of the Schur-complement
of the initial matrix. In order to do that it is necessary to construct in the tth QP iteration a
KKT matrix, K t , that is an augmented form of the initial matrix. The Schur complement
grows monotonically and since this is in general a dense matrix it is necessary to replace
periodically the initial factors of KO with a factorization of the current KKT matrix. The
method fits well with the incomplete solution approach since then the need to perform a
refactorization is reduced.
The application of the standard Schur-complement algorithm to the special case of solving
the sequence of QP problems that arises in SQP methods is straightforward (see [23] and
[25]). It is not strictly necessary to be able factor KO. What is required is to be able to
solve linear systems of the form KO = U,,relatively quickly. In general two such systems
require solving at each QP iteration. Such a requirement usually eliminates the use of
iterates procedures, but not more elaborate factorization approaches such as also using a
Schur-complement approach when solving these systems.
A range-space method
This is similar in some respects to the Schur-complement algorithm except instead of taking
the Schur complement of the initial KKT matrix we use in the tth QP iteration the Schur
complement of H,, where Ht is the Hessian portion of K,. Note that this approach requires
Ht to be nonsingular. Since the Schur complement is in general a dense matrix the approach
is only feasible when the Schur-complement is limited in size as it would be if we have
few constraints. The fact the Schur-complement is limited in size implies V 2 L ( x )is nearly
full rank hence assuming that Ht is full rank is not unreasonable. It is necessary with
136 MURRAY
this approach to solve many systems with Hl and while iterative methods are possible the
approach is best suited to the case where solves with H, are cheap and this usually implies
Ht is sparse and possibly well structured. The adaptation to the case of solving a sequence
of QPs is again straightforward. Unlike the Schur-complement approach where the Schur
complement is built from scratch as iterations proceed we need to form an initial Schur
complement and factor it. In general there will never be a need to discard these factors due
to them becoming too large since a given KKT matrix does not have to be an augmentation
of the preceeding KKT matrix. Consequently, the Schur complement may grow or shrink
in size as the iterations proceed and there is an apriori limit on its growth. Obviously if a
problem has only a few constraints it is known apriori that the Schur complement will not
be large. Only general constraints are of significance since bounds on the variable may be
used to solve a KKT system of reduced size.
The range-space approach is occasionally suitable for problems for which the Schur
complement is large provided it is sparse. Such problems do arise when V 2 L ( x )has such a
simple structure that ( V2L(x) )- ' is sparse. Usually in such cases V 2 L ( x )may be computed
or approximated directly by a small number of finite differences. The simplest case is when
V 2 L ( x )is a diagonal matrix. Sometimes a problem may be reformulated by adding some
additional variables and constraints to obtain a such a simple structure. This is the case
when V 2 L ( x )is a low rank change to a diagonal matrix.
A null-space method
The null-space method is based on the Eqs. (4.5). It may be used either in conjunction with
a projected Hessian approach or with a direct approximation to the Hessian. Approximating
the reduced Hessian has two benefits, it requires storing only a small matrix and it facilitates
the solution of the QP subproblem, which is solved very efficiently if the predicted active
set is correct. In the approach taken by Gill et al. [20] and Eldersveld [ 151 the matrix Q
is never required explicitly and the null-space algorithm may be efficiently implemented if
solves with Q and Q'HQ are efficient. As before we define Q = ( 2 Y ) , but now the$rst
columns of Z span the null space of the Jacobian of the current active set. By allowing the
columns of 2 to span a bigger space than that defined by the null-space of the active set we
are able to cope with some changes to the active set. If Z is not too large then Q'HQ is
still cheap to operate with and requires little storage. Moreover, and this is a key point, the
solution of the QP is relatively trivial to compute since it is not necessary to form Z'HZ
(in practice the Cholesky factor of this matrix is recurred).
We still require a sparse representation for Q . The usual sparse representation for Z is
to make use of the matrix
where B is a basis from the Jacobian of the active set and S is the remaining columns of
the Jacobian plus some columns of the full Jacobian corresponding to variables currently
on their bounds. The matrix Z is not stored explicitly instead the sparse LU factors of B
are stored. Since the algorithm requires only operations by Z and 2' this suffices.
SEQUENTIAL QUADRATIC PROGRAMMING METHODS 137
Y =(BJ.
Q-'=(' B '),
S
A barrier method
In theory barrier methods for convex QP are a simple extension to barrier methods for linear
programming (LP). In practice the computational implications of having a quadratic terms
significantly alter the algorithm. It is helpful to assume the original nonlinear problem is
in standard form, that is the constraints are of the form
c(x) = 0 x 2 0.
QPSUB minimize g T p
p€!W
+ipTHp subject to A p = -c, p 2 -x,
It was shown by Murray [311 that if the number of variables on their bounds at the solution is
between zero and n - m, then limp+0 ~ ( p=) 00, where ~ ( pis)the condition of the KKT
matrix at the minimizer of the subproblem. We can now show that this ill-conditioning,
though still present, is benign under certain circumstances [37].
It is easy to see that the ill-conditioning is harmful if the problem is not in the proposed
format. Suppose that the problem includes some general inequality constraints Ax 3 b as
well as bounds x 2 0. The Hessian of the barrier function is then
If the solution of a system is insensitive to small perturbations in the matrix, it does not
follow that all solution methods will be satisfactory. However, a direct method using an LU
or LBLT factorization preserves the benign nature of the ill-conditioning.
Warm starts
We are not solving a single QP but a sequence of related QPs. For active-set methods
“warm starts” mainly means using the “old” active set as an initial trial active set. For
barrier methods warm starts are somewhat more complex. It may be that once the active
set settles down it is worth switching to an active-set method.
It has been our observation over many years that for large problems for which the first
derivatives may be computed it is often possible to compute second derivatives (see [7] and
[39]). In many practical problems (see Section 7) the Hessian matrix, like the Jacobian
matrix, is sparse. Quasi-Newton approximations to the Hessian of the Lagrangian function
SEQUENTIAL QUADRATIC PROGRAMMING METHODS 139
are usually positive definite with a bounded condition number. In this way, a strictly convex
subproblem is obtained, and if a feasible point exists, a solution exists and is unique. In
contrast, the use of exact Hessians or quasi-Newton approximation that are not positive
semi-definite presents a number of technical difficulties, most of which stem from the
loss of control over the properties of Hk. Defining the Hessian of the QP subproblem as
the Hessian of the Lagrangian function leads in general to nonconvex subproblems. On
the other hand, there are numerous theoretical and practical benefits to be derived from the
explicit use of second derivatives. For example, it is possible to define an algorithm that
generates a sequence that converges to a second-order KKT point. Also, in practice it has
been observed when solving other classes of optimization problems that second-derivative
methods usually converge in much fewer iterations than alternative methods. In order to
reap all the benefits from the availability of second derivatives, it is necessary to define the
search direction other than as the minimizer of QP. The modifications necessary are similar
to those required to Newton’s method when solving unconstrained optimization.
In the approach adopted in [32] for quasi-Newton methods a search direction was defined
based not on a solution of the QP subproblem but on information at a constrained station-
ary point. Murray and Prieto show in [33] how this approach may be adapted to ensure
convergence of the iterates to a second-order KKT point when exact derivatives are used.
Clearly information at a stationary point alone will no longer suffice since an active-set
method when applied to a nonconvex QP cannot be assured of finding a stationary point.
Moreover, the procedure to construct the search direction from information at the station-
ary point (should one exist) is no longer assured of being a descent direction for the merit
function.
In order to prove convergence to a second-order KKT point it is also necessary to be able
to generate directions of negative curvature for the Hessian of the Lagrangian function. In
this case, a conventional linesearch may no longer be adequate, as the termination criteria for
this type of search depend on the value of the initial projected gradient, V F ( x k ) ” p k , which
may be zero or arbitrarily small. In such circumstances the algorithm in [33] makes use of
a curvilinear search, based on the model introduced in McCormick [29] and developed by
More and Sorensen [30].
It is shown in [33] that the method of constructing a search direction given in [32] is
only unsatisfactory if a direction of negative curvature is encountered in the QP active-set
method or at the stationary point the reduced Hessian is not positive definite. In either case a
direction of negative curvature may be computed. It is shown that it is sufficient to compute
the direction of negative curvature only at the initial point for the QP. From the direction of
negative curvature and the step to the stationary point (if it exists) it can be shown how a
suitable search direction may be constructed.
7. Industrial implementations
Several special SQP codes have been written and applied to specific large-scale applica-
tions. Starting in the early 80’s several codes were developed at GE and used to solve optimal
power flow problems. Such problems may have as many as 60,000 variables and 45,000
nonlinear constraints. The original approach taken was to use MINOS ([34] and [35]) as
140 MURRAY
the QP solver. Later a special QP routine was written based on the Schur-complement
approach [25]. A special class of very large structured problems were solved by apply-
ing Benders decomposition to the QP subproblem. The master problem was solved by the
Schur-complement algorithm. The slave problems were LP’s. Explicit second derivatives
were available. The user’s function and derivatives were relatively cheap to compute. The
Harwell code MA27 was used to factor the initial KKT matrix, but a special technique was
necessary to enable the ANALYSE phase to obtain a good ordering. The success of SQP
methods on OPF problems is both encouraging and surprising. In this class of methods the
problem functions and their derivatives are cheap to compute compared to solving the linear
systems in the QP subproblems. The reason for the success of the SQP approach was the
very small number of nonlinear iterations required.
A sparse SQP method has been developed at Boeing (see El]) and used on trajectory
problems. These problems are not super large, but are sometimes hard to solve. The
algorithm is very similar to that developed at GE except the Hessian here is approximated by
finite differences. In both cases the Hessian is sparse. Boeing has developed their own sparse
linear equation solver. The user’s functions are very expensive to evaluate for trajectory
problems, which is just the type of problem one expects SQP methods to perform well on
relative to alternative approaches.
Several codes have been developed to solve problems in process control in chemical
engineering. These problems may be very large (100,000 variables), but are almost always
highly constrained (the reduced Hessian is rarely larger than 50). There are almost as many
equality constraints as there are variables. The QP subproblems are therefore highly suited
to being solved by a null-space algorithm. This has been done in a conventional manner
by DMC Corp. Since the number of equality constraints is nearly equal to the number of
variables the solution of QP subproblem may be determined by solving a dense QP in a
small number of variables (the dimension of the null space of the equality constraints) and a
large number of inequality constraints. An algorithm along these lines has been developed
by Shell. Such QP subproblems are usually most efficiently solved by solving the dual.
An SQP method based on a barrier function approach has been developed by Power
Associates to solve OPF and related problems. Recall that in such problems the user’s
function and derivative are cheap to evaluate. Consequently, it is better to apply the barrier
to the original nonlinear problem, which leads to equality QP subproblems. The work to
perform the additional nonlinear iterations is more than off set by the savings in solving the
simpler QP subproblems.
Acknowledgments
It is a pleasure to acknowledge the contribution to this paper from the many years of
cooperative research with Philip Gill, Michael Saunders and Francisco Prieto. Any new
ideas presented here are taken from our current joint research program.
References
1. J.T. Betts and W.P. Huffman, “Path constrained trajectory optimization using sparse sequential quadratic
programming,” J. of Guidance, Control, and Dynamics, vol. 16, no. 1, pp. 59-68, 1993.
SEQUENTIAL QUADRATIC PROGRAMMING METHODS 141
2. J.T. Betts and P.D. Frank, “A sparse nonlinear optimization algorithm,” J. Optim. Theory and Applics., vol. 82,
pp. 5 19-54 1, 1994.
3 . L.T. Biegler, J. Nocedal, and C. Schmid, “A reduced Hessian method for large-scale constrained optimization,”
SIAM J. on Optimization, vol. 5 , pp. 314-347, 1995.
4. M.C. Biggs, “On the convergence of some constrained minimization algorithms based on recursive quadratic
programming,” JIMA, vol. 2 1, pp. 67-8 I , 1978.
5. A. Buckley and A. LeNir, “QN-like variable storage conjugate gradients,” Mathematical Programming, vol.
27, pp. 155-175, 1983.
6. A. Buckley and A. LeNir, “BBVSCG-A variable storage algorithm for function minimization,” ACM Trans-
actions on Mathematical Software, vol. 11, pp. 103-1 19, 1985.
7. R.C. Burchett, H.H. Happ, and D.R. Vierath, “Quadratically convergent optimal power flow,” IEEE Transac-
tions on Power Apparatus and Systems, vol. PAS-103, pp. 3267-3275, 1984.
8. R.H. Byrd, P. Lu, and J. Nocedal, “A limited memory algorithm for bound constrained optimization,” Report
NAM 07, EECS Department, Northwestern University, 1993.
9. R.H. Byrd, J. Nocedal, and R.B. Schnabel, “Representations of quasi-Newton matrices and their use in limited
memory methods,” Mathematical Programming, vol. 63, pp. 129-1 56, 1994.
10. A.R. Conn, N.I.M. Gould, and Ph.L. Toint, “An introduction to the structure of large-scale nonlinear optimiza-
tion problems and the LANCELOT project,” in Computing Methods in Applied Sciences and Engineering,
R. Glowinski and A. Lichnewsky (Eds.), SIAM, Philadelphia, pp. 42-54, 1990.
11. A.R. Conn, N.I.M. Gould, and Ph.L. Toint, “An introduction to the standard data input format (SDIF) for
nonlinear mathematical programming problems,” Departement de MathCmatique, FacultCs Universitaires de
Namur, Technical Report 9 1/8, 199 1.
12. A.R. Conn, N.I.M. Gould, and Ph.L. Toint, “LANCELOT: A fortran package for large-scale nonlinear opti-
mization (Release A),” Lecture Notes in Computation Mathematics 17, Springer Verlag: Berlin, Heidelberg,
New York, London, Paris and Tokyo, 1992.
13. A.R. Conn, N.I.M. Gould, and Ph.L. Toint, “Large-scale nonlinear constrained optimization: A current
survey,” Technical Report 94/1, DCpartement de Mathbmatique, FacultCs Universitaires de Namur, 1994.
14. G. Di Pillo, F. Facchinei, and L. Grippo, “An (RQP) algorithm using a differentiable exact penalty function
for inequality constrained problems,” Mathematical Programming, pp. 49-68, 1992.
IS. S.K. Eldersveld, “Large-scale sequential quadratic programming algorithms,” Stanford, Report SOL 92-4,
Department of Operations Research, Stanford University, 1992.
16. M.C. Fenelon, “Preconditioned conjugate-gradient-type methods for large-scale unconstrained optimization,”
Ph.D. Thesis, Department of Operations Research, Stanford University, 198 I .
17. R. Fletcher, “An optimal positive definite update for sparse Hessian matrices,” SIAM J . on Optimization, vol.
5, pp. 192-218, 1995.
18. J.Ch. Gilbert and C. LemarCchal, “Some numerical experiments with variable-storage quasi-Newton algo-
rithms,” Mathematical Programming, pp. 407-435, 1989.
19. P.E. Gill and W. Murray, “Conjugate-gradient methods for large-scale nonlinear optimization,” CA, Report
SOL 79- 15, Department of Operations Research, Stanford University, Palo Alto, 1979.
20. P.E. Gill, W. Murray, and M.A. Saunders, “Large-scale SQP methods and their application in trajectory
optimization,” Control Applications of Optimization, International Series of Numerical Mathematics, R.
Bulirsch and D. Kraft (Eds.), Birkhauser Basel, vol. 115, pp. 29-42, 1994.
2 1. P.E. Gill, W. Murray, and M.A. Saunders, “An SQP algorithm of large-scale optimization,” Report SOL 95-x,
Department of Operations Research, Stanford University (to appear).
22. P.E. Gill, W. Murray, M.A. Saunders, and M.H. Wright, “User’s guide for NPSOL (version 4.0): A for-
tran package for nonlinear programming,” Report SOL 86-2, Department of Operations Research, Stanford
University, 1986.
23. P.E. Gill, W. Murray, M.A. Saunders, and M.H. Wright, “Inertia-controlling methods for quadratic program-
ming,” SIAM Review, vol. 33, pp. 1-33, 1988.
24. P.E. Gill, W. Murray, M.A. Saunders, and M.H. Wright, “Constrained nonlinear programming,” in Optimiza-
tion, Handbooks in Operations Research and Management Science, G.L. Nemhauser and A.H.G. Rinnooy
Kan (Eds.), Elsevier, vol. I , Ch. 111, pp. 171-210, 1989.
142 MURRAY
25. P.E. Gill, W. Murray, M.A. Saunders, and M.H. Wright, “A Schur-complement method for sparse quadratic
programming,” in Reliable Numerical Computation, M.G. Cox and S. Hammarling (Eds.), Oxford University
Press, pp. 1 13-1 38, 1990.
26. P.E. Gill, W. Murray, M.A. Saunders, and M.H. Wright, “Some theoretical properties of an augmented
Lagrangian merit function,” in Advances in Optimization and Parallel Computing, P.M. Pardalos (Ed.), North-
Holland, pp. 101-128, 1992.
27. S .P. Han, “Superlinearly convergent variable metrix algorithms for general nonlinear programming problems,”
Math. Prog., vol. 11, pp. 263-282, 1976.
28. M.W. Leonard, “Improved quasi-Newton methods for optimization,” Ph.D. Thesis, Department of Mathemat-
ics, University of California, San Diego, 1995.
29. G. McCormick, “A modification of Armijo’s step-size rule for negative curvature,” Mathematical Program-
ming, vol. 13, pp. 11 1-1 15, 1977.
30. J.J. More and D.C. Sorensen, “Newton’s method,” in Studies in Numerical Analysis, G.H. Golub (Ed.)
(Mathematical Association of America), pp. 29-82, 1984.
3 I . W. Murray, “An algorithm for constrained minimization,” in Optimization, R. Fletcher (Ed.), Academic Press:
London and New York, pp. 247-258, 1969.
32. W. Murray and F.J. Prieto, “A sequential quadratic programming algorithm using an incomplete solution of
the subproblem,” in SIAM J. on Optimization, vol. 5, pp. 589-639, 1995.
33. W. Murray and F.J. Prieto, “A second-derivative method for nonlinearly constrained optimization,’’ Technical
Report Report SOL 95-3, Department of Operations Research, Stanford University, Stanford, 1995.
34. B.A. Murtagh and M.A. Saunders, “A projected Lagrangian algorithm and its implementation for sparse
nonlinear constraints,” Mathematical Programming Study, vol. 16, pp. 84-1 17, 1982.
35. B.A. Murtagh and M.A. Saunders, MINOS 5.4 user’s guide, Report SOL 83-20R, Department of Operations
Research, Stanford University, 1993.
36. J. Nocedal, “Updating quasi-Newton matrices with limited storage,” Mathematics of Computation, vol. 35,
pp. 773-782, 1980.
37. D.B. Ponceleh, “Barrier methods for large-scale quadratic programming,” Report SOL 9 1-2, Stanford Uni-
versity, Stanford, 1991.
38. M.J.D. Powell, “A fast algorithm for nonlinearly constrained optimization calculations,” in Numerical Anal-
ysis, Dundee 1977, Lecture Notes in Mathematics 630, G.A. Watson (Ed.), Springer-Verlag, pp. 144-157,
1978.
39. U.T. Ringertz, “A mathematical programming approach to structural optimization,’’ Report No. 88-24, Dept.
of Aeronautical Structures and Materials, The Royal Institute of Technology, Stockholm, 1988.
Computational Optimization and Applications 7, 143-1 58 (1997)
@ 1997 Kluwer Academic Publishers. Manufactured in The Netherlands.
STAVROS A. ZENIOS
Depurtment of Public and Business Administrution, University of Cyprus, Ku1lipoleo.v 75, Nicosiu, Cyprus
Abstract. We present a computationally efficient implementation of an interior point algorithm for solving large-
scale problems arising in stochastic linear programming and robust optimization. A matrix factorization procedure
is employed that exploits the structure of the constraint matrix, and it is implemented on parallel computers. The
implementation is perfectly scalable. Extensive computational results are reported for a library of standard test
problems from stochastic linear programming, and also for robust optimization formulations. The results show
that the codes are efficient and stable for problems with thousands of scenarios. Test problems with 130 thousand
scenarios, and a deterministic equivalent linear programming formulation with 2.6 million constraints and 18.2
million variables, are solved successfully.
1. Introduction
Stochastic linear programming (SLP), see e.g., Wets [20], and robust optimization (RO),
see Mulvey et al. [12], model problems with uncertain input data. References to the
wide range of applications are made by Wets and Mulvey et al., and in the textbook
of Kall and Wallace [S]. The mathematical programming formulations arising in SLP
and RO are usually of extremely large size, as they model constraints for a large num-
ber of realizations (called scenarios) of the uncertain data. However, these programs
are extremely sparse and structured. Considerable research efforts have gone into the
development of efficient algorithms for solving these problems. With the advances in
parallel computer architectures research has focused on the design of decomposition al-
gorithms, and their implementation on parallel machines, Dantzig [5]. A non-exhaustive
list of recent works in this direction includes (1) the parallelization of Benders decom-
position due to Dantzig et al. [6] and Nielsen and Zenios [16], (2) the development of
parallel decomposition algorithms based on diagonal quadratic approximations of aug-
mented Lagrangian due to Mulvey and Ruszczyhski [ 111 and Berger et al. [ 13, and (3) the
2. Problem formulation
A0 is an rno x no constraints matrix for the first-stage variables, and bo E IRmois the vector
of right-hand side coefficients for these constraints. This problem has n = no E,"=, + n1
+
variables and rn = rno E,"=, rnl equality constraints. A0 and Wl are assumed to have
full row rank, with rnl 1. n1 for all 1 = 0, 1 , 2 , . . . , N , and no 1. E,"=,
n1. (Full row rank
of A. is a reasonable assumption, but for many real-world problems the assumption of full
row-rank of Wl may not hold true, at least for some 1. It is reasonable to assume that the
concatenated matrices (q I Wl) have full row rank, but once is removed the remaining
matrix (Wl)may not be of full row rank.)
The objective function of SLP minimizes the cost of the first-stage decision, C ~ X Oplus ,
the expected cost of the second-stage decisions, E,"=, CITY/. RO minimizes higher-order
moments of the objective function value, and this is one of its key distinguishing features
over SLP. For example, a variance term may be minimized. Alternatively, for maximization
problem, RO may use an expected utility maximization formulation [ 121.
Denoting by c the concatenated vector cT = ( c z I c: 1 . . I c i ) ,and, similarly, xT =(xl
I y: I - - - I y:), and letting Q be a positive-definite matrix (e.g., a variance matrix) we can
formulate the RO problem as:
minimize
X
+
cTx xTQX
s.t. Ax = b,
x 0.
Stochastic linear programming formulations are cast in the formulation of (1)-(3) simply
by ignoring the quadratic term. (More general formulations of RO are given in [ 121.)
We develop a code for solving problem (1)-(3) based on the primal-dual path-following
algorithm of Monteiro and Adler [ 101 and Vanderbei and Carpenter [ 191. It can be described
as follows:
146 YANG AND ZENIOS
Algorithm 3.1 (The primal-dual path following algorithm for quadratic programs).
Initialization: Start with a triplet (xo, y o , zo) satisfying xo > 0 , zo > 0 , and any po > 0.
x are the primal variables in (1)-(3), y E IR” are the dual variables for constraints ( 2 ) ,
and z E lUn are the reduced cost variables for the bound constraints (3). Initialize the
iteration index v t 0.
Iterative Step: Calculate the dual step A y by solving:
p=b-Ax,
a = c + Q X - A ~ -Yz ,
+pl -XZl,
where 1 E Rnis the vector of all one’s. Compute the primal step A x from
Az = X-’(gS - Z A x ) . (6)
Update:
x”” = X ” +U A X ,
-y”
yv+l - +aAy,
z”” = Z” +~ A z ,
where 0 < (I! 5 1 is the step length, chosen so that the primal and reduced cost variables
remain positive. It is computed as (I! = 6 min{&p, Go},where 0 < 6 < 1, and
6 p = max{a, I x,,
ffj>O ’
+a,;Ax,2 0},
1max{a,j 1 Z,j + a j h z j 2 0).
ffj>O
The computationally expensive part of this algorithm is the solution of the system of
eqn. (4) to calculate the dual step Ay. In the next section we describe the numerical
procedures used for solving this system.
The procedure for solving eqn. (4) for the dual step Ay of the quadratic programming
RO model is a straightforward extension of the procedure developed by Birge and Qi for
the solution of SLP. (We assume that Q is a diagonal matrix. This assumption simplifies
the presentation of the procedure and allows us to exploit the problem's structure. RO
formulations with non-diagonal matrices can be reformulated by computing the factorization
of Q = L T L , defining I = L x and rewriting x T Q x = x T L T L x = XTX.) It is based on the
following result:
Tl 0
.. , v=
..
IfAo and Wi, 1 = I , . . . N , havefull row rank then M and G2 = -AoGF' A: are invertible,
and
Proof Follows along the same lines as the proof of Birge and Qi for linear programs,
under the assumption that Q is invertible. 0
It is easy to verify, using eqn. (1 l ) , that the solution of the linear system ( A O A T ) A y = @
is given by Ay = p - r , where p solves Sp = @, and r is obtained from the system
Gq = V T p , and Sr = U q . (12)
Hence, we get
The implementation of the interior code in ROBOPT uses the parallel implementation of
Procedure 3.1, developed by Jessup et al. [ 7 ] , to compute the dual step. Computations of
the various constants in Algorithm 3.1, and the calculation of A x , are also done in parallel
as explained below.
The parallel solution begins with the following data distribution. Processor 1 holds the
data corresponding to the 1th scenario for second stage decisions c , Wl , 01,pl and 01. Each
processor also holds acopy of the data for the first stage parameters Ao, 0 0 , PO, 00. With this
data distribution calculations that involve the scenario matrices and variables are computed
by multiple processors in parallel, with the 1th processor performing the calculations for
the lth scenario. (We assume for simplicity that there are as many processors as there are
scenarios.) Calculations that involve the first stage variables and matrices are computed,
redundantly, on multiple processors. By doing so each processor has available, locally, the
information on the first-stage decisions and the need to communicate this information from
some “master” processor is avoided. We describe now the parallel implementation of all
s t e p of the algorithm. First, the right-hand side of eqn. (4) is computed by the following
procedure, called Formrhs.
A SCALABLE PARALLEL INTERIOR POINT ALGORITHM 149
Once the right-hand-side of (4) has been computed we can solve the system using the
parallel matrix factorization procedure developed by Jessup et al. [ 7 ] . This procedure, called
Finddy, uses a global reduction function that takes as input vectors (or matrices) distributed
in every processor, sums them, and leaves a single vector (or matrix) sum at every node,
(For optimal implementations of this function on hypercube networks and fat trees, and
additional references, see [ 7 ] ) . It can be described as follows:
Finally the parallel procedure Finddx below constructs the primal step direction vector
A x defined by Eq. (5).
The step on the reduced cost variables, A z , is finally trivially computed using parallel
vector multiplications, since all matrices involved are diagonal, and the block corresponding
150 YANG A N D ZENIOS
to the lth subvector of z resides at the lth processor. The lth processors can now take a step
in the X I , yl, ZI variables, as well as the X O , yo, zo variables, since all required quantities
are available locally.
4. Computational results
We now report results from the computational experiments carried out with ROBO, T with
a suite of large-scale test problems. The objective of our experimental design is to address
the questions raised at the introduction of this paper. Namely, to establish that the developed
matrix factorization procedures are stable when used to solve large scale SLP and RO, that
the scalability of these procedures, when implemented on parallel machines, translates to
scalability of an interior point algorithm, and that very large scale problems can be solved
efficiently with ROBO, T . Comparisons with a state-of-the-art code, LOQO of Vanderbei
[18], illustrate that ROBOPT is competitive even for small to medium size problems, on
serial computers.
The results with the parallel code were obtained on a Connection Machine CM-5e [9].
Serial computing experiments were carried out on an IBM RS6000/550 workstation. Both
codes, LOQO and ROBO, T are written in C. ROBO, T uses the sparse, supernodal Cholesky
factorization and solver routines SUPFCT and SUPSLV of Ng and Peyton [ 141 to solve the
scenario systems, and LAPACK for the dense matrix systems of the first-stage problem.
The communications in the parallel code are implemented via the usage of standard routines
from the CMSSL library of the Connection Machine. The code is compiled with the gcc
compiler with -03 flag for maximum optimization.
We solved the five sets of problems-sc205, scagr7, scfxml, scrs8 and scsd8-from the
library of SLP test problems described by Birge and Holmes [3]. We also experimented with
the SLP formulation of a telecommunications problems, sen, described in Sen et al. [17].
For each set we generated several problems, with increasing number of scenarios. Their
characteristics are described in Table 1 . For two of the test sets, scsd8 and sen, we generated
problems with thousands of scenarios, as described in Table 2. To the best of our knowledge
these are the largest SLP problems solved to date.
RO models were generated by adding a randomly generated diagonal quadratic matrix
to the objective function of the SLP test problems. The condition number of this matrix is
user specified, and is reported with the computational results below. The algorithm would
terminate when the primal and dual objective values would agree in, at least, 8 decimal
points.
The Birge-Qi matrix factorization procedure is not the most efficient way for computing
the interior point steps for small-scale problems. In order to establish the penalty paid by
A SCALABLE PARALLEL INTERIOR POINT ALGORITHM 151
ROBOPT viz a viz a state-of-the-art serial code, LOQO, we solved several instances of
scsd8 and sen with increasing number of scenarios. Results are summarized in figures 1.
For problems such as sen, which has a single first-stage constraint, ROBO, T is faster than
LOQO even for problems with very few scenarios. For problems where the first-stage
constraint matrix is large, compared to the second-stage matrices, ROBOPT does not gain
an advantage unless we solve problems with large number of scenarios.
152 YANG AND ZENIOS
Figure I. ComparingROBO, T with LOQO for the scsd8 (figure on left) and sen (figure on right) test problems
with increasing number of scenarios.
To establish the suitability of the code for parallel computations we solve the scsd8 and sen
test problems on a Connection Machine CM-5, using up to 64 processors. Figure 2 shows
the relative speedup (i.e., ratio of solution time with the serial implementation of ROBOPT
to the solution time with its parallel implementation) as the number of scenarios increases
and for different number of processors.
Superlinear speedup is achieved with the scsd8 problem. That is, the parallel code on p
processors solves the problem more than p times faster than the same code implemented
serially. This is due to the effect of cache memory, which can hold more than one block
(i.e., W!, matrices) of the scsd8 problem. When only a single processor is available not
all blocks will fit in cache memory, and the solution time is affected by the transfer of
A SCALABLE PARALLEL INTERIOR POINT ALGORITHM
100.0 80.0
80.0
60.0
60.0
.
[
VI
40.0
. .
40.0 .. ' -.
20.0
20.0
O.0;
_.___-
Figure 2. Relative speedup of parallel ROBOPT, implemented on the Connection Machine CM-5 with p = 4,
32 and 64 processors, for the solution of the scsd8 (figure on left) and sen (figure on right) test problems.
data in and out of the cache. In the parallel implementation each processor holds in cache
memory all the blocks operated upon by that processor, and the overhead of caching is
avoided. No caching effect is observed for the sen test problem, since each block of this
problem is large and even a single block cannot fit in cache memory. Hence, data need to be
transferred into and out of cache memory in both the serial and the parallel implementation.
The speedup achieved for the sen problem is solely due to the efficient exploitation of the
multiple processors by the parallel procedures Formrhs, Finddy and Finddx. The efficiency
of parallel ROB0,T for the solution of sen is 98-99%.
Similar efficiency has been observed for all test problems. The results obtained with the
parallel solution of RO problems are identical to those reported here for the solution of SLPs.
We also conducted experiments to establish the scalability of the parallel code. Scalability
is the ability of a parallel code to maintain a constant level of efficiency as the number of
processors increases, by solving problems whose sizes increase in proportion to the number
of processors. See, e.g., Censor and Zenios [4, Chapter I]. This measure is important in
establishing whether a massively parallel machine can be used to solve extremely large
problems, or if the benefits from parallelism are restricted to machines with few processors.
We solved problems scsd8 and sen, with 4, 32 and 64 scenarios, using an equal number
of processors. Results are summarized in Table 3. We observe that the solution time, per
interior point iteration, remains virtually constant as we increase the number of processors
to match the number of scenarios. Parallel ROBOPT is perfectly scalable.
To benchmark ROBOPT, and establish the joint effects of the matrix factorization proce-
dures and their parallel implementation, we solved the suite of SLP test problems on the
Connection Machine CM-5. For the smaller test problems we use as many processors as
154 YANG AND ZENIOS
Tuble 3. Testing the scalability of the parallel implementation of ROBOPT.The solution time, per iteration,
remains virtually constant when the number of processors increases in proportion to the number of scenarios. All
times in CM seconds.
Solution Time
Problem Processors Iterations time per itn.
150.0 1500.0
.-
c
.-
c P 4
P32
P-64
100.0 1000.0
f
Y
50.0 500.0
, .
, .
,
'.
0.0
,/-
0.0 - a
Figure 3. Speedup of parallel ROBO, T compared to LOQO for the solution of the scsd8 and sen test problems.
the number of scenarios. Results are summarized in Table 4, where the parallel implemen-
tation of ROBO, T is compared with LOQO executing on a single processor of the CM-5.
ROBO, T outperforms LOQO, as the number of scenarios becomes large. The exact number
of scenarios for which it becomes preferable to use ROBOPT over LOQO depends on the
structure of the blocks of the test problem. Problems with small values of no, rno compared
to nl , rnl favor ROBO, T . This is the case with the sen test problem. Problems with large
values of no, rno compared to nl, rnl favor LOQO unless the number of blocks is large.
The relative performance of parallel ROBO, T over LOQO improves for larger problems.
Figure 3 shows the speedup of ROBO, T over LOQO on machines with up to 64 processors,
and for an increasing number of scenarios. The fact that the speedups exceed the number
of the available processors provides additional support to the claim of Section 4.2 that, even
when implemented serially, ROBO, T is more efficient than LOQO for large scale problems.
A SCALABLE PARALLEL INTERIOR POINT ALGORITHM 155
Tuble 4. Benchmark results with the parallel implementation of ROBOPT on the Connection Machine
CM-5, and comparisons with LOQO. LOQO executes on a single processor of the CM-5; ROBOPT uses as
many processors as there are scenarios. Solution times in seconds. NA: not available at the required level of
accuracy due to numerical errors.
~~
In the testing of matrix factorization procedures conducted by Birge and Holmes [3] it
was demonstrated that the Birge-Qi procedure is more stable and accurate than alternative
methods based on problem reformulation or Schur complements. The experimental results
summarized in Table 5 illustrate that the procedure remains stable and accurate when
156 YANG AND ZENIOS
Tuble 5. Stability and accuracy of parallel ROBO, T when solving ill-conditioned robust optimization problems.
For each test problem we report the number of interior point iterations, and (in parenthesis) the number of decimal
points of accuracy of the primal and dual objective values.
implemented in parallel and for RO problems as well. Stability and accuracy is maintained
even for problems with thousands of scenarios, and for ill-conditioned Q matrices.
As a last experiment we use the parallel implementation of ROBO, T to solve the very large
scale problems described in Table 2. Results are summarized in Table 6. We observe that
the interior point algorithm is capable of solving problems with millions of variables and
constraints, to an accuracy of more than 8 decimal points. Solution times are, for most
problems, less than one hour of computer time. The folklore that interior point algorithms
take a small number of iterations holds true even with the multi-million variable problems
solved here.
5. Conclusions
We have discussed the efficient and stable parallel implementation of a primal-dual path-
following algorithm for structured linear and nonlinear programs arising when planning
under uncertainty. The developed code, ROBO, T , is competitive with state-of-the-art opti-
mization software, when applied to small scale problems. It has superior performance
A SCALABLE PARALLEL INTERIOR POINT ALGORITHM 157
Tuble 6. Solving very large-scale stochastic linear programs using ROBOPT. (Solution times in seconds on a
Connection Machine CM-5e with 64 processors.)
scsd8.128 10 1.88
scsd8.256 10 2.39
scsd8.5 12 11 3.79
scsd8.1024 12 6.73
scsd8.2048 14 13.82
scsd8.13 1072 19 1066.1
sen.64 21 16.1
Sen. 128 23 30.8
sen.256 31 78.3
sen.5 12 31 153.5
Sen. 16384 49 7638.3
for large-scale problems, and it also parallelizes extremely well and is perfectly scal-
able. Work is under way for the extension of the ideas of this paper to the solution of
multi-stage planning problems. Preliminary results are equally encouraging and will be
reported elsewhere.
References
I . A.J. Berger, J.M. Mulvey, and A. Ruszczynski, “An extension of the DQA algorithm to convex stochastic
programs,” SIAM Journal on Optimization, vol. 4, no. 4, pp. 735-753, 1994.
2. J.R. Birge and L. Qi, “Computing block-angular Karmarkar projections with applications to stochastic pro-
gramming,” Management Science, vol. 34, pp. 1472-1479, Dec. 1988.
3. J.R. Birge and D.F. Holmes, “Efficient solution of two-stage stochastic linear programs using interior point
methods,” Computational Optimization and Applications, vol. 1, pp. 245-276, 1992.
4. Y. Censor and S.A. Zenios, Parallel Optimization: Theory, Algorithms and Applications, Oxford University
Press: Oxford, England, 1997 (in print).
5. G.B. Dantzig, “Planning under uncertainty using parallel computing,” Annals of Operations Research, vol. 14,
pp. 1-16, 1988.
6. G.B. Dantzig, J.K. Ho, and G. Infanger, “Solving stochastic linear programs on a hypercube multicomputer,”
Technical report sol 9 1-10, Operations Research Department, Stanford University, Stanford, CA, 1991.
7. E.R. Jessup, D. Yang, and S.A. Zenios, “Parallel factorization of structured matrices arising in stochastic
programming, SIAM Journal on Optimization, vol. 4, no. 4, pp. 833-846, 1994.
8. P. Kall and S.W. Wallace, Stochastic Programming, John Wiley & Sons: New York, 1994.
9. C.E. Leiserson, Z.S. Abuhamdeh, D.C. Douglas, C.R. Feynman, M.N. Ganmukhi, J .V. Hill, W.D. Hillis, B .C.
Kuszmaul, M.A. St. Pierre, D.S. Wells, M.C. Wong, S.-W. Yang, and R. Zak, “The network architecture of
the Connection Machine CM-5,” Manuscript, Thinking Machines Corporation, Cambridge, Massachusetts
02142, 1992.
10. R.D.C. Monteiro and I. Adler, “Interior path-following primal-dual algorithms. Part 11: Convex quadratic
programming,” Mathematical Programming, vol. 44, pp. 43-66, 1989.
11. J.M. Mulvey and A. Ruszczynski, “A new scenario decomposition method for large-scale stochastic opti-
mization,” Operations Research, vol. 43, pp. 477-490, 1994.
158 YANG AND ZENIOS
12. J.M. Mulvey, R.J. Vanderbei, and S.A. Zenios, “Robust optimization of large scale systems,” Operations
Research, vol. 43, pp. 264-281, 1995.
13. J.M. Mulvey and H. Vladimirou, “Evaluation of a parallel hedging algorithm for stochastic network program-
ming,” in Impacts of Recent Computer Advances on Operations Research, R. Sharda, B.L. Golden, E. Wasil,
0. Balci, and W. Stewart (Eds.), North-Holland, New York, USA, 1989.
14. E. Ng and B. Peyton, “A supernodal Cholesky factorization algorithm for shared-memory multiprocessors,”
SIAM Journal on Scientific and Statistical Computing, vol. 14, pp. 761-769, 1993.
15. S.S. Nielsen and S.A. Zenios, “A massively parallel algorithm for nonlinear stochastic network problems,”
Operations Research, vol. 41, no. 2, pp. 319-337, 1993.
16. S.S. Nielsen and S.A. Zenios, “Scalable parallel Benders decomposition for stochastic linear programming,”
Technical report, Management Science and Information Systems Department, University of Texas at Austin,
Austin, TX, 1994.
17. S. Sen, R.D. Doverspike, and S. Cosares, “Network planning with random demand,” Working paper, Systems
and Industrial Engineering Department, University of Arizona, Tucson, AZ, 1992.
18. R.J. Vanderbei, “LOQO user’s manual,” Technical report SOR 92-5, Department of Civil Engineering and
Operations Research, Princeton University, Princeton, NJ, 1992.
19. R.J. Vanderbei and T.J. Carpenter, “Symmetric indefinite systems for interior point methods,” Mathematical
Programming, vol. 58, pp. 1-32, 1993.
20. R.J.-B. Wets, “Stochastic programming,” in Handbooks in Operations Research and Management Science,
G.L. Nemhauser, A.H.G. Rinnooy Kan, and M.J. Todd (Eds.), vol. 1, pp. 573-629, North-Holland, Amsterdam,
1989.