A Fast LU Update For Linear Programming
A Fast LU Update For Linear Programming
33
Abstract
This paper discusses sparse matrix kernels of simplex-based linear programming
software. State-of-the-art implementations of the simplex method maintain an LU
factorization of the basis matrix which is updated at each iteration. The LU factorization
is used to solve two sparse sets of linear equations at each iteration. We present new
implementation techniques for a modified Forrest-Tomlin LU update which reduce the
time complexity of the update and the solution of the associated sparse linear systems.
We present numerical results on Netlib and other real-life LP models.
1.
Introduction
34
k is determined by the pivot column and row selection rules, and the nature
of the LP model.
The time to factorize and update the LU factorization and to solve two sparse
linear systems of equations at each pivot step.
This paper discusses only design and implementation of the sparse matrix
kernels, i.e. aspects of algorithms, data structures and computer programs to perform
the iterations of the simplex method as fast as possible. Pivot selection rules which
are, for difficult problems, also very important are an independent topic not considered
here.
The software discussed here is part of MOPS (Mathematical Optimization
System). MOPS contains a simplex-based high-speed LP optimizer developed at the
Freie Universit/it Berlin [9]. The system is fully written in FORTRAN and has been
ported to a variety of platforms including PCs, workstations and mainframes. MOPS
is used in industrial applications and as a research prototype to test new ideas in
linear and mixed-integer programming.
2.
minimize c'x!
A x = b,
l<x<u.
Step O.
35
(BTRAN) Compute the dual cost vector n:: solve B ' g = f , where f is a
suitably chosen pricing form.
Step 2.
Step 3.
(Price) If the test in step 2 failed, select a nonbasic variable q not dual
feasible to enter the basis according to a pivot column selection rule.
Step 4.
Step 5.
Step 6.
The sparse matrix kemels in this algorithm are: initial and periodic LU
factorizations, LU update, FTRAN and BTRAN. The time spent in these kernels is
typically between 6 0 - 9 0 % of the overall execution time (see also table 2).
Details of the LU factorization in MOPS are described elsewhere [11]. Since
the LU factorization is closely related to our LU update, we briefly summarize our
approach:
Each pivot chosen in the active submatrix satisfies a threshold pivoting criterion
[4]. As a consequence, we compute LU = PBQ" + E, where E is a perturbation
matrix controlled by the threshold pivoting tolerance.
36
3.
NOTATION
(3.1)
where p is the pivot row index and ee the pth unit column.
If (3.1) is multiplied from the left with L -I, we obtain
L-1B = U + (L-laq - Uep)e~,
(3.2)
where the permutations are left out for simplicitly; it is sufficient to keep track of
the row and column order. In other words, column p in U is replaced by the
transformed incoming column g = L-laq which we call the spike:
p
L-1B =
(3.3)
3.2.
37
BARTELS-GOLUB LU UPDATE
Bartels and Golub proposed the first LU update [1]. Permuting the columns
so that g is last, and moving the other columns up one place, we obtain
p
L-IB =
(3.4)
i.e. we have an upper Hessenberg matrix H. The permutation matrix is again left
out for simplicity.
Now the Bartels-Golub method consists of a series of elementary transformations
which reduce the subdiagonal elements of H to zero. For each column of H, we
pivot either on the diagonal or on the subdiagonal element, depending on which of
them has a larger absolute value. This is to improve the stability of the representation.
If we pivot on a subdiagonal element, a row interchange occurs, and the permutation
vectors have to be updated.
A main advantage of the Bartels-Golub update is that a stability bound can
be computed a priori on each step. However, because row interchanges are possible
on each elimination step, the Bartels-Golub update cannot be implemented as
efficiently as other methods.
3.3.
FORREST-TOMLINLU UPDATE
(3.5)
Next, multiples of rows are added to the last column to eIiminate the row
spike. Mathematically, the algorithm is equivalent to reducing the upper Hessenberg
matrix H to a triangular form by eliminations between adjacent rows which always
involve an interchange. This means that there is no fill in U, and all eliminations
38
are stored in L. This was an essential advantage in 1972 when the algorithm was
published, since U could be kept out-of-core.
In contrast to the Bartels-Golub update, the pivot selection ignores stability
aspects. Nevertheless, it works well in practice. A stability check can be performed
a posteriori at each iteration [10]. If it is not satisfied, the basis is refactorized.
The Forrest-Tomlin update can be implemented without permutation vectors,
since the incoming column is always packed to the end and a row interchange is made
for each row. Therefore, FFRAN and BTRAN can be performed in one work array.
3.4.
REIDUPDATE
~
p
r
p
(3.6)
where the submatrix consisting of rows and columns from p to r forms a bump. The
Reid update first performs permutations within the bump to improve sparsity so that
fewer eliminations are needed. The update proceeds in three major steps:
(1)
Search column singletons within the bump; move each of them to the upper
left comer of the bump, thus reducing the bump size.
(2)
Search row singletons within the bump; move each of them to the lower right
comer of the bump, thus reducing the bump size.
(3)
It should be noted that the final bump (if it exists) has no column singletons.
It follows that if it is possible to permute H into an upper triangular form, this will
be done in step 1. Furthermore, it is not possible for subsequent eliminations to
create any column singletons. This means that every pivot will be in a column with
39
minimal number of nonzeros and within this column in its row with least nonzeros
if the stability criterion is satisfied. The Reid update therefore uses a "minimal
row count within minimal column count" pivoting strategy.
The Reid update pays special attention to sparsity and stability considerations.
The main disadvantage is its complexity. In the worst case, it requires four passes
through all bump rows, which may contain many thousand elements. Elaborate
bookkeeping of the permutations is also required. Furthermore, eliminations may
be slow, since the pivot row is chosen only for one elimination step at a time. This
LU-update looked very promising. Therefore, we designed and implemented several
versions of it. Compared to the update described below, we found only slight
advantages at much higher update costs. Due to an additional permutation, FTRAN
and BTRAN are also slower.
4.
(4.1)
a) Incoming column in
p~s~on p
40
on the diagonal. The pivot order is reflected by a permutation matrix R. During the
LU updates, we perform symmetric row and column permutations in such a way
that the pivots remain on the diagonal; it is sufficient to maintain only one permutation
R. Note that the Reid updating method needs two permutation vectors throughout,
since row and column permutations are performed independently.
We prefer using the original row indices in the L part, and having the
permutations in the U part - thus the incoming spike L-laq can be computed and
inserted without permutations. On each basis change, a further permutation matrix
is applied to the "spiked" U to restore its triangularity. At iteration k, we have
(4.2)
where B(k) is the basis matrix, R(k) the permutation matrix, L(k) and U(k) lower
and upper triangular matrices.
Assume that U is stored row- and columnwise in a sparse form, and L -l
columnwise in a sparse form. The algorithm proceeds in the following way:
ALGORITHM LU-UPDATE
Step 0. Suppose the spike L-laq is computed and pivot position p selected.
Step 1. Column permutation: Determine position r (see fig. 4.1); place the spike
in position r; move columns p + 1. . . . . r one position to the left (an upper
Hessenberg matrix).
Step 2. Expand row p from its sparse representation into a full work array.
Step 3. Use rows p + 1. . . . . r to eliminate the elements on row p in columns
p . . . . . r - 1; store the eliminations in the L part.
Step 4. Row permutation: Place row p in position r; move rows p + 1 . . . . . r one
position upwards.
Step 5. Store the modified row (initially row p, now row r) back to the sparse rowand columnwise representations.
~ii!~iii~iiii~iliiiIi~iiliiil
!~iiiii~i~i~i~i~i~
p
~:~.~.~.~:i
[iiii:i!i!!!iii!iii
r
(4.3)
41
The basis factorization L-1B = RUR" is used at each iteration in the FTRAN
and BTRAN operations. Mathematically, the following operations are needed:
FTRAN:
compute g = L-laq,
solve RUR'y = g for y;
(FTRANL)
~RANU)
BTRAN:
(BTRANU)
(BTRANL)
42
in phase 2, we solve B'z = ep and update n"= n ' - dqz, where the reduced costs dq
and the pivot index p are determined at the previous iteration. The advantage is that
ee has just one nonzero. The row-wise representation of U is used for BTRAN.
5.
XMODLU: removes the leaving column and adds the entering column in U;
acts as a driver routine for the other tasks.
Since the factored matrix is stored in a sparse form, access and update operations
are much more complex than if it is stored dense. Our central problem was how to
implement a fast scheme which simultaneously updates both the row- and columnwise
representation of U. A fast update could be realized by introducing two pointer
arrays: for each element in one representation (row- or columnwise), the pointer
gives its position in the other representation. We are well aware that this solution
needs more main memory than previously published schemes. However, this is
justified, since main memory has become cheap and there is enough memory even
on PCs to solve large-scale problems with MOPS. For the same reason, we store
indices and pointers always in 32 bits.
Three arrays with double precision real numbers (64 bits) are reserved for the
numerical values of L and U: one for U stored row-wise, one for U stored columnwise,
and one for L. Row and column indices are stored in two integer arrays together
with both representations. There are integer arrays for row and column counts and
the starting positions of rows and columns of U.
Two pointer arrays are used to store for each row file element its index in
the column file, and vice versa. It would be possible to store the numerical elements
of U only once, say columnwise, and keep an index pattern of the row-wise U with
the pointer arrays. Using the pointers, one could have a row-wise access to U.
However, this is not optimal on machines with cache memory, since there is poor
data locality which results in slow access to the rows of U.
By using the double pointer technique, we could eliminate much of the searching
which initially slowed down the LU update. Virtually all the rest of searching in
inner loop was eliminated by the following techniques:
43
Using double pointers to find an element in the column file when its position
in the row file is known, and vice versa.
The stack of row indices of the incoming vector is built in FTRAN. Initially,
the row indices of the nonzeros in the incoming column aq are placed in the stack.
When a nonzero element of the transformed vector is computed in FTRAN, we
check whether its position is already in the stack, if not, it is added to the stack.
In XMODLU the stack is scanned, and the corresponding elements of the incoming
vector are placed in U. A similar scheme is used in XELIMN and XRBACK.
6.
Numerical results
The numerical results are based on some of the largest and most difficult LP
problems from the NETLIB test set. In addition, we used some LP models from
other real-life applications.
The computing environment was an IBM PS/2 PC with an i80486 processor
(25 MHz) and 16 MB main memory. MOPS was compiled with the NDP FORTRAN
compiler V3.0 and linked with the PharLap linker V4.0. The operating system was
MS-DOS 5.0. In addition, in table 3 we present some numerical results with 25VF47
on some other machines. Among them is a Siemens Nixdorf mainframe H-120-F
(BS2000), which is a fast scalar machine. The PS/2-486 is about 5% slower than
some PCs with the same processor.
Table 1 shows the names of the (NETLIB) LP problems and their dimensions.
In addition, numerical results with MOPS V1.4 are shown using the standard strategies
(scaling, LP preprocessing, default tolerances, crash start). Under the heading
"phase I" are the number of LP iterations to become feasible. Under the heading
"total" are the total number of LP iterations (phase 1 and phase 2). An LP iteration
is either a pivot or a bound switch. The CP times in seconds on the PC are reported
in the last column in table 1. This includes optimizing the unscaled problem with
the presumably optimal basis from the scaled optimization, and various safety
checks once a problem has been solved. They do not include the time for converting
the mps data.
Table 2 shows the CP time in seconds spent in the various computational
kernels for solving the LP problems. The times for the LU update are in all cases
lower than the time spent in selecting the pivot row! Also, the time spent in the LU
factorization is typically very small except on problems where there is excessive
44
Table 1
LP test models and MOPS run times.
(All times are in seconds on a PS/2 (25 MHz) with MSDOS.)
NETLIB LP models
Problem
cons.
scsd8
25fv47
degen3
stocfor3
80bau3b
greenbeb
fit2d
fit2p
truss
d2q06c
pilo~
pilot87
397
821
1503
16675
2262
2392
25
3000
1000
2171
1441
2030
vars.
nonzeros
phase I
total
Time (sec)
2750
1571
1818
15695
9799
5405
10500
13525
8806
5167
3652
4883
8584
10400
24646
64875
21002
30877
129018
50284
27836
32417
43167
73152
333
720
2010
3805
1984
1217
1533
6531
1993
988
3161
3502
973
3033
4070
9772
7879
4113
12584
14029
9528
9506
5709
7556
21
124
362
2764
379
323
1081
2073
511
1829
1579
5102
4481
3525
5563
2356
10958
9625
6181
11004
29840
69323
39597
119360
0
2864
1279
13592
1955
15771
3229
33733
144
2366
423
7413
Table 2
CP time (sec) for computational kernels.
(All times are in seconds on a PSI2 80486 (25 MHz) with MS-DOS.)
Problem
scsd8
25fv47
degen3
80bau3b
grecnbeb
fit2p
pilo~
P1
P2
P3
P4
FTRAN
BTRAN
Update
FACTOR
PRICE
CHUZR
5
43
89
76
86
839
469
4
31
149
72
101
136
581
I
9
22
19
22
85
59
I
7
24
11
21
50
287
7
17
42
167
51
577
93
2
16
35
32
41
384
87
37
689
133
1725
33
701
122
2051
12
94
31
233
5
167
26
533
42
481
41
2511
14
232
68
356
45
Table 3
CP times (sec) to solve 25FV47 on various machines.
Computing environment
Time (see)
Siemens H-120-F
BS2000, FORTRAN-77
13
HP-730
HP-UX, t77
16
21
i80486, 25 MHz
MS-DOS, NDP FORTRAN
124
Table 4
Comparison of MOPS V1.4 to OSL R2 on some NETLIB models.
(All times are in seconds on a PSI2 80486 (25 MHz) with MS-DOS.)
OSL
Model
25fv47
truss
degen3
80bau3b
greenbeb
fit2d
fit2p
pilots
d2q06c
stocfor3
pilot87
Grand total
m
821
1000
1503
2262
2392
25
3000
1441
2171
16675
2030
MOPS
n
1571
8806
1818
9799
5405
10500
13525
3652
5167
15695
4883
iters
time
iters
time
2436
13028
3864
6435
3964
9832
13459
5104
8558
14889
7858
184
1377
479
848
562
1171
4335
1711
1627
8024
4885
3033
9528
4070
7879
4113
12584
14029
5709
9506
9772
7556
124
511
362
379
323
1081
2073
1579
1829
2764
5102
25203
16127
fill-in such as pilots and problem P4. T h e reason is that the other kernels have to
be executed at each iteration, whereas on nearly all problems we perform 100
iterations on the average before factorizing.
Different L P solvers produce different pivot sequences and reinvert at different
points. It is, therefore, difficult to have a direct comparison b e t w e e n updating
methods. A comparison with another high-speed L P optimizer in the same computing
e n v i r o n m e n t and the same d e v e l o p m e n t tools is probably the best one can do.
46
Conclusions
Judging speed and numerical stability, it seems that the sparse matrix kernels
in MOPS for maintaining a sparse LU factorization are very competitive with the
best other known simplex codes on scalar machines. Compared to the LU update
and FTRAN of Reid, we are much faster without any apparent disadvantage in
numerical stability. Since these kernels are in the innermost loop of the simplex
method, we do not believe that a significantly more expensive update can be competitive.
One can always compute a fresh factorization if there are too many nonzeros or if
there are numerical stabilities. Even with the most unstable LP problems, there are
only a few factorizations in MOPS based on numerical grounds. It is our believe
that further performance improvements can only be achieved by better pivot selection
to reduce the number of iterations.
References
[1] R. Bartels and G. Golub, The simplex method of linear programming using LU decomposition,
Commun. ACM 12(1969)266-268.
[2] M. Benichou, J.N. Gauthier, G. Hentges and G. Ribi~re, The efficient solution of large scale linear
programming problems. Some algorithmic techniques and computational results, Math. Progr.
13(1977)280-322.
[3] R.K. Braytort, F.G. Gustavson and R.A. Willoughby, Some results on sparse matrices, Math. Comp.
24(1970)937 -954.
[4] I.S. Duff, A.M. Erisman and J.K. Reid, Direct Methods for Sparse Matrices (Oxford University
Press, Oxford, 1986).
47
[5] I. Forrest and L Tomlin, Updating the triangular factors of the basis to maintain sparsity in the
product form simplex method, Math. Progr. 2(I972)263-278.
[6] IBM, Introducing the Optimization Subroutine Library Release 2, Publication No. GC23-0517-03.
[7] R. Fourer, Solving staircase linear programming problems by the simplex method, 1: Inversion,
Math. Progr. 23(1982)274-313.
[8] J.K. Reid, A sparsity exploiting variant of the Bartels-Golub decomposition for linear programming
bases, Math. Progr. 24(1982)55-69.
[9] U. Suhl, MOPS - Mathematical OPtimization System, Institut filr Wirtschaftsinformatik, FU-Berlin
(1992), to appear in Eur. J. Oper. Res., Software Tools for Mathematical Programming.
[10] J.A. Tomlin, An accuracy test for updating triangular factors, Math. Progr. Study 4(1975)142-145.
[11] U. Sub.1 and L. Suhl, Computing sparse LU-factorizations for large-scale linear programming bases,
ORSA J. Comput. 2(1990)325-335.