A Sparse Matrix Approach To Reverse Mode Automatic
A Sparse Matrix Approach To Reverse Mode Automatic
dierentiation in Matlab
Shaun A. Forth
Applied Mathematics and Scientic Computing, Department of Engineering Systems and Management, Cran-
eld University, Shrivenham, Swindon, SN6 8LA, UK
Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur 721302, West
Bengal, India
Naveen Kr. Sharma gratefully acknowledges the support of a Summer Internship from the Department of
Engineering Systems and Management, Craneld University.
1
Bischof et al [5] investigated forward mode Matlab AD by source transformation combined
with storing and combining direction derivatives via an overloaded library. This hybrid approach
gave a signicant performance improvement over ADMAT. Source transformation permits the
use of compile-time performance optimisations [6, 7]: forward substitution was found particularly
eective. Kharche and Forth [8] investigated specialising their source transformation inlining of
functions of MADs fmad and derivvec classes in the cases of scalar and inactive variables. This
was particularly benecial for small problem sizes for which overloadings run time requirements
are dominated by the large relative cost of function call overheads and branching (required for
code relevant to scalar and inactive variables) compared to arithmetic operations.
Our thesis is that automated program generation techniques, and specically AD, should take
advantage of the most ecient features of a target language. For example, MADs derivvec classs
exploitation [3] of the optimised sparse matrix features of Matlab [9]. As we recap in Sec. 2, it
is well known that forward and reverse mode AD may be interpreted in terms of forward and
back substitution on the sparse, so-called, extended Jacobian system [1, Chap. 9]. In this article
we investigate whether we might use Matlabs sparse matrix features to eect reverse mode AD
without recourse to the usual tape-based mechanisms [1, Chap 6.1]. Our implementation, including
optimised variants, is described in Sec. 3 and the performance testing of Sec. 4 demonstrates our
approachs potential benets. Conclusions and further work are presented in Sec. 5.
2 Matrix Interpretation of automatic dierentiation
Following Griewank and Walther [1], we consider a function 1 : 11
11
of the form,
j = 1(r), (1)
in which r 11
and j 11
(r) 11
11
= r
, i = 1, . . . , n, (2)
= ,
, i = n + 1, . . . , n +j, (3)
, i = n +j + 1, . . . , n +j +:. (4)
In (2) we copy the independent variables r to internal variables
1
, . . . ,
. In (3) each
, i =
n+1, . . . , n+j is obtained as a result of j successive elementary operations or elementary functions
(e.g., additions, multiplications, square roots, cosines, etc.), acting on a small number of already
calculated
=
++
, i = 1, . . . , :, to complete the function evaluation.
We dene the gradient operator =
(
1
, . . . ,
)
, dierentiate (2)-(4) and arrange the
resulting equations for the
as a linear system,
0 0
1 1 1
0
1 T 1
1,...,
+1,...,+
++1,...,++
0
0
. (5)
In (5): 1
and 1
1,...,
=
1
.
.
.
,
with
+1,...,+
and
++1,...,++
dened similarly; from (3) the j n 1 and j j 1
are both sparse and contain partial derivatives of elementary operations and assignments; from
2
(4) the : n 1 and : j T are such that [1 T] contains exactly one unit entry per row. The
(n + j + :) (n + j + :) coecient matrix in (5) is known as the Extended Jacobian denoted
C 1
++
and has sub-diagonal entries c
.
By forward substitution on (5) we see that the system Jacobian J = j
1,...,
=
++1,...,++
is given by,
J = 1 +T(1 1
)
1
1, (6)
the Schur complement of 1 1
)
+1,...,+
= 1, (7)
J = 1 +T
+1,...,+
. (8)
i.e., forward substitution on the lower triangular system (7) followed by a matrix multipli-
cation and addition (8).
2. Reverse variant: the j : adjoint matrix
7 and Jacobian J are determined by,
(1
)
7 = T
, (9)
J = 1 +
7
1. (10)
i.e., back-substitution of the upper triangular system (9) followed by matrix multiplication
and addition (10).
The arithmetic cost of both variants is dominated by the solution of the linear systems (7) and (9)
with common (though transposed in (9)) sparse coecient matrix. These systems have n and
: right-hand sides respectively giving the rule of thumb that the forward variant is likely to be
preferred for n < : and reverse for : n (see [1, p.189] for a counter example). Since matrices
1, 1 and T in (7) to (10) are also sparse, further reductions in arithmetic cost might be obtained
by storing and manipulating the or
7 as sparse matrices [1, Chap. 7].
Other approaches to reducing arithmetic cost are based on performing row operations to reduce
the number of entries in (5) or, indeed, to entirely eliminate some rows [1, Chaps. 9-10]. Such
approaches have been generalised by Naumann [10] and may be very ecient if performed at
compilation time by a source transformation AD tool [11]. In Sec. 3.3 we will adapt Bischofs
hoisting technique [12] to reduced the size and number of entries of the extended Jacobian: reducing
memory and runtime costs. Unlike Bischof, ours is a runtime approach more akin to Christianson
et als [13] dirty vectors.
3 Three Implementations
In Secs. 3.2-3.4 we describe our three closely related overloaded classes designed to generate the
extended Jacobians entries as the users function is executed. First, however, we introduce the
MADExtJacStore class, objects of which are used by all three overloaded classes to store the
extended Jacobian entries.
3.1 Extended Jacobian Storage
A MADExtJacStore object has components to store: the number of independent variables, the
number of rows of the Extended Jacobian for which entries have been determined, the number of
entries determined, and a three column matrix to store the i, , and coecient c
1
1
1
2 1
2 1
2 .5 1
5 .5 1
1 1
1 1
1
1
1
2 1
2 1
2 1 1
5 1 1
1 1
1 1
(15)
In the extended Jacobian, rows 4 and 5 may now be removed from blocks 1 and 1, and columns 4
and 5 may be removed from blocks 1 and T. This also eliminates two rows from the intermediate
matrices
+1,...,+
or
7 leaving the extended Jacobian as,
C 1 =
1
1
1
2 1 1
5 1 1
1 1
1 1
(16)
5
We may now extract 1, 1, 1 and T from (16) and calculate the Jacobian J. Hoisting is an
example of a safe pre-elimination which never increases the number of arithmetic operations [1, p.
212] but can drastically reduce both these and memory costs. In Matlab hoisting may be applied
to element-wise operations or functions with a single array argument (e.g., -x, sin(x), sqrt(x))
and element-wise binary operations or functions with one inactive argument (e.g., 2 + x, A .*
x with A inactive). Hoisting is not applicable to matrix operations or functions (e.g, linear solve
X \ Y or determinant det(X)).
We eect hoisting by a run-time mechanism distinguishing our work from Bischof et al [12].
We use a similar technique to that of Christianson et al [13] who used it to reduce forward or back
substitution costs on the full extended Jacobian: we reduce the size of the extended Jacobian.
Our hoisted class ExtJacMAD H has an additional property to that of class ExtJacMAD, an
accumulated extended Jacobian entry array Cij. When we initialise our ExtJacMAD H object,
x = ExtJacMAD_H([0.5; 1; -2.5])
we assign x.Cij = 1 indicating that the derivatives of the elements of x are a multiple of one times
those associated with x.row index, i.e., rows 1 to 3. Step 1 of the overloaded operations of Sec. 3.2
is as before but with the additional copy tmp1.Cij = x.Cij. Step 2 diers more substantially
with no additions to the extended Jacobian, we merely copy tmp2.row_index = tmp1.row_index
and set tmp2.Cij = 2*tmp1.Cij. Step 3 is similar to the revised step 1. Step 4 is modied to
account for the objects accumulated extended Jacobian entry, so when dealing with the entries
associated with tmp2 we have c = temp2.Cij .* tmp3.value = 1 (expanded to [1 1]), and
when dealing with tmp3 we have c = tmp3.Cij .* tmp2.value = 1 .* [2 -5] = [2 -5]. The
assembled extended Jacobian then directly takes the form (16) and we see that the eects of
hoisting have been mimicked at runtime. Note that the Cij component is maintained as a scalar
whenever possible. Array values of Cij would be created if our example functions coding were
y = sin(x(2:3)) .* x(1) as then tmp2.Cij = cos(tmp1) = [0.5403 -0.8011] though this
would not prevent hoisting.
3.4 Using Matlabs New Objected Oriented Programming Style
Matlab release R2008a introduced substantially new object oriented programming styles and ca-
pabilities compared to those used by both our implementations of Secs. 3.2 and 3.3 and previous
Matlab AD packages [2, 3]. Instead of the old styles denition of all an objects methods within
separate source les in a common folder, in the new style all properties and methods are dened
in a single source le.
4 Performance Testing
All tests involved calculating the gradient of a function from the MINPACK-2 Test Problem
collection [14] with the original Fortran functions recoded into Matlab by replacing loops with
array operations and array functions. We performed all tests for a range of problem sizes n: for
each n ve dierent sets of independent variables r were used. For each derivative technique and
each problem size the set of ve derivative calculations were repeated suciently often that the
cumulative cpu time exceeded 5 seconds. If a single set of ve calculations exceeded 5 seconds
cpu time then that technique was not used for larger n. All tests were performed using Matlab
R2009b on a Windows XP SP3 PC with 2GB RAM. We rst consider detailed results from one
problem.
4.1 MINPACK-2 Optimal Design of Composites (ODC) test problem
Table 1 presents the run time ratio of gradient cpu time to function cpu time for the Optimal
Design of Composites (ODC) problem for varying number of independent variables n and using
the extended Jacobian techniques of Sec. 3. We also give run time ratios for a hand-coded adjoint,
6
one-sided nite dierences (FD) and sparse forward mode AD using MADs fmad class. For these,
and all other results presented, AD generated derivatives agreed with the hand-coded technique
to within round-o: errors for FD were in line with the expected truncation error. Within our
tables a dash indicates that memory or cpu time limits were exceeded.
Table 1: Gradient evaluation cpu time ratio cpu())/cpu()) for the MINPACK-2 Optimal Design
of Composites (ODC) test problem.
cpu())/cpu()) for problem size n
Grad. Tech 25 100 2500 10000 40000
hand-coded 1.8 1.9 2.0 2.0 1.7
FD 26.2 102.8 2684.0 11147.2 -
sparse forward AD 66.0 55.8 134.7 - -
Extended Jacobian: Sec. 3.2
forward full 58.6 79.9 - - -
forward sparse 61.3 56.9 62.0 64.7 53.7
reverse full 57.9 51.4 42.1 42.0 35.4
reverse sparse 57.2 57.4 51.2 55.1 48.0
Extended Jacobian + Hoisting: Sec. 3.3
forward full 46.3 51.4 - - -
forward sparse 48.3 42.7 34.0 33.9 28.6
reverse full 47.0 41.0 24.3 23.1 19.9
reverse sparse 44.9 39.9 26.3 26.4 23.3
Extended Jacobian + Hoisting + New Object Orientation: Sec. 3.4
forward full 99.4 95.2 - - -
forward sparse 99.4 83.1 38.4 35.3 28.6
reverse full 100.8 88.6 31.0 26.4 21.0
reverse sparse 98.0 82.1 33.7 29.0 23.8
The hand-coded results show what might be achievable for a source transformation AD tool in
Matlab. The FD cpu time ratio is in line with theory ( n + 1) but FD exceeded our maximum
permitted run time for large n. Sparse forward mode AD outperformed FD with increasing n but
exceeded our PCs 2 GB RAM for larger n.
The extended Jacobian approaches of Sec. 3.2, particularly the reverse variant with full storage,
are seen to be competitive with, or outperform, sparse forward AD with the exception of the
forward variant with full storage. Since : n we expect the reverse variants to outperform
forward and as : = 1 there is no point employing sparse storage. Employing the hoisting technique
of Sec. 3.3 was always benecial and for larger problem sizes halved the run time. This is because
hoisting reduces the number of entries in the Extended Jacobian by approximately 55% for all
problem sizes tested.
Employing Matlabs new object oriented features, described in Sec. 3.4, had a strongly detri-
mental eect doubling required cpu times compared to using the old features for small to moderate
n. The two sets of run times converge for large n because the underlying Matlab built-in functions
and operations are identical for both. Matlabs run time overheads must be signicantly higher
for the new approach and so dominate for small n.
4.2 MINPACK-2 mesh-based minimisation test problem
Table 2 presents selected results for the remaining mesh-based MINPACK-2 minimisation prob-
lems. We only present the most ecient reverse full variants of the Extended Jacobian approach.
Performance of the FD, hand-coded and new Object Oriented Extended Jacobian approaches was
in line with that for the ODC problem of Sec. 4.1.
From Table 2, only for the GL1 problem did sparse forward mode AD outperform the Extended
Jacobian approach. For all other problems the Extended Jacobian approach with hoisting gave
7
Table 2: Gradient evaluation cpu time ratio cpu())/cpu()) for the MINPACK-2 mesh-based
minimisation test problems. Ext. Jac. indicates use of the reverse full variant of the Extended
Jacobian approach.
cpu())/cpu()) for problem size n
Problem Grad. Tech 25 100 2500 10000 40000 160000
EPT sparse fwd. AD 102.1 99.6 341.3 - - -
Ext. Jac. 106.4 108.6 120.2 120.8 111.8 60.6
Ext. Jac. + Hoisting 93.5 98.3 91.2 91.9 86.0 46.1
GL1 sparse fwd. AD 138.0 93.1 13.4 7.5 6.6 7.3
Ext. Jac. 160.6 104.7 34.3 26.9 27.5 25.6
Ext. Jac. + Hoisting 115.5 85.6 23.2 18.0 17.9 16.6
MSA sparse fwd. AD 85.1 69.2 158.7 - - -
Ext. Jac. 82.4 71.3 54.0 60.3 66.0 -
Ext. Jac. + Hoisting 68.8 57.7 31.5 33.9 39.6 27.9
PJB sparse fwd. AD 74.7 79.4 - - - -
Ext. Jac. 61.5 69.7 131.8 71.6 68.5 -
Ext. Jac. + Hoisting 57.4 60.8 99.8 51.8 48.3 -
cpu())/cpu()) for problem size n
Problem Grad. Tech 100 400 10000 40000 160000 640000
GL2 sparse fwd. AD 86.9 86.7 142.8 292.8 - -
Ext. Jac. 76.8 78.6 88.1 66.2 82.8 -
Ext. Jac. + Hoisting 58.5 59.1 53.3 39.4 50.1 -
equivalent, or substantially faster, performance and used less memory allowing larger problem
sizes to be addressed. Hoisting reduced the number of extended Jacobian entries by between 34%
(EPT problem) to 49% (MSA problem) leading to signicant performance benets. In all cases,
except perhaps the GL2 problem, the Extended Jacobian with Hoisting approach gave run time
ratios decreasing with n for large n in line with the cheap gradient principle [1, p. 88].
5 Conclusions and further work
Our extended Jacobian approach of Sec. 3 allowed us to use operator overloading to build a
functions extended Jacobian before employing Matlabs sparse matrix operations to calculate
the Jacobian itself. Bischofs hoisting technique [12] was adapted for run time use to reduce the
size of the extended Jacobian. The performance testing of Sec. 4 shows that the reverse variant
of our approach with full storage of adjoints and hoisting was substantially more ecient and
able to cope with larger problem sizes than MADs forward mode [3] for ve of six gradient test
problems from the MINPACK-2 collection [14]. Performance is an order of magnitude worse than
for a hand-coded adjoint due to the additional costs of overloaded function calls and of branching
between multiple control ow paths in the overloaded functions. Additionally, the tailoring of the
hand-coded adjoints back-substitution to the sparsity of a particular source functions extended
Jacobian will likely outperform our use of a general sparse solve.
Future work will address Jacobian problems. Equations (11)-(14) may be employed directly
for this together with compression [1, Chap. 8]. Compression requires Jacobian-matrix, Jo =
1o +T(1 1
)
1
(1o), or matrix-Jacobian, \J = \1 + (\T)(1 1
)
1
1, products where o
and \ are so-called seed matrices. Hessians can be computed using the fmad class to dierentiate
the extended Jacobian computations: a forward-over-reverse strategy [1, Chap. 5].
The extended Jacobian approach is not suited to all problems. For example, the product of
two matrices would create some 2
3
extended Jacobian entries. Verma [2] noted that
by working at the matrix, and not the element level, just the 2
2
elements of the two matrices
are needed to enable reverse mode. An under-development tape-based AD implementation will
8
shortly be compared with this articles extended Jacobian approach.
References
[1] A. Griewank, A. Walther, Evaluating Derivatives: Principles and Techniques of Algorithmic
Dierentiation, 2nd Edition, SIAM, Philadelphia, PA, 2008.
[2] A. Verma, ADMAT: Automatic dierentiation in MATLAB using object oriented methods,
in: M. E. Henderson, C. R. Anderson, S. L. Lyons (Eds.), Object Oriented Methods for In-
teroperable Scientic and Engineering Computing: Proceedings of the 1998 SIAM Workshop,
SIAM, Philadelphia, 1999, pp. 174183.
[3] S. A. Forth, An ecient overloaded implementation of forward mode automatic dierentiation
in MATLAB, ACM T. Math. Software. 32 (2) (2006) 195222. doi:10.1145/1141885.1141888.
[4] Cayuga Research Associates, LLC, ADMAT: Automatic Dierentiation Toolbox for use with
MATLAB. Version 2.0. (2008).
URL http://www.math.uwaterloo.ca/CandO Dept/securedDownloadsWhitelist/Manual.pdf
[5] C. H. Bischof, H. M. B ucker, B. Lang, A. Rasch, A. Vehreschild, Combining source transfor-
mation and operator overloading techniques to compute derivatives for MATLAB programs,
in: Proceedings of the Second IEEE International Workshop on Source Code Analysis and
Manipulation (SCAM 2002), IEEE Computer Society, Los Alamitos, CA, USA, 2002, pp.
6572. doi:10.1109/SCAM.2002.1134106.
[6] C. H. Bischof, H. M. B ucker, A. Vehreschild, A macro language for derivative denition in
ADiMat, in: H. M. B ucker, G. Corliss, P. Hovland, U. Naumann, B. Norris (Eds.), Automatic
Dierentiation: Applications, Theory, and Implementations, Lecture Notes in Computational
Science and Engineering, Springer, 2005, pp. 181188. doi:10.1007/3-540-28438-9 16.
[7] H. M. B ucker, M. Petera, A. Vehreschild, Code optimization techniques in source transfor-
mations for interpreted languages, in: C. H. Bischof, H. M. B ucker, P. D. Hovland, U. Nau-
mann, J. Utke (Eds.), Advances in Automatic Dierentiation, Springer, 2008, pp. 223233.
doi:10.1007/978-3-540-68942-3 20.
[8] R. V. Kharche, S. A. Forth, Source transformation for MATLAB automatic dierentiation,
in: V. N. Alexandrov, G. D. van Albada, P. M. A. Sloot, J. Dongarra (Eds.), Computational
Science ICCS 2006, Vol. 3994 of Lect. Notes Comput. Sc., Springer, Heidelberg, 2006, pp.
558565. doi:10.1007/11758549 77.
[9] J. Gilbert, C. Moler, R. Schreiber, Sparse matrices in Matlab - design and implementation,
SIAM J. Matrix Anal. Appl. 13 (1) (1992) 333356.
[10] U. Naumann, Optimal accumulation of Jacobian matrices by elimination methods on the dual
computational graph, Math. Program., Ser. A 99 (3) (2004) 399421. doi:10.1007/s10107-
003-0456-9.
[11] S. A. Forth, M. Tadjouddine, J. D. Pryce, J. K. Reid, Jacobian code generated by source
transformation and vertex elimination can be as ecient as hand-coding, ACM T. Math.
Software. 30 (3) (2004) 266299. doi:10.1145/1024074.1024076.
[12] C. H. Bischof, Issues in parallel automatic dierentiation, in: A. Griewank, G. F. Corliss
(Eds.), Automatic Dierentiation of Algorithms: Theory, Implementation, and Application,
SIAM, Philadelphia, PA, 1991, pp. 100113.
[13] B. Christianson, L. C. W. Dixon, S. Brown, Sharing storage using dirty vectors, in: M. Berz,
C. Bischof, G. Corliss, A. Griewank (Eds.), Computational Dierentiation: Techniques, Ap-
plications, and Tools, SIAM, Philadelphia, PA, 1996, pp. 107115.
9
[14] B. M. Averick, R. G. Carter, J. J. More, G.-L. Xue, The MINPACK-2 test problem collection,
Preprint MCSP1530692, MCS ANL, Argonne, IL (1992).
10