Architecture of Advanced Numerical Analysis System
Architecture of Advanced Numerical Analysis System
Advanced Numerical
Analysis Systems
Designing a Scientific Computing System
using OCaml
—
Liang Wang
Jianxin Zhao
Architecture of Advanced
Numerical Analysis
Systems
Designing a Scientific Computing
System using OCaml
Liang Wang
Jianxin Zhao
Architecture of Advanced Numerical Analysis Systems: Designing a Scientific Computing
System using OCaml
Acknowledgments������������������������������������������������������������������������������������������������� xiii
Chapter 1: Introduction�������������������������������������������������������������������������������������������� 1
1.1 Numerical Computing in OCaml���������������������������������������������������������������������������������������������� 1
1.2 Architecture����������������������������������������������������������������������������������������������������������������������������� 3
Basic Computing and Analytics with Owl�������������������������������������������������������������������������������� 4
Advanced Design in Owl���������������������������������������������������������������������������������������������������������� 5
Hardware and Deployment������������������������������������������������������������������������������������������������������ 6
Research on Owl��������������������������������������������������������������������������������������������������������������������� 7
1.3 Summary�������������������������������������������������������������������������������������������������������������������������������� 8
v
Table of Contents
vi
Table of Contents
vii
Table of Contents
viii
Table of Contents
ix
Table of Contents
Bibliography��������������������������������������������������������������������������������������������������������� 457
Index��������������������������������������������������������������������������������������������������������������������� 465
x
About the Authors
Liang Wang is the Chief AI Architect at Nokia, the Chief
Scientific Officer at iKVA, a Senior Researcher at the
University of Cambridge, and an Intel Software Innovator.
He has a broad research interest in artificial intelligence,
machine learning, operating systems, computer networks,
optimization theory, and graph theory.
xi
Acknowledgments
Developing a full-featured numerical-analysis system is very complicated. Writing a
book to dive deep into its architecture is an even more challenging task. It not only
requires skills, enthusiasm, persistence, but also needs strong support from families,
colleagues, and communities. For years, we have received so much help from so many
individuals and organizations that it is almost impossible to make an exhaustive list.
Nonetheless, we would particularly like to emphasize that Owl is developed on top of
an enormous amount of previous work. Without the continuous efforts of these projects
and the intellectual contributions of these people over the years, it would be impossible
for us to create this system and deliver this book.
We give our most hearty thanks to those who contribute to the Owl project. Marcello
Seri and Ta-Chu Kao developed owl-ode, an Ordinary Differential Equation solver
library based on Owl. Pierre Vandenhove worked on the memory optimization of the
computation graph module during his internship at the OCaml Labs in Cambridge.
Tudor Tiplea participated in developing the base library in Owl so that it could run on
various backends such as browsers. Ben Catterall’s thesis work on the PSP provided a
theoretical foundation for the Actor system.
We would like to express our sincerest gratitude and appreciation to the OCaml
Software Foundation1 and Ahrefs2 for fully sponsoring this open access book as well
as their long-term support to the Owl project.
1
http://ocaml-sf.org/
2
https://ahrefs.com/
xiii
CHAPTER 1
Introduction
This book introduces Owl, a numerical library we have been developing and
maintaining for years. We develop Owl for scientific and engineering computing in the
OCaml language. It focuses on providing a comprehensive set of high-level numerical
functions so that developers can quickly build up any data analytical applications. Over
years of intensive development and continuous optimization, Owl has evolved into
a powerful software system with competitive performance compared to mainstream
numerical libraries. Meanwhile, Owl’s overall architecture remains simple and elegant.
Its small codebase can be easily managed by a small group of developers.
In this book, we are going to introduce the design and architecture of Owl, from
its designers’ perspective. The target audience is anyone who is interested in not only
how to use mathematical functions in numerical applications but also how they are
designed, organized, and implemented from scratch. Some prerequisites are needed
though. We assume the readers are familiar with basic syntax of the OCaml language.
We recommend [38] as a good reference book on this matter. Also note that this book
focuses on introducing core parts of the Owl codebase, such as the implementation
and design of various key modules. If you are more interested in how to use the
functionalities provided in Owl to solve numerical problems, such as basic mathematical
calculation, linear algebra, statistics, signal processing, etc., please refer to our book
OCaml Scientific Computing: Functional Programming in Data Science and Artificial
Intelligence [26].
1
© Liang Wang, Jianxin Zhao 2023
L. Wang and J. Zhao, Architecture of Advanced Numerical Analysis Systems,
https://doi.org/10.1007/978-1-4842-8853-5_1
Chapter 1 Introduction
most recent hot topic in scientific computing is machine learning. Thanks to the recent
advances in machine learning and deep neural networks, there is a huge demand on
various numerical tools and libraries in order to facilitate both academic researchers and
industrial developers to fast prototype and test their new ideas, then develop and deploy
analytical applications at a large scale.
Take deep neural networks as an example; Google invests heavily in TensorFlow,
while Facebook promotes their PyTorch. Beyond these libraries focusing on one
specific numerical task, the interest on general-purpose tools like Python and Julia
also grows fast. Python has been one popular choice among developers for fast
prototyping analytical applications. One important reason is SciPy and NumPy libraries,
tightly integrated with other advanced functionality such as plotting, offer a powerful
environment which lets developers write very concise code to finish complicated
numerical tasks. As a result, most frameworks provide Python bindings to take
advantage of the existing numerical infrastructure in NumPy and SciPy.
On the other hand, back before Owl was developed, the support of basic scientific
computing in OCaml was rather fragmented. There had been some initial efforts (e.g.,
Lacaml, Oml, Pareto, etc.), but their APIs were either too low level to offer satisfying
productivity or the designs overly focused on a specific problem domain. Moreover,
inconsistent data representation and excessive use of abstract types made it difficult to
exchange data across different libraries. Consequently, developers often had to write
a significant amount of boilerplate code just to finish rather trivial numerical tasks.
There was a severe lack of a general-purpose numerical library in the OCaml ecosystem.
However, we believe OCaml is a good candidate for developing such a general-purpose
numerical library for two important reasons:
2
Chapter 1 Introduction
1.2 Architecture
Designing and developing a full-fledged numerical library is a nontrivial task, despite
that OCaml has been widely used in system programming such as MirageOS. The key
difference between the two is fundamental and interesting: system libraries provide
a lean set of APIs to abstract complex and heterogeneous physical hardware, while
numerical libraries offer a fat set of functions over a small set of abstract number types.
When the Owl project started in 2016, we were immediately confronted by a series
of fundamental questions like: “what should be the basic data types”, “what should
be the core data structures”, “what modules should be designed”, etc. In the following
development and performance optimization, we also tackled many research and
engineering challenges on a wide range of different topics such as software engineering,
language design, system and network programming, etc. As a result, Owl is a rather
complex library, arguably one of the most complicated numerical software system
developed in OCaml. It contains about 269K lines of OCaml code, 142K lines of C code,
and more than 6500 functions. We have strived for a modular design to make sure that
the system is flexible and extendable.
We present the architecture of Owl briefly as in Figure 1-1. It contains two
subsystems. The subsystem on the left part is Owl’s numerical subsystem. The modules
contained in this subsystem fall into four categories:
3
Chapter 1 Introduction
Figure 1-1. The whole system can be divided into two subsystems. The subsystem
on the left deals with numerical computation, while the one on the right handles
the related tasks in a distributed and parallel computing context including
synchronization, scheduling, etc
In the rest of this chapter, we give a brief introduction about these various
components in Owl and set a road map for this book.
4
Chapter 1 Introduction
Owl supports a wide variety of classic numerical analytics methods, including basic
mathematical functions, statistics, linear algebra, ordinary differential equation, and
signal processing. The functions in each field are included in a corresponding module.
Their design is similar to that of Ndarray module, which is mainly interfacing to existing
tools in C code, such as OpenBLAS. The Ndarray module partly relies on these functions,
especially the ones that operate on scalars. For example, Ndarray provides a sin function.
What it does is actually calling the scalar version sin function in the Maths module and
mapping them on all its elements. Since this book mainly focuses on the architectural
design of Owl, we will not introduce in detail how to apply Owl in these fields; instead,
we will briefly give some examples in Appendix A.
5
Chapter 1 Introduction
Finally, neural networks, as complex as their architectures can be, are in essence
also an extension of regression and therefore are also trained by iterative optimization.
We cover this topic in Chapter 5. With the popularity of machine learning and neural
networks, this series of advanced analytics, especially the core technique AD, has
become increasingly essential in modern numerical analysis library stacks. In this book,
we will briefly introduce these topics and their architecture in design in Owl.
6
Chapter 1 Introduction
Research on Owl
In the last part of this book, we introduce two components in Owl: Zoo, for service
composition and deployment, and Actor, for distributed computing. The focus of these
two chapters is to present two pieces of research based on Owl.
In Chapter 9, we introduce the Zoo subsystem. It was originally developed to share
OCaml scripts. It is known that we can use OCaml as a scripting language as Python
(at certain performance cost because the code is compiled into bytecode). Even
though compiling into native code for production use is recommended, scripting is
still useful and convenient, especially for light deployment and fast prototyping. In
fact, the performance penalty in most Owl scripts is almost unnoticeable because the
heaviest numerical computation part is still offloaded to Owl which runs native code.
While designing Owl, our goal is always to make the whole ecosystem open, flexible,
and extensible. Programmers can make their own “small” scripts and share them with
others conveniently, so they do not have to wait for such functions to be implemented
in Owl’s master branch or submit something “heavy” to OPAM. Based on these
basic functionalities, we extend the Zoo system to address the computation service
composition and deployment issues.
Next, we discuss the topic of distributed computing. The design of distributed
and parallel computing module essentially differentiates Owl from other mainstream
numerical libraries. For most libraries, the capability of distributed and parallel
computing is often implemented as a third-party library, and the users have to deal with
low-level message passing interfaces. However, Owl achieves such capability through
its Actor subsystem. Distributed computing includes techniques that combine multiple
machines through a network, sharing data and coordinating progresses. With the fast-
growing number of data and processing power they require, distributed computing
has been playing a significant role in current smart applications in various fields. Its
application is extremely prevalent in various fields, such as providing large computing
power jointly, fault-tolerant databases, file system, web services, and managing large-
scale networks, etc.
In Chapter 10, we give a brief bird’s-eye view of this topic. Specifically, we introduce
an OCaml-based distributed computing engine, Actor. It has implemented three
mainstream programming paradigms: parameter server, map-reduce, and peer-to-peer.
Orthogonal to these paradigms, Actor also implements all four types of synchronization
7
Chapter 1 Introduction
1.3 Summary
In this chapter, we introduced the theme of this book, which centers on the design of
Owl, a numerical library we have been developing. We briefly discussed why we build
Owl based on the OCaml language. And then we set a road map for the whole book,
which can be categorized into four parts: basic building block, advanced analytics,
execution in various environments, and research topics based on Owl. We hope you will
enjoy the journey ahead!
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
8
CHAPTER 2
Core Optimizations
Perhaps one of the most important questions in a numerical library or software is, “how
to make it run faster?” You can never have a piece of scientific computation program that
runs too fast or takes too little memory. That is surely also the primary concern when we
are designing Owl. In this chapter, we discuss the optimization of the Ndarray module,
the core module that underlies almost all computation in Owl. We first introduce this
module and how it interfaces to C code. Next, we briefly introduce the basic principles in
optimizing numerical code, including some common techniques. Then we use the code
examples in Owl to demonstrate how these techniques are applied in Owl. We finish this
chapter with a bit of touch on the topic of automatic performance tuning in numerical
libraries.
9
© Liang Wang, Jianxin Zhao 2023
L. Wang and J. Zhao, Architecture of Advanced Numerical Analysis Systems,
https://doi.org/10.1007/978-1-4842-8853-5_2
Chapter 2 Core Optimizations
Here, the GADT type 'b is an element kind, such as single-precision (32 bits)
floating-point numbers, 8-bit integers, etc. Type 'a represents the type of OCaml values
that can be written into Bigarray or read back from it. The Bigarrays can contain various
types, and in Ndarray we mainly support four:
Namely, Owl mainly supports single-precision float (S), double precision (D), single-
precision complex (C), and double-precision complex (Z). Supporting complex data
types is essential to applications such as signal processing using Fourier transform.
Besides the ('a, 'b) part, the definition also includes a c_layout parameter.
Bigarray supports two different memory layouts. In Owl, we stick with the C-style data
layout, which means that indices start at 0 and in row-major format. Initially, Owl aims
to support both layouts, but it soon turns out that would just open the jar of worm
without much benefit.
As an example, if we need to create an ndarray of shape 2x3, elements of which are
all 0s of type single-precision float, we can use
Dense.Ndarray.S.ones [|2;3|]
In this naming of modules, Dense indicates the data is densely stored instead of
using sparse structure. Owl also supports Sparse ndarray types, and its API is quite
similar to that of the dense type. It also contains the four different kinds of data types as
stated earlier. Indeed, if you call Dense.Ndarray.S.ones [|2;3|], you can get sparsely
stored zero ndarrays. There are two popular formats for storing sparse matrices, the
Compressed Sparse Column (CSC) format and the Compressed Sparse Row (CSR)
format. Owl uses the Compressed Sparse Row format. Compared to the dense ndarray,
the definition of sparse ndarray contains some extra information:
10
Chapter 2 Core Optimizations
In this chapter, we focus on the optimization of dense ndarray structures and will not
further discuss details in the sparse ndarray.
The second part of the name is Ndarray, which is of course the topic of this chapter,
but it should be noted that we also support matrix data types. Implemented based on
Ndarray, the Matrix supports operations that work solely on matrices, such as row_num,
and can interoperate with ndarrays if the dimension is two. For example, we can perform
Besides the core implementation of data structure, the importance of the Ndarray
module lies in the various operations it supports. They can be categorized into
multiple types:
1
OCaml documentation: Profiling. https://ocaml.org/docs/profiling
11
Chapter 2 Core Optimizations
claim that solely using OCaml can satisfy all the performance requirements of numerical
computing. More often than not, we still need to rely on the power of C or FORTRAN, as
in many other numerical computing libraries such as NumPy.
Therefore, in Owl we interface most ndarray operations to C implementation. In
the next several sections, we first explain how to interface OCaml code to C and then
introduce in detail the principles and techniques that are commonly applied to optimize
the performance of computations.
2.2 Interface OCaml to C
To ensure its performance, we implement the core computation in the Ndarray module
mostly in the C language and then interface them to OCaml. In this section, we briefly
introduce how it is done. The corresponding code is mostly included in the src/owl/
core/ directory in the source code. Let’s use the sine function in the Ndarray module as
an example. This OCaml function is a wrapper of another OCaml function _owl_sin:
12
Chapter 2 Core Optimizations
We do not implement the C function float32_sin directly, since we observe that the
implementations of four different types of sine functions are mostly the same. Therefore,
we utilize C macros. Here is the template (FUN4) that can be used to implement a series
of mathematical functions:
#ifdef FUN4
start_x = X_data;
stop_x = start_x + N;
start_y = Y_data;
start_x += 1;
start_y += 1;
};
CAMLreturn(Val_unit);
}
#endif /* FUN4 */
13
Chapter 2 Core Optimizations
This C function should satisfy certain specifications. The returned value of float32_
sin must be of CAMLprim value type, and its parameters are of type value. Several
macros can be used to convert these value types into the native C types. These macros
are included in the <caml/mlvalues.h> header file. For example, we use Long_val to
cast a parameter into an integer value and Caml_ba_array_val from an OCaml Bigarray
type to an array of numbers. Finally, the computation itself is straightforward: apply the
function MAPFN on every element in the array x in a for-loop, and the output is saved in
array y. Since the returned value of the OCaml function _owl_sin is unit, this C function
also needs to return the macro Val_unit.
Notice that we haven’t specified several macros in this template yet: the number type
and the function to be applied on the array. For that, we use
14
Chapter 2 Core Optimizations
2
Microprocessor chronology, Wikipedia, URL: https://en.wikipedia.org/wiki/
Microprocessor_chronology
3
This figure is inspired by the CMU 15-418/15-618 course Parallel Computer Architecture and
Programming, which is recommended to anyone who is serious about learning parallel computing.
15
Chapter 2 Core Optimizations
perform required computation operations. This process repeats until the program stops.
During this process, the Execution Context contains necessary state information, such
as computing results, program counter, registers, etc. Besides these parts, the CPU also
contains several levels of cache, which finally connect to memory. This model provides
a quite good approximation to real CPUs. In the rest of this section, we will explain these
different parts and how they benefit the performance of computation in detail.
/&DFKH 0HPRU\&RQWURO
0HPRU\%XV
([HF&RQWH[W ([HF&RQWH[W
16
Chapter 2 Core Optimizations
Vectorization
The ALU also has the potential to provide more than one piece of data in one clock cycle.
The single instruction multiple data (SIMD) refers to a computing method that uses
one single instruction to process multiple data. It is in contrast with the conventional
computing method that processes one piece of data, such as load, add, etc., with one
construction. The comparison of these two methods is shown in Figure 2-3. In this
example, one instruction processes four float numbers at the same time.
17
Chapter 2 Core Optimizations
#include <immintrin.h>
return 0;
}
To use AVX instructions, programmers need to use the Intel intrinsics, included in
header files such as immintrin .h. Here, _mm256_set_epi32 packed eight 32-bit integers
together into a group, and _mm256_add_epi32 adds packed 32-bit integers. If you look at
the assembly code in this simple program, part of it looks like
...
vpunpcklqdq %xmm3, %xmm0, %xmm0
vpunpcklqdq %xmm2, %xmm1, %xmm1
vinserti128 $0x1, %xmm1, %ymm0, %ymm0
vmovdqa %ymm0, 8(%rsp)
vmovdqa -24(%rsp), %ymm0
vmovdqa %ymm0, 72(%rsp)
vmovdqa 8(%rsp), %ymm0
vmovdqa %ymm0, 104(%rsp)
vmovdqa 72(%rsp), %ymm1
vmovdqa 104(%rsp), %ymm0
vpaddd %ymm0, %ymm1, %ymm0
...
18
Chapter 2 Core Optimizations
7LPH
7KUHDG
)HWFK'HFRGH
7KUHDG
6WDOO
6WDOO 7KUHDG
7KUHDG
6WDOO
6WDOO
It can be seen that instead of normal instructions such as add, AVX2 uses vpaddd to
add two packed doubleword integers and also utilizes special registers such as ymm0, etc.
In general, using SIMD can significantly improve the performance of computations.
C
ontext Switching
As we have explained, the execution context in a processor contains necessary state
information when executing instructions. But that does not mean one processor can
only have one execution context. For example, the Intel Core i9-9900K contains two
execution contexts, or “hardware threads.” That provides the possibility of concurrent
processing. Specifically, a core can apply context switching to store execution state in one
context while running instructions on the other if possible.
Note that unlike previous methods we have introduced, context switching does
not enable parallel processing at exactly the same cycle; a core still has one ALU unit to
process instructions. Instead, at each clock a core can choose to run an instruction on an
available context. This is especially useful to deal with instruction streams that contain
high-latency operations such as memory read or write. As shown in Figure 2-4, while
one instruction is executing, perhaps waiting for a long while to read some data from
memory, the core can switch to the other contexts and run another set of instructions. In
this way, the execution latency is hidden and increases overall throughput.
19
Chapter 2 Core Optimizations
M
ulticore Processor
What we have introduced so far focuses on one single core. But another idea about
improving computation performance is more straightforward to a wider audience:
multicore processor. It means integrating multiple processing units, or cores, on one
single processor and enables the possibility of parallel processing at the same time. The
processor development trend is to add more and more cores in a processor. For example,
Apple M1 contains eight cores, with four high-performance cores and four high-
efficiency cores. Intel Core i9-12900HX processor contains 16 cores.
Similar to previous mechanisms, just because a processor provides the possibility
of parallel processing does not mean a piece of code can magically perform better by
itself. The challenge is how to utilize the power of multiple cores properly from the OS
and applications’ perspective. There are various approaches for programmers to take
advantage of the capabilities provided by multicore processors. For example, on Unix
systems the IEEE POSIX 1003.1c standard specifies a standard programming interface,
and its implementations on various hardware are called POSIX threads, or Pthreads.
It manages threads, such as their creation and joining, and provides synchronization
primitives such as mutex, condition variable, lock, barrier, etc. The OCaml language is also
adding native support for multicore, including parallelism via the shared memory parallel
approach and concurrency. It will be officially supported in the OCaml 5.0 release.
5HJLVWHUV F\FOH
)DVWHUDFFHVV
KLJKHUFRVW
aF\FOHV
&DFKHV
0DLQ0HPRU\ aF\FOHV
a0F\FOHV
)ODVK'LVN
6ORZHUDFFHVV a0F\FOHV
+DUG'LVN
ORZHUFRVW
5HPRWH6WRUDJH HJ,QWHUQHW
6WRUDJHFDSDFLW\
Memory
Besides the processor architecture, another core component, memory, also plays a
pivotal role in the performance of programs. In the rest of this section, we introduce
several general principles to better utilize memory properties. For more detailed
knowledge about this topic, we highly recommend the article by U. Drepper [18].
Ideally, a processing unit only needs to access one whole memory, as the Von
Neumann architecture indicates. However, such a universal memory would struggle
to meet various real-world requirements: permanent storage, fast access speed, cheap,
etc. Modern memory design thus has adopted the “memory hierarchy” approach
which divides memory into several layers, each layer being a different type of memory,
as shown in Figure 2-5. Broadly speaking, it consists of two categories: (1) internal
memory that is directly accessible by the processor, including CPU registers, cache, and
main memory; (2) external memory that is accessible to the processor through the I/O
module, including flash disk, traditional disk, etc. As the layer goes down, the storage
capacity increases, but the access speed and cost also decrease significantly.
The register is the closest to a computer’s processor. For example, an Intel Xeon Phi
processor contains 16 general-purpose registers and 32 floating-point registers, each
of 64-bit size. It also contains 32 registers for AVX instructions; the size of each is 256 or
512 bits. The access speed to registers is the fastest in the memory hierarchy, only taking
one processor cycle. The next level is caches of various levels. In Figure 2-1, we have
seen how a processor core connects directly to the various levels of caches. On an Intel
Core i9-9900K, the cache size is 64KB (L1, each core), 256KB (L2, each core), and 16MB
(shared), respectively. Its access speed is about ten cycles, with L1 being the fastest.
Cache
Due to the fast access time of cache compared to memory, utilizing cache is key to
improving performance of computing. Imagine that if only all a program’s data accesses
are directly from cache, its performance will reach orders of magnitude faster. Short
of reaching that ideal scenario, one principle is to utilize cache as much as possible.
Specifically, we need to exploit the locality of data access in programs, which means that
a program tends to reuse data that is “close” to what it has already used. The meaning
of “close” is twofold: first, recently used data is quite likely to be used again in the near
future, called temporal locality; second, data with nearby addresses are also likely to be
used recently. Later, we will discuss techniques based on these principles.
21
Chapter 2 Core Optimizations
As a sidenote, due to the importance of caches, it is necessary to know the cache size
on your computer. You can surely go through its manual or specification documentation
or use commands such as lscpu. In the Owl codebase, we have employed the same
approach as used in Eigen,4 which is to use the cpuid instruction provided on x86
architectures. It can be used to retrieve CPU information such as the processor type
and if features such as AVX are included. Overall, the routine query_cache_size first
checks whether the current CPU vendor is x86 or x64 architecture. If not, it means cpuid
might not be supported, and it only returns a conservative guess. Otherwise, it retrieves
information depending on if the vendor is AMD or Intel. Here, the macros OWL_ARCH_
x86_64 and OWL_ARCH_i386 are implemented by checking if predefined system macros
such as __x86_64__, _M_X64, __amd64, __i386, etc. are defined in the compiler. The
CPUID macro is implemented using assembly code utilizing the cpuid instruction.
if (cpu_is_amd(cpuinfo)) {
query_cache_sizes_amd(11p, 12p, 13p);
return;
}
int highest_func = cpuinfo[1];
if (highest_func >= 4)
query_cache_sizes_intel(11p, 12p, 13p);
else {
*11p = 32 * 1024;
*12p = 256 * 1024;
*13p = 2048 * 1024;
}
} else {
*11p = 16 * 1024;
*12p = 512 * 1024;
*13p = 512 * 1024;
4
Eigen: A C++ template library for linear algebra. The Eigen project.
https://eigen.tuxfamily.org/
22
Chapter 2 Core Optimizations
}
}
if(cache_type == 1 || cache_type == 3) {
int cache_level = (cpuinfo[0] & 0xE0) >> 5;
int ways = (cpuinfo[1] & 0xFFC00000) >> 22;
int partitions = (cpuinfo[1] & 0x003FF000) >> 12;
int line_size = (cpuinfo[1] & 0x00000FFF) >> 0;
int sets = (cpuinfo[2]);
Prefetching
To mitigate the long loading time of memory, prefetching is another popular approach.
As the name suggests, the processor fetches data into cache before it is demanded,
so that when it is actually used, the data can be accessed directly from the cache.
Prefetching can be triggered in two ways: via certain hardware events or explicit request
from the software.
Naturally, prefetching faces challenges from two aspects. The first is to know what
content should be prefetched from memory. A cache is so precious that we don’t want
to preload useless content, which leads to a waste of time and resources. Secondly, it is
equally important to know when to fetch. For example, fetching content too early risks
getting it removed from the cache before even being used.
For hardware prefetching, the processor monitors memory accesses and makes
predictions about what to fetch based on certain patterns, such as a series of cache
misses. The predicted memory addresses are placed in a queue, and the prefetch would
look just like a normal READ request to the memory. Modern processors often have
different prefetching strategies. One common strategy is to fetch the next N lines of data.
Similarly, it can follow a stride pattern: if currently the program uses data at address x,
then prefetch that at x+k, x+2k, x+3k, etc.
Compared with the hardware approach, software prefetching allows control from
programmers. For example, a GCC intrinsics is for this purpose:
It contains three arguments. The first is the data address to be prefetched; the second
is a compile-time integer that indicates if the prefetch is preparing for a read from or
write to memory; and the final one indicates the temporal locality of fetched data to
decide if it should be evicted from cache once accessed.
The programmer can insert __builtin_prefetch into code if the corresponding
data is anticipated to be accessed soon. This intrinsic will be compiled into data prefetch
instructions via the compiler. If the prefetch is executed at a proper moment before the
access, ideally the required data will be already in the cache by the time it is used. The
following code is a simple example to demonstrate how it works in a C code:
24
Chapter 2 Core Optimizations
There are also other approaches for software control on prefetching, such as the _mm_
prefetch (char const* p, int i) intrinsics from the SSE instruction set on Intel. It
prefetches a line of data from memory that contains address p to a location in the cache
hierarchy; argument i indicates the level of locality of cached data.
However, it is still tricky to do it right; sometimes, improper prefetching can even
make the execution slower. For one thing, it is normally difficult for us to know exactly
how far ahead the data should be fetched, especially in applications that access memory
irregularly. Frequent early prefetch actually reduces cache hit accuracy. Besides, the
locality pattern of different chunks of data is also complex to manage for programmers.
That’s why this approach should be used with caution.
D 80$ E 180$
N
UMA
Finally, we briefly talk about non-uniform memory access (NUMA), since it
demonstrates the hardware aspect of improving memory access efficiency. We
have mentioned the multicore design of processors. While improving parallelism in
processors, it has also brought challenges to memory performance. When an application
is processed on multiple cores, only one of them can access the computer’s memory at
a time, since they access a single entity of memory uniformly. In a memory-intensive
application, this leads to a contention for the shared memory and a bottleneck. One
approach to mitigate this problem is the non-uniform memory access (NUMA) design.
25
Chapter 2 Core Optimizations
Compared with the uniform access model we have introduced, NUMA separates
memory for each processor, and thus each processor can access its own share of
memory at a fairly low cost, as shown in Figure 2-6. However, the performance of NUMA
depends on the task being executed. If one processor needs to frequently access memory
of the other processors, that would lead to undesirable performance.
2.4 Optimization Techniques
The topic of computation optimization is a classic topic in computer science, and there
is still a lot of work about it in both academia and industry. Explaining even only a part of
them in detail requires a whole book. Instead, in this section we give some optimization
technique examples to demonstrate how the principles in the previous sections are
implemented.
Hardware Parallelization
Utilizing multicore is a straightforward way to improve computation performance,
especially complex computation on arrays with a huge amount of elements. Except
for the Pthread we have introduced, Open Multi-Processing (OpenMP) is another tool
that is widely used. OpenMP is a library that provides APIs to support shared memory
multiprocessing programming in C/FORTRAN languages on many platforms. An
OpenMP program uses multiple threads in its parallel section, and it also sets up the
environment in the sequential execution section at the beginning. The parallel section is
marked by the OpenMP directive omp pragma. Each thread executes the parallel section
and then joins together after finishing. Here is an example:
#include <omp.h>
Here, in a big array of 200000 elements, for each element we compute the sin
function on its index number. To apply multicore computing, we simply add one line of
derivative on the for-loop. Without specifying the number of threads, it divides the whole
workload, the array, onto all available cores.
In Owl, we have also applied OpenMP to improve computation performance. For
example, we have introduced the template to map a single function on all elements in an
array. We can now change part of the template as follows. Here, caml_release_runtime_
system releases the master lock in a calling thread, enabling other threads to run code in
parallel with the execution of the current thread:
...
caml_release_runtime_system();
start_x = X_data;
stop_x = start_x + N;
start_y = Y_data;
We can also benefit from SIMD. For example, instead of interfacing to standard C
math library functions, we can implement our own SIMD version of math functions. It
is unfortunately not as simple as adding one line of derivative, since the SIMD intrinsics
do not include complex computations. Even for one sine function, for example, we need
to carefully implement the Taylor expansion–based algorithm using various existing
intrinsics. Not to mention that we need to always think about different versions of SIMD:
SSE, AVX2, AVX512, etc., or different hardware vendors. In summary, the performance
boost using SIMD requires a significant amount of engineering work.
Cache Optimization
There are numerous cache optimization techniques, but most of them share the same
theme: improve data locality (both spatial and temporal) and align the code and data.
27
Chapter 2 Core Optimizations
If you put something into cache, you’d better make it count: reusing cached data
as much as possible. Next, we will use matrix multiplication as an example. Matrix
multiplication is one of the center pieces in scientific computing. Its basic algorithm
is simple:
The way the data is read is from left to right: (0, 0), (0,1), … (0, n), (1, 0), (1, 1), …
(1, n), …. While the element (0, 0) is loaded, the next several elements are also saved in
the cache so that (0, 1), (0, 2), etc. are all loaded from cache instead of memory. However,
the elements in mul2 are not accessed this way. After (0, 1), the elements (1, 0), (2, 0), …
are required. That means the cached elements are all wasted. One approach to deal with
this problem is to transpose mul2 before multiplication:
double tmp[N][N];
Another approach even utilizes L1 cache better. First, it “cuts” a large matrix into
multiple smaller square ones. Each line of such a square matrix can be fit into an L1
cache. These multiple smaller matrix multiplications are iterated in an outer loop.
This technique is sometimes called tiling. The algorithm can be demonstrated using
Figure 2-7. In a smaller matrix multiplication ab = c, a row of a is stored in L1 cache (Step
➀). The column number moves in matrix b to compute the corresponding output in c
(Step ➁). Only after this step the cached line will be evicted and a new row in a will be
retrieved into the cache (Step ➂). The previous rows in a will not be used again. The
algorithm is implemented as follows. Here, E is the number of elements in a row of the
small matrix:
28
Chapter 2 Core Optimizations
for (i = 0; i < N; i += E)
for (j = 0; j < N; j += E)
for (k = 0; k < N; k += E)
for (ii = 0, rr = &r[i][j],
am = &mul1[i][k]; ii < E;
++ii, rr += N, am += N)
for (kk = 0, bm = &mul2[k][j];
kk < E; ++kk, bm += N)
for (jj = 0; jj < E; ++jj)
rr[jj] += am[kk] * bm[jj];
D F
;
E &
$ %
䐠
䐟 /FDFKHVL]H
䐡 䐡
䐟
䐠
D E F
Another technique that utilizes cache is loop merging. We can merge consecutive
loops that sweep through data into one loop to reuse data in the cache, reducing
memory access. The following code shows a simple example:
29
Chapter 2 Core Optimizations
Obviously, these two loops can be fused into one single loop:
By fusing two loops into one, the access order of the x elements changed, increasing
temporal locality. It can further be accelerated using techniques such as parallel
computing techniques we have mentioned. For some of the cases that loop merging
cannot be directly applied, the loop alignment technique may help. For example, the two
for-loops
x[0] = y[0] + a
z[n-1] = x[n] + b
30
Chapter 2 Core Optimizations
Other Techniques
Besides processor parallelism and cache utilization, there are still many techniques to
improve code performance. We will only briefly introduce some of them in this part.
Compilers surely have a great impact on the code performance. For example,
compilers such as LLVM or GCC can be configured with plenty of options and flags.
Choosing the most suitable options can actually be a challenging task. Besides,
programmers can add inline assembly code to C to further increase the execution speed.
Another optimization technique, unrolling, is also partly about understanding how
compilers work. For example, we can unroll the for-loop into eight parts:
for(i=0;i<n;i++) {
a[i] = b[i] + 1
}
for(i=0; i<n; i+=8) {
a[i] = b[i] + 1
a[i] = b[i+1] + 1
a[i] = b[i+2] + 1
a[i] = b[i+3] + 1
a[i] = b[i+4] + 1
a[i] = b[i+5] + 1
a[i] = b[i+6] + 1
a[i] = b[i+7] + 1
}
It allows the compiler to decrease the number of conditional branches, thus reducing
potential branch mispredictions and condition evaluations.
Despite what we have explained, note that cache is not always helping. Sometimes,
the data is put into cache, but won’t be used again in a short while. That means the cache
is just wasting time on writing without being read. In that case, it is necessary to bypass
the caching phase. Processors support nontemporal write directly to memory. SIMD also
provides intrinsics to do that, such as the following intrinsics.
31
Chapter 2 Core Optimizations
#include <ammintrin.h>
void _mm_stream_sd(double *p, __m128d a);
void _mm_stream_ss(float *p, __m128 a);
2.5 Example: Convolution
Convolution is a family of mathematical operations that is arguably the most important
operation in deep neural networks. It makes up the backbone of a majority of deep
neural network architectures and takes up a large part of computation resources
involved in their training and inference. According to the shape of input, convolution
operations can be categorized into one dimensional, two dimensional, and three
dimensional. It can also be categorized according to usage in the forward or backward
propagation phase as normal convolution, backward convolution on kernel, and
backward convolution on input. There are special operations such as transpose
convolution, dilated convolution, etc. But their implementation principles are quite
similar. There is a lot of work on optimizing convolution operations due to their
importance [55]. It takes significant engineering effort to implement only part of
them. In this section, we use the two-dimensional convolution operation Conv2D as an
example to demonstrate how we apply various optimization techniques on convolution
operations in Owl.
A convolution operation takes two ndarrays as input: image (I) and kernel (F). In a
two-dimensional convolution, both ndarrays are of four dimensions. The image ndarray
has B batches; each image has size H × W and has IC channels. The kernel ndarray has R
rows, C columns, the same input channel IC, and output channel K. The convolution can
then be expressed as in Eq. 2.1.
32
Chapter 2 Core Optimizations
IC R C
CONVb ,h ,w ,k = ∑∑∑I b ,h + r ,w + c ,ic Fr ,c ,ic ,k . (2.1)
ic =1 r =1 c =1
34
Chapter 2 Core Optimizations
int pr = 0, pc = 0;
if (padding != 1) {
pr = (row_stride * ( output_rows - 1) + kernel_rows - input_rows) / 2;
pc = (col_stride * ( output_cols - 1) + kernel_cols - input_cols) / 2;
if (pr < 0) pr = 0;
if (pc < 0) pc = 0;
}
...
The code starts by locating the starting pointers of inputs (input and kernel) and
the various metadata about inputs: input channel, row/column numbers, output
channel, stride, padding size, etc. Besides, it also assigns memory space for outputs
and intermediate buffers. The code next implements what we have introduced. Using
three for-loops, we fill in the intermediate input buffer inpt2d, which is one matrix, and
multiply it with the kernel matrix using the GEMM routine provided by OpenBLAS.
...
35
Chapter 2 Core Optimizations
int cnt = 0;
for (int a = cstart; a < cend; ++a) {
for (int b = rstart; b < rend; ++b) {
for (int h = 0; h < in_channel; ++h) {
if (a < input_cols && a >= 0 &&
b < input_rows && b >= 0) {
int input_idx =
input_idx_base + a * input_ri + b * in_channel + h;
inpt2d[i * kernel_cri + cnt] = input_ptr[input_idx];
}
++cnt;
}
}
}
}
free(inpt2d);
return Val_unit;
}
36
Chapter 2 Core Optimizations
of generating the whole intermediate matrix, it cuts the input and kernel matrices into
small blocks one at a time so that the memory usage is limited no matter how large the
input and kernel are. Next, we show the code:
int mc = output_crb;
int kc = kernel_cri;
int nc = out_channel;
compute_block_sizes(&kc, &nc, &mc, sizeof(TYPE));
Suitable implementations can be chosen depending on the input size. Here, we use
the intermediate matrix size to decide if we need the memory-efficient implementation
or not. If it is sufficiently small, we use the previous im2col implementation. It is still
straightforward and fast with small input sizes. Otherwise, we compute the suitable
small block sizes as in [25].
To further improve the performance, we use the SIMD intrinsics in filling the
temporary matrix from input ndarray. For one thing, depending on whether the input
channel is divisible by the supported data length AVX_PSIZE of SIMD (e.g., 8 float
numbers for AVX), we provide two sets of implementations for filling the temporary
blocks. We then assign space for the small blocks that can be fit into cache accordingly.
...
for (int m = 0; m < output_crb; m += mc) {
int actual_mc = fminf(m + mc, output_crb) - m;
for (int k = 0; k < kernel_cri; k += kc) {
memset(temp_mk, 0, mc * kc * sizeof(TYPE));
int actual_kc = fminf(k + kc, kernel_cri) - k;
#ifdef AVX_PSIZE
int kc_strip = (actual_kc / AVX_PSIZE) * AVX_PSIZE;
#endif
38
Chapter 2 Core Optimizations
int cmn = 0;
for (int ix = 0; ix < actual_mc; ix++) {
for (int iy = 0; iy < actual_nc; iy++) {
int index_mn = (ix + m) * out_channel + (iy + n);
output_ptr[index_mn] += temp_mn[cmn++];
}
}
}
}
}
free(temp_mk);
free(temp_kn);
free(temp_mn);
return Val_unit;
}
The code next follows a similar pattern as the previous method, filling in the input
and kernel matrices and multiplying them to get the output, only that both need more
detailed control to get smaller matrices to fit into cache. Specifically, here is the code to
get the input matrix:
int cmk = 0;
for (int im = 0; im < actual_mc; im += 1) {
int b = (m + im) / output_cr;
int cr = (m + im) - b * output_cr;
int c = cr / output_rows;
int r = cr - c * output_rows;
39
Chapter 2 Core Optimizations
41
Chapter 2 Core Optimizations
42
Chapter 2 Core Optimizations
Figure 2-9. Parallel execution of the sin operation on ndarray using OpenMP
Figure 2-10. Compare the behavior of abs and sine when using OpenMP
43
Chapter 2 Core Optimizations
However, performance improvement does not come for free. The overhead of
using OpenMP comes from time spent on scheduling chunks of work to each thread,
managing locks on critical sections, startup time of creating threads, etc. Therefore,
when the input ndarray is small enough, these overheads might overtake the benefit of
threading.
What is a suitable input size to use OpenMP then? This question would be easy to
solve if there is one single suitable input size threshold for every operation, but that is
not the case. In a small experiment, we compare the performance of two operations, abs
(absolute value) and sin, in three cases: running them without using OpenMP, with
two-thread OpenMP, and with four-thread OpenMP.
The result in Figure 2-10 shows that, with growing input size, for the sine operation,
the OpenMP version outperforms the non-OpenMP version at a size of less than 1000,
but when using abs operation, that cross point is at about 1,000,000. The complexity of
math operations varies greatly, and the difference is even starker when we compare their
performance on different machines. Note that both axes use a log scale, and that is why a
small deviation when the input array size is small looks large in the figure.
This issue becomes more complex when considered in real applications such as
DNN, where users need to deal with operations of vastly different complexity and
input sizes. Thus, one fixed threshold for several operations is not an ideal solution.
Considering these factors, we need a fine-grained method to decide a suitable OpenMP
threshold for each operation.
Toward this end, we implement the AEOS module in Owl. The idea is to add a tuning
phase before compiling and installing Owl, so that each operation learns a suitable
threshold parameter to decide if OpenMP should be used or not, depending on the input
size. The key idea of parameter tuning is simple. We implement two versions of each
operation, one using OpenMP and the other not. We then measure their execution time
for various sizes of input. Each measurement is repeated multiple times, and, to reduce
the effect of outliers, only the values that are within the first and the third quartiles
are used. After removing outliers, regression is performed to find a suitable input size
threshold. According to our initial experiment, linear regression is fit to estimate the
OpenMP parameters here. Since this tuning phase is executed before compiling Owl,
the AEOS module is independent of Owl, and all necessary implementation is coded
separately to ensure that future changes of Owl do not affect the AEOS module itself.
44
Chapter 2 Core Optimizations
The tuned parameters then need to be passed to Owl. When the OpenMP switch is
turned on, the AEOS module generates a C header file which contains the definition of
macros, each of which defines a threshold for one operation. When this header file is not
generated, predefined default macro values are used instead. After that, Owl is compiled
with this header file and uses these tuned parameters in its math operations. The tuning
phase only needs to be performed once on each machine during installation.
The design of the AEOS module focuses on keeping tuning simple, effective, and
flexible. Each operation is implemented as a single OCaml module, so that support for
new operations can be easily added. The interface of such a module is shown as follows.
We expect that tuning does not have to be only about OpenMP parameters and that
different regression methods could be used in the future. For example, the Theil-Sen
estimator can be plugged in for parameter estimation if necessary. In each module,
arbitrary tuning procedures can be plugged in as long as the interface is satisfied.
The AEOS module is implemented in such a way that brings little interference to the
main Owl library. Code can be viewed in this pull request and has been merged into the
main branch of Owl. You only need to switch the ENABLE_OPENMP flag from 0 to 1 in the
dune file to try this feature.
45
Chapter 2 Core Optimizations
Table 2-1 presents the tuned threshold values of five operations on a MacBook
with a 1.1GHz Intel Core m3 CPU and a Raspberry Pi 3B. We can see that they vary
across different operations and different machines, depending on their computational
complexity. For example, on MacBook, the tuning result is “max_int”, which means that
for the relatively simple square root calculation, OpenMP should not be used, but that is
not the case on Raspberry Pi. Also, note that the less powerful Raspberry Pi tends to get
lower thresholds.
46
Chapter 2 Core Optimizations
2.7 Summary
In this chapter, we focused on the optimization of core ndarray operations in Owl. We
started by introducing the Ndarray module in Owl and its pivotal role in a numerical
library and then introduced how we interface the OCaml code to the C language. The
rest of this chapter mostly focused on optimizations at the C level. As an important
background, we explained the principles in optimizing scientific computing code, such
as utilizing parallelism of processors and locality of caches. Next, we briefly introduced
some techniques based on these principles. As an example, we demonstrated how we
apply some of them to optimize one of the most important operations in deep neural
networks: the convolution. Finally, we briefly introduced the automatic tuning approach
to optimize library performance across various platforms, using multicore parallel
computing on Owl as an example.
47
Chapter 2 Core Optimizations
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
48
CHAPTER 3
Algorithmic Differentiation
Differentiation is key to numerous scientific applications including maximizing or
minimizing functions, solving systems of ODEs, physical simulation, etc. Of existing
methods, algorithmic differentiation, or AD, is a computer-friendly technique for
performing differentiation that is both efficient and accurate. AD is a central component
of the architecture design of Owl. In this chapter, we will show, with hands-on examples,
how the AD engine is designed and implemented in Owl. AD will be used in some of the
other chapters to show its application in optimization and machine learning.
3.1 Introduction
Assume an object moves a distance of Δs in a time Δt, the average velocity of this object
during this period can be defined as the ratio between Δs and Δt. As both values get
smaller and smaller, we can get the instantaneous velocity:
∆s ds
v = lim = (3.1)
∆t → 0 ∆t dt
ds
The term is referred to as “the derivative of s with respect to t.”
dt
Differentiation is the process of finding a derivative in mathematics. It studies the
functional relationship between variables, that is, how much one variable changes when
the value of another variable changes. Differentiation has many important applications,
for example, finding minimum and maximum values of a function, finding the rate of
change of quantity, computing linear approximations to functions, and solving systems
of differential equations. Its critical roles in these key mathematical fields mean it is
widely used in various fields. One example is calculating marginal cost and revenue in
economics.
49
© Liang Wang, Jianxin Zhao 2023
L. Wang and J. Zhao, Architecture of Advanced Numerical Analysis Systems,
https://doi.org/10.1007/978-1-4842-8853-5_3
Chapter 3 Algorithmic Differentiation
f (x +δ )− f (x)
f ′ ( x ) = lim . (3.2)
δ →0 δ
∂f ∂f ∂f
∇f = , , = ( x1 ∗ x 2 ,x 0 ∗ x 2 ,x1 ∗ x 2 )
∂x 0 ∂x1 ∂x 2
50
Chapter 3 Algorithmic Differentiation
High-level API
Types
This process completely eliminates the impact of numerical errors, but the
complexity of symbolic manipulation quickly grows as expressions become more
n −1
complex. Just imagine computing the derivative of a simple calculation f ( x ) = ∏xi : the
i =0
result would be terribly long, if not that complex. As a result, symbolic differentiation can
easily consume a huge amount of computing resource and becomes impractically slow
in the end. Besides, unlike in numerical differentiation, we must know how a function is
constructed to use symbolic differentiation.
Finally, there is the algorithmic differentiation (AD). It is a chain rule–based
technique for calculating derivatives with respect to input variables of functions
defined in a computer program. Algorithmic differentiation is also known as automatic
differentiation, though strictly speaking it does not fully automate differentiation and
can sometimes lead to inefficient code. In general, AD combines the best of both worlds:
on one hand, it efficiently generates exact results and so is highly applicable in many
real-world applications; on the other hand, it does not rely on listing all the intermediate
results, and its computing process can be efficient. Therefore, it is the mainstream
implementation of many numerical computing tools and libraries, such as JuliaDiff
in Julia, ad in Python, ADMAT, etc. The rest of this chapter focuses mainly on algorithmic
differentiation.
In this chapter, we assume you are familiar with how differentiation works
mathematically, so we can focus on the design and implementation details of the AD
module in Owl. But first, let’s take a look at a simple example to see how the AD module
is used in Owl. In this example, we simply calculate the first-order and second-order
derivatives of the function tanh.
module AD = Algodiff.D
That’s all it takes. We define the function and apply diff on it to acquire its first-
order derivative, on which the diff function can be directly applied to get the second-
order derivative. We then evaluate and get the function value at point x = 1 on these two
derivative functions.
Figure 3-1 shows the various components in the AD module. Let us inspect how
they fit into the example code. First, we cannot directly use the basic data types, such
as ndarray and float number. Instead, they need to be first “packed” into a type that
the AD module understands. In this example, pack_flt is used to wrap a normal float
number into an AD type float. After calculation finishes, assuming we still get an AD type
float as output, it should be unpacked into a normal float number using the function
unpack_flt. The type system is the most fundamental building block in AD. Second,
to construct a computation in the AD system, we need to use operators, such as tahn
used in this example. AD provides a rich set of operators that are generated from the
op_builder module. After constructing a graph by stacking the operators, the AD engine
starts to let the input data “flow,” or “propagate,” twice in this graph, once forward
and once backward. The key function that is in charge of this process is the reverse
function. Based on the aforementioned process, we can calculate the differentiation of
various sorts. To simplify coding, a series of high-level APIs are constructed. The diff
function used in this example is one such API. It applies differentiation on a function
that accepts a float number as input and outputs a float number. These high-level APIs
52
Chapter 3 Algorithmic Differentiation
lead to extremely elegant code. As shown in this example, we can simply apply the
differentiation function on the original tanh function iteratively to get its first-order,
second-order, and any other higher-order derivatives. In the next several sections, we
will explain these building blocks in detail and how these different pieces are assembled
into a powerful AD module.
[
[
3.2 Types
We start with type definition. The data type in AD is defined in the owl_algodiff_types.
ml file, as shown in the following. Even if you are familiar with the type system in OCaml,
it may still seem a bit confusing. The essence of AD type is to express the forward and
reverse differentiation modes. So first, we use an example to demonstrate how these two
AD modes work.
53
Chapter 3 Algorithmic Differentiation
∂y ∂y
This function takes two inputs, and our aim is to compute ∇y = , .
∂x 0 ∂x1
Computations can be represented as a graph shown in Figure 3-2. Each node
represents either an input/output or intermediate variables generated by the
corresponding mathematical function. Each node is named vi. Herein, the input v0 = x0
and v1 = x1. The output y = v4.
Both the forward and reverse modes rely on basic rules to calculate differentiation. On
d
one hand, there are the basic forms of derivative equations, such as sin ( x ) = cos ( x ) ,
dx
d
u ( x )v ( x ) = u ′ ( x )v ( x ) + u ( x )v ′ ( x ) , etc. On the other is the chain rule. It states that,
dx
suppose we have two functions f and g that can be composed to create a function
F(x) = f(g(x)), then the derivative of F can be calculated as
F ′( x ) = f ′( g ( x ) ) g ′( x ) (3.4)
0 v0 = x0 = 2
v0 = 1
1 v1 = x1 = 2
v1 = 0
2 v2 = v0 v1 = 4
v2 = v0v1 + v1 v0 = 2 ∗ 0 + 2 ∗ 1 = 2
3
v3 = sin (v2 ) = −0.757 y = v3 = cos (v2 ) ∗ v2 = −0.654 ∗ 2 = −1.308
F orward Mode
Let’s look at the first way, namely, the “forward” mode, to calculate derivatives. We
∂y ∂y
ultimately wish to calculate (and , which can be calculated in a similar way).
∂x 0 ∂x1
We begin by calculating some intermediate results that will prove to be useful. Using
∂v ∂v
the labels vi to refer to the intermediate computations, we have 0 = 1 and 1 = 0
∂x 0 ∂x 0
immediately, since v0 = x0 and v1 = x1 actually.
∂v2
Next, consider , which requires us to use the derivative rule on multiplication. It
∂x 0
is a bit trickier and requires the use of the chain rule:
∂v2 ∂ ( x 0 x1 ) ∂ ( x0 ) ∂ ( x1 )
= = x1 + x0 = x1
∂x 0 ∂x 0 ∂x 0 ∂x 0
∂v2
After calculating , we proceed to compute partial derivatives of v4 which is the
∂x 0
∂y
final result we are looking for. This process starts with the input variables and ends
∂x 0
with the output variables, and that’s where the name “forward differentiation” comes
∂vi
from. We can simplify the notation by letting vi = . The vi is called the tangent of
∂x 0
function vi(x0, x1, …, xn) regarding the input variable x0, and the results of evaluating the
function at each intermediate point are called the primal value.
Let’s calculate y when setting x0 = 2 and x1 = 2. The full forward differentiation
calculation process is shown in Table 3-1 where two simultaneous computation
processes take place in the two computation columns: the primal just performs
computation following the computation graph; the tangent gives the derivative for each
intermediate variable with regard to x0.
Two things need to be noted in this calculation process. The first is that in
algorithmic differentiation, unlike symbolic differentiation, the computation is
performed step by step, instead of after the whole computation is unwrapped into one
big formula following the chain rule. Second, in each step, we only need to keep two
values: primal and tangent. Besides, each step only needs to have access to its “parents,”
using graph theory’s term. For example, to compute v2 and v2 , we need to know the
primal and tangent of v0 and v1; to compute that of v3, we need to know the primal and
tangent of v2; etc. These observations are key to our implementation.
55
Chapter 3 Algorithmic Differentiation
Reverse Mode
Now let’s rethink about this problem from the other direction: from outputs to inputs.
∂y
The problem remains the same, that is, to calculate . We still follow the same “step-
∂x 0
by-step” procedure as in the previous forward mode. The only difference is that this time
we calculate it backward. For example, in our example y = v3 = sin (v 2 ) , so if only we
∂y
know , we would move a step closer to our target solution.
∂v2
∂y ∂y
We first observe that = 1 , since y and v3 are the same. We then compute by
∂v3 ∂v2
applying the chain rule:
0 v0 = x0 = 2
1 v1 = x1 = 2
2 v2 = v0 ∗ v1 = 4
56
Chapter 3 Algorithmic Differentiation
4 v3 = 1
5 ∂v3 ∂ ( sin (v 2 ) )
v2 = v3 = v3 = 1 ∗ cos (v2 ) = −0.654
∂v2 ∂v2
6 ∂v2 ∂ (v0v1 )
v1 = v2 = v2 = −0.654 ∗ v0 = −1.308
∂v1 ∂v1
7 ∂v2 ∂ (v0v1 )
v0 = v2 = v2 = −0.654 ∗ v1 = −1.308
∂v1 ∂v0
∂y
vi =
∂vi
for the derivative of output variable y with regard to intermediate node vi. vi is called the
“adjoint of variable vi with respect to the output variable y.” Using this notation, Eq. 3.5
can be rewritten as
∂v3
v2 = v3 ∗ = 1 ∗ cos ( v2 )
∂v2
Note the difference between tangent and adjoint. In forward mode, we know v0 and
v1 and then calculate v2 , v3 , … until we get the target. In reverse mode, we start with
vn =1 and calculate vn−1 , vn−2 , … until we have our target v0 = ∂y = ∂y . v3 = v0 in this
∂v0 ∂x 0
example, given that we are talking about derivatives with respect to x0 when we use v3 .
As a result, the reverse mode is also called the adjoint mode.
Following this procedure, we can now perform the complete reverse mode
differentiation. Note one major difference compared to the forward mode. In Table 3-1,
we can compute the primal and tangent in one pass, since computing one of them does
not require the other. However, as shown in the previous analysis, it is possible to require
the value of v2 and possibly other previous primal values to compute v2 . Therefore,
57
Chapter 3 Algorithmic Differentiation
a forward pass1 is first required, as shown in Table 3-2, to compute the required
intermediate values. They are actually identical to those in the Primal Computation
column of Table 3-1. We put it here again to stress our point about this stand-alone
forward computing pass.
Table 3-3 shows the backward pass in the reverse differentiation process, starting
from the very end, and calculates all the way up to the beginning. A short summary: To
compute differentiation using reverse mode, we need a forward pass to compute primal
and next a backward pass to compute adjoint.
Both the forward and reverse modes are equivalent in computing differentiation. So
you might wonder, since the forward mode looks more straightforward, why don’t we
∂y ∂y
just stick with it all along? Note that we obtained “for free” while calculating .
∂x1 ∂x 0
But in the forward mode, to calculate the derivative regarding another input, we have
to calculate all the intermediate results again. So here lies one of the most significant
strengths of the reverse mode: no matter how many inputs there are, a single reverse
pass gives us all the derivatives of the inputs.
This property is extremely useful in neural networks. The computation graph
constructed in neural networks tend to be quite complex, often with more than one input.
The target of using AD is to find the derivative of the output – probably a scalar value of a
loss function – regarding inputs. Thus, using the reverse mode AD is more efficient.
D
ata Types
Now that we understand the basic elements in computing a derivative, let’s turn to the
data type used in the AD system. It is built upon two basic types: scalar number F and
ndarray Arr. They are of type A.elt and A.arr. Here, A presents an interface that mostly
resembles that of an ndarray module. It means that their specific types, such as single or
double precision, C implementation or base implementation, etc., all depend on this A
ndarray module. Therefore, the AD module does not need to deal with all the lower-level
details. We will talk about how the AD module interacts with the other modules later
in this chapter. For now, it suffices to simply understand them as, for example, single-
precision float number and ndarray with single-precision float as elements, so as to
better grasp the core ideas in AD.
1
Not to be confused with the “forward differentiation mode” introduced before.
58
Chapter 3 Algorithmic Differentiation
The other two types are compounded types, each representing one differentiation
mode. The DF type contains three parts, and the most important ones are the first two:
primal and tangent. The DR type contains six parts, and the most important ones are the
first, primal, and the third, op. op itself consists of three parts: adjoint, register, and label,
of which adjoint is the most important component. The DR type also contains an adjoint
accumulator (the second parameter), a fanout flag, and a tracker flag. The accumulator
is of reference type since it needs to be updated during the propagation process. Both
DF and DR types contain a tag of integer type. Later, we will discuss how these extra parts
work in an AD engine. To focus on the core idea in AD, for now we introduce the most
important elements: primal, tangent, and adjoint.
In essence, the computation graph in AD is constructed by building a list. Each
element of this list contains two elements: the partial derivative computation and the
original type t data. In the data type, the adjoint is a function. For each t type data, it
specifies how to construct this list. Though the derivative computation rule of different
operators varies, the adjoint generally falls into several patterns. For example, here is
what the adjoint function looks like for an operation/function that takes one input and
produces one output, such as sin, exp, etc.
59
Chapter 3 Algorithmic Differentiation
let r a =
let adjoint cp ca t = (dr (primal a) cp ca, a) :: t in
let register t = a :: t in
let label = S.label, [ a ] in
adjoint, register, label
Here, the r function returns an op type, which consists of the adjoint function,
the register function, and the label tuple. First, let’s look at the adjoint function. The
first two variables cp and ca will be used in the derivative function dr. We will talk
about it later in Section 3.3. For now, we only need to know that the reverse derivative
computation dr calculates something; we put it together with the original input operator
a into a tuple and add them to the existing list t, which is the third argument. The other
two components are supplementary. The register function actually is an adjoint
function without really calculating adjoints; it only stacks a list of original operators. The
third one, label, puts together a string such as “sin” or “exp” to the input operator.
Next, let’s see another example in an operator that takes multiple inputs, such as add,
mul (multiplication), etc. It’s a bit more complex:
let r_d_d a b =
let adjoint cp ca_ref t =
let abar, bbar = dr_ab (primal a) (primal b) cp ca_ref in
(abar, a) :: (bbar, b) :: t
in
let register t = a :: b :: t in
let label = S.label ^ "_d_d", [ a; b ] in
adjoint, register, label
The difference is that one such operator needs to push two items into the list. So here
dr_ab is still a function that calculates derivatives reversely, and it returns the derivatives
on its two parents, noted by abar and bbar, which are both pushed to the adjoint list.
The register and label follow a similar pattern. In fact, in an operator that takes multiple
inputs, we should consider other options, which is that one of the inputs is just a
constant element. In that case, only one element should be put into the list:
60
Chapter 3 Algorithmic Differentiation
let r_d_c a b =
let adjoint cp ca_ref t = (S.dr_a (primal a) b cp ca_ref, a) :: t in
let register t = a :: t in
let label = S.label ^ "_d_c", [ a; b ] in
adjoint, register, label
Operations on AD Type
After understanding the data type defined in AD, let’s take a look at what sorts of
operations can be applied to them. They are defined in the owl_algodiff_core.ml
file. The most notable ones are the “get” functions that retrieve certain information
from an AD type data, such as its primal, tangent, and adjoint values. In the following
code, the primal' is a “deep” function that recursively finds the primal value as float or
ndarray format.
61
Chapter 3 Algorithmic Differentiation
And the zero function resets all elements to the zero status:
Another group of important operations are those that convert the AD type to and
from ordinary types such as float and ndarray:
let pack_elt x = F x
let unpack_elt x =
match primal x with
| F x -> x
| _ -> failwith "error: AD.unpack_elt"
let _f x = F A.(float_to_elt x)
let unpack_arr x =
match primal x with
| Arr x -> x
| _ -> failwith "error: AD.unpack_arr"
There are also operations that provide helpful utilities. One of them is the zero we
have seen, and also some functions show type information:
let shape x =
match primal' x with
| F _ -> [||]
| Arr ap -> A.shape ap
| _ -> failwith "error: AD.shape"
62
Chapter 3 Algorithmic Differentiation
3.3 Operators
The graph is constructed with a series of operators that can be used to process AD type
data as well as building up a computation graph that is differentiable. They are divided
into submodules: Maths is the most important component, and it contains a full set of
mathematical functions to enable constructing various computation graphs; Linalg
contains a subset of linear algebra functions; NN contains functions used in neural
networks, such as two-dimensional convolution, dropout, etc.; Mat is specifically for
matrix operations, such as eye that generates an identity matrix; and Arr provides
functions such as shape and numel for ndarrays.
As shown in Figure 3-1, the implementation of an operation can be abstracted into
two parts: (a) what the derivative and calculation rules of it are and (b) how these rules
are applied into the AD system. The first part is defined in the owl_algodiff_ops.ml,
and the latter is in owl_algodiff_ops_builder.ml.
C
alculation Rules
Let’s look at some examples from the first to see what these calculation rules are and how
they are expressed in OCaml. We can use the sine function as an example. It takes an
input and computes its sine value as output. This module specifies four computing rules,
each corresponding to one type of AD data. Here, module A is the underlying “normal”
ndarray module that implements functions for ndarray and scalar values. It can be single
precision or double precision, implemented using OCaml or C. For the F scalar type,
ff_f specifies using the sin function from the Scalar submodule of A. If the data is an AD
ndarray, ff_arr states that the sine functions should be applied on all of its elements by
using the A.sin function. Next, if the data is of type DF, the df function is used. As shown
in the example in Table 3-1, it computes tangent (at) * derivative of primal (ap). In the
case of the sine function, it computes at * cos ap. Finally, the dr computes what we
have shown in Table 3-3. It computes adjoint (ca) * derivative of primal (a). Therefore,
here it computes !ca * cos a. Using the get reference operator !ca is because the
adjoint value in the DR type is a reference that can be updated.
module struct
let label = "sin"
let ff_f a = F A.Scalar.(sin a)
let ff_arr a = Arr A.(sin a)
63
Chapter 3 Algorithmic Differentiation
The similar template can be applied to other operators that take one input and
produce one output, such as the square root (sqrt), as shown in the next module. The
( )
1
′
derivative rule for the square root is x = .
2 x
module struct
let label = "sqrt"
let ff_f a = F A.Scalar.(sqrt a)
let ff_arr a = Arr A.(sqrt a)
let df cp _ap at = at / (pack_flt 2. * cp)
let dr _a cp ca = !ca / (pack_flt 2. * cp)
end
However, things get more complicated once an operator needs to deal with more
than one input. The problem is that for each of these four computation rules, we need to
consider multiple possible cases. Take the divide operation as an example. For a simple
primal value computation, we need to consider four cases: both inputs are scalar, both
are ndarray, and one of them is ndarray and the other is scalar. It corresponds to four
rules: ff_aa, ff_bb, ff_ab, and ff_ba. For the forward computation of tangent regarding
a
, we also need to consider three cases:
b
• df_da corresponds to the derivative when the second input is
constant:
′
a ( x ) a′( x )
= (3.6)
b b
64
Chapter 3 Algorithmic Differentiation
a(x )
′ a′( x ) − b′ ( x )
a(x ) b( x )
= , (3.8)
b( x ) b( x )
Expressing the rules in computing the reverse mode is more straightforward. If both
∂y ∂y
inputs a and b are nonconstant, then the function dr_ab computes a and b ,
∂a ∂b
a
where y = . Thus, dr_ab returns two values; the first is a / b (!ca / b), and the
b
a
second is − 2 (!ca * (neg a / (b * b))). In the code, squeeze_broadcast x s is an
b
internal helper function that squeezes array x so that it has shape s. If one of the inputs is
constant, then we can just omit the corresponding result, as shown in dr_a and dr_b.
module struct
let label = "div"
let ff_aa a b = F A.Scalar.(div a b)
let ff_ab a b = Arr A.(scalar_div a b)
let ff_ba a b = Arr A.(div_scalar a b)
let ff_bb a b = Arr A.(div a b)
let df_da _cp _ap at bp = at / bp
let df_db cp _ap bp bt = neg bt * cp / bp
let df_dab cp _ap at bp bt = (at - (bt * cp)) / bp
let dr_ab a b _cp ca =
( _squeeze_broadcast (!ca / b) (shape a)
, _squeeze_broadcast (!ca * (neg a / (b * b))) (shape b) )
let dr_a a b _cp ca = _squeeze_broadcast (!ca / b) (shape a)
let dr_b a b _cp ca = _squeeze_broadcast
(!ca * (neg a / (b * b))) (shape b)
end
65
Chapter 3 Algorithmic Differentiation
module struct
let label = "pow"
let ff_aa a b = F A.Scalar.(pow a b)
let ff_ab a b = Arr A.(scalar_pow a b)
let ff_ba a b = Arr A.(pow_scalar a b)
let ff_bb a b = Arr A.(pow a b)
let df_da _cp ap at bp = at *
(ap ** (bp - pack_flt 1.)) * bp
let df_db cp ap _bp bt = bt * cp * log ap
let df_dab cp ap at bp bt =
((ap ** (bp - pack_flt 1.)) * (at * bp)) +
(cp * bt * log ap)
let dr_ab a b cp ca =
( _squeeze_broadcast (!ca *
(a ** (b - pack_flt 1.)) * b) (shape a)
, _squeeze_broadcast (!ca * cp * log a) (shape b) )
let dr_a a b _cp ca =
_squeeze_broadcast (!ca *
(a ** (b - pack_flt 1.)) * b) (shape a)
let dr_b a b cp ca = _squeeze_broadcast
(!ca * cp * log a) (shape b)
end
66
Chapter 3 Algorithmic Differentiation
In the end, we need to build a sin : t -> t operator, which accepts a data of AD
type t and returns output of type t. This function is what we need:
These names may seem enigmatic. Here, the fd x function calculates the primal
value of x. ff x performs forward computation on the two basic types: scalar and
ndarray. The df cp ap at function computes the tangents in forward mode. Finally,
the function r computes the op part in the type, which “remembers” how to build up
the graph in the form of a list. To put them together, the basic logic of this function goes
like this:
• If the input is a DF type, produce a new DF type after calculating the
primal and tangent in forward mode.
• If the input is a DR type, produce a new DR type, with its knowledge
about how to compute adjoints and how to build up the list.
• Otherwise, it’s the basic type, scalar or ndarray; perform simple
forward computation on it.
67
Chapter 3 Algorithmic Differentiation
Note that the newly constructed DR type, aside from its primal value and op being
updated, the rest values, including adjoint, label, etc., are all set to 0. That is because a
computation graph is constructed in the forward pass, and the calculation of adjoints
does not happen in this step.
So the next question is: for the sine function, how can we get the fd, ff, etc.? Luckily,
from the previous Siso module that specifies various calculation rules, we have already
had all the ingredients required. Assume we have named this Siso sine module S, then
we have the forward computation on the two basic types:
let ff = function
| F a -> S.ff_f a
| Arr a -> S.ff_arr a
| _ -> error_uniop label a
And the r function looks like what we have introduced in Section 3.2, using the dr
function from module S to specify how to construct the list.
let r a =
let adjoint cp ca t = (S.dr (primal a) cp ca, a) :: t in
let register t = a :: t in
let label = S.label, [ a ] in
adjoint, register, label
let rec f a =
let open S in
(* define ff and r as stated above *)
let fd a = f a in
op_siso ~ff ~fd ~df:S.df ~r a
Put them together, and here is the final function that accepts a module and builds an
operator:
let build_siso =
(* define op_siso *)
fun (module S : Siso) ->
(* define f *)
f
68
Chapter 3 Algorithmic Differentiation
The code is concise, easy to read, and less prone to various possible errors in coding.
To build another “siso” operator, such as a square root, we only need to change the rules:
Here, we only use the most simple SISO type builder template as an example. We
also include the other templates:
• SIPO: Single input and pair outputs, such as the linear algebra
operation qr for QR factorization
• SITO: Single input and three outputs, such as the SVD factorization
• SIAO: Single input and array outputs, such as the split function that
splits input ndarray into multiple ones
• PISO: Pair inputs and single output, such as add and mul
69
Chapter 3 Algorithmic Differentiation
These templates can become quite complex. For example, in building the add
function, to choose from different combinations of possible input types, the builder
function can be as complex as
3.4 API
The previous section introduces AD operators, the building blocks to construct an AD
computation graph. The next thing we need is an “engine” that begins the differentiation
process. For that purpose, we first introduce several low-level APIs provided by the AD
module and explain how they are used to build up user-friendly advanced APIs such as
diff and grad.
L ow-Level APIs
We differentiate between the two differentiation modes: forward mode and backward
mode. As explained in the previous section, if an input x is of type DF, then by applying
operations such as sin x, a computation graph is constructed, and the primal and
tangent values are also computed during this process. All we need to do is to retrieve the
required value once this process is finished. To start a forward mode differentiation, we
need to create a DF type data as initial input, using the primal value, the initial tangent
(equals to 1), and an integer tag as arguments:
For example, if we are to calculate the derivative of f = sin (x2) at x = 2, we can first
create an initial point as
That’s it. Once the computation y is constructed, we can directly retrieve the tangent
value using the tangent function.
70
Chapter 3 Algorithmic Differentiation
The backward mode is a bit more complex. Remember that it consists of two passes:
one forward and one backward. From the previous section, we know that once the graph
is constructed, the primal data are calculated, but the adjoint values are all set to zero.
Therefore, we need some extra mechanism to pump the computation flow backward to
calculate adjoint values. Here is an example to use low-level APIs to compute derivatives
in the reverse mode:
open AD
The problem to solve is still the same: calculate the derivative of f = sin (x2) at x = 2;
the only difference is that we use the reverse mode this time. Let’s explain this example
line by line. First, we still need to build an initial operator with make_reverse.
let make_reverse p i =
let adjoint _cp _ca t = t in
let register t = t in
let label = "Noop", [] in
DR (p, ref (zero p), (adjoint, register, label), ref 0, i, ref 0)
The make_reverse function constructs a DR type data with a given primal value. The
rest of its fields are all set to zero. It does two things: first, it wraps input x into a value
of type t for Algodiff to process; second, it generates a unique tag for the input so that
input numbers can be nested. Next, calling f x' constructs the computation graph of f,
capturing the primal values and knowledge about how to calculate adjoints all in the DR
type data y.
Next, reverse_prop propagates the error back to the inputs:
let reverse_prop v x =
reverse_reset x;
reverse_push v x
71
Chapter 3 Algorithmic Differentiation
It consists of two steps: first, reset all values in this graph to initial status (reverse_
reset); second, perform backward propagation (reverse_push). Both follow a recursive
process.
let reverse_reset x =
let rec reset xs =
match xs with
| [] -> ()
| x :: t ->
(match x with
| DR (_cp, aa, (_, register, _), af, _ai, tracker) ->
aa := reset_zero !aa;
af := !af + 1;
tracker := succ !tracker;
if !af = 1 && !tracker = 1 then reset (register t) else reset t
| _ -> reset t)
in
reset [ x ]
The next function is reverse_push that is the core engine that drives the backward
propagation process. Its core idea is simple. It maintains a stack t of (adjoint value, AD
value) pairs. At each iteration, the push function takes one pair out of the head of stack.
The adjoint value v is added to the adjoint accumulator aa in the DR type node |x|. The
node also specifies an adjoint function that knows how to calculate adjoint values of its
parents, in the form of one or more (adjoint value, AD value) pairs. This process starts
with only one pair, which is the output DR type value of a whole computation. It finishes
when stack t is empty.
let reverse_push =
let rec push xs =
match xs with
| [] -> ()
| (v, x) :: t ->
(match x with
| DR (cp, aa, (adjoint, _, _), af, _ai, tracker) ->
aa := reverse_add !aa v;
(af := Stdlib.(!af - 1));
72
Chapter 3 Algorithmic Differentiation
After this step, the gradient of f is stored in the adjacent value of x', and we can
retrieve the value using the adjval function.
H
igh-Level APIs
Based on the basic low-level APIs, we are able to build more high-level and easy-to-
access differentiation functions. The most commonly used function for differentiating is
diff in the AD module. Given a function f that maps one scalar value to another, we can
calculate its derivative at point x by diff f x. For example, given the triangular function
tanh, we can easily calculate its derivative at position x = 0.1, as follows:
open Algodiff.D
let f x = Maths.(tanh x);;
let d = diff f (F 0.1);;
Its implementation using the forward mode low-level API is quite simple:
let diff' f x =
if not (is_float x) then
failwith "input must be a scalar";
let x = make_forward x (pack_flt 1.) (tag ()) in
let y = f x in
primal y, tangent y
73
Chapter 3 Algorithmic Differentiation
each point on a surface consists of three elements representing the partial derivative
along the x, y, and z axes. This vector indicates the direction in which the function has
the largest magnitude change. Its implementation uses the standard reverse mode:
let grad' f x =
let x = make_reverse x (tag ()) in
let y = f x in
assert (is_float y);
reverse_reset y;
reverse_push (pack_flt 1.) y;
primal y, x |> adjval
∂y 0 ∂y 0 ∂y 0
∂x …
∂x1 ∂xn −1
0
∂y 2 ∂y 2 ∂y 2
…
J ( y ) = ∂x 0 ∂x1 ∂xn −1
…
∂ym −1 ∂ym −1
…
∂ym −1
∂x 0 ∂x1 ∂xn −1
The intuition behind the Jacobian is similar to that of the gradient. At a particular
point in the domain of the target function, the Jacobian shows how the output vector
changes given a small change in the input vector. Its implementation is as follows:
let jacobianv' f x v =
if shape x <> shape v
then failwith "jacobianv': vector not the same dimension as input";
let x = make_forward x v (tag ()) in
74
Chapter 3 Algorithmic Differentiation
let y = f x in
primal y, tangent y
The advanced APIs support convenient composition and can be used to build
more complex ones. For example, the second-order derivative of function f can be
implemented as g = f |> diff |> diff. Another example is the hessian API. Given a
multivariate function that maps n input variables to a scalar, this function calculates its
second-order derivatives as a matrix. Its implementation is based on Jacobian:
xn +1 = xn − α H −1∇f ( xn ) (3.9)
open Algodiff.D
let _ =
let f x = Maths.(cos x |> sum') in
newton f (Mat.uniform 1 2)
75
Chapter 3 Algorithmic Differentiation
d(x + y)
f (x) = x
dy
that is, a function that contains another derivative function. It initially seems
d(x + y)
straightforward, and we don’t even need a computer’s help: as = 1 so
dy
f′(x) = x′ = 1. Unfortunately, applying the simple implementation without tag leads to
wrong answer.
# let diff f x =
match x with
| DF (_, _) ->
f x |> tangent
| DR (_, _, _) ->
let r = f x in
76
Chapter 3 Algorithmic Differentiation
# let f x =
let g = diff (fun y -> add_ad x y) in
mul_ad x (make_forward (g (make_forward 2. 1.)) 1.);;
val f : t -> t = <fun>
The result is 4 at point (2, 2), but we have previously calculated, and the result should
be 1 at any point. What has gone wrong? The answer is a bit tricky. Note that x=DF(2,1).
dx
The tangent value equals to 1, which means that = 1 . Now if we continue to use this
dx
same x value in function g, whose variable is y, the same x=DF(2,1) can be incorrectly
dx
translated by the AD engine as = 1 . Therefore, when used within function g, x
dy
should actually be treated as DF(2,0). That’s where tagging comes to help. It solves the
nested derivative problem by distinguishing derivative calculations and their associated
attached parameters with a unique tag for each usage of the derivative operator.
L azy Evaluation
We have seen how separating building template and operation definitions makes it
convenient to add new operations, simplifying code and improving productivity. But
it comes with a price: efficiency. Imagine a large calculation that contains thousands
of operations, with one operation occurring many times. Such situations are actually
quite common when using AD with neural networks where large computation graphs
are created that use functions such as add and mul many hundreds of times. With the
Builder approach described earlier, the operation will be recreated every time it is
used, which is rather inefficient. Fortunately, we can simply use OCaml’s lazy evaluation
mechanism to perform caching.
77
Chapter 3 Algorithmic Differentiation
OCaml provides a built-in function lazy that accepts an input of type 'a and returns
a value of type 'a lazy_t where the computation of the value of type 'a has been
delayed. This lazy expression won’t be evaluated until it is called by Lazy.force, and the
first time it is called, the expression is evaluated and the result is cached. Subsequent
applications of Lazy.force will simply return the cached result without further
reevaluation. Here is an example of lazy evaluation in OCaml:
In this example, we can see that building lazy_x does not evaluate the content,
which is delayed to the first Lazy.force. After that, every time force is called, only the
value is returned; the x itself, including the printf function, will not be evaluated. Now
come back to the AD module in Owl. Imagine that we need to add support for the sin
operation. The definition of sin remains the same:
open Algodiff.D
78
Chapter 3 Algorithmic Differentiation
However, we can instead use lazy evaluation to actually build the implementation
and benefit from the efficiency gain of the caching it provides.
In this way, regardless of how many times this sin function is called in a massive
computation graph, the Builder.build_siso process is only evaluated once.
E xtending AD
A significant benefit of the module design described earlier is that it can be easily
extended by providing modules representing new functions. For example, suppose
that the AD system did not support the natural logarithm, sin x , whose derivative
is sin′ x = cos x . Including this function is a simple matter of defining the necessary
functions for calculating primal, tangent, and adjoint values in a module and applying
the relevant function from the Builder module – in this case, build_siso for building
“single input, single output” functions.
open Algodiff.D
We can directly use this new operator as if it is a native operation in the AD module.
For example:
79
Chapter 3 Algorithmic Differentiation
# let f x1 x2 =
let x1 = F. x1 in
let x2 = F. x2 in
Maths.(div (cos x1) (new_sin_ad x2);;
val f : t -> t = <fun>
G
raph Utility
Though not core functions, various utility functions provide convenience to users, for
example, tools to visualize the computation graph built up by AD. They come in handy
when we are trying to debug or understand how AD works. The core of the visualization
function is a recursive traverse routine:
let _traverse_trace x =
let nodes = Hashtbl.create 512 in
let index = ref 0 in
(* local function to traverse the nodes *)
let rec push tlist =
match tlist with
| [] -> ()
| hd :: tl ->
if Hashtbl.mem nodes hd = false
then (
let op, prev =
match hd with
| DR (_ap, _aa, (_, _, label), _af, _ai, _) -> label
| F _a -> Printf.sprintf "Const", []
| Arr _a -> Printf.sprintf "Const", []
| DF (_, _, _) -> Printf.sprintf "DF", []
in
(* check if the node has been visited before *)
Hashtbl.add nodes hd (!index, op, prev);
index := !index + 1;
push (prev @ tl))
else push tl
in
80
Chapter 3 Algorithmic Differentiation
The _traverse_trace and its related functions are used to convert the computation
graph generated in backward mode into human-readable format. It initializes variables
for tracking nodes and indices, then iterates the graph and puts required information
into a hash table. With some extra code, the parsed information can be displayed on a
terminal or be converted into other formats that are suitable for visualization, such as
the dot format by Graphviz.
module S = struct
include Owl_dense_ndarray.S
let qr a =
let q, r, _ = qr a in
q, r
...
end
end
82
Chapter 3 Algorithmic Differentiation
These components all rely on the fundamental computation module A. The Core
module itself is built using a functor, with the ndarray module as the parameter. Its
interface is specified in Owl_algodiff_core_sig.ml, as follows. It includes the basic type
definition of types and the operations that can be applied on them.
include Owl_algodiff_types_sig.Sig
with type elt := A.elt and type arr := A.arr
Next, the operators such as sin are built using the Core module as a parameter. As
we have explained, first the Builder module works as a factory that assembles various
operators by providing different templates such as siso, including a type definition of
the template and the function to build operators.
83
Chapter 3 Algorithmic Differentiation
end
let build_siso =
...
Then in the operator module, based on Core and Builder, this module contains
all the operators which are built from the builder functions. They are categorized into
different modules such as Maths and Linalg.
module NN = struct
...
end
...
end
84
Chapter 3 Algorithmic Differentiation
3.7 Summary
In this chapter, we discussed the design of one of the core modules in Owl: the
algorithmic differentiation module. We started from its basic theory and difference
among three types of differentiations. Then we presented the overall architecture of
the AD module in Owl. We explained several parts in detail in the following sections:
the definition of types in this system, the operators, and the APIs built on existing
mechanisms. We also discussed more subtle issues that should be paid attention to
when building an industry-level AD engine, such as avoiding the perturbation confusion
issue and using lazy evaluation to improve performance, graph visualization, etc. Finally,
we explained how the AD system is built upon the Ndarray module in Owl.
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
85
CHAPTER 4
Mathematical
Optimization
Mathematical optimization is the process of searching for optimal values from a
selection of parameters, based on a certain metric. It can be formalized as follows:
minimise f (x)
(4.1)
subject to g i ( x ) < bi ,i = 1,2 ,…,n.
Here, vector x = {x1, x2, …, xn} is the optimization variables, and function f : n →
is the target function. The functions g i : n → ,i = 1,2 ,…,n are the constraints,
with constants bi being the constraint boundaries. The target of solving optimization
problems is to find x∗ to minimize f.
An optimization problem aims to find a solution that minimizes some quantity;
therefore, it arises in a wide range of disciplines, such as finance, engineering, computer
science, etc. For example, in portfolio management in the finance industry, an optimal
solution is required to divide the given total capital into n types of investments, where xi
is the amount of capital invested in financial asset i. The target might be to maximize to
the expected return or to minimize the risk. The constraints might be requiring that the
smallest return be larger than a predefined value, etc.
An optimization problem can be categorized into multiple types. The general form
in Eq. 4.1 contains several constraints. If there are no constraints, the problem is called
an unconstrained optimization; otherwise, it’s a constrained optimization problem. From
another perspective, some optimization target is to find the global minimal point (e.g.,
minimize f (x) = x2), while the others only need to find the optimum in a certain range (e.g.,
minimize f (x) = sin (x) in the range of [0, 2π]). In this chapter, and in the implemented
module in Owl, we focus on the unconstrained and local optimization problems. Specifically,
we have implemented one of the most important optimization methods: gradient descent.
87
© Liang Wang, Jianxin Zhao 2023
L. Wang and J. Zhao, Architecture of Advanced Numerical Analysis Systems,
https://doi.org/10.1007/978-1-4842-8853-5_4
Chapter 4 Mathematical Optimization
4.1 Gradient Descent
The gradient descent method is one of the most commonly used family of iterative
optimization processes. Its basic idea is to start from an initial value and then find a
certain search direction along a function to decrease the value by a certain step size until
it converges to a local minimum. We can thus describe the nth iteration of the descent
method as follows:
Repeat this process until a stopping condition is met, such as the update being
smaller than a threshold. Among the descent methods, gradient descent is one of the
most widely used algorithms to perform optimization and the most common way to
optimize neural networks. Based on the preceding descent process, a gradient descent
method uses the function gradient to decide its direction d and can be described as
follows:
Here, ∇ denotes the gradient. The distance μ along a certain direction is also called
the learning rate of this iteration. In a gradient descent process, when searching for the
minimum, it always follows the direction that is against the direction represented by the
negative gradient. The gradient can be calculated based on the algorithm differentiation
module we have introduced in Chapter 3. That’s why the whole Optimisation module is
built on Algodiff.
The implementation of gradient descent according to this definition is plain enough.
For example, for a certain differentiable function f that does have one global minimal
point, the following simple Owl code would do:
module n = Dense.Ndarray.D
open Algodiff.D
88
Chapter 4 Mathematical Optimization
let _ =
for i = 1 to n - 1 do
let u = grad f !x |> unpack_arr in
x := N.(sub !x (scalar_mul alpha u))
done;;
It’s basically a line-to-line translation of the process described before. You should be
familiar with the functions from the AD module, such as grad for calculating gradients
and unpack_arr for converting an AD type ndarray into a normal one. However, there
are a lot of details that should be attended to if we need to implement a robust gradient
descent method, such as how the learning rate should change, how other variant
methods should be incorporated, etc. Next, we will introduce several building blocks for
this method and the structure of the Optimise module in Owl.
4.2 Components
The core of the Optimise module in Owl abstracts several aspects of the gradient descent
method in applications: learning rate, gradient method, momentum, etc. Each of them is
represented as a submodule. All computation in these modules relies on the AD module.
The following code shows an outline of this optimization module. It is designed as a
functor parameterized by the AD module. In this section, we introduce a part of these
submodules and how they implement different methods.
...
end
89
Chapter 4 Mathematical Optimization
Learning Rate
When training a machine learning model, the learning rate is arguably the most important
hyperparameter that affects the training speed and quality. It specifies how much the
model weight should be changed given the estimated error in each training round. A large
learning rate may lead to suboptimal solutions and unstable training processes, whereas a
small rate may result in a long training time. That’s why choosing a proper learning rate is
both crucial and challenging in model training. There exist various methods to decide the
learning rate, and we have incorporated them in the Learning_rate module, as follows:
90
Chapter 4 Mathematical Optimization
This module consists of the type definitions of learning rate methods and the
functions that can be applied on it. The Learning_Rate.typ consists of four different
types of algorithms.1 For each type, it specifies parameters it requires.
Let’s look at how these methods are implemented to better understand the code.
The Const method is the most straightforward: just using a constant learning rate value
throughout the whole training process,. In typ, its only parameter is this learning rate as
a float number. Next, the run function takes in a learning rate type as input and returns
a function that accepts three inputs: the iteration number i, the gradient g, and the
parameters used in this method c (an array of floats). This function specifies how the
learning rate should be changed. In the case of the Const method, the rate does not
change. So it simply returns the previous learning rate itself. Recall from Chapter 3 that
_f wraps a float number into an AD scalar type. The default and to_string are helper
functions. The first generates a learning rate method with default parameters, and the
second prints parameter information of a given method.
The Adagrad method is a bit more complex. As the name suggests, it changes
the learning rate adaptively: a larger update step size for parameters associated with
infrequent features and small learning rate otherwise. Its parameter update at the t’s
iterate follows the following rules:
gt
θ t +1 = θ t − µ (4.2)
Gt +
Here, G is the sum of the squares of the corresponding gradients gt’s up to time step t.
This equation consists of two parts. The first is how the learning rate should be updated.
It is specified in the run function. The following code:
fun _ _ c ->
Maths.(_f a / sqrt (c.(0) + _f 1e-32))
1
Actually, there are more learning methods implemented, e.g., Adam optimization; they are
omitted here for the purpose of keeping code demonstration clear.
91
Chapter 4 Mathematical Optimization
µ
corresponds to . The c array contains parameters that are utilized in updating μ,
G +
which in this case is G. The second part is how to update this parameter. It is specified in
t
the update_ch function. In this case, the rule is Gt = ∑ g i2 , or
i =1
Gt = Gt −1 + g .2
i
at each iteration. The second element in this array is not used, so it remains the same.
The RMSprop method, is an adaptive learning rate method proposed by Geoff Hinton.
It is an extension of Adagrad. It follows the update rule in Eq. 4.2. Only that here
Gt = kGt −1 + (1 − k ) g t2 (4.3)
Note that k is a factor that is normally set to 0.9. Therefore, the run function keeps the
same; the update_ch function for RMSprop becomes
92
Chapter 4 Mathematical Optimization
Note the meaning of c is not the same as that in the Adagrad and RMSprop methods.
The next thing is to specify how to update the learning rate. Adam’s update rule is
mt (4.4)
θt = θt −1 − µ ,
vt +
where
mt v
mt = ,vt = t t .
1 − β1
t
1 − β2
Therefore, the run function of Adam returns a function that utilizes all three
parameters:
Note the final item / (g + _f 1e-32). You might notice that this item is not in
Eq. 4.4. The reason we put it here is that our framework follows this update pattern:
θt = θt −1 − run ( µ ,…) g .
But the final g multiplication item is not in the end of Eq. 4.4. That’s why we divide it
back in the run function.
So far, we have introduced multiple aspects in a learning rate method, most
notably run and update_cn, but we have not yet explained how they will be used in an
93
Chapter 4 Mathematical Optimization
optimization process. We will show that in the next section. For now, let’s move on to
another aspect of optimization: the gradient descent algorithm.
Gradient
We have provided the framework of gradient methods in Section 4.1. However, there
exist many variants of gradient descent algorithms. They are included in the Gradient
module. The code is shown as follows. Its structure is similar to that of Learning_rate.
The typ contains all the supported gradient methods; these methods do not carry type
parameters. The to_string function prints helper information for each method.
94
Chapter 4 Mathematical Optimization
| DaiYuanCG ->
fun _ _ g p g' ->
let y = Maths.(g' - g) in
let b = Maths.(l2norm_sqr' g' / sum' (p * y)) in
Maths.(neg g' + (b * p))
| NewtonCG ->
fun f w _ p g' ->
let hv = hessianv f w p |> Maths.transpose in
let b = Maths.(hv *@ g' / (hv *@ p)) in
Maths.(neg g' + (p *@ b))
| Newton ->
fun f w _ _ _ ->
let g', h' = gradhessian f w in
Maths.(neg (g' *@ inv h'))
The key component is the run function. Remember that the descent optimization
method is all about the process:
xn +1 = xn + µ dn ,
which shows the exploration direction at the next step. The run function specifies the
form of dn. Take the classic gradient descent method as an example; the direction is just
95
Chapter 4 Mathematical Optimization
the opposite of gradient. Therefore, dn = − gn, and thus the run function returns another
function:
This function takes five parameters as inputs, and the last one is the current gradient,
which is the only parameter used in this case.
Conjugate gradient descent: A problem with gradient descent is that it may perform
badly on certain types of functions. For example, if a function is steep and narrow, then
gradient descent will take many very small steps to reach the minimum, bouncing back
and forth, even if the function is in quadratic form. This can be fixed by the conjugate
gradient (CG) method, which is first proposed by Hestenes and Stiefel [29].
The CG method is similar to gradient descent, but the new direction at each step does
not completely follow the new gradient, but is somehow conjugated to the old gradients
and to all previous directions traversed. If both methods start from the same position,
gradient descent would follow the direction of the descent, which could be a blunt one
since this function is steep. But the conjugate method would prefer following the previous
momentum a little bit. As a result, the conjugate method follows a direction in between.
The new direction finds the minimum more efficiently than the gradient descent method.
The conjugate gradient descent is also a family of optimization methods. Instead of
using the opposite of gradient − ∇ f (xn) as a direction, they follow the procedure in the
nth iteration:
Based on this framework, we can take a look at the five parameters in the returned
function:
96
Chapter 4 Mathematical Optimization
Here, g and p are gradient and direction vectors from the previous round; g' is the
gradient of the current round. f is the function to be optimized itself, with input data w.
The CG process can thus be implemented as
Here, gn, dn, etc. are assumed to be vectors. Note how this parameter and the CG
method framework utilize information such as gradient and direction from the previous
iteration (gn − 1 and dn − 1). We can implement Eq. 4.5 as
let b =
let y = Maths.(g' - g) in
Maths.(sum' (g' * y) / (sum' (p * y) + _f 1e-32))
It uses the sum' function to perform vector multiplication, and the extra epsilon
value 1e-32 is used to make sure the denominator is not zero.
In the nonlinear conjugate method (NonlinearCG) [21]
g nT g n
βn = .
g nT−1 g n −1
Here, l2norm_sqr' g calculates the square of l2 norm (or Euclidean norm) of all
elements in g, which is gTg.
Similarly, in the conjugate gradient method proposed by Dai and Yuan
(DaiYuanCG) in [16]
g nT g n
βn = − .
dn ( g n − g n −1 )
97
Chapter 4 Mathematical Optimization
let b =
let y = Maths.(g' - g) in
Maths.(l2norm_sqr' g' / sum' (p * y))
Here, H is the hessian of f, that is, the second-order gradient of f. The code is a direct
translation of this equation:
fun f w _ _ _ ->
let g', h' = gradhessian f w in
Maths.(neg (g' *@ inv h'))
M
omentum
The basic gradient descent process can be further enhanced by the momentum
mechanism. It allows some “inertia” in choosing the optimization search direction,
which utilizes previous direction information. It helps to reduce noisy gradient descent
that bounces in search direction. The code of the Momentum module is listed as follows. Its
key component is the run function.
98
Chapter 4 Mathematical Optimization
| None
Recall in the basic structure of gradient descent, the change of value x at the nth
iteration is
dn = − µ∇f ( xn )
The float number m is the momentum parameter that indicates the impact of
direction information in the previous iteration.
The run function in this module returns a function that takes two inputs: the
previous direction u and the current direction u' (calculated using any combination
of learning rate and gradient methods). Therefore, the momentum method described
earlier can be simply implemented as Maths.((_f m * u) + u'). This is the standard
momentum method. If we decide not to use any momentum (None), it simply returns the
current direction u'.
99
Chapter 4 Mathematical Optimization
This module also supports the Nesterov Accelerated Gradient (Nesterov) method
[40]. It employs a simple change on the standard momentum in Eq. 4.7, by first applying
the momentum item on the parameter itself and then calculating the gradient ∇f, before
adding the momentum item again:
dn = − µ∇f ( xn + dn −1m ) + mdn −1 .
B
atch
There is one more submodule we need to mention: the Batch module. It is about
how the input data are divided into chunks and then fed into a training process. From
the previous introduction about gradient descent, you might assume the function
accepts scalar as input. However, in many applications, we should consider applying
optimization on a vector x. That means in calculating the gradients, we need to consider
using a group of data points instead of only one.
From the perspective of calculation, there is not much difference, and we can still
use all the data in calculating the gradients. However, one big application field of such
an optimization method is regression or, more broadly, machine learning, where there
could be millions of data points just to find the optima. We will talk about regression in
Section 4.4. In practice, computing optimization with large quantities of input data can
be unavailable due to the limit of hardware factors such as memory size of the computer.
Therefore, optimization for such problems is often repeated for several executive rounds,
each round called an epoch. In each epoch, the given input data are split into batches.
Each batch can choose to use a batch strategy, as the run function code shown as
follows:
100
Chapter 4 Mathematical Optimization
The Full strategy uses all the provided data. The Mini c and Sample c strategies
both take in c data points each time; the former chooses data sequentially, and the latter
does randomly. Finally, the Stochastic method only selects one random data point
from existing ones.
Checkpoint
So far, we have introduced how the learning rate, gradients, and momentum modules
return functions which utilize information such as the gradient or direction of the
previous iteration. But where are they stored? The answer lies in the Checkpoint module.
It stores all the information during optimization for later use and saves them as files on
the hard disk if necessary. Its code is shown as follows:
type typ =
| Batch of int (* default checkpoint at every specified batch
interval *)
| Epoch of float (* default *)
| Custom of (state -> unit) (* customised checkpoint called at every
batch *)
| None
....
end
The state type includes fields that we have introduced so far. The ch is used in the
learning rate module and contains parameters to be updated from the previous iteration.
The gs is the gradient of the previous iteration, and ps is the direction of the previous
iteration. Both are used in the gradient methods. The us represents the direction
update of the previous iteration and is the parameter used in momentum methods.
Besides storing this information, there is also the stop boolean value, which indicates
optimization stops if set to true. It also contains other information, including the current
iteration progress in batch, the number of batches in each epoch, and the total number
of epochs to run.
The typ decides at what point the checkpoint should be executed. Batch means to
checkpoint at every specified batch interval. Epoch then checkpoints at every specified
epoch interval. Besides these two, the user can also build customized functions that take
a state type as input to decide when the most proper time is to checkpoint for a specific
application.
102
Chapter 4 Mathematical Optimization
...
let init_state batches_per_epoch epochs =
let batches = float_of_int batches_per_epoch *. epochs
|> int_of_float in {
current_batch = 1
; batches_per_epoch
; epochs
; batches
; stop = false
; gs = [| [| _f 0. |] |]
; ps = [| [| _f 0. |] |]
; us = [| [| _f 0. |] |]
; ch = [| [| [| _f 0.; _f 0. |] |] |]
}
The init_state returns initial values for the different fields in a state. The users
need to specify the number of epochs in optimization and the input data batches in one
epoch. The default_checkpoint_fun executes a function to save certain content in a
file. This save function should be defined by users. And similar to previous modules,
the to_string method provides a convenient print function to show the configuration
103
Chapter 4 Mathematical Optimization
information about this module. Finally, the run function decides a suitable checkpoint
interval and executes the checkpoint function, either the default one or the customized
one provided by the user.
end
Params
The Params submodules are what brings all the other submodules together. It provides
an entry point for users to access various aspects of optimization. The code is shown as
follows:
104
Chapter 4 Mathematical Optimization
The Params type consists of the types of other submodules, such as Gradient.typ.
It also includes some other fields such as the number of epochs and a flag verbosity to
indicate if the full information of parameters should be printed out during optimization.
let default () =
{ epochs = 1.
; batch = Batch.Sample 100
; gradient = Gradient.GD
; learning_rate = Learning_Rate.(default (Const 0.))
; momentum = Momentum.None
; checkpoint = Checkpoint.None
; verbosity = true
}
105
Chapter 4 Mathematical Optimization
let to_string p =
Printf.sprintf "--- Training config\n"
^ Printf.sprintf "epochs: %g\n" p.epochs
^ Printf.sprintf "batch: %s\n" (Batch.to_string p.batch)
^ Printf.sprintf "method: %s\n" (Gradient.to_string p.gradient)
^ Printf.sprintf
"learning rate: %s\n"
(Learning_Rate.to_string p.learning_rate)
^ Printf.sprintf "momentum: %s\n" (Momentum.to_string p.momentum)
^ Printf.sprintf "checkpoint: %s\n"
(Checkpoint.to_string p.checkpoint)
^ Printf.sprintf
"verbosity: %s\n"
(if p.verbosity then "true" else "false")
^ "---"
end
The other three functions are straightforward. default assigns default values to each
parameter, config sets parameter values using the given input, and to_string prints
existing values.
106
Chapter 4 Mathematical Optimization
The function accepts three inputs: the function f, the initial input x, and the
optimization parameter params. It starts by defining the run function from various
submodules.
let iterate xi =
let loss, g = grad' optz_fun xi in
loss |> primal', g, optz_fun
in
...
iterate defines operations in the ith iteration. It utilizes the Algodiff module to
compute the primal value loss of evaluating optz_fun at point xi and the corresponding
gradient g at that point.
The preceding code shows the outline of the optimization procedure. First, it
initializes a new state of the optimization process. Here, we set it to one batch per epoch.
Next, the code keeps updating the state during the body of the while loop until the stop
status is set to true. The optimization result x and state are finally returned. The state
contains various historical information as we have explained. Each iteration of the while
loop contains the following steps:
107
Chapter 4 Mathematical Optimization
First, we execute iterate to get gradients. We can define the checkpoint of the
current progress; here, we provide an empty save function, which means no need to save
the current state into the file.
Next, we calculate the gradient descent direction p' using grad_fun, based on
gradient g'. Also, the learning rate parameter ch should be updated.
let u' =
Checkpoint.(Maths.(p' * rate_fun
state.current_batch g' state.ch.(0).(0)))
in
let u' = momt_fun Checkpoint.(state.us.(0).(0)) u' in
Then, the optimization direction is adjusted, first based on the learning rate and then
on momentum.
Finally, the values calculated in this iteration, such as the gradients, direction,
etc., are saved in the state for future use. That’s all for one iteration. Let’s look at one
example of optimization using gradient descent. Here, we use Himmelblau’s function; it
108
Chapter 4 Mathematical Optimization
is often used as a performance test for optimization problems. The function contains two
inputs and is defined as in Eq. 4.8.
f ( x ,y ) = ( x 2 + y − 11) + ( x + y 2 − 7 ) .
2 2
(4.8)
open Algodiff.D
module N = Dense.Ndarray.D
let himmelblau a =
let x = Mat.get a 0 0 in
let y = Mat.get a 0 1 in
Maths.(x ** (F 2.) + y - (F 11.) ** (F 2.) +
(x + y ** (F 2.) - (F 7.)) ** (F 2.) |> sum')
First, let’s look at what the code would look like without using the Optimise module.
Let’s apply the gradient descent method according to its definition in Section 4.1.
Here, we use an initial starting point [-2., 0.]. The step size eta is set to 0.0001,
and the iteration number is 2000. Then we can perform the iterative descent process.
let _ =
for i = 1 to n - 1 do
let u = grad himmelblau (Arr !a) |> unpack_arr in
a := N.(sub !a (scalar_mul eta u));
traj := N.concatenate [|!traj; (N.copy !a)|]
done;;
We apply the grad method in the Algodiff module to the Himmelblau function
iteratively, and the updated data a is stored in the traj array. Utilizing the Plot
module in Owl, we can visualize this function and the optimization trajectory using the
following code:
109
Chapter 4 Mathematical Optimization
let plot () =
let a, b = Dense.Matrix.D.meshgrid (-4.) 4. (-4.) 4. 50 50 in
let c = N.(add
(sub_scalar (add (pow_scalar a 2.) b) 11.)
(pow_scalar (sub_scalar (add a (pow_scalar b 2.)) 7.) 2.)
) in
let h = Plot.create ~m:1 ~n:2 "plot_himm.pdf" in
Plot.subplot h 0 0;
Plot.(mesh ~h ~spec:[ NoMagColor ] a b c);
Plot.subplot h 0 1;
Plot.contour ~h a b c;
let vx = N.get_slice [[]; [0]] !traj in
let vy = N.get_slice [[]; [1]] !traj in
Plot.plot ~h vx vy;
Plot.output h
To solve the same problem, we can also use the minimise_fun function introduced in
the previous section. First, we set up the parameters:
let p = Owl_optimise.D.Params.default ()
let _ = p.epochs <- 10.
let _ = p.gradient <- Owl_optimise.D.Gradient.GD;;
110
Chapter 4 Mathematical Optimization
It suffices to simply set the iteration limit epochs to something like 10 or 20 iterations.
Then we set the gradient method to be the classic gradient descent and then execute the
code, starting from the same initial values:
This function outputs execution logs to track the intermediate results, looking in part
like the following. It shows how the function value, starting from 2926 at the initial point,
is quickly reduced to about 2.5 within only 10 steps using gradient descent. It shows the
efficiency of the gradient descent method in finding optima.
...
10:46:49.805 INFO : T: 00s | E: 1.0/10 | B: 1/10 | L: 2026.000
10:46:49.806 INFO : T: 00s | E: 2.0/10 | B: 2/10 | L: 476.1010
10:46:49.807 INFO : T: 00s | E: 3.0/10 | B: 3/10 | L: 63.83614
10:46:49.807 INFO : T: 00s | E: 4.0/10 | B: 4/10 | L: 37.77679
10:46:49.808 INFO : T: 00s | E: 5.0/10 | B: 5/10 | L: 21.39686
10:46:49.809 INFO : T: 00s | E: 6.0/10 | B: 6/10 | L: 11.74234
10:46:49.809 INFO : T: 00s | E: 7.0/10 | B: 7/10 | L: 6.567733
10:46:49.809 INFO : T: 00s | E: 8.0/10 | B: 8/10 | L: 4.085909
10:46:49.810 INFO : T: 00s | E: 9.0/10 | B: 9/10 | L: 3.016714
10:46:49.810 INFO : T: 00s | E: 10.0/10 | B: 10/10 | L: 2.5943
...
4.4 Regression
In this section, we introduce a broad area that heavily relies on optimization: regression.
Regression is an important topic in statistical modeling and machine learning. It’s about
modeling problems which include one or more variables (also called “features” or
“predictors”) and require us to make predictions of another variable (“output variable”)
based on previous values of the predictors. Regression analysis includes a wide range
of models, from linear regression to isotonic regression, each with different theoretical
backgrounds and applications. In this section, we use the most widely used linear
regression as an example to demonstrate how optimization plays a key part in solving
regression problems.
111
Chapter 4 Mathematical Optimization
L inear Regression
Linear regression models the relationship between input features and the output
variable with a linear model. It is the most widely used regression model. Without loss
of generality, let’s look at an example with a single variable in the model. Such a linear
regression problem can be informally stated as follows. Suppose we have a series of (x, y)
data points:
----------------------------
|x| 5.16 | 7.51 | 6.53 | ...
----------------------------
|y| 0.36 | 5.84 | 16.9 | ...
----------------------------
Given that the relationship between these two quantities is y ≈ hθ(x), where
hθ(x) = θ0 + θ1 x, can we find out the θ0 and θ1 values that can fit the observed data points
as closely as possible? This problem can be further formalized later. Denote the list of x’s
and y’s as two vectors x and y. Suppose we have a function C that measures the distances
between x and y: Cθ(x, y). The target is to find suitable parameters θ that minimize the
distance. That’s where optimization comes to help.
L oss
So the next question is: How to represent this distance mathematically? One good choice
is to use the Euclidean distance. That means the target is to minimize function:
1 n
( ( ) )
2
C θ ( x ,y ) = ∑ hθ x (i ) − y (i )
2n i =1
(4.9)
1
Here, x(i) indicates the ith element in the vector x. The factor
is used to normalize
2n
the distance. Other forms of distance can also be applied here. Due to its importance,
this distance is called the loss and abstracted as the Loss submodule in the optimization
module. Its code is shown as follows:
112
Chapter 4 Mathematical Optimization
| L2norm
| Quadratic
| Cross_entropy
| Custom of (t -> t -> t)
...
end
It contains several methods to calculate the distance, or loss, between two values y
and y'. What we have described is the Quadratic method. It also supports the l1 or l2
∑( x ( ) − y ( ) )
2
norm: ∑ x( ) − y( )
i
i i
and
i
i i
. The cross-entropy measures the performance
let r = 1. /. float_of_int o in
let p = Arr A.(uniform ~a:(float_to_elt
(-.r)) ~b:(float_to_elt r) [| o; n |]) in
...
end
let f w x =
let w = Mat.reshape o n w in
Maths.(x *@ w) in
let w =
minimise_weight params f (Maths.flatten p) (Arr x) (Arr y)
|> snd
|> Mat.reshape o n
|> unpack_arr
in
match bias with
114
Chapter 4 Mathematical Optimization
...
end
The core step of this regression function is to apply optimization on the function
f using the given parameters, with proper shape manipulation. If the bias is included
in the optimization target, the returned result is split into two parts, first being w and
second being b.
Note that we have introduced minimise_fun for optimization, but here it uses the
minimise_weight. These two functions are actually very similar in implementation, but
with one key difference. In minimise_fun f x, it keeps calculating gradients with regard
to input x and changes the x accordingly until it reaches a point that minimizes f (x). In
minimise_weight f w x though, it keeps calculating gradients regarding the function’s
own parameter w and changes it accordingly until it reaches a point that minimizes fw(x).
The input data x stays the same in each round of optimization.
Based on this function, the linear regression can be implemented by choosing
suitable optimization parameters:
In linear regression, we utilize all the input data in one iteration or epoch (Full
batch mode). We use the Adagrad learning method, classic gradient descent, and the
Euclidean distance as the loss function. The optimization lasts 100 iterations until the
loss value is smaller than 1e-16. The Stopping is a helper module in optimise that
accepts a threshold, so that the optimization process can exit early.
115
Chapter 4 Mathematical Optimization
let poly x y n =
let z =
Array.init (n + 1) (fun i -> A.(pow_scalar x
(float_of_int i |> float_to_elt)))
in
let x = A.concatenate ~axis:1 z in
let params =
Params.config
~batch:Batch.Full
~learning_rate:(Learning_Rate.Const 1.)
~gradient:Gradient.Newton
~loss:Loss.Quadratic
~verbosity:false
~stopping:(Stopping.Const 1e-16)
100.
in
(_linear_reg false params x y).(0)
The key is to first process the data, so that each data point x can be projected to
a series of new features z, so that zi = xi. Eq. 4.10 then becomes a multivariable linear
regression:
hθ ( z ) = θ 0 + θ1 z1 + θ 2 z2 + θ 3 z3 …
116
Chapter 4 Mathematical Optimization
Another important type of regression is logistic regression, where the data y contain
integers that indicate different classes of data, instead of real numbers. Therefore, it
is most suitable for classification tasks, such as “age group,” “nationality,” etc. Logistic
regression replaces its target optimization function to be
C θ ( x ,y ) =
1 m
∑
m i =1
( ( ) )
g hθ x (i ) ,y (i ) , (4.11)
where m is the total number of data points in input data x and y; the function g is
defined as
− log ( hθ ( x ) ) , if y = 1
g ( hθ ( x ) ,y ) = (4.12)
− log (1 − hθ ( x ) ) , if y = 0
The logistic gradient can be implemented by using the cross-entropy loss function:
R
egularization
There is one thing we need to understand: regression is more than just optimization
after all. Its purpose is to create a model that fits the given data, and all too often, this
model should be used to predict the output of future input. Therefore, if a model fits
the given data too well, it may lose generality for future data. That’s where the idea of
regularization comes in. This technique prevents a model from being tuned too closely
to a particular dataset and thus may fail to predict future observations well.
117
Chapter 4 Mathematical Optimization
Think about the polynomial regression. The regularization technique favors simple
and low-order models. It modifies the optimization target function to penalize high-
order parameters, so that the large parameter values lead to higher cost. Therefore, by
minimizing the target function, we keep the unwanted parameters relatively small. This
can be implemented by adding an extra term at the end of the original target function.
Owl supports multiple types of such regularization terms in the Regularisation
submodule, which also belongs to the Optimiser module. Its core function run is shown
as follows:
The L2norm regularization function adds the L2 norm of θ as the penalty term:
λ ∑ θ2. The L1norm cost function is similar, adding the L1 norm or absolute value of the
parameter as penalty: λ ∑ |θ|. This difference means that L1norm permits coefficients
to be zero, very useful for feature selection. Regressions using these two regularization
techniques are sometimes called Ridge and Lasso regressions. The Elastic_net method
combines the penalties of the previous two:
1− a
λ ∑θ 2 + a ∑ θ ,
2
where a is a hyperparameter balancing between the former two. This method aims to
make feature selection less dependent on the input data.
We can create a new polynomial regression with regularization by simply changing
the optimization parameter to the following values:
Params.config
~batch:Batch.Full
~learning_rate:(Learning_Rate.Const 1.)
~gradient:Gradient.Newton
~loss:Loss.Quadratic
118
Chapter 4 Mathematical Optimization
~regularisation:(Regularisation.L2norm 0.5)
~verbosity:false
~stopping:(Stopping.Const 1e-16)
100.
4.5 Summary
In this chapter, we introduced optimization and its implementation in Owl. Focusing
on gradient descent, one of the most widely used optimization methods, we introduced
various aspects, such as the gradient method, learning rate, momentum, etc. Together
they provide a powerful and robust implementation. As an important example, we
further introduced regression, a machine learning technique that heavily relies on
optimization. We showed how various regression methods can be built efficiently using
the optimization module.
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
119
CHAPTER 5
5.1 Module Architecture
To explain in layman’s terms, you can imagine a neural network as a communication
network where data flow from one node to another node without loops. Nodes are
referred to as neurons. Every time data pass a neuron, it will be processed in different
ways depending on the type of a neuron. The link between neurons represents nonlinear
transformation of the data. Neurons can be wired in various ways to exhibit different
architectures which specialize in different tasks. During the training phase, data can be
fed into a neural network to let it form the knowledge of certain patterns. During the
inference phase, the neural network can apply previously learned knowledge to the
input data.
A DNN framework is built to let us define the network structure and orchestrate its
learning and inference tasks. The framework is a complicated artifact containing lots of
technologies. However, from the high-level system perspective, there is only a limited
amount of core functions which a framework must implement. Let us take a look at the
key functionalities required by Owl’s neural network module:
121
© Liang Wang, Jianxin Zhao 2023
L. Wang and J. Zhao, Architecture of Advanced Numerical Analysis Systems,
https://doi.org/10.1007/978-1-4842-8853-5_5
Chapter 5 Deep Neural Networks
In the rest of this chapter, we will examine internal mechanisms of these modules.
The optimization and algorithmic differentiation modules are not DNN specific, we will
skip them for now; their implementations have been covered in detail in the previous
chapters.
122
Chapter 5 Deep Neural Networks
5.2 Neurons
Neurons are implemented as modules. Each type of neuron corresponds to a specific
module. These modules share many common functions such as mktag, mkpar, update,
etc., but their implementation might slightly differ. Every neuron has its own neuron_typ
which specifies the shape of the neuron’s input and output.
The neuron_typ also includes all the parameters associated with this neuron.
For example, the preceding code presents the signature of the Linear neuron which
performs wx + b for input x. As you can see, fields x and b are used for storing the weight
and bias of this linear function.
123
Chapter 5 Deep Neural Networks
Core Functions
The record type neuron_typ is created by calling the create function when constructing
the network.
The parameters are created, but their values need to be initialized in the init
function. The bias parameter is set to zero, while the initialization of weight depends on
init_typ.
let init l =
let m = l.in_shape.(0) in
let n = l.out_shape.(0) in
l.w <- Init.run l.init_typ [| m; n |] l.w;
l.b <- Mat.zeros 1 n
How the weights are initialized matters a lot in the training. If weights are not
initialized properly, it takes much longer to train the network. In the worse case, the
training may fail to converge. The Init module provides various ways to initialize
weights, specified by different type constructors. Some initialization methods require
extra parameters. For example, if we want to randomize the weight with Gaussian
distribution, we need to specify the mean and variance. As discussed by X. Glorot in [24],
the initialization method has a nontrivial impact on model training performance. Besides
those supported here, users can also use Custom to implement their own initialization
methods.
124
Chapter 5 Deep Neural Networks
When a neuron is added to the network, the connect function is called to validate
that the input shape is consistent with the output shape.
The following functions are used to retrieve the parameters and their corresponding
primal and adjoint values. These functions are mainly used in the training phase. When
the parameters need to be updated during the training, the optimization engine can call
the update function to do the job.
let update l u =
l.w <- u.(0) |> primal';
l.b <- u.(1) |> primal'
The run function is the most important one in the module. This function defines how
the input data should be processed and is called by the network during the evaluation.
Let us look at the run function of the linear neuron. The function is so simple and
contains only one line of code which calculates exactly wx + b.
125
Chapter 5 Deep Neural Networks
Most neurons’ run functions are just a one-liner; the simplicity is because Owl has
implemented a very comprehensive set of numerical functions.
OLQHDUOD\HU DFWLYLVLRQOD\HU
Z E
LQSXW
RXWSXW
GDWD
Activation Module
One reason we can pile up even just linear neurons to construct a deep neural network
is its property of nonlinearity, which is introduced by activation neurons. Nonlinearity is
a useful property in our models as most real-world data demonstrate nonlinear features.
Without nonlinearity, all the linear neurons can be reduced to just one matrix. Activation
functions are aggregated in one module called Activation. Similar to other neuron
modules, the Activation also has neuron_typ and many similar functions.
126
Chapter 5 Deep Neural Networks
type neuron_typ =
{ mutable activation : typ
; mutable in_shape : int array
; mutable out_shape : int array
}
...
end
5.3 Networks
Essentially, a neural network is a computation graph. Nodes represent neurons which
aggregate more complicated data processing logic than nodes in a vanilla computation
graph. The following code presents the type definition of the node. A node can have
a name for reference. The prev and next fields are used for linking to ancestors
and descendants, respectively. The output field is used to store the output of the
computation.
The node per se does not contain any data processing logic. Rather, the node
refers to the neuron which implements actual numerical operations on the data. The
motivation of this design is to separate the mechanism of a network and the logic of
neurons. The network field refers to the network that the current node belongs to. Note
that the network structure is not necessarily the same in training and inference phases.
Some nodes may be dropped during the inference phase, such as dropout. The train
field is used for specifying whether a node is only for training purposes.
type node =
{ mutable name : string
; mutable prev : node array
; mutable next : node array
; mutable neuron : neuron
127
Chapter 5 Deep Neural Networks
and network =
{ mutable nnid : string
; mutable size : int
; mutable roots : node array
; mutable outputs : node array
; mutable topo : node array
}
As we can see, these type definitions are similar to computation graphs. Even though
they contain some specific neural network–related fields, the type definitions are not
more complicated than a general-purpose computation graph.
To build up networks, most of the time we use functions that build a node and
connect it to an existing node stack. For example:
This function first creates a Conv2D neuron with various parameters and wraps it into
a node using the make_node function. Then we connect n to its parent nodes using the
add_node function. This step uses the connect function of the neuron and also updates
the child’s input and output shape during connection. With the network graph APIs, we
can write concise code to build up a network, such as
128
Chapter 5 Deep Neural Networks
open Owl
open Neural.S
open Neural.S.Graph
The network definition always starts with an input layer and ends with the get_
network function which finalizes and returns the constructed network. We can also see
the shape of the data, and the parameters will be inferred later as long as the input_
shape is determined. We only need to provide the data shape in the input node, and the
network can automatically infer the shape in the other nodes.
We have already covered most elements to build up a neural network. For example,
Figure 5-2 shows the structure of a basic LeNet-like neural network, combining the
convolution layer, pooling layer, linear layer, activation layer, etc. This network is simple
yet powerful, perfectly capable of performing the handwritten digit recognition task
accurately. But to do that, the network should first be trained.
129
Chapter 5 Deep Neural Networks
5.4 Training
Training is a complicated, time-consuming, and computation-intensive process. There
are many parameters to configure different components in a neural network framework
to control the process. The following functor definition can give us a good understanding
about what needs to be configured. Fortunately, the Optimise module does all the
heavy-lifting job; it implements several engines for different optimization tasks.
130
Chapter 5 Deep Neural Networks
let p =
match params with
| Some p -> p
| None -> Optimise.Params.default ()
in
Optimise.minimise_network ?state p f b u s x y
We need to specify four important functions: the function for forward evaluation,
the function for backward propagation, the function for updating the weights, and
the function for saving the network. These four functions are passed as parameters to
the minimise_network function which is the engine specifically for optimizing neural
networks as a function. We have introduced minimise_fun in Chapter 4 and used it to
find an optimal x∗ to minimize f (x). The minimise_network function works similarly and
is also implemented similarly, with the exception of one subtle difference. Instead of
input x, this function aims to find optimal θ∗ to minimize fθ(x) for a given input x. In the
case of optimizing a neural network, θ indicates the weight parameters.
let forward nn x =
mktag (tag ()) nn;
run x nn, mkpar nn
The core logic of the run function is iterating all the neurons in a topological order.
For each neuron, the inputs are collected from its ancestors’ outputs first, then the
neuron’s activation function is triggered to process the inputs. The neuron’s output
131
Chapter 5 Deep Neural Networks
is saved in its hosting node. Finally, the output of the whole network is collected and
returned.
let run x nn =
Array.iter
(fun n ->
(* collect the inputs from parents' output *)
let input =
match n.neuron with
| Input _ -> [| x |]
| _ -> collect_output n.prev
in
(* process the current neuron, save output *)
let output = run input n.neuron in
n.output <- Some output)
nn.topo;
(* collect the final output from the tail *)
let sink = [| nn.topo.(Array.length nn.topo - 1) |] in
(collect_output sink).(0)
A backward pass is much more complicated than a forward pass, even though the
code in the backward function looks as simple as the forward. The actual complexity is
hidden in the reverse_prop which is the core function in the AD module. The purpose
of the backward pass is to propagate the errors backward from the output to inputs. By
doing so, the neurons along the path can utilize this error information to adjust their
parameters and hopefully minimize the future errors as well.
Derivatives can also be calculated in the forward pass, for example, using dual
numbers, why do we use backward propagation in the implementation? The reason is
that a typical neural network has much more input parameters than output parameters.
Backward propagation requires much less computation in this scenario.
let backward nn y =
reverse_prop (_f 1.) y;
mkpri nn, mkadj nn
Here, mkpri and mkadj return the primal and adjoint values of all the parameters.
132
Chapter 5 Deep Neural Networks
• We dry run the neural network to derive the computation graphs for
both forward and backward pass. We reuse these computation graphs
in the following iterative process rather than regenerating them.
The following function first creates the network, then configures the training process,
and finally trains the network by calling Graph.train. In fact, the Graph.train function
calls the train_generic function we just introduced in the previous section. The
train_generic directly passes the neural network along with the configurations to the
optimization engine to kick off the optimizing process.
let train () =
let x, _, y = Dataset.load_cifar_train_data 1 in
let network = make_network [|32;32;3|] in
Graph.print network;
let params = Params.config
~batch:(Batch.Mini 100) ~learning_rate:(Learning_Rate.Adagrad 0.005)
~checkpoint:(Checkpoint.Epoch 1.) ~stopping:(Stopping.Const 1e-6) 10.
in
Graph.train ~params network x y
However, from a programmer’s perspective, if we use the neural compiler, the only
thing that needs to be changed is the train function. The network definition remains
exactly the same.
134
Chapter 5 Deep Neural Networks
Except for the mundane packing and unpacking parameters, the most noticeable
change is that we are now using CGCompiler.train to train a network. C GCompiler.
train is implemented in the neural compiler function. So what is contained in this
function? Let us have a look at its implementation.
135
Chapter 5 Deep Neural Networks
The first part is simply creating some higher-order functions from the network
configuration. The purpose is to simplify the following code:
let batch =
match params.batch with
| Full -> full_size
| Mini n -> n
| Sample n -> n
| Stochastic -> 1
in
let network_shape = Graph.input_shape network in
let input_shape = Array.append [| batch |] network_shape in
...
Because compile_simple needs to dry run the network, it needs to know the shape
of the input. The input shape depends on how the training is configured. For a small
dataset, we can input the whole dataset in each iteration, so the shape will be full size.
For a larger dataset, we might want to use different logic to select a batch of data as input,
even just one sample per iteration. We can calculate the size from the params.batch
parameter.
136
Chapter 5 Deep Neural Networks
Then the neural network is initialized, and the weights are updated. After this step,
all the preparation work for a dry run is done.
The most critical step is to derive the computation graph of the backward pass.
Before we can do that, we need to first run the forward pass. The outcome of the forward
pass y and ground truth y’ is fed into the loss function loss_fun which contains the
computation graph of the forward pass.
137
Chapter 5 Deep Neural Networks
Then we further adjust the loss value by adding the regularization term if necessary
and assign it with a proper name.
The Graph.backward function created the computation graph of the backward pass,
contained in z. The computation graph is also the derivative of the loss function of the
network. We also separate out both weights ws and their adjacent value gs’ from z. After
this step, there is a very lengthy code to further calculate and adjust the gradient with
clipping, momentum, etc.
The final computation graph is returned along with the loss function, input,
and output.
138
Chapter 5 Deep Neural Networks
image that contains multiple objects, it seeks to classify individual objects and localizes
each one using a bounding box. Similarly, the semantic segmentation task requires
classifying the pixels in an image into different categories. Each segment is recognized
by a “mask” that covers the whole object. All possible objects are shown using different
masks, but it does not categorize what those objects are. The Mask R-CNN (Mask Region-
based Convolutional Neural Network) architecture was proposed in 2017 to address
all the previous problems. With sufficient training, it can solve these problems at once:
detecting objects in an image, labeling each of them, and providing a binary mask for
the image to determine which pixels belong to which objects. This task is called instance
segmentation.
As a preliminary example and for visual motivation, Figure 5-3 shows what this
network generates. In this example, a normal street view picture is processed by the
pretrained Mask R-CNN (MRCNN) network, and the objects (people, sheep, bag,
car, bus, etc.) are segmented from the input figure and recognized with probability
represented by a number between zero and one. Image segmentation has many
important applications, including medical imaging (locating tumors, detecting cancer
cells, etc.), traffic control systems, locating objects in satellite images, etc. In the rest of
this section, we will explain how this complex network can be built in OCaml using the
Neural module. The full code is provided in the GitHub repository.
139
Chapter 5 Deep Neural Networks
R-CNN Architecture
The idea of using a CNN to enhance the object detection task was first proposed in [23].
This paper proposes a “Regions with CNN features” (R-CNN) object detection system.
It is divided into several phases. The first phase is to localize possible objects of interest
in an input image. Instead of using a sliding window, R-CNN uses a different approach
called “regions”: for each input image, it first generates a number of (e.g., 2000) region
proposals that are independent of the object categories used. They are rectangle regions
of the image, of different aspects and sizes. The content in each region is then checked to
see if it contains any object of interest. Each region proposal is then processed by a CNN
to get a 4096-dimension feature vector. This CNN takes an input of fixed size 227 × 227,
and thus each region, regardless of its shape, is morphed into this fixed size before being
processed by CNN. As to the output feature vector, it is processed by a trained SVM
model to be classified into the accepted results.
140
Chapter 5 Deep Neural Networks
feature map is pooled by a “RoI pooling” layer into a smaller feature map of fixed size,
which is then turned into a feature vector by several fully connected layers.
Next, the feature vectors are fed into a branch. One output of this branch contains
the classification, and the confidence of that classification, of the object in that region.
The other specifies the rectangle location of the object, encoded by four real-valued
numbers. The output on this branch contains such four-number tuple for all of the
object categories in this task. Compared to R-CNN, this method does not require a lot
of space for feature caching, and it proves to be about 9 times faster in training and 213
times faster in inference.
information about the mask of the detected object in the RoI. Therefore, the Mask
R-CNN can retrieve the rectangle bound, classification results, classification possibility,
and the mask of that object, for any RoI, in a single pass. In the next section, we will
introduce the Mask R-CNN architecture in detail.
open Owl
module N = Dense.Ndarray.S
open CGraph
open Graph
open AD
1
https://github.com/pvdhove/owl-mask-rcnn. Work in this chapter was conducted by Pierre
Vandenhove during his internship in the OCaml Labs group at the University of Cambridge
Computer Laboratory. The code was ported from the Keras/TensorFlow implementation.
142
Chapter 5 Deep Neural Networks
The network accepts three inputs, each representing images, metadata, and the
number of anchors (the rectangular regions). The Configuration module contains a list
of constants that will be used in building the network.
Feature Extractor
The picture is first fed to a convolutional neural network to extract features of the image.
The first few layers detect low-level features of an image, such as edges and basic shapes.
As you go deeper into the network, these simply features are assembled into higher-
level features such as “people” and “cars.” Five of these layers (called “feature maps”)
of various sizes, both high and low levels, are then passed on to the next parts. This
implementation uses Microsoft’s ResNet101 network as a feature extractor.
let p4 =
add ~name:"fpn_p4add"
[|upsampling2d [|2; 2|] ~name:"fpn_p5upsampled" p5;
conv2d [|1; 1; 1024; tdps|]
str ~padding:VALID ~name:"fpn_c4p4" c4|] in
let p3 =
add ~name:"fpn_p3add"
[|upsampling2d [|2; 2|] ~name:"fpn_p4upsampled" p4;
conv2d [|1; 1; 512; tdps|]
str ~padding:VALID ~name:"fpn_c3p3" c3|] in
let p2 =
add ~name:"fpn_p2add"
[|upsampling2d [|2; 2|] ~name:"fpn_p3upsampled" p3;
conv2d [|1; 1; 256; tdps|]
str ~padding:VALID ~name:"fpn_c2p2" c2|] in
143
Chapter 5 Deep Neural Networks
The features are extracted by combining both ResNet101 and the Feature Pyramid
Network. ResNet extracts features of the image (early layers extract low-level features;
later layers extract high-level features). The Feature Pyramid Network creates a second
pyramid of feature maps from top to bottom so that every map has access to high-
and low-level features. This combination achieves excellent gains in both accuracy
and speed.
Proposal Generation
To try to locate the objects, about 250,000 overlapping rectangular regions or anchors are
generated.
Single RPN graphs are applied on different features in rpn_features_maps, and the
results from these networks are concatenated. For each bounding box on the image,
the RPN returns the likelihood that it contains an object, called its objectness, and a
refinement for the anchor; both are represented by rank 3 ndarrays.
144
Chapter 5 Deep Neural Networks
Next, in the proposal layer, the 1000 best anchors are selected according to their
objectness. Anchors that overlap too much with each other are eliminated, to avoid
detecting the same object multiple times. Each selected anchor is also refined in case it
was not perfectly centered around the object.
let rpn_rois =
let prop_f = PL.proposal_layer
C.post_nms_rois C.rpn_nms_threshold in
MrcnnUtil.delay_lambda_array [|C.post_nms_rois; 4|]
prop_f ~name:"ROI"
[|rpn_class; rpn_bbox; input_anchors|] in
In rpn_rois, the proposal layer picks the top anchors from the RPN output, based on
nonmaximum suppression and anchor scores.
Classification
All anchor proposals from the previous layer are resized to a given fixed size and fed
into a ten-layer neural network. The network assigns each of them the probability that
it belongs to each class. The network is pretrained on fixed classes; changing the set of
classes requires retraining the whole network. Note that this step does not take as much
time for each anchor as a full-fledged image classifier such as Inception, since it reuses the
precomputed feature maps from the Feature Pyramid Network. Therefore, there is no need
to go back to the original picture. The class with the highest probability is chosen for each
proposal, and thanks to the class predictions, the anchor proposals are even more refined.
Proposals classified in the background class are deleted. Eventually, only the proposals
with an objectness over some threshold are kept, and we have our final detections, each
with a bounding box and a label. This process can be described by the following code:
145
Chapter 5 Deep Neural Networks
A Feature Pyramid Network classifier associates a class to each proposal and further
refines the bounding box for that class. The only thing left to do then is to generate a
binary mask for each object. This is handled by a small convolutional neural network
which produces a small square of values between 0 and 1 for each detected bounding
box. This square is resized to the original size of the bounding box with bilinear
interpolation, and pixels with a value over 0.5 are tagged as being part of the object.
Finally, the output contains detection results and masks from the previous steps.
After getting to know the internals of the MRCNN architecture, we can now run the
code to see it work. The core code is listed as follows:
open Mrcnn
146
Chapter 5 Deep Neural Networks
A key step is to apply the Model.detect function on the input images, returning
the regions of interest, the classification result ID of the object in each region, the
classification certainty scores, and a mask that shows the outline of that object in the
region. With this information, the Visualise module runs for three passes on the
original image: the first for adding bounding boxes and object masks, the second for
adding the numbers close to the bounding box, and finally for printing out the resulting
images from the previous two steps. In this example, the pretrained weights on 80 classes
of common objects are provided, which have been converted from the TensorFlow
implementation mentioned earlier. As to the execution speed, processing one image
with a size of 1024 × 1024 pixels takes between 10 and 15 seconds on a moderate laptop.
5.7 Summary
In this chapter, we provided an insight into the neural network module in Owl.
Benefiting from solid implementation of algorithmic differentiation and optimization,
the Neural module is concise and expressive. We explained the neurons and network
components in this module and then showed how network training is done in Owl. This
chapter also covered how we implement a neural network compiler to automatically
optimize the network structure and memory usage. Finally, we introduced in detail
a DNN application, instance segmentation, that drives the development of the
computation graph module in Owl.
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
147
CHAPTER 6
Computation Graph
A computation graph is a basic theoretical tool that underlies modern deep learning
libraries. It is also an important component in Owl. This chapter first gives a bird’s-
eye view on the computation graph in Owl and its importance in computing. We then
demonstrate how to use it in Owl with some examples. Then we will continue to cover
the design and implementation details of the computation graph module and how it is
fitted into Owl’s functor stack.
Figure 6-1 shows an example graph for calculating function sin (x * y).1 The
computation graph contains several pieces of information which are essential for
debugging the applications. These information include the node index, operation type,
reference counter, shapes of data, etc. For example, in Figure 6-1 the row vector y of
shape [1; 4] is broadcast on the matrix x of shape [8; 4] in the Mul operation.
1
This figure is generated with the tool provided by the CGraph module in Owl, which we will
discuss in detail in this chapter.
2
In Sep. 2019, TensorFlow rolled out version 2.0. Starting from this version, TensorFlow uses eager
execution by default, which aims to be easier for users to get started with.
150
Chapter 6 Computation Graph
3
Code example: Constructing an LSTM network using Owl. URL: https://github.com/owlbarn/
owl/blob/master/examples/lazy_lstm.ml
151
Chapter 6 Computation Graph
Significance in Computing
Now that you know the basic ideas of a computation graph, you may ask why it matters.
Actually, a computation graph plays a core role in any machine learning framework.
Both TensorFlow [1] and PyTorch [42], the most popular deep learning libraries, use
a computation graph as the central data structure. A computation graph makes many
things a lot easier. Here is an incomplete list of its potential benefits:
Some of the benefits are very obvious. Memory usage can certainly be optimized
if the graph structure is fixed and the input shapes are known beforehand. One
optimization is reusing previously allocated memory, which is especially useful for those
applications involving large ndarray calculations. In fact, this optimization can also
be performed by a compiler by tracking the reference number of allocated memory, a
technique referred to as linear types [50]. Some may appear less obvious at first glance.
For example, we can decompose a computation graph into multiple independent
subgraphs, and each can be evaluated in parallel on different cores or even computers.
Maintaining the graph structure also improves fault tolerance, by providing natural
support for rollback mechanisms.
The computation graph provides a way to abstract the flow of computations;
therefore, it is able to bridge the high-level applications and low-level machinery
of various hardware devices. This is why it has natural support for heterogeneous
computing.
152
Chapter 6 Computation Graph
The computation graph has more profound implications on the scalability and
security of scientific computing systems. Because the memory allocated for each node
is mutable, the algorithmic differentiation becomes more scalable when evaluating
large and complex graphs. At the same time, mutable transformation is handled by Owl
internally, so programmers can still write safe functional code.
module N = Dense.Ndarray.D
Now, let’s make this function into a computation graph which can be lazy evaluated
by CGraph.
module N = Owl_computation_cpu_engine.Make
(Owl_algodiff_primal_ops.D)
The Make function here is actually a functor. For those who are not familiar with
the idea of functor, it is a powerful tool in OCaml to build generic code and structure
large-scale systems. To put into plain words, a functor is a function that creates modules
from modules. As we will explain in Section 6.3, the computation graph is designed as a
functor stack. Different aspects of the computation graph, such as memory management
and graph optimization, are added into the CGraph by creating a new module based
153
Chapter 6 Computation Graph
on an existing one, layer by layer. So far, it suffices to know that the functor creates
a module N, which provides exactly the same ndarray operations, except that all the
operations are conducted on symbols which represent ndarray instead of real objects
allocated in memory.
Next, we define two variables. The first x is an ndarray (arr), and y is a scalar (elt).
At this point, we only define these two as placeholders with no real data. That is to say,
we do not care about what specific ndarray or scalar these two variables are. Then we use
the add_scalar function to get another lazy evaluated ndarray g. That finishes that lazy
calculation. So far, we only know that g is calculated by adding x and y, but have no idea
what their values are. To get the value of the lazy expression g, we need to first assign
values to x and y:
Here, x is assigned a double-precision ndarray of 1s, and y is float number 2. Note the
two different assignment methods for ndarray and scalar. Finally, we can evaluate the
ndarray g:
# N.unpack_arr g
- : Owl_algodiff_primal_ops.D.arr =
C0 C1
R0 3 3
R1 3 3
The eval_arr evaluates the whole graph but does not return the result. To extract the
calculation result, we need to use the unpack_arr or unpack_elt function. The result is a
2x2 ndarray, the values of which are all 3s, just as expected. So where does the calculation
happen? Remember that the CGraph module N is built based on the double-precision
154
Chapter 6 Computation Graph
include Owl_algodiff_generic.Make
(Owl_algodiff_primal_ops.D)
Based on the chain rule, the Algodiff module automatically constructs a graph that
computes the gradient of input function f. The result is contained in scalar z. However,
the graph is constructed internally, but sometimes we need to have access to this graph
and apply optimizations. Obviously, it is extremely difficult for the users to manually
construct the computation graph that computes the gradient of the function f. Note that
the Algodiff module is also built using functors. Its base module follows the Ndarray
interface. By changing it from the Ndarray to CGraph module, we can make z to be a
computation graph instead of a scalar value, as the following code snippet shows:
module G = Owl_computation_cpu_engine.Make
(Owl_algodiff_primal_ops.D)
include Owl_algodiff_generic.Make (G)
let f x y =
Maths.((x * sin (x + x) + ((pack_flt 1.) *
sqrt x) / (pack_flt 7.)) * (relu y) |> sum')
155
Chapter 6 Computation Graph
Most of the code stay unchanged. Notice how the CGraph module is treated as an
alternative to the Ndarray module in building the AD module, since they follow the
same set of interfaces required by the Algodiff module of its base module. They decide
if the AD module uses normal or lazy evaluation. By executing this piece of code, the
result z contains a computation graph constructed by the backward propagation pass in
performing algorithmic differentiation.
The next thing we need to do is to assign values to inputs and evaluate z.
That requires building a graph based on the input and output, as shown by the
following code:
let inputs = [|
unpack_arr x |> G.arr_to_node;
unpack_elt y |> G.elt_to_node
|]
let outputs = [| unpack_elt z |> G.elt_to_node |]
let g = G.make_graph inputs outputs "graph"
To build a graph, we need to specify the input and output nodes. It might be a bit
confusing, since there are two layers of packing and unpacking: the first from the AD
node to the CGraph element and the second from the CGraph element to ndarray or
scalar. We need AD.unpack_arr and AD.unpack_elt to unwrap AD type data (ndarray
and scalar) into CGraph ndarray and scalar values. And then, to build the explicit
computation graph, we need to use the G.arr_to_node and G.elt_to_node functions
to make them into graph nodes first. Finally, an explicit computation graph can be built
with the make_graph function.
After constructing the graph g, we can then assign real data values to the
computation graph. Note that we need to first unpack the Algodiff values to CGraph
values before assignment:
156
Chapter 6 Computation Graph
G.eval_graph g;;
Since the whole graph is evaluated, the output ndarray z is also evaluated. We can
first unpack it from the Algodiff value into normal CGraph ndarray and then get its
value by another layer of unpacking:
You might be wondering why bother to build the graph through all these layers of
packing and unpacking when we can directly evaluate the value z. One main reason is
to enable various optimizations on the graph before executing it, as we will explain in
the following sections. Another reason is that evaluation is not always the target. For
example, we often need to visualize the generated computation graph. The computation
graph is very helpful in both debugging and understanding the characteristics of your
numerical computations. Owl provides the graph_to_dot function to facilitate you in
157
Chapter 6 Computation Graph
generating computation graphs. It converts the computation graph into a dot format
string. The dot file can be visualized with tools such as Graphviz. For example, the
following code generates a dot file for the graph we have constructed in this example,
and this graph is shown in Figure 6-2.
let s = G.graph_to_dot g
let _ = Owl_io.write_file "cgraph.dot" s
open CGCompiler.Neural
open CGCompiler.Neural.Graph
open CGCompiler.Neural.Algodiff
158
Chapter 6 Computation Graph
The CGraph-built neural network module does not require any change of code
in building the CNN except for the headers. To build a normal neural network, we
use the Neural module, and now we only need to change that to the CGCompiler.
Neural module. Here, the owl_neural_compiler functor compiles a DNN definition
and training configuration into a device-dependent static graph. As its output, the
CGCompiler is a computation graph–powered neural network compiler module.
CGCompiler also provides training functions. Note that the data requires proper packing
around original ndarray.
Similarly, the inference can be done with the CGCompiler.model function. To make
the existing DNN program into a lazy evaluation version, all we need to do is to update
the header and use packing/unpacking properly for the data.
One of the key performance improvements CGraph has on the neural network lies
in its ability of graph and memory optimization. To motivate you to understand more
about the design and optimization of the CGraph module, here is an example. Let’s train
a LeNet-like DNN based on the MNIST dataset, using the normal version mnist_cnn.
ml and the CGraph-powered version lazy_mnist.ml.4 Similar to the preceding example
code, both scripts train the same convolution neural network in 60 iterations. In one of
our evaluations on a normal laptop, mnist_cnn.ml takes 30s to finish and approximately
consumes 4GB memory, while lazy_mnist.ml only takes 5s and consumes about
0.75GB. This performance improvement is astounding. If these numbers make you
interested in knowing how the magic happens, please keep reading the next section. We
will unveil the underlying mechanism of Owl’s computation graph.
4
Both code snippets are available from the source code of Owl.
159
Chapter 6 Computation Graph
The left figure shows part of Owl’s original functor stack, and the right one shows
how the current one looks like after injection. In the very initial design, Ndarray
implements a set of fundamental n-dimensional array operations, then Algodiff defines
abstract mathematical operations for differentiation, finally the Optimise engine glues
low-level maths with high-level deep neural network applications. The whole stack is
parameterized by the number type abstraction in Ndarray:
Based on this architecture, the whole functor stack of the computation graph can be
inserted between the Ndarray and Algodiff modules. The design principle is that the
functor stack of a numerical system should be parameterized by both number type and
device type. The number type provides data representation (real or complex, single or
double, row-based or column-based layout, etc.) which decides how a math construct
should be built and operated. The device type provides hardware representation (CPU,
GPU, FPGA, etc.) which decides how the computation should be performed on a
specific device.
The following list summarizes the functionality of each functor in the CGraph stack.
The order and naming of these functors can give you a rough understanding about how
it is designed, as follows:
161
Chapter 6 Computation Graph
module M =
Owl_neural_generic.Flatten (
Owl_neural_graph.Make (
Owl_neural_neuron.Make (
Owl_optimise_generic.Make (
Owl_algodiff_generic.Make (
Dense.Ndarray.S)))));;
162
Chapter 6 Computation Graph
As to the new stack that contains computation graph functors, we can see it is indeed
much deeper.
module M =
Owl_neural_generic.Flatten (
Owl_neural_graph.Make (
Owl_neural_neuron.Make (
Owl_optimise_generic.Make (
Owl_algodiff_generic.Make (
Owl_computation_engine.Flatten (
Owl_computation_cpu_engine.Make_Nested (
Owl_computation_graph.Make (
Owl_computation_optimiser.Make (
Owl_computation_operator.Make (
Owl_computation_symbol.Make (
Owl_computation_shape.Make (
Owl_computation_type.Make (
Owl_computation_cpu_device.Make (
Dense.Ndarray.S))))))))))))));;
Computing Device
A computation graph is an abstract construct to express the logic of a function. To
calculate the outcome of a function, computation graphs need to be evaluated on a
physical device. The device can be anything as long as it has the capability to perform
numerical operations, such as the CPU, GPU, etc. To extend Owl on a new device,
we only need to create a new device module and define how the basic operations
can be performed on this device. Because a majority of the CGraph module is device
independent, the device layer becomes very lightweight, which further makes Owl very
easy to extend.
163
Chapter 6 Computation Graph
The following functor defines a CPU device. The functor’s input is the type of data
which will be manipulated on the device. In our case, they are either ndarray or scalar
values. This makes perfect sense if you are familiar with computer architecture. The data
are often stored and processed differently on devices of different architectures. Making a
new device is simply creating an abstract record type in OCaml. The other two functions
are for packing and unpacking data into the types which a device can process.
type device =
{ device_type : device_type
; initialised : bool
}
type value =
| ArrVal of A.arr
| EltVal of A.elt
...
end
164
Chapter 6 Computation Graph
GPU is defined by kernels which are written in C-like DSL. Different computing units
communicate through events.
type device =
{ device_type : device_type
; initialised : bool
}
type value =
{ mutable cpu_mem : cpu_mem array
; mutable gpu_mem : cl_mem array
; mutable kernel : cl_kernel array
; mutable events : cl_event array
}
let arr_to_value x =
let cpu_mem = [| x |] in
let gpu_mem = [||] in
let kernel = [||] in
let events = [||] in
{ cpu_mem; gpu_mem; kernel; events }
let value_to_arr x =
if Array.length x.cpu_mem > 0
then x.cpu_mem.(0)
else failwith "value_to_arr: not evaluated yet"
...
end
165
Chapter 6 Computation Graph
There are four attributes associated with a value regarding its storage, computation,
and communication on an OpenCL device, that is, CPU memory and GPU memory
for storage, kernel for computation, and event for communication between
computing units.
Types of Operation
The Owl_computation_type functor takes a device module as its input, then specifies
all the possible operations on the given device. Whenever we want to extend the set
of operations, we need to add the corresponding constructor of the new operation to
the sum type op. The current set of operations covers a wide range of unary and binary
numerical functions, such as Abs, Neg, Add, as well as functions for neural networks such
as MaxPool3d.
type state =
| Valid
| Invalid
and block =
{ size : int
; block_id : int
; mutable active : t option
; mutable memory : value
; mutable nodes : t list
}
and attr =
{ mutable op : op
; mutable freeze : bool
; mutable reuse : bool
; mutable state : state
166
Chapter 6 Computation Graph
and op =
| Noop
| Var
| Const
| Abs
| Neg
...
end
Shape Inference
The shape of data might change while traveling through different nodes in a
computation graph. The shape information is very valuable for debugging and
optimization purposes. When all the inputs of a given function are known, the shape
of the outcome can be decided; hence, the shape information of a computation graph
becomes available. The Owl_computation_shape functor is created for automating
shape inference. The core function of this functor is infer_shape which calls the
corresponding shape inference function of an operator using pattern matching.
167
Chapter 6 Computation Graph
end
There are over 30 shape inference functions defined. We can take a closer look at
those frequently used ones. For example, scalar operators such as Scalar_Add do not
require shape information, so its inference function only returns an empty array. The
reason for using an array of arrays as the returned type is because an operator might
produce multiple ndarrays as outputs.
The _infer_shape_01 pattern is defined for a unary operator with a single input
and a single output such as Abs. These operators do not change the shape of the data;
the input and output have exactly the same shape. Thus, the inference function simply
returns the shape of the input as the shape of the output.
168
Chapter 6 Computation Graph
If the inputs have the same shape, binary operators like Add will produce an output
of the same shape. However, if the shapes of inputs are different, broadcasting must be
taken into account to correctly calculate the output shape. Luckily, the broadcasting
rules can be codified easily from that used in the Ndarray module for shape inference
purposes.
We do not cover all the shape inference patterns here. If you are interested, the
readers are encouraged to read the source code of Owl to learn more about them. When
the shape of a graph is known, we can exploit this information to calculate the total
memory consumption, discover optimization opportunities, validate the consistency of
inputs and outputs, identify potential bugs, etc.
169
Chapter 6 Computation Graph
...
end
The general function for creating a node is make_node. The function utilizes the node
type defined in the Owl_graph module which provides a comprehensive set of functions
to manipulate a graph.
170
Chapter 6 Computation Graph
let freeze =
match freeze with
| Some s -> s
| None -> false
in
let value =
match value with
| Some v -> v
| None -> [||]
in
let attr = { op; freeze; reuse; state; shape; value; block = None } in
let node = Owl_graph.node ?name attr in
if value <> [||] then make_value_block value.(0) node;
node
171
Chapter 6 Computation Graph
For simple creation functions of ndarray, such as empty, zeros, etc., make_node is
sufficient because these functions do not require any parents to provide inputs except its
own shape information.
For unary operators which do require the output of a parent node, make_then_
connect is called to connect the parent’s output to the operator’s input. The outputs of
parent nodes are unpacked from the arr type, while the outputs of a child are packed
back into the arr type.
172
Chapter 6 Computation Graph
Binary operators work in a similar way. The only difference is that the inputs are from
two parents rather than one comparing to unary operators.
With these basic functions, we can construct very complicated computation graphs.
Quite often, the underlying computation graph may appear more complicated than the
actual function defined in code; neural network applications are good examples.
173
Chapter 6 Computation Graph
then (
(match get_operator x with
| Noop -> pattern_003 x
| Empty _shape -> pattern_000 x
| Zeros _shape -> pattern_000 x
...
| Add -> pattern_001 x
| Sub -> pattern_000 x
| Mul -> pattern_019 x
| Div -> pattern_007 x
...
| Scalar_Add -> pattern_010 x
| Scalar_Sub -> pattern_010 x m
...
| Dot (_transa, _transb, _alpha, _beta) -> pattern_005 x
| Fused_Adagrad (_rate, _eps) -> pattern_000 x
| _ -> failwith "Owl_computation_optimiser:_optimise_term");
validate x)
end
174
Chapter 6 Computation Graph
In this part, we will explain three most commonly used graph optimization
patterns: constant folding, operations fusing, and removing zeros. Constant folding is
a very basic pattern to reduce graph size. In a computation graph, it is common to see
that a lot of constants are involved. As a result, some subgraphs can be precalculated.
Figure 6-4 shows such an example. In this subgraph, the nodes #241 depends on are
either constants or operations on constants. Therefore, the value of node #241 is already
decided. We can thus fold this subgraph into one single node before evaluating the
whole graph.
From the definition of the _optimise_term function, we can see the Scalar_Add
operator triggers the pattern_010 function. This function first tries to optimize the
parent nodes, and then it checks whether both parents are constants. If so, the function
evaluates the expression based on the current operator, creates a new constant node for
175
Chapter 6 Computation Graph
the result, and removes the current node and its parents. By doing so, all the expressions
which can be evaluated during this phase will be folded into a constant, which can save a
lot of time during the graph evaluation phase.
and pattern_010 x =
let parents = parents x in
let a = parents.(0) in
let b = parents.(1) in
_optimise_term a;
_optimise_term b;
match get_operator a, get_operator b with
| Const, Const ->
let a_val = node_to_elt a |> elt_to_float in
let b_val = node_to_elt b |> elt_to_float in
let c_val = pattern_011 (get_operator x) a_val b_val in
set_parents x [||];
176
Chapter 6 Computation Graph
set_reuse x false;
set_operator x Const;
freeze x;
set_value x [| float_to_elt c_val |> unpack_elt |> elt_to_value |]
| _ -> ()
The next pattern, fusing operations, combines multiple operations into one, if
applicable. For example, in Figure 6-5, nodes #421, #463, and #464 are fused into one fma
node (i.e., fused-multiply-add operation). Owl also recognizes complicated patterns, for
example, a pattern formed by nodes #511–#515 appears a lot in DNN training that uses
the Adagrad (adaptive subgradient) training method. Fusing all these operations into
one single operation can improve computing efficiency as well as numerical accuracy.
Besides, this optimization also effectively reduces the round trips to the memory, which
saves a lot of time when operating on large ndarrays.
In the source code, fusing FMA operation depends on the pattern_004 function. The
function first checks if the current operator is Add, then checks if one of the inputs is from
the multiplication operator. If both conditions are satisfied, the pattern is identified. The
refnum is a counter tracking how many times the output of an operator has been referred
to by other expressions. If refnum is greater than one, we cannot fuse the operator
because its output is used by another operator as input.
177
Chapter 6 Computation Graph
and pattern_004 x =
if get_operator x = Add
then (
let x_parents = parents x in
let a = x_parents.(0) in
let b = x_parents.(1) in
if get_operator a = Mul && refnum a = 1
then (
let new_parents = Owl_utils_array.(parents a @ [| b |]) in
set_parents x new_parents;
replace_child a x;
set_operator x FMA;
remove_node a)
else if get_operator b = Mul && refnum b = 1
then (
let new_parents = Owl_utils_array.(parents b @ [| a |]) in
set_parents x new_parents;
178
Chapter 6 Computation Graph
replace_child b x;
set_operator x FMA;
remove_node b))
Next, the adding zero pattern is trivial to see in the graph. If one node adds another
zeros node, then the zeros node can be safely removed. In the example shown in
Figure 6-6, nodes #164 and #166 are removed, and the others are folded. Moreover,
node #255 for the repeat operation is also removed because the add operation already
supports the broadcasting operation. Removing #255 can save some runtime memory in
the evaluation.
The pattern_002 function detects both x + 0 and 0 + x patterns. The implementation
is intuitive. After an Add operator is identified, the function checks whether one of the
inputs is zero. If so, the Zero node is removed, and the current Add operator is replaced
with the Noop operator.
and pattern_002 x =
let x_parents = parents x in
let a = x_parents.(0) in
let b = x_parents.(1) in
if get_operator x = Add
then (
match get_operator a, get_operator b with
| Zeros _, _ ->
set_operator x Noop;
remove_edge a x;
_optimise_term x
| _, Zeros _ ->
set_operator x Noop;
remove_edge b x;
_optimise_term x
| _, _ -> ())
There are also other patterns that focus on specific calculations, such as
multiplication, division, repeat, sum-reduce, etc. Please refer to the source code if
you are interested in them. To show how effectively the Optimiser works, we again
179
Chapter 6 Computation Graph
use the aforementioned LeNet-like CNN that is trained on the MNIST dataset. The
original network has 201 nodes and 239 edges; after applying the graph optimization in
Optimiser, the whole computation graph consists of only 103 nodes and 140 edges.
Optimizing a graph structure to improve evaluation performance is an advanced
topic. But as you can see in the previous step-by-step illustration, advanced
functionalities can be decomposed into a set of simple functions identifying specific
patterns and optimizing locally using a typical divide-and-conquer approach. The graph
optimization in TensorFlow follows a somewhat similar path. The computation graph
in TensorFlow is first constructed using the Python frontend, and via a layer of the C
API, this graph is converted to a format that the C++ backend can recognize. After that,
the graph is optimized using various techniques, including common subexpression
elimination, constant folding, removing identity nodes, removing dead nodes, etc. If you
look at the source code of TensorFlow, this part of functionalities is taken care of by the
common runtime module of its core engine.
Computation Engine
Finally, we have reached the top of the CGraph functor stack: the computation engine.
Because a computation graph has to be evaluated on the hardware, each type of device
must implement its own computing engine. The following code shows the engine
for CPU devices. The core function eval_gen consists of two steps. The first step is to
initialize the graph by calling _init_terms. The second step is to evaluate the graph by
calling _eval_terms.
180
Chapter 6 Computation Graph
For comparison, let us also create a loop at the computing engine for OpenCL
devices. The functor structure of the OpenCL computing engine is almost the same
except the eval_gen function.. The function has a bit more code because the procedure
of setting up a computing environment is much more complicated on an OpenCL-
compatible device than on a CPU device. The procedure consists of many steps including
specifying context, accelerator, command queue, kernel programs, etc. The evaluation
outputs also need to be explicitly copied from GPU memory to CPU memory for further
processing.
181
Chapter 6 Computation Graph
and _eval_map_01 x f =
_eval_terms (parents x);
let inputs = Array.map (fun parent ->
value_to_arr (get_value parent).(0)) (parents x) in
let out = value_to_arr (get_value x).(0) in
f ~out inputs
On the other hand, the similar function for OpenCL devices is more complicated.
Because the computation takes place on an accelerator, we need to set up the command
queue for communication and event queue for synchronizing computing units. We
also need to specify the suitable kernels. for computing logic. These kernels are
compiled dynamically during the runtime and then copied to the computing units of
an accelerator. When the output is finally ready, we must explicitly dispatch the event to
notify the dependent.
Programming a GPU is very much like programming a computer cluster. The gain
of parallel computing comes with inevitable synchronization and communication
overhead. Therefore, GPU computing only makes sense when the computation
complexity is high enough to dwarf other overheads.
When offloading the computation to a GPU, we should avoid transmitting data back
and forth between the host and the device memory, so eager evaluation is not ideal
in this context because the performance will be throttled by copying. This is the gap
between CPU computing and a language with eager evaluation. The computation graph
essentially fills the gap between Owl and GPU computing simply because the laziness
can be simulated now.
182
Chapter 6 Computation Graph
From an implementation perspective, we only need to write a new engine functor for
GPU devices to evaluate a graph; all the others remain the same. Comparing to the CPU
engine, the OpenCL engine maintains the memory allocated on both the host and the
device for each node, copying only happens whenever it is necessary, and the allocated
memory on the device is reused as much as possible.
183
Chapter 6 Computation Graph
3. The player can remove any pebble from a vertex (and reuse that
pebble later).
The goal of the game is to place a pebble at least once on some fixed output vertices of
the graph. Figure 6-7 shows an example of an optimal pebbling strategy using the previous
computation graph (gray nodes are pebbled), using moves 1-> 2 -> 3 -> 1 -> 2-> 2.
We assume that the goal is to pebble node 5.
This game relates to the memory allocation of the computation graph if we see
pebbles as memory blocks used to store the output value of a node. We assume that the
values of the inputs are known (move 1). We can only compute the value of a vertex if all
its predecessors are simultaneously stored in memory (move 2). The sliding move means
that the memory of a node can be overwritten by its successor during its computation
(inplace reuse). We can always reuse a memory block from any other node (move 3).
Given a graph, the idea is thus to find a strategy to pebble it using the minimum number
of pebbles, in other words, using as little memory as possible.
184
Chapter 6 Computation Graph
We also want to avoid pebbling any node twice in order to keep the execution time as
low as possible, because that would mean that we compute the same node twice. Given
these constraints, finding a strategy using the least amount of pebbles is unfortunately
NP complete [45]. Since computation graphs can have a few thousand nodes, we
implement a fast heuristic instead of an exact algorithm.
Now we can apply the pebble game process in our memory allocation process. We
propose to share memory between nodes that (1) are not necessarily a parent/child pair
and (2) that do not have the same output size (by allocating a large block of memory
once, without necessarily using all of it all the time). To do this efficiently, we first have
to fix an evaluation order (in practice, any topological order). Given this order, we
can pinpoint the moment when the memory of a node becomes useless by keeping a
counter of how many times it has been used. When it has been used by all its children,
we can recycle its memory. Then to allocate memory to a node, we simply check which
blocks are available, and we select the one with the closest size (in order not to waste
too much memory). If no block is available, we allocate a new one. This can be executed
in O(n∗ log (n)) time, which is negligible compared to the actual cost of evaluating
the graph.
185
Chapter 6 Computation Graph
Note that some operations cannot overwrite their inputs while they are being
computed (the sliding move from the pebble game is forbidden) and that some nodes
cannot be overwritten for practical purposes, typically constant nodes or neural network
weights. When evaluated in the right order, the computation graph needs much smaller
blocks of memory than the non-optimized version. As an example, part of an optimized
computation graph is shown in Figure 6-8. Each color corresponds to a memory block,
and white nodes always need to be kept in memory.
The code in add_node_to_block illustrates the steps of introducing a new node. If
the memory block of the parent is reusable, the function checks whether the memory
is large enough for accommodating the output of the current operator. If so, the node
includes the current operator to the list of nodes sharing the same memory block.
Moreover, the memory is reshaped according to the shape of the output.
186
Chapter 6 Computation Graph
187
Chapter 6 Computation Graph
6.5 Summary
In this chapter, we introduced the core computation graph module in Owl. We started
with a general introduction of the computation graph in numerical computing and
why we build that in Owl. Then we used several examples to demonstrate how the
computation graph module is used in Owl. This was followed by the internal design of
this module, most importantly the CGraph stack and its position in the Owl architecture.
The computation graph creates a large optimization space, and in this chapter, we
presented two of them in detail. The first is the graph structure optimization, and the
second is to optimize the memory allocation in the computation graph. A computation
graph is an important research topic, and we believe there is still much potential in this
module for performance improvement.
188
Chapter 6 Computation Graph
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
189
CHAPTER 7
Performance Accelerators
7.1 Hardware Accelerators
The Graphics Processing Unit (GPU) has become one of the most important types of
hardware accelerators. It is designed to render 3D graphics and videos and still is core
to the gaming industry. Besides creating stunning visual effects, programmers also
take advantage of the GPU’s advantage in parallel processing in many fields to perform
computing-heavy tasks, such as in health data analytics, physical simulation, artificial
intelligence, etc.
Recall from Chapter 2 the architecture of a typical CPU. The architecture of a
GPU core is somewhat similar. Figure 7-1 shows one core of an Nvidia GTX 1080,
which contains 20 such cores. Compared with a CPU core, it contains much more
multithreading units, including more single-instruction-multiple-data (SIMD) function
units, and more cores in a GPU. Another character of the GPU is its small cache. This
GPU contains only two levels of caches (this figure only shows the level 1 cache; the level
2 cache is shared by all 20 cores), each smaller than a typical CPU’s. Besides, the GPU
also focuses on throughput, and thus its bandwidth between cores and memory is much
larger than that of a CPU.
191
© Liang Wang, Jianxin Zhao 2023
L. Wang and J. Zhao, Architecture of Advanced Numerical Analysis Systems,
https://doi.org/10.1007/978-1-4842-8853-5_7
Chapter 7 Performance Accelerators
([HFXWLRQ&RQWH[W
/&DFKH
Another type of accelerator that has gain much attention in recent years is the Tensor
Processing Unit (TPU). In 2016, Google has announced TPU, its application-specific
integrated circuit, and introduced that TPU has been deployed in Google’s data centers
to accelerate neural network computation.
The design of TPU is based on the fact that matrix multiplication plays a dominant
role in neural network–related computing and uses this operation as a primitive
operation. In the CPU, the basic calculation unit is scalar, so that we can, for example,
add two integers using one single instruction within a cycle. The GPU, on the other hand,
widely utilizes multithreading, and thus the user can add two vectors in a cycle. The TPU
further proposes to finish a matrix operation in one cycle.
A TPU v2 core consists of a Matrix Multiply Unit (MXU) and a Vector Processing Unit
(VPU). The former specializes in matrix multiplication, and the latter takes care of all
other types of tasks, such as activations. The MXU utilizes a systolic array architecture to
enable the single-clock matrix operation. As a result, it can execute up to 128K operations
in one cycle. Google reports that the TPU delivered 15–30X higher performance and
30–80X higher performance per watt than contemporary CPUs and GPUs. Besides, due
to neural networks’ tolerance to errors, TPU also performs quantization to compress
calculation by converting continuous float numbers to discrete ones.
192
Chapter 7 Performance Accelerators
Utilizing Accelerators
There is no doubt that numerical computing heavily relies on hardware accelerators:
TensorFlow, PyTorch, Julia, MATLAB, etc. They all support multiple types of devices,
including at least the CPU and GPU. In general, there are two methods to do that.
The first, and most widely used, is direct support of the hardware. Take the GPU as
an example. When programming a GPU, Nvidia CUDA is a widely used choice. CUDA is
a parallel computing platform and programming model for computing on Nvidia GPUs.
In TensorFlow, a computation graph is first expressed in Python on the frontend and is
then accordingly built up using the C++ backend. This graph is further optimized and
partitioned onto multiple devices, which can be CPU, GPU, or TPU devices. Each device
invokes the corresponding executor to run the assigned computation on its subgraph.
For TPU device execution, TensorFlow incorporates a compiler and software stack that
translates API calls from TensorFlow computation graphs into TPU instructions. In Julia,
support for Nvidia GPUs is provided by its CUDA.jl package. Built on the CUDA toolkit, it
enables both interfacing with the CUDA API directly and writing CUDA kernels. NumPy
does not support GPUs, but in the vast Python world, there are a lot of GPU-friendly
alternatives, such as Numba, CuPy, etc.
Compared with CUDA, the Open Computing Language (OpenCL) serves as an open
source standard for cross-platform parallel programming and is not limited to Nvidia
GPUs. Therefore, some numerical libraries and software also support it to work on non-
Nvidia GPUs. The StreamExecutor that TensorFlow utilizes to process computation tasks
on a GPU device is actually a unified wrapper around the CUDA and OpenCL runtimes.
Recently, given a growing number of deep learning frameworks and equally growing
number of hardware accelerator platforms, a new approach is to utilize intermediate
representations. For example, the deep learning compiler has gained rapid growth. A
DL compiler takes the model definition described in a deep learning framework and
generates efficient implementation specific to certain target hardware. TVM [10] is one
popular DL compiler that works with a wide range of frameworks and hardware devices.
A closely related idea is an open neural network standard that can be converted to and
from various frameworks and can also be compiled and executed on various hardware.
One such example is the Open Neural Network Exchange (ONNX) format. In summary,
it is a growing trend that the definition of computation can be separated out and the
low-level compilers to deal with optimization, code generation, etc. to pursue best
computation performance. We can think of DL compilers and open standards as the
neck of an hourglass that bridges the gap between two types of ecosystems. In the rest
193
Chapter 7 Performance Accelerators
of this chapter, based on the latter approach, we propose owl_symbolic, which converts
Owl computation to that of ONNX and can further be executed on various hardware
accelerators.
7.2 Design
Except for the requirement to be executed on accelerators, the development of the
owl_symbolic library is motivated by several other factors. For one thing, scientific
computation can be considered as consisting of two broad categories: numerical
computation and symbolic computation. Owl has achieved a solid foundation in the
former, but as yet to support the latter one, which is heavily utilized in a lot of fields.
Besides, tasks such as visualizing a computation also require some form of
intermediate representation (IR). Owl has already provided a computation graph layer
to separate the definition and execution of computation to improve the performance,
as introduced in Chapter 6, but it’s not an IR layer to perform these different tasks as
mentioned before. Toward this end, we begin to develop an intermediate symbolic
representation of computations and facilitate various tasks based on this symbol
representation.
One thing to note is that do not mistake our symbolic representation as the classic
symbolic computation (or computer algebra system) that manipulates mathematical
expressions in a symbolic way, which is similar to the traditional manual computations.
It is indeed one of our core motivations to pursue the symbolic computation with Owl.
Currently, we provide a symbolic representation layer as the first step toward that target.
More discussion will be added in future versions of the development with the support of
symbolic math in Owl.
The owl_symbolic library is divided into two parts: the core symbolic representation
that constructs a symbolic graph and various engines that perform different tasks based
on the graph. The architecture design of this system is shown in Figure 7-2.
194
Chapter 7 Performance Accelerators
Core Abstraction
The core part is designed to be minimal and contains only necessary information.
Currently, it has already covered many common computation types, such as math
operations, tensor manipulations, neural network–specific operations such as
convolution, pooling, etc. Each symbol in the symbolic graph performs a certain
operation. Input to a symbolic graph can be constants such as integer, float number,
complex number, and tensor. The input can also be variables with certain shapes. An
empty shape indicates a scalar value. The users can then provide values to the variable
after the symbolic graph is constructed.
Symbol
The symbolic representation is defined mainly as an array of symbol. Each symbol is
a graph node that has an attribution of type Owl_symbolic_symbol.t. It means that we
can traverse through the whole graph by starting with one symbol. Besides symbols, the
name field is the graph name, and node_names contains all the nodes’ names contained in
this graph.
type t =
{ mutable sym_nodes : symbol array
; mutable name : string
; mutable node_names : string array
}
195
Chapter 7 Performance Accelerators
type t =
| NOOP
| Int of Int.t
| Complex of Complex.t
| Float of Float.t
| Tensor of Tensor.t
| Variable of Variable.t
| RandomUniform of RandomUniform.t
| Sin of Sin.t
| Cos of Cos.t
| Exp of Exp.t
| ReduceSum of ReduceSum.t
| Reshape of Reshape.t
| Conv of Conv.t
....
There are totally about 150 operations included in our symbolic representation. Each
operation is implemented as a module. These modules share common attributes such as
names, input operation names, and output shapes, and then each module contains zero
or more attributes of itself. For example, the Sin operation module is implemented as
196
Chapter 7 Performance Accelerators
The module provides properties such as op_type and functions such as create that
returns objects of type Sin.t. The name, input, and out_shape are common attributes in
the operation modules.
In implementing the supported operations, we follow the categorization used in
ONNX. These operations can be generally divided into different groups as follows:
• Tensor: Normal tensor operations, like the ones that are included in
the Ndarray module, such as concat, reshape, etc.
197
Chapter 7 Performance Accelerators
There are also some functions that only apply to certain types of operations. The
generator type of operations all need to specify the type of data it supports. Therefore,
we use the dtype function to check their data types. Another example is the output
property. For most of the operation, it has only one output, and therefore its name is its
output name. However, for operations such as MaxPool that contains multiple outputs,
we need another function: output.
Type Checking
The type supported by owl_symbolic is listed as follows:
type number_type =
| SNT_Noop
| SNT_Float
| SNT_Double
| SNT_Complex32
| SNT_Complex64
| SNT_Bool
| SNT_String
| SNT_Int8
| SNT_Int16
| SNT_Int32
| SNT_Int64
| SNT_Uint8
| SNT_Uint16
| SNT_Uint32
| SNT_Uint64
| SNT_Float16
| SNT_SEQ of number_type
This list of types covers most number and non-number types. Besides, the SNT_SEQ
type is used to compose with these basic types to indicate a list of float number, boolean
value, string, etc.
198
Chapter 7 Performance Accelerators
Operators
All these operations are invisible to users. What the users really use are the operators. To
build a graph, we first need to build the required attributes into an operation and then put it
into a graph node. This is what an operator does. Take the sin operator as an example:
Here, the sin operator takes its parent node x as an input, get its name as an input
property, and create a symbol node with the function make_node. This function takes an
operation and an array of parent symbols and then creates one symbol as a return. What
it does is mainly creating a child node using the given operation as node attribution,
updating the child’s input and output shapes, and then connecting the child with
parents before returning the child node. The connection is on both directions:
Therefore, the users can use the operators to build a graph representation. Here is an
example:
open Owl_symbolic
open Op
open Infix
Here, we start with the variable operator, which creates a placeholder for incoming
data later. You can specify the shape of the variable with the ~shape parameter. If not
specified, then it defaults to a scalar. You can also choose to initialize this variable with a
tensor so that even if you don’t feed any data to the variable, the default tensor value will
be used. A tensor in owl-symbolic is defined as
199
Chapter 7 Performance Accelerators
type tensor =
{ mutable dtype : number_type
; mutable shape : int array
; mutable str_val : string array option
; mutable flt_val : float array option
; mutable int_val : int array option
; mutable raw_val : bytes option
}
A tensor is of a specific type of data, and then it contains the value: string array, float
array, integer array, or bytes. Only one of these fields can be used. If initialized with a
tensor, a variable takes the same data type and shape as that of the tensor.
Naming
Currently, we adopt a global naming scheme, which is to add an incremental index
number after each node’s type. For example, if we have an Add symbol, a Div symbol, and
then another Add symbol in a graph, then each node will be named add_0, div_1, and
add_1. One exception is the variable, where a user has to explicitly name when creating a
variable. Of course, users can also optionally name any node in the graph, but the system
will check to make sure the name of each node is unique. The symbolic graph contains
the node_names field that includes all the nodes’ names in the graph.
Shape Inferencing
One task the symbolic core needs to perform is shape checking and shape
inferencing. Shape inference is performed in the make_node function and therefore
happens every time a user uses an operation to construct a symbolic node and
connect it with previous nodes. It is assumed that the parents of the current node are
already known.
200
Chapter 7 Performance Accelerators
As the code shows, for each node, we first find the output shapes of its parents. The
in_shape is of type int array option array array. You can understand it this way: int
array is a shape array; int array option means this shape could be None. Then int
array option array is one whole input from the previous parent, since one parent may
contain multiple outputs. Finally, int array option array array includes output from
all parents. The main function Owl_symbolic_shape.infer_shape then infers the output
shape of the current node and saves it to the out_shape property of that symbol.
The infer_shape function itself checks the symbol type and then match with specific
implementation. For example, a large number of operations actually take one parent and
keep its output shape:
This pattern infer_shape_01 covers these operations. It simply takes the input shape
and returns the same shape.
There are two possible reasons for the input shape to be None. At first, each node
will be initialized with a None output shape. During shape inference, in certain cases,
the output shape depends on the runtime content of input nodes, not just the shapes of
input nodes and attributions of the current node. In that case, the output shape is set to
None. Once the input shapes contain None, the shape inference results hereafter will all
be None, which means the output shapes cannot be decided at compile time.
Multiple Outputs
Most of the operators are straightforward to implement, but some of them return
multiple symbols. In that case, an operation returns not a node, but a tuple or, when output
numbers are uncertain, an array of nodes. For example, the MaxPool operation returns two
outputs: one is the normal maxpooling result, and the other is the corresponding tensor that
contains indices of the selected values during pooling. Or we have the Split operation that
splits a tensor into a list of tensors, along the specified axis. It returns an array of symbols.
201
Chapter 7 Performance Accelerators
Engines
Based on this simple core abstraction, we use different engines to provide functionalities:
converting to and from other computation expression formats, print out to human-
readable format, graph optimization, etc. As we have said, the core part is kept minimal.
If the engines require information other than what the core provides, each symbol has an
attr property as an extension point. All engines must follow the following signature:
type t
It means that each engine has its own core type t, be it a string or another format of
graph, and it needs to convert t to and from the core symbolic graph type or save/load
a type t data structure to a file. An engine can also contain extra functions besides these
four. Now that we have explained the design of owl_symbolic, let’s look at the details of
some engines in the next few sections.
7.3 ONNX Engine
The ONNX engine is the current focus of development in owl_symbolic. ONNX is
a widely adopted Open Neural Network Exchange format. A neural network model
defined in ONNX can be, via suitable converters, run on different frameworks and thus
hardware accelerators. The main target of ONNX is to promote the interchangeability of
neural network and machine learning models, but it is worth noting that the standard
covers a lot of basic operations in scientific computation, such as power, logarithms,
trigonometric functions, etc. Therefore, the ONNX engine serves as a good starting point
for its coverage of operations.
Taking a symbolic graph as input, how would then the ONNX engine produce the
ONNX model? We use the ocaml-protoc, a protobuf compiler for OCaml, as the tool. The
ONNX specification is defined in an onnx.proto file, and the ocaml-protoc can compile
this protobuf files into OCaml types along with serialization functions for a variety
202
Chapter 7 Performance Accelerators
message ModelProto {
optional int64 ir_version = 1;
repeated OperatorSetIdProto opset_import = 8;
optional string producer_name = 2;
optional string producer_version = 3;
optional string domain = 4;
optional int64 model_version = 5;
optional string doc_string = 6;
optional GraphProto graph = 7;
repeated StringStringEntryProto metadata_props = 14;
};
open Owl_symbolic_specs.PT
type model_proto =
{ ir_version : int64 option
; opset_import : operator_set_id_proto list
; producer_name : string option
; producer_version : string option
; domain : string option
; model_version : int64 option
; doc_string : string option
; graph : graph_proto option
; metadata_props : string_string_entry_proto list
}
203
Chapter 7 Performance Accelerators
open Owl_symbolic
open Op
open Infix
After including necessary library components, the first three lines of code create a
symbolic representation z using the symbolic operators such as sin, pow, and float. The
x and y are variables that accept user inputs. It is then used to create a symbolic graph.
This step mainly checks if there is any duplication of node names. Then the of_symbolic
function in the ONNX engine takes the symbolic graph as input and generates a model_
proto data structure, which can be further saved as a model named test.onnx.
204
Chapter 7 Performance Accelerators
To use this ONNX model, we could use any framework that supports ONNX. Here,
we use the Python-based ONNX Runtime as an example. We prepare a simple Python
script as follows:
import numpy as np
import math
import onnxruntime as rt
sess = rt.InferenceSession("test.onnx")
input_name_x = sess.get_inputs()[0].name
input_name_y = sess.get_inputs()[1].name
x = np.asarray(math.pi, dtype="float32")
y = np.asarray(3., dtype="float32")
This script is very simple: it loads the ONNX model we have just created and then
gets the two input variables and assigns two values to them in the sess.run command.
All the user needs to know in advance is that there are two input variables in this ONNX
model. Note that we could define not only scalar type inputs but also tensor type
variables in owl_symbolic and then assign a NumPy array to them when evaluating.
open Owl_symbolic
open Op
let _ =
let flt_val = [| 1.; 2.; 3.; 4.; 5.; 6. |] in
let t = Type.make_tensor ~flt_val [| 2; 3 |] in
let x = variable ~init:t "X" in
let y = sin x in
let g = SymGraph.make_graph [| y |] "sym_graph" in
let z = ONNX_Engine.of_symbolic g in
ONNX_Engine.save z "test.onnx"
205
Chapter 7 Performance Accelerators
This computation simply takes an input variable x and then applies the sin
operation. Let’s look at the Python side:
import numpy as np
import onnxruntime as rt
sess = rt.InferenceSession("test.onnx")
pred_onx = sess.run(None, input_feed={})
print(pred_onx[0])
[[ 0.84147096 0.9092974 0.14112 ]
[-0.7568025 -0.9589243 -0.2794155 ]]
Note how the initializer works without users providing any input in the input feed
dictionary. Of course, the users can still provide their own data to this computation,
but the mechanism may be a bit different. For example, in onnx_runtime, using sess.
get_inputs() gives an empty set this time. Instead, you should use get_overridable_
initializers():
input_x = sess.get_overridable_initializers()[0]
input_name_x = input_x.name
input_shape_x = input_x.shape
x = np.ones(input_shape_x, dtype="float32")
pred_onx = sess.run(None, {input_name_x: x})
let dnn =
let x = variable ~shape:[| 100; 3; 32; 32 |] "X" in
let t_conv0 = conv ~padding:Type.SAME_UPPER x
206
Chapter 7 Performance Accelerators
Apparently, that’s too much information for the users to handle. To make things
easier for the users, we create a neural network layer based on existing symbolic
operations. This lightweight layer takes only 180 LoC, and yet it provides an Owl-like
clean syntax for the users to construct neural networks. For example, we can construct
an MNIST-DNN model:
open Owl_symbolic_neural_graph
let nn =
input [| 100; 3; 32; 32 |]
|> normalisation
|> conv2d [| 32; 3; 3; 3 |] [| 1; 1 |]
|> activation Relu
|> max_pool2d [| 2; 2 |] [| 2; 2 |] ~padding:VALID
|> fully_connected 512
|> activation Relu
|> fully_connected 10
|> activation (Softmax 1)
|> get_network
207
Chapter 7 Performance Accelerators
let _ =
let onnx_graph = Owl_symbolic_engine_onnx.of_symbolic nn in
Owl_symbolic_engine_onnx.save onnx_graph "test.onnx"
Besides this simple DNN, we have also created the complex architectures such as
ResNet, InceptionV3, SqueezeNet, etc. They are all adapted from existing Owl DNN
models with only a minor change. The execution of the generated ONNX model is
similar:
import numpy as np
import onnxruntime as rt
sess = rt.InferenceSession("test.onnx")
input_name_x = sess.get_inputs()[0].name
input_name_shape = sess.get_inputs()[0].shape
input_x = np.ones(input_name_shape , dtype="float32")
pred_onx = sess.run(None, {input_name_x: input_x})[0]
For simplicity, we generate a dummy input for the execution/inference phase of this
model. Of course, currently, in our model the weight data is not trained. Training of a
model should be completed on a framework such as TensorFlow. Combining trained
weight data into the ONNX model remains to be a future work.
Furthermore, by using tools such as js_of_ocaml, we can convert both examples into
JavaScript; executing them can create the ONNX models, which in turn can be executed
on the browser using ONNX.js that utilizes WebGL. In summary, using ONNX as the
intermediate format for exchange computation across platforms enables numerous
promising directions.
7.4 LaTeX Engine
The LaTeX engine takes a symbolic representation as input and produces LaTeX
strings which can then be visualized using different tools. Its design is simple,
mainly about matching symbol type and projecting it to correct implementation.
Again, let’s look at an example that builds up a symbolic representation of a
( )
calculation exp sin ( x 0 ) + cos ( x 0 ) + 10 × x 02 + exp (π i ) :
2 2
208
Chapter 7 Performance Accelerators
open Owl_symbolic
open Op
open Infix
let make_expr0 () =
let x = variable "x_0" in
let y =
exp ((sin x ** float 2.) + (cos x ** float 2.))
+ (float 10. * (x ** float 2.))
+ exp (pi () * complex 0. 1.)
in
SymGraph.make_graph [| y |] "sym_graph"
# let () = make_expr0 ()
|> LaTeX_Engine.of_symbolic
|> print_endline
\exp(\sin(x_0) ^ 2 + \cos(x_0) ^ 2) + 10 \times x_0 ^ 2 + \exp(\pi \
times 1.00i)
Simply putting it in the raw string form is not very helpful for visualization. We have
built a web UI in this engine that utilizes KaTeX, which renders LaTeX string directly on
a browser. In the following, we use the html function provided by the engine to show this
string on our web UI using the functionality the engine provides:
# let () =
let exprs = [ make_expr0 () ] in
LaTeX_Engine.html ~dot:true ~exprs "example.html"
The generated “example.html” web page is a stand-alone page that contains all the
required scripts. Once opened in a browser, it looks like Figure 7-3.
For each expression, the web UI contains its rendered LaTeX form and
corresponding computation graph.
209
Chapter 7 Performance Accelerators
7.5 Owl Engine
An Owl engine enables converting an Owl computation graph to or from a symbolic
representation. A symbolic graph can thus benefit from the concise syntax and powerful
features such as algorithmic differentiation in Owl.
The conversion between Owl CGraph and the symbolic representation is
straightforward, since both are graph structures. We only need to focus on making the
operation projection between these two systems correct.
210
Chapter 7 Performance Accelerators
The basic idea is simple: find the type of symbol and its input node in CGraph, and
then do the projection to symbolic representation. For most of the math operators such
as sin, the projection is one to one, but that’s not all the cases. For some operations such
as subtraction, we have Sub, SubScalar, ScalarSub, etc. depending on the type of input,
but they can all be projected to the sub operator in symbolic representation. Or for the
convolution operation, we need to first convert the parameters in a suitable way before
the projection.
Let’s look at an example of using the Owl engine:
open Owl_symbolic
module G = Owl_computation_cpu_engine.Make (Owl_algodiff_primal_ops.S)
module AD = Owl_algodiff_generic.Make (G)
module OWL_Engine = Owl_symbolic_engine_owl.Make (G)
let make_graph () =
let x = G.ones [| 2; 3 |] |> AD.pack_arr in
let y = G.var_elt "y" |> AD.pack_elt in
let z = AD.Maths.(sin x + y) in
let input = [| AD.unpack_elt y |> G.elt_to_node |] in
let output = [| AD.unpack_arr z |> G.arr_to_node |] in
G.make_graph ~input ~output "graph"
let _ =
let k = make_graph ()
|> OWL_Engine.to_symbolic
|> ONNX_Engine.of_symbolic
in
ONNX_Engine.save k "test.onnx"
And this test.onnx file can further be processed with Python code as introduced in
the previous section.
7.6 Summary
In this chapter, we briefly discussed the topic of supporting hardware accelerators in
Owl. To improve the performance of computation, it is necessary to utilize the power of
hardware accelerators, such as GPU, TPU, etc. It is a growing trend that the definition
and execution of computation can be separated out. To this end, we built a symbolic
representation based on Owl to facilitate exporting computations to other frameworks
that support multiple hardware accelerators. This representation can be executed by
multiple backend engines. Currently, it supports the ONNX, LaTeX, and Owl itself as
engines. This chapter introduced the design of this symbolic representation and used
several examples to demonstrate how the computation in Owl can be executed on other
frameworks or visualized.
212
Chapter 7 Performance Accelerators
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
213
CHAPTER 8
Compiler Backends
For a numerical library, it is always beneficial and challenging to extend to multiple
execution backends. We have seen how we support accelerators such as the GPU by
utilizing a symbolic representation and computation graph standard such as ONNX. In
this chapter, we introduce how Owl can be used on more edge-oriented backends,
including JavaScript and unikernel.
8.1 Base Library
Before we start, we need to understand how Owl enables compiling to multiple
backends by providing different implementations. Owl, as well as many of its external
libraries, is actually divided into two parts: a base library and a core library. The base
library is implemented with pure OCaml. For some backends such as JavaScript, we can
only use the functions implemented in OCaml.
You may wonder how much we will be limited by the base library. Fortunately, the
most advanced modules in Owl are often implemented in pure OCaml, and they live
in the base, which includes the modules we have introduced in the previous chapters:
algorithmic differentiation, optimization, even neural networks, and many others.
Figure 8-1 shows the structure of the core functor stack in Owl.
As we have introduced in Chapter 2, the Ndarray module is the core building block
in Owl. The base library aims to implement all the necessary functions as the core
library Ndarray module. The stack is implemented in such a way that the user can switch
between these two different implementations without the modules of higher layer. In the
Owl functor stack, Ndarray is used to support the computation graph module to provide
lazy evaluation functionality. Here, we use the Owl_base_algodiff_primal_ops module,
which is simply a wrapper around the base Ndarray module. It also includes a small
number of matrix and linear algebra functions. By providing this wrapper instead of
215
© Liang Wang, Jianxin Zhao 2023
L. Wang and J. Zhao, Architecture of Advanced Numerical Analysis Systems,
https://doi.org/10.1007/978-1-4842-8853-5_8
Chapter 8 Compiler Backends
using the Ndarray module directly, we can avoid mixing all the functions in the Ndarray
module and make it a large Goliath.
Next, the algorithmic differentiation can build up its computation module based on
normal ndarray or its lazy version. For example, you can have an AD that relies on the
normal single-precision base ndarray module:
module AD = Owl_algodiff_generic.Make
(Owl_base_algodiff_primal_ops.S)
Going up even further on the stack, we have the more advanced optimization and
neural network modules. They are both based on the AD module. For example, the
following code shows how we can build a neural graph module by layers of functors from
the base Ndarray:
module G = Owl_neural_graph.Make
(Owl_neural_neuron.Make
(Owl_optimise_generic.Make
(Owl_algodiff_generic.Make
(Owl_base_algodiff_primal_ops.S))))
216
Chapter 8 Compiler Backends
Normally, the users do not have to care about how these modules are constructed
layer by layer, but understanding the functor stack and typing is nevertheless beneficial,
especially when you are creating new module that relies on the base ndarray module.
These examples show that once we have built an application with the core Ndarray
module, we can then seamlessly switch it to base Ndarray without changing anything
else. That means that all the code and examples we have seen so far can be used directly
on different backends that require pure implementation.
The base library is still an ongoing work, and there is still a lot to do. Though the
Ndarray module is a large part in the base library, there are other modules that also need
to be reimplemented in OCaml, such as the linear algebra module. We need to add more
functions such as SVD factorization. Even for the Ndarray itself, we still have not totally
covered all the functions yet.
Our strategy is to add the base Ndarray functions gradually. We put most of
the signature files in the base library, and the core library signature file includes its
corresponding signature file from the base library, plus functions that are currently
unique to the core library. The target is to total coverage so that the core and base
libraries provide exactly the same functions.
As can be expected, the pure OCaml implementation normally performs worse than
the C code implemented version. For example, for the convolution operation, without
the help of optimized routines from OpenBLAS, etc., we can only provide the naive
implementation that includes multiple for-loops. Its performance is orders of magnitude
slower than the core library version. Currently, our priority is to implement the
functions themselves instead of caring about function optimization, nor do we intend to
outperform C code with pure OCaml implementation.
8.2 Backend: JavaScript
At first glance, JavaScript has very little to do with high-performance scientific
computing. One important reason we aim to include that in Owl is that the web
browser is arguably the most widely deployed technology on various edge devices, for
example, mobile phones, tablets, laptops, etc. More and more functionalities are being
pushed from data centers to edge for reduced latency, better privacy, and security. And
JavaScript applications running in a browser are getting more complicated and powerful.
Moreover, JavaScript interpreters are being increasingly optimized, and even relatively
complicated computational tasks can run with reasonable performance.
217
Chapter 8 Compiler Backends
This chapter uses two simple examples to demonstrate how to compile Owl
applications into JavaScript code so that you can deploy the analytical code into
browsers, using both native OCaml code and Facebook Reason. It additionally requires
the use of dune, a build system designed for OCaml/Reason projects. As you will see, this
will make the compilation to JavaScript effortless.
Native OCaml
We rely on the tool js_of_ocaml to convert native OCaml code into JavaScript. Js_of_
ocaml is a compiler from OCaml bytecode programs to JavaScript. The process can
thus be divided into two phases: first, compile the OCaml source code into bytecode
executables, and then apply the js_of_ocaml command to it. It supports the core
Bigarray module among most of the OCaml standard libraries. However, since the Sys
module is not fully supported, we are careful to not use functions from this module in
the base library.
We have described how algorithmic differentiation plays a core role in the ecosystem
of Owl, so now we use an example of AD to demonstrate how we convert a numerical
program into JavaScript code and then get executed. The example is about optimizing
the mathematical function sin. The first step is writing down our application in OCaml
as follows, then save it into a file demo.ml.
let _ =
let f = Maths.sin in
let y = desc f (F 0.1) in
Owl_log.info "argmin f(x) = %g" (unpack_flt y)
218
Chapter 8 Compiler Backends
The code is very simple: the desc defines a gradient descent algorithm, and then we
use desc to calculate the minimum value of the Maths.sin function. In the end, we print
out the result using the Owl_log module’s info function. Note that we pass in the base
Ndarray module to the AD functor to create a corresponding AD module.
In the second step, we need to create a dune file as follows. This file will instruct how
the OCaml code will be first compiled into bytecode and then converted into JavaScript
by calling js_of_ocaml.
(executable
(name demo)
(modes byte js)
(libraries owl-base))
With these two files in the same folder, we can then run the following command in
the terminal:
dune build
The command builds the application and generates a demo.bc.js in the _build/
default/ folder. Finally, we can run the JavaScript using Node.js (or loading into a
browser using an appropriate HTML page).
node _build/default/demo.bc.js
As a result, we should be able to see the output result showing a value that minimizes
the sin function and should be similar to
Even though we present a simple example here, the base library can be used to
produce more complex and interactive browser applications.
219
Chapter 8 Compiler Backends
Facebook Reason
Facebook Reason leverages OCaml as a backend to provide type-safe JavaScript. It is
gaining its momentum and becoming a popular choice of developing web applications.
It actually uses another tool, BuckleScript, to convert the Reason/OCaml code to
JavaScript. Since Reason is basically a syntax layer built on top of OCaml, it is very
straightforward to use Owl in Reason to develop advanced numerical applications.
In this example, we use reason code to manipulate multidimensional arrays, the core
data structure in Owl. First, we save the following code into a reason file called demo.re.
Note the suffix is .re now. It includes several basic math and Ndarray operations in Owl.
open! Owl_base;
The preceding code is simple. It creates a random ndarray, takes a slice, and then
prints them out. The Owl library can be seamlessly used in Reason. Next, instead of
using Reason’s own translation of this frontend syntax with bucklescript, we still turn to
js_of_ocaml for help. Let’s look at the dune file, which turns out to be the same as that in
the previous example:
(executable
(name demo)
(modes js)
(libraries owl-base))
220
Chapter 8 Compiler Backends
As in the previous example, you can then compile and run the code with the
following commands:
dune build
node _build/default/demo.bc.js
As you can see, except that the code is written in different languages, the rest of the
steps are identical in both example thanks to js_of_ocaml and dune.
8.3 Backend: MirageOS
Besides JavaScript, another choice of backend we aim to support is the MirageOS. It
is an approach to build unikernels. A unikernel is a specialized, single address space
machine image constructed with library operating systems. Unlike a normal virtual
machine, it only contains a minimal set of libraries required for one application. It can
run directly on a hypervisor or hardware without relying on operating systems such as
Linux and Windows. The unikernel is thus concise and secure, and extremely efficient
for distribution and execution on either cloud or edge devices.
MirageOS is one solution to building unikernels. It utilizes the high-level language
OCaml and a runtime to provide an API for operating system functionalities. In using
MirageOS, the users can think of the Xen hypervisor as a stable hardware platform,
without worrying about the hardware details such as devices. Furthermore, since
the Xen hypervisor is widely used in platforms such as Amazon EC2 and Rackspace
Cloud, MirageOS-built unikernel can be readily deployed on these platforms. Besides,
benefiting from its efficiency and security, MirageOS also aims to form a core piece of the
Nymote/MISO tool stack to power the Internet of Things.
221
Chapter 8 Compiler Backends
module A = Owl_algodiff_generic.Make
(Owl_algodiff_primal_ops.S)
open A
let main () =
let f x = Maths.(pow x (F 3.) - (F 2.) *
pow x (F 2.) + (F 2.)) in
let init = Stats.uniform_rvs ~a:0. ~b:10. in
let y = desc f (F init) in
Owl_log.info "argmin f(x) = %g" (unpack_flt y)
This part of the code is mostly the same as before. By applying the diff function of
the algorithmic differentiation module, we use the gradient descent method to find the
value that minimizes the function x3 − 2x2 + 2. Then we need to add something different:
module GD = struct
let start = main (); Lwt.return_unit
end
Here, the start is an entry point to the unikernel. It performs the normal OCaml
function main and then returns an Lwt thread that will be evaluated to unit. Lwt is a
concurrent programming library in OCaml. It provides the “promise” data type that can
be determined in the future.
All the preceding code is written to a file called gd_owl.ml. To build a unikernel,
next we need to define its configuration. In the same directory, we create a file called
configure.ml:
open Mirage
let main =
foreign
~packages:[package "owl"]
"Gd_owl.GD" job
222
Chapter 8 Compiler Backends
let () =
register "gd_owl" [main]
It’s not complex. First, we need to open the Mirage module. Then we declare a
value main (or you can name it any other name). It calls the foreign function to specify
the configuration. First, in the package parameter, we declare that this unikernel
requires the Owl library. The next string parameter “Gd_owl.GD” specifies the name
of the implementation file and in that file the module GD that contains the start entry
point. The third parameter job declares the type of devices required by a unikernel,
such as network interfaces, network stacks, file systems, etc. Since here we only do the
calculation, there is no extra device required, so the third parameter is a job. Finally, we
register the unikernel entry file gd_owl with the main configuration value.
That’s all it takes for coding. Now we can take a look at the compiling part. MirageOS
itself supports multiple backends. The crucial choice therefore is to decide which one
to use at the beginning by using mirage configure. In the directory that holds the
previous two files, you run mirage configure -t unix, and it configures to build the
unikernel into a Unix ELF binary that can be directly executed. Or you can use mirage
configure -t xen, and then the resulting unikernel will use the hypervisor backend like
Xen or KVM. Either way, the unikernel runs as a virtual machine after starting up. In this
example, we choose to use Unix as backends. So we run
make depend
make
and it calls the mirage build command. As a result, now the current directory contains
the _build/gd_owl.native executable, which is the unikernel we want. Executing it
yields a similar result as before:
223
Chapter 8 Compiler Backends
This neural network has two hidden layers, has a small weight size (146KB), and
works well in testing (92% accuracy). We can write the weight into a text file. This file is
named simple_mnist.ml, and similar to the previous example, we can add a unikernel
entry point function in the file:
Here, the infer function creates a neural network, loads the weight, and then
performs inference on an input image. We also need a configuration file. Again, it’s
mostly the same:
open Mirage
let main =
foreign
~packages:[package "owl-base"]
"Simple_mnist.Main" job
let () =
register "Simple_mnist" [main]
224
Chapter 8 Compiler Backends
Figure 8-2. Performance of map and fold operations on ndarray on a laptop and
Raspberry Pi
Once compiled to MirageOS unikernel with Unix backends, the generated binary is
10MB. You can also try compiling this application to JavaScript.
By these examples, we show that the Owl library can be readily deployed into
unikernels via MirageOS. The numerical functionality can then greatly enhance the
express ability of possible OCaml-MirageOS applications. Of course, here we cannot
cover all the important topics about MirageOS; please refer to the documentation of
MirageOS and Xen Hypervisor for more information.
8.4 Evaluation
In this section, we mainly compare the performance of different backends. Specifically,
we observe three representative groups of operations: (1) map and fold operations on
ndarray; (2) using gradient descent, a common numerical computing subroutine, to
get argmin of a certain function; (3) conducting inference on complex DNNs, including
SqueezeNet and a VGG-like convolution network. The evaluations are conducted on
a ThinkPad T460S laptop with an Ubuntu 16.04 operating system. It has an Intel Core
i5-6200U CPU and 12GB RAM.
The OCaml compiler can produce two kinds of executables: bytecode and native.
Native executables are compiled for specific architectures and are generally faster, while
bytecode executables have the advantage of being portable.
For JavaScript, we use the js_of_ocaml approach as described in the previous
sections. Note that for convenience we refer to the pure implementation of OCaml and
the mix implementation of OCaml and C as base-lib and owl-lib separately, but they
are in fact all included in the Owl library. For Mirage compilation, we use both libraries.
225
Chapter 8 Compiler Backends
Figure 8-2 shows the performance of map and fold operations on ndarray. We use
simple functions such as plus and multiplication on 1-d (size <1, 000) and 2-d arrays.
The log-log relationship between the total size of ndarray and the time each operation
takes keeps linear. For both operations, owl-lib is faster than base-lib, and native
executables outperform bytecode ones. The performance of Mirage executives is close to
that of native code. Generally, JavaScript runs the slowest, but note how the performance
gap between JavaScript and the others converges when the ndarray size grows. For the
fold operation, JavaScript even runs faster than bytecode when size is sufficiently large.
Note that for the fold operation, there is an obvious increase in time used at around
input size of 103 for fold operations, while there is no such change for the map operation.
That is because I change the input from one-dimensional ndarray to two-dimensional
starting that size. This change does not affect the map operation, since it treats an input
of any dimension as a one-dimensional vector. On the other hand, the fold operation
considers the factor of dimension, and thus its performance is affected by this change.
In Figure 8-3, we want to investigate if the preceding observations still hold in more
complex numerical computation. We choose to use a gradient descent algorithm to
find the value that locally minimizes a function. We choose the initial value randomly
between [0, 10]. For both sin(x) and x3 − 2x2 + 2, we can see that JavaScript runs the
slowest, but this time the base-lib slightly outperforms owl-lib.
We further compare the performance of DNN, which requires large amount of
computation. We compare SqueezeNet and a VGG-like convolution network. They have
different sizes of weight and network structure complexities.
Table 8-1 shows that though the performance difference between owl-lib and
base-lib is not obvious, the former is much better. So is the difference between native
and bytecode for base-lib. JavaScript is still the slowest. The core computation required
for DNN inference is the convolution operation. Its implementation efficiency is the
key to these differences. Currently, we are working on improving its implementation in
base-lib.
226
Chapter 8 Compiler Backends
227
Chapter 8 Compiler Backends
8.5 Summary
The base library in Owl was separated from the core module mainly to accommodate
multiple possible execution backends. This chapter introduced how the base module
works. Then we showed two possible backends: the JavaScript and the unikernel virtual
machine. Both backends are helpful to extend the application of Owl to more devices.
Finally, we used several examples to demonstrate how these backends are used and their
performances.
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
228
CHAPTER 9
Composition
and Deployment
In this chapter, we first present Zoo, a script subsystem we have originally developed for
OCaml file sharing. We will introduce how it is used and its design. Based on this system,
we discuss the problem of computation composition and deployment in a numerical
library.
Example
To illustrate how to use Zoo, let’s start with a simple synthetic scenario. Alice is a data
analyst and uses Owl in her daily job. One day, she realized that the functions she
needed had not been implemented yet in Owl. Therefore, she spent an hour in her
229
© Liang Wang, Jianxin Zhao 2023
L. Wang and J. Zhao, Architecture of Advanced Numerical Analysis Systems,
https://doi.org/10.1007/978-1-4842-8853-5_9
Chapter 9 Composition and Deployment
computer and implemented these functions by herself. She thought these functions
might be useful to others, for example, her colleague Bob; she decided to share these
functions using the Zoo system. Now let’s see how Alice manages to do so in the
following, step by step.
First, Alice needs to create a folder (e.g., myscript folder) for her shared script. What
to put in the folder then? She needs at least two files in this folder. The first one is of
course the file (i.e., coolmodule.ml) implementing the function as follows. The function
sqr_magic returns the square of a magic matrix; it is quite useless in reality but serves as
an example here.
#!/usr/bin/env owl
open Owl
let sqr_magic n = Mat.(magic n |> sqr)
The second file she needs is a #readme.md which provides a brief description of the
shared script. Note that the first line of the #readme.md will be used as a short description
for the shared scripts. This short description will be displayed when you use the owl
-list command to list all the available Zoo code snippets on your computer.
Second, Alice needs to distribute the files in the myscript folder. The distribution is
done via Gist, so you must have gist installed on your computer. For example, if you use
Mac, you can install gist with brew install gist. Owl provides a simple command-
line tool to upload the Zoo code snippets. Note that you need to log in to your GitHub
account for gist and git.
The owl -upload command simply uploads all the files in myscript as a
bundle to your Gist page. The command also prints out the URL after a successful
upload. The bundle Alice uploaded before is assigned a unique id, that is,
9f0892ab2b96f81baacd7322d73a4b08. In order to use the sqr_magic function, Bob only
needs to use the #zoo directive in his script, for example, bob.ml, in order to import the
function.
230
Chapter 9 Composition and Deployment
#!/usr/bin/env owl
#zoo "9f0892ab2b96f81baacd7322d73a4b08"
Bob’s script is very simple, but there are a couple of things worth pointing out:
• You may also want to use chmod +x bob.ml to make the script
executable. This is obvious if you are a heavy terminal user.
Note that to use the #zoo directive in REPL such as utop, you need to manually load
the owl-zoo library with #require "owl-zoo";;. Alternatively, you can also load owl-
top using #require "owl-top";; which is an OCaml top-level wrapper of Owl. If you
want to make utop, load the library automatically by adding this line to ~/.ocamlinit.
Version Control
Alice has modified and uploaded her scripts several times. Each version of her code is
assigned a unique version id. Different versions of code may work differently, so how
could Bob specify which version to use? The good news is that he barely needs to change
his code.
#!/usr/bin/env owl
#zoo "9f0892ab2b96f81baacd7322d73a4b08?
vid=71261b317cd730a4dbfb0ffeded02b10fcaa5948"
The only thing he needs to add is a version id using the parameter vid. The naming
scheme of Zoo is designed to be similar with the field-value pair in a RESTful query. The
version id can be obtained from a gist’s revisions page.
Besides specifying a version, it is also quite possible that Bob prefers to use the
newest version Alice provides, whatever its id may be. The problem here is that, how
often does Bob need to contact the Gist server to retreat the version information? Every
231
Chapter 9 Composition and Deployment
time he runs his code? Well, that may not be a good idea in many cases considering the
communication overhead and response time. Zoo caches gists locally and tends to use
the cached code and data rather than downloading them all the time.
To solve this problem, Zoo provides another parameter in the naming scheme: tol.
It is the threshold of a gist’s tolerance of the time it exists on the local cache. Any gist that
exists on a user’s local cache for longer than tol seconds is deemed outdated and thus
requires updating the latest vid information from the Gist server before being used. For
example:
#!/usr/bin/env owl
#zoo "9f0892ab2b96f81baacd7322d73a4b08?tol=300"
By setting the tol parameter to 300, Bob indicates that if Zoo has already fetched
the version information of this gist from the remote server within the past 300 seconds,
then keep using its local cache; otherwise, contact the Gist server to check if a newer
version is pushed. If so, the newest version is downloaded to local cache before being
used. In the case where Bob doesn’t want to miss every single update of Alice’s gist code,
he can simply set tol to 0, which means fetching the version information every time he
executes his code. The vid and tol parameters enable users to have fine-grained version
control of Zoo gists. Of course, these two parameters should not be used together. When
vid is set in a name, the tol parameter will be ignored. If both are not set, as shown in the
previous code snippet, Zoo will use the latest locally cached version if it exists.
A user can either choose a specific version id or use the latest version, which
means the newest version on local cache. Obviously, using latest introduces cache
inconsistency. The latest version on one machine might not be the same on the other. To
get the up-to-date version from a Gist server, the download time of the latest version on a
local machine will be saved as metadata. The newest version on the server will be pulled
to the local cache after a certain period of time, if the latest flag is set in the Gist name.
Ideally, every published service should contain a specific version id, and latest should
only be used during development.
232
Chapter 9 Composition and Deployment
233
Chapter 9 Composition and Deployment
application API plays a key role.1 Another field that advocates the composition approach
is the serverless computing, where the stateless functions can be composed into more
complex ones. Based on the observation that existing serverless systems spend a large
portion of time on booting function containers and interaction between functions,
the SAND system investigates the combination of different functions. By proposing
application-level sandboxing and a hierarchical message bus, this system reduces
latency and improves resource utility.
In this chapter, as a contribution, the Zoo system provides a small domain-specific
language (DSL) to enable the composition of advanced data analytics services.
Benefiting from OCaml’s powerful type system, the Zoo provides type checking for the
composition. Besides, the Zoo DSL supports fine-grained version control in composing
different services provided by different developers, since the code of these services may
be in constant change.
Another challenge in conducting ML-based data analytics on edge devices is the
deployment of data analytics services. Most existing machine learning frameworks, such
as TensorFlow and Caffe, focus mainly on the training of analytics models. On the other
hand, end users, many of whom are not ML professionals, mainly use trained models to
perform inference. This gap between the current ML systems and users’ requirements is
growing.
The deployment of service is close to the idea of model serving. The Clipper [13]
serving system is used for ML model–based prediction, and it features choosing the
model that has the lowest latency from models on multiple ML frameworks. It enables
users to access models based on multiple machine learning frameworks. These models
are implemented in the form of containers. Compared with Clipper, the TensorFlow
Serving focuses on using TensorFlow itself as a model execution framework. The models
are in the form of SavedModel, and they can be deployed as a container that contains
TensorFlow to serve prediction requests. Another field that employs the idea of service
deployment is in the serverless computing. In serverless platforms such as Amazon
Lambda and OpenLambda, utilizing the powerful ecosystem of existing cloud providers,
the stateless functions provided by users can be deployed on different types of devices to
get access to resources such as database and cloud files. For this aspect, as a contribution,
the Zoo DSL also involves deploying composed services to multiple backends: not only
containers but also unikernels and JavaScripts. We have discussed them in Chapter 8.
1
Engineering Trade-Offs and The Netflix API Re-Architecture. The Netflix Tech Blog. https://
bit.ly/3evFz9g
234
Chapter 9 Composition and Deployment
9.3 System Design
Based on these basic functionalities, we extend the Zoo system to address the
composition and deployment challenges. Specifically, we design a small DSL to enable
script sharing, type-checked composition of different data analytics services with version
control, and deployment of services to multiple backends. First, we would like to briefly
introduce the workflow of Zoo as shown in Figure 9-1. The workflow consists of two
parts: development on the left side and deployment on the right.
235
Chapter 9 Composition and Deployment
of the constructed service. Deployment is not limited to edge devices, but can also be on
cloud servers, or a hybrid of both cases, to minimize the data revealed to the cloud and
the associated communication costs. Thus, by this design, a data analytics service can
easily be distributed to multiple devices. In the rest of this section, we will elaborate on
the design and give details of different parts of this workflow.
Service
Gist is a core abstraction in Zoo. It is the center of code sharing. However, to compose
multiple analytics snippets, Gist alone is insufficient. For example, it cannot express the
structure of how different pieces of code are composed together. Therefore, we introduce
another abstraction: service.
A service consists of three parts: Gists, types, and the dependency graph. A Gist is the
list of Gist ids this service requires. Types are the parameter types of this service. Any
service has zero or more input parameters and one output. This design follows that of
an OCaml function. A dependency graph is a graph structure that contains information
about how the service is composed. Each node in it represents a function from a Gist and
contains the Gist’s name, id, and a number of parameters of this function.
Zoo provides three core operations about a service: create, compose, and publish.
The create_service creates a dictionary of services given a Gist id. This operation reads
the service configuration file from that Gist and creates a service for each function
specified in the configuration file. The compose_service provides a series of operations to
combine multiple services into a new service. A compose operation does type checking
by comparing the “types” field of two services. An error will be raised if incompatible
services are composed. A composed service can be saved to a new Gist or be used for
further composition. The publish_service makes a service’s code into such forms that
can be readily used by end users. Zoo is designed to support multiple backends for these
publication forms. Currently, it targets the Docker container, JavaScript, and MirageOS
[37] as backends.
Type Checking
As mentioned in Section 9.3, one of the most important tasks of service composition is
to make sure the type matches. For example, suppose there is an image analytics service
that takes a PNG format image, and if we connect to it another one that produces a
JPEG image, the resulting service will only generate meaningless output for data type
236
Chapter 9 Composition and Deployment
mismatch. OCaml provides primary types such as integer, float, string, and Boolean.
The core data structure of Owl is ndarray. However, all these types are insufficient for
high-level service type checking as mentioned. That motivates us to derive richer high-
level types.
To support this, we use generalized algebraic data types (GADTs) in OCaml. There
already exist several model collections on different platforms, for example, Caffe and
MXNet. I observe that most current popular deep learning models can generally be
categorized into three fundamental types: image, text, and voice. Based on them, we
define subtypes for each: PNG and JPEG images, French and English text, and voice, that
is, png img, jpeg img, fr text, en text, fr voice, and en voice types. More can be
further added easily in Zoo. Therefore, type checking in OCaml ensures type-safe and
meaningful composition of high-level deep learning services.
D
SL
Zoo provides a minimal DSL for service composition and deployment.
Composition: To acquire services from a Gist of id gid, we use $gid to create a
dictionary, which maps from service name strings to services. I implement the dictionary
data structure using Hashtbl in OCaml. The # operator is overloaded to represent the
“get item” operation. Therefore
$gid#sname
can be used to get a service that is named “sname.” Now suppose we have n services: f1,
f2, …, fn. Their outputs are of type tf1, tf2, …, tfn. Each service s accepts ms input parameters,
which have type t s1 , t s2 , …, t sms . Also, there is a service g that takes n inputs, each of
them has type t 1g , t g2 , …, t gn . Its output type is to. Here, Zoo provides the $> operator to
[ f1 ,f 2 ,.…,fn ]$ > g
n
This operation returns a new service that has ∑m
s =1
s inputs and is of output type to.
This operation does type checking to make sure that t fi = t gi ,∀i ∈1,2 ,…,n .
Deployment: Taking a service s, be it a basic or composed one, it can be deployed
using the following syntax:
s$@ backend
237
Chapter 9 Composition and Deployment
Service Discovery
The services require a service discovery mechanism. For simplicity’s sake, each newly
published service is added to a public record hosted on a server. The record is a list
of items, and each item contains the following: a Gist id that the service is based on; a
one-line description of this service; a string representing the input and output types
of this service, such as “image → int → string → text,”; a service URI. For the container
deployment, the URI is a Docker Hub link, and for the JavaScript backend, the URI is a
URL link to the JavaScript file itself. The service discovery mechanism is implemented
using an off-the-shelf database.
9.4 Use Case
To illustrate the preceding workflow, let us consider a synthetic scenario. Alice is a
French data analyst. She knows how to use ML and DL models on existing platforms,
but is not an expert. Her recent work is about testing the performance of different image
classification neural networks. To do that, she needs to first modify the image using the
DNN-based Neural Style Transfer (NST) algorithm. NST takes two images and outputs to
a new image, which is similar to the first image in content and the second in style. This
new image should be passed to an image classification DNN for inference. Finally, the
classification result should be translated to French. She does not want to put academic-
related information on Google’s server, but she cannot find any single pretrained model
that performs this series of tasks.
Here comes the Zoo system to help. Alice finds Gists that can do image recognition,
NST, and translation separately. Even better, she can perform image segmentation to
greatly improve the performance of NST using another Gist. All she has to provide is
some simple code to generate the style images she needs to use. She can then assemble
these parts together easily using Zoo.
238
Chapter 9 Composition and Deployment
open Zoo
(* Image classification *)
let s_img = $ "aa36e" # "infer";;
(* Image segmentation *)
let s_seg = $ "d79e9" # "seg";;
(* Neural style transfer *)
let s_nst = $ "6f28d" # "run";;
(* Translation from English to French *)
let s_trans = $ "7f32a" # "trans";;
(* Alice's own style image generation service *)
let s_style = $ alice_Gist_id # "image_gen";;
(* Compose services *)
let s = [s_seg; s_style] $> s_nst
$> n_img $> n_trans;;
(* Publish to a new Docker Image *)
let pub = (List.hd s) $@
(CONTAINER "alice/image_service:latest");;
Note that the Gist id used in the code is shortened from 32 digits to 5 due to column
length limit. Once Alice creates the new service and publishes it as a container, she can
then run it locally, send a request with image data to the deployed machine, and get
image classification results back in French.
9.5 Discussion
One thing to note is that, in service composition, type checking is a nice property
to have, but not the only one. From web services to microservices, the industry and
researchers have studied the composition issue for years. Besides checking the static
information such as message types, interfaces, etc., sometimes the dynamic behavior
between services should also be checked. It is the same in our data analytics services
composition scenario.
For example, the Generative Adversarial Network (GAN) is a huge family of
networks. A GAN consists of two parts: generator and discriminator. The generator tries
its best to synthesize images based on existing parameters. The discriminator takes the
239
Chapter 9 Composition and Deployment
images produced by the generator and tries its best to separate the generated data from
true data, using a Boolean or percentage value. This mutual deception process is iterated
until the discriminator can no longer tell the difference between the generated data
and the true data. Using Zoo, the users may want to compose a generator with different
discriminators to see which combination produces the most trustworthy fake images.
To do this, only matching the types of these two services is not enough. The users also
need to specify the dynamic information such as the order and number and messages
exchanged in between.
To solve this problem, some kind of formalisms may need to be introduced in
as theoretical foundation to structure interaction and reason over communicating
processes between services. One such option is the session types [31]. Session types are
a type discipline for communication-centric programming. It is based on the π-calculi,
and its basic idea is that the communication protocol can be described as a type, which
can be checked at runtime or statically. The session types have gained much attention
recently and are already implemented in multiple languages, including OCaml. This
approach can effectively enhance the type checking in Zoo and is a promising future
direction to pursue in my next step on this work.
9.6 Summary
In this chapter, we first introduced Zoo, a scripting sharing tool in Owl, including
its usage and design. Based on it, we explored two topics: service composition and
deployment. Zoo provides a small DSL to enable type-checked composition of different
data analytics services with version control and deployment of services to multiple
backends. It benefits from OCaml’s powerful type system. A use case was presented to
demonstrate the expressiveness of this DSL in composing advanced ML services such as
image recognition, text translation, etc. The Zoo DSL also enables deploying composed
services to multiple backends: containers, unikernels, and JavaScripts; service
deployment often requires choosing a suitable one.
240
Chapter 9 Composition and Deployment
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
241
CHAPTER 10
Distributed Computing
Distributed computing has been playing a significant role in current smart applications
in various fields. In this chapter, we first briefly give a bird’s-eye view of this topic,
introducing various programming paradigms. Next, we introduce Actor, an OCaml-
based distributed computing engine, and how it works together with Owl. We then
focus on one key element in distributed computing: the synchronization. We introduce
four different types of synchronization methods or “barriers” that are commonly used
in current systems. Next, we elaborate how these barriers are designed and provide
illustrations from the theoretical perspective. Finally, we use evaluations to show the
performance trade-offs in using different barriers.
243
© Liang Wang, Jianxin Zhao 2023
L. Wang and J. Zhao, Architecture of Advanced Numerical Analysis Systems,
https://doi.org/10.1007/978-1-4842-8853-5_10
Chapter 10 Distributed Computing
244
Chapter 10 Distributed Computing
Due to these reasons, it has been gaining increasing popularity in various research and
application fields. Federated Learning emphasizes the training data are not always
IID. That is, a device’s local data cannot be simply regarded as samples drawn from the
overall distribution. The data distribution has an enormous impact on model training.
Some research work provide theoretical analysis distributed training with non-IID
data. Some works are proposed to address the imbalanced data problem. Besides data
enhancement, its strategies include a combination of sequential update and BSP in
updating, given how biased the data is.
Map-Reduce Engine
Following the MapReduce programming model, nodes can be divided by tasks:
either map or reduce. A map function processes a key/value pair to generate a set of
intermediate key/value pairs, and a reduce function aggregates all the intermediate key/
value pairs with the same key. Execution of this model can automatically be paralleled.
Mappers compute in parallel while reducers receive the output from all mappers and
combine to produce the accumulated result. This parameter update is then broadcast
to all nodes. Details such as distributed scheduling, data divide, and communication
245
Chapter 10 Distributed Computing
in the cluster are mostly transparent to the programmers so that they can focus on the
logic of mappers and reducers in solving a problem within a large distributed system.
This simple functional style can be applied to a surprisingly wide range of applications.
For example, the following code shows an example of using the map-reduce engine to
implement the classic wordcount task:
let wordcount () =
Ctx.init Sys.argv.(1) "tcp://localhost:5555";
Ctx.load "unix://data/wordcount.data"
|> Ctx.flatmap Str.(split (regexp "[ \t\n]"))
|> Ctx.map String.lowercase_ascii
|> Ctx.filter (fun x -> (String.length x) > 0)
|> Ctx.filter (fun x -> not (List.mem x stop_words))
|> Ctx.map (fun k -> (k,1))
|> Ctx.reduce_by_key (+)
|> Ctx.collect
|> List.flatten |> print_result;
Ctx.terminate ()
let _ = wordcount ()
246
Chapter 10 Distributed Computing
• push: Send the updates to the model plane. The updates can be sent
to either a central server or to individual nodes depending on which
engine is used (e.g., map-reduce, parameter server, or peer-to-peer).
The following code shows the interfaces of the parameter server engine:
open Actor_types
type barrier =
| ASP (* Asynchronous Parallel *)
| BSP (* Bulk Synchronous Parallel *)
| SSP (* Stale Synchronous Parallel *)
| PSP (* Probabilistic Synchronous Parallel *)
247
Chapter 10 Distributed Computing
module PS = Actor_param
let test_context () =
PS.register_schedule schedule;
PS.register_push push;
PS.start Sys.argv.(1) Actor_config.manager_addr;
Owl_log.info "do some work at master node"
let _ = test_context ()
248
Chapter 10 Distributed Computing
249
Chapter 10 Distributed Computing
let test_owl_distributed () =
Actor.Mapre.init Sys.argv.(1) "tcp://localhost:5555";
Similarly, this composition also applies to more advanced and complicated data
structures such as neural networks. Remember that Ndarray is the core data structure in
Owl, which the neural network module relies on. Therefore, we can create a distributed
version neural network module using the same functor:
Here, we use the single-precision neural network graph module and the parameter
server distributed engine to parameterize the new module. It enables parallel training
on a computer cluster. The following code shows an example. Most of the code stays
unchanged. All it requires is to use the M2.train function instead of the original one to
train a network.
let test_neural_parallel () =
let open Owl.Neural.S in
let open Graph in
250
Chapter 10 Distributed Computing
let nn =
input [|32;32;3|]
|> normalisation ~decay:0.9
|> conv2d [|3;3;3;32|] [|1;1|] ~act_typ:Activation.Relu
|> conv2d [|3;3;32;32|] [|1;1|] ~act_typ:Activation.Relu ~padding:VALID
|> max_pool2d [|2;2|] [|2;2|] ~padding:VALID
|> dropout 0.1
|> conv2d [|3;3;32;64|] [|1;1|] ~act_typ:Activation.Relu
|> conv2d [|3;3;64;64|] [|1;1|] ~act_typ:Activation.Relu ~padding:VALID
|> max_pool2d [|2;2|] [|2;2|] ~padding:VALID
|> dropout 0.1
|> fully_connected 512 ~act_typ:Activation.Relu
|> linear 10 ~act_typ:Activation.(Softmax 1)
|> get_network
in
let x, _, y = Owl.Dataset.load_cifar_train_data 1 in
let chkpt state =
if Checkpoint.(state.current_batch mod 1 = 0) then (
Checkpoint.(state.stop <- true);
)
in
251
Chapter 10 Distributed Computing
252
Chapter 10 Distributed Computing
The Bulk Synchronous Parallel (BSP) is the most strict, which requires all workers
to proceed in lockstep moving to the next iteration only when all the workers are ready.
Bulk Synchronous Parallel (BSP) is a deterministic scheme where workers perform a
computation phase followed by a synchronization/communication phase to exchange
updates, under control of a central server [54]. BSP programs are often serializable,
that is, they are equivalent to sequential computations, if the data and model of a
distributed algorithm have been suitably scheduled, making BSP the strongest barrier
control method [30]. Numerous variations of BSP exist, for example, allowing a worker to
execute more than one iteration in a cycle [14]. Federated Learning also uses BSP for its
distributed computation [6]. Moreover, BSP requires centralized coordination.
The Asynchronous Parallel (ASP) [41] is the least strict barrier control, since
it allows each worker to proceed at its own pace without waiting for the others.
Asynchronous Parallel (ASP) takes the opposite approach to BSP, allowing
computations to execute as fast as possible by running all workers completely
asynchronously [41]. ASP can result in fast convergence because it permits the
highest possible rate of iteration [54]. However, the lack of any coordination means
that updates are calculated based on old model state, resulting in reduced accuracy.
There are no theoretical guarantees as to whether algorithms converge. The Hogwild
scheme proposed in [41] has many limits, for example, it requires a convex function
and sparse update. Many work have tried to extend these limits in application and
theoretical analysis [35]. These studies often lead to carefully tuned step size in training.
[59] proposes a delay-compensated SGD that mitigates delayed updates in ASP by
compensating the gradients received at the parameter server. [32] introduces another
variant of ASP specifically for wide area networks: as communication is a dominant
factor, it advocates allowing insignificant updates to be delayed indefinitely in WAN.
253
Chapter 10 Distributed Computing
The third one is the Stale Synchronous Parallel (SSP) [30], which relaxes BSP
by allowing workers to proceed to the next iteration once all workers’ iterations are
within a certain limit with each other. Stale Synchronous Parallel (SSP) is a bounded
asynchronous model that balances between BSP and ASP. Rather than requiring all
workers to proceed to the next iteration together, it requires only that the iteration of
any two workers in the system differs by at most s, a predefined staleness bound. The
staleness parameter limits error and allows SSP to provide deterministic convergence
guarantees [30, 15, 54]. Built on SSP, [58] investigates the n-softsync, the synchronization
method that makes the parameter server updating its weight after collecting certain
number of updates from any workers. [9] proposes to remove a small amount of
“longtail” workers or add a small amount of backup nodes to mitigate this effect while
avoiding asynchronous noise.
The final one is called Probabilistic Synchronous Parallel (PSP). Its basic idea
is to introduce a sampling primitive in the system and to use a sampled subset of
participating workers to estimate progress of the entire system. PSP introduces a second
dimension to this trade-off: from how many nodes must we receive updates before
proceeding to the next iteration. By composing our sampling primitive with traditional
254
Chapter 10 Distributed Computing
barrier controls, it obtains a family of barrier controls better suited for supporting
iterative learning in the heterogeneous networks.
The core idea behind PSP is simple yet powerful: we require that only some
proportion, not all, of the working nodes be synchronized in progress. By “progress”
we mean the number of updates pushed to the server at the client’s side and the total
number of updates collected at the server’s side. In a centralized training framework, the
server builds this subset of the training nodes based on system information, such as their
current local progress. This subset can be sampled by various approaches. One common
and intuitive way is to sample randomly among all the workers.
The parameter in PSP, the sampling size, therefore controls how precise this
estimation is. Assuming this random subset is not biased with respect to nodes’
performance, the server can use the resulting distribution to derive an estimate of the
percentage of nodes which have passed a given progress. This estimation depends on the
specific method used within the subset, as will be discussed in Section 10.4. Accordingly,
the server can decide whether to allow the trainers to pass the barrier and advance their
local progress.
Figure 10-2 illustrates the difference among these four types of barriers. Here, the
computing progress is measured by super steps, or iterations. Communication may
happen at the barrier to ensure consistency through the global state. A central server
may also be required in order to maintain the global state, denoted by the clock symbol.
Table 10-1 summarizes the barrier synchronization methods used by different
machine learning systems. You can see that, regardless if it is a classic system or a new
one, the barrier synchronization has been an important component in the system.
255
Chapter 10 Distributed Computing
BSP and ASP are good examples to illustrate these two factors. In BSP, workers
must wait for others to finish in a training round, and all the workers are of the same
progress. Therefore, of all barrier methods the BSP can offer the best consistency and
256
Chapter 10 Distributed Computing
highest accuracy in each update. BSP is a deterministic algorithm. As a price, if there are
stragglers in the training nodes, the system progress will be bottlenecked by the slowest
node. On the other hand, ASP allows nodes to execute as fast as possible, with no need to
consider the progress of other nodes. As a result, ASP leads to the highest possible rate of
iteration. However, the lack of any coordination means that updates are calculated based
on out-of-date model state, resulting in reduced consistency.
The design of SSP clearly shows a good trade-off between these two extremes. As
shown in Figure 10-3, SSP attempts to exploit this trade-off by bounding the difference
in iterations between participating nodes. On one hand, it does not require all the
nodes to have exactly the same progress as in BSP and thus improves its iteration rate.
On the other hand, its stale bound provides a more strict consistency bound on nodes
than ASP. As a result, it achieves a balance between these two ends, hence leading to a
higher rate of convergence. The parameter staleness covers the spectrum on this one-
dimensional tuning space.
But is that all about the design space of a barrier control method? Let’s look deep
into the current model again. We start by visualizing the iterative update process, as
shown in Figure 10-4.
The model is simple. A sequence of updates is applied to an initial global state x0.
Here, u(p,t) denotes update(node id, timestamp), that is, the updates are generated
for all the nodes on all its clock ticks. In this example, there are three nodes. Ideally, in
clock tick ti we expect to have three updates: u(0, ti), u(1, ti), and u(2, ti). However, due to
the noisy environment, these updates are divided into two sets. The deterministic ones
are those we expect if everything goes well as stated earlier. The probabilistic ones are
those out-of-order updates due to packet loss, network delay, node failure, etc. Although
it is simple, this model can represent most iterative learning algorithms.
257
Chapter 10 Distributed Computing
Then we use the analytical model to express each barrier method as in Figure 10-5.
The left part deals with the consistency. The += operator is the server logic about how to
incorporate updates submitted to the central server into the global state. The right part
deals with synchronization; computers either communicate to each other or contact
the central server to coordinate their progress. As discussed earlier, the right side can be
divided into two types of updates: deterministic and probabilistic.
The formulation reveals some very interesting structures from a system design
perspective. For BSP and SSP, the central server couples the control logic of both
consistency and synchronization. That is to say, if you choose tight consistency, you
must also choose global synchronization control by a logic central server. For both BSP
and SSP, one logical server is assigned to update model parameters and coordinate the
258
Chapter 10 Distributed Computing
259
Chapter 10 Distributed Computing
In Eq. 10.2, the consistency is thus further decomposed into the consistency degree
in a sample and the completeness of this sample.
Thus, PSP shows a tuning space that incorporates all the other barriers. As shown in
Figures 10-6 and 10-7, in the refined design space, the ASP is placed at the bottom left,
since it shows the weakest consistency (no control on the progress of other nodes) and
completeness (each node only considers itself ). On the other hand, BSP and SSP show
full completeness, since they require a central server to synchronize the progress of all
nodes. Similar to Figure 10-3, they show different levels of consistency.
Compatibility
As a more general framework, one noteworthy advantage of PSP lies in that it is
straightforwardly compatible with existing synchronization methods, which provides
the tuning dimension of consistency. In classic BSP and SSP, their barrier control
mechanisms are invoked by a central server to test the synchronization condition with
the given inputs. For BSP and SSP to use the sampling primitive, they simply need to
use the sampled states rather than the global states when evaluating the barrier control
condition. Within each sampled subset, these traditional mechanisms can then be
applied. Users can thus easily derive probabilistic versions of BSP and SSP, namely, pBSP
and pSSP. For example, Figure 10-8 shows that PSP can be applied to other synchronous
machines as a higher-order function to derive probabilistic versions.
Formally, at the barrier control point, a worker samples β out of P workers without
replacement. If one lags more than s updates behind the current worker, then the worker
waits. This process is pBSP (based on BSP) if the staleness parameter s = 0 and pSSP
(based on SSP) if s > 0. If s = ∞, PSP reduces to ASP.
260
Chapter 10 Distributed Computing
Figure 10-7. The new dimension allows us to explore a larger design space, which
further makes it possible to find a better trade-off to achieve better convergence rate
261
Chapter 10 Distributed Computing
As an illustration, Figure 10-9 depicts how to compose BSP with PSP, namely, a
subset of the population of nodes is chosen, and then the BSP is applied within the
subset (pBSP). The composition of PSP and SSP (pSSP) follows the same idea.
Besides existing barrier control methods, PSP is also compatible with both
centralized and decentralized training approaches. As described earlier, the extra
completeness dimension decouples consistency and synchronization. The other full
complete synchronization control methods require a centralized node to hold the global
state. By using a sampling primitive, they can be transformed into fully distributed
solutions. In a decentralized setting, based on the information it gathers from its
262
Chapter 10 Distributed Computing
neighboring nodes, a trainer node may either decide to pass the barrier control by
advancing its local progress or wait until the threshold is met.
The benefits of exploring this two-dimensional design space are thus multitude.
First, it enables constructing fully distributed barrier control mechanisms that are more
scalable. As illustrated in Figure 10-2d, each node depends only on several other nodes
to decide its own barrier, not on all other nodes. Second, it allows exploring barriers that
can achieve better convergence. To ignore the status of the other workers with impunity,
it relies on the fact that, in practice, many iterative learning algorithms can tolerate a
certain degree of error as they converge to their final answers [12]. By controlling the
sampling method and size, PSP reduces the impact of lagging nodes while also limiting
the error introduced by nodes returning updates based on stale information. Third,
in an unreliable environment, using the sampling primitive can minimize the impact
of outliers and stragglers by probabilistically choosing a subset of the total workers as
estimation. In summary, by tuning the sampling size and staleness parameters carefully,
the generated barrier control methods can be robust against the effect of stragglers
while also ensuring a degree of consistency between iterations as the algorithm
progresses. In Section 10.6, we will investigate the performance in more detail.
263
Chapter 10 Distributed Computing
10.5 Convergence Analysis
In this section, we present a theoretical analysis of PSP and show how it affects the
convergence of ML algorithms (SGD used in the analysis). The analysis mainly shows that
(1) under PSP, the algorithm only has a small probability not to converge, and the upper
limit of this probability decreases with the training iterations; (2) instead of choosing
large sampling size, it is proved that a small number is already sufficient to provide a good
performance. The notations used in the following analysis are presented in Table 10-2.
The analysis is based on the model shown in Figure 10-4. In a distributed machine
learning process, these N workers keep generating updates, and a shared model is
updated with them continuously. We count these updates by first looping over all
workers at one iteration and then across all the iterations. In this process, each one is
incrementally indexed by integer t. The total length of this update sequence is T. We
apply an analysis framework similar to that of [15]. At each barrier control point, every
worker A samples β out of N workers without replacement. If one of these sampled
workers lags more than s steps behind worker A, it waits. The probabilities of a node
lagging r steps are drawn from a distribution with a probability mass function f (r) and
cumulative distribution function (CDF) F(r). Without loss of generality, we set both
staleness r and sample size β parameters to be constants.
Ideally, in a fully deterministic barrier control system such as BSP, the ordering of
updates in this sequence should be deterministic. We call it a true sequence. However,
in reality, what we get is often a noisy sequence, where updates are reordered irregularly
due to sporadic and random network and system delays. These two sequences share
the same length. We define sequence inconsistency as the number of index difference
264
Chapter 10 Distributed Computing
between these two sequences and denote it by γt. It shows how much a series of updates
deviate from an ideal case. If the sequence inconsistency is bounded, it means that
what a true sequence achieves, in time, a noisy sequence can also achieve, regardless
of the order of updates. This metric is thus a key instrument in theoretically proving the
convergence property of an asynchronous barrier method.
T
Let R [ X ] = ∑ ft ( x t ) − ft ( x ) . This is the sum of the differences between the optimal
t
value of the function and the current value given a noisy state. To put it plainly, it shows
the difference between “the computation result we get if all the parameter updates we
receive are in perfect ideal order” and “the computation result we get in the real world
when using, for example, PSP barrier.” Now we show the noisy system state, x t ,
converges in expectation toward the optimal, x⋆, in probability. Specifically, since R[X] is
accumulated over time, to get a time-independent metric, we need to show the value
R[ X ]
is bounded.
T T
Theorem: SGD under PSP, convergence in probability Let f ( x ) = ∑ ft ( x ) be a
t =1
convex function where each ft ∈ R is also convex. Let x ∈ R be the minimizer of this
⋆ d
function. Assume that ft are L-Lipschitz and that the distance between two points x and
x′ is bounded: D ( x || x ′ ) = 1 || x − x ′ ||22 ≤ F 2 , where F is constant. Let an update be given by
2
σ
u t = −ηt ∇ft ( x t ) and the learning rate by ηt = . We have bound:
t
Tδ 2
−
R[ X ] 1 2 2F
2
c+
bδ
P − σ L − −q ≥δ ≤e , (10.3)
3
T T σ
where δ is a constant and b ≤ 4NTLσ. The b term here is the upper bound on the random
variables which are drawn from the lag distribution f (r). The q and c are two values that
are related to the mean and variance of γt. If we assume that 0 < a < 1, then it can be
proved that both q and c are bounded. Furthermore, if we assume with probability Φ that
R[ X ]
∀t. 4NLσγt < O(T), then b < O(T). That means converges to O(T−1/2), in probability
T
Φ with an exponential tail bound that decreases as time increases.
265
Chapter 10 Distributed Computing
In other words, this theorem claims that as long as the difference between the noisy
update sequence and the ideal sequence is bounded, and that the nodes in the system
do not lag behind too far, PSP guarantees that (with certain probability) the difference
between the result we get and the optimal result diminishes as more updates are
generated and appended in the sequence. A formal proof of this theorem can be seen
in [52].
1 T r ( r + 1) a ( r + 2 )
∑
T t =0
E( γt ) ≤ S
2
+
( 1 − a )
2
,
(10.4)
1 T r ( r + 1 ) ( 2r + 1 ) a ( r 2 + 4 )
∑E ( γ t ) < S
T t =0
2
6
+
(1 − a )
3
, (10.5)
where
1− a
S= . (10.6)
F ( r ) (1 − a ) + a − a T − r +1
As intimidating as these bounds may seem, they can both be treated as constants
for fixed a, T, r, and β values. They provide a means to quantify the impact of the PSP
sampling primitive and provide stronger convergence guarantees than ASP, shown in
Figure 10-11. They do not depend upon the entire lag distribution.
266
Chapter 10 Distributed Computing
363
The intuition provided in Eq. 10.4 and Eq. 10.5 is that, when applying PSP, the update
sequence we get is not too different from the true sequence, regarding both mean and
variance of the difference. To demonstrate the impact of the sampling primitive on
bounds quantitatively, Figures 10-10a and 10-10b show how increasing the sampling
count, β (from 1 to 128, marked with different line colors on the right), yields tighter
bounds. Notably, only a small number of nodes need to be sampled to yield bounds close
to the optimal. This result has an important implication to justify using the sampling
primitive in large distributed learning systems due to its effectiveness. This will be further
verified in the evaluation section.
The discontinuities at a = 0 and a = 1 reflect edge cases of the behavior of the barrier
method control. Specifically, with a = 0, no probability mass is in the initial r steps, so no
progress can be achieved if the system requires β > 0 workers to be within r steps of the
fastest worker. If a = 1 and β = 0, then the system is operating in ASP mode, so the bounds
are expected to be large. However, these are overly generous. Better bounds are O(T) for
the mean and O(T2) for the variance, which we give in our proof. When a = 1 and β ≠ 0,
the system should never wait and workers could slip as far as they like as long as they
returned to be within r steps before the next sampling point.
267
Chapter 10 Distributed Computing
Implementation Technique
As shown in Table 10-1, barrier control methods are widely used in existing systems,
such as parameter servers, Hadoop, etc. Indeed, PSP is not yet widely available in many
systems, which means the completeness dimension in synchronization method design
cannot be readily utilized. The good news is that bringing the extra design dimension
requires minimal effort. To implement PSP atop of current data analytics frameworks,
developers only need to add a new primitive: sampling. As shown in Section 10.4, it is
straightforward to compose existing barrier methods in a distributed system.
By default, we choose the trainers randomly. There are various ways to guarantee the
random sampling, for example, organizing the nodes into a structural overlay such as
the Distributed Hash Table (DHT). The random sampling is based on the fact that node
identifiers are uniformly distributed in a namespace. Nodes can estimate the population
size based on the allocated ID density in the namespace.
The choice of samples has a great impact on the performance of PSP. The sampling
of PSP provides an estimate of the total distribution of the progress of all the workers.
In a worst-case scenario where the sampled subset happens to be all stragglers, this
subset cannot provide a very efficient estimation of all the workers. Different sampling
strategies can be used in certain scenarios.
For example, we can change how frequently the sample changes during distributed
computing. Or, we can choose the workers according to their previous computation
time. Specifically, at each round, all the workers are categorized into two groups
according to their historical computing time per iteration, one slow and one fast, and
then choose equal numbers of workers from both groups to form the target subset. We
can use clustering algorithms such as K-Means.
268
Chapter 10 Distributed Computing
10.6 Evaluation
In this section, we investigate the performance of various barrier control methods
in experiments and the trade-off they make. We focus on two common metrics in
evaluating barrier strategies: the accuracy and system progress. Using these metrics,
we explore various barrier controls with regard to the impact of sample settings and
stragglers in the Federated Learning system. Besides, we also use a new metric called
progress inconsistency as a metric of training accuracy, but without the impact of specific
application hyperparameters.
Experiment Setup
We perform extensive experiments on the real-world dataset FEMNIST, which is part
of LEAF, a modular benchmarking framework for learning in federated settings, and
includes a suite of open source federated datasets [8]. Similar to MNIST, the FEMNIST
dataset is for image classification tasks. But it contains 62 different classes (10 digits,
26 lowercases, and 26 uppercases). Each image is of size 28 by 28 pixels. The dataset
contains 805,263 samples in total. The number of samples is distributed evenly across
different classes.
To better study the performance of the proposed method with non-IID data
distribution in Federated Learning, we follow the data partition setting in [7]. We first
sort the data by class labels, divide them into 2n shards, and assign each of n workers 2
shards. This pathological non-IID partition makes the training data on different workers
overlap as little as possible. The validation set is 10% of the total data. Besides, we
preprocess it so that the validation set is roughly balanced in each class. As for training
hyperparameters, we use a batch size of 128, and we use the Adam optimizer, with
learning rate of 0.001 and coefficient of (0.9, 0.999).
We conduct our experiment on a server that has 56 Intel(R) Xeon(R) CPU E5-2680
v4 and a memory of 256G. In the rest of this section, if not otherwise mentioned, we
use 16 workers by default. Besides, one extra worker is used for model validation to
compute its accuracy. In the rest of this section, we aim to show the wide range of tuning
space enabled by the sampling parameter and how existing barrier methods can be
incorporated into PSP.
269
Chapter 10 Distributed Computing
Accuracy
We execute the training process using each method on the non-IID FEMNIST dataset
for about eight epochs. The results are shown in Figure 10-12. The subfigure uses time
as the x axis. It shows the change of trained model accuracy in about 10,000 seconds. It
compares the ASP, BSP, and pBSP (composing PSP with BSP) where the sampling size
equals 4.
The first thing to note here is, though the performance of ASP looks optimal at the
beginning due to its quick accumulation of updates from different workers, it quickly
deteriorates and fails to converge. Compared to the unstable performance of ASP, BSP
steadily converges. Then the pBSP clearly outperforms these two regarding model
accuracy, especially in the later part of training. Due to its probabilistic nature, the
pBSP line shows larger jitters than BSP, but also follows the general trend of BSP toward
convergence steadily.
The strength of PSP lies in that it combines the advantages of existing methods. In
the lower subfigure of Figure 10-12, we use the accumulated total number of updates the
parameter server has received as the x axis to compare the “efficiency” of the updates in
ASP, SSP, and pSSP. The staleness parameter of SSP and pSSP is set to 4 here. We can see
that as updates are accumulating, despite using sampling, the accuracy increase of pSSP
is similar to that of SSP.
Meanwhile, pSSP is much faster than SSP with regard to the update progress or the
rate at which the updates accumulate at the parameter server. Figure 10-13 shows the
number of updates at the server with regard to time (here, we show only results from
the beginning of evaluations). As can be seen, at any given time, both pBSP and pSSP
progress faster than BSP and SSP correspondingly. Of course, ASP progresses the fastest
since it does not require any synchronization among workers, but its nonconverged
updates make this advantage obsolete.
The difference of the number of updates can be directly interpreted as the
communication cost, since each update means the transmission of weight and gradient
between the server and clients. For example, at about 600s, the pSSP incurs 35% more
traffic than SSP; and pBSP even doubles the traffic in BSP. In our experiments, the
PSP can reduce communication overhead without sacrificing the final model accuracy.
270
Chapter 10 Distributed Computing
PSP combines the best of two worlds. On one hand, it has similar update efficiency
as SSP and BSP; on the other hand, it achieves faster update progress that is similar to
ASP. As a result, it outperforms the existing barrier control methods.
271
Chapter 10 Distributed Computing
System Progress
In this section, we use 32 workers and run the evaluation for 400 seconds. Figure 10-14a
shows the distribution of all nodes’ progress when evaluation is finished.
As expected, the most strict BSP leads to a tightly bounded progress distribution,
but at the same time, using BSP makes all the nodes progress slowly. At the end of the
experiment, all the nodes only proceed to about the 80th update. As a comparison, using
ASP leads to a much faster progress of around 200 updates. But the cost is a much loosely
spread distribution, which shows no synchronization at all among nodes. SSP allows
certain staleness (4 in our experiment) and sits between BSP and ASP.
272
Chapter 10 Distributed Computing
PSP shows another dimension of performance tuning. We set sample size β to 4, that
is, a sampling ratio of only 12.5%. The result shows that pBSP is almost as tightly bound
as BSP and also much faster than BSP itself. The same is also true when comparing pSSP
and SSP. In both cases, PSP improves the iteration efficiency while limiting dispersion.
To further investigate the impact of the sample size, we focus on BSP and choose
different sample sizes. In Figure 10-14b, we vary the sample size from 0 to 24. As we
increase the sample size, the curves start shifting from right to left with tighter and
tighter spread, indicating less variance in nodes’ progress. With sample size 0, the pBSP
exhibits exactly the same behavior as that of ASP; with increased sample size, pBSP starts
becoming more similar to SSP and BSP with tighter requirements on synchronization.
Another important observation worth mentioning is, with a very small sample size
of one or two (i.e., very small communication cost on each individual node), pBSP can
already effectively synchronize most of the nodes compared to ASP. The tail caused by
stragglers can be further trimmed by using a larger sample size. This observation confirms
our theoretical analysis in Section 10.5, which explicitly shows that a small sample size can
effectively push the probabilistic convergence guarantee to its optimum even for a large
system size, which further indicates the superior scalability of the proposed solution.
Figure 10-15. Stragglers impact both system performance and accuracy of model
updates. Probabilistic synchronization control by a sampling primitive is able to
mitigate such impacts
When composed with BSP, PSP can increase the system progress of BSP by about
85% while retaining the almost the same tight bound on progress distribution. Besides,
by tuning the sample size, the evaluation result shows that a small size such as 2 or 4 in a
system of 32 workers can effectively provide a tight convergence guarantee.
273
Chapter 10 Distributed Computing
Robustness to Straggler
Stragglers are not uncommon in traditional distributed training and are pervasive in
the workers of Federated Learning. In this section, we show the impact of stragglers
on system performance and accuracy of model updates and how probabilistic
synchronization control by a sampling primitive can be used to mitigate such impacts.
As explained before, we model the system stragglers by increasing the training time
of each slow trainer to n-fold, namely, on average they spend n times as much time
as normal nodes to finish one iteration. The parameter n here is the “slowness” of the
system. In the experiment shown in Figure 10-15, we keep the portion of slow nodes
fixed and increase the slowness from 2 to 8. Then we measure the accuracy of using a
certain barrier control method at the end of training. To be more precise, we choose
a period of results before the ending and use their mean value and standard for each
observation point.
Figure 10-15 plots the decreasing model accuracy due to stragglers as a function
of the straggler slowness. As we can see, both ASP and BSP are sensitive to stragglers,
both dropping about 20% accuracy by increasing slowness from 2x to 8x, while that of
pBSP only drops by less than 10%. For BSP, this is mainly because the stragglers severely
reduce the training update progress; for ASP, this can be explained as the result of its
asynchronous nature, where updates from slow workers are delayed. This problem is
exacerbated by the non-IID data, where the data overlap between different workers
is limited, if not none at all. Once again, PSP takes the best of both worlds. As we have
shown before, its probabilistic sampling mitigates the effect of data distribution and is
also less prone to the progress reduction caused by stragglers.
PSP is less prone to the stragglers in the system. When the slowness increases from
2x to 8x, both ASP and BSP are sensitive to stragglers, both dropping about 20% accuracy,
while that of pBSP only decreases by less than 10%.
274
Chapter 10 Distributed Computing
S
ampling Settings
In Section 10.6, we investigate how the choice of sampling size affects the progress in
PSP. One question is then: How to choose the suitable sample size? As pointed out in
Section 10.5, one important observation that can be derived from our theory proof is that
a small number of sampling can achieve similar performance as that using large sample
numbers.
275
Chapter 10 Distributed Computing
276
Chapter 10 Distributed Computing
The result shows that, compared to the basic strategy, the dynamic one can
effectively increase the efficiency of PSP. The increase ranges from about 25% to twofold
for different sampling sizes. The low accuracy of the basic strategy shows that it tends to
result in a more asynchronous training, which is more similar to ASP than BSP.
The grouping strategy achieves similar results as the dynamic one, but shows smaller
deviation of box, which means a smoother curve in training (result figure omitted due
to space limit). Besides, in dynamic strategy, the sampling size does not visibly affect
the model accuracy, which means that the smaller sample size can be used to increase
system progress without sacrificing model accuracy. Also note that in both cases, the
larger sampling size leads to smaller deviation. This also agrees with the design and
analysis of PSP shown in previous sections.
We learned two things in this section. First, by varying the setting of the sampling
size from 2 to 8 in pSSP by using a worker size of 16, it can be seen that a small sampling
size can still achieve that of a large one, regarding model accuracy. Second, the dynamic
and grouping sampling strategies can both effectively improve the performance.
Compared to the basic strategy, both can effectively increase the efficiency of PSP. The
increase ranges from about 25% to twofold for different sampling sizes.
Progress Inconsistency
In the previous section, we have evaluated the impact of barrier control methods on the
accuracy of three different models. However, the training accuracy is affected not only
by the barrier method, which controls training inconsistency, but also hyperparameters
such as learning rate. The tolerance of error in training for different applications also
varies greatly. To better understand the impact of barriers on model consistency
during training without considering the influence of these factors, we use progress
inconsistency as a metric to compare barriers.
In distributed training, for a worker, between the time it pulls a model from a
server and updates its own local model, the server likely has already received several
updates from other workers. These updates are the source of training inconsistency.
We define progress inconsistency as the number of these updates between a worker’s
corresponding read and update operations. In this experiment, we collect the progress
inconsistency value of each node at its every step during training.
We investigate the relationship between the number of nodes and inconsistency of
pBSP. All executions run for 100 seconds, and we increase workers from 50 to 500. We
measure the average and variance of progress inconsistency, both normalized with the
277
Chapter 10 Distributed Computing
number of workers, as shown in Figure 10-18. The average inconsistency of ASP is mostly
unaffected by size. With a smaller sample size, that of pBSP becomes close to ASP, but
note that only the initial increase of network size has a considerable impact. With sample
size fixed and network size growing, the average inconsistency grows sublinearly, which
is an ideal property. As to the standard deviation values of pBSP, they mostly keep stable
regardless of network size.
According to these observations, we can see that for PSP, both the average training
inconsistency (denoted by mean) and the noise (denoted by variance) grow sublinearly
toward a certain limit for different sample sizes, limited by that of ASP and BSP/SSP.
10.7 Summary
In this chapter, we explored the topic of distributed computing in Owl, with a focus on
the topic of synchronization barriers. We showed Actor, an OCaml-based distributed
computing engine, which has implemented three different computing paradigms,
namely, map-reduce, parameter server, and peer-to-peer. Orthogonal to that, it also has
implemented four different types of barrier control methods.
We proposed the Probabilistic Synchronous Parallel, which is suitable for data
analytic applications deployed in large and unreliable distributed systems. It strikes a
good trade-off between the efficiency and accuracy of iterative learning algorithms by
probabilistically controlling how data is sampled from distributed workers. In Actor,
we implemented PSP with a core system primitive of “sampling.” We showed that the
sampling primitive can be combined with existing barrier control methods to derive
fully distributed solutions. We then evaluated the performance of various barrier
control methods. The effectiveness of PSP in different application scenarios depends
on the suitable parameter, that is, the sample size. Similar to the performance tuning
in numerical computation, we suggest resorting to prior knowledge and empirical
measurement for its parameter tuning and regard this as the challenge for the future
exploration.
278
Chapter 10 Distributed Computing
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
279
CHAPTER 11
Testing Framework
Every proper software requires testing, and so is Owl. All too often, we have found that
testing can help us discover potential errors we failed to notice during development. In
this chapter, we briefly introduce the philosophy of testing in Owl, the tool we use for
conducting the unit test, and examples to demonstrate how to write unit tests. Issues
such as using functors in tests and other things to notice in writing test code for Owl, etc.
are also discussed in this chapter.
11.1 Unit Test
There are multiple ways to perform test on your code. One common way is to use
assertion or catching/raising errors in the code. These kinds of tests are useful, but the
testing code is mixed with the function code itself, while we need separate test modules
that check the implementation of functions against expected behaviors.
In Owl, we apply a unit test to ensure the correctness of numerical routines as much
as possible. A unit test is a software test method that checks the behavior of individual
units in the code. In our case, the “unit” often means a single numerical function.
There is an approach of software development that is called test-driven development,
where you write test code even before you implement the function to be tested itself.
Though we don’t enforce such approach, there are certain testing principles we follow
during the development of Owl. For example, we generally don’t trust code that is
not tested, so in a GitHub pull request, it is always a good practice to accompany your
implementation with a unit test in the test/ directory in the source code. Besides, try to
keep the function short and simple, so that a test case can focus on a certain aspect.
We use the alcotest framework for testing in Owl. It is a lightweight test framework
with simple interfaces. It exposes a simple TESTABLE module type, a check function to
assert test predicates, and a run function to perform a list of unit -> unit test callbacks.
281
© Liang Wang, Jianxin Zhao 2023
L. Wang and J. Zhao, Architecture of Advanced Numerical Analysis Systems,
https://doi.org/10.1007/978-1-4842-8853-5_11
Chapter 11 Testing Framework
11.2 Example
Let’s look at an example of using alcotest in Owl. Suppose you have implemented
some functions in the linear algebra module, including the functions such as rank,
determinant, inversion, etc., and try to test them before making a pull request. The
testing code can be included in one test unit, and each unit consists of four major
sections.
In the first section, we define some utility function and common constants which
will be used in the unit. For this example, we specify the required precision and some
predefined input data. Here, we use 1e-6 as the precision threshold. Two ndarrays are
deemed the same if the sum of their difference is less than 1e-6, as shown in mpow. The
predefined input data can also be defined in each test case, as in is_triu_1.
open Owl
open Alcotest
module M = Owl.Linalg.D
(* Section #1 *)
let approx_equal a b =
let eps = 1e-6 in
Stdlib.(abs_float (a -. b) < eps)
The second section is the core. It contains the actual testing logic, for example,
whether the det function can correctly calculate the determinant of a given matrix. Every
testing function defined in the To_test module has self-contained logic to validate the
implementation of the target function.
(* Section #2 *)
let det () =
let x = Mat.hadamard 4 in
282
Chapter 11 Testing Framework
M.det x = 16.
let vecnorm_01 () =
let a = M.vecnorm ~p:1. x0 in
approx_equal a 21.
let vecnorm_02 () =
let a = M.vecnorm ~p:2. x0 in
approx_equal a 9.539392014169456
let is_triu_1 () =
let x = Mat.of_array [| 1.; 2.; 3.; 0.;
5.; 6.; 0.; 0.; 9. |] 3 3 in
M.is_triu x = true
let mpow () =
let x = Mat.uniform 4 4 in
let y = M.mpow x 3. in
let z = Mat.(dot x (dot x x)) in
approx_equal Mat.(y - z |> sum') 0.
end
The most common test function used in Owl has the type unit -> bool. The idea
is that each test function compares a certain aspect of a function with expected results.
If there are multiple test cases for the same function, such as the case in vecnorm, we
tend to build different test cases instead of using one large test function to include all the
cases. The common pattern of these functions can be summarized as
let test_func () =
let expected = expected_value in
let result = func args in
assert (expected = result)
It is important to understand that the equal sign does not necessarily mean the two
values have to be the same; in fact, if the floating-point number is involved, which is
quite often the case, we only need the two values to be approximately equal within an
error range. If that’s the case, you need to pay attention to which precision you are using:
283
Chapter 11 Testing Framework
double or float. The same threshold might be enough for a single-precision float number,
but could still be a large error for double-precision computation.
The third section contains mostly boilerplate code which notifies the testing
framework two important pieces of information. The first one is the name of the testing
function, so the testing framework can store in its log for post-analysis. The second one
is the anticipated result which will be used by the testing framework to check against the
outcome of the testing function. If the outcome does not match the expected result, the
testing fails and the failure will be logged.
Here, we expect all the test functions to return true, though alcotest does support
testing returning a lot of other types such as string, int, etc. Please refer to its source file
for more details.
(* Section #3 *)
let rank () =
Alcotest.(check bool) "rank" true (To_test.rank ())
let det () =
Alcotest.(check bool) "det" true (To_test.det ())
let vecnorm_01 () =
Alcotest.(check bool) "vecnorm_01" true (To_test.vecnorm_01 ())
let vecnorm_02 () =
Alcotest.(check bool) "vecnorm_02" true (To_test.vecnorm_02 ())
let is_triu_1 () =
Alcotest.(check bool) "is_triu_1" true (To_test.is_triu_1 ())
let mpow () =
Alcotest.(check bool) "mpow" true (To_test.mpow ())
(* Section #4 *)
let test_set =
[ "rank", `Slow, rank
; "det", `Slow, det
284
Chapter 11 Testing Framework
In the final section, we take functions from section 3 and put them into a list of test
set. The test set specifies the name and mode of the test. The test mode is either Quick
or Slow. Quick tests run on any invocations of the test suite. Slow tests are for stress tests
that run only on occasion, typically before a release or after a major change. We can
further specify the execution order of these testing functions.
After this step, the whole file is named unit_linalg.ml and put under the test/
directory, as with all other unit test files. Now the only thing left is to add it in the test_
runner.ml:
let () =
Alcotest.run
"Owl"
[ "stats_rvs", Unit_stats_rvs.test_set
; "maths", Unit_maths.test_set
; "linear algebra", Unit_linalg.test_set
...
; "conv3d_mec", Unit_conv_mec_naive.Conv3D_MEC.test_set
; "conv2d_naive", Unit_conv_mec_naive.Conv2D_NAIVE.test_set
; "conv3d_naive", Unit_conv_mec_naive.Conv3D_NAIVE.test_set
; "dilated_conv2d", Unit_dilated_conv2d.test_set
; "dilated_conv3d", Unit_dilated_conv3d.test_set
; "base: algodiff diff", Unit_base_algodiff_diff.test_set
; "base: algodiff grad", Unit_base_algodiff_grad.test_set
; "base: slicing basic", Unit_base_slicing_basic.test_set
; "base: pooling2d", Unit_base_pool2d.test_set
; "base: pooling3d", Unit_base_pool3d.test_set
... ]
That’s all. Now you can try make test and check if the functions are implemented
well. The compilation result is shown in Figure 11-1. It shows that all tests are successful.
285
Chapter 11 Testing Framework
What if one of the test functions does not pass? Let’s intentionally make a failing test,
such as asserting the matrix in the rank test equals 1 instead of the correct answer 2, and
run the test again.
As we can see in Figure 11-2, the failure was detected and logged directly onto the
standard output.
286
Chapter 11 Testing Framework
Corner Cases
Corner cases involve situations that occur outside of normal operating parameters. That
is obvious in the testing of convolution operations. As the core operation in deep neural
networks, convolution is complex: it contains input, kernel, strides, padding, etc. as
parameters. Therefore, special cases such as 1x1 kernel, strides of different height and
width, etc. are tested in various combinations, sometimes with different input data.
287
Chapter 11 Testing Framework
...
Test Coverage
Another issue is test coverage. It means the percentage of code for which an associated
test has existed. Though we don’t seek a strict 100% coverage for now, wider test
coverage is always a good idea. For example, in our implementation of the repeat
operation, depending on whether the given axes contain one or multiple integers, the
implementation changes. Therefore, in the test functions, it is crucial to cover both cases.
11.4 Use Functor
Note that we can still benefit from all the powerful features in OCaml such as the
functor. For example, in testing the convolution operations, we would like to test the
implementation of both that in the core library (which is implemented in C) and that in
the base library (in pure OCaml). Apparently, there is no need to write the same unit test
code twice for these two sets of implementation. To solve that problem, we have a test
file unit_conv2d_genericl.ml that has a large module that contains all the previous four
sections:
288
Chapter 11 Testing Framework
And in the specific testing file for core implementation unit_conv2d.ml, it simply
contains one line of code:
11.5 Performance Tests
For a numerical library, being able to calculate correct results is not enough. How fast
a function can calculate also matters; actually, it matters a lot in modern real-time
data analysis, which has wide applications in many fields such finance, robotics, flight
control, etc. In addition to correctness, a performance test is also included in the Owl
testing framework. The following simple generic function runs a target function for
certain amount of times, then calculates the average speed:
This function for testing each operation is similar but prints out the traces more
eagerly in every iteration:
(* test one operation c time, output the used time in each evaluation *)
let test_op_each c op =
Printf.printf "| test some fun %i times\n" c;
289
Chapter 11 Testing Framework
With these two generic functions, we can write up a list of tests very quickly. An
example that tests the execution time of various matrix operations is shown as follows:
let _ =
Random.self_init ();
let m, n = 5000, 20000 and c = 1 in
print_endline (String.make 60 '+');
Printf.printf "| test matrix size: %i x %i exps: %i\n" m n c;
print_endline (String.make 60 '-');
let x, y = (M.uniform Float64 m n), (M.uniform Float64 m n) in
test_op "empty " c (fun () -> M.empty Float64 m n);
test_op "zeros " c (fun () -> M.zeros Float64 m n);
test_op "col " c (fun () -> M.col x (n-1));
test_op "row " c (fun () -> M.row x (m-1));
test_op "cols " c (fun () -> M.cols x [|1;2|]);
test_op "rows " c (fun () -> M.rows x [|1;2|]);
test_op "map " c (fun () -> M.map (fun y -> 0.) x);
...
290
Chapter 11 Testing Framework
11.6 Summary
In this chapter, we briefly introduced how the unit tests are performed with the alcotest
framework in the existing Owl codebase. We used one example piece of test code for the
linear algebra module in Owl to demonstrate the general structure of the Owl test code.
We then discussed some tips we find helpful in writing tests, such as considering corner
cases, test coverage, and using functors to simplify the test code. In practice, we find the
unit tests come really handy in development, and we just cannot have too much of them.
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
291
APPENDIX A
293
© Liang Wang, Jianxin Zhao 2023
L. Wang and J. Zhao, Architecture of Advanced Numerical Analysis Systems,
https://doi.org/10.1007/978-1-4842-8853-5
Appendix A Basic Analytics Examples
Owl has implemented the full linear algebra interface to CBLAS and LAPACKE. You
might notice the extra C in CBLAS and E in LAPACKE because they are the corresponding
C interface of FORTRAN implementations. To this end, Owl has implemented several
internal modules. The Owl_cblas module provides the raw interface to CBLAS functions,
from levels 1 to 3. The interfaced functions have the same names as those in CBLAS. The
Owl_lapacke_generated module provides the raw interface to LAPACKE functions (over
1000), which also have the same names defined in lapacke.h. The Owl_lapacke module
is a very thin layer of interface between the Owl_lapacke_generated module and Linalg
module. The purpose is to provide a unified function to make generic functions over
different number types.
The functions in Owl_cblas and Owl_lapacke_generated are very low level,
for example, you need to deal with calculating parameters, allocating workspace,
postprocessing results, and many other tedious details. End users do not really need
to use them directly unless they have enough background in numerical analysis and
chase after the performance. So, for example, the LU factorization is performed using
the sgetrf or dgetrf function in the Owl_lapacke_generated module, the signature of
which looks like this:
val sgetrf : layout:int -> m:int -> n:int -> a:float ptr ->
lda:int -> ipiv:int32 ptr -> int
Instead of exposing all these parameters, the getrf function in the Owl_lapacke
module provides interfaces that are more straightforward:
These low-level functions provide more general access for users. If this still looks a
bit unfamiliar to you, in the Linalg module we have
val lu : (`a, `b) t -> (`a, `b) t * (`a, `b) t * (int32, int32_elt) t
294
Appendix A Basic Analytics Examples
295
Appendix A Basic Analytics Examples
296
Appendix A Basic Analytics Examples
The core functions in the FFT module are the fft function and its reverse rfft. This
module provides these basic FFT functions as listed in Table A-1. The inputs to these
functions are ndarrays. As with Ndarray, the FFT module also provides four number
types: float (S), double precision (D), single-precision complex (C), and double-precision
complex (Z). The axis specifies along which axis of the input ndarray a function is
performed. It is set to the highest dimension if not specified. The parameter n specifies
the size of output.
Here, we use a one-dimensional Fourier transform example to demonstrate the
usages of these functions, especially the most basic fft and its inverse transform
function ifft. We plot the FFT of the sum of two sine waves showing the power of FFT to
separate signals of different frequencies. This example uses 600 sampling points, and the
1
sample spacing is .
800
# module G = Dense.Ndarray.Generic
# let n = 600.
# let t = 1. /. 800.
# let x = Arr.linspace 0. (n *. t) (int_of_float n)
# let y1 = Arr.((50. *. 2. *. Owl_const.pi) $* x |> sin)
# let y2 = Arr.(0.5 $* ((80. *. 2. *. Owl_const.pi)
$* x |> sin))
The combined signal in Figure A-1b shows an irregular shape. Next, we apply FFT on
this mixed signal:
# let yf = Owl_fft.D.fft y
In the result yf, each tuple can be seen as a frequency vector in the complex space.
We can plot the length of these vectors, represented by z in the following code. Again, we
use only half of the elements that have positive frequencies in array yf. The plot is shown
in Figure A-1c.
297
Appendix A Basic Analytics Examples
Figure A-1. (a) Two sine signals of different frequencies, (b) combined signal, (c)
using FFT to separate two sine signals from their mixed signal
F x ,y ,y ,,y n 0. (A.1)
298
Appendix A Basic Analytics Examples
That is, as a function of derivatives with boundary conditions. They can be used to
model dynamical systems. The initial state of the system is called its initial values, and
these values are often known and can be represented as
y x x0 y 0 ,y y1 ,, (A.2)
x x1
where the y0, y1, etc. are known. The highest order of derivatives occurring in Eq. A.1 is
the order of the differential equation. A first-order differential equation can be generally
dy
expressed as f x ,y , where f is any function that contains x and y. Solving Eq.
dx
A.1 and fitting to the given initial values as in Eq. A.2 is called the initial value problem.
Solving such problems is the main target of many numerical ODE solvers.
A real-world system often contains multiple interdependent components, each
described by a function that evolves over time. For example, the Lorenz attractor system
has three components that change with time: the rate of convection in the atmospheric
flow and the horizontal and vertical temperature variations. Such system is an example
of first-order linear systems of ODE or just linear systems of ODE. Generally, if we have
y1 t a11 t a1n t g1 t
y t , A t ,and g t ,
yn t an1 t ann t g n t
y t A t y t g t . (A.3)
299
Appendix A Basic Analytics Examples
dy 1 1
Ay , where A .
dt 2 3
This equation represents an oscillator system where y is the state of the system, t is
time, and the initial state at t = 0 is y0 = [−1, 1]T. We wish to know the system state at t = 2.
The function can be expressed in Owl using the matrix module.
let f y t =
let a = [|[|1.; -1.|];[|2.; -3.|]|]|> Mat.of_arrays in
Mat.(a *@ y)
Next, we want to specify the timespan of this problem: from 0 to 2 using a step
of 0.001.
The last requirement for solving the problem is to have the initial values:
This is the rk4 solver, short for "fourth-order Runge-Kutta method," that we have
introduced before. The result shows both the steps ts and the system values at each
step ys.
The owl-ode library abstracts the initial value problems as four different parts:
• A function f to show how the system evolves in equation y′(t) = f (y, t)
300
Appendix A Basic Analytics Examples
Indeed, the signature of a solver clearly indicates these four parts. Building on this
uniform abstraction, you can choose a suitable solver and use it to solve many complex
and practical ODE problems. Note that due to the difference of solvers, the requirements
of different solvers vary, for example, some require the state to be two matrices, while
others process data in a more general ndarray format.
The owl-ode library provides a wide range of solvers. It implements native solvers
based on the basic step-by-step update idea discussed earlier. There are also many
mature off-the-shelf tools for solving ODEs, and we interface to two of them: sundials1
and ODEPACK.2 Both are well implemented and widely used in practical use. For
example, the SciPy provides a Python wrap of the sundials, and NASA also uses its
CVODE/CVODES solvers for spacecraft trajectory simulations. The sundials is a SUite of
Nonlinear and DIfferential/ALgebraic equation Solvers. It contains six solvers, and we
interface to its CVODE solver for solving initial value problems. ODEPACK is a collection
of FORTRAN solvers for the initial value problem for ordinary differential equation
systems. We interface to its LSODA solver which is for solving the explicit form ODE.
1
https://computing.llnl.gov/projects/sundials
2
https://computing.llnl.gov/casc/odepack/
301
APPENDIX B
System Conventions
All software systems have their own rules and conventions with which developers
must comply, and Owl is no exception. In this appendix, we cover function naming
and various other conventions in the Owl library.
As we can see from the type signature, the output is specified in the optional out
parameter. If out is not provided, the first operand (x) will be attempted to store the final
result – as binary operators support broadcasting operations by default, when using
impure functions every dimension of the first argument x must not be smaller than that
of the second argument y. In other words, impure functions only allow broadcasting
smaller y onto an x that is big enough to accommodate the result.
Most binary math functions have an associated shorthand operator, such as +, -, *,
and /. The impure versions also have associated operators, for example, rather than Arr.
(x + y) which returns the result in a new ndarray, you can write Arr.(x += y) which
adds x and y and saves the result into x. These operators are listed in Table B-1.
add + +=
sub - -=
mul * *=
div / /=
add_scalar +$ +$=
sub_scalar -$ -$=
mul_scalar *$ *$=
div_scalar /$ /$=
The result is a one-element ndarray, so to treat it as a scalar, you must retrieve the
single value by calling the get function.
304
Appendix B System Conventions
This becomes inconvenient in OCaml if we must always extract the scalar value from
the return of reduce operations. In languages like Python and Julia, the return type is
dynamically determined, but OCaml’s strong typing requires that we either use a unified
type or implement another set of functions. In the end, Owl picked the latter in its
design, so every reduce operation has two versions:
• One allowing you to reduce along the specified axis or to reduce all
the elements, but always returning an ndarray
• One that reduces all the elements, always returning a scalar value
The difference between the two is indicated by naming those returning a scalar with
an extra “'” character in their names. For example, those functions returning an ndarray
are named Arr.sum, Arr.prod, mean, etc., while those returning a scalar are named Arr.
sum’, Arr.prod’, mean’, etc.
305
Appendix B System Conventions
Technically, S, D, C, and Z are wrappers of the Generic module with explicit type
information provided. Therefore, you can save the type constructor which was passed
into the Generic module if you use these submodules directly. In short, the Generic
module can do everything that submodules can, but for some functions (e.g., creation),
you must explicitly pass in the type information.
In practice, we often work with double-precision numbers, so Owl provides shortcuts
to the data structures of double-precision floating-point numbers. Arr is equivalent to
double-precision real Dense.Ndarray.D, and, similarly, Mat is equivalent to double-
precision real Dense.Matrix.D. These two modules are frequently used in this book. You
can cast one value from one type to another one by using the cast_* functions in the
Generic module. For example, Generic.cast_s2d casts from float32 to float64, and
Generic.cast_c2z casts from complex32 to complex64.
Many functions in the Generic module can handle the aforementioned four different
number types. This polymorphism is achieved by pattern matching and general abstract
data type in OCaml. In the following code, we use the sum function in the Dense.Matrix.
Generic module as an example:
open Owl;;
let x = Dense.Matrix.S.eye 5 in
Dense.Matrix.Generic.sum x;;
let x = Dense.Matrix.D.eye 5 in
Dense.Matrix.Generic.sum x;;
let x = Dense.Matrix.C.eye 5 in
Dense.Matrix.Generic.sum x;;
let x = Dense.Matrix.Z.eye 5 in
Dense.Matrix.Generic.sum x;;
306
Appendix B System Conventions
As we can see, no matter what kind of numbers are held in an identity matrix, we
can always pass it to the Dense. Matrix.Generic.sum function. Similarly, we can do
the same thing for other modules (Dense.Ndarray, Sparse. Matrix, etc.) and other
functions (add, mul, neg, etc.).
307
Appendix B System Conventions
(continued)
308
Appendix B System Conventions
309
APPENDIX C
Metric Systems
and Constants
In many scientific computing problems, numbers are not abstract but reflect the realistic
meanings. In other words, these numbers only make sense on top of a well-defined
metric system. For example, when we talk about the distance between two objects, we
write down a number 30, but what does 30 mean in reality? Is it meters, kilometers,
miles, or lightyears? Another example, what is the speed of light? Well, this really
depends on what metrics you are using, for example, km/s, m/s, mile/h, etc. Things can
get really messy in computation if we do not unify the metric system in a numerical
library. The translation between different metrics is often important in a real-world
application; therefore, a full-featured numerical library is obliged to provide sufficient
support for managing different metric systems. In this appendix, we briefly introduce the
metric system and constants provided in the Owl library to support scientific computing.
311
© Liang Wang, Jianxin Zhao 2023
L. Wang and J. Zhao, Architecture of Advanced Numerical Analysis Systems,
https://doi.org/10.1007/978-1-4842-8853-5
Appendix C Metric Systems and Constants
All the metrics defined in these four systems can be found in the interface file owl_
const.mli. In general, SI is much newer and recommended to use. The International
System of Units (French: Système international d’unités, SI) is historically also called
the MKSA system of units for meter-kilogram-second-ampere. The SI system of units
extends the MKS system and has seven base units, by expressing any measurement
of physical quantities using fundamental units of length, mass, time, electric current,
thermodynamic temperature, amount of substance, and luminous intensity, which are
meter, kilogram, second, ampere, kelvin, mole, and candela, respectively.
With a well-defined metric system, we can safely talk about the distance between
two objects, light of speed, and a lot of other real-world stuff with a well-defined metric
system in Owl. See the following examples:
Const.SI.plancks_constant_h;; (* in SI system *)
Const.MKS.plancks_constant_h;; (* in MKS system *)
Const.CGS.plancks_constant_h;; (* in CGS system *)
Const.CGSM.plancks_constant_h;; (* in CGSM system *)
Table C-1 shows some physical constants that the SI module includes.
As a computer scientist, you must be familiar with prefixes such as kilo, mega,
and giga. The SI system includes the definition of these prefixes as well. But be careful
(especially for computer science guys), the base is 10 instead of 2. These prefixes are
defined in the Const.Prefix module.
312
Appendix C Metric Systems and Constants
313
Appendix C Metric Systems and Constants
314
Appendix C Metric Systems and Constants
Some basic mathematical constants are also provided in Owl, though some
constants in advanced mathematics are not yet included such as the golden ratio or
Euler-Mascheroni constant. They are shown in Table C-2.
pi Pi
e Natural constant
euler Euler constant
Besides these constants, we also provide some frequently used computations based
on them, including
• log2e is defined as ( log 2 e ).
• sqrt3 is defined as ( 3 ).
• sqrtpi is defined as ( π ).
315
Appendix C Metric Systems and Constants
module SI = struct
let speed_of_light = 2.99792458e8
let gravitational_constant = 6.673e-11
let plancks_constant_h = 6.62606896e-34
let plancks_constant_hbar = 1.05457162825e-34
let astronomical_unit = 1.49597870691e11
let light_year = 9.46053620707e15
let parsec = 3.08567758135e16
let grav_accel = 9.80665e0
let electron_volt = 1.602176487e-19
let mass_electron = 9.10938188e-31
let mass_muon = 1.88353109e-28
let mass_proton = 1.67262158e-27
let mass_neutron = 1.67492716e-27
let rydberg = 2.17987196968e-18
let boltzmann = 1.3806504e-23
let molar_gas = 8.314472e0
let standard_gas_volume = 2.2710981e-2
let minute = 6e1
let hour = 3.6e3
let day = 8.64e4
let week = 6.048e5
let inch = 2.54e-2
let foot = 3.048e-1
let yard = 9.144e-1
let mile = 1.609344e3
let nautical_mile = 1.852e3
let fathom = 1.8288e0
let mil = 2.54e-5
let point = 3.52777777778e-4
let texpoint = 3.51459803515e-4
316
Appendix C Metric Systems and Constants
These units are all derived from the seven basic units we have mentioned and can be
categorized according to different application fields.
Time: The time units are shown in Table C-3. The base SI unit for time measurement
is second.
Length: The length units are shown in Table C-4. The base SI unit for length
measurement is meter.
Area: The area units are shown in Table C-5. Measuring area and volume still relies
on the SI base unit meter.
319
Appendix C Metric Systems and Constants
Volume: The volume units are shown in Table C-6. The base SI unit for volume
measurement is cubic meter.
Speed: The speed units are shown in Table C-7. The base units for speed are that of
time and length.
320
Appendix C Metric Systems and Constants
Mass: The mass units are shown in Table C-8. The base unit for presenting mass is
kilogram (kg).
Force: The force units are shown in Table C-9. Measuring force relies on
the SI derived unit, newton, and one newton is equal to one kilogram meter per
squared second.
Energy: The energy units are shown in Table C-10. The unit of measuring the energy
level is joule, which is equal to one kilogram square meter per square second.
321
Appendix C Metric Systems and Constants
Power: The energy units are shown in Table C-11. The unit of power is watts, an SI
derived unit. One watt is equal to one kilogram square meter per cubic second, or one
Joule per second.
Pressure: The pressure units are shown in Table C-12. To measure pressure, we often
use pascal as a standard unit. One pascal is equal to a kilogram per meter per square
second, or a newton per square meter.
322
Appendix C Metric Systems and Constants
Viscosity: The viscosity units are shown in Table C-13. The poise is a unit in dynamic
viscosity, and the “stokes” is for kinematic viscosity. They are actually included in the
CGS-based system for electrostatic units.
Luminance: The luminance units are shown in Table C-14. Candela is the base unit
for luminance, and both lumen and lux are derived units.
Radioactivity: The radioactivity units are shown in Table C-15. The SI unit of
radioactivity is becquerel, named in honor of the scientist Henri Becquerel, defined as
one transformation (or decay or disintegration) per second. The other base units such as
ampere, second, and kilogram are also used.
323
Appendix C Metric Systems and Constants
324
APPENDIX D
Algodiff Module
In the rest of appendixes, we provide some important pieces of source code of several
Owl modules. It complements existing materials we have discussed in this book, so that
the readers can have a deeper understanding of how they work.
In this appendix, we provide the full source code of several components in the
algorithmic differentiation module. It consists of three parts. First are the templates
to generate operators (owl_algodiff_ops_builder.ml) and examples that generate
operators using these templates (owl_algodiff_ops.ml). Learning these code is
instrumental in understanding how AD works. Second are the core functionalities
provided in owl_algodiff_core.ml and owl_algodiff_generic.ml. These
functionalities look simple enough, but they make the backbone of the whole module.
Third is the graph traversal module that can convert the AD graph into multiple formats.
It comes handy when debugging and better understanding the details of an AD graph.
(** owl_algodiff_ops_builder.ml *)
325
© Liang Wang, Jianxin Zhao 2023
L. Wang and J. Zhao, Architecture of Advanced Numerical Analysis Systems,
https://doi.org/10.1007/978-1-4842-8853-5
Appendix D Algodiff Module
let build_siso =
(* single input single output operation *)
let op_siso ~ff ~fd ~df ~r a =
match a with
| DF (ap, at, ai) ->
let cp = fd ap in
DF (cp, df cp ap at, ai)
| DR (ap, _, _, _, ai, _) ->
let cp = fd ap in
DR (cp, ref (zero cp), r a, ref 0, ai, ref 0)
| ap -> ff ap
in
fun (module S : Siso) ->
let rec f a =
let open S in
let ff = function
| F a -> S.ff_f a
| Arr a -> S.ff_arr a
| _ -> error_uniop label a
in
let fd a = f a in
let r a =
let adjoint cp ca t = (S.dr (primal a) cp ca, a) :: t in
let register t = a :: t in
let label = S.label, [ a ] in
adjoint, register, label
in
op_siso ~ff ~fd ~df:S.df ~r a
in
f
326
Appendix D Algodiff Module
let build_sipo =
(* single input pair outputs operation *)
let op_sipo ~ff ~fd ~df ~r a =
match a with
| DF (ap, at, ai) ->
let cp1, cp2 = fd ap in
DF (cp1, df cp1 ap at, ai), DF (cp2, df cp2 ap at, ai)
| DR (ap, _, _, _, ai, _) ->
let cp1, cp2 = fd ap in
let ca1_ref = ref (zero cp1) in
let ca2_ref = ref (zero cp2) in
let cp1_ref = ref cp1 in
let cp2_ref = ref cp2 in
let tracker = ref 0 in
(* tracker: int reference In reverse_reset, i keeps track of the
number of times
cp1 and cp2 has been called such that in reverse_push, we do not
update the
adjoint of ap before we've fully updated both ca1 and ca2 *)
( DR
( cp1
, ca1_ref
, r (a, (cp1_ref, cp2_ref), (ca1_ref, ca2_ref))
, ref 0
, ai
, tracker )
327
Appendix D Algodiff Module
, DR
( cp2
, ca2_ref
, r (a, (cp1_ref, cp2_ref), (ca1_ref, ca2_ref))
, ref 0
, ai
, tracker ) )
| ap -> ff ap
in
fun (module S : Sipo) ->
let rec f a =
let open S in
let ff = function
| F a -> S.ff_f a
| Arr a -> S.ff_arr a
| _ -> error_uniop label a
in
let fd = f in
let r (a, cp_ref, ca_ref) =
let adjoint cp _ca t = (S.dr (primal a) cp cp_ref ca_ref,
a) :: t in
let register t = a :: t in
let label = S.label, [ a ] in
adjoint, register, label
in
op_sipo ~ff ~fd ~df ~r a
in
f
328
Appendix D Algodiff Module
val dr : t -> t -> t ref * t ref * t ref -> t ref * t ref * t ref -> t
end
let build_sito =
(* single input three outputs operation *)
let op_sito ~ff ~fd ~df ~r a =
match a with
| DF (ap, at, ai) ->
let cp1, cp2, cp3 = fd ap in
DF (cp1, df cp1 ap at, ai), DF (cp2, df cp2 ap at, ai), DF (cp3, df
cp3 ap at, ai)
| DR (ap, _, _, _, ai, _) ->
let cp1, cp2, cp3 = fd ap in
let ca1_ref = ref (zero cp1) in
let ca2_ref = ref (zero cp2) in
let ca3_ref = ref (zero cp3) in
let cp1_ref = ref cp1 in
let cp2_ref = ref cp2 in
let cp3_ref = ref cp3 in
let tracker = ref 0 in
( DR
( cp1
, ca1_ref
, r (a, (cp1_ref, cp2_ref, cp3_ref), (ca1_ref, ca2_ref, ca3_ref))
, ref 0
, ai
, tracker )
, DR
( cp2
, ca2_ref
, r (a, (cp1_ref, cp2_ref, cp3_ref), (ca1_ref, ca2_ref,
ca3_ref))
, ref 0
, ai
, tracker )
329
Appendix D Algodiff Module
, DR
( cp3
, ca3_ref
, r (a, (cp1_ref, cp2_ref, cp3_ref), (ca1_ref, ca2_ref, ca3_ref))
, ref 0
, ai
, tracker ) )
| ap -> ff ap
in
fun (module S : Sito) ->
let rec f a =
let open S in
let ff = function
| F a -> S.ff_f a
| Arr a -> S.ff_arr a
| _ -> error_uniop label a
in
let fd = f in
let r (a, cp_ref, ca_ref) =
let adjoint cp _ca t = (S.dr (primal a) cp cp_ref ca_ref,
a) :: t in
let register t = a :: t in
let label = S.label, [ a ] in
adjoint, register, label
in
op_sito ~ff ~fd ~df ~r a
in
f
330
Appendix D Algodiff Module
let build_siao =
(* single input array outputs operation *)
let op_siao ~ff ~fd ~df ~r a =
match a with
| DF (ap, at, ai) ->
let cp_arr = fd ap in
let ct_arr = df cp_arr ap at in
Array.map2 (fun cp ct -> DF (cp, ct, ai)) cp_arr ct_arr
| DR (ap, _, _, _, ai, _) ->
let cp_arr = fd ap in
let cp_arr_ref = Array.map (fun cp -> ref cp) cp_arr in
let tracker = ref 0 in
let ca_ref_arr = Array.map (fun cp -> ref (zero cp)) cp_arr in
Array.map2
(fun cp ca_ref ->
DR (cp, ca_ref, r (a, cp_arr_ref, ca_ref_arr), ref 0, ai,
tracker))
cp_arr
ca_ref_arr
| ap -> ff ap
in
fun (module S : Siao) ->
let rec f a =
let open S in
let ff = function
| F a -> S.ff_f a
| Arr a -> S.ff_arr a
| _ -> error_uniop label a
in
let fd = f in
let r (a, cp_arr_ref, ca_arr_ref) =
let adjoint cp _ca_ref t = (S.dr (primal a) cp cp_arr_ref ca_arr_
ref, a) :: t in
331
Appendix D Algodiff Module
let register t = a :: t in
let label = S.label, [ a ] in
adjoint, register, label
in
op_siao ~ff ~fd ~df ~r a
in
f
let build_piso =
(* pair input single output operation *)
let op_piso ~ff ~fd ~df_da ~df_db ~df_dab ~r_d_d ~r_d_c ~r_c_d a b =
match a, b with
| F _ap, DF (bp, bt, bi) ->
let cp = fd a bp in
DF (cp, df_db cp a bp bt, bi)
| DF (ap, at, ai), F _bp ->
let cp = fd ap b in
332
Appendix D Algodiff Module
333
Appendix D Algodiff Module
| 1 ->
let cp = fd ap b in
DR (cp, ref (zero cp), r_d_c a b, ref 0, ai, ref 0)
| _ -> failwith "error: forward and reverse clash at the
same level")
| DF (ap, at, ai), DF (bp, bt, bi) ->
(match cmp_tag ai bi with
| 0 ->
let cp = fd ap bp in
DF (cp, df_dab cp ap at bp bt, ai)
| 1 ->
let cp = fd ap b in
DF (cp, df_da cp ap at b, ai)
| _ ->
let cp = fd a bp in
DF (cp, df_db cp a bp bt, bi))
| DR (ap, _, _, _, ai, _), DR (bp, _, _, _, bi, _) ->
(match cmp_tag ai bi with
| 0 ->
let cp = fd ap bp in
DR (cp, ref (zero cp), r_d_d a b, ref 0, ai, ref 0)
| 1 ->
let cp = fd ap b in
DR (cp, ref (zero cp), r_d_c a b, ref 0, ai, ref 0)
| _ ->
let cp = fd a bp in
DR (cp, ref (zero cp), r_c_d a b, ref 0, bi, ref 0))
| a, b -> ff a b
in
fun (module S : Piso) ->
let rec f a b =
let ff a b =
match a, b with
| F a, F b -> S.ff_aa a b
| F a, Arr b -> S.ff_ab a b
334
Appendix D Algodiff Module
335
Appendix D Algodiff Module
~r_d_c
~r_c_d
a
b
in
f
val dr : int list -> t array -> t -> t ref -> t list
end
let build_aiso =
let build_info =
Array.fold_left
(fun (i, t, m, idxs) x ->
match m, x with
| _, F _ | _, Arr _ -> succ i, t, m, idxs
| `normal, DR (_, _, _, _, t', _) -> succ i, t', `reverse, [ i ]
| `forward, DR (_, _, _, _, t', _) ->
if t' > t
then succ i, t', `reverse, [ i ]
else if t' = t
then failwith "error: forward and reverse clash on the
same level"
else succ i, t, `forward, idxs
| `reverse, DR (_, _, _, _, t', _) ->
if t' > t
then succ i, t', `reverse, [ i ]
else if t' = t
then succ i, t', `reverse, i :: idxs
else succ i, t, m, idxs
| `normal, DF (_, _, t') -> succ i, t', `forward, [ i ]
336
Appendix D Algodiff Module
337
Appendix D Algodiff Module
let cp = f ap in
let at =
let at = a |> Array.map zero in
List.iter (fun k -> at. (k) <- tangent a. (k)) idxs;
S.df idxs cp ap at
in
DF (cp, at, max_t)
| `reverse ->
let ap =
Array.map
(fun x ->
match x with
| DR (p, _, _, _, t', _) ->
if max_t = t'
then p
else if t' > max_t
then failwith "no tags should be higher than max_t"
else x
| x -> x)
a
in
let cp = f ap in
let adjoint cp ca t =
(* use primal of inputs to calculate adjoint *)
let ar = S.dr idxs ap cp ca |> Array.of_list in
List.append List.(mapi (fun i k -> ar. (i), a. (k)) idxs) t
in
let register t = List.fold_left (fun t i -> a. (i) :: t)
t idxs in
let label = S.label, List.(map (fun i -> a. (i)) idxs) in
DR (cp, ref (zero cp),
(adjoint, register, label), ref 0, max_t, ref 0)
in
f
end
338
Appendix D Algodiff Module
(** owl_algodiff_ops.ml *)
339
Appendix D Algodiff Module
Printf.(
sprintf
"_squeeze_broadcast: there ought to have been a
broadcasting error \
in the forward pass"))
in
let _, axis = fold (0, []) shp_x in
let idxs = Array.of_list axis in
sum_reduce ~axis:idxs x)
and _tan =
lazy
(build_siso
(module struct
let label = "tan"
340
Appendix D Algodiff Module
and ( / ) a b = div a b
and _div =
lazy
(build_piso
(module struct
let label = "div"
and _set_slice =
lazy
(fun i ->
build_piso
341
Appendix D Algodiff Module
(module struct
let label = "set_slice"
let ff_bb a b =
let a = A.copy a in
A.(set_slice i a b);
Arr a
342
Appendix D Algodiff Module
(build_sipo
(module struct
let label = "qr"
let ff_arr a =
let q, r = A.(Linalg.qr a) in
Arr q, Arr r
343
Appendix D Algodiff Module
if i = j
then float_to_elt 0.
else (
let s2_i = get_item s2 0 i |> unpack_flt in
let s2_j = get_item s2 0 j |> unpack_flt in
1. /. (s2_j -. s2_i) |> float_to_elt)))
in
let inv_s = pack_flt 1. / s in
if thin
then
(u * sbar *@ vt)
+ (((u *@ (f * ((ut *@ ubar) - (ubart *@ u))) * s)
+ ((e_m - (u *@ ut)) *@ ubar * inv_s))
*@ vt)
+ (u
*@ ((transpose s * (f * ((vt *@ vbar) - (vbart *@ v))) *@ vt)
+ (transpose inv_s * vbart *@ (e_n - (v *@ vt)))))
else raise (Owl_exception.NOT_IMPLEMENTED "owl_algodiff_ops.svd")
in
lazy
(fun ˜thin ->
build_sito
(module struct
let label = "svd"
let ff_arr a =
let u, s, vt = A.(Linalg.svd ˜thin a) in
Arr u, Arr s, Arr vt
let dr_ab a _b cp ca =
let abar, qbar = _lyapunov_backward_aq a !ca cp in
abar, qbar
and _care =
lazy
(let unpack a = a.(0), a.(1), a.(2), a.(3) in
let care_forward ˜diag_r p a b r at bt qt rt =
let tr_b = transpose b in
let r = if diag_r then diag r else r in
let inv_r = if diag_r then pack_flt 1. / r else inv r in
let k = if diag_r then transpose inv_r * tr_b *@ p else inv_r *@
tr_b *@ p in
let acl = a - (b *@ k) in
let tr_acl = transpose acl in
let da () =
let pat = p *@ at in
neg (transpose pat) - pat
in
let dq () = neg qt in
let dr () = neg (transpose k *@ rt *@ k) in
let db () =
let x = p *@ bt *@ k in
x + transpose x
in
tr_acl, [| da; db; dq; dr |]
in
let care_backward ˜diag_r a b _q r p pbar =
let tr_b = transpose b in
let inv_r = if diag_r then pack_flt 1. / diag r else inv r in
let k = if diag_r then transpose inv_r * tr_b *@ p else inv_r *@
tr_b *@ p in
let tr_k = transpose k in
let acl = a - (b *@ k) in
let s =
(* we can symmetrise without loss of generality as p is
symmetric *)
let pbar = pack_flt 0.5 * (pbar + transpose pbar) in
346
Appendix D Algodiff Module
let ff a =
match unpack a with
| Arr a, Arr b, Arr q, Arr r -> A.Linalg.care ˜diag_r a b q r
|> pack_arr
| _ -> error_uniop "care" a.(0)
347
Appendix D Algodiff Module
(* NOTE: these functions are for neural network. There are many
restrictions at the moment. E.g. they do not support higher-order
derivatives, and some do not support forward mode, so use them when you
know what you are doing. *)
348
Appendix D Algodiff Module
in
lazy
(fun ˜padding a b s ->
build_piso
(module struct
let label = "conv2d"
349
Appendix D Algodiff Module
let os = A.shape o in
let q = Owl_utils.llss2aarr p in
Array.iteri (fun i x -> x.(1) <- Stdlib.(os. (i) - 1 - x.(1))) q;
let q = Owl_utils.aarr2llss q in
A.(get_slice q o) |> pack_arr
in
lazy
(fun ˜v p a ->
build_siso
(module struct
let label = "pad"
350
Appendix D Algodiff Module
let shape x =
let s = A.shape (unpack_arr x) in
s.(0), s.(1)
351
Appendix D Algodiff Module
352
Appendix D Algodiff Module
let tag () =
_global_tag := !_global_tag + 1;
!_global_tag
353
Appendix D Algodiff Module
let shape x =
match primal' x with
| F _ -> [||]
| Arr ap -> A.shape ap
| _ -> failwith "error: AD.shape"
let numel x =
match primal' x with
| Arr x -> A.numel x
| _ -> failwith "error: AD.numel"
354
Appendix D Algodiff Module
let clip_by_l2norm a x =
match primal' x with
| Arr x -> Arr A.(clip_by_l2norm a x)
| _ -> failwith "error: AD.clip_by_l2norm"
let copy_primal' x =
match primal' x with
| Arr ap -> Arr A.(copy ap)
| _ -> failwith "error: AD.copy"
let pack_elt x = F x
let unpack_elt x =
match primal x with
| F x -> x
| _ -> failwith "error: AD.unpack_elt"
let _f x = F A.(float_to_elt x)
let unpack_flt x =
match primal x with
| F x -> A.elt_to_float x
| _ -> failwith "error: AD.unpack_flt"
355
Appendix D Algodiff Module
let unpack_arr x =
match primal x with
| Arr x -> x
| _ -> failwith "error: AD.unpack_arr"
let deep_info x =
match primal' x with
| F a -> Printf.sprintf "F(%g)" A.(elt_to_float a)
| Arr a ->
Printf.sprintf "Arr(%s)" (A.shape a |> Owl_utils_array.to_string
string_of_int)
| _ -> "you should not have reached here!"
let type_info x =
match x with
| F _a -> Printf.sprintf "[%s]" (deep_info x)
| DF (ap, _at, ai) -> Printf.sprintf "[DF tag:%i ap:%s]"
ai (deep_info ap)
| DR (ap, _at, _ao, _af, ai, _) ->
Printf.sprintf "[DR tag:%i ap:%s]" ai (deep_info ap)
| _ -> Printf.sprintf "[%s]" (deep_info x)
let error_binop op a b =
let s0 = "#0:" ˆ type_info a in
let s1 = "#1:" ˆ type_info b in
failwith (op ˆ " : " ˆ s0 ˆ ", " ˆ s1)
let error_uniop op a =
let s = type_info a in
failwith (op ˆ " : " ˆ s)
end
356
Appendix D Algodiff Module
(** Owl_algodiff_generic.ml *)
let make_reverse p i =
let adjoint _cp _ca t = t in
let register t = t in
let label = "Noop", [] in
DR (p, ref (zero p), (adjoint, register, label), ref 0, i, ref 0)
357
Appendix D Algodiff Module
(* _traverse_trace and its related functions are used to convert the
computation graph generated in backward mode into human-readable format.
You can make your own convert function to generate needed format. *)
let _traverse_trace x =
(* init variables for tracking nodes and indices *)
let nodes = Hashtbl.create 512 in
let index = ref 0 in
(* local function to traverse the nodes *)
let rec push tlist =
match tlist with
| [] -> ()
| hd :: tl ->
if Hashtbl.mem nodes hd = false
then (
358
Appendix D Algodiff Module
359
Appendix D Algodiff Module
""
v_prev)
nodes
""
360
Appendix D Algodiff Module
361
APPENDIX E
open Owl_types
type node =
{ mutable name : string
; (* name of a node *)
mutable prev : node array
; (* parents of a node *)
mutable next : node array
; (* children of a node *)
mutable neuron : neuron
; (* neuron contained in a node *)
mutable output : t option
363
© Liang Wang, Jianxin Zhao 2023
L. Wang and J. Zhao, Architecture of Advanced Numerical Analysis Systems,
https://doi.org/10.1007/978-1-4842-8853-5
Appendix E Neural Network Module
and network =
{ mutable nnid : string
; (* name of the graph network *)
mutable size : int
; (* size of the graph network *)
mutable roots : node array
; (* roots of the graph network, i.e. inputs *)
mutable outputs : node array
; (* outputs of the graph network *)
mutable topo : node array (* nodes sorted in topological order *)
}
let make_node ?name ?(train = false) prev next neuron output network =
let name =
match name with
| Some s -> s
| None -> Printf.sprintf "%s_%i" (to_name neuron) network.size
in
{ name; prev; next; neuron; output; network; train }
364
Appendix E Neural Network Module
let get_roots nn =
match nn.roots with
| [||] -> failwith "Owl_neural_graph:get_roots"
| x -> x
365
Appendix E Neural Network Module
366
Appendix E Neural Network Module
367
Appendix E Neural Network Module
let run x nn =
Array.iter
(fun n ->
(* collect the inputs from parents' output *)
let input =
match n.neuron with
| Input _ -> [| x |]
| _ -> collect_output n.prev
in
(* process the current neuron, save output *)
let output = run input n.neuron in
n.output <- Some output)
nn.topo;
(* collect the final output from the tail *)
let sink = [| nn.topo.(Array.length nn.topo - 1) |] in
(collect_output sink).(0)
let forward nn x =
mktag (tag ()) nn;
run x nn, mkpar nn
let forward_inputs nn x =
mktag (tag ()) nn;
run_inputs x nn, mkpar nn
let backward nn y =
reverse_prop (_f 1.) y;
mkpri nn, mkadj nn
let copy nn =
let nn' = make_network ˜nnid:nn.nnid nn.size [||] [||] in
368
Appendix E Neural Network Module
let _remove_training_nodes nn =
let topo' =
Owl_utils.Array.filter
(fun n ->
if n.train = true
then (
(* remove myself from my parents *)
Array.iter
(fun m ->
let next' = Owl_utils.Array.filter (fun x -> x.name <>
n.name) m.next in
m.next <- next')
n.prev;
369
Appendix E Neural Network Module
let model nn =
if Array.length nn.roots > 1
then failwith "Owl_neural_graph:model Did you mean to use model_
inputs?";
let nn = copy nn in
_remove_training_nodes nn;
let inference x =
match run (Arr x) nn with
| Arr y -> y
| _ -> failwith "Owl_neural_graph:model"
in
inference
let model_inputs nn =
let nn = copy nn in
_remove_training_nodes nn;
let inference inputs =
let outputs = run_inputs (Array.map (fun x -> Arr x) inputs) nn in
Array.map unpack_arr outputs
in
inference
370
Appendix E Neural Network Module
371
Appendix E Neural Network Module
372
Appendix E Neural Network Module
let conv1d
?name
?(padding = SAME)
?(init_typ = Init.Tanh)
?act_typ
kernel
stride
input_node
=
let neuron = Conv1D (Conv1D.create padding kernel stride init_typ) in
let nn = get_network input_node in
let n = make_node ?name [||] [||] neuron None nn in
add_node ?act_typ nn [| input_node |] n
let conv2d
?name
?(padding = SAME)
?(init_typ = Init.Tanh)
?act_typ
kernel
stride
input_node
=
let neuron = Conv2D (Conv2D.create padding kernel stride init_typ) in
let nn = get_network input_node in
let n = make_node ?name [||] [||] neuron None nn in
add_node ?act_typ nn [| input_node |] n
let conv3d
?name
?(padding = SAME)
?(init_typ = Init.Tanh)
373
Appendix E Neural Network Module
?act_typ
kernel
stride
input_node
=
let neuron = Conv3D (Conv3D.create padding kernel stride init_typ) in
let nn = get_network input_node in
let n = make_node ?name [||] [||] neuron None nn in
add_node ?act_typ nn [| input_node |] n
let dilated_conv1d
?name
?(padding = SAME)
?(init_typ = Init.Tanh)
?act_typ
kernel
stride
rate
input_node
=
let neuron =
DilatedConv1D (DilatedConv1D.create padding kernel stride rate
init_typ)
in
let nn = get_network input_node in
let n = make_node ?name [||] [||] neuron None nn in
add_node ?act_typ nn [| input_node |] n
let dilated_conv2d
?name
?(padding = SAME)
?(init_typ = Init.Tanh)
?act_typ
kernel
stride
rate
374
Appendix E Neural Network Module
input_node
=
let neuron =
DilatedConv2D (DilatedConv2D.create padding kernel stride rate
init_typ)
in
let nn = get_network input_node in
let n = make_node ?name [||] [||] neuron None nn in
add_node ?act_typ nn [| input_node |] n
let dilated_conv3d
?name
?(padding = SAME)
?(init_typ = Init.Tanh)
?act_typ
kernel
stride
rate
input_node
=
let neuron =
DilatedConv3D (DilatedConv3D.create padding kernel stride rate
init_typ)
in
let nn = get_network input_node in
let n = make_node ?name [||] [||] neuron None nn in
add_node ?act_typ nn [| input_node |] n
let transpose_conv1d
?name
?(padding = SAME)
?(init_typ = Init.Tanh)
?act_typ
kernel
stride
375
Appendix E Neural Network Module
input_node
=
let neuron =
TransposeConv1D (TransposeConv1D.create padding kernel stride
init_typ)
in
let nn = get_network input_node in
let n = make_node ?name [||] [||] neuron None nn in
add_node ?act_typ nn [| input_node |] n
let transpose_conv2d
?name
?(padding = SAME)
?(init_typ = Init.Tanh)
?act_typ
kernel
stride
input_node
=
let neuron =
TransposeConv2D (TransposeConv2D.create padding kernel stride
init_typ)
in
let nn = get_network input_node in
let n = make_node ?name [||] [||] neuron None nn in
add_node ?act_typ nn [| input_node |] n
let transpose_conv3d
?name
?(padding = SAME)
?(init_typ = Init.Tanh)
?act_typ
kernel
stride
input_node
=
376
Appendix E Neural Network Module
let neuron =
TransposeConv3D (TransposeConv3D.create padding kernel stride
init_typ)
in
let nn = get_network input_node in
let n = make_node ?name [||] [||] neuron None nn in
add_node ?act_typ nn [| input_node |] n
377
Appendix E Neural Network Module
378
Appendix E Neural Network Module
379
Appendix E Neural Network Module
let to_string nn =
let s = ref (nn.nnid ˆ "\n\n") in
Array.iter
(fun n ->
let prev =
Array.map (fun n -> n.name) n.prev |> Owl_utils_array.to_string
(fun s -> s)
in
let next =
381
Appendix E Neural Network Module
let save_weights nn f =
let h = Hashtbl.create nn.size in
Array.iter
(fun n ->
let ws = Neuron.save_weights n.neuron in
Hashtbl.add h n.name ws)
nn.topo;
Owl_io.marshal_to_file h f
382
Appendix E Neural Network Module
let load_weights nn f =
let h = Owl_io.marshal_from_file f in
Array.iter
(fun n ->
let ws = Hashtbl.find h n.name in
Neuron.load_weights n.neuron ws)
nn.topo
383
Appendix E Neural Network Module
let new_nodes =
Array.fold_left
(fun acc name -> collect_subnn_nodes (get_node nn name) acc)
[]
output_names
in
(* sorts the new topology *)
let new_topo =
Array.fold_left
(fun acc n ->
match List.find_opt (fun n' -> n'.name = n.name) new_nodes with
| Some n' -> n' :: acc
| None -> acc)
[]
nn.topo
|> List.rev
|> Array.of_list
in
subnn.topo <- new_topo;
(* re-construct network structure *)
Array.iter
(fun node' ->
let node = get_node nn node'.name in
if not (List.memq node' !in_nodes)
then node'.prev <- Array.map (fun n -> get_node subnn n.name)
node.prev;
if not (Array.mem node.name output_names)
then (
(* only process nodes that are part of the subnetwork *)
let next =
Owl_utils_array.filter
(fun n -> Array.exists (fun n' -> n'.name = n.name)
subnn.topo)
node.next
in
384
Appendix E Neural Network Module
(* With custom input nodes, next could contain an input node. *)
node'.next <- Array.map (fun n -> get_node subnn n.name) next);
connect_to_parents node'.prev node')
subnn.topo;
(* TODO: Warn if not all names in in_names were used? *)
subnn.roots <- Array.of_list !in_nodes;
subnn.outputs <- Array.map (fun name -> get_node subnn name)
output_names;
subnn
385
Appendix E Neural Network Module
type neuron_typ =
{ mutable activation : typ
; mutable in_shape : int array
; mutable out_shape : int array
}
let to_string l =
let in_str = Owl_utils_array.to_string string_of_int l.in_shape in
let act_str = activation_to_string l.activation in
Printf.sprintf " Activation : %s in/out:[*,%s]\n" act_str in_str ˆ ""
Linear Module
388
Appendix E Neural Network Module
; act
; init_typ
; in_shape = [| t; i |]
; out_shape = [| o |]
}
let init l =
let i = l.in_shape.(1) in
let o = l.out_shape.(0) in
let h = l.hiddens in
l.whh <- Init.run l.init_typ [| h; h |] l.whh;
l.wxh <- Init.run l.init_typ [| i; h |] l.wxh;
l.why <- Init.run l.init_typ [| h; o |] l.why;
l.bh <- Mat.zeros 1 h;
l.by <- Mat.zeros 1 o
let reset l =
Mat.reset l.whh;
Mat.reset l.wxh;
Mat.reset l.why;
Mat.reset l.bh;
Mat.reset l.by
let mktag t l =
l.whh <- make_reverse l.whh t;
l.wxh <- make_reverse l.wxh t;
l.why <- make_reverse l.why t;
l.bh <- make_reverse l.bh t;
l.by <- make_reverse l.by t
let mkpri l = [| primal l.whh; primal l.wxh; primal l.why; primal l.bh;
primal l.by |]
389
Appendix E Neural Network Module
let mkadj l = [| adjval l.whh; adjval l.wxh; adjval l.why; adjval l.bh;
adjval l.by |]
let update l u =
l.whh <- u.(0) |> primal';
l.wxh <- u.(1) |> primal';
l.why <- u.(2) |> primal';
l.bh <- u.(3) |> primal';
l.by <- u.(4) |> primal'
let copy l =
let l' = create l.hiddens l.out_shape.(0) l.act l.init_typ in
mkpri l |> Array.map copy_primal' |> update l';
l'
let run x l =
let s = shape x in
l.h <- Mat.zeros s.(0) l.hiddens;
let act x = Activation.run_activation x l.act in
for i = 0 to l.in_shape.(0) - 1 do
let t = Maths.get_slice [ []; [ i ]; [] ] x in
let t = Maths.reshape t [| s.(0); s.(2) |] in
(* recurrent logic, calculate the hidden state *)
l.h <- act Maths.((l.h *@ l.whh) + (t *@ l.wxh) + l.bh)
done;
Maths.((l.h *@ l.why) + l.by)
let to_string l =
let t = l.in_shape.(0) in
let i = l.in_shape.(1) in
let o = l.out_shape.(0) in
let h = l.hiddens in
Printf.sprintf " Recurrent : matrix in:(*,%i,%i) out:(*,%i) \n" t i o
ˆ Printf.sprintf " init : %s\n" (Init.to_string l.init_typ)
ˆ Printf.sprintf " params : %i\n" ((h * h) + (i * h) + (h * o) + h + o)
ˆ Printf.sprintf " whh : %i x %i\n" h h
ˆ Printf.sprintf " wxh : %i x %i\n" i h
390
Appendix E Neural Network Module
LSTM Module
391
Appendix E Neural Network Module
let t =
match time_steps with
| Some i -> i
| None -> 0
in
{ wxi = Mat.empty 0 o
; whi = Mat.empty o o
; wxc = Mat.empty 0 o
; whc = Mat.empty o o
; wxf = Mat.empty 0 o
; whf = Mat.empty o o
; wxo = Mat.empty 0 o
; who = Mat.empty o o
; bi = Mat.empty 1 o
; bc = Mat.empty 1 o
; bf = Mat.empty 1 o
; bo = Mat.empty 1 o
; c = Mat.empty 0 o
; h = Mat.empty 0 o
; init_typ
; in_shape = [| t; i |]
; out_shape = [| o |]
}
let init l =
let i = l.in_shape.(1) in
let o = l.out_shape.(0) in
l.wxi <- Init.run l.init_typ [| i; o |] l.wxi;
l.whi <- Init.run l.init_typ [| o; o |] l.whi;
l.wxc <- Init.run l.init_typ [| i; o |] l.wxc;
l.whc <- Init.run l.init_typ [| o; o |] l.whc;
392
Appendix E Neural Network Module
let reset l =
Mat.reset l.wxi;
Mat.reset l.whi;
Mat.reset l.wxc;
Mat.reset l.whc;
Mat.reset l.wxf;
Mat.reset l.whf;
Mat.reset l.wxo;
Mat.reset l.who;
Mat.reset l.bi;
Mat.reset l.bc;
Mat.reset l.bf;
Mat.reset l.bo
let mktag t l =
l.wxi <- make_reverse l.wxi t;
l.whi <- make_reverse l.whi t;
l.wxc <- make_reverse l.wxc t;
l.whc <- make_reverse l.whc t;
l.wxf <- make_reverse l.wxf t;
l.whf <- make_reverse l.whf t;
l.wxo <- make_reverse l.wxo t;
l.who <- make_reverse l.who t;
l.bi <- make_reverse l.bi t;
l.bc <- make_reverse l.bc t;
l.bf <- make_reverse l.bf t;
l.bo <- make_reverse l.bo t
393
Appendix E Neural Network Module
let mkpar l =
[| l.wxi; l.whi; l.wxc; l.whc; l.wxf; l.whf; l.wxo; l.who; l.bi; l.bc;
l.bf; l.bo |]
let mkpri l =
[| primal l.wxi
; primal l.whi
; primal l.wxc
; primal l.whc
; primal l.wxf
; primal l.whf
; primal l.wxo
; primal l.who
; primal l.bi
; primal l.bc
; primal l.bf
; primal l.bo
|]
let mkadj l =
[| adjval l.wxi
; adjval l.whi
; adjval l.wxc
; adjval l.whc
; adjval l.wxf
; adjval l.whf
; adjval l.wxo
; adjval l.who
; adjval l.bi
; adjval l.bc
; adjval l.bf
; adjval l.bo
|]
let update l u =
l.wxi <- u.(0) |> primal';
l.whi <- u.(1) |> primal';
394
Appendix E Neural Network Module
let copy l =
let l' = create l.out_shape.(0) l.init_typ in
mkpri l |> Array.map copy_primal' |> update l';
l'
let run x l =
let s = shape x in
l.h <- Mat.zeros s.(0) l.out_shape.(0);
l.c <- Mat.zeros s.(0) l.out_shape.(0);
for i = 0 to l.in_shape.(0) - 1 do
let t = Maths.get_slice [ []; [ i ]; [] ] x in
let t = Maths.reshape t [| s.(0); s.(2) |] in
(* lstm logic, calculate the output *)
let i = Maths.((t *@ l.wxi) + (l.h *@ l.whi) + l.bi |> sigmoid) in
let c' = Maths.((t *@ l.wxc) + (l.h *@ l.whc) + l.bc |> tanh) in
let f = Maths.((t *@ l.wxf) + (l.h *@ l.whf) + l.bf |> sigmoid) in
l.c <- Maths.((i * c') + (f * l.c));
let o = Maths.((t *@ l.wxo) + (l.h *@ l.who) + l.bo |> sigmoid) in
l.h <- Maths.(o * tanh l.c)
done;
l.h
let to_string l =
let t = l.in_shape.(0) in
let i = l.in_shape.(1) in
395
Appendix E Neural Network Module
let o = l.out_shape.(0) in
Printf.sprintf " LSTM : in:(*,%i,%i) out:(*,%i) \n" i t o
ˆ Printf.sprintf " init : %s\n" (Init.to_string l.init_typ)
ˆ Printf.sprintf
" params : %i\n"
((i * o)
+ (o * o)
+ (i * o)
+ (o * o)
+ (i * o)
+ (o * o)
+ (i * o)
+ (o * o)
+ o
+ o
+ o
+ o)
ˆ Printf.sprintf " wxi : %i x %i\n" i o
ˆ Printf.sprintf " whi : %i x %i\n" o o
ˆ Printf.sprintf " wxc : %i x %i\n" i o
ˆ Printf.sprintf " whc : %i x %i\n" o o
ˆ Printf.sprintf " wxf : %i x %i\n" i o
ˆ Printf.sprintf " whf : %i x %i\n" o o
ˆ Printf.sprintf " wxo : %i x %i\n" i o
ˆ Printf.sprintf " who : %i x %i\n" o o
ˆ Printf.sprintf " bi : %i x %i\n" 1 o
ˆ Printf.sprintf " bc : %i x %i\n" 1 o
ˆ Printf.sprintf " bf : %i x %i\n" 1 o
ˆ Printf.sprintf " bo : %i x %i\n" 1 o
ˆ ""
396
Appendix E Neural Network Module
Conv2D Module
397
Appendix E Neural Network Module
let init l =
l.w <- Init.run l.init_typ l.kernel l.w;
l.b <- Arr.(zeros (shape l.b))
let reset l =
Arr.reset l.w;
Arr.reset l.b
let mktag t l =
l.w <- make_reverse l.w t;
l.b <- make_reverse l.b t
let update l u =
l.w <- u.(0) |> primal';
l.b <- u.(1) |> primal'
let copy l =
let l' = create l.padding l.kernel l.stride l.init_typ in
mkpri l |> Array.map copy_primal' |> update l';
l'
398
Appendix E Neural Network Module
let to_string l =
let ws = Arr.shape l.w in
let bn = Arr.shape l.b in
let in_str = Owl_utils_array.to_string string_of_int l.in_shape in
let out_str = Owl_utils_array.to_string string_of_int l.out_shape in
Printf.sprintf " Conv2D : tensor in:[*;%s] out:[*,%s]\n" in_str out_str
ˆ Printf.sprintf " init : %s\n" (Init.to_string l.init_typ)
ˆ Printf.sprintf " params : %i\n" ((ws.(0) * ws.(1) * ws.(2) * ws.(3))
+ bn.(0))
ˆ Printf.sprintf " kernel : %i x %i x %i x %i\n" ws.(0) ws.(1)
ws.(2) ws.(3)
ˆ Printf.sprintf " b : %i\n" bn.(0)
ˆ Printf.sprintf " stride : [%i; %i]\n" l.stride.(0) l.stride.(1)
ˆ ""
DilatedConv2D Module
399
Appendix E Neural Network Module
400
Appendix E Neural Network Module
let init l =
l.w <- Init.run l.init_typ l.kernel l.w;
l.b <- Arr.(zeros (shape l.b))
let reset l =
Arr.reset l.w;
Arr.reset l.b
let mktag t l =
l.w <- make_reverse l.w t;
l.b <- make_reverse l.b t
let update l u =
l.w <- u.(0) |> primal';
l.b <- u.(1) |> primal'
let copy l =
let l' = create l.padding l.kernel l.stride l.rate l.init_typ in
mkpri l |> Array.map copy_primal' |> update l';
l'
let to_string l =
let ws = Arr.shape l.w in
let bn = Arr.shape l.b in
let in_str = Owl_utils_array.to_string string_of_int l.in_shape in
let out_str = Owl_utils_array.to_string string_of_int l.out_shape in
Printf.sprintf " DilateConv2D : tensor in:[*;%s] out:[*,%s]\n" in_
str out_str
401
Appendix E Neural Network Module
TransposeConv2D Module
402
Appendix E Neural Network Module
; b = Arr.empty [| o |]
; kernel
; stride
; padding
; init_typ
; in_shape
; out_shape = [| 0; 0; o |]
}
let init l =
l.w <- Init.run l.init_typ l.kernel l.w;
l.b <- Arr.(zeros (shape l.b))
let reset l =
Arr.reset l.w;
Arr.reset l.b
let mktag t l =
l.w <- make_reverse l.w t;
l.b <- make_reverse l.b t
403
Appendix E Neural Network Module
let update l u =
l.w <- u.(0) |> primal';
l.b <- u.(1) |> primal'
let copy l =
let l' = create l.padding l.kernel l.stride l.init_typ in
mkpri l |> Array.map copy_primal' |> update l';
l'
let to_string l =
let ws = Arr.shape l.w in
let bn = Arr.shape l.b in
let in_str = Owl_utils_array.to_string string_of_int l.in_shape in
let out_str = Owl_utils_array.to_string string_of_int l.out_shape in
Printf.sprintf " TransposeConv2D : tensor in:[*;%s] out:[*,%s]\n" in_
str out_str
ˆ Printf.sprintf " init : %s\n" (Init.to_string l.init_typ)
ˆ Printf.sprintf " params : %i\n" ((ws.(0) * ws.(1) * ws.(2) * ws.(3))
+ bn.(0))
ˆ Printf.sprintf " kernel : %i x %i x %i x %i\n" ws.(0) ws.(1)
ws.(2) ws.(3)
ˆ Printf.sprintf " b : %i\n" bn.(0)
ˆ Printf.sprintf " stride : [%i; %i]\n" l.stride.(0) l.stride.(1)
ˆ ""
404
Appendix E Neural Network Module
FullyConnected Module
let init l =
let m = Array.fold_left (fun a b -> a * b) 1 l.in_shape in
let n = l.out_shape.(0) in
l.w <- Init.run l.init_typ [| m; n |] l.w;
l.b <- Mat.zeros 1 n
let reset l =
Mat.reset l.w;
Mat.reset l.b
let mktag t l =
l.w <- make_reverse l.w t;
l.b <- make_reverse l.b t
405
Appendix E Neural Network Module
let update l u =
l.w <- u.(0) |> primal';
l.b <- u.(1) |> primal'
let copy l =
let l' = create l.out_shape.(0) l.init_typ in
mkpri l |> Array.map copy_primal' |> update l';
l'
let run x l =
let m = Mat.row_num l.w in
let n = Arr.numel x / m in
let x = Maths.reshape x [| n; m |] in
let y = Maths.((x *@ l.w) + l.b) in
y
let to_string l =
let wm = Array.fold_left (fun a b -> a * b) 1 l.in_shape in
let wn = l.out_shape.(0) in
let bn = l.out_shape.(0) in
let in_str = Owl_utils_array.to_string string_of_int l.in_shape in
Printf.sprintf
" FullyConnected : tensor in:[*,%s] matrix out:(*,%i)\n"
in_str
l.out_shape.(0)
ˆ Printf.sprintf " init : %s\n" (Init.to_string l.init_typ)
ˆ Printf.sprintf " params : %i\n" ((wm * wn) + bn)
ˆ Printf.sprintf " w : %i x %i\n" wm wn
ˆ Printf.sprintf " b : %i x %i\n" 1 bn
ˆ ""
406
Appendix E Neural Network Module
MaxPool2D Module
407
Appendix E Neural Network Module
let to_string l =
let padding_s =
match l.padding with
| SAME -> "SAME"
| VALID -> "VALID"
in
Printf.sprintf
" MaxPool2D : tensor in:[*,%i,%i,%i] out:[*,%i,%i,%i]\n"
l.in_shape.(0)
l.in_shape.(1)
l.in_shape.(2)
l.out_shape.(0)
l.out_shape.(1)
l.out_shape.(2)
ˆ Printf.sprintf " padding : %s\n" padding_s
ˆ Printf.sprintf " kernel : [%i; %i]\n" l.kernel.(0) l.kernel.(1)
ˆ Printf.sprintf " stride : [%i; %i]\n" l.stride.(0) l.stride.(1)
ˆ ""
AvgPool2D Module
408
Appendix E Neural Network Module
let to_string l =
let padding_s =
match l.padding with
| SAME -> "SAME"
| VALID -> "VALID"
in
Printf.sprintf
" AvgPool2D : tensor in:[*,%i,%i,%i] out:[*,%i,%i,%i]\n"
l.in_shape.(0)
l.in_shape.(1)
l.in_shape.(2)
l.out_shape.(0)
l.out_shape.(1)
l.out_shape.(2)
409
Appendix E Neural Network Module
UpSampling2D Module
let to_string l =
Printf.sprintf
" UpSampling2D : tensor in:[*,%i,%i,%i] out:[*,%i,%i,%i]\n"
l.in_shape.(0)
l.in_shape.(1)
l.in_shape.(2)
410
Appendix E Neural Network Module
l.out_shape.(0)
l.out_shape.(1)
l.out_shape.(2)
ˆ Printf.sprintf " size : [%i; %i]\n" l.size.(0) l.size.(1)
ˆ ""
Dropout Module
let run x l =
let a = _f (1. /. (1. -. l.rate)) in
let b = NN.(dropout ˜rate:l.rate x) in
Maths.(a * b)
let to_string l =
let in_str = Owl_utils_array.to_string string_of_int l.in_shape in
let out_str = Owl_utils_array.to_string string_of_int l.out_shape in
Printf.sprintf " Dropout : in:[*,%s] out:[*,%s]\n" in_str out_str
ˆ Printf.sprintf " rate : %g\n" l.rate
411
Appendix E Neural Network Module
GaussianDropout Module
let run x l =
let s = shape x in
let sigma = Stdlib.sqrt (l.rate /. (1. -. l.rate)) in
let a =
match primal' x with
| Arr _ -> Arr.gaussian ˜sigma:(A.float_to_elt sigma) s
| _ -> failwith "owl_neural_neuron:gaussiandropout:run"
in
Maths.(x * (a + _f 1.))
let to_string l =
let in_str = Owl_utils_array.to_string string_of_int l.in_shape in
let out_str = Owl_utils_array.to_string string_of_int l.out_shape in
Printf.sprintf " GaussianDropout : in:[*,%s] out:[*,%s]\n" in_
str out_str
ˆ Printf.sprintf " rate : %g\n" l.rate
412
Appendix E Neural Network Module
AlphaDropout Module
let run x l =
(* parameters of affine transformation *)
let alpha = 1.6732632423543772848170429916717 in
let scale = 1.0507009873554804934193349852946 in
let p = -.alpha *. scale in
let a = ((1. -. l.rate) *. (1. +. (l.rate *. (p ** 2.)))) ** -0.5 in
let b = -.a *. p *. l.rate in
let s = shape x in
let mask =
match primal' x with
| Arr _ -> Arr A.(bernoulli ˜p:(A.float_to_elt (1. -. l.rate)) s)
| _ -> failwith "owl_neural_neuron:alphadropout:run"
in
let p = _f p in
let a = _f a in
let b = _f b in
let x = Maths.((x * mask) + (p * (_f 1. - mask))) in
Maths.((a * x) + b)
let to_string l =
let in_str = Owl_utils_array.to_string string_of_int l.in_shape in
let out_str = Owl_utils_array.to_string string_of_int l.out_shape in
413
Appendix E Neural Network Module
Flatten Module
let to_string l =
let in_str = Owl_utils_array.to_string string_of_int l.in_shape in
Printf.sprintf " Flatten : in:[*,%s] out:[*,%i]\n" in_str l.out_
shape.(0)
Slice Module
414
Appendix E Neural Network Module
let to_string l =
let in_str = Owl_utils_array.to_string string_of_int l.in_shape in
let out_str = Owl_utils_array.to_string string_of_int l.out_shape in
let slice_str =
List.mapi
(fun i l ->
let s = List.map string_of_int l |> String.concat "; " in
Printf.sprintf "%i:[%s]" i s)
l.slice
|> String.concat " "
in
Printf.sprintf " Slice : in:[*,%s] out:[*,%s]\n" in_str out_str
ˆ Printf.sprintf " Axes : %s\n" slice_str
Add Module
415
Appendix E Neural Network Module
let run x _l =
let n = Array.length x in
(* at least two inputs *)
assert (n > 1);
let acc = ref x.(0) in
for i = 1 to n - 1 do
acc := Maths.(!acc + x. (i))
done;
!acc
let to_string l =
let in_str = Owl_utils_array.to_string string_of_int l.in_shape in
let out_str = Owl_utils_array.to_string string_of_int l.out_shape in
Printf.sprintf " Add : in:[*,%s] out:[*,%s]\n" in_str out_str
Mul Module
416
Appendix E Neural Network Module
let run x _l =
let n = Array.length x in
(* at least two inputs *)
assert (n > 1);
let acc = ref x.(0) in
for i = 1 to n - 1 do
acc := Maths.(!acc * x. (i))
done;
!acc
let to_string l =
let in_str = Owl_utils_array.to_string string_of_int l.in_shape in
let out_str = Owl_utils_array.to_string string_of_int l.out_shape in
Printf.sprintf " Multiply : in:[*,%s] out:[*,%s]\n" in_str out_str
Dot Module
417
Appendix E Neural Network Module
let run x _l =
assert (Array.length x = 2);
Maths.(x.(0) *@ x.(1))
let to_string l =
let m = l.in_shape.(0) in
let n = l.in_shape.(1) in
Printf.sprintf " Dot : in:[*,%i] [%i,%i] out:[*,%i]\n" m m n n
Max Module
let run x _l =
let n = Array.length x in
(* at least two inputs *)
assert (n > 1);
let acc = ref x.(0) in
for i = 1 to n - 1 do
acc := Maths.(max2 !acc x. (i))
done;
!acc
418
Appendix E Neural Network Module
let to_string l =
let in_str = Owl_utils_array.to_string string_of_int l.in_shape in
let out_str = Owl_utils_array.to_string string_of_int l.out_shape in
Printf.sprintf " Max : in:[*,%s] out:[*,%s]\n" in_str out_str
Concatenate Module
419
Appendix E Neural Network Module
let run x l =
let n = Array.length x in
(* at least two inputs *)
assert (n > 1);
let acc = ref x.(0) in
for i = 1 to n - 1 do
acc := Maths.(concat ˜axis:l.axis !acc x. (i))
done;
!acc
let to_string l =
let in_str =
Owl_utils_array.to_string
(fun i -> if i = -1 then "*" else string_of_int i)
l.in_shape
in
let out_str = Owl_utils_array.to_string string_of_int l.out_shape in
Printf.sprintf " Concatenate : in:[*,%s] out:[*,%s]\n" in_str out_str
ˆ Printf.sprintf " axis : %i\n" l.axis
ˆ ""
Embedding Module
420
Appendix E Neural Network Module
let init l =
let m = l.in_dim in
let n = l.out_shape.(1) in
l.w <- Init.run l.init_typ [| m; n |] l.w
let copy l =
let l' = create l.in_dim l.out_shape.(1) l.init_typ in
mkpri l |> Array.map copy_primal' |> update l';
l'
let run x l =
let x = primal' x |> unpack_arr in
let s = A.shape x in
421
Appendix E Neural Network Module
let to_string l =
let wm, wn = l.in_dim, l.out_shape.(1) in
Printf.sprintf
" Embedding : matrix in:(*,%i) out:(*,%i,%i) \n"
l.in_shape.(0)
l.out_shape.(0)
l.out_shape.(1)
ˆ Printf.sprintf " init : %s\n" (Init.to_string l.init_typ)
ˆ Printf.sprintf " in_dim : %i\n" l.in_dim
ˆ Printf.sprintf " params : %i\n" (wm * wn)
ˆ Printf.sprintf " w : %i x %i\n" wm wn
ˆ ""
422
APPENDIX F
(*
* Actor - Parallel & Distributed Engine of Owl System
* Copyright (c) 2016-2018 Liang Wang <[email protected]>
*)
val map_partition : ('a list -> 'b list) -> string -> string
val flatmap : ('a -> 'b list) -> string -> string
val reduce : ('a -> 'a -> 'a) -> string -> 'a option
val reduce_by_key : ('a -> 'a -> 'a) -> string -> string
val fold : ('a -> 'b -> 'a) -> 'a -> string -> 'a
423
© Liang Wang, Jianxin Zhao 2023
L. Wang and J. Zhao, Architecture of Advanced Numerical Analysis Systems,
https://doi.org/10.1007/978-1-4842-8853-5
Appendix F Actor System for Distributed Computing
val apply : ('a list -> 'b list) -> string list -> string list ->
string list
Server
(*
* Actor - Parallel & Distributed Engine of Owl System
* Copyright (c) 2016-2018 Liang Wang <[email protected]>
*)
open Actor_types
let _broadcast_all t s =
let bar = Random.int 536870912 in
StrMap.iter (fun _k v -> Actor_utils.send ˜bar v t s) !_context.workers;
Bar
424
Appendix F Actor System for Distributed Computing
let run_job_eager () =
List.iter (fun s ->
let s' = List.map (fun x -> Actor_dag.get_vlabel_f x) s in
let bar = _broadcast_all Pipeline (Array.of_list s') in
let _ = barrier bar in
Actor_dag.mark_stage_done s;
) (Actor_dag.stages_eager ())
let run_job_lazy x =
List.iter (fun s ->
let s' = List.map (fun x -> Actor_dag.get_vlabel_f x) s in
let bar = _broadcast_all Pipeline (Array.of_list s') in
let _ = barrier bar in
Actor_dag.mark_stage_done s;
) (Actor_dag.stages_lazy x)
let collect x =
Owl_log.info "%s" ("collect " ˆ x ˆ "\n");
run_job_lazy x;
let bar = _broadcast_all Collect [|x|] in
barrier bar
|> List.map (fun m -> Marshal.from_string m.par.(0) 0)
let count x =
Owl_log.info "%s" ("count " ˆ x ˆ "\n");
run_job_lazy x;
let bar = _broadcast_all Count [|x|] in
barrier bar
|> List.map (fun m -> Marshal.from_string m.par.(0) 0)
|> List.fold_left (+) 0
let fold f a x =
Owl_log.info "%s" ("fold " ˆ x ˆ "\n");
run_job_lazy x;
let g = Marshal.to_string f [ Marshal.Closures ] in
let bar = _broadcast_all Fold [|g; x|] in
barrier bar
425
Appendix F Actor System for Distributed Computing
let reduce f x =
Owl_log.info "%s" ("reduce " ˆ x ˆ "\n");
run_job_lazy x;
let g = Marshal.to_string f [ Marshal.Closures ] in
let bar = _broadcast_all Reduce [|g; x|] in
let y = barrier bar
|> List.map (fun m -> Marshal.from_string m.par.(0) 0)
|> List.filter (function Some _x -> true | None -> false)
|> List.map (function Some x -> x | None -> failwith "") in
match y with
| hd :: tl -> Some (List.fold_left f hd tl)
| [] -> None
let terminate () =
Owl_log.info "%s" ("terminate #" ˆ !_context.job_id ˆ "\n");
let bar = _broadcast_all Terminate [||] in
let _ = barrier bar in ()
let broadcast x =
Owl_log.info "%s" ("broadcast -> " ˆ string_of_int (StrMap.cardinal !_
context.workers) ˆ
let y = Actor_memory.rand_id () in
let bar = _broadcast_all Broadcast [|Marshal.to_string x []; y|] in
let _ = barrier bar in y
let map f x =
let y = Actor_memory.rand_id () in
Owl_log.info "%s" ("map " ˆ x ˆ " -> " ˆ y ˆ "\n");
let g = Marshal.to_string f [ Marshal.Closures ] in
Actor_dag.add_edge (to_msg 0 MapTask [|g; x; y|]) x y Red; y
426
Appendix F Actor System for Distributed Computing
let map_partition f x =
let y = Actor_memory.rand_id () in
Owl_log.info "%s" ("map_partition " ˆ x ˆ " -> " ˆ y ˆ "\n");
let g = Marshal.to_string f [ Marshal.Closures ] in
Actor_dag.add_edge (to_msg 0 MapPartTask [|g; x; y|]) x y Red; y
let filter f x =
let y = Actor_memory.rand_id () in
Owl_log.info "%s" ("filter " ˆ x ˆ " -> " ˆ y ˆ "\n");
let g = Marshal.to_string f [ Marshal.Closures ] in
Actor_dag.add_edge (to_msg 0 FilterTask [|g; x; y|]) x y Red; y
let flatten x =
let y = Actor_memory.rand_id () in
Owl_log.info "%s" ("flatten " ˆ x ˆ " -> " ˆ y ˆ "\n");
Actor_dag.add_edge (to_msg 0 FlattenTask [|x; y|]) x y Red; y
let shuffle x =
let y = Actor_memory.rand_id () in
Owl_log.info "%s" ("shuffle " ˆ x ˆ " -> " ˆ y ˆ "\n");
let z = Marshal.to_string (StrMap.keys !_context.workers) [] in
let b = Marshal.to_string (Random.int 536870912) [] in
Actor_dag.add_edge (to_msg 0 ShuffleTask [|x; y; z; b|]) x y Blue; y
let reduce_by_key f x =
(* TODO: without local combiner ... keep or not? *)
let x = shuffle x in
let y = Actor_memory.rand_id () in
Owl_log.info "%s" ("reduce_by_key " ˆ x ˆ " -> " ˆ y ˆ "\n");
let g = Marshal.to_string f [ Marshal.Closures ] in
Actor_dag.add_edge (to_msg 0 ReduceByKeyTask [|g; x; y|]) x y Red; y
427
Appendix F Actor System for Distributed Computing
let join x y =
let z = Actor_memory.rand_id () in
Owl_log.info "%s" ("join " ˆ x ˆ " & " ˆ y ˆ " -> " ˆ z ˆ "\n");
let x, y = shuffle x, shuffle y in
Actor_dag.add_edge (to_msg 0 JoinTask [|x; y; z|]) x z Red;
Actor_dag.add_edge (to_msg 0 JoinTask [|x; y; z|]) y z Red; z
let apply f i o =
Owl_log.info "%s" ("apply f ... " ˆ "\n");
let g = Marshal.to_string f [ Marshal.Closures ] in
let o = List.map (fun _ -> Actor_memory.rand_id ()) o in
let x = Marshal.to_string i [ ] in
let y = Marshal.to_string o [ ] in
let z = Actor_memory.rand_id () in
List.iter (fun m -> Actor_dag.add_edge (to_msg 0 ApplyTask [|g; x; z;
y|]) m z Red) i;
List.iter (fun n -> Actor_dag.add_edge (to_msg 0 NopTask [|z; y|]) z n
Red) o; o
let load x =
Owl_log.info "%s" ("load " ˆ x ˆ "\n");
let y = Actor_memory.rand_id () in
let bar = _broadcast_all Load [|x; y|] in
let _ = barrier bar in y
let save x y =
Owl_log.info "%s" ("save " ˆ x ˆ "\n");
let bar = _broadcast_all Save [|x; y|] in
barrier bar
|> List.map (fun m -> Marshal.from_string m.par.(0) 0)
|> List.fold_left (+) 0
428
Appendix F Actor System for Distributed Computing
Client
(*
* Actor - Parallel & Distributed Engine of Owl System
* Copyright (c) 2016-2018 Liang Wang <[email protected]>
*)
open Actor_types
429
Appendix F Actor System for Distributed Computing
connect s k) in
let _ = !_context.workers <- StrMap.add k s !_context.workers in
let _ = ZMQ.Socket.set_send_high_water_mark s Actor_config.high_
warter_mark in
s ) in
Actor_utils.send ˜bar s OK [|Marshal.to_string v []|]
) z
let process_pipeline s =
Array.iter (fun s ->
let m = of_msg s in
match m.typ with
| MapTask -> (
Owl_log.info "%s" ("map @ " ˆ !_context.myself_addr);
let f : 'a -> 'b = Marshal.from_string m.par.(0) 0 in
List.map f (Actor_memory.find m.par.(1)) |> Actor_memory.add
m.par.(2)
)
| MapPartTask -> (
Owl_log.info "%s" ("map_partition @ " ˆ !_context.myself_addr);
let f : 'a list -> 'b list = Marshal.from_string m.par.(0) 0 in
f (Actor_memory.find m.par.(1)) |> Actor_memory.add m.par.(2)
)
| FilterTask -> (
Owl_log.info "%s" ("filter @ " ˆ !_context.myself_addr);
let f : 'a -> bool = Marshal.from_string m.par.(0) 0 in
List.filter f (Actor_memory.find m.par.(1)) |> Actor_memory.add
m.par.(2)
)
| FlattenTask -> (
Owl_log.info "%s" ("flatten @ " ˆ !_context.myself_addr);
List.flatten (Actor_memory.find m.par.(0)) |> Actor_memory.add
m.par.(1)
)
430
Appendix F Actor System for Distributed Computing
431
Appendix F Actor System for Distributed Computing
let service_loop () =
Owl_log.debug "mapre worker @ %s" !_context.myself_addr;
(* set up local loop of a job worker *)
try while true do
let i, m = Actor_utils.recv !_context.myself_sock in
let bar = m.bar in
match m.typ with
| Count -> (
Owl_log.info "%s" ("count @ " ˆ !_context.myself_addr);
let y = List.length (Actor_memory.find m.par.(0)) in
Actor_utils.send ˜bar !_context.master_sock OK [|Marshal.to_
string y []|]
)
| Collect -> (
Owl_log.info "%s" ("collect @ " ˆ !_context.myself_addr);
let y = Actor_memory.find m.par.(0) in
Actor_utils.send ˜bar !_context.master_sock OK [|Marshal.to_
string y []|]
)
| Broadcast -> (
Owl_log.info "%s" ("broadcast @ " ˆ !_context.myself_addr);
Actor_memory.add m.par.(1) (Marshal.from_string m.par.(0) 0);
Actor_utils.send ˜bar !_context.master_sock OK [||]
)
| Reduce -> (
Owl_log.info "%s" ("reduce @ " ˆ !_context.myself_addr);
let f : 'a -> 'a -> 'a = Marshal.from_string m.par.(0) 0 in
let y =
match Actor_memory.find m.par.(1) with
| hd :: tl -> Some (List.fold_left f hd tl)
| [] -> None
in
Actor_utils.send ˜bar !_context.master_sock OK [|Marshal.to_
string y []|];
)
432
Appendix F Actor System for Distributed Computing
433
Appendix F Actor System for Distributed Computing
434
Appendix F Actor System for Distributed Computing
(*
* Actor - Parallel & Distributed Engine of Owl System
* Copyright (c) 2016-2018 Liang Wang <[email protected]>
*)
open Actor_types
type barrier =
| ASP (* Asynchronous Parallel *)
| BSP (* Bulk Synchronous Parallel *)
| SSP (* Stale Synchronous Parallel *)
| PSP (* Probabilistic Synchronous Parallel *)
435
Appendix F Actor System for Distributed Computing
Server
(*
* Actor - Parallel & Distributed Engine of Owl System
* Copyright (c) 2016-2018 Liang Wang <[email protected]>
*)
open Actor_types
436
Appendix F Actor System for Distributed Computing
let update_steps t w =
let t' = Hashtbl.find !_context.worker_step w in
match t > t' with
| true -> (
Hashtbl.replace !_context.worker_busy w 0;
Hashtbl.replace !_context.worker_step w t;
Hashtbl.add !_context.step_worker t w )
| false -> ()
let _get k =
let k' = Obj.repr k in
let v, t = Hashtbl.find _param k' in
Obj.obj v, t
let _set k v t =
let k' = Obj.repr k in
let v' = Obj.repr v in
match Hashtbl.mem _param k' with
| true -> Hashtbl.replace _param k' (v',t)
| false -> Hashtbl.add _param k' (v',t)
let _broadcast_all t s =
StrMap.iter (fun _k v -> Actor_utils.send ˜bar:!_context.step v t s) !_
context.workers;
!_context.step
let terminate () =
let _ = _broadcast_all Terminate [||] in
Unix.sleep 1 (** FIXME: change to BSP *)
let service_loop () =
Owl_log.debug "parameter server @ %s" !_context.myself_addr;
(* unmarshal the schedule and pull functions *)
437
Appendix F Actor System for Distributed Computing
438
Appendix F Actor System for Distributed Computing
439
Appendix F Actor System for Distributed Computing
Client
(*
* Actor - Parallel & Distributed Engine of Owl System
* Copyright (c) 2016-2018 Liang Wang <[email protected]>
*)
open Actor_types
let _get k =
let k' = Marshal.to_string k [] in
Actor_utils.send ˜bar:!_context.step !_context.master_sock PS_Get [|k'|];
let m = of_msg (ZMQ.Socket.recv ˜block:true !_context.master_sock) in
Marshal.from_string m.par.(0) 0, m.bar
let _set k v t =
let k' = Marshal.to_string k [] in
let v' = Marshal.to_string v [] in
Actor_utils.send ˜bar:t !_context.master_sock PS_Set [|k'; v'|]
440
Appendix F Actor System for Distributed Computing
let update_param x t =
(* update multiple kvs, more efficient than set *)
let x' = Marshal.to_string x [] in
Actor_utils.send ˜bar:t !_context.master_sock PS_Push [|x'|]
let service_loop () =
Owl_log.debug "parameter worker @ %s" !_context.myself_addr;
(* unmarshal the push function *)
let push : 'a -> ('b * 'c) list -> ('b * 'c) list = Marshal.from_string
!_push 0 in
(* loop to process messages *)
try while true do
let _i, m = Actor_utils.recv !_context.myself_sock in
let t = m.bar in
match m.typ with
| PS_Schedule -> (
Owl_log.debug "%s: ps_schedule" !_context.myself_addr;
!_context.step <- (if t > !_context.step then t else !_
context.step);
let vars = Marshal.from_string m.par.(0) 0 in
let updates = push !_context.myself_addr vars in
update_param updates t
)
| Terminate -> (
Owl_log.debug "%s: terminate"!_context.myself_addr;
Actor_utils.send ˜bar:t !_context.master_sock OK [||];
Unix.sleep 1; (* FIXME: sleep ... *)
failwith ("#" ˆ !_context.job_id ˆ " terminated")
)
| _ -> ( Owl_log.debug "unknown mssage to PS" )
done with Failure e -> (
Owl_log.warn "%s" e;
ZMQ.Socket.close !_context.myself_sock;
Pervasives.exit 0 )
441
Appendix F Actor System for Distributed Computing
(*
* Actor - Parallel & Distributed Engine of Owl System
* Copyright (c) 2016-2018 Liang Wang <[email protected]>
*)
(* Peer-to-Peer Parallel *)
open Actor_types
442
Appendix F Actor System for Distributed Computing
Server
(*
* Actor - Parallel & Distributed Engine of Owl System
* Copyright (c) 2016-2018 Liang Wang <[email protected]>
*)
open Actor_types
443
Appendix F Actor System for Distributed Computing
let furthest x =
let d = ref min_int in
let n = ref "" in
List.iteri (fun _i y ->
let d' = distance (hash y) x in
if d' > !d then ( d := d'; n := y )
) (StrMap.keys !_context.workers @ [!_context.myself_addr]);
!n
let furthest_exclude x l =
let addrs = StrMap.keys !_context.workers @ [!_context.myself_addr]
|> List.filter (fun x -> not (List.mem x l))
in
let d = ref min_int in
let n = ref "" in
444
Appendix F Actor System for Distributed Computing
let nearest x =
let d = ref max_int in
let n = ref "" in
List.iteri (fun _i y ->
let d' = distance (hash y) x in
if d' < !d then ( d := d'; n := y )
) (StrMap.keys !_context.workers @ [!_context.myself_addr]);
!n
let nearest_exclude x l =
let addrs = StrMap.keys !_context.workers @ [!_context.myself_addr]
|> List.filter (fun x -> not (List.mem x l))
in
let d = ref max_int in
let n = ref "" in
List.iteri (fun _i y ->
let d' = distance (hash y) x in
if d' < !d then ( d := d'; n := y )
) addrs;
!n
end
let _get k =
let k' = Obj.repr k in
let v, t = Hashtbl.find _param k' in
Obj.obj v, t
let _set k v t =
let k' = Obj.repr k in
let v' = Obj.repr v in
match Hashtbl.mem _param k' with
| true -> Hashtbl.replace _param k' (v',t)
| false -> Hashtbl.add _param k' (v',t)
let _allocate_params x y =
let x = Route.hash x in
let y = Route.hash y in
let l = ref [] in
Hashtbl.iter (fun k v ->
let h = Obj.obj k |> Route.hash in
if (Route.distance y h) < (Route.distance x h) then l := !l @ [(k,v)]
) _param; !l
let _shall_deliver_pull () =
let ready = ref true in
Hashtbl.iter (fun _k v ->
446
Appendix F Actor System for Distributed Computing
let _notify_peers_step () =
List.iter (fun k ->
Route.forward k P2P_Ping [|!_context.myself_addr|]
) (StrMap.keys !_context.workers)
447
Appendix F Actor System for Distributed Computing
let _process_timeout () =
_notify_peers_step ();
Owl_log.debug "%s: timeout" !_context.myself_addr
let service_loop () =
Owl_log.debug "%s: p2p server" !_context.myself_addr;
let barrier : p2p_barrier_typ = Marshal.from_string !_barrier 0 in
let pull : ('a, 'b) p2p_pull_typ = Marshal.from_string !_pull 0 in
(* loop to process messages *)
ZMQ.Socket.set_receive_timeout !_context.myself_sock (1 * 1000);
try while true do
(* first, wait and process arriving message *)
try let i, m = Actor_utils.recv !_context.myself_sock in (
match m.typ with
| P2P_Connect -> (
Owl_log.debug "%s: p2p_connect %s" !_context.myself_addr m.par.(0);
let addr = m.par.(0) in
!_context.master_addr <- addr;
!_context.master_sock <- Route.connect addr
)
| P2P_Ping -> (
Owl_log.debug "%s: p2p_ping %s" !_context.myself_addr m.par.(0);
let addr = m.par.(0) in
if Route.exists addr = false then Route.(connect addr |> add addr)
)
| P2P_Join -> (
Owl_log.debug "%s: p2p_join %s" !_context.myself_addr m.par.(0);
let src = m.par.(0) in
let dst = Marshal.from_string m.par.(1) 0 in
let next = Route.nearest_exclude dst [src] in
if next = !_context.myself_addr then (
if Route.exists src = false then (
let s = Route.connect src in
let _ = Route.add src s in
Actor_utils.send s P2P_Ping [|!_context.myself_addr|]
);
448
Appendix F Actor System for Distributed Computing
450
Appendix F Actor System for Distributed Computing
452
Appendix F Actor System for Distributed Computing
Client
(*
* Actor - Parallel & Distributed Engine of Owl System
* Copyright (c) 2016-2018 Liang Wang <[email protected]>
*)
open Actor_types
let _get k =
let k = Marshal.to_string k [] in
let s = [|k; !_context.master_addr|] in
Actor_utils.send !_context.master_sock P2P_Get s;
let _, m = Actor_utils.recv !_context.myself_sock in
let _k, v, t = Marshal.from_string m.par.(0) 0 in
v, t
let _set k v =
let s = Marshal.to_string (k, v, -1) [] in
Actor_utils.send !_context.master_sock P2P_Set [|s|]
let _barrier () =
Actor_utils.send !_context.master_sock P2P_Bar [||];
let _, m = Actor_utils.recv !_context.myself_sock in
!_context.step <- m.bar
let service_loop () =
Owl_log.debug "p2p_client @ %s" !_context.master_addr;
(* unmarshal the schedule and push function *)
let schedule : 'a p2p_schedule_typ = Marshal.from_string !_schedule 0 in
454
Appendix F Actor System for Distributed Computing
455
Bibliography
[1]. Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen,
Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat,
Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for
large-scale machine learning. In 12th {USENIX} symposium on
operating systems design and implementation ({OSDI} 16), pages
265–283, 2016.
[2]. Amr Ahmed, Mohamed Aly, Joseph Gonzalez, Shravan
Narayananmuthy, and Alexander Smola. Scalable inference in
latent variable models. WSDM, pages 123–132, 2012.
[3]. Istemi Ekin Akkus, Ruichuan Chen, Ivica Rimac, Manuel Stein,
Klaus Satzke, Andre Beck, Paarijaat Aditya, and Volker Hilt.
Sand: Towards high-performance serverless computing. In 2018
USENIX Annual Technical Conference (USENIX ATC’18), pages
923–935, 2018.
[4]. Tal Ben-Nun and Torsten Hoefler. Demystifying parallel and
distributed deep learning: An in-depth concurrency analysis.
ACM Computing Surveys (CSUR), 52(4):1–43, 2019.
[5]. Yoshua Bengio, Nicolas Boulanger-Lewandowski, and Razvan
Pascanu. Advances in optimizing recurrent networks. In 2013
IEEE international conference on acoustics, speech and signal
processing, pages 8624–8628. IEEE, 2013.
[6]. Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry
Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub
Konecny, Stefano Mazzocchi, H Brendan McMahan, and Others.
Towards federated learning at scale: System design. arXiv preprint
arXiv:1902.01046, 2019.
457
© Liang Wang, Jianxin Zhao 2023
L. Wang and J. Zhao, Architecture of Advanced Numerical Analysis Systems,
https://doi.org/10.1007/978-1-4842-8853-5
Bibliography
[8]. Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li,
Jakub Konečný, H. Brendan McMahan, Virginia Smith, and Ameet
Talwalkar. LEAF: A Benchmark for Federated Settings. NeurIPS,
pages 1–9, 2018.
[9]. Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, Google
Brain, Mountain View, Rafal Jozefowicz, and San Francisco.
Revising distributed synchronous SGD. ICLR’17, pages 1–10, 2017.
[12]. James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R
Ganger, Garth Gibson, Kimberly Keeton, and Eric Xing. Solving
the straggler problem with bounded staleness. In Presented as part
of the 14th Workshop on Hot Topics in Operating Systems, 2013.
458
Bibliography
[14]. Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak
Lee, Abhimanu Kumar, Jinliang Wei, Wei Dai, Gregory R Ganger,
Phillip B Gibbons, et al. Exploiting bounded staleness to speed up
big data analytics. In 2014 USENIX Annual Technical Conference
(USENIX ATC’14), pages 37–48, 2014.
[15]. Wei Dai, Abhimanu Kumar, Jinliang Wei, Qirong Ho, Garth
Gibson, and Eric P Xing. High-performance distributed ml at scale
through parameter server consistency models. In Twenty-Ninth
AAAI Conference on Artificial Intelligence, 2015.
[19]. Moming Duan, Duo Liu, Xianzhang Chen, Renping Liu, Yujuan
Tan, and Liang Liang. Self-Balancing Federated Learning with
Global Imbalanced Data in Mobile Systems. IEEE Transactions on
Parallel and Distributed Systems, 32(1):59–71, 2021.
[20]. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient
methods for online learning and stochastic optimization. Journal
of machine learning research, 12(7), 2011.
459
Bibliography
[23]. Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik.
Rich feature hierarchies for accurate object detection and
semantic segmentation. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 580–587, 2014.
[27]. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick.
Mask R-CNN. In Proceedings of the IEEE international conference
on computer vision, pages 2961–2969, 2017.
[28]. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
residual learning for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages
770–778, 2016.
[30]. Qirong Ho, James Cipar, Henggang Cui, Jin Kyu Kim, Seunghak
Lee, Phillip B. Gibbons, Garth A. Gibson, Gregory R. Ganger, and
Eric P. Xing. More effective distributed ML via a stale synchronous
parallel parameter server. In Advances in Neural Information
Processing Systems, 2013.
460
Bibliography
[33]. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.
[34]. Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr
Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor
Yiing Su. Scaling distributed machine learning with the parameter
server. In Proceedings of the 11th USENIX Symposium on
Operating Systems Design and Implementation, OSDI 2014, 2014.
[35]. Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous
parallel stochastic gradient for nonconvex optimization. In
Advances in Neural Information Processing Systems, 2015.
[36]. Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang,
and Ji Liu. Can decentralized algorithms outperform centralized
algorithms? A case study for decentralized parallel stochastic
gradient descent. arXiv preprint arXiv:1705.09056, 2017.
[38]. Yaron Minsky, Anil Madhavapeddy, and Jason Hickey. Real World
OCaml: Functional programming for the masses. O’Reilly Media,
Inc., 2013.
461
Bibliography
[41]. Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright.
Hogwild!: A lock-free approach to parallelizing stochastic gradient
descent. NIPS’11 Proceedings of the 24th International Conference
on Neural Information Processing Systems, pages 693–701, 2011.
[42]. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James
Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia
Gimelshein, Luca Antiga, et al. PyTorch: An imperative style,
high-performance deep learning library. Advances in neural
information processing systems, 32:8026–8037, 2019.
[43]. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster
R-CNN: Towards real-time object detection with region proposal
networks. Advances in neural information processing systems,
28:91–99, 2015.
[44]. Alexander Sergeev and Mike Del Balso. Horovod: fast and easy
distributed deep learning in TensorFlow. arXiv, 2017.
[48]. Benoit Steiner, Chris Cummins, Horace He, and Hugh Leather.
Value learning for throughput optimization of deep learning
workloads. In Proceedings of Machine Learning and Systems,
volume 3, pages 323–334, 2021.
462
Bibliography
[50]. Philip Wadler. Linear types can change the world! In Programming
concepts and methods, volume 3, page 5. Citeseer, 1990.
[53]. Liang Wang, Sotiris Tasoulis, Teemu Roos, and Jussi Kangasharju.
Kvasir: Scalable provision of semantically relevant web content on big
data framework. IEEE Transactions on Big Data, 2(3):219–233, 2016.
[54]. Eric P Xing, Qirong Ho, Pengtao Xie, and Dai Wei. Strategies
and principles of distributed machine learning on big data.
Engineering, 2(2):179–195, 2016.
[58]. Wei Zhang, Suyog Gupta, Xiangru Lian, and Ji Liu. Staleness-
aware async-SGD for distributed deep learning. arXiv preprint
arXiv:1511.05950, 2015.
[59]. Shuxin Zheng, Qi Meng, Taifeng Wang, Wei Chen, Nenghai Yu,
Zhi-Ming Ma, and Tie-Yan Liu. Asynchronous stochastic gradient
descent with delay compensation. In Proceedings of the 34th
International Conference on Machine Learning-Volume 70, pages
4120–4129. JMLR. org, 2017.
463
Index
A graph converter, 358, 359, 361
graph utility, 80, 81
Activation function, 126, 127, 131
lazy expression, 78
Actor distributed engine
module design, 79
barrier control methods, 252–256
Ndarray module, 81–84
composing, OWl, 249, 250
operations, 61, 62
definition, 245
operator building, 325–334, 336–352
framework, 249
operators
map-reduce, 245, 246
Mat, 63
parameter server module, 246, 248
rules, 63–65, 68
synchronization methods, 254
SISO, 66, 69
Actor system
perturbation confusion/tag, 76, 77
MapReduce engine, 423–428, 430–434
Amazon Lambda, 234
parameter server engine, 435–442
Arithmetic Logic Unit (ALU), 15
peer-to-peer engine, 442–455
Arr.add_ function, 303
Adagrad and RMSprop methods, 93
Artificial Intelligence (AI), 1, 191, 233
Adagrad method, 91
Asynchronous barrier method, 265
Adam optimizer, 92
Asynchronous Parallel (ASP), 253
add_node function, 128
Automated Empirical Optimization of
add_scalar function, 154
Software (AEOS), 42–47
adjoint function, 72
adjval function, 73
alcotest framework, 281, 291 B
Algodiff.gradhessian function, 98
backward function, 132
Algodiff module, 51, 107, 109, 155, 156
Barrier control mechanisms, 245, 252,
Algorithmic differentiation (AD), 5, 51
260, 263
Algodiff module, 51, 52
Basic Linear Algebra Subprograms
APIs, 70, 72, 73, 75
(BLAS), 293
computation, 53
Batch module, 100
core modules, 352–356, 358
Bigarray module, 9, 218
data types, 53, 58–60
Builder module, 83
definition, 49
Bulk Synchronous Parallel (BSP), 253, 262
forward/reverse modes, 54–58
465
© Liang Wang, Jianxin Zhao 2023
L. Wang and J. Zhao, Architecture of Advanced Numerical Analysis Systems,
https://doi.org/10.1007/978-1-4842-8853-5
INDEX
Computation optimization, 26, See also Deep Neural Networks (DNNs), 2, 227
Optimization compiler, 133–136, 138
Computer algebra system, 194 framework, 121
Computing method, 17 functionalities, 121
Configuration module, 143 networks, 127–129
configure.ml, 222 neurons, 123
Conjugate gradient (CG) method, 96, 97 Object detection (see Object detection)
Const method, 91 training, 130–132
Conv2D neuron, 128 df function, 63
Convolution algorithm, 32 df cp ap at function, 67
AVX intrinsics, 40 Differential equation, 3, 5, 49, 293,
category, 32 298, 299
definition, 32 Differentiation
GEMM routines, 42 AD, 51
im2col implementation, 37 definition, 49
implementation code, 33, 35 numerical, 50
input/kernel matrices, 39 symbolic, 50
memory usage, 36 variables, 49
OpenBLAS, 35, 36 diff function, 52, 222
SIMD intrinsics, 38 Directed acyclic graph (DAG), 5, 149, 184
types, 33 Distributed computing, 7
Core optimizations convergence analysis
instruction-level parallelism, 17 barrier control system, 264
interface Ocaml to C, 12–14 PSP sampling primitive, 266–268
limited computing time, 15 shared model, 264
memory, 21 experiment setup, 269
N-dimensional array, 9–11 machine learning, 243, 244
processor parallelism, 16 sampling primitive, 278
quad-core CPU, 15, 16 sampling size affects, 275–278
SIMD, 17 system progress, 272, 273
cpuid instruction, 22 system stragglers, 274, 275
create function, 124 trained model accuracy, 270–272
Cross-entropy loss, 113, 117 Distributed Hash Table (DHT), 268
Distributed training system
analytical model, 258
D BSP/ASP, 256
“Deep” function, 61 compatibility, 260–264
Deep learning compiler, 193 consistency, 256
467
INDEX
468
INDEX
469
INDEX
470
INDEX
471
INDEX
472