0% found this document useful (0 votes)
37 views

Chapter 7

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Chapter 7

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

“An optimist sees a glass half full. A pessimist sees a glass half empty.

An engineer sees a glass that is twice as big as it needs to be.”


— Unknown

“Death and taxes are unsolved engineering problems.”


— Unknown

The book is distributed on the “read first, buy later” principle.

Andriy Burkov Machine Learning Engineering - Draft


7 Model Deployment
Once the model was built and thoroughly tested, it can be deployed. Deploying a model
means to make it available for accepting queries generated by the users of the production
system.
Model deployment is the fourth stage of machine learning engineering in the machine learning
project life cycle:

Data collection
Business Goal Feature Model Model
&
problem definition engineering building evaluation
preparation

Model Model Model Model


maintenance monitoring serving deployment

Figure 1: Machine learning project life cycle.

A model can be deployed in various ways. It can be deployed on a server or on the user’s
device; it can be deployed for all users at once or to a small fraction of users. Below we
consider all the options.

7.1 Deployment Patterns

A model can be deployed following several patterns: * statically, as a part of an installable


software package, * dynamically on the user’s device, * dynamically on a server.

7.2 Static Deployment

The static deployment of a machine learning model is very similar to traditional software
deployment: you prepare an installable binary of the entire software where the model is
packaged as a resource available on the runtime. Depending on the operating system and
the runtime environment, the objects of both the model and the feature extractor can be
packaged as a part of a dynamic-link library (DLL on Windows), Shared Objects (*.so files on

Andriy Burkov Machine Learning Engineering - Draft 3


Linux) or be serialized and saved in the standard resource location for virtual machine-based
systems such as Java and .Net.
Static deployment has many advantages: the software system has direct access to the model,
so the execution time is usually fast for the user, the model can be called when the user
is offline, and the software vendor doesn’t have care on keeping the model operational: it
becomes user’s responsibility.
One drawback of following such a deployment pattern is obviously a difficulty to upgrade the
model without upgrading the entire software application. Another drawback is a difficulty
(or even impossibility) to monitor the performance of the model.

7.3 Dynamic Deployment on User’s Device

Dynamic deployment on users’ devices is similar to the static deployment in the sense the
user runs a part of your system as a software application on their device. The difference with
the static deployment here is that the model in dynamic deployment is not a part of the
binary code of the application. This allows pushing updates of the model without updating
the application running on the user’s device. However, once deployed, the model still runs on
the user’s device.
The latter can be achieved in two ways:
• The model file would only contain the learned parameters while the user’s device
would have installed a runtime environment for the model. Some machine learning
packages, TensorFlow for example, have a lightweight version that can run on mobile
devices. Alternatively, such frameworks as Core ML from Apple allow running on Apple
devices the models created in several popular packages, including scikit-learn, Keras
and XGBoost.
• The model file would be a serialized object that the application would deserialize. The
advantage of such an approach is that you don’t need to have a runtime environment
for your model on the user’s device: all needed dependencies will be serialized with the
object of the model. Am evident drawback is that an update might be quite “heavy”
which is a problem if your software system has millions of users.
The main advantage of the dynamic deployment on the users’ devices is that the calls to the
model will be fast for the user. The main drawback is related to the update of the model: as
I already mentioned, a serialized object can be quite voluminous; furthermore, some users
can be offline during the update or even turn off all future updates. In that latter case, you
may end up with different users using very different versions of the model. The difference
of the software or model versions may, in turn, result in outdated data received from some
users. If that data is then used in data analysis or to train other models, the result of the
analysis or the new models can be negatively affected by that outdated data.
Another drawback of deploying models on the user’s device is that the model becomes more

Andriy Burkov Machine Learning Engineering - Draft 4


easily available for analysis to the third-party. The latter can try to reverse-engineer the
model to reproduce its behavior or discover its weaknesses (by providing various inputs and
observing the output) and adapt their data so that the model predicts what they want. For
example, let the mobile application allow the user to read news related to their interest. A
content provider might try to reverse engineer the model running on the user’s device so that
that model recommends more often the content of that content provider.
As with the static deployment, deploying the model on a user’s device makes it difficult to
monitor the performance of the model.

7.4 Dynamic Deployment on a Server

Because of the possible complications with delivering updates using the above two patterns
and problems with performance monitoring, the most frequent deployment pattern is to
deploy the model on a server (or servers) and make it available as a REST API in the form
of a web service.

7.4.1 Deployment on a Virtual Machine

In a typical web service architecture deployed in a cloud environment, the predictions are
served in response to canonically-formatted HTTP requests. A web service running on a
virtual machine receives a user request that contains the input data, calls the machine learning
system using that input data, and then transforms the output of the machine learning system
into a JSON or XML string which is then output by the web service. To cope with high load,
several identical virtual machines running both the web service and the machine learning
system are running in parallel. A load balancer dispatches the HTTP requests to a specific
virtual machine depending on its availability. The virtual machines can be added and closed
manually or be a part of an auto-scaling group that launches or terminates virtual machines
based on their usage. Figure 2 illustrate that deployment pattern. In Figure 2, each instance
denoted as an orange square contains all the code needed to run the feature extractor and
the model; the instance also contains a web service that has access to that code.

Andriy Burkov Machine Learning Engineering - Draft 5


Auto scaling group

Instance 1

User input
Load balancer Instance 2
...

Instance N

Figure 2: Deploying a machine learning model as a web service on a virtual machine.

In Python, a REST API web service is usually implemented using a web application framework
such as Flask or CherryPy. An R equivalent is Shiny.
The advantage of deploying on a virtual machine is that the architecture of the software
system is conceptually simple: it’s a typical web service.
Among the downsides is the need to maintain servers (physical or virtual). If virtualization is
used, then there is an additional computational overhead due to virtualization and running
multiple operating systems. Finally, deploying on a virtual machine has a relatively higher
cost compared to the deployment in a container or a serverless deployment that we consider
below.

7.4.2 Deployment in a Container

A more modern alternative to a virtual machine-based deployment is a container-based


deployment. Working with containers is typically considered more resource-efficient and
flexible than with virtual machines. A container is similar to a virtual machine in the sense
that it is also an isolated runtime environment with its own filesystem, CPU, memory and
process space. The main difference, however, is that all containers running on the same

Andriy Burkov Machine Learning Engineering - Draft 6


virtual or physical machine share the operating system, while each virtual machine runs its
own instance of the operating system.
The deployment process looks as follows. The machine learning system and the web service
are installed inside a container (usually a container is a Docker container, but there are
alternatives), then a container-orchestration system is used to run the containers on a cluster
of physical or virtual servers. A typical choice of a container-orchestration system for running
it on-premises or in a cloud platform other than Amazon Web Services (AWS) is Kubernetes;
if your software is running on AWS, you can use its proprietary container orchestrator Fargate.
Figure 3 illustrate that deployment pattern.

Cluster with auto scaler

Container 1

Container 2
User input
Load balancer
Container orchestrator ...

Container N

Figure 3: Deploying a machine learning model as a web service in a container running on a


cluster.

In the above deployment pattern, the virtual or physical machines are organized into a cluster
whose resources are managed by the container orchestrator. New virtual or physical machines
can be added to the cluster and closed manually or, if your software is deployed in a cloud
environment, by a cluster auto-scaler that launches (and adds to the cluster) or terminates
virtual machines based on the usage of the cluster.
Deployment in a container has the advantage of being more resource-efficient as compared
to the deployment on a virtual machine. Though it has the same drawback of the need to

Andriy Burkov Machine Learning Engineering - Draft 7


manage (and pay for) one’s own servers.

7.4.3 Serverless Deployment

Several cloud services providers, including Amazon, Google and Microsoft offer the so-called
serverless computing. It’s known under the name of lambda-functions on Amazon Web
Services and functions on Microsoft Azure and Google Cloud Platform.
The serverless deployment consists of preparing a zip archive with all the code needed to run
the machine learning system (the model and the feature extractor). The zip archive must
contain a file with a specific name that contains a specific function or class method definition
with a specific signature (so-called entry point function). The zip archive must be uploaded
to the cloud platform and registered under a unique name.
The cloud platform provides an API that can be used to submit inputs to the serverless
function (by specifying its name and providing the payload) and get the outputs from it. The
cloud platform takes care of deploying the code and the model on an adequate computational
resource, executing the code and routing the output back to the client.
Usually, the execution time for a function is limited by the cloud service provider. The size
of the zip file and the amount of RAM available on the runtime are also limited.
The limit on the size of the zip file can be a challenge for deploying machine learning systems.
A typical machine learning model, in order to to be properly executed, requires multiple
heavyweight dependencies, such as (for Python) numpy, scipy, and scikit-learn.
Besides Python, supported programming languages, depending on the cloud platform, can
include Java, Go, PowerShell, Node.js, C#, and Ruby.
There are many advantages to relying on serverless deployment. The obvious advantage is
that you don’t have to provision resources such as servers or virtual machines; you don’t have
to install dependencies and care to upgrade the system. Serverless systems are highly scalable
and can easily and effortlessly support 10,000 and more requests per second. Serverless
functions support both synchronous and asynchronous modes of operation.
Serverless deployment is also cost-efficient: you only pay for compute-time. That effect
can also be achieved with the previous two deployment patterns by using autoscaling, but
autoscaling has a significant latency: the demand can drop but excessive virtual machines
could still keep running for some time before they are terminated.
The serverless deployment also simplifies a lot canarying. In software engineering, canarying
is a strategy of software update when the updated version of the code is pushed to a small
group of end-users (who are usually unaware that they are receiving new code). Because the
new version of the code is only distributed to a small number of users, its impact is relatively
small and changes can be reversed quickly should the new code prove to contain bugs. It is
easy to set up two versions of serverless functions in production and start sending low volume

Andriy Burkov Machine Learning Engineering - Draft 8


traffic to one to test it in the production setting without affecting many users. We will talk
more about canarying in the next section.
Rollbacks are also very simple in the serverless deployment because it is easy to switch back
to the previous version of the function by replacing one zip archive by another.
Besides the limit on the size of the zip archive and the amount of RAM available on the
runtime, an important drawback of serverless deployment of machine learning-powered
software is the unavailability of access to GPU1 , which can be a significant limitation for
deploying very deep models.
Of course, complex software systems may combine deployment patterns. A deployment
pattern appropriate for one model may be suboptimal for another model. A combination of
several deployment patterns is called hybrid deployment pattern. For example, personal
assistants like Google Home or Amazon Echo might have a model that recognizes the
activation phrase (such as “OK, Google”) deployed on the client’s device and some more
complex models that handle requests like “put song X on device Y” running on the server.
Alternatively, the deployment on the user’s mobile device can be used for a model that
augments the video by adding some intelligent effects to it in realtime. At the same time,
the deployment on a server can be used for the model that applies more complex effects such
as stabilization and superresolution.

7.5 Deployment Strategies

Typical deployment strategies are:


• single deployment,
• silent deployment,
• canary deployment,
• multi-armed bandit.
Let’s consider each of them.

7.5.1 Single Deployment

Single deployment is the simplest one. Once you have a new model, you serialize it into a
file and then replace the old file by the new one (and you also replace the feature extractor if
needed).
If you deploy on a server in a cloud environment then you prepare a new virtual machine or
a container running the new version of the model (and, if needed, of the feature extraction
object), then you replace the virtual machine image or the container, gradually close the old
machines or containers and let the autoscaler start the new ones.
1 As of February 2020.

Andriy Burkov Machine Learning Engineering - Draft 9


If you deploy on a physical server, then you have to upload a new model file (and feature
engineering object, if needed) on all physical servers, replace old files and code by the new
ones, and restart the web service.
If you deploy on the user’s device, then you push the new model file (and feature engineering
object) to the user’s device and restart the software.
If you use interpretable software, the feature extractor object can be deployed by replacing
one source code file with another one. Otherwise, to avoid redeploying the entire software
application (on either the server or the user’s device) the feature extractor’s object can be
serialized into a file and then the software running the model would deserialize the feature
extractor object on each startup.
Single deployment has the advantage of being simple; however, it’s also one that is the riskiest:
if the new model or the feature extractor contains a bug, all users will be affected.

7.5.2 Silent Deployment

A counterpart of the single deployment is silent deployment. It consists of both deploying


the new version of the model and feature extractor and keeping the old ones. Both versions
will run in parallel, but the user will not be exposed to the new version until the switch is
done. The predictions made by the new version of the model are only logged and then, after
some time, analyzed to detect possible bugs.
Silent deployment has the benefit of giving enough time to make sure that the new model
works as expected without affecting any user. The drawback of such an approach is the
need to run two times more models which may be very resource consuming. Furthermore,
for many applications, it’s impossible to evaluate the effect the new model has on the user
without exposing its prediction to the user.

7.5.3 Canary Deployment

As we already discussed, canary deployment consists of pushing the new version of the model
and the code to a small fraction of the users, while keeping the old version running for
most users. Contrary to the silent deployment, canary deployment allows validating the
performance of the new model and the effect its predictions have on the users. Contrary to
the single deployment, canary deployment doesn’t affecting lots of users in a case of possible
bugs.
By opting for the canary deployment you accept the additional complexity of having and
maintaining several versions of the model deployed simultaneously. An obvious drawback of
the canary deployment is its impossibility to let the engineers spot rare errors: if you deploy
the new version of the model and the code to 5% of users and a bug affects 2% of users, then
you have only 0.1% chance that the bug will be discovered.

Andriy Burkov Machine Learning Engineering - Draft 10


7.5.4 Multi-Armed Bandits

As we have seen in Section ?? of the previous chapter, multi-armed bandits (MAB) are
a way to compare one or more versions of the model and select the best performing one in
the production environment. MAB have an interesting property: after an initial exploration
period, during which the MAB algorithm gathers enough evidence to evaluate the performance
of each model (arm), eventually, the best performing arm is played all the time, which means
that after the convergence of the MAB algorithm, all users are routed to the version of the
software running the best model.
This property of the MAB algorithm allows us to deploy the new model while keeping the
old one and wait for the MAB algorithm to converge. This will give us both the information
whether the new model performs better than the old one and, at the same time, let the
MAB algorithm replace the old model by the new one once it is certain that the new model
performs better.

7.6 Technical Details

In this section, we discuss practical aspects and technical details of deploying machine learning
systems in production.

7.6.1 Algorithmic Efficiency

Most data analysts work in Python or R. While, as we have seen, there are web frameworks
that allow building web services in those two languages, they are not considered the most
efficient languages.
Indeed, when you use scientific packages Python, much of the code in those packages was
written in efficient C or C++ and then compiled for your specific operating system. However,
your own code, such as feature extraction code and the code that transforms the prediction
of the model in a business decision or action, is not as efficient.
Furthermore, not all algorithms capable of solving a problem are practical. Some can be too
slow. Some problems can be solved by a fast algorithm; for others, no fast algorithms can
exist.
The subfield of computer science called analysis of algorithms is concerned with determining
and comparing the complexity of algorithms. Big O notation is used to classify algorithms
according to how their running time or space requirements grow as the input size grows.
For example, let’s say we have the problem of finding the two most distant one-dimensional
examples in the set of examples S of size N . One algorithm we could craft to solve this
problem would look in Python like this:

Andriy Burkov Machine Learning Engineering - Draft 11


1 def find_max_distance(S):
2 result = None
3 max_distance = 0
4 for x1 in S:
5 for x2 in S:
6 if abs(x1 - x2) >= max_distance:
7 max_distance = abs(x1 - x2)
8 result = (x1, x2)
9 return result
or like this in R:
1 find_max_distance <- function(S) {
2 result <- NULL
3 max_distance <- 0
4 for (x1 in S) {
5 for (x2 in S) {
6 if (abs(x1 - x2) >= max_distance) {
7 max_distance <- abs(x1 - x2)
8 result <- c(x1, x2)
9 }
10 }
11 }
12 result
13 }
In the above algorithm, we loop over all values in S, and at every iteration of the first loop, we
loop over all values in S once again. Therefore, the above algorithm makes N 2 comparisons of
numbers. If we take as a unit time the time the comparison, abs and assignment operations
take, then the time complexity (or, simply, complexity) of this algorithm is at most 5N 2 . (At
each iteration, we have one comparison, two abs and two assignment operations.) When the
complexity of an algorithm is measured in the worst case, big O notation is used. For the
above algorithm, using big O notation, we write that the algorithm’s complexity is O(N 2 );
the constants, like 5, are ignored.
For the same problem, we can craft another algorithm like this (in Python):
1 def find_max_distance(S):
2 result = None
3 min_x = float("inf")
4 max_x = float("-inf")
5 for x in S:
6 if x < min_x:
7 min_x = x
8 if x > max_x:
9 max_x = x

Andriy Burkov Machine Learning Engineering - Draft 12


10 result = (max_x, min_x)
11 return result
or like this (in R):
10 find_max_distance <- function(S):
11 result <- NULL
12 min_x <- Inf
13 max_x <- -Inf
14 for (x in S) {
15 if (x < min_x) {
16 min_x <- x
17 }
18 if (x > max_x) {
19 max_x = x
20 }
21 result <- c(max_x, min_x)
22 result
In the above algorithm, we loop over all values in S only once, so the algorithm’s complexity
is O(N ). In this case, we say that the latter algorithm is more efficient than the former.
An algorithm is called efficient when its complexity is polynomial in the size of the input.
Therefore both O(N ) and O(N 2 ) are efficient because N is a polynomial of degree 1 and N 2
is a polynomial of degree 2. However, for very large inputs, an O(N 2 ) algorithm can be slow.
In the big data era, scientists often look for O(log N ) algorithms.
From a practical standpoint, when you implement your algorithm, you should avoid using
loops whenever possible. For example, you should use operations on matrices and vectors,
instead of loops. In Python, to compute wx, you should type,
1 import numpy
2 wx = numpy.dot(w,x)
and not,
1 wx = 0
2 for i in range(N):
3 wx += w[i]*x[i]
Similarly, in R you should type,
1 wx = w %*% x
and not,
23 wx <- 0
24 for (i in seq(N)):
25 wx <- wx + w[i]*x[i]

Andriy Burkov Machine Learning Engineering - Draft 13


Use appropriate data structures. If the order of elements in a collection doesn’t matter, use
set instead of list. In Python, the operation of verifying whether a specific example x belongs
to S is efficient when S is declared as a set and is inefficient when S is declared as a list.
Another important data structure that you can use to make your Python code more efficient
is dict. It is called a dictionary or a hashmap in other languages. It allows you to define a
collection of key-value pairs with very fast lookups for keys.
Using libraries is generally most reliable - you should only write your own code if you are a
researcher or it is truly required. Scientific Python packages like numpy, scipy, and scikit-learn
were built by experienced scientists and engineers with efficiency in mind. They have many
methods implemented in the C programming language for maximum efficiency.
If you need to iterate over a vast collection of elements, use Python generators (or their R
alternative in the iterators package) that create a function that returns one element at a
time rather than all the elements at once.
Use the cProfile package in Python (or its R counterpart such as lineprof ) to find ineffi-
ciencies in your code.
Finally, when nothing can be improved in your code from the algorithmic perspective, you
can further boost the speed of your code by using:
• multiprocessing package in Python, or its counterpart parallel in R, to run compu-
tations in parallel, and
• PyPy, Numba or similar tools to compile your Python code (or the compiler package
for R) into fast, optimized machine code.

7.6.2 Delivery Format for Model and Code

As I already mentioned above, the most straightforward way to deliver the model and the
feature extractor code to the production environment, is serialization.
Every modern programming language has serialization tools. In Python it’s Pickle:
1 import pickle
2 from sklearn import svm, datasets
3

4 classifier = svm.SVC()
5 X, y = datasets.load_iris(return_X_y=True)
6 classifier.fit(X, y)
7

8 #saving model to file


9 with open("model.pickle","wb") as outfile:
10 pickle.dump(classifier, outfile)
11

Andriy Burkov Machine Learning Engineering - Draft 14


12 #reading model from file
13 classifier2 = None
14 with open("model.pickle","rb") as infile:
15 classifier2 = pickle.load(infile)
16 if classifier2:
17 prediction = classifier2.predict(X[0:1])
while in R it’s RDS:
1 library("e1071")
2

3 classifier <- svm(Species ~ ., data = iris, kernel = 'linear')


4

5 #saving model to file


6 saveRDS(classifier, "./model.rds")
7

8 #reading model from file


9 classifier2 <- readRDS("./model.rds")
10

11 prediction <- predict(classifier2, iris[1,])


The same approach can be applied to save to file, copy to the production enviroment, and
then read from file the serialized object of the feature extractor.
For some applications, the speed of prediction is critical. In such cases, the production code
is written in a compiled language such as Java or C/C++. If a data analyst has built a model
in Python or R, in order to them to be deployable in production, there are two options:
• rewrite the code in a compiled programming language of the production environemnt,
or
• use PMML.
The Predictive Model Markup Language (PMML) is an XML-based predictive model
interchange format. For example, let you use Python to build an SVM model and then
save the model as a PMML file. Let the production runtime environment be a Java Virtual
Machine (JVM). As long as PMML is supported by a machine learning library for JVM and
that library has an implementation of SVM, your model can be used in production directly;
you don’t need to rewrite your code in a JVM language.
Unfortunately, PMML is not widely supported2 , so for some models, if high time efficiency is
needed, you will have to reimplement your Python or R model in a programming language
supported in production.
2 As of February 2020.

Andriy Burkov Machine Learning Engineering - Draft 15


7.7 Automated Deployment, Versioning and Metadata

Only deploy a model in production when it’s accompanied with the following data and
information:
1. An end-to-end set that defines inputs and correct outputs for the model that must
always work.
2. A confidence test set that that defines inputs and correct outputs for the model that
are used to compute the value of the metric.
3. A metric that will be calculated on the confidence test set by applying the model to it.
4. The acceptable range of values of the metric.
Once the system using the model is evoked for the first time on an instance of a server or on
the client’s device, an external process must call the system on the end-to-end test data and
validate that all predictions are correct. Only if all predictions on the end-to-end test set are
correct, the system can be applied to the production data.
A deployment of a model has to be done using a script from a versioned repository of models.
The versions of the following three elements must always be in sync:
1. Training data
2. Feature extractor
3. Model
Each update to the data must produce the new version of the data in the data repository.
The model built based on a specific version of the data must be put into the model repository
with the same version as the version of the data used to build the model. If the feature
extractor was not updated, its version still has to be updated to be in sync with the data
and the model.
Similarly, if the feature extractor is updated, then a new model must be built using an
updated feature extractor, and the versions are incremented for the feature extractor, the
model, and the training data (even if the data remains unchanged).
The deployment of a new version of the model has to be done automatically by a script in a
transactional way. Given a version of the model to deploy, the deployment script has to pick
the model and the feature extraction object from the respective repositories and copy them
to the production environment. The model has to be applied to the end-to-end test data by
simulating a regular call from the outside. If, at any stage in the process, there’s an error
in the end-to-end test data prediction by the model, the entire deployment has to be rolled
back.
Every version of the model in the model repository has to be accompanied with the following
metadata:
1. The name and the version of the library or package used to build the model.

Andriy Burkov Machine Learning Engineering - Draft 16


2. If Python was used to build the model, then requirements.txt of the virtual environment
used to build the model.
3. The name of the machine learning algorithm, names and the values of the hyperparam-
eters.

7.8 Contributors

I’m grateful to the following people for their valuable contributions to the quality of this
chapter: vacant.

Andriy Burkov Machine Learning Engineering - Draft 17

You might also like