Chapter 7
Chapter 7
Data collection
Business Goal Feature Model Model
&
problem definition engineering building evaluation
preparation
A model can be deployed in various ways. It can be deployed on a server or on the user’s
device; it can be deployed for all users at once or to a small fraction of users. Below we
consider all the options.
The static deployment of a machine learning model is very similar to traditional software
deployment: you prepare an installable binary of the entire software where the model is
packaged as a resource available on the runtime. Depending on the operating system and
the runtime environment, the objects of both the model and the feature extractor can be
packaged as a part of a dynamic-link library (DLL on Windows), Shared Objects (*.so files on
Dynamic deployment on users’ devices is similar to the static deployment in the sense the
user runs a part of your system as a software application on their device. The difference with
the static deployment here is that the model in dynamic deployment is not a part of the
binary code of the application. This allows pushing updates of the model without updating
the application running on the user’s device. However, once deployed, the model still runs on
the user’s device.
The latter can be achieved in two ways:
• The model file would only contain the learned parameters while the user’s device
would have installed a runtime environment for the model. Some machine learning
packages, TensorFlow for example, have a lightweight version that can run on mobile
devices. Alternatively, such frameworks as Core ML from Apple allow running on Apple
devices the models created in several popular packages, including scikit-learn, Keras
and XGBoost.
• The model file would be a serialized object that the application would deserialize. The
advantage of such an approach is that you don’t need to have a runtime environment
for your model on the user’s device: all needed dependencies will be serialized with the
object of the model. Am evident drawback is that an update might be quite “heavy”
which is a problem if your software system has millions of users.
The main advantage of the dynamic deployment on the users’ devices is that the calls to the
model will be fast for the user. The main drawback is related to the update of the model: as
I already mentioned, a serialized object can be quite voluminous; furthermore, some users
can be offline during the update or even turn off all future updates. In that latter case, you
may end up with different users using very different versions of the model. The difference
of the software or model versions may, in turn, result in outdated data received from some
users. If that data is then used in data analysis or to train other models, the result of the
analysis or the new models can be negatively affected by that outdated data.
Another drawback of deploying models on the user’s device is that the model becomes more
Because of the possible complications with delivering updates using the above two patterns
and problems with performance monitoring, the most frequent deployment pattern is to
deploy the model on a server (or servers) and make it available as a REST API in the form
of a web service.
In a typical web service architecture deployed in a cloud environment, the predictions are
served in response to canonically-formatted HTTP requests. A web service running on a
virtual machine receives a user request that contains the input data, calls the machine learning
system using that input data, and then transforms the output of the machine learning system
into a JSON or XML string which is then output by the web service. To cope with high load,
several identical virtual machines running both the web service and the machine learning
system are running in parallel. A load balancer dispatches the HTTP requests to a specific
virtual machine depending on its availability. The virtual machines can be added and closed
manually or be a part of an auto-scaling group that launches or terminates virtual machines
based on their usage. Figure 2 illustrate that deployment pattern. In Figure 2, each instance
denoted as an orange square contains all the code needed to run the feature extractor and
the model; the instance also contains a web service that has access to that code.
Instance 1
User input
Load balancer Instance 2
...
Instance N
In Python, a REST API web service is usually implemented using a web application framework
such as Flask or CherryPy. An R equivalent is Shiny.
The advantage of deploying on a virtual machine is that the architecture of the software
system is conceptually simple: it’s a typical web service.
Among the downsides is the need to maintain servers (physical or virtual). If virtualization is
used, then there is an additional computational overhead due to virtualization and running
multiple operating systems. Finally, deploying on a virtual machine has a relatively higher
cost compared to the deployment in a container or a serverless deployment that we consider
below.
Container 1
Container 2
User input
Load balancer
Container orchestrator ...
Container N
In the above deployment pattern, the virtual or physical machines are organized into a cluster
whose resources are managed by the container orchestrator. New virtual or physical machines
can be added to the cluster and closed manually or, if your software is deployed in a cloud
environment, by a cluster auto-scaler that launches (and adds to the cluster) or terminates
virtual machines based on the usage of the cluster.
Deployment in a container has the advantage of being more resource-efficient as compared
to the deployment on a virtual machine. Though it has the same drawback of the need to
Several cloud services providers, including Amazon, Google and Microsoft offer the so-called
serverless computing. It’s known under the name of lambda-functions on Amazon Web
Services and functions on Microsoft Azure and Google Cloud Platform.
The serverless deployment consists of preparing a zip archive with all the code needed to run
the machine learning system (the model and the feature extractor). The zip archive must
contain a file with a specific name that contains a specific function or class method definition
with a specific signature (so-called entry point function). The zip archive must be uploaded
to the cloud platform and registered under a unique name.
The cloud platform provides an API that can be used to submit inputs to the serverless
function (by specifying its name and providing the payload) and get the outputs from it. The
cloud platform takes care of deploying the code and the model on an adequate computational
resource, executing the code and routing the output back to the client.
Usually, the execution time for a function is limited by the cloud service provider. The size
of the zip file and the amount of RAM available on the runtime are also limited.
The limit on the size of the zip file can be a challenge for deploying machine learning systems.
A typical machine learning model, in order to to be properly executed, requires multiple
heavyweight dependencies, such as (for Python) numpy, scipy, and scikit-learn.
Besides Python, supported programming languages, depending on the cloud platform, can
include Java, Go, PowerShell, Node.js, C#, and Ruby.
There are many advantages to relying on serverless deployment. The obvious advantage is
that you don’t have to provision resources such as servers or virtual machines; you don’t have
to install dependencies and care to upgrade the system. Serverless systems are highly scalable
and can easily and effortlessly support 10,000 and more requests per second. Serverless
functions support both synchronous and asynchronous modes of operation.
Serverless deployment is also cost-efficient: you only pay for compute-time. That effect
can also be achieved with the previous two deployment patterns by using autoscaling, but
autoscaling has a significant latency: the demand can drop but excessive virtual machines
could still keep running for some time before they are terminated.
The serverless deployment also simplifies a lot canarying. In software engineering, canarying
is a strategy of software update when the updated version of the code is pushed to a small
group of end-users (who are usually unaware that they are receiving new code). Because the
new version of the code is only distributed to a small number of users, its impact is relatively
small and changes can be reversed quickly should the new code prove to contain bugs. It is
easy to set up two versions of serverless functions in production and start sending low volume
Single deployment is the simplest one. Once you have a new model, you serialize it into a
file and then replace the old file by the new one (and you also replace the feature extractor if
needed).
If you deploy on a server in a cloud environment then you prepare a new virtual machine or
a container running the new version of the model (and, if needed, of the feature extraction
object), then you replace the virtual machine image or the container, gradually close the old
machines or containers and let the autoscaler start the new ones.
1 As of February 2020.
As we already discussed, canary deployment consists of pushing the new version of the model
and the code to a small fraction of the users, while keeping the old version running for
most users. Contrary to the silent deployment, canary deployment allows validating the
performance of the new model and the effect its predictions have on the users. Contrary to
the single deployment, canary deployment doesn’t affecting lots of users in a case of possible
bugs.
By opting for the canary deployment you accept the additional complexity of having and
maintaining several versions of the model deployed simultaneously. An obvious drawback of
the canary deployment is its impossibility to let the engineers spot rare errors: if you deploy
the new version of the model and the code to 5% of users and a bug affects 2% of users, then
you have only 0.1% chance that the bug will be discovered.
As we have seen in Section ?? of the previous chapter, multi-armed bandits (MAB) are
a way to compare one or more versions of the model and select the best performing one in
the production environment. MAB have an interesting property: after an initial exploration
period, during which the MAB algorithm gathers enough evidence to evaluate the performance
of each model (arm), eventually, the best performing arm is played all the time, which means
that after the convergence of the MAB algorithm, all users are routed to the version of the
software running the best model.
This property of the MAB algorithm allows us to deploy the new model while keeping the
old one and wait for the MAB algorithm to converge. This will give us both the information
whether the new model performs better than the old one and, at the same time, let the
MAB algorithm replace the old model by the new one once it is certain that the new model
performs better.
In this section, we discuss practical aspects and technical details of deploying machine learning
systems in production.
Most data analysts work in Python or R. While, as we have seen, there are web frameworks
that allow building web services in those two languages, they are not considered the most
efficient languages.
Indeed, when you use scientific packages Python, much of the code in those packages was
written in efficient C or C++ and then compiled for your specific operating system. However,
your own code, such as feature extraction code and the code that transforms the prediction
of the model in a business decision or action, is not as efficient.
Furthermore, not all algorithms capable of solving a problem are practical. Some can be too
slow. Some problems can be solved by a fast algorithm; for others, no fast algorithms can
exist.
The subfield of computer science called analysis of algorithms is concerned with determining
and comparing the complexity of algorithms. Big O notation is used to classify algorithms
according to how their running time or space requirements grow as the input size grows.
For example, let’s say we have the problem of finding the two most distant one-dimensional
examples in the set of examples S of size N . One algorithm we could craft to solve this
problem would look in Python like this:
As I already mentioned above, the most straightforward way to deliver the model and the
feature extractor code to the production environment, is serialization.
Every modern programming language has serialization tools. In Python it’s Pickle:
1 import pickle
2 from sklearn import svm, datasets
3
4 classifier = svm.SVC()
5 X, y = datasets.load_iris(return_X_y=True)
6 classifier.fit(X, y)
7
Only deploy a model in production when it’s accompanied with the following data and
information:
1. An end-to-end set that defines inputs and correct outputs for the model that must
always work.
2. A confidence test set that that defines inputs and correct outputs for the model that
are used to compute the value of the metric.
3. A metric that will be calculated on the confidence test set by applying the model to it.
4. The acceptable range of values of the metric.
Once the system using the model is evoked for the first time on an instance of a server or on
the client’s device, an external process must call the system on the end-to-end test data and
validate that all predictions are correct. Only if all predictions on the end-to-end test set are
correct, the system can be applied to the production data.
A deployment of a model has to be done using a script from a versioned repository of models.
The versions of the following three elements must always be in sync:
1. Training data
2. Feature extractor
3. Model
Each update to the data must produce the new version of the data in the data repository.
The model built based on a specific version of the data must be put into the model repository
with the same version as the version of the data used to build the model. If the feature
extractor was not updated, its version still has to be updated to be in sync with the data
and the model.
Similarly, if the feature extractor is updated, then a new model must be built using an
updated feature extractor, and the versions are incremented for the feature extractor, the
model, and the training data (even if the data remains unchanged).
The deployment of a new version of the model has to be done automatically by a script in a
transactional way. Given a version of the model to deploy, the deployment script has to pick
the model and the feature extraction object from the respective repositories and copy them
to the production environment. The model has to be applied to the end-to-end test data by
simulating a regular call from the outside. If, at any stage in the process, there’s an error
in the end-to-end test data prediction by the model, the entire deployment has to be rolled
back.
Every version of the model in the model repository has to be accompanied with the following
metadata:
1. The name and the version of the library or package used to build the model.
7.8 Contributors
I’m grateful to the following people for their valuable contributions to the quality of this
chapter: vacant.