0% found this document useful (0 votes)
14 views

ML System Optimization - Lecture 10 - Model Optimization Techniques

Uploaded by

allybenson5888
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

ML System Optimization - Lecture 10 - Model Optimization Techniques

Uploaded by

allybenson5888
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Model Optimization for

Edge ML
Machine Learning Model Optimization for Intelligent
Edge

• Use cases built around IoT are endless.


• Devices are being used in homes to manage security, energy,
watering and appliances.
• Factories are optimizing operations and costs through
predictive maintenance.
• Cities are controlling traffic and applying IoT for public safety.
Logistics companies are tracking shipments, doing fleet
management and optimizing routes.
• Restaurants are ensuring food safety in fridges and deep fryers,
retailer are deploying smart digital signage and implementing
advanced payment systems — and the list goes on and on.
Machine Learning Model Optimization for Intelligent
Edge

• To cope with this challenge, “core” or “central” cloud


alone cannot deliver services at the scale and speed
expected in this era.
• Rather, support on the edge is going to be needed (in
form of a separate cloud instance) to satisfy response
time demands and to deliver superior user experience.
• In this work, we elaborate on the need of an edge
focusing mainly on the AI/ML design strategy with edge
playing a central role.
Smart devices, Intelligent Edge,
Continuous Learning
Edge ML Process
• The edge machine learning process underscoring three distinct
tiers, IoT, Core and Edge clouds.
• First off, the labelled (training) data is secured separately from a singular or multiple resources,
outside of the main process.
• This data is then fed into ML algorithms for model training, which (for practical reasons) takes
place in back-end core.
• The core is poised to do the heavy lifting of model training due to (near) endless capacity and
processing power availability.
• Once the model has reached acceptable accuracy, it is then deployed on the Edge to provide
data insights and real-time inferences based on the data collected locally from the devices.
• Model training, followed by pruning (taking extra fat off) and validation goes through an
iterative cycle to generate an optimal model before rolling into production.
• This creates a closed loop system that incorporates the approach of taking user actions and
then feeding them back into the data set for re-training. This entails new features or modified
ground truth or both.
• Re-training on newer distribution helps in improving accuracy and increases capacity to handle
wider range of data inputs.
Edge ML Subsystem Needs
• In a typical multi-tiered, multi-segment system, you need to
pay attention to the sub-system characteristic.
• For example, the Edge compute layer has far less storage and
compute power comparing to the back-end cloud data center.
• High-performance and high throughput is expected of the
Edge to cater to real-time traffic with low-latency and quick
response time requirements.
• When it comes to re-training, tasks need to be carefully split
between the Edge and the Core cloud to maintain business
objectives and at the same time gain improvements in terms
of better accuracy.
Continual Feedback loop

• Important here is the idea of the feedback loop and online model training/update stages.
• Pre-training and optimization takes place in a full-scale central cloud
• This way it can utilize its almost infinite power and space.
• By improving model capacity with more factual field data results in a better functioning
model at run time.
• It can be shared over a number of (edge) devices by employing them for a number of
tasks, with each device working in parallel.
Edge ML Approach Guidelines
1.Select the right chip set with estimates for energy consumption and compute
performance
2.Understand storage requirements over time
3.Latency and throughput requirements
4.Understand your workload
5.Understand your traffic patterns
6.Understand your data growth
7.Understand memory/storage requirements
8.Hyper-parameters knowledge and tuning
9.Re-use space and reduce storage overuse/overkill
10.Train and save (at core), transfer (model) and load (at edge)
11.Containerization of AI services — Well orchestrated and manageable
12.Reproducible and secure pipelines (such as using kubeflow)
Benefits of Model Optimization
• Less resources required: Deploying models to edge devices with
restrictions on processing, memory, or power-consumption. For
example, mobile and Internet of Things (IoT) devices.
• Efficiency: Reduced model size can help improving productivity,
especially when deployed on the Edge.
• Latency: there’s no round-trip to a server, adherence to compliance
• Privacy: no data needs to leave the device or edge gateway, hence
better security
• Connectivity: an Internet connection isn’t required for business
operation
• Power consumption: Matrix multiplication operations require
compute power. Less neurons means less power consumption.
Model Optimization
Techniques
Pruning
• Pruning describes a set of techniques to trim network size (by nodes not layers) to
improve computational performance and sometimes resolution performance.
• The gist of these techniques is removing nodes from the network during training by
identifying those nodes which, if removed from the network, would not noticeably
affect network performance (i.e., resolution of the data).
• Even without using a formal pruning technique, you can get a rough idea of which
nodes are not important by looking at your weight matrix after training; look weights
very close to zero — it’s the nodes on either end of those weights that are often
removed during pruning.
• Pruning neural networks is an old idea going back to 1990 (with Yan Lecun’s
optimal brain damage work) and before.
• The idea is that among the many parameters in the network, some are redundant
and don’t contribute a lot to the output.
• If you could rank the neurons in the network according to how much they contribute,
you could then remove the low ranking neurons from the network, resulting in a
smaller and faster network.
Pruning Algorithms
• By applying a pruning algorithm to your network during
training, you can approach optimal network
configuration.
• The ranking of neurons can be done according to
the L1/L2 mean of weights, their mean activations, the
number of times a neuron wasn’t zero on some
validation set, and other creative methods.
• After the pruning, the accuracy will drop (hopefully not
too much if the ranking is clever), and the network is
usually trained more to recover.
Pruning Algorithms (for Neural
Network)

Pruning for SVM: https://www.atlantis-press.com/article/10.pdf


Dimensionality Reduction
• When dealing with real world problems we often deal
with high dimensional data that can cross million of
data points.
• Depending upon your algorithm selection, techniques
like PCA and RFE can prove to be quite useful in
reducing data dimensions and hence model’s space
requirements.
• This method it is used primarily for non-neural network
based ML algorithms.
• Other techniques, such as pruning and the ones in forth
coming sections are more suitable for optimizing deep
learning models.
Quantization
• Quantization techniques are particularly effective when applied during training and can
improve inference speed by reducing the number of bits used for model weights and
activations.
• For example, using 8-bit fixed point representation instead of floats can speed up the
model inference, reduce power and further reduce size by 4x.
• We will look at several techniques and useful tips as we move on with our discussion.
1.Reduce parameter count with pruning and structured pruning. Practically setting the
neural network parameters’ values to zero, thus creating a sparse neural net (matrix).
Sparse matrices tend to yield better compression resulting in overall model size
reduction.
2.Reduce representational precision with quantization. Quantizing deep neural networks
uses techniques that allow for reduced precision representations of weights and,
optionally, activations for both storage and computation. It is found that weight pruning
when combined with quantization, results in compound benefits.
3.Update the original model topology to a more efficient one with reduced parameters or
faster execution. For example, tensor decomposition methods and distillation.
Palletization
• Map the weights of a model to a discrete set of precomputed (or
learned) values.
• Inspired by an artist’s palette, the idea is to map many similar values
to one average or approximate value, then use those new values for
computing inference.
• In this way, palettization is similar in spirit to algorithmic
memoization, a dictionary, or a look-up table.
• Palettization can make a model smaller but does not make a model
faster since it incurs look-up time.
Regularization
• One of the most common problem data science professionals face is to avoid
overfitting.
• Avoiding overfitting can single-handedly improve your model’s performance. L1 and L2
(weight decay) are the most common types of regularization.
• These update the general cost function by adding another term known as the
regularization term. Concretely: Cost function = Loss (say, binary cross entropy) +
Regularization term
• the weight matrices W need to be reasonably close to zero. So one piece of intuition is
maybe to set the weight to be so close to zero for a lot of hidden units that’s basically
zeroing out a lot of the impact of these hidden units.
• We can think of it as zeroing out or at least reducing the impact of a lot of the hidden
units so you end up with what might feel like a simpler network.
• It turns out that what actually happens is they’ll still use all the hidden units, but each
of them would just have a much smaller effect.
• But you do end up with a simpler network and as if you have a smaller network that is
therefore less prone to overfitting.
Hyper Parameter Tuning
• By tuning hyper-parameters, efficient networks can be engineered
• This in-turn will result in superior run time performance when put
under resource constraining situations, like on an IoT device or
MEC.
• Some of the popular approaches are:
• Neural Network depth — Concretely, depth of a NN and the number of
neurons per hidden layer enhance model’s capability to work with more
complex decision boundaries. The selection of number of neurons per
layer and number of layers constitute what’s called the network
architecture. There is no hard and fast rule when it comes to decide on the
hidden layer(s) dimensions; rather your architectural choices will be based
on the empirical results obtained from different combinations. Typically,
you will treat the number of Hidden layers as a tune able hyper-parameter
and use it during the forward pass.
Hyper Parameter Tuning
• Drop out ratio — Dropout is a technique to fight overfitting and improve neural
network generalization. It is one of the most interesting types of regularization
techniques and most frequently used in the field of deep learning.
• Dropout is also a defensive mechanism against model over-fitting. At every iteration, it
randomly selects some nodes and removes them along with all of their incoming and
outgoing connections. The dropout ratio is a hyper parameter that controls the zeroing
out of neurons/weights and supported by all major ML libraries such as Keras. The value
is normally set between 0.25–0.50.
Hardware Acceleration
• At its core, neural networks are multi-dimensional arrays (matrices or
tensors) which operate on mathematical operations like addition and
multiplication.
• Specialized hardware such as FPGA , TPU or GPU rapidly manipulate
and alter memory to accelerate the overall process, such as model
training and its execution.
• Edge TPU is Google’s purpose-built ASIC designed to run AI at the
edge. It delivers high performance in a small physical and power
footprint, enabling the deployment of high-accuracy AI at the edge.
• Other solutions like AI2GO from Xnor AI provide pre-built, purpose
built models that can run autonomously on small inexpensive devices
including Raspberry Pi with no connection to Internet or central cloud
needed.
Lightweight Frameworks
• In May 2017 Google introduced TensorFlow Lite for
mobile edge devices development.
• It is designed to make it easy to perform machine
learning at the edge, instead of sending data back and
forth to a server.
• TensorFlow Lite works with a huge range of devices,
from tiny microcontrollers to powerful mobile phones.
Putting it all together using Pipelines
Purpose Built Frameworks
• Frameworks like Learn2Compress from Google,
generalizes the learning by incorporating several state-
of-the-art techniques for compressing neural network
models.
• It takes as input a large pre-trained TensorFlow model
provided by the user, performs training and
optimization and automatically generates ready-to-use
on-device models that are smaller in size, more
memory-efficient, more power-efficient and faster at
inference with minimal loss in accuracy.
TFLite based Model
Optimization
Types of optimization supported by
TFLite
• TensorFlow Lite currently supports optimization via
quantization, pruning and clustering.
• These are part of the
TensorFlow Model Optimization Toolkit,
• The toolkit provides resources for model optimization
techniques that are compatible with TensorFlow Lite.
Quantization
• Quantization works by reducing the precision of the numbers used to represent a model's
parameters, which by default are 32-bit floating point numbers, resulting in smaller model size
and faster computation.
• The following types of quantization are available in TensorFlow Lite:
Data Supported
Technique requirements Size reduction Accuracy hardware
Post-training float16 No data Up to 50% Insignificant CPU, GPU
quantization accuracy loss

Post-training dynami No data Up to 75% Smallest accuracy CPU, GPU (Android)


c range quantization loss

Post-training integer Unlabelled Up to 75% Small accuracy loss CPU, GPU (Android),
quantization representative EdgeTPU
sample
Quantization-aware t Labelled training Up to 75% Smallest accuracy CPU, GPU (Android),
raining data loss EdgeTPU
Quantization Decision Tree
Quantization Numbers
Top-1 Top-1 Latency Latency
Top-1 Accuracy Accuracy Latency (Post (Quantizati Size Size
Model Accuracy (Post (Quantizati (Original) Training on Aware (Original) (Optimized)
(Original) Training on Aware (ms) Quantized) Training) (MB) (MB)
Quantized) Training) (ms) (ms)

Mobilenet- 0.709 0.657 0.70 124 112 64 16.9 4.3


v1-1-224

Mobilenet- 0.719 0.637 0.709 89 98 54 14 3.6


v2-1-224

Inception_v 0.78 0.772 0.775 1130 845 543 95.7 23.9


3
Resnet_v2_ 0.770 0.768 N/A 3973 2868 N/A 178.3 44.9
101
Full integer quantization: int16 activations,
int8 weights

• Quantization with int16 activations is a full integer quantization scheme with


activations in int16 and weights in int8.
• This mode can improve accuracy of the quantized model in comparison to the full
integer quantization scheme with both activations and weights in int8 keeping a
similar model size.
• It is recommended when activations are sensitive to the quantization.
• NOTE: Currently only non-optimized reference kernel implementations are available in
TFLite for this quantization scheme, so by default the performance will be slow
compared to int8 kernels.
• Full advantages of this mode can currently be accessed via specialised hardware, or
custom software.
Development
workflow
• As a starting point, hosted models are studied to see if they
could work for the application.
• If not, the post-training quantization tool based approach is
typically adopted as it is broadly applicable and does not require
training data.
• For cases where the accuracy and latency targets are not met, or
hardware accelerator support is important,
quantization-aware training is considered the better option.
• Note: See additional optimization techniques under the
TensorFlow Model Optimization Toolkit.
• For further reduction in model size, pruning and/or clustering
prior to quantizing models is the typical approach.
Model Optimization Sequence
1. Choose the best model for the task: Depending on the task, you will need to make a
tradeoff between model complexity and size. If your task requires high accuracy, then you
may need a large and complex model. For tasks that require less precision, it is better to use
a smaller model because they not only use less disk space and memory, but they are also
generally faster and more energy efficient.
2. Pre-optimized models: See if any existing TensorFlow Lite pre-optimized models provide
the efficiency required by your application.

3. Post-training tooling
If you cannot use a pre-trained model for your application, try using
TensorFlow Lite post-training quantization tools during TensorFlow Lite conversion, which can
optimize your already-trained TensorFlow model. See the post-training quantization tutorial to
learn more.
4. Training-time tooling
If the above simple solutions don't satisfy your needs, you may need to involve training-time
optimization techniques. Optimize further with our training-time tools and dig deeper.
Optimize further
• When pre-optimized models and post-training tools do not satisfy
your use case, the next step is to try the different training-time
tools.
• Training time tools piggyback on the model's loss function over the
training data such that the model can "adapt" to the changes
brought by the optimization technique.
• The starting point to use our training APIs is a Keras training script,
which can be optionally initialized from a pre-trained Keras model to
further fine tune.
• Training time tools available for you to try:
• Weight pruning
• Quantization
• Weight clustering
• Collaborative optimization
TFLite Model Optimizations
Capabilities in the Pipeline
• Quantization
• Selective post-training quantization to exclude certain layers from
quantization.
• Quantization debugger to inspect quantization error losses per each layer.
• Applying quantization-aware training on more model coverage e.g.
TensorFlow Model Garden.
• Quality and performance improvements for post-training dynamic-range
quantization.
• Tensor Compression API to allow compression algorithms such as SVD.
• Pruning / sparsity
• Combine configurable training-time (pruning + quantization-aware training)
APIs.
• Increase sparsity application on TF Model Garden models.
• Sparse model execution support in TensorFlow Lite.

You might also like