ML System Optimization - Lecture 10 - Model Optimization Techniques
ML System Optimization - Lecture 10 - Model Optimization Techniques
Edge ML
Machine Learning Model Optimization for Intelligent
Edge
• Important here is the idea of the feedback loop and online model training/update stages.
• Pre-training and optimization takes place in a full-scale central cloud
• This way it can utilize its almost infinite power and space.
• By improving model capacity with more factual field data results in a better functioning
model at run time.
• It can be shared over a number of (edge) devices by employing them for a number of
tasks, with each device working in parallel.
Edge ML Approach Guidelines
1.Select the right chip set with estimates for energy consumption and compute
performance
2.Understand storage requirements over time
3.Latency and throughput requirements
4.Understand your workload
5.Understand your traffic patterns
6.Understand your data growth
7.Understand memory/storage requirements
8.Hyper-parameters knowledge and tuning
9.Re-use space and reduce storage overuse/overkill
10.Train and save (at core), transfer (model) and load (at edge)
11.Containerization of AI services — Well orchestrated and manageable
12.Reproducible and secure pipelines (such as using kubeflow)
Benefits of Model Optimization
• Less resources required: Deploying models to edge devices with
restrictions on processing, memory, or power-consumption. For
example, mobile and Internet of Things (IoT) devices.
• Efficiency: Reduced model size can help improving productivity,
especially when deployed on the Edge.
• Latency: there’s no round-trip to a server, adherence to compliance
• Privacy: no data needs to leave the device or edge gateway, hence
better security
• Connectivity: an Internet connection isn’t required for business
operation
• Power consumption: Matrix multiplication operations require
compute power. Less neurons means less power consumption.
Model Optimization
Techniques
Pruning
• Pruning describes a set of techniques to trim network size (by nodes not layers) to
improve computational performance and sometimes resolution performance.
• The gist of these techniques is removing nodes from the network during training by
identifying those nodes which, if removed from the network, would not noticeably
affect network performance (i.e., resolution of the data).
• Even without using a formal pruning technique, you can get a rough idea of which
nodes are not important by looking at your weight matrix after training; look weights
very close to zero — it’s the nodes on either end of those weights that are often
removed during pruning.
• Pruning neural networks is an old idea going back to 1990 (with Yan Lecun’s
optimal brain damage work) and before.
• The idea is that among the many parameters in the network, some are redundant
and don’t contribute a lot to the output.
• If you could rank the neurons in the network according to how much they contribute,
you could then remove the low ranking neurons from the network, resulting in a
smaller and faster network.
Pruning Algorithms
• By applying a pruning algorithm to your network during
training, you can approach optimal network
configuration.
• The ranking of neurons can be done according to
the L1/L2 mean of weights, their mean activations, the
number of times a neuron wasn’t zero on some
validation set, and other creative methods.
• After the pruning, the accuracy will drop (hopefully not
too much if the ranking is clever), and the network is
usually trained more to recover.
Pruning Algorithms (for Neural
Network)
Post-training integer Unlabelled Up to 75% Small accuracy loss CPU, GPU (Android),
quantization representative EdgeTPU
sample
Quantization-aware t Labelled training Up to 75% Smallest accuracy CPU, GPU (Android),
raining data loss EdgeTPU
Quantization Decision Tree
Quantization Numbers
Top-1 Top-1 Latency Latency
Top-1 Accuracy Accuracy Latency (Post (Quantizati Size Size
Model Accuracy (Post (Quantizati (Original) Training on Aware (Original) (Optimized)
(Original) Training on Aware (ms) Quantized) Training) (MB) (MB)
Quantized) Training) (ms) (ms)
3. Post-training tooling
If you cannot use a pre-trained model for your application, try using
TensorFlow Lite post-training quantization tools during TensorFlow Lite conversion, which can
optimize your already-trained TensorFlow model. See the post-training quantization tutorial to
learn more.
4. Training-time tooling
If the above simple solutions don't satisfy your needs, you may need to involve training-time
optimization techniques. Optimize further with our training-time tools and dig deeper.
Optimize further
• When pre-optimized models and post-training tools do not satisfy
your use case, the next step is to try the different training-time
tools.
• Training time tools piggyback on the model's loss function over the
training data such that the model can "adapt" to the changes
brought by the optimization technique.
• The starting point to use our training APIs is a Keras training script,
which can be optionally initialized from a pre-trained Keras model to
further fine tune.
• Training time tools available for you to try:
• Weight pruning
• Quantization
• Weight clustering
• Collaborative optimization
TFLite Model Optimizations
Capabilities in the Pipeline
• Quantization
• Selective post-training quantization to exclude certain layers from
quantization.
• Quantization debugger to inspect quantization error losses per each layer.
• Applying quantization-aware training on more model coverage e.g.
TensorFlow Model Garden.
• Quality and performance improvements for post-training dynamic-range
quantization.
• Tensor Compression API to allow compression algorithms such as SVD.
• Pruning / sparsity
• Combine configurable training-time (pruning + quantization-aware training)
APIs.
• Increase sparsity application on TF Model Garden models.
• Sparse model execution support in TensorFlow Lite.