0% found this document useful (0 votes)
96 views

Architecting To Support Machine Learning

This document discusses machine learning system architectures and outlines key considerations when architecting such systems. It describes two main workflows for machine learning systems - model development and model serving. It then examines each step of these workflows in more detail, outlining responsibilities and important architectural concerns to address at each step, such as data storage, processing, model training and deployment. Finally, it provides examples of architectural decisions and case studies of machine learning systems implemented for real-world domains.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views

Architecting To Support Machine Learning

This document discusses machine learning system architectures and outlines key considerations when architecting such systems. It describes two main workflows for machine learning systems - model development and model serving. It then examines each step of these workflows in more detail, outlining responsibilities and important architectural concerns to address at each step, such as data storage, processing, model training and deployment. Finally, it provides examples of architectural decisions and case studies of machine learning systems implemented for real-world domains.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Architecting to

Support Machine
Learning

Humberto Cervantes, UAM


Iurii Milovanov, SoftServe
Rick Kazman, University of Hawaii
PARTICULARITIES OF ML SYSTEMS

● In ML systems, the behaviour is not specified directly in code but is learned from data

Traditional Programming Machine learning

Data Data

Computer Output Computer Model


Program Expected output

● At the core of the system, there is a model that uses data transformed into features to
perform predictions for particular tasks
● This model can be seen as a compiled software library that is part of a bigger system
TWO MAIN WORKFLOWS
Development environment

Raw historical Transformation Model selection Trained


data into features and training ML Model

data transformation rules data to model development


+ model refine model & model serving
data rules

Results
Transformation Trained
New raw data derived from
into features ML Model
prediction

Serving environment automatic


retraining
MODEL DEVELOPMENT LIFECYCLE

Shorter cycles of model development/refinement are needed

(Refined) (Refined)
Model
Model Model ModelModel Model Model
Model Model
development serving
development refinement
serving refinement
refinement
serving serving
ARCHITECTING THE SYSTEM

Supporting initial model development and the


model refinement lifecycle Introduces many |
architectural concerns:
“Architectural concerns encompass additional
aspects that need to be considered as part of
architectural design but which are not expressed
as traditional requirements.”

Identifying them is useful to perform the design of


a system that supports ML using a method such
as ADD
ARCHITECTING THE SYSTEM
We will look into more details in the steps of the workflows to discuss the concerns and
decisions that can be made to satisfy them

MODEL DEVELOPMENT

DATA CLEANSING MODEL


TRAINING DATA FEATURE MODEL
AND SELECTION AND
INGESTION
NORMALIZATION
ENGINEERING
TRAINING
PERSISTENCE activity and
data flow

step

MODEL
NEW DATA
DATA VALIDATION
FEATURES
TRANSFER AND
SERVING workflow
INGESTION EXTRACTION RESULTS
PREDICTION

MODEL SERVING
TRAINING DATA INGESTION

Responsibility
● Collect and store raw data for training
Architectural concerns
● Collect and store large volumes of training data, support fast bulk reading
○ Ingestion: Manual, Message broker, ETL Jobs
○ Storage: Object Storage, SQL or NoSQL, HDFS
● Labeling of raw training data
○ Data labelling toolkit: Intel’s CVAT, Amazon Sagemaker Ground Truth
● Protect access to sensitive data
DATA CLEANSING AND NORMALIZATION

Responsibility
● Identify and remove errors and duplicates from
selected data and perform data conversions
(such as normalization) to create a reliable data set.
Architectural concerns
● Provide mechanisms such as APIs to support query and visualization of the data
○ Data warehouse to support data analysis, such as HIVE
● Transform large volumes of raw training data
○ Data processing framework, such as Spark
FEATURE ENGINEERING

Responsibility
● Perform data transformations and augmentation to
incorporate additional knowledge to the training data
● Identify the list of features to use for training
Architectural concerns
● Transform large volumes of raw training data into features
● Provide mechanism for data segregation (training / testing)
● Features logging and versioning
○ Data versioning mechanism, such as Data Science Version Control System (DVC)
○ Use of a feature store, a data management platform that stores feature data, feature
engineering logic, metadata, and a registry for publishing and discovering features
and training datasets
MODEL SELECTION AND TRAINING
Responsibility
● Based on a selected algorithm, train, tune and
evaluate a model.
Architectural concerns
● Selection of a framework
○ TensorFlow, PyTorch, Spark MLlib, scikit-learn, etc.
● Select training location and provide environment and manage resources to train,
tune and evaluate a model
○ Single vs distributed training, Hardware acceleration (GPU/TPU)
○ Resource Management (e.g. Yarn, Kubernetes)
● Log and monitor training performance metrics
MODEL PERSISTENCE

Responsibility
● Persist the trained and tuned model (or entire
pipeline) to support transfer to the serving
environment
Architectural concerns
● Persistence of the model
○ Examples: Spark MLlib Pipelines, PMML, MLeap, ONNX
● Storage of the model
○ Examples: Database, document storage, object storage, NFS, DVC
● Optimize model after training (e.g. reduce size for use in constrained device)
○ Example: Tensorflow Model Optimization Toolkit
NEW DATA INGESTION

Responsibility
● Obtain and import unseen data for predictions
Architectural concerns
● Batch prediction: asynchronously generate predictions for multiple input data
observations.
● Online (or real-time) prediction: synchronously generate predictions for individual
data observations (request/response or streaming).
DATA VALIDATION AND FEATURE EXTRACTION

Responsibility
● Process raw data into features according to
the transformation rules defined during model
development
Architectural concerns
● Ensure data conforms to the characteristics defined during training
○ Usage of a data schema defined during model development
○ Monitoring of changes in input distributions
● Design batch and/or streaming pipelines
○ Realtime data storage (e.g. Cassandra)
○ Data processing framework (e.g. Spark)
● Select and query additional real-time data sources (for feature extraction)
MODEL TRANSFER AND PREDICTION

Responsibility
● Transfer (deployment) of model code and
perform predictions
Architectural concerns
● Model transfer and validation
○ Transfer: re-writing, docker, PMML…
● Rolling out / rolling back new model version
○ Blue-green deployment, Canary-testing
○ Support for multiple model versions, update and rollback mechanisms, for example
using TensorFlow serving
● Define prediction location
PREDICTION LOCATION
Local model: the model predicts/re-trains on the client side
client machine
ML Model

Remote model: the model predicts/re-trains on the server side


data for prediction server machine
client machine
results ML Model

Hybrid model predicts on client and re-trains on both (federated learning)

client machine model deltas server machine


Local ML Model model updates Global ML Model
SERVING RESULTS

Responsibility
● Delivery and post-processing of prediction results
to a destination
Architectural Concerns
● Monitor model staleness (age) and performance
● Monitoring predictions vs. actual values when possible
● Monitoring deviations between distribution of predicted and observed labels
● Storage prediction results
● Aggregate results from multiple models
CASE STUDIES
NEW DOMAIN UNDERSTANDING
CASE STUDY • SoftServe worked with two Fortune 100 companies – a hardware and
DISTRIBUTED IOT networking provider, and an energy exploration and production
company – to research the oil extraction process
NETWORK ACROSS OIL • SoftServe suggested a solution and architecture design to match the
& GAS PRODUCTION client need for a distributed fiber-optic sensing (IoT) program. 

DOMAIN-SPECIFIC TECHNOLOGY CHALLENGES / LIMITATIONS


• SoftServe suggested 3rd-party sensing hardware and data protocol to
address industry-specifics challenges
• SoftServe designed and deployed a hybrid edge and cloud data
processing model
• We built a real-time BI layer and analytics engine on large-scale data
streams

SOLUTION DESIGN
• SoftServe’s end solution focused on unsupervised anomaly detection to
help the end client identify observations that do not conform to the
expected behavioral patterns
ARCHITECTURAL DRIVERS
• Ingest and process multi-dimensional time series data from sensing equipment
(100-200GB per day)
• Calculate the key metrics and perform short- and long-term predictions in near
real-time (up to 5 mins)
• Continuously re-train the model when the new data comes in
• Initial training dataset consisted of ~300GB
• Support queries against historical data for analytics
ARCHITECTURAL DECISION [MODEL DEV]
1. Training Data Ingestion
• HDFS used as a storage layer
• Directory structure for data versioning
• Custom data conversion from the proprietary data protocol
ARCHITECTURAL DECISION [MODEL DEV]
2. Data Cleansing and Normalization
• Spark SQL and Dataframes for analytics
• Batch Spark jobs for data pre-processing
ARCHITECTURAL DECISION [MODEL DEV]
3. Feature Engineering
• Batch Spark job to calculate the features
• Selected features were stored in CrateDB and exposed via SQL


ARCHITECTURAL DECISION [MODEL DEV]
4. Model Training and Selection
• Spark ML for model training and tuning
• Yarn resource management
• No hardware acceleration were used


ARCHITECTURAL DECISION [MODEL DEV]
5. Model Persistence
• The result models were stored on HDFS
ARCHITECTURAL DECISION [MODEL SERVING]
1. New Data Ingestion
• Kafka used as a message broker to ingest the data from the sensors
ARCHITECTURAL DECISION [MODEL SERVING]
2. Data validation and Feature extraction
• Same batch transformations re-used in Spark Streaming
ARCHITECTURAL DECISION [MODEL SERVING]
3. Model Prediction
• Batch Spark ML jobs scheduled every 3 mins
ARCHITECTURAL DECISION [MODEL SERVING]
4. Serving Results
• The results saved back to CrateDB and exposed via Impala
• Zoomdata used to communicate the data and predictions


OUTCOMES
• Highly scalable distributed IoT platform
leveraging state-of-the-art Big Data and Cloud
technologies
• Real-time monitoring and user-centric BI
analytics
• Custom domain-specific self-learning anomaly
detection solution
SMART PARKING
SOLUTION
A SoftServe innovative solution provides
automatic parking space detection based
on a Computer Vision ML model.
A CCTV camera installed on a rooftop
captures images and the current parking
state is visualized in real-time via a web
application and LCD at the parking
entrance.
The solution can be used for both open and
authorized parking areas.
ARCHITECTURAL DRIVERS

• Deploy to the private on-premise infrastructure


• Perform real-time predictions over a video stream from the 4K IP camera
• Process 5 images per second for 121 parking spots
• Support on-demand re-training and re-deployment
• Initial training dataset consisted of 200,000+ images (SoftServe’s proprietary)
ARCHITECTURAL DECISION [MODEL DEV]
1. Training Data Ingestion
• NFS used as a storage layer for
training data
• Custom image labeling tool for
training data augmentation
ARCHITECTURAL DECISION [MODEL DEV]
2. Data Cleansing and
Normalization
• Custom image processing pipeline
written in Python (split image, lens
correction, color correction, contrast
and brightness correction etc.)
ARCHITECTURAL DECISION [MODEL DEV]
3. Feature Engineering
• Raw image data used for predictions
ARCHITECTURAL DECISION [MODEL DEV]
4. Model Training and
Selection
• TensorFlow/Python for model
training
• Containerized training jobs ran on a
VM and orchestrated by Ansible
ARCHITECTURAL DECISION [MODEL DEV]
5. Model Persistence
• The result models stored in a private
GIT repository (MS TFS)
• Ansible used to deploy a model as a
dockerized microservice
ARCHITECTURAL DECISION [MODEL SERVING]
1. New Data Ingestion
• Polling job transfers new images
from the edge device
ARCHITECTURAL DECISION [MODEL SERVING]
2. Data Validation and Feature
Extraction
• Same Python transformations
re-used in a Docker-based worker
service
ARCHITECTURAL DECISION [MODEL SERVING]
3. Model Prediction
• Dockerized RESTFul microservice
deployed to a VM
ARCHITECTURAL DECISION [MODEL SERVING]
4. Serving Results
• The results sent to RabbitMQ to
serve multiple components
PRACTICAL RECOMMENDATIONS

● Cloud native architectures based on containers, microservices and Kubernetes help


address ML-specific requirements for composability, portability, scalability
● Open source ML community has been working on multiple frameworks and libraries
that provide out-of-the-box ML lifecycle management capabilities (i.e. TFX, Kubeflow,
MLflow and Seldon)
● Tight cooperation between software engineering and ML teams helps significantly
speed up project development and maximize project success
CONCLUSIONS

● Architecting ML systems pose new challenges compared with the development of


traditional systems because the behaviour is not specified directly in code but is
learned from data.

● We need to architect to support initial model development and the continuous model
refinement lifecycle.

● The model development and model serving workflow steps can be used as a
framework to guide design and to document design decisions.
QUESTIONS?

Humberto Cervantes [email protected]


Iurii Milovanov [email protected]
Rick Kazman [email protected]

Thank you!!
Machine learning process

You might also like