Architecting To Support Machine Learning
Architecting To Support Machine Learning
Support Machine
Learning
● In ML systems, the behaviour is not specified directly in code but is learned from data
Data Data
● At the core of the system, there is a model that uses data transformed into features to
perform predictions for particular tasks
● This model can be seen as a compiled software library that is part of a bigger system
TWO MAIN WORKFLOWS
Development environment
Results
Transformation Trained
New raw data derived from
into features ML Model
prediction
(Refined) (Refined)
Model
Model Model ModelModel Model Model
Model Model
development serving
development refinement
serving refinement
refinement
serving serving
ARCHITECTING THE SYSTEM
MODEL DEVELOPMENT
step
MODEL
NEW DATA
DATA VALIDATION
FEATURES
TRANSFER AND
SERVING workflow
INGESTION EXTRACTION RESULTS
PREDICTION
MODEL SERVING
TRAINING DATA INGESTION
Responsibility
● Collect and store raw data for training
Architectural concerns
● Collect and store large volumes of training data, support fast bulk reading
○ Ingestion: Manual, Message broker, ETL Jobs
○ Storage: Object Storage, SQL or NoSQL, HDFS
● Labeling of raw training data
○ Data labelling toolkit: Intel’s CVAT, Amazon Sagemaker Ground Truth
● Protect access to sensitive data
DATA CLEANSING AND NORMALIZATION
Responsibility
● Identify and remove errors and duplicates from
selected data and perform data conversions
(such as normalization) to create a reliable data set.
Architectural concerns
● Provide mechanisms such as APIs to support query and visualization of the data
○ Data warehouse to support data analysis, such as HIVE
● Transform large volumes of raw training data
○ Data processing framework, such as Spark
FEATURE ENGINEERING
Responsibility
● Perform data transformations and augmentation to
incorporate additional knowledge to the training data
● Identify the list of features to use for training
Architectural concerns
● Transform large volumes of raw training data into features
● Provide mechanism for data segregation (training / testing)
● Features logging and versioning
○ Data versioning mechanism, such as Data Science Version Control System (DVC)
○ Use of a feature store, a data management platform that stores feature data, feature
engineering logic, metadata, and a registry for publishing and discovering features
and training datasets
MODEL SELECTION AND TRAINING
Responsibility
● Based on a selected algorithm, train, tune and
evaluate a model.
Architectural concerns
● Selection of a framework
○ TensorFlow, PyTorch, Spark MLlib, scikit-learn, etc.
● Select training location and provide environment and manage resources to train,
tune and evaluate a model
○ Single vs distributed training, Hardware acceleration (GPU/TPU)
○ Resource Management (e.g. Yarn, Kubernetes)
● Log and monitor training performance metrics
MODEL PERSISTENCE
Responsibility
● Persist the trained and tuned model (or entire
pipeline) to support transfer to the serving
environment
Architectural concerns
● Persistence of the model
○ Examples: Spark MLlib Pipelines, PMML, MLeap, ONNX
● Storage of the model
○ Examples: Database, document storage, object storage, NFS, DVC
● Optimize model after training (e.g. reduce size for use in constrained device)
○ Example: Tensorflow Model Optimization Toolkit
NEW DATA INGESTION
Responsibility
● Obtain and import unseen data for predictions
Architectural concerns
● Batch prediction: asynchronously generate predictions for multiple input data
observations.
● Online (or real-time) prediction: synchronously generate predictions for individual
data observations (request/response or streaming).
DATA VALIDATION AND FEATURE EXTRACTION
Responsibility
● Process raw data into features according to
the transformation rules defined during model
development
Architectural concerns
● Ensure data conforms to the characteristics defined during training
○ Usage of a data schema defined during model development
○ Monitoring of changes in input distributions
● Design batch and/or streaming pipelines
○ Realtime data storage (e.g. Cassandra)
○ Data processing framework (e.g. Spark)
● Select and query additional real-time data sources (for feature extraction)
MODEL TRANSFER AND PREDICTION
Responsibility
● Transfer (deployment) of model code and
perform predictions
Architectural concerns
● Model transfer and validation
○ Transfer: re-writing, docker, PMML…
● Rolling out / rolling back new model version
○ Blue-green deployment, Canary-testing
○ Support for multiple model versions, update and rollback mechanisms, for example
using TensorFlow serving
● Define prediction location
PREDICTION LOCATION
Local model: the model predicts/re-trains on the client side
client machine
ML Model
Responsibility
● Delivery and post-processing of prediction results
to a destination
Architectural Concerns
● Monitor model staleness (age) and performance
● Monitoring predictions vs. actual values when possible
● Monitoring deviations between distribution of predicted and observed labels
● Storage prediction results
● Aggregate results from multiple models
CASE STUDIES
NEW DOMAIN UNDERSTANDING
CASE STUDY • SoftServe worked with two Fortune 100 companies – a hardware and
DISTRIBUTED IOT networking provider, and an energy exploration and production
company – to research the oil extraction process
NETWORK ACROSS OIL • SoftServe suggested a solution and architecture design to match the
& GAS PRODUCTION client need for a distributed fiber-optic sensing (IoT) program.
SOLUTION DESIGN
• SoftServe’s end solution focused on unsupervised anomaly detection to
help the end client identify observations that do not conform to the
expected behavioral patterns
ARCHITECTURAL DRIVERS
• Ingest and process multi-dimensional time series data from sensing equipment
(100-200GB per day)
• Calculate the key metrics and perform short- and long-term predictions in near
real-time (up to 5 mins)
• Continuously re-train the model when the new data comes in
• Initial training dataset consisted of ~300GB
• Support queries against historical data for analytics
ARCHITECTURAL DECISION [MODEL DEV]
1. Training Data Ingestion
• HDFS used as a storage layer
• Directory structure for data versioning
• Custom data conversion from the proprietary data protocol
ARCHITECTURAL DECISION [MODEL DEV]
2. Data Cleansing and Normalization
• Spark SQL and Dataframes for analytics
• Batch Spark jobs for data pre-processing
ARCHITECTURAL DECISION [MODEL DEV]
3. Feature Engineering
• Batch Spark job to calculate the features
• Selected features were stored in CrateDB and exposed via SQL
•
ARCHITECTURAL DECISION [MODEL DEV]
4. Model Training and Selection
• Spark ML for model training and tuning
• Yarn resource management
• No hardware acceleration were used
•
•
•
ARCHITECTURAL DECISION [MODEL DEV]
5. Model Persistence
• The result models were stored on HDFS
ARCHITECTURAL DECISION [MODEL SERVING]
1. New Data Ingestion
• Kafka used as a message broker to ingest the data from the sensors
ARCHITECTURAL DECISION [MODEL SERVING]
2. Data validation and Feature extraction
• Same batch transformations re-used in Spark Streaming
ARCHITECTURAL DECISION [MODEL SERVING]
3. Model Prediction
• Batch Spark ML jobs scheduled every 3 mins
ARCHITECTURAL DECISION [MODEL SERVING]
4. Serving Results
• The results saved back to CrateDB and exposed via Impala
• Zoomdata used to communicate the data and predictions
•
OUTCOMES
• Highly scalable distributed IoT platform
leveraging state-of-the-art Big Data and Cloud
technologies
• Real-time monitoring and user-centric BI
analytics
• Custom domain-specific self-learning anomaly
detection solution
SMART PARKING
SOLUTION
A SoftServe innovative solution provides
automatic parking space detection based
on a Computer Vision ML model.
A CCTV camera installed on a rooftop
captures images and the current parking
state is visualized in real-time via a web
application and LCD at the parking
entrance.
The solution can be used for both open and
authorized parking areas.
ARCHITECTURAL DRIVERS
● We need to architect to support initial model development and the continuous model
refinement lifecycle.
● The model development and model serving workflow steps can be used as a
framework to guide design and to document design decisions.
QUESTIONS?
Thank you!!
Machine learning process