AWS ML Notes -Domain 4 - Monitor Model
AWS ML Notes -Domain 4 - Monitor Model
Importance of Monitoring in ML
a) Machine Learning Lens: AWS Well-Architected Framework: Best practices and design principles
Automate Retraining
Detecting Drift in Monitoring
a) Drift Types
Note: Bias inverse of variance, which is the level of small fluctuations or noise common in complex data sets.
Bias tends to cause model predictions to overgeneralize, and variance tends to cause models to
undergeneralize. Increasing variance is one method for reducing the impact of bias.
b) Monitoring Drift
STEPS
To monitor model quality, SageMaker Model Monitor requires the following inputs:
1. Baseline data
2. Inference input and predictions made by the deployed model
3. Amazon SageMaker Ground Truth associated with the inputs to the model
Post-training bias metrics in SageMaker Clarify help us answer two key questions:
• Are all facet values represented at a similar rate in positive (favorable) model predictions?
• Does the model have similar predictive performance for all facet values?
How it works: It quantifies the contribution of each input feature (for example, audio characteristics)
to the model's predictions, helping to explain how the model arrives at its decisions.
Options for using SageMaker Clarify
Uses SHAP
SageMaker Clarify provides feature attributions based on the concept of Shapley value. This is a game-
theoretic approach that assigns an importance value (SHAP value) to each feature for a particular
prediction.
1. SageMaker Clarify: This is the core component that performs the actual bias detection
and generate quality metrics and violations
2. SageMaker Model Monitor: This is the framework that can use Clarify's capabilities to
perform continuous monitoring of deployed models.
SageMaker Model Dashboard
Features
1. Alerts :
How it helps: The dashboard provides a record of all activated alerts, allowing the data
scientist to review and analyze past issues.
Alert criteria depend upon two parameters:
• Datapoints to alert: Within the evaluation period, how many runtime failures raise an alert?
• Evaluation period: The # of most recent monitoring executions to consider when evaluating
alert status.
2. Risk rating
A user-specified parameter from the model card with a low, medium, or high value.
3. Endpoint performance
You can select the endpoint column to view performance metrics, such as:
• CpuUtilization: The sum of each individual CPU core's utilization from 0%-100%.
• MemoryUtilization: The % of memory used by the containers on an instance, 0%-100%.
• DiskUtilization: The % of disk space used by the containers on an instance, 0%-100%.
This information helps you determine if a model is actively used for batch inference.
When training a model, SageMaker creates a model lineage graph, a visualization of the entire ML
workflow from data preparation to deployment.
• Stakeholder notifications: When monitoring metrics indicate changes that impact business
KPIs or the underlying problem
• Data Scientist notification: You can use automated notifications to data scientists when
your monitoring detects data drift or when expected data is missing.
• Model retraining: Configure your model training pipeline to automatically retrain models
when monitoring detects drift, bias, or performance degradation.
• Autoscaling: You use resource utilization metrics gathered by infrastructure monitoring to
initiate autoscaling actions.
• Predictable
• When there are known maintenance • May retrain unnecessarily if
seasonal patterns schedule no significant changes occur
Scheduled
• For maintaining model • Can anticipate and • Might miss sudden,
accuracy over time prepare for retraining unexpected changes
periods
4.2 Monitor and Optimize Infrastructure and Costs
4.2.1 Monitor Infrastructure
Monitor Performance Metrics - CloudWatch vs Model Monitor
Feature SageMaker Model Monitor CloudWatch Logs
Alert System Set alerts for deviations in model quality Notifications based on preset thresholds
• Pre-built monitoring capabilities (no coding) Customizable log patterns and anomaly
Customization
• Custom analysis options detection
Continuous collection and analysis of Deep insights into internal state and behavior of ML
Definition
metrics systems
Detect issues and invoke alerts or Provide deeper insights for troubleshooting and
Outcome
automated actions optimization
Primarily focused on predefined metrics Enables asking and answering questions about
Scope
and thresholds system behavior
Monitoring Tools (for Performance and Latency)
CloudWatch CloudWatch Logs
Feature AWS X-Ray QuickSight
Lambda Insights Insights
• Interactive
querying and
• Works across AWS • Monitors metrics
analysis of log data • Interactive
and third-party (memory, duration,
• Correlates log data dashboards
services invocation count)
from different • ML-powered
Key Features • Generates detailed • Provides detailed
sources
service graphs logs and traces insights
• Visualizes time • Supports various
• Identifies • Helps identify
series data data sources
performance bottlenecks in
• Supports
bottlenecks Lambda functions
aggregations,
filters, and regex
Any service that Various AWS services
Compatible EC2, ECS, Lambda,
Lambda generates logs in and external data
Services Elastic Beanstalk
CloudWatch sources
SageMaker w/ EventBridge
c) How to start
Short-term,
On-Demand Pay-per-use with no long-term Real-time inference
unpredictable None (baseline)
Instances commitment services
workloads
Steady-state,
Reserved Discounted rates for 1 or 3- Up to 72% vs On- Long-running ML
predictable
Instances year commitments Demand training jobs
workloads
Commit to a specific
Savings Plans for Flexible, recurring Up to 64% vs On- Regular model training
compute usage for 1 or 3
SageMaker SageMaker usage Demand and deployment
years
4.3 Secure AWS ML Resources
Allows SageMaker to
SageMaker
perform tasks on behalf of General SageMaker operations
Execution
users
Specific to SageMaker
Processing Job Data processing tasks
Service processing jobs
Roles
Specific to SageMaker
Training Job Model training tasks
training jobs
Specific to SageMaker
Model Model deployment and hosting
model deployment
Resource
ID Purpose Key Permissions Notes
Scope
• SageMaker: CreateTrainingJob,
CreateModel
• S3: GetObject, PutObject
• ECR: BatchGetImage
Least privilege Adheres to
• CloudWatch: PutMetricData Specific ARNs
1 access for ML principle of least
for each service
workflow privilege
• machinelearning:Get*
• machinelearning:Describe*
Specific
Allows reading
MLModel ARNs
Read metadata metadata but not
2 for Get*<br>*
of ML resources modifying
(all) for
resources
Describe*
• machinelearning:CreateDataSourceFrom*
Cannot be
Create ML • machinelearning:CreateMLModel
3 * (all) restricted to
resources • machinelearning:CreateBatchPrediction
specific resources
• machinelearning:CreateEvaluation
Allows
Manage real- • machinelearning:CreateRealtimeEndpoint
Specific management of
4 time endpoints • machinelearning:DeleteRealtimeEndpoint
MLModel ARN endpoints for a
and predictions • machinelearning:Predict
specific model
Detailed examples
4. Allow users to create /delete real-time endpoints and perform real-time predictions on an ML model
Detailed examples
• S3
• CloudWatch Logs
• SageMaker runtime
• SageMaker API
4.3.3 SageMaker Compliance & Governance
AWS Services for Compliance and Governance
Service Purpose Key Features ML-Related Use Case
Amazon Automated vulnerability Continuous scanning for Scan container images in ECR
Inspector management vulnerabilities for ML model deployments
Governance
Description AWS Services to Use
/Framework
• AWS Config
ISO 27001 Information Security Management System standard
• AWS Security Hub
• AWS Artifact
SOC 2 Service Organization Control for service organizations • AWS Config
• SageMaker Model Cards
• AWS Config
PCI-DSS Payment Card Industry Data Security Standard • AWS WAF
• Amazon Inspector
• AWS Artifact
HIPAA Health Insurance Portability and Accountability Act • AWS Security Hub
• AWS Config
• AWS CloudTrail
FedRAMP Federal Risk and Authorization Management Program
• AWS Config
• Caller identity
Identify unauthorized API calls to
CloudTrail Logs Monitor API calls • Timestamps
SageMaker resources
• API details
Data Event Monitor data Input/output data for training Verify if unauthorized entities
Logs plane operations and inference accessed model data
AWS Enhance network Private connections between Ensure traffic remains within
PrivateLink security VPC and SageMaker AWS network