0% found this document useful (0 votes)
38 views24 pages

AWS ML Notes -Domain 4 - Monitor Model

Uploaded by

jjoxeyejoxeye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views24 pages

AWS ML Notes -Domain 4 - Monitor Model

Uploaded by

jjoxeyejoxeye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Domain 4: Monitor Model

4.1 Monitor Model Performance and Data Quality


4.1.1 Monitoring Machine Learning Solutions

Importance of Monitoring in ML
a) Machine Learning Lens: AWS Well-Architected Framework: Best practices and design principles

Best practice: When

Resource pooling Sharing compute, storage, and


networking resources
Optimize resources Caching
Data management data compression, partitioning, and
lifecycle management
AWS Auto Scaling SageMaker built-in scaling. AWS Auto-
Scale ML workloads
Scaling
based on demand
Lambda
Monitor usage and costs resource tagging
Reduce Cost
monitor ROI
Establish Feedback Loops
Enable continuous Monitor Performance SageMaker Model Monitor (Drift)
improvement CloudWatch alerts (deviations)

Automate Retraining
Detecting Drift in Monitoring
a) Drift Types

Drift Type Description Causes Implications

• Real-world data not as curated


Production data • Model accuracy
Data Quality as training data
distribution differs from decreases
Drift • Changes in data collection
training data distribution • Predictions become less
processes
reliable
• Shifts in real-world conditions
• Changes in the underlying
Model predictions differ
Model Quality relationship between features • Decreased model
from actual ground truth and target performance
Drift
labels • Model decay over time • Inaccurate predictions
• Concept drift
• Training data too small or not
representative • Model overgeneralization
Increase in bias affecting • Incorporation of societal • Unfair or discriminatory
Bias Drift model predictions over assumptions in training data predictions
time • Exclusion of important data • Ethical concerns
points • New groups in
• Changes in real-world data production
distribution
• Shifts in feature importance • Model may rely on less
Changes in the
over time relevant features
Feature contribution of individual
• Changes in the underlying • Decreased
Attribution Drift features to model problem domain interpretability
predictions • Introduction of new, more • Potentially reduced
predictive features performance

Note: Bias inverse of variance, which is the level of small fluctuations or noise common in complex data sets.
Bias tends to cause model predictions to overgeneralize, and variance tends to cause models to
undergeneralize. Increasing variance is one method for reducing the impact of bias.
b) Monitoring Drift

Monitoring Type What It Monitors How It Monitors

• Missing values • Implement data validation checks


• Outliers • Calculate statistical metrics
Data Quality • Data types • Compare metrics with baseline values
Monitoring • Statistical metrics (mean, std dev, • Use data drift detection techniques (e.g.,
etc.) Kolmogorov-Smirnov tests, Maximum Mean
• Data distribution Discrepancy)
• Calculate evaluation metrics on held-out test set
• Evaluation metrics (accuracy,
or production data sample
Model Quality precision, recall, F1, AUC, etc.)
• Implement confidence thresholding or
Monitoring • Prediction confidence
uncertainty estimation
• Performance across different
• Flag low-confidence predictions
subpopulations
• Monitor performance on different data subsets
• Calculate bias metrics for different sensitive
groups
• Bias metrics (disparate impact,
Model Bias Drift fairness, etc.) • Compare bias metrics with baseline values or
Monitoring thresholds
• Performance across sensitive
• Implement bias mitigation techniques (e.g.,
groups
adversarial debiasing, calibrated equalized
odds)
• Use interpretability techniques (e.g., SHAP) to
calculate feature attributions
Feature
• Feature importance scores • Calculate statistical metrics on feature
Attribution Drift • Statistical metrics of feature attributions
Monitoring attributions • Compare metrics with baseline values
• Identify features with significantly changed
attributions
SageMaker Model Monitor
Integration

SageMaker - Monitoring for Data Quality Drift

STEPS

1. Initiate data capture on the endpoint


2. Create a baseline

start a baseline processing job with the suggest_baseline method of the


ModelQualityMonitor object using the SageMaker Python SDK.
3. Schedule data quality monitoring jobs
4. Integrate data quality monitoring with Cloudwatch
5. Interpret results and analyze findings

The report is generated as the constraint_violations.json file. The SageMaker Model


Monitor prebuilt container provides the following violation checks.
• data_type_check
• completeness_check
• baseline_drift_check
• missing_column_check
• extra_column_check
• categorical_values_check
SageMaker - Monitoring for Model Quality Drift using Model Monitor

To monitor model quality, SageMaker Model Monitor requires the following inputs:

1. Baseline data
2. Inference input and predictions made by the deployed model
3. Amazon SageMaker Ground Truth associated with the inputs to the model

SageMaker - Monitoring for Bias using Clarify


Statistical bias drift occurs when the data used for training differs from the data encountered during
prediction, leading to potentially biased outcomes. This is prominent when training data changes over
time. In this lesson, you will learn about AWS services that help you monitor for statistical bias drift.

Post-training bias metrics in SageMaker Clarify help us answer two key questions:

• Are all facet values represented at a similar rate in positive (favorable) model predictions?

• Does the model have similar predictive performance for all facet values?

SageMaker Model Monitor automatically does the following:

• Merges the prediction data with SageMaker Ground Truth labels


• Computes baseline statistics and constraints
• Inspects the merged data and generates bias metrics and violations
• Emits CloudWatch metrics to set up alerts and triggers
• Reports and alerts on bias drift detection
• Provides reports for visual analysis

How it works: It quantifies the contribution of each input feature (for example, audio characteristics)
to the model's predictions, helping to explain how the model arrives at its decisions.
Options for using SageMaker Clarify

When to use which

Method Description When to Use

• For one-time or ad-hoc bias analysis


Configure and run Clarify
SageMaker • When you need full control over the analysis
processing job using configuration
Clarify Directly
SageMaker Python SDK API • For integrating bias analysis into custom
workflows
SageMaker Integrate Clarify with Model
• When you want to automate bias detection in
Model Monitor Monitor for continuous bias production
+ Clarify monitoring • If you need to set up alerts for bias drift

• During the data preparation phase


Utilize Clarify within Data
SageMaker • When you want to identify potential bias early
Wrangler during data in the ML pipeline
Data Wrangler
preparation • If you're already using Data Wrangler for data
preprocessing
SageMaker - Monitoring for Feature Attribution Drift (Model Monitor + Clarify)
Feature attribution refers to understanding and quantifying the contribution or influence of each
feature on the model's predictions or outputs. It helps to identify the most relevant features and their
relative importance in the decision-making process of the model.

Uses SHAP

SageMaker Clarify provides feature attributions based on the concept of Shapley value. This is a game-
theoretic approach that assigns an importance value (SHAP value) to each feature for a particular
prediction.

Here's how it works:

1. SageMaker Clarify: This is the core component that performs the actual bias detection
and generate quality metrics and violations
2. SageMaker Model Monitor: This is the framework that can use Clarify's capabilities to
perform continuous monitoring of deployed models.
SageMaker Model Dashboard

Features

1. Alerts :
How it helps: The dashboard provides a record of all activated alerts, allowing the data
scientist to review and analyze past issues.
Alert criteria depend upon two parameters:
• Datapoints to alert: Within the evaluation period, how many runtime failures raise an alert?
• Evaluation period: The # of most recent monitoring executions to consider when evaluating
alert status.
2. Risk rating

A user-specified parameter from the model card with a low, medium, or high value.

3. Endpoint performance

You can select the endpoint column to view performance metrics, such as:

• CpuUtilization: The sum of each individual CPU core's utilization from 0%-100%.
• MemoryUtilization: The % of memory used by the containers on an instance, 0%-100%.
• DiskUtilization: The % of disk space used by the containers on an instance, 0%-100%.

4. Most recent batch transform job

This information helps you determine if a model is actively used for batch inference.

5. Model lineage graphs

When training a model, SageMaker creates a model lineage graph, a visualization of the entire ML
workflow from data preparation to deployment.

6. Links to model details


The dashboard links to a model details page where you can explore an individual model.

Model Monitor vs SageMaker Dashboard vs Clarify: When to use which one


Tool Description Why to use When to Use

• data and model quality • To set up automated alerts for


Continuous issues performance degradation
monitoring of ML • model drift • When you need to monitor resource
Model Monitor
models in utilization
production • Monitor real-time endpoints, batch
transform, On-demand monitoring job

• • For a high-level overview of all


Centralized view of
SageMaker SageMaker activities
SageMaker
Dashboard • To track training jobs, endpoints, and
resources and jobs
notebook instances

• Detecting Bias • To detect bias in training data and


• Triggers statistics and model predictions
Violations report • When you need to explain model
Bias detection and
SageMaker decisions
model explainability
Clarify tool
• For regulatory compliance requiring
model transparency
• To improve model fairness and
accountability
4.1.2 Remediating Problems Identified by Monitoring
Automated remediations and notifications

• Stakeholder notifications: When monitoring metrics indicate changes that impact business
KPIs or the underlying problem
• Data Scientist notification: You can use automated notifications to data scientists when
your monitoring detects data drift or when expected data is missing.
• Model retraining: Configure your model training pipeline to automatically retrain models
when monitoring detects drift, bias, or performance degradation.
• Autoscaling: You use resource utilization metrics gathered by infrastructure monitoring to
initiate autoscaling actions.

Model retraining strategies

Strategy When to Use Advantages Considerations

• When drift is detected • May be frequent if thresholds


• Timely response to
above a certain threshold are too sensitive
Event-driven changes
• In response to significant • Retraining can be expensive
• Efficient use of
changes in data or and time-consuming
resources
performance

• Allows for human • Requires constant monitoring


• When market conditions
judgment in decision- by data scientists or
On-demand change significantly stakeholders
making
• In response to new • May lead to delayed
• Can incorporate
competitors or strategies responses
business context

• Predictable
• When there are known maintenance • May retrain unnecessarily if
seasonal patterns schedule no significant changes occur
Scheduled
• For maintaining model • Can anticipate and • Might miss sudden,
accuracy over time prepare for retraining unexpected changes
periods
4.2 Monitor and Optimize Infrastructure and Costs
4.2.1 Monitor Infrastructure
Monitor Performance Metrics - CloudWatch vs Model Monitor
Feature SageMaker Model Monitor CloudWatch Logs

Continuous monitoring of ML models in Monitoring, storing, and accessing log


Purpose
production files

(all 4 ML monitoring types)


• Log collection from various sources
Key • Data quality monitoring • Log storage in S3
Capabilities • Model quality monitoring • Pattern recognition
• Bias drift monitoring • Log anomaly detection
• Feature attribution drift monitoring
• EC2 instances
Monitoring • Real-time endpoint monitoring • CloudTrail
Types • Batch transform job monitoring • Amazon Route 53
• On-schedule monitoring for async batch jobs • Other sources

Alert System Set alerts for deviations in model quality Notifications based on preset thresholds

• Pre-built monitoring capabilities (no coding) Customizable log patterns and anomaly
Customization
• Custom analysis options detection

Monitoring vs. Observability


Monitoring Observability

Continuous collection and analysis of Deep insights into internal state and behavior of ML
Definition
metrics systems

Understanding complex interactions and


Focus Detecting anomalies and deviations
dependencies

• Collecting metrics • Analyzing system behavior


Key
• Logging • Identifying root causes
Activities
• Alerting • Reasoning about system health

• • Metric collection • Distributed tracing


Techniques • Threshold-based alerting • Structured logging
• Basic log analysis • Advanced data visualization

Detect issues and invoke alerts or Provide deeper insights for troubleshooting and
Outcome
automated actions optimization

Primarily focused on predefined metrics Enables asking and answering questions about
Scope
and thresholds system behavior
Monitoring Tools (for Performance and Latency)
CloudWatch CloudWatch Logs
Feature AWS X-Ray QuickSight
Lambda Insights Insights

Trace information about In-depth performance


Interactive log BI and data
Purpose responses and calls in monitoring for Lambda
analytics service visualization service
applications fns only

• Interactive
querying and
• Works across AWS • Monitors metrics
analysis of log data • Interactive
and third-party (memory, duration,
• Correlates log data dashboards
services invocation count)
from different • ML-powered
Key Features • Generates detailed • Provides detailed
sources
service graphs logs and traces insights
• Visualizes time • Supports various
• Identifies • Helps identify
series data data sources
performance bottlenecks in
• Supports
bottlenecks Lambda functions
aggregations,
filters, and regex
Any service that Various AWS services
Compatible EC2, ECS, Lambda,
Lambda generates logs in and external data
Services Elastic Beanstalk
CloudWatch sources

• Monitor and • Create


optimize ML • Analyze logs from dashboards for ML
• Analyze bottlenecks in
models deployed ML workloads experiment results
ML Use ML systems
as Lambda • Identify patterns • Analyze and
Cases • Trace requests in ML
functions and anomalies in present insights
applications (e.g.,
• Identify root causes ML system from ML
chatbot inference)
of Lambda function behavior predictions
issues
Performance Interactive
Time series graphs, Log
Visualization Service maps, Trace views dashboards, Trace dashboards, Charts,
event views
details Graphs

End-to-end request Detailed Lambda Flexible, interactive log Comprehensive data


Primary
tracing and bottleneck function performance analysis and visualization and
Benefit
identification insights visualization business intelligence

SageMaker w/ EventBridge

Actions that can be automatically invoked using EventBridge:

a) Invoking an AWS Lambda function


b) Invoking Amazon EC2 run command (not create or deploy)
c) Relaying event to Kinesis Data Streams
d) Activating an AWS Step Functions state machine.
e) Notifying SNS topic or an AWS Server Migration Service (AWS SMS) queue.
4.2.2 Optmize Infrastructure
Inference Recommender types
a) Inference Recommendation Types
Default Advanced
Endpoint Recommender Endpoint Recommender + Inference Recommender
45 mins 2 hours

b) Endpoint Recommender vs Inference Recommender

Endpoint Recommender Inference Recommender


Output list (or ranking) of prospective Same
instances
run a set of load tests. based on a custom load test.
What you -N/A - your desired ML instances or a
need to do serverless endpoint, provide a
custom traffic pattern, and
provide requirements for
latency and throughput

c) How to start

SageMaker Inference Recommender

d) Sample Recommender output


4.2.3 Optmize Costs
Inference Recommender types

Option Description Best For Cost Savings Example Use Case

Spare EC2 capacity at lower Interruptible Up to 90% vs On- Data preprocessing or


Spot Instances
prices; can be interrupted workloads Demand batch processing

Short-term,
On-Demand Pay-per-use with no long-term Real-time inference
unpredictable None (baseline)
Instances commitment services
workloads

Steady-state,
Reserved Discounted rates for 1 or 3- Up to 72% vs On- Long-running ML
predictable
Instances year commitments Demand training jobs
workloads

Reserved capacity for AWS Ensuring capacity


ML workloads in on-
Capacity Blocks Outposts or Wavelength during peak Varies
premises environments
Zones demand

Commit to a specific
Savings Plans for Flexible, recurring Up to 64% vs On- Regular model training
compute usage for 1 or 3
SageMaker SageMaker usage Demand and deployment
years
4.3 Secure AWS ML Resources

4.3.1 Securing ML Resources


Access Control using IAM
a) Roles vs Policies

Category Type Description Key Responsibilities/Features

Data Scientist/ Provides access for


Access to S3, Athena , SageMaker Studio
ML Engineer experimentation

User Provides access for data


Data Engineer Access to S3, Athena, AWS Glue, EMR
Roles management

Provides access for ML Access to SageMaker, CodePipeline, CodeBuild,


MLOps Engineer
operations CloudFormation, ECR, Lambda, Step Functions

Allows SageMaker to
SageMaker
perform tasks on behalf of General SageMaker operations
Execution
users

Specific to SageMaker
Processing Job Data processing tasks
Service processing jobs
Roles
Specific to SageMaker
Training Job Model training tasks
training jobs

Specific to SageMaker
Model Model deployment and hosting
model deployment

Attached to IAM users,


Identity-based Define actions allowed on specific resources
groups, or roles
IAM
Policies Attached to resources (e.g.,
Resource-based Control who can access specific resources
S3 buckets)
IAM Policy – Examples for ML workflows

Resource
ID Purpose Key Permissions Notes
Scope

• SageMaker: CreateTrainingJob,
CreateModel
• S3: GetObject, PutObject
• ECR: BatchGetImage
Least privilege Adheres to
• CloudWatch: PutMetricData Specific ARNs
1 access for ML principle of least
for each service
workflow privilege

• machinelearning:Get*
• machinelearning:Describe*

Specific
Allows reading
MLModel ARNs
Read metadata metadata but not
2 for Get*<br>*
of ML resources modifying
(all) for
resources
Describe*

• machinelearning:CreateDataSourceFrom*
Cannot be
Create ML • machinelearning:CreateMLModel
3 * (all) restricted to
resources • machinelearning:CreateBatchPrediction
specific resources
• machinelearning:CreateEvaluation

Allows
Manage real- • machinelearning:CreateRealtimeEndpoint
Specific management of
4 time endpoints • machinelearning:DeleteRealtimeEndpoint
MLModel ARN endpoints for a
and predictions • machinelearning:Predict
specific model
Detailed examples

1. identity-based policy used in a machine learning use case

2. Allow users to read machine learning resources metadata

3. Allow users to create machine learning resources

4. Allow users to create /delete real-time endpoints and perform real-time predictions on an ML model
Detailed examples

To ensure access only from VPC, use VPC Endpoints for:

• S3
• CloudWatch Logs
• SageMaker runtime
• SageMaker API
4.3.3 SageMaker Compliance & Governance
AWS Services for Compliance and Governance
Service Purpose Key Features ML-Related Use Case

Provide on-demand access to • Self-service portal Access HIPAA compliance


AWS Artifact AWS compliance reports and • Access to compliance reports for healthcare ML
agreements documentation projects

• Continuous monitoring Monitor SageMaker resource


Monitor & Evaluate AWS
AWS Config • Automated configuration configurations for compliance
resource configurations
evaluation with security policies

Continuously audit AWS Streamlined auditing process, Assess compliance of ML


Audit
usage for risk and compliance against regulations and workflows with industry
Manager assessment standards standards

Security View of security alerts and Monitor security posture across


Centralized security alerts
Hub posture ML workflows and resources

Amazon Automated vulnerability Continuous scanning for Scan container images in ECR
Inspector management vulnerabilities for ML model deployments

Create catalogs of compliant


AWS Service Create and manage catalogs Governance-compliant
SageMaker resources and ML
Catalog of pre-approved resources resource catalogs
models

Amazon SageMaker Governance Tools Summary


Tool Purpose Key Features

SageMaker Role • Define minimum permissions for ML activities


Simplify access control
Manager • Quick setup & Streamlined access management

SageMaker Model Document and share • Record intended uses


Cards model information • Document risk ratings

SageMaker Model Provide overview of • Unified view of all models in account


Dashboard models • Monitor model behavior in production

Streamline ML asset • Publish ML and data assets


SageMaker Assets
management • Share assets across teams

• Protect data and workloads


Model Governance Ensure compliance and
• Ensure compliance with standards
and Explainability transparency
• Enhance model interpretability
Compliance certifications and regulatory Frameworks

Governance
Description AWS Services to Use
/Framework

• AWS Config
ISO 27001 Information Security Management System standard
• AWS Security Hub

• AWS Artifact
SOC 2 Service Organization Control for service organizations • AWS Config
• SageMaker Model Cards

• AWS Config
PCI-DSS Payment Card Industry Data Security Standard • AWS WAF
• Amazon Inspector

• AWS Artifact
HIPAA Health Insurance Portability and Accountability Act • AWS Security Hub
• AWS Config

• AWS CloudTrail
FedRAMP Federal Risk and Authorization Management Program
• AWS Config

Note: AWS Config common to all


4.3.3 Security Best Practices for CI/CD Pipelines
CI/CD pipeline stages
Best practice: When

CI/CD Stage Security Tools/Practices


• pre-commit hooks (scripts)
• IDE plugins to
Pre-Commit o analyze code, detect issues
o provide recommendations for improvements.
o handle linting, formatting, beautifying, and securing code.
Commit Static Application Security Testing (SAST),
Build Software Composition Analysis (SCA)
o identifies the open-source packages used in code
o defining vulnerabilities and potential compliance-based issues
o scan infrastructure as code (IaC) manifest files
Test • Dynamic Application Security Testing (DAST)
• Interactive Application Security Testing (IAST)
o Combine the advantages of SAST and DAST tools.
Deploy • Penetration testing
Monitor • Red/Blue/Purple teaming
4.3.4 Implement Security & Compliance w/ Monitoring, Logging and Auditing
CloudTrail for ML Resource Monitoring and Logging
Use Case Description Benefits

Compliance Generate audit trails using • Demonstrate compliance with regulations


Auditing CloudWatch Logs and CloudTrail • Meet internal policy requirements

Resource • Optimize ML workloads


Monitor resource utilization metrics
Optimization • Prevent resource abuse and DoS attacks

Investigate and respond to security • Identify unauthorized access attempts


Incident Response
incidents • Detect and respond to data breaches

Implement ML models to detect • Identify potential security threats


Anomaly Detection
unusual patterns • Detect deviations in monitoring data

SageMaker Security Troubleshooting and Debugging Summary


Tool/Feature Purpose Key Information Provided Use Case

• Caller identity
Identify unauthorized API calls to
CloudTrail Logs Monitor API calls • Timestamps
SageMaker resources
• API details

Data Event Monitor data Input/output data for training Verify if unauthorized entities
Logs plane operations and inference accessed model data

Permissions granted for


Manage access Identify overly permissive
IAM Policies SageMaker resources and
control policies, ensure least privilege
operations

Monitor network Network traffic to/from Identify suspicious IP addresses


VPC Flow Logs
traffic SageMaker resources or communication patterns

• Encryption status (at rest and in


Encryption Ensure data Verify proper data encryption and
transit)
Settings protection key management
• AWS KMS key configurations

AWS Enhance network Private connections between Ensure traffic remains within
PrivateLink security VPC and SageMaker AWS network

You might also like