Module 3 Aws
Module 3 Aws
Sections Demonstration
1. Scenario introduction • Training a model using Amazon
2. Collecting and securing data SageMaker
3. Evaluating your data • Optimizing Amazon SageMaker
Hyperparameters
4. Preprocessing your data
5. Training • Running Amazon SageMaker Autopilot
6. Evaluating the accuracy of the model
7. Hosting and using the model
8. Hyperparameter and model tuning Knowledge check
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 2
Module overview continued
Lab
• Guided lab: Exploring Amazon Sagemaker
• Guided lab: Visualizing data
• Guided lab: Encoding categorical data
• Guided lab: Splitting data and training a model with xgBoost
• Guided lab: Hosting and consuming a model on AWS
• Guided lab: Evaluating model accuracy
• Guided lab: Tuning with Amazon SageMaker
• Challenge Lab:
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 3
Module objectives
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine learning pipeline
Section 6
New data and retraining
Business problem Deploy model
Section 1 Section 8
Problem Yes
Tune model
formulation
No
Feature augmentation
Data augmentation
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 6
Machine learning pipeline
Problem Yes
Tune model
formulation
Meets
Collect and Feature Select and train
Evaluate data Evaluate model business
label data engineering model
goal?
No
Feature augmentation
Data augmentation
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 7
Define business objective
Questions to ask:
• How is this task done today?
• How will the business measure success?
• How will the solution be used?
• Do similar solutions exist, which you might learn from?
• What assumptions have been made?
• Who are the domain experts?
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 8
How should you frame this problem?
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 9
Example: Problem formulation
Why?
Reduce the number of customers who end their membership because
of fraud.
Fraud
Binary classification problem
Not fraud
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 11
Wine quality dataset
Question: Based on the composition of the wine, can you predict the quality and therefore the price?
Why:
• View statistics
• Deal with outliers
• Scale numeric data
Citation
Source: UCI Wine quality dataset
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 12
Car evaluation dataset
Question: Can you use a car’s attributes to predict whether the car will be
purchased?
Why:
• View statistics
• Encode categorical data
Citation
Source: UCI Car evaluation dataset
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science.
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 13
Vertebral column dataset
Why:
• View statistics
• Encode categorical data
• Train and tune a model
Citation
Source: UCI Vertebral column dataset
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science.
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 14
• Business problems must be
Section 1 key converted to an ML problem
takeaways • Why?
• Can it be measured?
• What kind of ML problem is it?
• Classification or regression?
15 © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Module 3: Implementing a Machine Learning Pipeline with Amazon SageMaker
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine learning pipeline
Problem Yes
Tune model
formulation
Meets
Collect and Feature Select and train
Evaluate data Evaluate model business
label data engineering model
goal?
No
Feature augmentation
Data augmentation
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 17
What data do you need?
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 18
Data sources
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 19
Observations
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 20
Get a domain expert
• Do you have the data that you need to try to address this problem?
• Is your data representative?
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 21
Storing data in AWS
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 22
Extract, transform, load (ETL)
Data Catalog
Crawler
Data
Stores
Schedule or Event
Extract Load
Transform Script
Data Data
Source Target
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 23
ETL with AWS Glue
AWS Glue can glue together different datasets and emit a single endpoint
that can queried.
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 24
AWS Glue overview
Endpoint
Amazon AWS Glue
Redshift
Amazon RDS
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 25
ETL with Python
r = requests.get(url, stream=True)
thezip = zipfile.ZipFile(io.BytesIO(r.content)) Download and extract
thezip.extractall(folder)
s3 = boto3.client('s3')
bucket = 'bucketname'
with os.scandir(folder) as dir:
for f in dir:
if f.is_file():
Upload to Amazon S3
s3.upload_file(
Filename=os.path.join(folder,f.name),
Key=f.name, Bucket=bucket)
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 26
Securing data: AWS Identity and Access
Management (IAM) policy
IAM policies to control access:
{
"Id": "Policy1583974368597",
"Version": "2012-10-17", GetObject limits access to read only…
"Statement": [
{
"Sid": "Stmt1583974365198", …allowing the action…
"Action": ["s3:GetObject"],
"Effect": "Allow",
"Resource": "arn:aws:s3:::awsmachinelearningrepo/*", …for only this bucket…
"Principal": {"AWS": ["DataReaderRole"]}
}
] …by only this IAM role.
}
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 27
Securing data: Data encryption
The contents of many data repositories in AWS can be quickly and easily
encrypted.
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 28
Securing data: AWS CloudTrail for audit
Compliance Auditing
Operational Troubleshooting
AWS CloudTrail
Track user activity
and detect unusual Capture Store Act Review Security Analysis
API usage
Automatic Compliance
Remediation
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 29
Module 3 – Guided
Lab 1:
Exploring Amazon
SageMaker
30 © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• The first step in the machine learning
Section 2 key pipeline is to obtain data
takeaways • Extract, transform, and load (ETL) is
a common term for obtaining data for
machine learning
• AWS Glue makes it easy to run ETL
jobs from various data stores
• Securing your data includes both
controlling access and encrypting
data
31 © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Module 3: Implementing a Machine Learning Pipeline with Amazon SageMaker
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine learning pipeline
New data and retraining
Deploy model
Business problem
Data augmentation
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 33
You must understand your data
Before you can run statistics on your data, you must ensure that it’s in the
right format for analysis.
Date of Charge Was This
Customer Vendor
Transaction Amount Fraud?
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 34
Loading data into pandas
import pandas as pd
url = "https://somewhere.com/winequality-red.csv"
df_wine = pd.read_csv(url,';')
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 35
pandas DataFrame
Number of instances
df_wine.shape
Number of attributes
df_wine.head(5)
Columns/Attributes
Rows/Instances
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 36
Index and column names
df_wine.columns
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
'pH', 'sulphates', 'alcohol', 'quality'],
dtype='object')
df_wine.index
RangeIndex(start=0, stop=1599, step=1)
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 37
DataFrame schema
df_wine.dtypes df_wine.info()
quality int64 <class 'pandas.core.frame.DataFrame'>
fixed acidity float64 Int64Index: 1597 entries,
volatile acidity float64 0 to 1598
citric acid float64 Data columns (total 12 columns):
residual sugar float64 quality 1597 non-null int64
chlorides float64 fixed acidity 1597 non-null float64
free sulfur dioxide float64 volatile acidity 1597 non-null float64
total sulfur dioxide float64 citric acid 1597 non-null float64
density float64 residual sugar 1597 non-null float64
pH float64 chlorides 1597 non-null float64
sulphates float64 free sulfur dioxide 1597 non-null float64
alcohol float64 total sulfur dioxide 1597 non-null float64
dtype: object density 1597 non-null float64
pH 1597 non-null float64
sulphates 1597 non-null float64
alcohol 1597 non-null float64
dtypes: float64(11), int64(1)
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
memory usage: 162.2 KB 38
Descriptive statistics
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 39
Statistical characteristics
df_wine.describe()
total
fixed volatile residual free sulfur
citric acid chlorides sulfur pH sulphates alcohol quality
acidity acidity sugar dioxide
dioxide
count 1599.00 1599.00 1599.00 1599.00 1599.00 1599.00 1599.00 1599.00 1599.00 1599.00 1599.00
mean 8.32 0.53 0.27 2.54 0.09 15.87 46.47 3.31 0.66 10.42 5.64
std 1.74 0.18 0.19 1.41 0.05 10.46 32.90 0.15 0.17 1.07 0.81
min 4.60 0.12 0.00 0.90 0.01 1.00 6.00 2.74 0.33 8.40 3.00
25% 7.10 0.39 0.09 1.90 0.07 7.00 22.00 3.21 0.55 9.50 5.00
50% 7.90 0.52 0.26 2.20 0.08 14.00 38.00 3.31 0.62 10.20 6.00
75% 9.20 0.64 0.42 2.60 0.09 21.00 62.00 3.40 0.73 11.10 6.00
max 15.90 1.58 1.00 15.50 0.61 72.00 289.00 4.01 2.00 14.90 8.00
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 40
Categorical statistics identify frequency of values and class imbalances
df_car.head(5)
buying maint doors persons lug_boot safety class
0 vhigh vhigh 2 2 small low unacc
1 vhigh vhigh 2 2 small med unacc
2 vhigh vhigh 2 2 small high unacc
3 vhigh vhigh 2 2 med low unacc
4 vhigh vhigh 2 2 med med unacc
df_car.describe()
buying maint doors persons lug_boot safety class
count 1728 1728 1728 1728 1728 1728 1728
unique 4 4 4 3 3 3 4
top low low 2 2 big low unacc
freq 432 432 432 576 576 576 1210
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 41
Plotting attribute statistics
df_wine[‘sulphates’].hist(bins=10) df_wine[‘sulphates’].plot.box()
df_wine[‘sulphates'].plot.kde()
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 42
Plotting multivariate statistics
df_wine.plot.scatter( pd.plotting.scatter_matrix(
x='alcohol', df_wine[['citric acid',
y='sulphates') 'alcohol',
'sulphates']])
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 43
Scatter plot with identification
high = df_wine[['sulphates','alcohol']][df_wine['quality']>5]
low = df_wine[['sulphates','alcohol']][df_wine['quality']<=5]
plt.scatter(high['sulphates'],high['alcohol'],s=50,c='blue',marker='o',label='great')
plt.scatter(x=low['sulphates'],y=low['alcohol'],s=50,c='red',marker='v',label='poor')
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 44
Correlation matrix
corr_matrix = df_wine.corr()
corr_matrix["quality"].sort_values(ascending=False)
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 45
Correlation matrix heat map
import seaborn as sns
correlations = df_wine.corr()
fig, ax = plt.subplots(figsize=(10, 10))
colormap =
sns.color_palette("BrBG", 10)
sns.heatmap(correlations,
cmap=colormap,
annot=True,
fmt=".2f")
ax.set_yticklabels(colum_names);
plt.show()
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 46
Module 3 – Guided
Lab 2: Visualizing
Data
47 © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• The first step in evaluating data is to
Section 3 key make sure that it’s in the right format
takeaways • pandas is a popular Python library
for working with data
• Use descriptive statistics to learn
about the dataset
• Create visualizations with pandas to
examine the dataset in more detail
48 © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Module 3: Implementing a Machine Learning Pipeline with Amazon SageMaker
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine learning pipeline
Problem Yes
Tune model
formulation
Meets
Collect and Feature Select and train
Evaluate data Evaluate model business
label data engineering model
goal?
No
Feature augmentation
Data augmentation
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 50
Feature selection and extraction
XX X
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 51
Feature extraction
• Invalid values • Transformation
• Wrong formats • Normalization
• Misspelling • Dimensionality reduction
• Duplicates
• Consistency
• Rescale
• Encode categories
• Remove outliers
• Reassign outliers
• Bucketing
• Decomposition
• Aggregation
• Combination
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 52
Encoding ordinal data
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 53
Encoding non-ordinal data
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 54
Cleaning data
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 55
Finding missing data
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 56
Plan for missing values
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 57
Dropping missing values
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 58
Imputing missing data
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 59
Outliers
• Outliers can –
• Provide a broader picture of the data
• Make accurate predictions difficult
• Indicate the need for more columns
• Types of outliers –
• Univariate: Abnormal values for a
single variable
• Multivariate: Abnormal values for a
combination of two or more variables
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 60
Finding outliers
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 61
Dealing with outliers
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 62
Feature selection
• Filter method
• Use statistical methods
Feature Selection
• Wrapper methods
• Actually train the model
XX X
• Embedded methods
• Integrated with model
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 63
Feature selection: Filter methods
• Measures –
• Pearson’s correlation All features
• Linear discriminant analysis
(LDA)
Statistics and
• Analysis of variance
correlation
(ANOVA)
• Chi-square
Best features
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 64
Feature selection: Wrapper
Best features
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 65
Feature selection: Embedded methods
Feature subset
Evaluate
results
?
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 66
Module 3 – Guided
Lab 3:
Encoding
Categorical Data
67 © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Feature engineering involves –
Section 4 key • Selection
takeaways • Extraction
• Preprocessing gives you better data
• Two categories for preprocessing –
• Converting categorical data
• Cleaning up dirty data
• Use categorical encoding to convert categorical
data
• Various types of dirty data –
• Missing data
• Outliers
• Develop a strategy for dirty data –
• Replace or delete rows with missing data
• Delete, transform, or impute new values for outliers
68 © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Module 3: Implementing a Machine Learning Pipeline with Amazon SageMaker
Section 5: Training
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine learning pipeline
Problem Yes
Tune model
formulation
Meets
Collect and Feature Select and train
Evaluate data Evaluate model business
label data engineering model
goal?
No
Feature augmentation
Data augmentation
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 70
File formats for machine learning
JSON
*.csv *.rec text Protobuf
lines
recordIO
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 71
Formatting the data for an algorithm
When you use CSV format, use the first column as the target variable.
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 72
Why split the data?
Y
X
Overfitting
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 73
Holdout method
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 74
K-fold cross validation
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 75
Shuffle your data
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 76
Training models with Amazon SageMaker
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
77
Amazon SageMaker built-in algorithms
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
78
XGBoost
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
79
Linear learner
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 80
K-means
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 81
Creating a training job in Amazon
SageMaker
2
{
{ { code comes here
code comes here code comes here code comes here
code comes here code comes here code comes here
code comes here code comes here }
} }
1 Training code
S3 bucket Helper code Training code
training data
Model training 3
on ML computer instances
Amazon Elastic
Container Registry
Amazon SageMaker (Amazon ECR)
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 82
Demonstration: Training a Model
Using Amazon SageMaker
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 83
Module 3 – Guided
Lab 4:
Splitting Data and
Training a Model
with XGBoost
84 © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Split data into training and testing sets –
Section 5 key • Optionally, split into three sets, including
takeaways validation set
• Can use k-fold cross validation to use all
the non-test data for validation
• Can use two key algorithms for
supervised learning –
• XGBoost
• Linear learner
• Use k-means for unsupervised learning
• Use Amazon SageMaker training jobs
85 © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Module 3: Implementing a Machine Learning Pipeline with Amazon SageMaker
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine learning pipeline
Problem Yes
Tune model
formulation
Meets
Collect and Feature Select and train
Evaluate data Evaluate model business
label data engineering model
goal?
No
Feature augmentation
Data augmentation
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 87
Is your model ready to deploy?
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 88
Deployment options
Batch transform
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 89
The goal of deployment
Predictions
Managed
environment hosting In production
models
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 90
Deploying with Amazon SageMaker
Client
Inference Inference
request response
Secure
Amazon SageMaker endpoint
Helper Inference
code code
Models
Helper Training
code code
Training
data Inference
code images
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 91
Creating an endpoint
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 92
Obtaining predictions
Data
processing Model training Predictions
steps
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 93
Module 3 – Guided
Lab 5:
Hosting and
Consuming a
Model on AWS
94 © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Can use two options for deployment
Section 6 key • Amazon SageMaker hosting
takeaways • Batch transform
• Deploy only after you have tested
your model
• Goal is to generate predictions for
client applications
• Create an endpoint
• Single-model endpoint for simple use
cases
• Multi-model endpoint to support
multiple use cases
95 © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Module 3: Implementing a Machine Learning Pipeline with Amazon SageMaker
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine learning pipeline
Problem Yes
Tune model
formulation
Meets
Collect and Feature Select and train
Evaluate data Evaluate model business
label data engineering model
goal?
No
Feature augmentation
Data augmentation
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 97
Evaluation determines how well your model predicts the target on future data
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 98
Success metric
Success metric:
Business problem:
10% reduction in fraud
Fraudulent credit Model metric
claims in 6-month
card transactions
period
The model metric must align to both the business problem and the success metric.
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 99
Confusion matrix
Actual
Trained
model Cat Not cat
Predicted
Cat 107 23
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 100
Confusion matrix terminology
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 101
Which model is better?
Confusion matrices from two models that use the same data:
Actual Actual
Cat Not cat Cat Not cat
Predicted
Predicted
Cat 107 23 Cat 148 53
Not cat 69 42 Not cat 28 12
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 102
Sensitivity
Sensitivity:
What percentage of cats were correctly identified?
Actual TP
sensitivity=
TP + FN
Cat Not cat
Predicted
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 103
Specificity
Specificity:
What percentage of not cats were correctly identified?
Actual TN
specificity=
TN + FP
Cat Not cat
Predicted
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 104
Which model is better now?
Model A Model B
Actual Actual
Cat Not cat Cat Not cat
Predicted
Predicted
Cat 107 23 Cat 148 53
Not cat 69 42 Not cat 28 12
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 105
Thresholds
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 106
Classification: ROC graph
1.0
Receiver operating
0.8
characteristic (ROC)
True-positive rate
Sensitivity
0.6
1 – Specificity
False-positive rate
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 107
Classification: AUC-ROC
1.0
Area
0.8
under the
curve
True-positive rate
Sensitivity
0.6
0.0
0.0 0.2 0.4 0.6 0.8 1.0
1-Specificity
False-positive rate
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 108
Classification: Other metrics
precision ⋅sensitivity
F 1 score = 2· Combines precision and sensitivity
precision + sensitivity
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 109
Regression: Mean squared error
(x6,y6)
(x4,y4)
(x3,y3)
𝑛
1
MSE = ∑ ( 𝑖 𝑖 )
2
𝑦 − ỹ y (x1,y1)
𝑛 𝑖=1 (x5,y5)
(x7,y7)
(x2,y2)
x
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 110
ML tuning process
Tune model
Meets
Feature Select and train
Evaluate model business
engineering model
goal?
No
Feature augmentation
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 111
Module 3 – Guided
Lab 6:
Evaluating Model
Accuracy
112 © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Several ways to validate the model
Section 7 key • Hold-out
takeaways • K-fold cross validation
• Two types of model evaluation
• Classification
• Confusion matrix
• AUC-ROC
• Regression testing
• Mean squared
113 © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Module 3: Implementing a Machine Learning Pipeline with Amazon SageMaker
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine learning pipeline
Problem Yes
Tune model
formulation
Meets
Collect and Feature Select and train
Evaluate data Evaluate model business
label data engineering model
goal?
No
Feature augmentation
Data augmentation
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 115
Recap: Goal for models
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 116
Hyperparameter categories
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 117
Tuning hyperparameters
Clients Hyperparameter
(console, notebook, Training job
tuning job
IDEs, AWS CLI)
Training job
Tuning strategy
Objective metric:
Area under the Training job
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
curve 119
Automated hyperparameter tuning
Clients Hyperparameter
(console, notebook, Training job
tuning job
IDEs, AWS CLI)
Model Objective
Name Metric Eta Max_depth
Model3 0.8 0.7 6 Training job
Model1 0.75 0.09 5 Tuning strategy
Objective metric:
Area under the Training job
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
curve 120
Hyperparameter tuning
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 121
Tuning best practices
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 122
Demonstration: Optimizing
Amazon SageMaker
Hyperparameters
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 123
Amazon SageMaker Autopilot
Inputs Outputs
Amazon SageMaker Autopilot
Test data
Analyze data
Trained models
Feature
engineering
Training data
y Model tuning
Model metrics
Target data
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 124
Demonstration: Running
Amazon SageMaker
Autopilot
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 125
Module 3 – Guided
Lab 7:
Tuning with
Amazon
SageMaker
126 © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Model tuning helps find the best
Section 8 key solution
takeaways • Hyperparameters
• Model
• Optimizer
• Data
• Tuning
• Use Amazon SageMaker to help tune
hyperparameters
• Use Autopilot for faster development
127 © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Module 3: Implementing a Machine Learning Pipeline with Amazon SageMaker
Module wrap-up
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Module summary
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 130
Additional resources
• <Add URLs to resources that students might find helpful for further
exploration on topics discussed in this module.>
• <Especially, link to whitepapers or other resources mentioned in the Exam
Guide of the certification exam this course is intended to help students
prepare for.>
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 131
Thank you
© 2021 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon
Web Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections, feedback, or other questions? Contact us at https://support.aws.amazon.com/#/contacts/aws-training.
All trademarks are the property of their owners.