0% found this document useful (0 votes)
230 views288 pages

ML Certificate Preparation (Last Version)

Uploaded by

Mohammed Mahdy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
230 views288 pages

ML Certificate Preparation (Last Version)

Uploaded by

Mohammed Mahdy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 288

AMAZON WEB SERVICES

AWS

Machine Learning Specialty


Certificate Preparation
2021
Version 2.0

Prepared By:
Ahmed Mohamed Elhamy
This page is intentionally left blank.
Introduction Create Data repositories for ML

Table of Contents
Introduction .................................................................................................... 8
References ...................................................................................................... 9
1. Data Engineering .................................................................................... 10
1.1 Create Data repositories for ML ......................................................................................... 10
1.1.1 Lake Formation ......................................................................................................... 10
1.1.2 S3 .............................................................................................................................. 10
1.1.3 Amazon FSx for Lustre .............................................................................................. 15
1.1.4 Amazon EFS............................................................................................................... 15
1.2 Identify and implement a data-ingestion ........................................................................... 16
1.2.1 Apache Kafka............................................................................................................. 16
1.2.2 Kinesis ....................................................................................................................... 17
1.2.2.1 Kinesis Streams ..................................................................................................... 18
1.2.2.2 Kinesis firehose ..................................................................................................... 18
1.2.2.3 Kinesis Analytics .................................................................................................... 20
1.2.2.4 Kinesis Video Streams ........................................................................................... 23
1.2.3 Glue ........................................................................................................................... 27
1.2.3.1 Glue Data Catalog ................................................................................................. 27
1.2.3.2 Crawlers ................................................................................................................ 28
1.2.3.3 Glue ETL ................................................................................................................ 30
1.2.3.4 Job Authoring........................................................................................................ 31
1.2.3.5 Job Execution ........................................................................................................ 31
1.2.3.6 Job Workflow ........................................................................................................ 32
1.2.4 Data Stores in Machine learning ............................................................................... 33
1.2.4.1 Redshift ................................................................................................................. 33
1.2.4.2 RDS, Aurora ........................................................................................................... 33
1.2.4.3 DynamoDB ............................................................................................................ 33
1.2.4.4 ElasticSearch ......................................................................................................... 33
1.2.4.5 ElastiCache ............................................................................................................ 33

1
Introduction Create Data repositories for ML

1.2.4.6 Data Pipeline ......................................................................................................... 33


1.2.4.7 AWS Batch ............................................................................................................ 35
1.2.4.8 Data Migration Service ......................................................................................... 36
1.2.4.9 Step Function ........................................................................................................ 36
1.2.5 Full Data Engineer Pipeline ....................................................................................... 37
1.2.5.1 Real time Layer ..................................................................................................... 37
1.2.5.2 Video Layer ........................................................................................................... 38
1.2.5.3 Batch Layer ........................................................................................................... 38
1.3 Identify and implement a data-transformation ................................................................. 39
1.3.1 Hadoop...................................................................................................................... 39
1.3.2 Amazon EMR ............................................................................................................. 39
1.3.3 Apache Spark ............................................................................................................ 41
2. Exploratory Data Analysis ....................................................................... 48
2.1 Perform featuring engineering ........................................................................................... 48
2.1.1 Data Distribution ....................................................................................................... 48
2.1.2 Trends & Seasonality................................................................................................. 48
2.1.3 Types of Visualization................................................................................................ 49
2.1.4 Dimension Reduction ................................................................................................ 49
2.1.5 Missing Data.............................................................................................................. 49
2.1.6 Unbalanced Data....................................................................................................... 51
2.1.7 Handling Outliers ...................................................................................................... 52
2.1.8 Binning ...................................................................................................................... 52
2.1.9 Transforming ............................................................................................................. 52
2.1.10 One hot encoding ..................................................................................................... 52
2.1.11 Scaling ....................................................................................................................... 52
2.1.12 Data Skewing............................................................................................................. 54
2.1.13 Residuals ................................................................................................................... 55
2.1.14 Shuffling .................................................................................................................... 58
2.2 Analyze and visualize data for ML ...................................................................................... 58
2.2.1 Amazon Athena......................................................................................................... 58

2
Introduction Create Data repositories for ML

2.2.2 Amazon Quick Sight .................................................................................................. 59


3. Modeling ................................................................................................. 61
3.1 Frame business problems as ML problems ........................................................................ 61
3.1.1 Supervised Machine Learning ................................................................................... 61
3.1.1.1 Regression............................................................................................................. 61
3.1.1.2 Classification ......................................................................................................... 66
3.1.1.3 Evaluate Model ..................................................................................................... 74
3.1.1.4 Overfitting and Underfitting ................................................................................. 75
3.1.1.5 Bias/Variance Tradeoff ......................................................................................... 75
3.1.1.6 Regularization ....................................................................................................... 76
3.1.1.7 Bagging and Boosting ........................................................................................... 77
3.1.1.8 Cross Validation .................................................................................................... 80
3.1.1.9 Train Model........................................................................................................... 81
3.1.2 Unsupervised Machine Learning .............................................................................. 83
3.1.2.1 Clustering .............................................................................................................. 83
3.1.2.2 Anomaly Detection ............................................................................................... 84
3.1.3 Deep Learning ........................................................................................................... 87
3.1.3.1 Perceptron ............................................................................................................ 87
3.1.3.2 Multi-Layer Perceptron and Backpropagation ..................................................... 88
3.1.3.3 Activation Functions ............................................................................................. 90
3.1.3.4 Classification Hyperparameters ............................................................................ 94
3.1.3.5 Vanishing/Exploding Gradients ............................................................................. 96
3.1.3.6 Reusing pre-trained layers .................................................................................. 100
3.1.3.7 Fast Optimizers ................................................................................................... 101
3.1.3.8 Early Stop ............................................................................................................ 104
3.1.3.9 Learning Schedule............................................................................................... 104
3.1.3.10 Regularization ................................................................................................... 105
3.1.3.11 Famous Frameworks......................................................................................... 107
3.1.3.12 Convolution Neural Network ............................................................................ 107
3.1.3.13 Recurrent Neural Network ............................................................................... 115

3
Introduction Create Data repositories for ML

3.1.3.14 Reinforcement .................................................................................................. 123


3.1.4 Natural Language Processing (NLP) ........................................................................ 133
3.1.4.1 Text preprocessing.............................................................................................. 133
3.1.4.2 Vectorization....................................................................................................... 134
3.1.4.3 Train Model......................................................................................................... 136
3.1.4.4 Sentiment Analysis.............................................................................................. 141
3.2 Select the appropriate model........................................................................................... 144
3.2.1 Linear Learner ......................................................................................................... 144
3.2.2 K Nearest Neighbors ............................................................................................... 148
3.2.3 K-Means .................................................................................................................. 151
3.2.4 Principal Component Analysis (PCA) ....................................................................... 159
3.2.5 XGBoost................................................................................................................... 165
3.2.6 IP Insights ................................................................................................................ 177
3.2.7 Factorization Machines ........................................................................................... 180
3.2.8 Object Detection ..................................................................................................... 183
3.2.9 Image Classification ................................................................................................ 185
3.2.10 Semantic Segmentation .......................................................................................... 188
3.2.11 Blazing Text ............................................................................................................. 190
3.2.12 Seq2Seq .................................................................................................................. 193
3.2.13 Object2Vec .............................................................................................................. 196
3.2.14 Neural Topic Model ................................................................................................ 199
3.2.15 Latent Dirichlet Allocation (LDA) ............................................................................. 201
3.2.16 DeepAR ................................................................................................................... 203
3.2.17 Random Cut Forest ................................................................................................. 206
3.2.18 Neural Collaborative Filtering ................................................................................. 209
4. ML implementation and Operations .................................................... 211
4.1 SageMaker ........................................................................................................................ 211
4.1.1 Amazon ECR ............................................................................................................ 211
4.1.2 Introduction to SageMaker ..................................................................................... 212
4.1.3 Automatic Model Tuning ........................................................................................ 214

4
Introduction Create Data repositories for ML

4.1.4 SageMaker Dock Container..................................................................................... 215


4.1.4.1 Container ............................................................................................................ 215
4.1.4.2 Docker ................................................................................................................. 215
4.1.4.3 SageMaker Modes .............................................................................................. 217
4.1.4.4 SageMaker Toolkit Structure .............................................................................. 218
4.1.4.5 Docker Image Folder Structure........................................................................... 218
4.1.4.6 Extend Docker Image .......................................................................................... 219
4.1.4.7 Adapt Docker Container for SageMaker ............................................................. 221
4.1.4.8 Adapting Your Own Inference Container............................................................ 222
4.1.4.9 Use Your Own Training Algorithms ..................................................................... 224
4.1.4.10 Distributed Training Configuration ................................................................... 226
4.1.4.11 Environment Variables ..................................................................................... 227
4.1.4.12 Tensorflow Training .......................................................................................... 227
4.1.4.13 Deep Learning AMI (DLAMI) ............................................................................. 229
4.1.5 Production Variant .................................................................................................. 230
4.1.6 SageMaker Neo ....................................................................................................... 230
4.1.7 SageMaker Security ................................................................................................ 230
4.1.8 SageMaker Resources ............................................................................................. 233
4.1.9 SageMaker Automatic Scaling................................................................................. 234
4.1.10 Availability Zones in SageMaker.............................................................................. 235
4.1.11 SageMaker Inference Pipeline ................................................................................ 236
4.1.12 SageMaker with Spark ............................................................................................ 236
4.1.13 Notebook Lifecycle.................................................................................................. 237
4.1.14 SageMaker Studio ................................................................................................... 238
4.1.15 SageMaker Experiments ......................................................................................... 239
4.1.16 SageMaker Monitoring ........................................................................................... 240
4.1.17 SageMaker Debugger.............................................................................................. 243
4.1.18 SageMaker Ground Truth ....................................................................................... 246
4.1.19 SageMaker Autopilot .............................................................................................. 247
4.1.20 SageMaker ModelMonitor...................................................................................... 248

5
Introduction Create Data repositories for ML

4.1.21 SageMaker JumpStart ............................................................................................. 250


4.1.22 SageMaker Data Wrangler ...................................................................................... 250
4.1.23 SageMaker Feature Store ....................................................................................... 250
4.1.24 SageMaker Edge Manager ...................................................................................... 251
4.1.25 Put the all together ................................................................................................. 252
4.2 AI Services......................................................................................................................... 253
4.2.1 Amazon Comprehend ............................................................................................. 253
4.2.2 Amazon Translate ................................................................................................... 253
4.2.3 Amazon Transcribe ................................................................................................. 254
4.2.4 Amazon Polly........................................................................................................... 254
4.2.5 Amazon Forecast .................................................................................................... 254
4.2.6 Amazon Lex ............................................................................................................. 255
4.2.7 Amazon Rekognition ............................................................................................... 255
4.2.8 Amazon Personalize ................................................................................................ 256
4.2.9 Amazon Textract ..................................................................................................... 259
4.2.10 Amazon DeepRacer................................................................................................. 259
4.2.11 DeepLens ................................................................................................................ 259
4.2.12 AWS DeepComposer ............................................................................................... 259
4.2.13 Amazon Fraud Detector .......................................................................................... 259
4.2.14 Amazon CodeGuru .................................................................................................. 260
4.2.15 Contact Lens for Amazon Connect.......................................................................... 260
4.2.16 Amazon Kindra ........................................................................................................ 260
4.2.17 Amazon Augmented AI (A2I)................................................................................... 260
4.2.18 Put all together ....................................................................................................... 261
4.3 AWS IoT for Predictive Maintenance ............................................................................... 262
4.3.1 IoT Green Grass....................................................................................................... 262
4.3.2 Use case .................................................................................................................. 262
4.4 Security ............................................................................................................................. 264
4.4.1 PrivateLink............................................................................................................... 264
4.4.2 VPC Endpoints ......................................................................................................... 264

6
Introduction Create Data repositories for ML

4.4.3 VPC endpoint services (AWS PrivateLink) ............................................................... 265


4.4.4 Bucket policy and VPC endpoint ............................................................................. 265
4.4.5 AWS Site to Site ...................................................................................................... 266
4.4.6 Control access to services with VPC endpoints ...................................................... 266
4.4.6.1 Use VPC endpoint policies .................................................................................. 266
4.4.6.2 Security groups ................................................................................................... 266
4.4.7 SageMaker notebook instance networking ............................................................ 267
4.4.8 Network Isolation.................................................................................................... 269
4.4.9 Private packages ..................................................................................................... 270
4.4.10 Secure Deployment................................................................................................. 271
4.4.11 Protect communication in distributed training job ................................................ 271
4.4.12 AI Services opt-out policies (AWS Organization) .................................................... 272
4.5 Deploy and operationalize ML solutions .......................................................................... 274
4.5.1 Deployment Management ...................................................................................... 274
4.5.2 Deployment Options ............................................................................................... 276
4.5.3 Inference Types....................................................................................................... 278
4.5.4 Instance Types ........................................................................................................ 280
5. Appendices ........................................................................................... 281
5.1 Algorithms Input Formats................................................................................................. 281
5.2 Algorithm Instance Types ................................................................................................. 283
5.3 Algorithm Type & Usage ................................................................................................... 285

7
Introduction Create Data repositories for ML

Introduction
This document is for any candidate who want to pass AWS machine learning certificate exam.
This document is following exam preparation path recommended from AWS.

The document structure is classified as per the domains stated by Amazon that will be covered in
the exam.
This document is covering all the topics with full and clear explanation.
You should have a background in Machine Learning, this document is only used for exam
preparation not for full Machine Learning explanation.
This document is explaining all the amazon products and tools that is used in Machine Learning
till the end of 2021.
This document is not discussing any python code that is related to Machine Learning.

NOTE: This document is built from many books, websites, you tube channels…etc.
as being stated in the references section. All rights reserved to their owners with
many thanks for them for this clear explanations.
Hope that this document is helpful for all of you, and hope the success for all of you.
Thanks
Ahmed Mohamed Elhamy

8
References Create Data repositories for ML

References
Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow.
https://g.co/kgs/HmXTUi

Amazon SageMaker developer guide


https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html

AWS Digital Courses for Machine Learning


https://www.aws.training/LearningLibrary

AWS Certified Machine Learning Specialty 2021 - Hands On! – Udemy Course
https://www.udemy.com/share/1029De2@PW1KVGFbTFIPd0dDBXpOfhRuSlQ=/

StatQuest
https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw

DeepLizard
https://www.youtube.com/c/deeplizard

Stanford University
https://www.youtube.com/watch?v=6niqTuYFZLQ

9
Data Engineering Create Data repositories for ML

1. Data Engineering
1.1 Create Data repositories for ML
1.1.1 Lake Formation
A data lake is a centralized repository that allows you to store all your structured and
unstructured data at any scale. You can store your data as-is, without having to first structure the
data, and run different types of analytics—from dashboards and visualizations to big data
processing, real-time analytics, and machine learning to guide better decisions.
A data warehouse is a database optimized to analyze relational data coming from transactional
systems and line of business applications. The data structure, and schema are defined in advance
to optimize for fast SQL queries, where the results are typically used for operational reporting
and analysis. Data is cleaned, enriched, and transformed so it can act as the “single source of
truth” that users can trust.
A data lake is different, because it stores relational data from line of business applications, and
non-relational data from mobile apps, IoT devices, and social media. The structure of the data or
schema is not defined when data is captured. This means you can store all of your data without
careful design or the need to know what questions you might need answers for in the future.
Different types of analytics on your data like SQL queries, big data analytics, full text search, real-
time analytics, and machine learning can be used to uncover insights.

1.1.2 S3
 Amazon S3 allows people to store objects (files) in “buckets” (directories)
 Buckets must have a globally unique name
 Objects (files) have a Key. The key is the FULL path:
 <my_bucket>/my_file.txt
 <my_bucket>/my_folder1/another_folder/my_file.txt
 This will be interesting when we look at partitioning
 Max object size is 5TB
 Object Tags (key / value pair –up to 10) –useful for security / lifecycle

NOTE: An Amazon S3 bucket name is globally unique, and the namespace is


shared by all AWSaccounts. This means that after a bucket is created, the name of
that bucket cannot be used by another AWSaccount in any AWSRegion until the
bucket is deleted. You should not depend on specific bucket naming conventions
for availability or security verification purposes.

10
Data Engineering Create Data repositories for ML

Bucket names must be unique within a partition. A partition is a grouping of


Regions. AWS currently has three partitions: aws (Standard Regions), aws-cn
(China Regions), and aws-us-gov (AWS GovCloud [US] Regions).

Buckets used with Amazon S3 Transfer Acceleration can't have dots (.) in their
names.

Amazon S3 Transfer Acceleration is a bucket-level feature that enables fast, easy,


and secure transfers of files over long distances between your client and an S3
bucket.

S3 for Machine Learning


 Backbone for many AWS ML services (example: SageMaker)
 Create a “Data Lake”
 Infinite size, no provisioning
 99.999999999% durability
 Decoupling of storage (S3) to compute (EC2, Amazon Athena, Amazon Redshift Spectrum,
Amazon Rekognition, and AWS Glue)
 Centralized Architecture
 Object storage => supports any file format
 Common formats for ML: CSV, JSON, Parquet, ORC, Avro, Protobuf

S3 Data Partitions
 Pattern for speeding up range queries (ex: AWS Athena)
 By Date: s3://bucket/my-data-set/year/month/day/hour/data_00.csv
 By Product: s3://bucket/my-data-set/product-id/data_32.csv
 You can define whatever partitioning strategy you like!
 Data partitioning will be handled by some tools we use (e.g. AWS Glue and Athena)

S3 Storage Tier
 Amazon S3 Standard -General Purpose
 Amazon S3 Standard-Infrequent Access (IA)
 Amazon S3 One Zone-Infrequent Access
 Amazon S3 Intelligent Tier
 Amazon Glacier

11
Data Engineering Create Data repositories for ML

Amazon S3 Glacier provides three options for access to archives, from a few minutes to
several hours, and S3 Glacier Deep Archive provides two access options ranging from 12 to
48 hours.

S3 Life Cycle
 Set of rules to move data between different tiers, to save storage cost
 Example: General Purpose => Infrequent Access => Glacier
 Transition actions: objects are transitioned to another storage class.
 Move objects to Standard IA class 60 days after creation
 And move to Glacier for archiving after 6 months
 Expiration actions: S3 deletes expired objects on our behalf
 Access log files can be set to delete after a specified period of time

S3 Encryption
There are 4 methods of encrypting objects in S3
 SSE-S3: encrypts S3 objects using keys handled & managed by AWS
 SSE-KMS: use AWS Key Management Service to manage encryption keys
 Additional security (user must have access to KMS key)
 Audit trail for KMS key usage
 SSE-C: when you want to manage your own encryption keys
 Client Side Encryption

NOTE: From an ML perspective, SSE-S3 and SSE-KMS will be most likely used

S3 Accessibility
User based
 IAM policies -which API calls should be allowed for a specific user
Sample IAM Policy
This IAM policy grants the IAM entity (user, group, or role) it is attached to permission to
perform any S3 operation on the bucket named “my_bucket”, as well as that bucket’s
contents.
{
"Version": "2012-10-17",
"Statement":[{
"Effect": "Allow",
12
Data Engineering Create Data repositories for ML

"Action": "s3:*",
"Resource": ["arn:aws:s3:::my_bucket",
"arn:aws:s3:::my_bucket/*"]
}
]
}
Resource Based
 Bucket Policies -bucket wide rules from the S3 console -allows cross account
It is used for:
 Grant public access to the bucket
 Force objects to be encrypted at upload
 Grant access to another account (Cross Account)
Sample S3 Bucket Policy
This S3 bucket policy enables the root account 111122223333 and the IAM user Alice
under that account to perform any S3 operation on the bucket named “my bucket”, as well
as that bucket’s contents.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": ["arn:aws:iam::111122223333:user/Alice",
"arn:aws:iam::111122223333:root"]
},
"Action": "s3:*",
"Resource": ["arn:aws:s3:::my_bucket",
"arn:aws:s3:::my_bucket/*"]
}
]
}
 Object Access Control List (ACL) –finer grain
 Bucket Access Control List (ACL) –less common

S3 Default Encryption
13
Data Engineering Create Data repositories for ML

The old way to enable default encryption was to use a bucket policy and refuse any HTTP
command without the proper headers.
The new way is to use the “default encryption” option in S3
Note: Bucket Policies are evaluated before “default encryption”

S3 Security
 Networking -VPC Endpoint Gateway:
 Allow traffic to stay within your VPC (instead of going through public web)
 Make sure your private services (AWS SageMaker) can access S3
 Logging and Audit:
 S3 access logs can be stored in other S3 bucket
 API calls can be logged in AWS CloudTrail
 Tagged Based (combined with IAM policies and bucket policies)
 Example: Add tag Classification=PHI to your objects

S3 Pipe Input mode


With Pipe input mode, your dataset is streamed directly to your training instances instead of
being downloaded first. This means that your training jobs start sooner, finish quicker, and need
less disk space. Amazon SageMaker algorithms have been engineered to be fast and highly
scalable.
With Pipe input mode, your data is fed on-the-fly into the algorithm container without involving
any disk I/O. This approach shortens the lengthy download process and dramatically reduces
startup time. It also offers generally better read throughput than File input mode. This is because
your data is fetched from Amazon S3 by a highly optimized multi-threaded background process. It
also allows you to train on datasets that are much larger than the 16 TB Amazon Elastic Block
Store (EBS) volume size limit.
Pipe mode enables the following:
- Shorter startup times because the data is being streamed instead of being downloaded
to your training instances.
- Higher I/O throughputs due to our high-performance streaming agent.
- Virtually limitless data processing capacity.
Built-in Amazon SageMaker algorithms can now be leveraged with either File or Pipe input
modes. Even though Pipe mode is recommended for large datasets, File mode is still useful for
small files that fit in memory and where the algorithm has a large number of epochs. Together,

14
Data Engineering Create Data repositories for ML

both input modes now cover the spectrum of use cases, from small experimental training jobs to
petabyte-scale distributed training jobs.
Amazon SageMaker algorithms
Most first-party Amazon SageMaker algorithms work best with the optimized protobuf recordIO
format. For this reason, this release offers Pipe mode support only for the protobuf recordIO
format. The algorithms in the following list support Pipe input mode today when used with
protobuf recordIO-encoded datasets:
- Principal Component Analysis (PCA)
- K-Means Clustering
- Factorization Machines
- Latent Dirichlet Allocation (LDA)
- Linear Learner (Classification and Regression)
- Neural Topic Modelling
- Random Cut Forest

1.1.3 Amazon FSx for Lustre


 Speeds up training job by saving data to SageMaker
 When your training data is in S3 and you plan to run training jobs several times using
different algorithms and hyperparameters, consider using FSx for Lustre a file system
service.
 FSx for Lustre speeds up training jobs by serving S3 data to SageMaker at high speed. The
first time you run a training job FSx for Lustre automatically copies data from S3 and makes
it available for SageMaker.
 You use that same FSx for subsequent iterations of training jobs, preventing repeated
downloads of common S3 objects.

1.1.4 Amazon EFS


 If training data in on EFS, we recommend using that as your training data source. Amazon
EFS has the benefit of directly launch your training job from services without the need for
data movement.
 Cluster load performance:
S3 less than 1, EFS equals 1, EBS equals 1.29 and FSx greater than 1.6

15
Data Engineering Identify and implement a data-ingestion

1.2 Identify and implement a data-ingestion


1.2.1 Apache Kafka
Technically speaking, event streaming is the practice of capturing data in real-time from event
sources like databases, sensors, mobile devices, cloud services, and software applications in the
form of streams of events; storing these event streams durably for later retrieval; manipulating,
processing, and reacting to the event streams in real-time as well as retrospectively; and routing
the event streams to different destination technologies as needed. Event streaming thus ensures
a continuous flow and interpretation of data so that the right information is at the right place, at
the right time.

Challenges Operating Apache Kafka


 Difficult to setup
 Tricky to scale
 Hard to achieve high availability
 AWS integration needs development
 No Control and no visibility metrics

Amazon Managed Streaming for Kafka (MSK)


 Fully compatible with Apache Kafka
 Aws Management console and AWS API for provisioning
 Clusters are setup automatically
 Provision Apache Kafka brokers and storage
 Create and tear down clusters on demand
 Deeply integrated with AWS services
 Kafka – Kinesis connector library for using Kafka with Kinesis

Compatibility
 MSK clusters are compatible with:
 Kafka partition reassignment tools
 Kafka APIs
 Kafka admin client
 3rd party tools
 MSK not compatible with:
 Tools that upload .jar files “Cruise Control”, “Uber Replicator”, “LinkedInn”,
“Confluent Control Center” and “Auto Data Balancer”

16
Data Engineering Identify and implement a data-ingestion

Comparison between MSK and Kinesis


MSK:
 Build on partitions
 Open source compatibility
 Strong integration with 3rd party tools
 Cluster provisioning model
 Scaling isn’t seamless
 Raw performance
Kinesis:
 Build on shards
 AWS API experience
 Throughput provisioning model
 Seamless scaling
 Lower cost
 AWS integration

1.2.2 Kinesis
Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you
can get timely insights and react quickly to new information. Amazon Kinesis offers key
capabilities to cost-effectively process streaming data at any scale, along with the flexibility to
choose the tools that best suit the requirements of your application. With Amazon Kinesis, you
can ingest real-time data such as video, audio, application logs, website clickstreams, and IoT
telemetry data for machine learning, analytics, and other applications. Amazon Kinesis enables
you to process and analyze data as it arrives and respond instantly instead of having to wait until
all your data is collected before the processing can begin.

 Kinesis is a managed alternative to Apache Kafka


 Great for application logs, metrics, IoT, clickstreams
 Great for “real-time” big data
 Great for streaming processing frameworks (Spark, NiFi….etc.)
 Data is automatically replicated synchronously to 3 AZ

17
Data Engineering Identify and implement a data-ingestion

Figure 1: Kinesis

1.2.2.1 Kinesis Streams


 Shards to be provisioned in advance
 Data Retention for 24 hours up to 7 days
 Ability to reprocess/reply data
 Must manage scaling (shards splitting and merging)
 Multiple applications can consume the same stream
 Once data is inserted in Kinesis, it can’t be deleted (immutability)
 Records can be up to 1MB in size

1MB/s or 2MB/s or
1000messages/s 5API calls/s
Figure 2: Kinesis Streams

1.2.2.2 Kinesis firehose


 Fully managed
 Near real time (60 seconds latency minimum for non-full batches)
 Auto scaling
 Support many data formats
 Conversion from CSV/JSON to parquet/ORC (only for S3)
18
Data Engineering Identify and implement a data-ingestion

 Supports compression when target is Amazon S3 (GZIP, ZIP, and SNAPPY)


 Pay for the amount of data going through Firehose
 No data storage
 Amazon Kinesis Data Firehose can convert the format of your input data from JSON to
Apache Parquet or Apache ORC before storing the data in Amazon S3. Parquet and ORC
are columnar data formats that save space and enable faster queries compared to row-
oriented formats like JSON. If you want to convert an input format other than JSON, such
as comma-separated values (CSV) or structured text, you can use AWS Lambda to
transform it to JSON first. For more information.

Figure 3: Kinesis Data Firehose Diagram

19
Data Engineering Identify and implement a data-ingestion

Figure 4: Kinesis Data Firehose Delivery Diagram

Kinesis Data Streams vs Firehose


Streams
 Going to write custom code (producer / consumer)
 Real time (~200 ms latency for classic, ~70 ms latency for enhanced fan-out)
 Must manage scaling (shard splitting / merging)
 Data Storage for 1 to 7 days, replay capability, multi consumers

Firehose
 Fully managed, send to S3, Splunk, Redshift, ElasticSearch.
 Serverless data transformations with Lambda
 Near real time (lowest buffer time is 1 minute)
 Automated Scaling
 No data storage

1.2.2.3 Kinesis Analytics

Figure 5: Kinesis Analytics Conceptually

20
Data Engineering Identify and implement a data-ingestion

Figure 6: Kinesis Analytics in more depth

 Amazon Kinesis Data Analytics reduces the complexity of building, managing, and
integrating Apache Flink applications with other AWS services.
 Pay only for resources consumed (but it’s not cheap)
 Serverless; scales automatically
 Use IAM permissions to access streaming source and destination(s)
 SQL or Flink to write the computation
 Schema discovery
 Lambda can be used for pre-processing
 Kinesis data analytics could make reference to tables in S3 buckets.
 Amazon Kinesis Analytics applications can transform data before it is processed by your
SQL code. This feature allows you to use AWS Lambda to convert formats, enrich data,
filter data, and more. Once the data is transformed by your function, Kinesis Analytics
sends the data to your application’s SQL code for real-time analytics.

NOTE: Apache Flink is an open source framework and engine for processing data
streams

SQL for simple and fast use cases


 Sub-second end to end processing latencies
 SQL steps can be chained together in serial or parallel steps
 Build applications with one or hundreds of queries

21
Data Engineering Identify and implement a data-ingestion

 Pre-built functions include everything from sum and count distinct to machine learning
algorithms
 Aggregations run continuously using window operators

Java for sophisticated applications


Utilizes Apache Flink, a Framework and distributed engine for stateful processing of data streams.
Simple programming: Easy to use and flexible APIs make building apps fast
High performance: In-memory computing provides low latency & high throughput.
Stateful Processing: Durable application state saves.
Strong data integrity: Exactly-once processing and consistent state.

Use cases
 Streaming ETL: select columns, make simple transformations, on streaming data
 Continuous metric generation: live leaderboard for a mobile game
 Responsive analytics: look for certain criteria and build alerting (filtering)

Machine learning on kinesis data analytics


RANDOM_CUT_FOREST
 SQL function used for anomaly detection on numeric columns in a stream
 Example: detect anomalous subway ridership during the NYC marathon
 Uses recent history to compute model

Figure 7: Anomaly Detection with Random Cut Forest

22
Data Engineering Identify and implement a data-ingestion

HOTSPOTS
 Locate and return information about relatively dense regions in your data
 Example: a collection of overheated servers in a data center

Figure 8: Dense Regions with Hotspots

1.2.2.4 Kinesis Video Streams

Figure 9: Kinesis Video Streams

23
Data Engineering Identify and implement a data-ingestion

Producers:
 Security camera, body-worn camera, AWS DeepLens, smartphone camera, audio feeds,
images, RADAR data, RTSP camera.
 One producer per video stream
 Video playback capability

Consumers
 Build your own (MXNet, Tensorflow)
 AWS SageMaker
 Amazon Rekognition Video
 Keep data for 1 hour to 10 years

Kinesis Video Streams use cases


Amazon Kinesis Video Streams Inference Template (KIT) for Amazon SageMaker enables
customers to attach Kinesis Video streams to Amazon SageMaker endpoints in minutes. This
drives real-time inferences without having to use any other libraries or write custom software to
integrate the services. The KIT comprises of the Kinesis Video Client Library software packaged as
a Docker container and an AWS CloudFormation template that automates the deployment of all
required AWS resources.

Figure 10: Analyze Live Video Stream

The software pulls media fragments from the streams using the real-time Kinesis Video Streams
GetMedia API operation, parses the media fragments to extract the H264 chunk, samples the
24
Data Engineering Identify and implement a data-ingestion

frames that need decoding, then decodes the I-frames and converts them into image formats
such as JPEG/PNG format, before invoking the Amazon SageMaker endpoint. As the Amazon
SageMaker-hosted model returns inferences, KIT captures and publishes those results into a
Kinesis data stream. Customers can then consume those results using their favorite service, such
as AWS Lambda. Finally, the library publishes a variety of metrics into Amazon CloudWatch so
that customers can build dashboards, monitor, and alarm on thresholds as they deploy into
production.

Kinesis Summary –Machine Learning


 Kinesis Data Stream: create real-time machine learning applications
 Kinesis Data Firehose: ingest massive data near-real time
 Kinesis Data Analytics: real-time ETL / ML algorithms on streams
 Kinesis Video Stream: real-time video stream to create ML applications

Amazon Kinesis- Firehose vs. Streams


Amazon Kinesis Data Streams is for use cases that require custom processing, per incoming
record, with sub-1 second processing latency, and a choice of stream processing frameworks.
Amazon Kinesis Data Firehose is for use cases that require zero administration, ability to use
existing analytics tools based on Amazon S3, Amazon Redshift, and Amazon ES, and a data
latency of 60 seconds or higher.

Kinesis Architecture Example

25
Data Engineering Identify and implement a data-ingestion

Figure 11: Kinesis Architecture Example

26
Data Engineering Identify and implement a data-ingestion

1.2.3 Glue
AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and
combine data for analytics, machine learning, and application development. AWS Glue provides
all of the capabilities needed for data integration.
AWS Glue provides both visual and code-based interfaces to make data integration easier. Users
can easily find and access data using the AWS Glue Data Catalog. Data engineers and ETL (extract,
transform, and load) developers can visually create, run, and monitor ETL workflows with a few
clicks in AWS Glue Studio. Data analysts and data scientists can use AWS Glue DataBrew to
visually enrich, clean, and normalize data without writing code. With AWS Glue Elastic Views,
application developers can use familiar Structured Query Language (SQL) to combine and
replicate data across different data stores.
AWS Glue enables you to perform ETL operations on streaming data using continuously-running
jobs. AWS Glue streaming ETL is built on the Apache Spark Structured Streaming engine, and can
ingest streams from Amazon Kinesis Data Streams, Apache Kafka, and Amazon Managed
Streaming for Apache Kafka (Amazon MSK). Streaming ETL can clean and transform streaming
data and load it into Amazon S3 or JDBC data stores. Use Streaming ETL in AWS Glue to process
event data like IoT streams, clickstreams, and network logs.

Features
 Fully managed, cost effective, pay only for the resources consumed
 Jobs are run on a serverless Spark platform
 Glue Scheduler to schedule the jobs, could be run every 5 minutes minimum.
 Glue Triggers to automate job runs based on “events”

1.2.3.1 Glue Data Catalog


Metadata repository for all your tables
 Hive metastore compatible with enhanced functionality
 Automated Schema Inference
 Schemas are versioned
 Integrates with Athena or Redshift Spectrum (schema & data discovery)
 Glue Crawlers can help build the Glue Data Catalog

27
Data Engineering Identify and implement a data-ingestion

Figure 12: Glue Data Catalog

1.2.3.2 Crawlers

 Crawlers go through your data to infer schemas and partitions


 Works on JSON, Parquet, CSV, relational store
 Crawlers work for: S3, Amazon Redshift, Amazon RDS
 Run the Crawler on a Schedule or On Demand
 Need an IAM role / credentials to access the data stores

28
Data Engineering Identify and implement a data-ingestion

Figure 13: AWS Glue crawlers interact with data stores

The following is the general workflow for how a crawler populates the AWS Glue Data
Catalog:

1. A crawler runs any custom classifiers that you choose to infer the format and schema
of your data. You provide the code for custom classifiers, and they run in the order that
you specify.
2. The first custom classifier to successfully recognize the structure of your data is used to
create a schema. Custom classifiers lower in the list are skipped.
3. If no custom classifier matches your data's schema, built-in classifiers try to recognize
your data's schema. An example of a built-in classifier is one that recognizes JSON.
4. The crawler connects to the data store. Some data stores require connection
properties for crawler access.
5. The inferred schema is created for your data.
6. The crawler writes metadata to the Data Catalog. A table definition contains metadata
about the data in your data store. The table is written to a database, which is a
container of tables in the Data Catalog. Attributes of a table include classification,
which a label is created by the classifier that inferred the table schema.
29
Data Engineering Identify and implement a data-ingestion

Data Store Connections could be:


 Amazon S3
 Amazon RDS
 Amazon RedShift
 Amazon DynamoDB
 JDBC

Glue and S3 Partitions


 Glue crawler will extract partitions based on how your S3 data is organized
 Think up front about how you will be querying your data lake in S3
 Example: devices send sensor data every hour
 Do you query primarily by time ranges?
If so, organize your buckets as s3://my-bucket/dataset/yyyy/mm/dd/device
 Do you query primarily by device?
If so, organize your buckets as s3://my-bucket/dataset/device/yyyy/mm/dd

1.2.3.3 Glue ETL

Transform data, Clean Data, Enrich Data (before doing analysis)


 Generate ETL code in Python or Scala, you can modify the code
 Can provide your own Spark or PySpark scripts

Bundled Transformations:
 DropFields, DropNullFields–remove (null) fields
 Filter –specify a function to filter records
 Join –to enrich data
 Map -add fields, delete fields, perform external lookups

Machine Learning Transformations:


 FindMatchesML: identify duplicate or matching records in your dataset, even when the
records do not have a common unique identifier and no fields match exactly.
30
Data Engineering Identify and implement a data-ingestion

 Apache Spark transformations (example: K-Means)

In typical analytic workloads, column-based file formats like Parquet or ORC are preferred over
text formats like CSV or JSON. It is common to convert data from CSV/JSON/etc. into Parquet for
files on Amazon S3, which can be done in the transformation phase.
Target can be S3, JDBC (RDS, Redshift), or in Glue Data Catalog

1.2.3.4 Job Authoring


 Auto-generates ETL code
 Build on open frameworks – Python/Scala and Apache Spark
 Developer-centric – editing, debugging, sharing
 Steps:
- Pick a Source
- Pick a Target
- Apply Transformation
- Edit Code Generated

1.2.3.5 Job Execution


 Run jobs on a serverless Spark platform
 Provides flexible scheduling , Job monitoring and alerting
 Compose jobs globally with event-based dependencies
- Easy to reuse and leverage work across organization boundaries
 Multiple triggering mechanisms
- Schedule-based: e.g., time of day
- Event-based: e.g., job completion
- On-demand: e.g., AWS Lambda
 Logs and alerts are available in Amazon CloudWatch
 Glue keeps track of data that has already been processed by a previous run of an ETL job.
This persisted state information is called a bookmark.
- For example, you get new files every day in your S3 bucket. By default, AWS Glue
keeps track of which files have been successfully processed by the job to prevent
data duplication.
 There is no need to provision, configure, or manage servers:
- Auto-configure VPC and role-based access
- Customers can specify the capacity that gets allocated to each job
31
Data Engineering Identify and implement a data-ingestion

- Automatically scale resources (on post-GA roadmap)


- You pay only for the resources you consume while consuming them

Figure 14: Glue Execution

1.2.3.6 Job Workflow


 Create and visualize complex ETL activities involving multiple crawlers, jobs, and triggers
 Records execution progress and status
 Provide both static and dynamic view

Figure 15: Glue Workflow

32
Data Engineering Identify and implement a data-ingestion

1.2.4 Data Stores in Machine learning


1.2.4.1 Redshift
 Data Warehousing, SQL analytics (OLAP -Online analytical processing)
 Load data from S3 to Redshift
 Use Redshift Spectrum to query data directly in S3 (no loading)

1.2.4.2 RDS, Aurora


 Relational Store, SQL (OLTP -Online Transaction Processing)
 Must provision servers in advance

1.2.4.3 DynamoDB
 NoSQL data store, serverless, provision read/write capacity
 Useful to store a machine learning model served by your application

1.2.4.4 ElasticSearch
 Indexing of data
 Search amongst data points
 Clickstream Analytics

1.2.4.5 ElastiCache
 Caching mechanism
 Not really used for Machine Learning

NOTE: Amazon ML allows you to create a datasource object from data stored in a
MySQL database in Amazon Relational Database Service (Amazon RDS). When you
perform this action, Amazon ML creates an AWS Data Pipeline object that
executes the SQL query that you specify, and places the output into an S3 bucket
of your choice. Amazon ML uses that data to create the datasource.

1.2.4.6 Data Pipeline


AWS Data Pipeline is a web service that provides a simple management system for data-driven
workflows. Using AWS Data Pipeline, you define a pipeline composed of the “data sources” that
contain your data, the “activities” or business logic such as EMR jobs or SQL queries, and the
“schedule” on which your business logic executes. For example, you could define a job that, every

33
Data Engineering Identify and implement a data-ingestion

hour, runs an Amazon Elastic MapReduce (Amazon EMR)–based analysis on that hour’s Amazon
Simple Storage Service (Amazon S3) log data, loads the results into a relational database for
future lookup, and then automatically sends you a daily summary email.

Example 1:
You can use AWS Data Pipeline to archive your web server's logs to Amazon Simple Storage
Service (Amazon S3) each day and then run a weekly Amazon EMR (Amazon EMR) cluster over
those logs to generate traffic reports. AWS Data Pipeline schedules the daily tasks to copy data
and the weekly task to launch the Amazon EMR cluster. AWS Data Pipeline also ensures that
Amazon EMR waits for the final day's data to be uploaded to Amazon S3 before it begins its
analysis, even if there is an unforeseen delay in uploading the logs.

Example 2:
For example, you could define a job that, every hour, runs an Amazon Elastic MapReduce
(Amazon EMR)–based analysis on that hour’s Amazon Simple Storage Service (Amazon S3) log
data, loads the results into a relational database for future lookup, and then automatically sends
you a daily summary email.

Features
 Manages task dependencies
 Retries and notifies on failures
 Data sources may be on-premises
 Highly available
 Destinations include S3, RDS, DynamoDB, Redshift and EMR
 Control over environment resources
 Access to EC2 and EMR
 Can create resources in your account

34
Data Engineering Identify and implement a data-ingestion

Figure 16: Data Pipeline Example

Data Pipeline Vs Glue


Glue:
 Glue ETL -Run Apache Spark code, Scala or Python based, focus on the ETL
 Glue ETL -Do not worry about configuring or managing the resources
 Data Catalog to make the data available to Athena or Redshift Spectrum

Data Pipeline:
 Orchestration service
 More control over the environment, compute resources that run code, & code
 Allows access to EC2 or EMR instances (creates resources in your own account)

1.2.4.7 AWS Batch


It enables developers, scientists, and engineers to easily and efficiently run hundreds of
thousands of batch computing jobs on AWS. It dynamically provisions the optimal quantity and
type of compute resources (e.g., CPU or memory optimized instances) based on the volume and
specific resource requirements of the batch jobs submitted.

Features
 Run batch jobs as Docker images
 Dynamic provisioning of the instances (EC2 & Spot Instances)
 Optimal quantity and type based on volume and requirements
 No need to manage clusters, fully serverless
 You just pay for the underlying EC2 instances

35
Data Engineering Identify and implement a data-ingestion

 Schedule Batch Jobs using CloudWatch Events


 Orchestrate Batch Jobs using AWS Step Functions

AWS Batch vs Glue


Glue:
 Glue ETL -Run Apache Spark code, Scala or Python based, focus on the ETL
 Glue ETL -Do not worry about configuring or managing the resources
 Data Catalog to make the data available to Athena or Redshift Spectrum

Batch:
 For any computing job regardless of the job (must provide Docker image)
 Resources are created in your account, managed by Batch
 For any non-ETL related work, Batch is probably better

1.2.4.8 Data Migration Service


 Quickly and securely migrate databases to AWS, resilient, self-healing
 The source database remains available during the migration
 Supports:
 Homogeneous migrations: ex Oracle to Oracle
 Heterogeneous migrations: ex Microsoft SQL Server to Aurora
 Continuous Data Replication using CDC
 You must create an EC2 instance to perform the replication tasks
 No data transformation, once the data is in AWS, you can use Glue to transform it

1.2.4.9 Step Function


Step Functions is a serverless orchestration service that lets you combine AWS Lambda functions
and other AWS services to build business-critical applications.
Step Functions is based on state machines and tasks. A state machine is a workflow. A task is a
state in a workflow that represents a single unit of work that another AWS service performs. Each
step in a workflow is a state.

Features
 Use to design workflows
 Easy visualizations
 Advanced Error Handling and Retry mechanism outside the code
 Audit of the history of workflows
36
Data Engineering Identify and implement a data-ingestion

 Ability to “Wait” for an arbitrary amount of time


 Max execution time of a State Machine is 1 year
Examples

Figure 19: Manage a Batch Job

Figure 17: Train Machine Learning Model Figure 18: Tune a machine Learning Model

1.2.5 Full Data Engineer Pipeline


1.2.5.1 Real time Layer

Figure 20: Real Time Data Engineer

37
Data Engineering Identify and implement a data-ingestion

1.2.5.2 Video Layer

Figure 21: Video Layer Data Engineer

1.2.5.3 Batch Layer

Figure 22: Batch Layer

38
Data Engineering Identify and implement a data-transformation

1.3 Identify and implement a data-transformation


1.3.1 Hadoop
The Apache Hadoop software library is a framework that allows for the distributed processing of
large data sets across clusters of computers using simple programming models. It is designed to
scale up from single servers to thousands of machines, each offering local computation and
storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to
detect and handle failures at the application layer, so delivering a highly-available service on top
of a cluster of computers, each of which may be prone to failures.

Components
 Hadoop Core (Common):
Libraries and utilities for all of these modules to run on top on Java and Scripts.
 HDFS:
Hadoop Distributed File System
 YARN (Yet Another Resource Negotiator)
 Manage the resources across the cluster.
 It performs scheduling and resource allocation for the Hadoop System.
 It is composed of three components: Resource Manager, Nodes Manager and
Application Manager.
 MapReduce
Software framework for easily writing applications that process vast amount of data in
parallel on a large cluster in a reliable fault tolerant manner.
It consists of:
 Map Functions: do thing like transform, reformat data or extract data. Its output is
intermediate data.
 Resource Functions: Takes the intermediate data and aggregating this data for the
final answer.

1.3.2 Amazon EMR


 Amazon Elastic MapReduce is an Amazon Web Service tool for big data processing and
analysis. Amazon EMR offers the expandable low-configuration service as an easier
alternative to running in-house cluster computing.
 Amazon EMR is based on Apache Hadoop
 Include Spark, Hbase, Presto, Flink and Hive.

39
Data Engineering Identify and implement a data-transformation

 Compose of clusters, Cluster is a collection of EC2 instance where every instance is called a
Node.

EMR Cluster
 Master Node
 Manages the cluster by running software components to co-ordinate the
distribution of data and tasks among other nodes for processing.
 It tracks the status of tasks and monitors the health of the cluster.
 Also known as Leader Node
 Core Nodes
 These are the nodes with software components that run tasks and store the data on
the HDFS
 Task Nodes
 These nodes only run tasks and don’t store data on HDFS, used for only computation
(sudden tasks).

EMR Usage
 Transit Cluster: Configured to automatically terminate once all steps have been completed.
Load input data  Process data  Store data  Terminate
 Long Run Cluster: manually terminated after interacting with it.

EMR Services
 Nodes are EC2 instances
 VPC to configure network
 S3 to load and save your data
 CloudWatch to monitor cluster performance and configure alarms
 IAM for permissions
 CloudTrail to audit requests to the services
 Data pipeline to schedule and start cluster

EMR Storage
 HDFS
Very good for performance but it will go away when the cluster shutdown. HDFS is stored
as blocks and distributed across by default block size is 128 MB.

40
Data Engineering Identify and implement a data-transformation

 EMRS
Which allows you to use S3 as though it were in HDFS file system and use DynamoDB to
track the consistency across MRFS.
 Local File System
 EBS

EMR Promises
 EMR changes by hour and EC2 instances
 Provision new nodes on failure
 Add/remove tasks on the fly
 Resize running cluster nodes

1.3.3 Apache Spark


Apache Spark is an open-source, distributed processing system used for big data workloads. It
utilizes in-memory caching, and optimized query execution for fast analytic queries against data
of any size. It provides development APIs in Java, Scala, Python and R, and supports code reuse
across multiple workloads—batch processing, interactive queries, real-time analytics, machine
learning, and graph processing.
 Spark replaces MapReduce and still working on top of YARN and HDFS. Also, Spark has its
own Resource Manager and may not use HDFS according to the use case.
 It utilize in memory caching and optimized query execution for fast analytic queries.
 Spark has APIs for JAVA, R, Python and Scala and supports code reuse.
 Spark not used for batch processing but it is more for transforming data.

How Spark Works?

41
Data Engineering Identify and implement a data-transformation

Figure 23: How Spark works

1. Spark context connect to different cluster managers which allocate the resources across
the applications.
2. Upon connecting, Spark will acquire executors on nodes in the cluster.
3. The executors are processes that run computations and store data.
4. The application code is send to the executors.
5. Spark context will send tasks to the executors to run.

Spark Components
 Resilient Distribution Dataset (RDD)
Represents a logical collection of data partitioned across different compute node.

 Spark SQL
Engine that provides low latency interactive queries up to 100X faster than map reduce.
Supports various data sources: JDBC, ODBC, JSON, ORC, Parquet and HDFS
Spark SQL exposes data frames as python and datasets as Scala.
Spark SQL uses distributed queries that executes across the entire cluster.

 Spark Streaming

42
Data Engineering Identify and implement a data-transformation

Real time solution that leverage spark course fast scheduling capabilities to do streaming
analytics.
It supports ingestion from Twitter, Kafka, Flune, HDFS and Zero MQ.
Spark Streaming can integrate with AWS Kinesis.
 MLib (Machine Learning Library)
 Graphx
Data structure graph.

MLIB
Machine Learning Library in Spark contains:
 Classification: logistic regression and Naiive Bayes.
 Regression
 Decision Trees
 Recommendation Engine using ALS (Alternating Least Square).
 Clustering (K-means)
 LDA (topic modeling)
 SVD, PCA
 ML workflows (pipelines, transformation and persistence)
 Statistics functions

Zeppelin
 A notebook for Spark.
 Can use Spark SQL.
 Can visualize data in charts and graphs.

EMR Notebook
 Amazon notebook for EMR with more integration to AWS.
 Notebook is backed in S3.
 Provision cluster from notebook.
 Feeding task to the cluster from notebook.
 Hosted inside VPC.
 Access only via console.

43
Data Engineering Identify and implement a data-transformation

 Build Spark apps and run queries on the cluster.


 Python Spark, R and Scala
 Graphical libraries
 Hosted outside the cluster.
 Can work on teams on the same notebook.
 No charge

EMR Security
 IAM Policies
To grant or deny permissions and determine what actions user can perform with Amazon
EMR and other AWS resources. IAM policies with tags to control access on a cluster by
cluster basis (per cluster).

 IAM Role
For EMRFS request to S3 allow you to control whether cluster users can access files from
EMR based on user, group or location.

 Kerberos
Strong authentication through secret key cryptography that ensures that passwords aren’t
sent over the network in unencrypted format.

 SSH (Secure Socket Shell)


Provide secure way for users to connect to the command line on cluster instance.
SSH also used for tunneling to view the various web interfaces.

 Kerberos and Amazon EC2


Key pairs to use client authentication.

 IAM Role and EMR


Control access to other AWS services.
44
Data Engineering Identify and implement a data-transformation

Role for auto scaling and a role for cleaning EC2 instances.

EMR Instance Types


 Master Node
m4.large if nodes < 50 nodes
m4.xlarge if nodes > 50 nodes

 Core & Task Node


 m4.large is usually good
 If cluster waits on external dependencies (i.e. web crawler), then use t2.meduim
 Improved performance m4.xlarge.
 Intensive computation then CPU Instances (Cs)
 Database, memory cache then High Memory Instances (Memory 2xlarge,
4xlarge.....etc.)
 Network/CPU intensive (NLP, ML) then cluster compute instances
 Task Nodes could be Spot instances

NOTE: If using Spot instances in Master node or core node then risk for partial
data loss, but you can use it in testing only.
For using Spark with SageMaker refer to section 4.1.12 SageMaker with Spark

Amazon EMR is the best place to run Apache Spark:


 Quickly and easily create managed Spark clusters from the AWS Management Console,
AWS CLI, or the Amazon EMR API.
 Fast Amazon S3 connectivity using the Amazon EMR File System (EMRFS)
 Integration with the Amazon EC2 Spot market and AWS Glue Data Catalog
 EMR Managed Scaling to add or remove instances from your cluster
AWS Lake Formation brings fine-grained access control, while integration with AWS Step
Functions helps with orchestrating your data pipelines.
EMR Studio (preview) is an integrated development environment (IDE)
 Makes it easy for data scientists and data engineers to develop, visualize, and debug data
engineering and data science applications written in R, Python, Scala, and PySpark.

45
Data Engineering Identify and implement a data-transformation

 Provides fully managed Jupyter Notebooks, and tools like Spark UI and YARN Timeline
Service to simplify debugging.

Features and benefits


Fast performance
EMR features Amazon EMR runtime for Apache Spark:
 Performance-optimized runtime environment for Apache Spark that is active by default on
Amazon EMR clusters.
 Amazon EMR runtime for Apache Spark can be over 3x faster than clusters without the
EMR runtime
 100% API compatibility with standard Apache Spark.
 Run faster and saves you compute costs, without making any changes to your applications.

By using a directed acyclic graph (DAG) execution engine, Spark can create efficient query plans
for data transformations. Spark also stores input, output, and intermediate data in-memory as
resilient dataframes, which allows for fast processing without I/O cost, boosting performance of
iterative or interactive workloads.

Develop applications quickly and collaboratively


 Apache Spark natively supports Java, Scala, SQL, and Python, which gives you a variety of
languages for building your applications.
 Submit SQL or HiveQL queries using the Spark SQL module
 Spark API interactively with Python or Scala directly in the Spark shell or via EMR Studio, or
Jupyter notebooks on your cluster.
 Apache Hadoop 3.0 in EMR 6.0 brings Docker container support to simplify managing
dependencies.
 Leverage cluster-independent EMR Notebooks (based on Jupyter) or use Zeppelin to create
interactive and collaborative notebooks for data exploration and visualization.
 Tune and debug your workloads in the EMR console which has an off-cluster, persistent
Spark History Server.

Create diverse workflows


Apache Spark includes several libraries to help build applications for:
46
Data Engineering Identify and implement a data-transformation

 Machine learning (MLlib),


 Stream processing (Spark Streaming)
 Graph processing (GraphX)
You can use deep learning frameworks like Apache MXNet with your Spark applications
Integration with AWS Step Functions enables you to add serverless workflow automation and
orchestration to your applications.

Integration with Amazon EMR feature set


Submit Apache Spark jobs with the EMR Step API:
 Use Spark with EMRFS to directly access data in S3,
 Save costs using EC2 Spot capacity,
 Use EMR Managed Scaling to dynamically add and remove capacity, and launch long-
running or transient clusters to match your workload.
 Configure Spark encryption and authentication with Kerberos using an EMR security
configuration.
 AWS Glue Data Catalog to store Spark SQL table metadata
 Amazon SageMaker with your Spark machine learning pipelines.
 EMR installs and manages Spark on Hadoop YARN, and you can also add other big data
applications on your cluster.
 EMR with Apache Hudi lets you more efficiently manage change data capture (CDC) and
helps with privacy regulations like GDPR and CCPA by simplifying record deletion.

47
Exploratory Data Analysis Perform featuring engineering

2. Exploratory Data Analysis


2.1 Perform featuring engineering
2.1.1 Data Distribution
 Normal Distribution:
- For continuous numbers
- 1 SD  34.1 %
- 2 SD  13.6 %
- 3 SD  4.2 %
- 4 SD  <1%
 Probability Mass function:
- For discrete data
- Probability of discrete data occurrence
 Poisson Distribution
- Number of trials over a period of time, area or distance
- How many mails received or calls
- Should be discrete numbers
 Binomial Distribution
- Head or tail
- Positive or negative
 Bernoulli Distribution
- Special case of Binomial
- Has single trial n=1
- Binomial = Bernoulli for different (n)

2.1.2 Trends & Seasonality


 Trends: Where the data goes
 Seasonality: Frequency in the curve

Figure 24: Trends & Seasonality

 Additive model
When seasonality is constant.
48
Exploratory Data Analysis Perform featuring engineering

Time series = Seasonality + Trend + Noise

 Multiplicative model
Model seasonality increase as Trend increase.
Time Series = Seasonality × Trend × Noise

2.1.3 Types of Visualization


 Bar charts: for comparison and distribution (Histograms)
 Line Graph: For changes overtime
 Scatter plot and heatmap: For correlation
 Pie graph: for aggregation and percentages
 Pivot tables: For tabular data

NOTE: Correlation matrix is used to show linear relationship, while scatter matrix
shows any relationship.

NOTE: Scatterplot matrices visualize attribute-target and attribute-attribute


pairwise relationships. Correlation matrices measure the linear dependence
between features; can be visualized with heat-maps.
Highly correlated
 Highly correlated (positive or negative) features usually degrade performance of linear ML
models such as linear and logistic regression models - we should select one of the
correlated features and discard the other(s).
 Decision Trees are immune to this problem.

2.1.4 Dimension Reduction


 Too many features can be problematic.
 Every feature is a dimension.
 Unsupervised dimension reduction technique can also be employed to distill many
features into fewer features:
- PCA: principle component analysis
- K-means

2.1.5 Missing Data


 Missing data is problematic for most ML algorithms.
 Missing data should be processed (imputed) by one of the following methods:

49
Exploratory Data Analysis Perform featuring engineering

- Mean Replacing
Replace missing data with mean value.
If you have outliers, you can use “Median value”
Advantage: fast and easy and will not affect sample size nor mean value.
Disadvantage:
 Can’t be used for categorical columns
 Not very accurate
 Misses correlations between features.

- Dropping
Drop rows or columns contains the missing data.
Use this method if:
 Not many rows contains missing data
 Dropping rows doesn’t bias the data
 You don’t have enough time
 It is reasonably
It is not a good approach at all think about filling the missing data with any other
column as summary and text.
- Common point
Use the most common value for that column to replace missing values. Useful for
categorical variables.

- Machine Learning
Use Machine Learning algorithms to fill the missing data
 KNN
Find “K nearest neighbor” rows and average their values.
Assume numeric data, there is methods for categorical data handling but not
good solutions.
 Deep Learning
Build machine learning model to impute missing data.

50
Exploratory Data Analysis Perform featuring engineering

Works well with categorical columns but complicated.


 Regression
Find linear and non-linear relation between features and missing feature.
 MICE (Multiple Imputation by Chained Equation)
It is type of regression also called “Fully Conditional Specification” or
“Sequential Regression Multiple Imputation”
Very good results.
 Get more data if possible
 AWS Datawig tool uses neural networks to predict missing values in tabular
data

2.1.6 Unbalanced Data


When the train data in not containing sufficient data from one class (minority) i.e. if you train a
fraud detector model and the train data didn’t contain much fraud transactions. This is known as
unbalanced data and will be problematic for most ML algorithms.
 Over Sampling
Duplicate samples from minor class
Can be done at random
 Under Sampling
Remove samples from major class.
Throwing data is usually not a good idea and is not best practice.
 Synthetic Minority Over sampling TEchnique (SMOTE)
Artificially generate new samples from the minority class using nearest neighbors (KNN)
- Run KNN of each sample in the minority class
- Create new samples from the KNN (mean of neighbors)
Both generate new samples and under sample majority class
Generally better than just oversampling

51
Exploratory Data Analysis Perform featuring engineering

2.1.7 Handling Outliers


 Using Standard Deviation (SD) to decide if the value is outlier or not.
 Box and whisker diagrams shows outliers.
 AWS Random Cut Forest algorithm is made for outlier detection.

2.1.8 Binning
 Bucket observations together based on range of values.
 Useful when there is uncertainty of measurements.
 Transform numeric data to ordinal data.
 For example: age (20s), (30s)….etc.
 Using algorithms for categorical data rather than numerical data.
 Quantile binning
It categorize your data by their place in the data distribution, so it ensures that everyone of
your bias has an equal number of samples within them.
It ensures that the same number of samples in each resulting bin.

2.1.9 Transforming
 If you have features that has an exponential trend within it, we can make logarithmic
transform to make the data look more linear.
 You can also take x2 or x.

2.1.10 One hot encoding


 Create bucket for every category.
 The bucket of your category has 1 and all others are 0.
 Very common in deep learning, where categories are represented by individual output
“neurons”.

2.1.11 Scaling
 Some models prefer data to be normally distributed around 0.
 Most models require feature data to at least be scaled to comparable values. Otherwise,
features with larger magnitude will have more weight than they should be. For example,
age and income. Income will be much higher weight than age.
 There are 4 methods for scaling:

52
Exploratory Data Analysis Perform featuring engineering

- MinMax Scaler
Also called normalization.
Values are shifted and rescaled so they end up ranging from 0  1.
Formula subtract the value form min value and divide by max – min.
Very sensitive to outliers.
Normalizer builds totally new features that are not correlated to initial features.
- Standardization
This is done by subtract the value from mean value then divide by standard
deviation. So, the result distribution has unit variance.
It doesn’t bound to 0  1 but not affected by outliers.

NOTE: StandardScaler and other scalers that work featurewise are preferred
in case meaningful information is located in the relation between feature
values from one sample to another sample, wherease Normalizer and other
scalers that work stamplewise are preferred in case meaningful information
is located in the relation between feature values from one feature to another
feature.

- Max Abs Scaler


Divide features data by maximum absolute value in the data.
Value range will be from -1  1.
It doesn’t shift or center the data thus doesn’t destroy sparsity.

- Robust Scaler
It is better than standardization in dealing with outliers.
Formula as follows:
 Calculate median (50th percentile)
 Calculate 25th and 75th percentiles
 Value = (Value – median) / (P75 – P25)

53
Exploratory Data Analysis Perform featuring engineering

2.1.12 Data Skewing

 Positive = Right Skew


 Negative = Left Skew
 If positive (right Skew) we can solve (Normal distribution) it by:
- Logarithmic transformation
- Square Root transformation
- Reciprocal transformation
 If Negative (Left Skew) we can solve (Normal Distribution) by:
- Exponential transformation
- Power transformation
- Arcsine transformation

54
Exploratory Data Analysis Perform featuring engineering

2.1.13 Residuals

A residuals plot (see the picture above) which has an increasing trend (first figure) suggests that
the error variance increases with the independent variable; while a distribution that reveals a
decreasing trend (second figure) indicates that the error variance decreases with the
independent variable. Neither of these distributions are constant variance patterns. Therefore
they indicate that the assumption of constant variance is not likely to be true and the regression
is not a good one. On the other hand, a horizontal-band pattern (third figure) suggests that the
variance of the residuals is constant.

Can be used for outliers.

55
Exploratory Data Analysis Perform featuring engineering

The Residual vs. Order of the Data plot can be used to check the drift of the variance (see the
picture above) during the experimental process, when data are time-ordered. If the residuals are
randomly distributed around zero, it means that there is no drift in the process.

56
Exploratory Data Analysis Perform featuring engineering

If the data being analyzed is time series data (data recorded sequentially), the Residual vs. Order
of the Data plot will reflect the correlation between the error term and time. Fluctuating patterns
around zero will indicate that the error term is dependent.

SageMaker Performance Visualization


It is common practice to review the residuals for regression problems. A residual for an
observation in the evaluation data is the difference between the true target and the predicted
target. Residuals represent the portion of the target that the model is unable to predict. A
positive residual indicates that the model is underestimating the target (the actual target is larger
than the predicted target). A negative residual indicates an overestimation (the actual target is
smaller than the predicted target). The histogram of the residuals on the evaluation data when
distributed in a bell shape and centered at zero indicates that the model makes mistakes in a
random manner and does not systematically over or under predict any particular range of target
values. If the residuals do not form a zero-centered bell shape, there is some structure in the
model's prediction error. Adding more variables to the model might help the model capture the
pattern that is not captured by the current model. The following illustration shows residuals that
are not centered on zero.

57
Exploratory Data Analysis Analyze and visualize data for ML

2.1.14 Shuffling
 Many algorithms benefit from shuffling your data.
 Otherwise, they may learn from residual signals in the training data resulting from the
order in which they were collected.

2.2 Analyze and visualize data for ML


2.2.1 Amazon Athena
 Serverless way if doing interactive queries of your S3 data lake i.e. log files.
- No Clusters
- No Data warehouse
 No need to load data i.e. No ETL.
 Point to data in S3, define the schema and start query. (Direct query from S3)
 Supported formats CSV, JSON, ORC, Parquet and Avro.
 Can be used for structured, unstructured and semi structured data.
 Integration with Jupyter, Zeppelin and R Studio notebook.
 Integration with Quick Sight.
 Integration with JDBC and ODBC with visual tools.
 Pay as you go 5$ per TB scanned by query. So, compressing, partitioning and columnar
formats are a good idea to decrease cost.
 Saves a lot of money when using columnar format as ORC or Parquet as the columns only
scanned.

Standard SQL
 Uses Presto with ANSI SQL support.
58
Exploratory Data Analysis Analyze and visualize data for ML

 Fast, ad-hoc queries


 Executes queries in parallel
 No provisioning extra resources for complex queries
 Scales automatically
 Works with standard data formats i.e. CSV, Apache Weblogs, JSON, Parquet and ORC.
 Handles complex queries i.e. Large Joins, Window functions and Arrays.
 The results of all Athena queries are stored in S3 bucket called “aws-athena-query-results-
<account_id>” using the year, month, and day the query was run, the hexadecimal Athena
query ID, and the region the query was run.
 Can use DDL for creating and deleting external tables.

2.2.2 Amazon Quick Sight


 Business analytics service
 Serverless
 Data sources Redshift, RDS, Athena, EC2 hosted databases and files (S3 or on premises) in
Excel, CSV, TSV and common log files.
 Limited data preparation only changing filed names, data types and adding some
calculated columns.
 Send report snapshots directly to your users inbox

SPICE
 Datasets of Quick Sight is imported into SPICE (10GB).
 SPICE is super-fast, parallel in memory calculation engine.
 SPICE uses columnar storage in memory machine code generation.
 SPICE accelerating interactive queries on large data sets.
 High available, durable and scalable.

Quick Sight Machine Learning


 Anomaly detection: Using random cut forest algorithm to analyze data points to detect
outliers.
 Forecast: It also uses random cut forest algorithm to detect seasonality and trends. So it
can forecast.
 Auto Narratives: Writing the story of data in plain language.

Pricing
59
Exploratory Data Analysis Analyze and visualize data for ML

 Annual or monthly
 Standard or enterprise
 SPICE can go beyond (10GB)

Embedding Dashboards in your Application


QuickSight allows you to seamlessly integrate interactive dashboards and analytics into your own
applications:
 Enhance your applications with rich analytics and dashboards
 Easy maintenance, no servers to manage
 Fast! No Custom development or domain expertise needed
 Leverage new features as we add them
 Utilizes Pay-per-Session Pricing

60
Modeling Frame business problems as ML problems

3. Modeling
3.1 Frame business problems as ML problems
3.1.1 Supervised Machine Learning
3.1.1.1 Regression
Logistic Regression
Some regression algorithms can be used for classification as well, Logistic Regression (also called
Logit Regression) is commonly used to estimate the probability that an instance belongs to a
particular class.
If the estimated probability is greater than 50%, then the model predicts that the instance
belongs to that class (called the positive class, labeled “1”), or else it predicts that it does not (i.e.,
it belongs to the negative class, labeled “0”). This makes it a binary classifier.
Estimating Probabilities
Just like a Linear Regression model, a Logistic Regression model computes a weighted sum of the
input features (plus a bias term), but instead of outputting the result directly like the Linear
Regression model does, it outputs the logistic of this result.

The logistic—noted σ (·)—is a sigmoid function (i.e., S-shaped) that outputs a number between 0
and 1.

Figure 25: Logistic Function

Once the Logistic Regression model has estimated the probability p = hθ (x) that an instance x
belongs to the positive class, it can make its prediction ŷ easily.
61
Modeling Frame business problems as ML problems

Notice that σ (t) < 0.5 when t < 0, and σ (t) ≥ 0.5 when t ≥ 0, so a Logistic Regression model
predicts 1 if x T θ is positive, and 0 if it is negative.
Training and Cost Function
The objective of training is to set the parameter vector θ so that the model estimates high
probabilities for positive instances (y = 1) and low probabilities for negative instances (y = 0).

Logistic Regression models can be regularized using ℓ1 or ℓ2 penalties.

Decision Trees
Decision Trees are versatile Machine Learning algorithms that can per- form both classification
and regression tasks, and even multioutput tasks.

Figure 26: A Classification Decision Tree

One of the many qualities of Decision Trees is that they require very little data preparation. In
particular, they don’t require feature scaling or centering at all.
A node’s samples attribute counts how many training instances it applies to:
For example, 100 training instances have a petal length greater than 2.45 cm (depth 1,
right), among which 54 have a petal width smaller than 1.75 cm (depth 2, left).

62
Modeling Frame business problems as ML problems

A node’s value attribute tells you how many training instances of each class this node applies to:
For example, the bottom-right node applies to 0 Iris-Setosa, 1 Iris- Versicolor, and 45 Iris-
Virginica.
A node’s gini attribute measures its impurity: a node is “pure” (gini=0) if all training instances it
applies to belong to the same class.
For example, since the depth-1 left node applies only to Iris-Setosa training instances, it is
pure and its gini score is 0.
The depth-2 left node has a gini score equal to 1 – (0/54)2 – (49/54)2 – (5/54)2 ≈ 0.168.

Figure 27: Decision Tree Boundary

The thick vertical line rep- resents the decision boundary of the root node (depth 0): petal length
= 2.45 cm. Since the left area is pure (only Iris-Setosa), it cannot be split any further. However,
the right area is impure, so the depth-1 right node splits it at petal width = 1.75 cm (represented
by the dashed line). Since max-depth was set to 2, the Decision Tree stops right there. However,
if you set max-depth to 3, then the two depth-2 nodes would each add another decision
boundary (represented by the dotted lines).
A Decision Tree can also estimate the probability that an instance belongs to a particular class k:
first it traverses the tree to find the leaf node for this instance, and then it returns the ratio of
training instances of class k in this node. For example, suppose you have found a flower whose
petals are 5 cm long and 1.5 cm wide. The corresponding leaf node is the depth-2 left node, so
the Decision Tree should output the following probabilities: 0% for Iris-Setosa (0/54), 90.7% for
Iris-Versicolor (49/54), and 9.3% for Iris-Virginica (5/54). And of course if you ask it to predict the
class, it should output Iris-Versicolor (class 1) since it has the highest probability.
The CART Training Algorithm
Scikit-Learn uses the Classification And Regression Tree (CART) algorithm to train Decision Trees
(also called “growing” trees). The idea is really quite simple: the algorithm first splits the training

63
Modeling Frame business problems as ML problems

set in two subsets using a single feature k and a threshold tk (e.g., “petal length ≤ 2.45 cm”). How
does it choose k and tk? It searches for the pair (k, tk) that produces the purest subsets (weighted
by their size).
The cost function that the algorithm tries to minimize is given by:

Once it has successfully split the training set in two, it splits the subsets using the same logic, then
the sub-subsets and so on, recursively. It stops recurring once it reaches the maximum depth
(defined by the max-depth hyperparameter), or if it cannot find a split that will reduce impurity.
Making predictions requires traversing the Decision Tree from the root to a leaf. Decision Trees
are generally approximately balanced. Since each node only requires checking the value of one
feature, the overall prediction complexity is independent of the number of features. So
predictions are very fast, even when dealing with large training sets.
Gini Impurity or Entropy
By default, the Gini impurity measure is used, but you can select the entropy impurity measure
instead by setting the criterion hyperparameter to "entropy".

So should you use Gini impurity or entropy? The truth is, most of the time it does not make a big
difference: they lead to similar trees. Gini impurity is slightly faster to compute, so it is a good
default. However, when they differ, Gini impurity tends to isolate the most frequent class in its
own branch of the tree, while entropy tends to produce slightly more balanced trees.

Regularization Hyperparameters
This is controlled by the max_depth hyperparameter (the default value is none, which means
unlimited). Reducing max_depth will regularize the model and thus reduce the risk of overfitting.
64
Modeling Frame business problems as ML problems

min_samples_split: The minimum number of samples a node must have before it can be split).
min_samples_leaf: The minimum number of samples a leaf node must have.
min_weight_fraction_leaf: Same as min_samples_leaf but expressed as a fraction of the total
number of weighted instances.
max_leaf_nodes: Maximum number of leaf nodes.
max_features: Maximum number of features that are evaluated for splitting at each node.

Regression

Figure 28: A Decision Tree for regression

This tree looks very similar to the classification tree you built earlier. The main difference is that
instead of predicting a class in each node, it predicts a value. For example, suppose you want to
make a prediction for a new instance with x1 = 0.6. You traverse the tree starting at the root, and
you eventually reach the leaf node that predicts value=0.1106. This prediction is simply the
average target value of the 110 training instances associated to this leaf node. This prediction
results in a Mean Squared Error (MSE) equal to 0.0151 over these 110 instances.
The CART algorithm works mostly the same way as earlier, except that instead of try- ing to split
the training set in a way that minimizes impurity, it now tries to split the training set in a way that
minimizes the MSE.

Instability

65
Modeling Frame business problems as ML problems

Hopefully by now you are convinced that Decision Trees have a lot going for them: they are
simple to understand and interpret, easy to use, versatile, and powerful. However they do have a
few limitations. First, as you may have noticed, Decision Trees love orthogonal decision
boundaries (all splits are perpendicular to an axis), which makes them sensitive to training set
rotation.
3.1.1.2 Classification
Performance measure for the classification: evaluating a classifier is often significantly trickier
than evaluating a regressor.

Accuracy
Accuracy: The percent (ratio) of cases classified correctly:
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁

Accuracy ranges from 0 (bad) to 1 (good)


High Accuracy Paradox: Accuracy is misleading when dealing with imbalanced datasets: few true
state positives (the ‘rare’ class), many true state negatives (the ‘dominant’ class), high accuracy
even when few True Positives.

Confusion Matrix
It is used for evaluating a classifier.
To compute the confusion matrix, you first need to have a set of predictions, so they can be
compared to the actual targets.
Each row in a confusion matrix represents an actual class, while each column represents a
predicted class.

NOTE: It is not always the case that the actual class is represented as rows and
predicted represented as predicted may be they are swapped. TAKE CARE.
A perfect classifier would have only true positives and true negatives, so its confusion matrix
would have nonzero values only on its main diagonal i.e. no false positives and no false negatives.

66
Modeling Frame business problems as ML problems

Figure 29: An illustrated confusion matrix

Precision
I detected wrong data by getting wrong data as positive. (‫)أنا جبت حاجات غلط‬

Such that:
 TP is true positive (predicted as positive and they are actually positive).
 FP is false positive (predicted as positive but they are actually negative).

NOTE: As much as you classify negative classes as positive classes then FP will
increase and the overall precision will decrease.
‫ قل ر‬precision ‫اكت كل لما ال‬
.‫أكت‬ ‫كل لما جبت حاجات غلط ر‬

Good choice of metric when you care a lot about false positives i.e. medical screening, drug
testing.

Recall
Recall is also called Sensitivity or True Positive Rate (TPR) or Completeness

I didn’t recognize all data of my class. (‫)أنا مجبتش كل حاجه‬

67
Modeling Frame business problems as ML problems

Such that:
 TP is true positive (predicted as positive and they are actually positive).
 FN is false negative (predicted as negative but they are actually positive).

NOTE: As much as you didn’t recognize all the data of your class i.e. FN increase
the overall recall decrease.
‫ قل ر‬recall ‫أكت كل لما ال‬
.‫أكت‬ ‫كل لما مجبتش حاجات ر‬

Good choice of metric when you care a lot about false negatives i.e. fraud detection.

F1
It is often convenient to combine precision and recall into a single metric called the F1 score, in
particular if you need a simple way to compare two classifiers. The F1 score is the harmonic mean
of precision and recall.

Precision/Recall Tradeoff
The F1 score favors classifiers that have similar precision and recall. This is not always what you
want: in some contexts you mostly care about precision, and in other con- texts you really care
about recall. For example, if you trained a classifier to detect videos that are safe for kids, you
would probably prefer a classifier that rejects many good videos (low recall) but keeps only safe
ones (high precision), rather than a classifier that has a much higher recall but lets a few really
bad videos show up in your product (in such cases, you may even want to add a human pipeline
to check the classifier’s video selection). On the other hand, suppose you train a classifier to
detect shoplifters on surveillance images: it is probably fine if your classifier has only 30%
precision as long as it has 99% recall (sure, the security guards will get a few false alerts, but
almost all shoplifters will get caught).

68
Modeling Frame business problems as ML problems

Unfortunately, you can’t have it both ways: increasing precision reduces recall, and vice versa.
This is called the precision/recall tradeoff.

Figure 30: Decision threshold and precision/recall tradeoff

Figure 31: Precision and recall versus the decision threshold

So let’s suppose you decide to aim for 90% precision. You look up the first plot and find that you
need to use a threshold of about 8,000. To be more precise you can search for the lowest
threshold that gives you at least 90% precision.

Another way to select a good precision/recall tradeoff is to plot precision directly against recall.

69
Modeling Frame business problems as ML problems

Figure 32: Precision versus recall

You can see that precision really starts to fall sharply around 80% recall. You will probably want to
select a precision/recall tradeoff just before that drop—for example, at around 60% recall. But of
course the choice depends on your project.

ROC Curve
The receiver operating characteristic (ROC) curve is another common tool used with binary
classifiers. It is very similar to the precision/recall curve, but instead of plotting precision versus
recall, the ROC curve plots the true positive rate (another name for recall) against the false
positive rate (FPR).
𝐹𝑃
FPR (False positive rate) = = 1 – TNR (True Negative Rate also called specificity)
𝐹𝑃 + 𝑇𝑁
The FPR is the ratio of negative instances that are incorrectly classified as positive.
The TNR is the ratio of negative instances that are correctly classified as negative.
𝑇𝑁
𝑇𝑁𝑅 =
𝑇𝑁 + 𝐹𝑃

To plot the ROC curve, you first need to compute the TPR and FPR for various threshold values.

70
Modeling Frame business problems as ML problems

Figure 33: ROC curve

Once again there is a tradeoff: the higher the recall (TPR), the more false positives (FPR) the
classifier produces. The dotted line represents the ROC curve of a purely random classifier; a
good classifier stays as far away from that line as possible (toward the top-left corner).
One way to compare classifiers is to measure the area under the curve (AUC). A per- fect
classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC
equal to 0.5.
Commonly used metric for comparing classifiers.

PR or ROC
As a rule of thumb, you should prefer the PR curve whenever the positive class is rare or when you
care more about the false positives than the false negatives, and the ROC curve otherwise.

Multiclass confusion Matrix

71
Modeling Frame business problems as ML problems

Figure 34: Multiclass confusion Matrix with heatmap

 This is used for multiple classification


 The legend color density appears on the right that is scaled as number of times. The color
mapped to actual value.
 Predicted label on the x axis and actual labels in the y axis. If I predict 5 and it was 5 then
the color density is very dark blue color. We can look at the legend and say yes it 89 times.
 If I predicted 1 but it was actual 8, then light color that we can look up in the legend it tell
us that this 20 times.
 You would expect to see a dark color in the diagonal representing a good accuracy.

AWS Confusion Matrix

72
Modeling Frame business problems as ML problems

Figure 35: AWS Confusion Matrix

 Number of correct and incorrect predictions per class (infer from colors of each cell)
 F1 scores per class
 True class frequencies: the “total” column
 Predicted class frequencies: the “total” row

Bayes
 A Bayesian network is a graphical model that represents a set of variables and their
conditional dependencies.
For example, disease and symptoms are connected using a network diagram. All symptoms
connected to a disease are used to calculate the probability of the existence of the disease.
 Naive Bayes classifier is a technique to assign class labels to the samples from the available
set of labels. This method assumes each feature’s value as independent and will not
consider any correlation or relationship between the features.

73
Modeling Frame business problems as ML problems

3.1.1.3 Evaluate Model

RMSE gives an idea of how much error the system typically makes in its predictions, with a higher
weight for large errors.
Even though the RMSE is generally the preferred performance measure for regression tasks, in
some contexts you may prefer to use another function. For example, suppose that there are
many outlier districts. In that case, you may consider using the Mean Absolute Error.
RMSE says about the error value but not the sign of error. The question is to find whether the
model overestimates or underestimates. The solution is residual plots.
RMSE for a hypothetical regression model that would always predict the mean of the target as
the answer. For example, if you were predicting the age of a house buyer and the mean age for
the observations in your training data was 35, the baseline model would always predict the
answer as 35. You would compare your ML model against this baseline to validate if your ML
model is better than a ML model that predicts this constant answer.

R squared is another commonly used metric with linear regression problems. R squared explains
the fraction of variance accounted for by the model. It’s like a percentage, reporting a number
from 0 to 1. When R squared is close to 1 it usually indicates that a lot of the variabilities in the
data can be explained by the model itself.
R squared will always increase when more explanatory variables are added to the model highest r
squared may not be the best model. To counter this potential issue, there is another metric
called the Adjusted R squared. The Adjusted R squared has already taken care of the added effect
for additional variables and it only increases when the added variables have significant effects in
74
Modeling Frame business problems as ML problems

the prediction. The adjusted R squared adjusts your final value based on the number of features
and number of data points you have in your dataset.
A recommendation, therefore, is to look at both R squared and Adjusted R squared. This will
ensure that your model is performing well but that there’s also not too much overfitting.

NOTE: When there is more than one feature highly correlated with least squares,
the data matrix X has less than full rank, and therefore the moment matrix XTX
cannot be inverted, the ordinary least squares estimator doesn’t exist.

3.1.1.4 Overfitting and Underfitting


Poor performance on the training data could be because the model is too simple (the input
features are not expressive enough) to describe the target well. Performance can be improved by
increasing model flexibility. To increase model flexibility, try the following:
- Add new domain-specific features and more feature Cartesian products, and change
the types of feature processing used (e.g., increasing n-grams size)
- Decrease the amount of regularization used
If your model is overfitting the training data, it makes sense to take actions that reduce model
flexibility. To reduce model flexibility, try the following:
- Feature selection: consider using fewer feature combinations, decrease n-grams size,
and decrease the number of numeric attribute bins.
- Increase the amount of regularization used.
Accuracy on training and test data could be poor because the learning algorithm did not have
enough data to learn from. You could improve performance by doing the following:
- Increase the amount of training data examples.
- Increase the number of passes on the existing training data.

3.1.1.5 Bias/Variance Tradeoff


An important theoretical result of statistics and Machine Learning is the fact that a model’s
generalization error can be expressed as the sum of three very different errors:
Bias
This part of the generalization error is due to wrong assumptions, such as assuming that
the data is linear when it is actually quadratic. A high-bias model is most likely to underfit
the training data.

75
Modeling Frame business problems as ML problems

Variance
This part is due to the model’s excessive sensitivity to small variations in the training data.
A model with many degrees of freedom (such as a high-degree polynomial model) is likely
to have high variance, and thus to overfit the training data.

Irreducible error
This part is due to the noisiness of the data itself. The only way to reduce this part of the
error is to clean up the data (e.g., fix the data sources, such as broken sensors, or detect
and remove outliers).
Increasing a model’s complexity will typically increase its variance and reduce its bias. Conversely,
reducing a model’s complexity increases its bias and reduces its variance. This is why it is called a
tradeoff.

Figure 36: Bias/Variance

3.1.1.6 Regularization
A good way to reduce overfitting is to regularize the model (i.e., to constrain it): the fewer
degrees of freedom it has, the harder it will be for it to overfit the data. For example, a simple
way to regularize a polynomial model is to reduce the number of polynomial degrees.
For a linear model, regularization is typically achieved by constraining the weights of the model.
Ridge Regression
Ridge Regression (also called Tikhonov regularization) is a regularized version of Lin- ear
Regression: a regularization term equal to α∑i = 1n θi2 is added to the cost function.

76
Modeling Frame business problems as ML problems

This forces the learning algorithm to not only fit the data but also keep the model weights as
small as possible. Note that the regularization term should only be added to the cost function
during training. Once the model is trained, you want to evaluate the model’s performance using
the un-regularized performance measure.
The hyperparameter α controls how much you want to regularize the model. If α = 0 then Ridge
Regression is just Linear Regression. If α is very large, then all weights end up very close to zero
and the result is a flat line going through the data’s mean.

Lasso Regression
Least Absolute Shrinkage and Selection Operator Regression (simply called Lasso Regression) is
another regularized version of Linear Regression.

It adds a regularization term to the cost function, but it uses the ℓ1 norm of the weight vector
instead of half the square of the ℓ2 norm.
An important characteristic of Lasso Regression is that it tends to completely eliminate the
weights of the least important features (i.e., set them to zero).

3.1.1.7 Bagging and Boosting


Suppose you ask a complex question to thousands of random people, then aggregate their
answers. In many cases you will find that this aggregated answer is better than an expert’s
answer. This is called the wisdom of the crowd. Similarly, if you aggregate the predictions of a
group of predictors (such as classifiers or repressor), you will often get better predictions than
with the best individual predictor. A group of predictors is called an ensemble; thus, this
technique is called Ensemble Learning, and an Ensemble Learning algorithm is called an Ensemble
method.
For example, you can train a group of Decision Tree classifiers, each on a different random subset
of the training set. To make predictions, you just obtain the predictions of all individual trees,
then predict the class that gets the most votes.

77
Modeling Frame business problems as ML problems

Figure 37: Training diverse classifiers

Hard Classifier
A very simple way to create an even better classifier is to aggregate the predictions of each
classifier and predict the class that gets the most votes. This majority-vote classifier is called a
hard voting classifier.

Soft Classifier
If all classifiers are able to estimate class probabilities, then you can tell Scikit-Learn to predict the
class with the highest class probability, averaged over all the individual classifiers. This is called
soft voting.

NOTE: SVC is not calculating probability by default, so you need to define its
probability hyperparameter to true.

Bagging and Pasting


We use the same training algorithm for every predictor, but to train them on different random
subsets of the training set. When sampling is performed with replacement, this method is called
bagging1 (short for bootstrap aggregating). When sampling is performed without replacement, it
is called pasting.
In other words, both bagging and pasting allow training instances to be sampled several times
across multiple predictors, but only bagging allows training instances to be sampled several times
for the same predictor.
Once all predictors are trained, the ensemble can make a prediction for a new instance by simply
aggregating the predictions of all predictors. The aggregation function is typically the statistical

78
Modeling Frame business problems as ML problems

mode (i.e., the most frequent prediction, just like a hard voting classifier) for classification, or the
average for regression.
Each individual predictor has a higher bias than if it were trained on the original training set, but
aggregation reduces both bias and variance.
I can be parallelized on different CPUs or cores as this is different training on different datasets.

Out-of-Bag Evaluation
With bagging, some instances may be sampled several times for any given predictor, while others
may not be sampled at all. By default a Bagging Classifier samples m training instances with
replacement (bootstrap=True), where m is the size of the training set. This means that only about
63% of the training instances are sampled on average for each predictor. The remaining 37% of
the training instances that are not sampled are called out-of-bag (oob) instances.

Note: They are not the same 37% for all predictors.

Boosting
Boosting (originally called hypothesis boosting) refers to any Ensemble method that can combine
several weak learners into a strong learner. The general idea of most boosting methods is to train
predictors sequentially, each trying to correct its predecessor.
Most popular boosting methods is AdaBoost (short for Adaptive Boosting) and Gradient Boosting.
It can’t be parallelized as it trains data sequentially.
AdaBoosting
One way for a new predictor to correct its predecessor is to pay a bit more attention to the
training instances that the predecessor underfitted. This results in new predictors focusing more
and more on the hard cases.
For example, to build an AdaBoost classifier, a first base classifier (such as a Decision Tree) is
trained and used to make predictions on the training set. The relative weight of misclassified
training instances is then increased. A second classifier is trained using the updated weights and
again it makes predictions on the training set, weights are updated, and so on.
Gradient Boosting
Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting
its predecessor. However, instead of tweaking the instance weights at every iteration like
AdaBoost does, this method tries to fit the new predictor to the residual errors made by the
previous predictor.
79
Modeling Frame business problems as ML problems

Summary
 XGBoost is the latest hotness
 Boosting generally yields better accuracy
 Bagging avoids overfitting
 Bagging is easier to parallelize
 Bagging reduces both bias and variance

Random Forest
Random Forest is an ensemble of Decision Trees, generally trained via the bagging method (or
sometimes pasting), typically with max_samples set to the size of the training set.
It is a combination of multiple trees. Train a decision tree for each sampled data (Data is divided
into samples using random techniques). Combine end results of each tree by voting.
With a few exceptions, a Random Forest Classifier has all the hyperparameters of a Decision Tree
(to control how trees are grown), plus all the hyperparameters of a Bagging to control the
ensemble itself.
The Random Forest algorithm introduces extra randomness when growing trees; instead of
searching for the very best feature when splitting a node, it searches for the best feature among
a random subset of features. This results in a greater tree diversity, which (once again) trades a
higher bias for a lower variance, generally yielding an overall better model.

3.1.1.8 Cross Validation


K-Folds
If we decide to go with a larger K, we're going to train the model more times, effectively using all
the training data every time. So, a larger K definitely means more time and more variation in the
test error that you get for every subset of the dataset because of the size of the test dataset will
keep getting smaller and smaller. On the other hand the data that you will use for training will be
larger for a large K so the bias will reduce. For a smaller K we are using smaller chunks of the data
and we're training the model fewer times. So smaller Ks are going to be more biased because
we're using smaller chunks of the original training dataset. Biased in this sense means there’s a
systematic difference between the true model and the estimated model. Like variance, we’ll talk
more about bias in a following module.
Typically, when training machine learning algorithms, we start with a number between five and
ten folds to see the performance. Those numbers can then be changed based on your specific
80
Modeling Frame business problems as ML problems

business problem and needs.

There may be some variations of K-Fold cross-validation, for example, the Leave-One-Out cross-
validation. In the Leave-One-Out cross-validation, the K is equal to N. Every time we leave one
data point out for testing, we are using the rest in the training data. This is usually used for very
small datasets where every data point is very valuable.
There's also stratified K-Fold cross-validation, which is often used when there are seasonalities or
subgroups in small proportion in the data set. Stratified K-Fold cross-validation is going to ensure
that for each fold, there are some equal weight proportions of the data for every different fold.
For instance, while splitting the data you might want to ensure that there is an equal
representation of a certain target variable among the different folds.

3.1.1.9 Train Model


Choosing best hyperparameter by Grid Search, Random Search and Bayesian Search.
Bayesian search treats hyperparameter tuning like a regression problem. Given a set of input
features (the hyperparameters), hyperparameter tuning optimizes a model for the metric that
you choose. To solve a regression problem, hyperparameter tuning makes guesses about which
hyperparameter combinations are likely to get the best results, and runs training jobs to test
these values. After testing the first set of hyperparameter values, hyperparameter tuning uses
regression to choose the next set of hyperparameter values to test. In this example, the Bayesian
search runs initial training jobs to create the predicted curve using objective function.

NOTE: SageMaker has automated hyperparameter tuning, which uses methods like gradient
descent, Bayesian optimization, and evolutionary algorithms to conduct a guided search for the
best hyperparameter settings.

Tuning best practice


 Don’t adjust every hyperparameter
 Limit your range of values to what’s most effective
 Run one training job at a time rather than in parallel
 In distributed training jobs, make sure the objective metric you want is the one reported
back.
 With SageMaker, convert log scaled hyperparameters to linear-scaled whenever possible.
 Initially, it assumes that hyperparameters are linear-scaled. If they should be log-scaled, it
might take some time for Amazon SageMaker to discover that. If you know that a
hyperparameter should be log-scaled and can convert it yourself, doing so could improve
81
Modeling Frame business problems as ML problems

hyperparameter optimization.

82
Modeling Frame business problems as ML problems

3.1.2 Unsupervised Machine Learning


Unsupervised learning tasks and algorithms:
 Clustering: the goal is to group similar instances together into clusters. This is a great tool
for data analysis, customer segmentation, recommender systems, search engines, image
segmentation, semi-supervised learning, dimensionality reduction, and more.
 Anomaly detection: the objective is to learn what “normal” data looks like, and use this to
detect abnormal instances, such as defective items on a production line or a new trend in a
time series.
 Density estimation: this is the task of estimating the probability density function (PDF) of
the random process that generated the dataset. This is commonly used for anomaly
detection: instances located in very low-density regions are likely to be anomalies. It is also
useful for data analysis and visualization.
3.1.2.1 Clustering
Clustering is used in a wide variety of applications, including:
 Customer segmentation:
You can cluster your customers based on their purchases, their activity on your website,
and so on. This is useful to understand who your customers are and what they need, so
you can adapt your products and marketing campaigns to each segment. For example, this
can be useful in recommender systems to suggest content that other users in the same
cluster enjoyed.

 Data analysis:
When analyzing a new dataset, it is often useful to first discover clusters of similar
instances, as it is often easier to analyze clusters separately.

 Dimensionality reduction:
Once a dataset has been clustered, it is usually possible to measure each instance’s affinity
with each cluster (affinity is any measure of how well an instance fits into a cluster). Each
instance’s feature vector x can then be replaced with the vector of its cluster affinities. If
there are k clusters, then this vector is k dimensional. This is typically much lower
dimensional than the original feature vector, but it can preserve enough information for
further processing.

 Anomaly detection (also called outlier detection):


Any instance that has a low affinity to all the clusters is likely to be an anomaly. For
example, if you have clustered the users of your website based on their behavior, you can
detect users with unusual behavior, such as an unusual number of requests per second,
83
Modeling Frame business problems as ML problems

and so on. Anomaly detection is particularly useful in detecting defects in manufacturing or


for fraud detection.

 Semi-supervised learning:
If you only have a few labels, you could perform clustering and propagate the labels to all
the instances in the same cluster. This can greatly increase the amount of labels available
for a subsequent supervised learning algorithm, and thus improve its performance.

 Search engines:
For example, some search engines let you search for images that are similar to a reference
image. To build such a system, you would first apply a clustering algorithm to all the images
in your database: similar images would end up in the same cluster. Then when a user
provides a reference image, all you need to do is to find this image’s cluster using the
trained clustering model, and you can then simply return all the images from this cluster.

 Segment an image:
By clustering pixels according to their color, then replacing each pixel’s color with the
mean color of its cluster, it is possible to reduce the number of different colors in the
image considerably. This technique is used in many object detection and tracking systems,
as it makes it easier to detect the contour of each object.

3.1.2.2 Anomaly Detection


Data Stream Anomaly Detection

Figure 38: Anomaly Detection Architecture with Kinesis Analytics

84
Modeling Frame business problems as ML problems

 This is how random cut forest algorithm is used in kinesis analytics to detect anomalies.
 You can use RANDOM_CUT_FOREST function in the SQL of kinesis analytics to detect
anomalies in the data. With 100 trees with sub samples of 100 elements with 1000
elements in history and 10 shingle size.
 It is learning as it goes.
 Shingles is just a parameter of how many data points to look at while inferencing.

Anomaly Detection using SageMaker

Figure 39: Anomaly Detection using SageMaker

 Differ From Data Stream:


- Random cut forest is working in batch mode.
- It is not learning as it goes, it is only trained when it is trained or re-trained.
- Shingles should be done manually and prepared by code.
- The model output could be compared with labeled data (Anomaly or not anomaly)
to measure the model performance.
- SageMaker detect that the anomaly score is 3 SD from the mean then the data point
is considered anomaly.
 For Training:
- Historical data is collected either from redshift or S3.
- The data is optionally labeled if you would like to test the model performance before
production.
- Data is passed to SageMaker to train RCF model for anomaly detection.

85
Modeling Frame business problems as ML problems

- After training the model is deployed to an endpoint for inference.


 For Inference:
- AWS lambda is used to collect the raw data and pass it to Model Endpoint for
inference.
- If the model return anomaly detection then an alert should fire.
- After inference the data could be stored in redshift.
- The data and anomaly detection could be visualized using QuickSight.

86
Modeling Frame business problems as ML problems

3.1.3 Deep Learning


3.1.3.1 Perceptron
The Perceptron is one of the simplest ANN architectures, it is based on a slightly different
artificial neuron called a threshold logic unit (TLU), or sometimes a linear threshold unit (LTU):
the inputs and output are now numbers and each input connection is associated with a weight.
The TLU computes a weighted sum of its inputs
(z = w1 x1 + w2 x2 + ⋯ + wn xn = xT w), then applies a step function to that sum and outputs
the result:
hw(x) = step(z), where z = xT w.

Figure 40: Threshold logic unit

The most common step function used in Perceptron is the Heaviside step function and sometime
the sign function.

Figure 41: Heaviside & Sign Step Functions

A single TLU can be used for simple linear binary classification. It computes a linear combination
of the inputs and if the result exceeds a threshold, it outputs the positive class or else outputs the
negative class.
Training a TLU in this case means finding the right values for w0, w1, and w2.
When all the neurons in a layer are connected to every neuron in the previous layer (i.e., its input
neurons), it is called a fully connected layer or a dense layer.
Input neurons are just output whatever input they are fed. Moreover, an extra bias feature is
generally added (x0 = 1).
87
Modeling Frame business problems as ML problems

Figure 42: Perceptron Diagram

It is possible to efficiently compute the outputs of a layer of artificial neurons for several
instances at once:
hw,b (X) = ϕ (XW + b)
 X represents the matrix of input features. It has one row per instance, one column per
feature.
 The weight matrix W contains all the connection weights except for the ones from the bias
neuron. It has one row per input neuron and one column per artificial neuron in the layer.
 The bias vector b contains all the connection weights between the bias neuron and the
artificial neurons. It has one bias term per artificial neuron.
 The function ϕ is called the activation function: when the artificial neurons are TLUs, it is a
step function
3.1.3.2 Multi-Layer Perceptron and Backpropagation
An MLP is composed of one (pass-through) input layer, one or more layers of TLUs, called hidden
layers, and one final layer of TLUs called the output layer.

88
Modeling Frame business problems as ML problems

Figure 43:Multi-Layer Perceptron

The signal flows only in one direction (from the inputs to the out- puts), so this architecture is an
example of a feedforward neural net- work (FNN).
Backpropagation training algorithm:
 It handles one mini-batch at a time (for example containing 32 instances each), and it goes
through the full training set multiple times.
 Each mini-batch is passed to the network’s input layer, which just sends it to the first
hidden layer. The algorithm then computes the output of all the neurons in this layer (for
every instance in the mini-batch). The result is passed on to the next layer, its output is
computed and passed to the next layer, and so on until we get the output of the last layer,
the output layer. This is the forward pass: it is exactly like making predictions, except all
intermediate results are preserved since they are needed for the backward pass.
 Next, the algorithm measures the network’s output error (i.e., it uses a loss function that
compares the desired output and the actual output of the network, and returns some
measure of the error).
 Then it computes how much each output connection contributed to the error. This is done
analytically by simply applying the chain rule (perhaps the most fundamental rule in
calculus), which makes this step fast and precise.
 The algorithm then measures how much of these error contributions came from each
connection in the layer below, again using the chain rule—and so on until the algorithm
reaches the input layer. As we explained earlier, this reverse pass efficiently measures the
error gradient across all the connection weights in the network by propagating the error
gradient backward through the network (hence the name of the algorithm).
89
Modeling Frame business problems as ML problems

 Finally, the algorithm performs a Gradient Descent step to tweak all the connection
weights in the network, using the error gradients it just computed.
This algorithm is so important, it’s worth summarizing it again: for each training instance the
backpropagation algorithm first makes a prediction (forward pass), measures the error, then goes
through each layer in reverse to measure the error contribution from each connection (reverse
pass), and finally slightly tweaks the connection weights to reduce the error (Gradient Descent
step).

NOTE: It is important to initialize all the hidden layers’ connection weights


randomly, or else training will fail. For example, if you initialize all weights and
biases to zero, then all neurons in a given layer will be perfectly identical, and thus
backpropagation will affect them in exactly the same way.
3.1.3.3 Activation Functions
It is not efficient to use step function the step function contains only flat segments, so the
gradient descent can’t work with.
 Sigmoid/Logistic Function: σ(z) = 1 / (1 + exp(–z)) has a well-defined nonzero derivative
every- where, allowing Gradient Descent to make some progress at every step.

 The hyperbolic/Tangent function tanh(z) = 2σ(2z) – 1 Just like the logistic function it is S-
shaped, continuous, and differentiable, but its output value ranges from –1 to 1 (instead of
0 to 1 in the case of the logistic function), which tends to make each layer’s output more or
less centered around 0 at the beginning of training. This often helps speed up convergence
(‫ )التقارب‬and preferable over sigmoid/Logistic functions.

NOTE: Both Sigmoid/Logistic and Hyperbolic/Tangent suffer from floating points


precision problems especially when goes to saturation. Also, both of them are
computationally expensive computing exponential and trigonometry is expensive.
 The Rectified Linear Unit function: ReLU(z) = max(0, z) It is continuous but unfortunately
not differentiable at z = 0 (the slope changes abruptly, which can make Gradient Descent
bounce around), and its derivative is 0 for z < 0. However, in practice it works very well and
has the advantage of being fast to compute. Most importantly, the fact that it does not
have a maximum output value also helps reduce some issues during Gradient Descent.

 Swish
- From google, performs very well
- For deep networks more than 40 layers
90
Modeling Frame business problems as ML problems

Figure 44: Activation Functions and their derivatives

Other Activation Functions:


These functions are not used due to drawbacks in the functions:
 Linear Functions
- Multiply and adding some constants didn’t do anything.
- Use linear algorithms instead.
- Can’t do back propagation
- Rarely used, If used for just tweaking data from input layer
 Binary Step Functions
- It is on/off
- Can’t handle multiple classification
- Vertical slopes don’t work with Calculus

Figure 45: Binary Step Function

 Maxout
- Outputs the max of the inputs
- ReLU is a special case of Maxout

91
Modeling Frame business problems as ML problems

- Double parameters to be trained not often practical

How to choose activation function?


For multiple classification, use Softmax.
For RNN, use Tanh.
For other:
- Start with ReLU
- If not that good try leaky ReLU
- If not that good use PPReLU
- If not that good try Swish
Refer to section 3.1.3.5 for more information about Leaky ReLU and PReLU.

Output
You need one output neuron per output dimension.
 If you want to guarantee that the output will always be positive, then you can use the ReLU
activation function, or the Softplus activation function in the output layer.
 If you want to guarantee that the predictions will fall within a given range of values, then
you can use the logistic function or the hyperbolic tangent, and scale the labels to the
appropriate range: 0 to 1 for the logistic function, or –1 to 1 for the hyperbolic tangent.
Error
 The loss function to use during training is typically the mean squared error
 Mean absolute error if you have a lot of outliers in the training set
 Huber loss, which is a combination of both.
Hyperparameters values
Hyperparameter Typical Value
# input neurons One per input feature (e.g., 28 x 28 = 784 for MNIST)
# hidden layers Depends on the problem. Typically 1 to 5.
# neurons per Depends on the problem. Typically 10 to 100.
hidden layer
# output neurons 1 per prediction dimension
Hidden activation ReLU

92
Modeling Frame business problems as ML problems

Output activation None or ReLU/Softplus (if positive outputs) or Logistic/Tanh (if bounded
outputs)
Loss function MSE or MAE/Huber (if outliers)

Classification:
 MLPs can also be used for classification tasks. For a binary classification problem, you just
need a single output neuron using the logistic activation function: the output will be a
number between 0 and 1, which you can interpret as the estimated probability of the
positive class. Obviously, the estimated probability of the negative class is equal to one
minus that number.

 MLPs can also easily handle multilabel binary classification tasks, For example, you could
have an email classification system that predicts whether each incoming email is ham or
spam and if the email is urgent or not. In this case, you would need two output neurons,
both using the logistic activation function: the first would output the probability that the
email is spam and the second would output the probability that it is urgent.

 If each instance can belong only to a single class, out of 3 or more possible classes (e.g.,
classes 0 through 9 for digit image classification), then you need to have one output
neuron per class, and you should use the Softmax activation function for the whole output
layer. The Softmax function will ensure that all the estimated probabilities are between 0
and 1 and that they add up to one (which is required if the classes are exclusive). This is
called multiclass classification.

Softmax
 Used in the final output layer of a multiple classification problem
 Basically converts outputs to probabilities of each classification.
 Used for multiclass classification not multilabel classification.

93
Modeling Frame business problems as ML problems

Figure 46: Softmax calculation

In a classification problem for classify Iris plant, suppose the output from DNN as shown above in
the figure. The problem is some outputs is not restricted from 0 to 1 so is hard to be interpreted.
So we are going to use Softmax, which we are going to calculate the probability of each class by
using the following equation: eSetosa / (eSetosa + eVersicolor + eVirginica)
For every class, we will have a probability for this class ranging from 0 to 1 and their sum will
equal to 1.

NOTE: This is not a real probability and these number depends on the weights
from DNN, and will be changed if this weights changed.

Cross Entropy
Now that you know how the model estimates probabilities and makes predictions, let’s take a
look at training. The objective is to have a model that estimates a high probability for the target
class (and consequently a low probability for the other classes).
Cross Entropy, penalizes the model when it estimates a low probability for a target class. Cross
entropy is frequently used to measure how well a set of estimated class probabilities match the
target classes.

3.1.3.4 Classification Hyperparameters


Hyperparameter Binary Classification Multilabel classification Multiclass classification
Input & hidden layers Same as regression Same as regression Same as regression
# output neurons 1 1 per label 1 per class
Output layer activation Logistic Logistic Softmax
Loss function Cross-Entropy Cross-Entropy Cross-Entropy

94
Modeling Frame business problems as ML problems

Hyperparameters
 Number of hidden layers
 Number of neurons per hidden layer
 Other parameters
- Learning rate
A simple approach for tuning the learning rate is to start with a large value that
makes the training algorithm diverge, then divide this value by 3 and try again, and
repeat until the training algorithm stops diverging.

- Batch size
Batch size is how many sample size in each epoch.
Have a significant impact on your model’s performance and the training time. In
general the optimal batch size will be lower than 32.
A small batch size ensures that each training iteration is very fast, and although a
large batch size will give a more precise estimate of the gradients, in practice this
does not matter much since the optimization landscape is quite complex and the
direction of the true gradients do not point precisely in the direction of the
optimum.

Figure 47: Batch Size Local Minima

Smaller batch size can get out easily from local minima while large batch size can
stuck in the local minima.
If we are working with shuffle data it will be weirder to use large batch size as the
local minima will sometimes calculated right and sometimes not (inconsistent
results).

95
Modeling Frame business problems as ML problems

- Activation function
ReLU activation function will be a good default for all hidden layers. For the output
layer, it really depends on your task.

- Number of training iterations


Training iterations does not actually need to be tweaked: just use early stopping
instead.
3.1.3.5 Vanishing/Exploding Gradients
Gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As
a result, the Gradient Descent update leaves the lower layer connection weights virtually
unchanged, and training never converges to a good solution.
This is very common with the sigmoid function, as the derivative of the sigmoid function is
between 0 and 0.25, when we apply the chain rule to the lower layer, we multiply with very small
numbers and the Gradients becomes very small. When we update the weigh value with the
gradient, the weight is slightly changed, so the algorithm will not trained.
Exploding is the same concept but the gradients becomes very large, so updates the weight with
large values thus didn’t take the weight to the correct value when apply gradient descent
(Oscillate over the smallest value).

Solutions

Weight Initialization
To solve vanishing/exploding gradient problem we need the variance of the outputs of each layer
to be equal to the variance of its inputs.
Initially we used to initialize weights by random numbers with mean 0 and standard deviation 1.
Let’s take an example to discuss the problem of using random numbers:
Suppose we have a 250 input neuron and the value for each neuron is 1, and the weight is
generated randomly with mean 0 and SD 1.

96
Modeling Frame business problems as ML problems

Figure 48: Random Numbers weights Initialization

So the var(z) = 250 and its SD = 15.811.


When SD is significantly larger than 1, so when the value of Z is passed to the activation function
as Sigmoid function the return value will be always 1 and similarly negative numbers will be
return 0, so we reach a saturated functions.
When return 1 may lead to explode and similarly when return 0 will lead to vanishing.
So small changes will be applied to the weights during training and the training time will be
significantly increase.
To solve this problem we will apply Xavier initialization or Glorot initialization:
Var(z) = 1/n such that n is the number of weights connected to a given node from previous layer
and may vary one layer to another.
Randomly generate weights with mean 0 and SD = 1 as usual then multiply weight with 1/n
When using ReLU as activation function then var(z) = 2/n
Also, var(z) could be calculated as:
Var(z) = 2/nin + nout Where nin number of weights coming into the node and nout number of
weights going out of this node.

Non Saturation Activation Function


Choice of activation function affects the vanishing and exploding gradient.
 Sigmoid activation functions is not a good choice as it suffers from 0 and 1
saturations.
97
Modeling Frame business problems as ML problems

 ReLU activation function is not perfect. It suffers from a problem known as the dying
ReLU during training, some neurons effectively die, meaning they stop outputting
anything other than 0 when negative values applied.
 Leaky ReLU. This function is defined as LeakyReLUα(z) = max(αz, z)
The hyperparameter α defines how much the function “leaks”: it is the slope of the
function for z < 0, and is typically set to 0.01.
Setting α = 0.2 (huge leak) seemed to result in better performance than α = 0.01
(small leak).

Figure 49: Leaked ReLU

 Randomized leaky ReLU (RReLU), where α is picked randomly in a given range during
training, and it is fixed to an average value during testing.
 Parametric leaky ReLU (PReLU), where α is authorized to be learned during training.
It is complicated.
 Exponential linear unit (ELU) that outperformed all the ReLU variants in their
experiments: training time was reduced and the neural network performed better
on the test set.

ELU activation function is that it is slower to compute than the ReLU and its variants
(due to the use of the exponential function) but during training this is compensated
by the faster convergence rate.
 Self-Normalizing Neural Networks (SELU) is just a scaled version of the ELU activation
function

98
Modeling Frame business problems as ML problems

Batch Normalization
The problem comes when one of the weights becomes drastically large than the other weights,
the output from its corresponding neuron will be extremely large and this imbalance will be
cascaded to the neural network causing instability.
To solve this problem we should apply batch normalization. Batch normalization is applied per
layer. It normalizes the output from the activation functions before being passed to the next
layer.
1. z = x – m / s, such that z is the output from activation function, m is the mean and s is the
standard deviation.
2. Multiply the output with arbitrary parameter g.
3. Then add arbitrary parameter b.
4. The final will be: z = (x - m/s) * g + b
5. The values m, s, g and b are all trainable meaning that they will be optimized during the
training.
By this way all the weights don’t become imbalanced and the training speed will be greatly
increased.
Benefits:
 The vanishing gradients problem was strongly reduced, to the point that they could
use saturating activation functions such as the tanh and even the logistic activation
function.
 The networks were also much less sensitive to the weight initialization.
 Ability to use much larger learning rates, significantly speeding up the learning
process.
 Batch Normalization also acts like a regularizer, reducing the need for other
regularization techniques.

Gradient Clipping (Explode Gradient)


Another popular technique to lessen the exploding gradients problem is to simply clip the
gradients during backpropagation so that they never exceed some threshold.

Other Solutions
 Multilevel hierarchy

99
Modeling Frame business problems as ML problems

Break up levels into their own sub-networks trained individually.


 Long Short Term memory (LTSM)
For RNN
 Residual Networks
RSNet, Ensemble for short networks

Gradient Checking
 A debugging technique
 Numerically check the derivatives computed during training
 Useful for validating code of neural network training

3.1.3.6 Reusing pre-trained layers


Transfer Learning
It is generally not a good idea to train a very large DNN from scratch: instead, you should always
try to find an existing neural network that accomplishes a similar task to the one you are trying to
tackle then just reuse the lower layers of this network: this is called transfer learning.
The output layer of the original model should usually be replaced since it is most likely not useful
at all for the new task, and it may not even have the right number of outputs for the new task.
Similarly, the upper hidden layers of the original model are less likely to be as useful as the lower
layers, since the high-level features that are most useful for the new task may differ significantly
from the ones that were most useful for the original task. You want to find the right number of
layers to reuse.
Try freezing all the reused layers first (i.e., make their weights non-trainable, so gradient descent
won’t modify them), then train your model and see how it performs. Then try unfreezing one or
two of the top hidden layers to let backpropagation tweak them and see if performance
improves. The more training data you have, the more layers you can unfreeze. It is also useful to
reduce the learning rate when you unfreeze reused layers: this will avoid wrecking their fine-
tuned weights.
If you still cannot get good performance, and you have little training data, try drop- ping the top
hidden layer(s) and freeze all remaining hidden layers again. You can iterate until you find the
right number of layers to reuse.

100
Modeling Frame business problems as ML problems

Unsupervised Pre-training
When you want to train a model for which you don’t have much labeled training data, and you
cannot find a model trained on a similar task.
You should gather plenty of unlabeled training data, you can try to train the layers one by one,
starting with the lowest layer and then going up, using an unsupervised feature detector
algorithm such as Restrict-ted Boltzmann Machines or auto encoders.
Once all layers have been trained this way, you can add the output layer for your task, and fine-
tune the final network using supervised learning (i.e., with the labeled training examples). At this
point, you can unfreeze all the pre-trained layers, or just some of the upper ones.

Pre-training on an Auxiliary Task


If you do not have much labeled training data, one last option is to train a first neural network on
an auxiliary task for which you can easily obtain or generate labeled training data, then reuse the
lower layers of that network for your actual task. The first neural network’s lower layers will learn
feature detectors that will likely be reusable by the second neural network.
For example, if you want to build a system to recognize faces, you may only have a few pictures
of each individual clearly not enough to train a good classifier. Gathering hundreds of pictures of
each person would not be practical. However, you could gather a lot of pictures of random
people on the web and train a first neural network to detect whether or not two different
pictures feature the same person. Such a net- work would learn good feature detectors for faces,
so reusing its lower layers would allow you to train a good face classifier using little training data.

3.1.3.7 Fast Optimizers


Training a very large deep neural network can be painfully slow. So far we have four ways to
speed up training:
 Applying a good initialization strategy for the connection weights
 Using a good activation function
 Using Batch Normalization
 Reusing parts of a pre-trained network.

Another huge speed boost comes from using a faster optimizer than the regular Gradient
Descent optimizer.

101
Modeling Frame business problems as ML problems

Momentum Optimization
Recall that Gradient Descent simply updates the weights θ by directly subtracting the gradient of
the cost function J(θ) with regards to the weights (∇θ J(θ)) multiplied by the learning rate η. The
equation is: θ ← θ – η∇θ J(θ). It does not care about what the earlier gradients were. If the local
gradient is tiny, it goes very slowly.
Momentum optimization cares a great deal about what previous gradients were: at each
iteration, it subtracts the local gradient from the momentum vector m (multiplied by the learning
rate η), and it updates the weights by simply adding this momentum vector. In other words, the
gradient is used for acceleration, not for speed. To simulate some sort of friction mechanism and
prevent the momentum from growing too large, the algorithm introduces a new hyperparameter
β, simply called the momentum, which must be set between 0 (high friction) and 1 (no friction). A
typical momentum value is 0.9.

Figure 50: Momentum Algorithm

You can easily verify that if the gradient remains constant, the terminal velocity (i.e., the
maximum size of the weight updates) is equal to that gradient multiplied by the learning rate η
multiplied by1/1- β (ignoring the sign). For example, if β = 0.9, then the terminal velocity is equal
to 10 times the gradient times the learning rate, so Momentum optimization ends up going 10
times faster than Gradient Descent! This allows Momentum optimization to escape from plateaus
much faster than Gradient Descent.

NOTE: In deep neural networks that don’t use Batch Normalization, the upper
layers will often end up having inputs with very different scales, so using
Momentum optimization helps a lot. It can also help roll past local optima.

Nesterov Accelerated Gradient


The idea of Nesterov Momentum optimization, or Nesterov Accelerated Gradient (NAG), is to
measure the gradient of the cost function not at the local position but slightly ahead in the
direction of the momentum. The only difference from vanilla Momentum optimization is that the
gradient is measured at θ + βm rather than at θ.

Figure 51: Nesterov Accelerated Gradient Algorithm

102
Modeling Frame business problems as ML problems

AdaGrad
Consider the elongated bowl problem again: Gradient Descent starts by quickly going down the
steepest slope, then slowly goes down the bottom of the valley. It would be nice if the algorithm
could detect this early on and correct its direction to point a bit more toward the global
optimum. The AdaGrad algorithm achieves this by scaling down the gradient vector along the
steepest dimensions

Figure 52: AdaGrad Algorithm

RMSProp
Although AdaGrad slows down a bit too fast and ends up never converging to the global
optimum, the RMSProp algorithm fixes this by accumulating only the gradients from the most
recent iterations (as opposed to all the gradients since the beginning of training). It does so by
using exponential decay in the first step

Figure 53: RMSProp Algorithm

Adam and Nadam optimization


Adam which stands for adaptive moment estimation, combines the ideas of Momentum
optimization and RMSProp: just like Momentum optimization it keeps track of an exponentially
decaying average of past gradients, and just like RMSProp it keeps track of an exponentially
decaying average of past squared gradients.

103
Modeling Frame business problems as ML problems

Figure 54: Adam & Nadam Algorithm

3.1.3.8 Early Stop


A very different way to regularize iterative learning algorithms such as Gradient Descent is to stop
training as soon as the validation error reaches a minimum. This is called early stopping.
As the epochs go by, the algorithm learns and its prediction error (RMSE) on the training set
naturally goes down, and so does its prediction error on the validation set. However, after a while
the validation error stops decreasing and actually starts to go back up. This indicates that the
model has started to overfit the training data.

3.1.3.9 Learning Schedule


Finding a good learning rate can be tricky. If you set it way too high, training may actually diverge.
If you set it too low, training will eventually converge to the optimum, but it will take a very long
time. If you set it slightly too high, it will make progress very quickly at first, but it will end up
dancing around the optimum, never really settling down. If you have a limited computing budget,
you may have to interrupt training before it has converged properly, yielding a suboptimal
solution.
One approach is to start with a large learning rate, and divide it by 3 until the training algorithm
stops diverging. You will not be too far from the optimal learning rate, which will learn quickly
and converge to good solution.

Power scheduling
Set the learning rate to a function of the iteration number t: η (t) = η0 / (1 + t/k)c . The initial
learning rate η0, the power c (typically set to 1) and the steps s are hyperparameters. The
learning rate drops at each step, and after s steps it is down to η0 / 2. After s more steps, it is
down to η0 / 3. Then down to η0 / 4, then η0 / 5, and so on. As you can see, this schedule first
drops quickly, then more and more slowly. Of course, this requires tuning η0, s (and possibly c).
104
Modeling Frame business problems as ML problems

Exponential scheduling
Set the learning rate to: η(t) = η0 0.1t/s. The learning rate will gradually drop by a factor of 10
every s steps. While power scheduling reduces the learning rate more and more slowly,
exponential scheduling keeps slashing it by a factor of 10 every s steps.

Piecewise constant scheduling


Use a constant learning rate for a number of epochs (e.g., η0 = 0.1 for 5 epochs), then a smaller
learning rate for another number of epochs (e.g., η1 = 0.001 for 50 epochs), and so on. Although
this solution can work very well, it requires fiddling around to figure out the right sequence of
learning rates, and how long to use each of them.

Performance scheduling
Measure the validation error every N steps (just like for early stopping) and reduce the learning
rate by a factor of λ when the error stops dropping.

3.1.3.10 Regularization
Constraining a model to make it simpler and reduce the risk of overfitting is called regularization.
The amount of regularization to apply during learning can be controlled by a hyper- parameter. A
hyperparameter is a parameter of a learning algorithm (not of the model). As such, it is not
affected by the learning algorithm itself; it must be set prior to training and remains constant
during training. If you set the regularization hyper- parameter to a very large value, you will get
an almost flat model (a slope close to zero); the learning algorithm will almost certainly not
overfit the training data, but it will be less likely to find a good solution. Tuning hyperparameters
is an important part of building a Machine Learning system
We already implemented one of the best regularization techniques in:
 Early stopping
 Batch Normalization was designed to solve the vanishing/exploding gradients problems, is
also acts like a pretty good regularizer.
In this section we will present other popular regularization techniques for neural networks: ℓ1
and ℓ2 regularization, dropout and max-norm regularization.

105
Modeling Frame business problems as ML problems

ℓ1 and ℓ2 regularization
You can use ℓ1 and ℓ2 regularization to constrain a neural network’s connection weights (but
typically not its biases).
A regularization term is added as weights are learned
ℓ1 term is the sum of the weights 𝜆 𝑘𝑖=1 𝑤𝑖
ℓ2 term is the sum of the square of the weights 𝜆 k𝑖=1 𝑤𝑖2
Difference between ℓ1 and ℓ2:
ℓ1: sum of weights
 Performs feature selection – entire features go to 0
 Computationally inefficient
 Sparse output because it is removing information from the data.
ℓ2: sum of square of weights
 All features remain considered, just weighted
 Computationally efficient
 Dense output

Simplify Model
Try to drop some neurons and/or layers.

Dropout
At every training step, every neuron (including the input neurons, but always excluding the
output neurons) has a probability p of being temporarily “dropped out,” meaning it will be
entirely ignored during this training step, but it may be active during the next step. The
hyperparameter p is called the dropout rate, and it is typically set to 50%. After training, neurons
don’t get dropped anymore.
Dropout including the input and hidden layers excluding the output layer.

Monte-Carlo (MC) Dropout


 A profound connection between dropout networks (i.e., neural networks containing a
dropout layer before every weight layer) and approximate Bayesian inference, giving
dropout a solid mathematical justification.
106
Modeling Frame business problems as ML problems

 MC Dropout, can boost the performance of any trained dropout model, without having to
retrain it or even modify it at all!
 MC Dropout provides a much better measure of the model’s uncertainty.
 MC Dropout simple to implement

Max-Norm Regularization
It constrains the weights w of the incoming connections such that ∥ *w* ∥2 ≤ _r_, where r is the
max-norm hyperparameter and ∥ · ∥2 is the ℓ2 norm. Max-norm regularization does not add a
regularization loss term to the overall loss function. Instead, it is typically implemented by
computing ∥w∥2 after each training step and clipping w if needed (w  w(r/∥w∥2)).

3.1.3.11 Famous Frameworks


 Tensorflow/Keras (By Google)
 MXNet (By Apache)
Both are supported by AWS

3.1.3.12 Convolution Neural Network


CNN is can find features (data patterns) in your data. This is called “Feature Location Invariance”.
It is used for:
 Image Classification
 Machine Translation
 Sentence Classification
 Sentiment Analysis
The most important building block of a CNN is the convolutional layer: neurons in the first
convolutional layer are not connected to every single pixel in the input image, but only to pixels
in their receptive fields. In turn, each neuron in the second convolutional layer is connected only
to neurons located within a small rectangle in the first layer. This architecture allows the network
to concentrate on small low-level features in the first hidden layer, then assemble them into
larger higher-level features in the next hidden layer, and so on.

107
Modeling Frame business problems as ML problems

Figure 55: CNN layers with rectangular local receptive fields

Padding
A neuron located in row i, column j of a given layer is connected to the outputs of the neurons in
the previous layer located in rows i to i + fh – 1, columns j to j + fw – 1, where fh and fw are the
height and width of the receptive field In order for a layer to have the same height and width as
the previous layer, it is common to add zeros around the inputs, as shown in the diagram. This is
called zero padding.

Figure 56: Connection between layers and Zero padding

Padding must be either "VALID" or "SAME":


 If set to "VALID", the convolutional layer does not use zero padding,
 If set to "SAME", the convolutional layer uses zero padding if necessary. In this case, the
number of output neurons is equal to the number of input neurons divided by the stride.

108
Modeling Frame business problems as ML problems

Stride
It is also possible to connect a large input layer to a much smaller layer by spacing out the
receptive fields, as shown in the next figure. The shift from one receptive field to the next is
called the stride.

Figure 57: Reducing dimensionality using a stride of 2

Filters
A neuron’s weights can be represented as a small image the size of the receptive field. For
example, the next figure two possible sets of weights, called filters (or convolution kernels). The
first one is represented as a black square with a vertical white line in the middle (it is a 7 × 7
matrix full of 0s except for the central column, which is full of 1s); neurons using these weights
will ignore everything in their receptive field except for the central vertical line (since all inputs
will get multiplied by 0, except for the ones located in the central vertical line). The second filter
is a black square with a horizontal white line in the middle. Once again, neurons using these
weights will ignore everything in their receptive field except for the central horizontal line.
Now if all neurons in a layer use the same vertical line filter (and the same bias term), and you
feed the network the input image shown in next figure, the layer will output the top-left image.
Notice that the vertical white lines get enhanced while the rest gets blurred. Similarly, the upper-
right image is what you get if all neurons use the same horizontal line filter; notice that the
horizontal white lines get enhanced while the rest is blurred out. Thus, a layer full of neurons
using the same filter outputs a feature map, which highlights the areas in an image that activate
the filter the most. Of course you do not have to define the filters manually: instead, during
training the convolutional layer will automatically learn the most useful filters for its task, and the
layers above will learn to combine them into more complex patterns.
109
Modeling Frame business problems as ML problems

Figure 58: Applying two different filters to get two feature maps

Stacking Multiple Feature Maps


Up to now, for simplicity, I have represented the output of each convolutional layer as a thin 2D
layer, but in reality a convolutional layer has multiple filters (you decide how many), and it
outputs one feature map per filter, so it is more accurately represented in 3D. To do so, it has
one neuron per pixel in each feature map, and all neurons within a given feature map share the
same parameters (i.e., the same weights and bias term). However, neurons in different feature
maps use different parameters. A neuron’s receptive field is the same as described earlier, but it
extends across all the previous layers’ feature maps. In short, a convolutional layer
simultaneously applies multiple trainable filters to its inputs, making it capable of detecting
multiple features anywhere in its inputs.

110
Modeling Frame business problems as ML problems

Figure 59: Convolution layers with multiple feature maps, and images with three color

Moreover, input images are also composed of multiple sublayers: one per color channel. There
are typically three: red, green, and blue (RGB). Grayscale images have just one channel, but some
images may have much more—for example, satellite images that capture extra light frequencies
(such as infrared).
Specifically, a neuron located in row i, column j of the feature map k in a given convolutional layer
l is connected to the outputs of the neurons in the previous layer l – 1, located in rows i × sh to i ×
sh + fh – 1 and columns j × sw to j × sw + fw – 1, across all feature maps (in layer l – 1). Note that
all neurons located in the same row i and column j but in different feature maps are connected to
the outputs of the exact same neurons in the previous layer.

Pooling Layer
Pooling layer goal is to subsample (i.e., shrink) the input image in order to reduce the
computational load, the memory usage, and the number of parameters (thereby limiting the risk
of overfitting).

111
Modeling Frame business problems as ML problems

Just like in convolutional layers, each neuron in a pooling layer is connected to the outputs of a
limited number of neurons in the previous layer, located within a small rectangular receptive
field. You must define its size, the stride, and the padding type, just like before. However, a
pooling neuron has no weights; all it does is aggregate the inputs using an aggregation function
such as the max, min or mean.

Figure 60: Max pool layer

Pool layer also introduces some level of invariance to small translations, rotation and scaling as
shown in the below figure.

Figure 61: Invariance to small translations

Max pooling and average pooling can be performed along the depth dimension rather than the
spatial dimensions, although this is not as common. This can allow the CNN to learn to be
invariant to various features. For example, it could learn multiple filters, each detecting a
different rotation of the same pattern, such as handwritten digits.

112
Modeling Frame business problems as ML problems

Figure 62: Depth-wise max pooling can help the CNN learn any invariance

Flatten Layer
This layer is converting the 2D layer to 1D layer for passing into a flat hidden layer of neuron.

CNN Architectures
Typical CNN architectures stack a few convolutional layers (each one generally followed by a
ReLU layer), then a pooling layer, then another few convolutional layers (+ReLU), then another
pooling layer, and so on. The image gets smaller and smaller as it progresses through the
network, but it also typically gets deeper and deeper (i.e., with more feature maps) thanks to the
convolutional layers (see Figure 14-11). At the top of the stack, a regular feedforward neural
network is added, composed of a few fully connected layers (+ReLU), and the final layer outputs
the prediction (e.g., a Softmax layer that outputs estimated class probabilities).

Figure 63: Typical CNN architecture

113
Modeling Frame business problems as ML problems

NOTE: A common mistake is to use convolution kernels that are too large. For
example, instead of using a convolutional layer with a 5 × 5 kernel, it is generally
preferable to stack two layers with 3 × 3 kernels: it will use less parameters and
require less computations, and it will usually perform better. One exception to this
recommendation is for the first convolutional layer: it can typically have a large
kernel (e.g., 5 × 5), usually with stride of 2 or more: this will reduce the spatial
dimension of the image without losing too much information, and since the input
image only has 3 channels in general, it will not be too costly.
Simple Usage:
Conv2D  MaxPooling2D  Dropout  Flatten  Dense  Dropout  Softmax
Conv2D: Convolution for the image data
Max Pooling: Subsample the image down to shrink the amount of data
Dropout: Prevent overfitting
Flatten: Convert data to 1D to be fed into the perceptron (Dense Layer)
Dense: Just a perceptron for normal DNN (Hidden layer of neurons)
Softmax: Multiclass classification.

Specialized CNN Architectures


Defines specific arrangements of layers, padding and hyperparameters for solving specific
problems.
 LeNet-5
Good for handwriting recognition
 AlexNet
Image Classification better than LeNet
 GoogleLeNet
Even deeper with better performance
Introduces inception modules (groups of convolution modules)
GoogLeNet: This great performance came in large part from the fact that the network was
much deeper than previous CNNs, This was made possible by sub-networks called
inception modules, which allow GoogLeNet to use parameters much more efficiently than
114
Modeling Frame business problems as ML problems

previous architectures: GoogLeNet actually has 10 times fewer parameters than AlexNet
(roughly 6 million instead of 60 million).

 ResNet (Residual Network)


Even deeper maintain performance by using skip connections (also called shortcut
connections).
ResNet 50 is used by AWS, which is built on ResNet.
When you initialize a regular neural network, its weights are close to zero, so the net- work
just outputs values close to zero. If you add a skip connection, the resulting net- work just
outputs a copy of its inputs; in other words, it initially models the identity function. If the
target function is fairly close to the identity function (which is often the case), this will
speed up training considerably.

CNNs are Hard


 CNN are very computationally expensive, they are very heavy on CPU, GPU and RAM.
 Lots of hyperparameters
 Getting trained data as well as storing and accessing.

3.1.3.13 Recurrent Neural Network


Deals with sequence in some sort as sequence in time (predict stock prices), sequence in words
(translation, understand words in sentences)

RNN Process Sequence

115
Modeling Frame business problems as ML problems

 One to One:
One input and one output.
Classification from a set of categories.
 One to Many:
One input and many outputs. In every iteration there is an output.
Example: Image captioning, such that input is an image and output is a sequence of words
of different length.

 Many to One:
Many inputs and one output.
Example:
- Sentiment analysis, such that input is a sequence of words and the output is
sentiment of this text weather it is positive or negative.
- Video as input with variable number of frames (many) and the output is a
classification or action for the entire video.

 Many to Many:
Many inputs and many outputs.
Example:
Machine translation, such that input sequence of words and output is sequence of words.
The input and the output is variable in length i.e. the input in English which could have a
variable in length and the output in French which could have a variable in length. Also, the
English sentence should not be the same length in French sentence.

 Many to Many (Last One):


Many inputs and many outputs with every iteration has an output.
Example:

116
Modeling Frame business problems as ML problems

Video classification on frame level such that input as video variable frames and each frame
should be classified.

Non-sequence data as classify images by taking a series of “glimpses”.

Recurrent Neuron

Figure 64: Recurrent Neuron (Cell) or Memory Cell

 The core recurrent neuron will take some input x, feed that input into the RNN.
 RNN core neuron has some internal hidden state and that internal hidden state will be
updated every time that RNN reads a new input.
 The internal hidden state will then feedback to the model the next time it reads an input.
 Frequently we will want our RNN’s to produce some output at every time step.
 This pattern will read an input, update its hidden state and then produce an output.

Recurrent Formula
We can process a sequence of vector X by applying a recurrence formula at every step time.

117
Modeling Frame business problems as ML problems

Figure 65: Recurrent Formula

 Inside the RNN cell, we are computing some recurrence relation with a function f
 The function f will depend on some weights (w), it will accept the previous hidden state h(t-
1) and the input at the current state xt and this will output the updated hidden state h(t).
 The updated hidden state will be used in the next step as previous hidden state with the
new input.

NOTE: the same function and the same set of parameters (same updatable
weights) are used at every time step.

Figure 66: Simple RNN Function form (Vanilla Recurrent Neural Network)

 The current hidden state ht which is function that takes the previous hidden state h(t-1)
and some input x.
 We have some weight matrix Wxh that we multiply against the input Xt.
 Another weight matrix Whh that we multiply against the previous hidden state h(t-1)
 Then we add the results and pass them to tanh function.
 If we have an output from this cell we might have another weight matrix Why the will be
multiplied by ht and this will be the cell output.

NOTE: We can limit backpropagation to a limited number of time steps (truncated


backpropagation through time)

RNN Computational Graph


118
Modeling Frame business problems as ML problems

Figure 67: RNN Computational Graph (Many to Many)

 The process begins by passing the initial hidden state h0 (commonly 0) with the first input
x1 to the function fw.
 Then apply the function fw, apply the weights and calculate the next hidden state h1.
 The new hidden state h1 with the new input x2 will be recurred to the same cell.
 In each iteration in process the weights will be updated (Weights are reused). Hidden
states will be calculated and passed to the cell in the next step.
 This process will be repeated over and over again till we consume all the inputs xt.
 In back propagation, we will have a separate gradient for w flowing from each of those
time steps and then the final gradient for w will be the sum of all of those individual per
time gradient.
 We can also have yt explicitly, every ht at each step might feed into some other neural
network that can produce yt.
 Also, we can calculate the loss at every individual step, the total loss will be the sum for all
the individual losses. Then we calculate the gradient for the total loss with respect to w.

119
Modeling Frame business problems as ML problems

Figure 68: RNN Computational Graph (Many to One)

 The final output will depend upon the final hidden state as this hidden state holds all the
information from all the previous hidden states.

Figure 69: RNN Computational Graph (One to Many)

Sequence to Sequence

Figure 70: Sequence to Sequence Many to One + One to Many

120
Modeling Frame business problems as ML problems

 It is used for something like machine translation where you take a variably sized input and
a variably sized output.
 You can think of this as a combination of the many to one (Encoder) plus a one to many
(Decoder)
 Encoder will receive the variably sized input which is your sentence in English and then
summarize that entire sentence using the final hidden state of the encoder network.
 Decoder will receive the input as a vector from the encoder and produce this variable sized
output which is your sentence in another language.

Long Short Term Memory (LSTM)

Figure 71: LSTM Equation

 It is used to solve the vanishing and exploding gradient problem.


 LSTM maintain two hidden states the normal hidden state and the cell state.
 Cell state kept inside the LTSM and it doesn’t really exposed to the outside world.

Figure 72: LSTM Equations

121
Modeling Frame business problems as ML problems

 IFOG gates:
- Forget gate (f): How much do we want to forget from the cell memory
- Input gate (i): How much do we want to input into our cell
- Gate gate (g): How much do we want to write to our cell
- Output gate (o): How much to reveal from cell to the output world
 Stack previous hidden state vector and current input state vector, multiply them with very
big weight matrix w to compute the four different gates, which all have the same size as
the hidden state.

NOTE: Sometimes we use different weight matrix as different weight matrix for
each gate.
 Input, forget and output gates use sigmoid function (from 0 - 1) while the gate uses tanh
function (from -1 to 1).
 We calculate cell state and hidden state.
ct = (f  ct-1) + (i  g)
ht = o  tanh(ct)

NOTE: forget gate (f) is a vector or zeros and ones that telling us for each element
in the cell state, do we want to forget that element of the cell or remember that
element of the cell. The same concept apply to input (i) and output (o) vectors as
all of them comes from the sigmoid function.
 Cell states will be incremented or decremented on each step.
 The hidden state will be used in the next step.
Back propagation with LSTM
 Back propagation is element wise multiplication with the forget gate and so the cell state
and this will solve the gradient descent vanishing and exploding for 2 reasons:
- Forget gate is element wise multiplication rather than full matrix multiplication
- Multiply by different forget gate at every step

NOTE: In vanilla RNN, we continually multiplying by the same weight matrix.

RNN Variants
Gated Recurrent Unit (GRU)

122
Modeling Frame business problems as ML problems

Learning phrase representation using encoder –decoder for statistical machine translation.

3.1.3.14 Reinforcement
It is used in:
 Supply chain management
 HVAC systems stands for heating, ventilation, and air conditioning.
 Industrial Robots
 Dialog Systems
 Autonomous Vehicles
Yields very fast performance once the space has been explored.

Markov Decision Processes (MDPs)


Components
Markov decision processes give us a way to formalize sequential decision making. This
formalization is the basis for structuring problems that are solved with reinforcement learning.
In an MDP, we have a decision maker, called an agent that interacts with the environment it's
placed in. These interactions occur sequentially over time. At each time step, the agent will get
some representation of the environment's state. Given this representation, the agent selects
an action to take. The environment is then transitioned into a new state, and the agent is given
a reward as a consequence of the previous action.

Components of an MDP:
- Agent
- Environment
- State
- Action
- Reward

This process of selecting an action from a given state, transitioning to a new state, and receiving a
reward happens sequentially over and over again, which creates something called
a trajectory that shows the sequence of states, actions, and rewards.

123
Modeling Frame business problems as ML problems

Throughout this process, it is the agent's goal to maximize the total amount of rewards that it
receives from taking actions in given states. This means that the agent wants to maximize not just
the immediate reward, but the cumulative rewards it receives over time.

MDP Notation
In an MDP, we have a set of states (S), a set of actions (A), and a set of rewards (R). We'll assume
that each of these sets has a finite number of elements.
At each time step t=0, 1, 2, ⋯, the agent receives some representation of the environment's
state St ∈ S. Based on this state, the agent selects an action At ∈ A. This gives us the state-action
pair (St, At).
Time is then incremented to the next time step t+1, and the environment is transitioned to a new
state St+1 ∈ S. At this time, the agent receives a numerical reward Rt+1 ∈ R for the action At taken
from state St.
We can think of the process of receiving a reward as an arbitrary function f that maps state-
action pairs to rewards. At each time t, we have:

f(St, At) = Rt+1

The trajectory representing the sequential process of selecting an action from a state,
transitioning to a new state, and receiving a reward can be represented as:

S0, A0, R1, S1, A1, R2, S2, A2, R3…

Figure 73: MDP Notation Illustration

Let's break down this diagram into steps.


- At time t, the environment is in state St.
- The agent observes the current state and selects action At.
- The environment transitions to state St+1 and grants the agent reward Rt+1.
124
Modeling Frame business problems as ML problems

- This process then starts over for the next time step, t+1.
- Note, t+1 is no longer in the future, but is now the present. When we cross the dotted line
on the bottom left, the diagram shows t+1 transforming into the current time step t so
that St+1 and Rt+1 are now St and Rt.

Transition Probabilities
- Since the sets (S) and (R) are finite, the random variables (Rt) and (St) have well defined
probability distributions. In other words, all the possible values that can be assigned
to Rt and St have some associated probability. These distributions depend on
the preceding state and action that occurred in the previous time step t−1.
- For example, suppose s′ ∈ S and r ∈ R. Then there is some probability
that St=s′ and Rt=r. This probability is determined by the particular values of
the preceding state (s) ∈ S and action a ∈ A(s). Note that A(s) is the set of actions that can
be taken from state (s).
- Let’s define this probability.
For all s′ ∈ S, s ∈ S, r ∈ R, and a ∈ A(s), we define the probability of the transition to
state s′ with reward r from taking action (a) in state (s) as:

Expected Return
We stated that the goal of an agent in an MDP is to maximize its cumulative rewards. We need a
way to aggregate and formalize these cumulative rewards. For this, we introduce the concept of
the expected return of the rewards at a given time step.
For now, we can think of the return simply as the sum of future rewards. Mathematically, we
define the return G at time t as
Gt = Rt+1 + Rt+2 + Rt+3 + ……..+ RT
Where T is the final time step.
This concept of the expected return is super important because it's the agent's objective to
maximize the expected return. The expected return is what's driving the agent to make the
decisions it makes.

Episodic Vs. Continuing Tasks


125
Modeling Frame business problems as ML problems

In our definition of the expected return, we introduced T, the final time step. When the notion of
having a final time step makes sense, the agent-environment interaction naturally breaks up into
subsequences, called episodes. For example, think about playing a game of pong. Each new round
of the game can be thought of as an episode, and the final time step of an episode occurs when a
player scores a point.
Each episode ends in a terminal state at time T, which is followed by resetting the environment
to some standard starting state or to a random sample from a distribution of possible starting
states. The next episode then begins independently from how the previous episode ended.
Formally, tasks with episodes are called episodic tasks.
There exists other types of tasks though where the agent-environment interactions don't break
up naturally into episodes, but instead continue without limit. These types of tasks are
called continuing tasks.
For example, Painting robots work continuously.
Continuing tasks make our definition of the return at each time t problematic because our final
time step would be T= ∞.

Discounted Return
Our revision of the way we think about return will make use of discounting. Rather than the
agent's goal being to maximize the expected return of rewards, it will instead be to maximize the
expected discounted return of rewards. Specifically, the agent will be choosing action (At) at each
time t to maximize the expected discounted return.
Agent's goal to maximize the expected discounted return of rewards.
To define the discounted return, we first define the discount rate (), to be a number
between 0 and 1. The discount rate will be the rate for which we discount future rewards and will
determine the present value of future rewards. With this, we define the discounted return as:

This definition of the discounted return makes it to where our agent will care more about the
immediate reward over future rewards since future rewards will be more heavily discounted. So,
while the agent does consider the rewards it expects to receive in the future, the more

126
Modeling Frame business problems as ML problems

immediate rewards have more influence when it comes to the agent making a decision about
taking a particular action.

Now, check out this relationship below showing how returns at successive time steps are related to
each other. We'll make use of this relationship later.

Also, check this out. Even though the return at time t is a sum of an infinite number of terms, the return
is actually finite as long as the reward is nonzero and constant, and γ<1.

For example, if the reward at each time step is a constant 1 and γ<1, then the return is

This infinite sum yields a finite result. If you want to understand this concept more deeply, then
research infinite series convergence. For our purposes though, you're free to just trust the fact that this is
true, and understand the infinite sum of discounted returns is finite if the conditions we outlined are
met.

Policies and Value Functions


First, we'd probably like to know how likely it is for an agent to take any given action from any given
state. In other words, what is the probability that an agent will select a specific action from a specific
state? This is where the notion of policies come into play, and we'll expand on this in just a moment.

Secondly, in addition to understanding the probability of selecting an action, we'd probably also like to
know how good a given action or a given state is for the agent. In terms of rewards, selecting one
action over another in a given state may increase or decrease the agent's rewards, so knowing this in
advance will probably help our agent out with deciding which actions to take in which states. This is
where value functions become useful, and we'll also expand on this idea in just a bit.

Question Addressed by
How probable is it for an agent to select any action from a given state? Policies
How good is any given action or any given state for an agent? Value functions

Policies

127
Modeling Frame business problems as ML problems

A policy is a function that maps a given state to probabilities of selecting each possible action from that
state. We will use the symbol π to denote a policy.

When speaking about policies, formally we say that an agent “follows a policy.” For example, if an agent
follows policy π at time t, then π (a | s) is the probability that At = a, if St=s. This means that, at time t,
under policy π, the probability of taking action (a) in state (s) is π (a | s).

Note that, for each state s ∈ S, π is a probability distribution over a ∈ A(s).

Value Functions
Value functions are functions of states, or of state-action pairs, that estimate how good it is for
an agent to be in a given state, or how good it is for the agent to perform a given action in a given
state.
State-Value Function
The state-value function for policy π, denoted as vπ, tells us how good any given state is for an
agent following policy π. In other words, it gives us the value of a state under π.
Action-Value Function
Similarly, the action-value function for policy π, denoted as qπ, tells us how good it is for the agent
to take any given action from a given state while following policy π. In other words, it gives us the
value of an action under π.
Conventionally, the action-value function qπ is referred to as the Q-function, and the output from
the function for any given state-action pair is called a Q-value. The letter “Q” is used to represent
the quality of taking a given action in a given state.
Optimality
It is the goal of reinforcement learning algorithms to find a policy that will yield a lot of rewards
for the agent if the agent indeed follows that policy. Specifically, reinforcement learning
algorithms seek to find a policy that will yield more return to the agent than all other policies.
In terms of return, a policy π is considered to be better than or the same as policy π′ if the
expected return of π is greater than or equal to the expected return of π′ for all states.
Optimal State-Value Function
The optimal policy has an associated optimal state-value function. We denote the optimal
state-value function as v∗ and define as:

128
Modeling Frame business problems as ML problems

For all s ∈ S. In other words, v∗ gives the largest expected return achievable by any policy π for
each state.

Optimal Action-Value Function


Similarly, the optimal policy has an optimal action-value function, or optimal Q-function, which
we denote as q∗ and define as:

For all s ∈ S and a ∈ A(s). In other words, q∗ gives the largest expected return achievable by any
policy π for each possible state-action pair.

Bellman Optimality Equation for Q*


It states that, for any state-action pair (s, a) at time t, the expected return from starting in
state (s), selecting action a and following the optimal policy thereafter (AKA the Q-value of this
pair) is going to be the expected reward we get from taking action (a) in state (s), which is Rt+1,
plus the maximum expected discounted return that can be achieved from any possible next
state-action pair (s′, a′).
Since the agent is following an optimal policy, the following state (s′) will be the state from which
the best possible next action a′ can be taken at time t+1.

Q-Learning
Q-learning is a reinforcement learning technique used for learning the optimal policy in a Markov
Decision Process. We'll illustrate how this technique works by introducing a game where a
reinforcement learning agent tries to maximize points.
We left off talking about the fact that once we have our optimal Q-function q∗ we can determine
the optimal policy by applying a reinforcement learning algorithm to find the action that
maximizes q∗ for each state.
The objective of Q-learning is to find a policy that is optimal in the sense that the expected value
of the total reward over all successive steps is the maximum achievable. So, in other words, the
goal of Q-learning is to find the optimal policy by learning the optimal Q-values for each state-
action pair.

129
Modeling Frame business problems as ML problems

Q-Learning with Value Iteration


First, as a quick reminder, remember that the Q-function for a given policy accepts a state and an
action and returns the expected return from taking the given action in the given state and
following the given policy thereafter.

Value Iteration
The Q-learning algorithm iteratively updates the Q-values for each state-action pair using the
Bellman equation until the Q-function converges to the optimal Q-function, q∗. This approach is
called value iteration.

Q Table
We'll be making use of a table, called a Q-table, to store the Q-values for each state-action pair.
The horizontal axis of the table represents the actions, and the vertical axis represents the states.
So, the dimensions of the table are the number of actions by the number of states.
All the Q-values in the table are first initialized to zero. Over time, though, as the agent plays
several episodes of the game, the Q-values produced for the state-action pairs that the agent
experiences will be used to update the Q-values stored in the Q-table.
As the Q-table becomes updated, in later moves and later episodes, the agent can look in the Q-
table and base its next action on the highest Q-value for the current state. This will make more
sense once we actually start playing the game and updating the table.
Episodes
Now, we'll set some standard number of episodes that we want the lizard to play. Let's say we
want the agent to play five episodes. It is during these episodes that the learning process will take
place.
In each episode, the agent starts out by choosing an action from the starting state based on the
current Q-values in the table. The agent chooses the action based on which action has the
highest Q-value in the Q-table for the current state.
But, wait... That's kind of weird for the first actions in the first episode, right? Because all the Q-
values are set zero at the start, so there's no way for the agent to differentiate between them to
discover which one is considered better. So, what action does it start with?
To answer this question, we'll introduce the trade-off between exploration and exploitation.
Exploration vs Exploitation
130
Modeling Frame business problems as ML problems

Exploration is the act of exploring the environment to find out information about
it. Exploitation is the act of exploiting the information that is already known about the
environment in order to maximize the return.
We need a balance of both exploitation and exploration. So how do we implement this?

Epsilon Greedy Strategy


To get this balance between exploitation and exploration, we use what is called an epsilon greedy
strategy. With this strategy, we define an exploration rate ϵ that we initially set to 1. This
exploration rate is the probability that our agent will explore the environment rather than exploit
it. With ϵ=1, it is 100% certain that the agent will start out by exploring the environment.
As the agent learns more about the environment, at the start of each new episode, ϵ will decay
by some rate that we set so that the likelihood of exploration becomes less and less probable as
the agent learns more and more about the environment. The agent will become “greedy” in
terms of exploiting the environment once it has had the opportunity to explore and learn more
about it.
To determine whether the agent will choose exploration or exploitation at each time step, we
generate a random number between 0 and 1. If this number is greater than epsilon, then the
agent will choose its next action via exploitation, i.e. it will choose the action with the highest Q-
value for its current state from the Q-table. Otherwise, its next action will be chosen via
exploration, i.e. randomly choosing its action and exploring what happens in the environment.
Updating Q-Value
To update the Q-value for the action of moving right taken from the previous state, we use the
Bellman equation.

We want to make the Q-value for the given state-action pair as close as we can to the right hand
side of the Bellman equation so that the Q-value will eventually converge to the optimal Q-
value q∗.
This will happen over time by iteratively comparing the loss between the Q-value and the optimal
Q-value for the given state-action pair and then updating the Q-value over and over again each
time we encounter this same state-action pair to reduce the loss.

Learning Rate
The learning rate is a number between 0 and 1, which can be thought of as how quickly the agent
abandons the previous Q-value in the Q-table for a given state-action pair for the new Q-value.
We don't want to just overwrite the old Q-value, but rather, we use the learning rate as a tool to
determine how much information we keep about the previously computed Q-value for the given
131
Modeling Frame business problems as ML problems

state-action pair versus the new Q-value calculated for the same state-action pair at a later time
step. We'll denote the learning rate with the symbol α, and we'll arbitrarily set α=0.7 for example.
The higher the learning rate, the more quickly the agent will adopt the new Q-value. For example,
if the learning rate is 1, the estimate for the Q-value for a given state-action pair would be the
straight up newly calculated Q-value and would not consider previous Q-values that had been
calculated for the given state-action pair at previous time steps.

Summary
This is a summary for the reinforcement algorithm:
 MDP provide a mathematical framework for modeling decision making in situations where
outcomes are partly random and partly under the control of a decision maker.
 Our “Q” values are described as a reward function 𝑅𝑎 (𝑠, 𝑠’)
 Start off with Q values of 0
 Explore the space
 As bad things happen after a given state/action, reduce its Q
 As rewards happen after a given state/action, increase its Q
 You can “look ahead” more than one step by using a discount factor when computing Q
(here (s) is previous state, s’ is current state)
- Q(s, a) += discount * (reward(s, a) + max(Q(s’)) – Q(s, a))
 Exploration problem:
- We can’t always choose the highest Q value as at the beginning all the Q values are
initialized with 0 and you will miss a lot of paths.
- If a random number is less than epsilon, don’t follow the highest Q, but choose at
random
- That way, exploration never totally stops
- Choosing epsilon can be tricky
 Markov Decision Process (MDP) is the same as discrete time stochastic control process and
Dynamic programming.

SageMaker and Reinforcement


So SageMaker offers an implementation of reinforcement learning that's built on deep learning:
 Frameworks: Tensorflow and MXNet
 Toolkits: Supports Intel Coach and Ray Rllib toolkits.
 Environments: Custom, open-source, or commercial environments supported.
- MATLAB, Simulink
- EnergyPlus, RoboSchool, PyBullet
132
Modeling Frame business problems as ML problems

- Amazon Sumerian, AWS RoboMaker


 Reinforcement learning can be distributed training and/or environment rollout.
 Multi-core on the same PC or multi-instance on multiple PCs.
 There is no built-in hyperparameters but you can create your own parameters and let
SageMaker optimize these parameters for you.
 No specific instance types for Reinforcement to use but GPUs may be helpful as it is related
to deep learning.

3.1.4 Natural Language Processing (NLP)

Figure 74: NLP Pipeline

3.1.4.1 Text preprocessing


Corpus
 Corpus: Large collection of words or phrases. Like vocabulary or dictionary of a language.
- Corpus can come from different sources: Documents, web sources, database
 Token: Words or phrases extracted from documents.
 Feature vector: A numeric array that ML models use for training and
classification/regression tasks.

Stop Words
 Manually excluded from the text, because they occur too frequently in all documents in
the corpus.
 There are 179 stop words in NLTK library i.e. she, he, is….etc.

Tokenizing
 Separating text data into tokens by white space and punctuation as token separators.
133
Modeling Frame business problems as ML problems

 Sentence: “I don’t like eggs.”


- Tokens: “I”, “don’t”, “like”, “eggs”, “.”

 Tokens: Individual pieces of information from the raw sentence.


 Tokens can be further processed depending on their importance.

Stemming
 Set of rules to slice a string to a substring. The goal is to remove word affixes (particularly
suffixes).
 Such as:
- Removing “s”, “es” which generally indicates plurality.
- Removing past tense suffixes: “ed”
 For example “The children are playing and running. The weather was better yesterday.”
- Stemming: “The children are play and run. The weather was better yesterday.”

Lemmatization
 It looks up words in dictionary and returns the “head” word called a “lemma.”
- It is more complex than stemming.
- For best results, word position tags should be provided: Adjective, noun...etc.
 For example “The children are playing and running. The weather was better yesterday.”
- Lemmatizing: “The child be play and run. The weather be good yesterday”
When preparing the data
 Apply words stemming and lemmatization
 Remove Stop words
 removing punctuations
 convert text to lowercase (actually depends on your use-case)
 replacing digits
 After preprocessing, we then move on to tokenizing the corpus

3.1.4.2 Vectorization
 ML algorithms expect numeric vectors as inputs instead of texts
 This transformation is called vectorization or feature extraction.
134
Modeling Frame business problems as ML problems

Bag of Words
Each document is represented by a vector with size equal to the size of the corpus (vocabulary).
Each entry is the number of times the corresponding word occurred in the sentence (raw counts
method)

Figure 75: Bag of Words

Issues:
 We lost information inherent in the word order.
 Large documents can have big word counts compared to small docs.

Keeping the order: N-grams


Let’s keep some order information with N-grams: “N” consecutive words in a text. N-grams: “This
movie is good”
 1-gram: “this”, “movie”, “is”, “good”
 2-grams: “this movie”, “movie is”, “is good”
Let’s apply to our data (N=2

Figure 76: 2 gram

Term Frequencies (TF)


Token counts divided by total number of tokens in the document.

135
Modeling Frame business problems as ML problems

Figure 77: Term Frequency

Inverse Document Frequency (IDF)


So far, we only looked at local scale (document level). We also need to consider other documents
in the corpus.
The main insight: Meaning is mostly encoded in more rare items in documents. For example: In
sports documents about basketball and soccer,
 We will mostly see words like “play”, “run” and “score.”
 These won’t be useful to distinguish between a basketball and soccer document.

How to calculate “Term Frequency Inverse Document Frequency”?


N: total documents
Nt: Number of documents with token/phrase “t” in it.
Term Frequency (tf): tft = Token counts / by total number of tokens in the document
Document frequency (df): dft = Nt /N
Inverse document frequency (idf):
 Compute idft = 1/dft. Then apply logarithm: log (N/Nt)
 Ski learn also applies smoothing: idft = log[(N+1)/(Nt+1)] + 1
Term Frequency Inverse Document Frequency (TFIDF):
 tf_idft,d = tft,d x idft
A high TFIDF is reached by a high term frequency and a high inverse document frequency (low
document frequency).

3.1.4.3 Train Model


Naïve Bayes Classifier
Sample Problems:
136
Modeling Frame business problems as ML problems

 Spam / not spam email


 Text topic classification: sports, politics, finance
 Opinion (sentiment): like, neutral, dislike

Naïve Bayes Classifier – Example


Category classification: We have training texts that belong to categories: Finance and Not
Finance.
We want to classify new texts using a Naïve Bayes classifier.

Which tag does the following sentence belong to? “Learn stock markets playing this game”
Pre-processing:
 Remove stop words and
 Remove words shorter than 2 characters
 Apply stemming

137
Modeling Frame business problems as ML problems

Calculation:
We want to calculate the probabilities:
P (Finance | “learn stock market play game”): Probability of the Finance tag, given the sentence:
Learn stock markets playing this game
P (Not Finance | “learn stock market play game”): Probability of the Not Finance tag, given the
sentence: Learn stock markets playing this game
We will assign a category to “learn stock market play game” based on whichever probability is
larger.
By using Bayed Theorem

We wish to know which of these two probabilities is larger.


The denominators are the same! So we don’t need to compute the denominators to know which
probability is large.

𝑃 𝐹𝑖𝑛𝑎𝑛𝑐𝑒 "𝑙𝑒𝑎𝑟𝑛 𝑠𝑡𝑜𝑐𝑘 𝑚𝑎𝑟𝑘𝑒𝑡 𝑝𝑙𝑎y 𝑔𝑎𝑚𝑒") ~ 𝑃 ("𝒍𝒆𝒂𝒓𝒏 𝒔𝒕𝒐𝒄𝒌 𝒎𝒂𝒓𝒌𝒆𝒕 𝒑𝒍𝒂𝒚
𝒈𝒂𝒎𝒆"|𝐹𝑖𝑛𝑎𝑛𝑐𝑒) 𝑥 𝑃 (𝐹𝑖𝑛𝑎𝑛𝑐𝑒)
𝑃 (𝐹𝑖𝑛𝑎𝑛𝑐𝑒) = 2 /5 as we have 2 sentences finance out of 5 sentences.
𝑃 𝑁𝑜𝑛 𝐹𝑖𝑛𝑎𝑛𝑐𝑒 "𝑙𝑒𝑎𝑟𝑛 𝑠𝑡𝑜𝑐𝑘 𝑚𝑎𝑟𝑘𝑒𝑡 𝑝𝑙𝑎𝑦 𝑔𝑎𝑚𝑒") ~ 𝑃("𝒍𝒆𝒂𝒓𝒏 𝒔𝒕𝒐𝒄𝒌 𝒎𝒂𝒓𝒌𝒆𝒕 𝒑𝒍𝒂𝒚
𝒈𝒂𝒎𝒆"|𝑁𝑜𝑛 𝐹𝑖𝑛𝑎𝑛𝑐𝑒) 𝑥 𝑃 (𝑁𝑜𝑛 𝐹𝑖𝑛𝑎𝑛𝑐𝑒)

138
Modeling Frame business problems as ML problems

Problem: We don’t have any data with the exact sequence: “learn stock market play game”
Solution: Be naïve! Assume every word is conditionally independent.

 𝑃 ("𝑙𝑒𝑎𝑟𝑛 𝑠𝑡𝑜𝑐𝑘 𝑚𝑎𝑟𝑘𝑒𝑡 𝑝𝑙𝑎𝑦 𝑔𝑎𝑚𝑒" | 𝐹𝑖𝑛𝑎𝑛𝑐e) = P (“learn” | Finance) x P (“stock” |


Finance) x P (“market” | Finance) x P (“play” | Finance) x P (“game” |Finance)

 𝑃 ("𝑙𝑒𝑎𝑟𝑛 𝑠𝑡𝑜𝑐𝑘 𝑚𝑎𝑟𝑘𝑒𝑡 𝑝𝑙𝑎𝑦 𝑔𝑎𝑚𝑒"|𝑁𝑜𝑡 𝐹𝑖𝑛𝑎𝑛𝑐𝑒) = P (“learn” | Not Finance) x P


(“stock” | Not Finance) x P (“market” | Not Finance) x P (“play” | Not Finance) x P (“game”
| Not Finance)

Smoothing:

Result:

139
Modeling Frame business problems as ML problems

140
Modeling Frame business problems as ML problems

3.1.4.4 Sentiment Analysis


 Classifying the polarity of a given text. Positive or Negative Movie Reviews.

Sentiment Analysis could be:


 Simplest task:
- Is the attitude of this text positive or negative?
 More complex:
- Rank the attitude of this text from 1 to 5
 Advanced:
- Detect the target, source, or complex type

Sentiment Lexicons
NLTK has two main sentiment data sources:
 Bing Liu Opinion Lexicon
 SentiWordNet
We also have VADER:
Sentiment metrics with score [-1, 1]: -Positive -Neutral -Negative -Compound
It considers the following cases:
 Punctuation: Namely the exclamation point (!), increases the magnitude of the intensity
 Capitalization, specifically using ALL-CAPS to emphasize meaning.
 Degree modifiers (also called intensifiers, booster words, or degree adverbs): “The service
is extremely good” > “The service is very good” > “The service is marginally good”
 The contrastive conjunction “but”: “The food here is great, but the service is horrible”
 Uses tri-gram to extend: “The food here isn’t really all that great

Word Representation
This is a relation between words.
The relation could also be relation between words from different languages.

141
Modeling Frame business problems as ML problems

Word2Vec
Word2Vec is a technique for natural language processing published in 2013. The word2vec
algorithm uses a neural network model to learn word associations from a large corpus of text.
Once trained, such a model can detect synonymous words or suggest additional words for a
partial sentence. As the name implies, word2vec represents each distinct word with a particular
list of numbers called a vector. The vectors are chosen carefully such that a simple mathematical
function (the cosine similarity between the vectors) indicates the level of semantic
similarity between the words represented by those vectors.
Word2vec is a group of related models that are used to produce word embeddings. These
models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts
of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically
of several hundred dimensions, with each unique word in the corpus being assigned a
corresponding vector in the space. Word vectors are positioned in the vector space such that
words that share common contexts in the corpus are located close to one another in the space.
Word2vec can utilize either of two model architectures to produce a distributed
representation of words: continuous bag-of-words (CBOW) or continuous skip-gram. In the
continuous bag-of-words architecture, the model predicts the current word from a window of
surrounding context words. The order of context words does not influence prediction (bag-of-
words assumption). In the continuous skip-gram architecture, the model uses the current word
to predict the surrounding window of context words. The skip-gram architecture weighs nearby
context words more heavily than more distant context words.

Sentence Vectors
The same concept of word2Vec but for sentences
Main goal: Create a numeric representation for a sentence (document) regardless of its length.

142
Modeling Frame business problems as ML problems

There are 3 methods to calculate it:


 Average/Sum Word Vectors
For each sentence:
- Find the corresponding word vector for each word/token.
- Average (or apply min/max) the word vectors for each sentence.

 Weighed Sum of Vectors


The first method assumed that each word/token has the same effect on the sentence
vector.
- We can apply some weights to each word token. Weights (w1, w2….) can be simply
Term Frequency, TF-IDFs or document position (title, main text, and conclusion).

 Pre-trained System
Universal Sentence Encode, it uses Google’s sentence encoder.
Provides pre-trained models to get fixed size sentence (512) vectors.

143
Modeling Select the appropriate model

3.2 Select the appropriate model


3.2.1 Linear Learner
 Linear regression
- Fit a line to your training data
- Predications based on that line
 Can handle both regression (numeric) predictions and classification predictions
- For classification, a linear threshold function is used.
- Can do binary or multi-class

How it works?
 Preprocessing
- Training data must be normalized (so all features are weighted the same)
- Linear Learner can do this for you automatically
- Input data should be shuffled
 Training
- Uses stochastic gradient descent
- Choose an optimization algorithm (Adam, AdaGrad, SGD…..etc.)
- Multiple models are optimized in parallel
- Tune L1, L2 regularization
 Validation • Most optimal model is selected

Input Formats
 RecordIO-wrapped protobuf
- Float32 data only!
 CSV
- First column assumed to be the label
 File or Pipe mode both supported

Hyperparameters
Parameter Description
num_classes The number of classes for the response variable. The algorithm assumes that classes are
labeled 0... num_classes - 1.

Required when predictor_type is multiclass_classifier. Otherwise, the algorithm ignores it.


Valid values: Integers from 3 to 1,000,000
predictor_type Specifies the type of target variable as a binary classification, multiclass classification, or
regression.
144
Modeling Select the appropriate model

Required
Valid values: binary_classifier, multiclass_classifier, or regressor
accuracy_top_k When computing the top-k accuracy metric for multiclass classification, the value of k. If
the model assigns one of the top-k scores to the true label, an example is scored as
correct.

Optional
Valid values: Positive integers
Default value: 3
balance_multiclas Specifies whether to use class weights, which give each class equal importance in the loss
s_weights
function. Used only when the predictor_type is multiclass_classifier.

Optional
Valid values: true, false
Default value: false
binary_classifier_ When predictor_type is set to binary_classifier, the model evaluation criteria for the
model_
validation dataset (or for the training dataset if you don't provide a validation dataset).
selection_criteria
Criteria include:
 accuracy—The model with the highest accuracy.
 f_beta—The model with the highest F1 score. The default is F1.
 precision_at_target_recall—The model with the highest precision at a given recall target.
 recall_at_target_precision—The model with the highest recall at a given precision target.
 loss_function—The model with the lowest value of the loss function used in training.

Optional
Valid values: accuracy, f_beta, precision_at_target_recall, recall_at_target_precision,
or loss_function
Default value: accuracy
epochs The maximum number of passes over the training data.

Optional
Valid values: Positive integer
Default value: 15
init_method Sets the initial distribution function used for model weights. Functions include:
 uniform—Uniformly distributed between (-scale, +scale)
 normal—Normal distribution, with mean 0 and sigma

Optional

145
Modeling Select the appropriate model

Valid values: uniform or normal


Default value: uniform
l1 The L1 regularization parameter. If you don't want to use L1 regularization, set the value to
0.

Optional
Valid values: auto or non-negative float
Default value: auto
learning_rate The step size used by the optimizer for parameter updates.

Optional
Valid values: auto or positive floating-point integer
Default value: auto, whose value depends on the optimizer chosen.
loss Specifies the loss function.
The available loss functions and their default values depend on the value
of predictor_type:
 If the predictor_type is set to regressor, the available options
are auto, squared_loss, absolute_loss, eps_insensitive_squared_loss, eps_insensitive_abso
lute_loss, quantile_loss, and huber_loss. The default value for auto is squared_loss.
 If the predictor_type is set to binary_classifier, the available options are auto,logistic,
and hinge_loss. The default value for auto is logistic.
 If the predictor_type is set to multiclass_classifier, the available options
are auto and softmax_loss. The default value for auto is softmax_loss.
Valid
values: auto, logistic, squared_loss, absolute_loss, hinge_loss, eps_insensitive_squared_los
s, eps_insensitive_absolute_loss, quantile_loss, or huber_loss

Optional
Default value: auto
mini_batch_size The number of observations per mini-batch for the data iterator.

Optional
Valid values: Positive integer
Default value: 1000
momentum The momentum of the sgd optimizer.

Optional
Valid values: auto or a floating-point integer between 0 and 1.0
Default value: auto

146
Modeling Select the appropriate model

num_models The number of models to train in parallel. For the default, auto, the algorithm decides the
number of parallel models to train. One model is trained according to the given training
parameter (regularization, optimizer, loss), and the rest by close parameters.

Optional
Valid values: auto or positive integer
Default values: auto
optimizer The optimization algorithm to use.

Optional
Valid values:
 Auto — The default value.
 Sgd — Stochastic gradient descent.
 Adam — Adaptive momentum estimation.
 Rmsprop — A gradient-based optimization technique that uses a moving average of
squared gradients to normalize the gradient.
Default value: auto. The default setting for auto is adam.
target_recall The target recall.
If binary_classifier_model_selection_criteria is precision_at_target_recall, then recall is
held at this value while precision is maximized.

Optional
Valid values: Floating-point integer between 0 and 1.0
Default value: 0.8
wd The weight decay parameter, also known as the L2 regularization parameter. If you don't
want to use L2 regularization, set the value to 0.

Optional
Valid values: auto or non-negative floating-point integer
Default value: auto

Instance Types
 Training
- Single or multi-machine CPU or GPU
 Multi-GPU does not help

147
Modeling Select the appropriate model

3.2.2 K Nearest Neighbors


It is supervised machine learning algorithm used for classification and regression.
- K-NN is one of the simplest Machine Learning algorithms based on Supervised Learning
technique.
- K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.
- K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
- K-NN algorithm can be used for Regression as well as for Classification but mostly it is used
for the Classification problems.
- K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
- It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
- KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
The output depends on whether k-NN is used for classification or regression:
- In k-NN classification, the output is a class membership. An object is classified by a plurality
vote of its neighbors, with the object being assigned to the class most common among
its k nearest neighbors.
- In k-NN regression, the output is the property value for the object. This value is the average
of the values of k nearest neighbors.

AWS KNN
How it works?
Step 1: Sample
To specify the total number of data points to be sampled from the training dataset, use
the sample_size parameter. For example, if the initial dataset has 1,000 data points and
the sample_size is set to 100, where the total number of instances is 2, each worker would
sample 50 points. A total set of 100 data points would be collected. Sampling runs in linear time
with respect to the number of data points.

Step 2: Perform Dimension Reduction


148
Modeling Select the appropriate model

The current implementation of the k-NN algorithm has two methods of dimension reduction. You
specify the method in the dimension_reduction_type hyperparameter.
- The sign method specifies a random projection, which uses a linear projection using a
matrix of random signs.
- fjlt method specifies a fast Johnson-Lindenstrauss transform, a method based on the
Fourier transform. The fjlt method should be used when the target dimension is large
and has better performance with CPU inference.

NOTE: Using dimension reduction introduces noise into the data and this noise can
reduce prediction accuracy.

Step 3: Build an Index


The algorithm builds the index for fast getting the neighbors by using the index instead of
searching all the data.
During inference, the algorithm queries the index for the k-nearest-neighbors of a sample point.
Based on the references to the points, the algorithm makes the classification or regression
prediction. It makes the prediction based on the class labels or values provided. K-NN provides
three different types of indexes: a flat index, an inverted index, and an inverted index with
product quantization. You specify the type with the index_type parameter.

Step 4: Serialize the Model


When the k-NN algorithm finishes training, it serializes three files to prepare for inference.
- model_algo-1: Contains the serialized index for computing the nearest neighbors.
- model_algo-1.labels: Contains serialized labels (np.float32 binary format) for computing
the predicted label based on the query result from the index.
- model_algo-1.json: Contains the JSON-formatted model metadata which stores
the k and predictor_type hyper-parameters from training for inference along with other
relevant state.

149
Modeling Select the appropriate model

Hyperparameters
Parameter Description
feature_dim The number of features in the input data.

Required
Valid values: positive integer.
k The number of nearest neighbors.

Required
Valid values: positive integer
predictor_type The type of inference to use on the data labels.

Required
Valid values: classifier for classification or regressor for regression.
sample_size The number of data points to be sampled from the training data
set.

Required
Valid values: positive integer
dimension_reduction_target The target dimension to reduce to.

Required when you specify


the dimension_reduction_type parameter.

Valid values: positive integer greater than 0 and less


than feature_dim.
dimension_reduction_type The type of dimension reduction method.

Optional
Valid values: sign for random projection or fjlt for the fast Johnson-
Lindenstrauss transform.

Default value: No dimension reduction


index_metric The metric to measure the distance between points when finding
nearest neighbors. When training with index_type set
to faiss.IVFPQ, the INNER_PRODUCT distance and COSINE similarity
are not supported.

Optional
Valid values: L2 for Euclidean-distance, INNER_PRODUCT for inner-
product distance, COSINE for cosine similarity.

150
Modeling Select the appropriate model

Default value: L2
index_type The type of index.

Optional
Valid values: faiss.Flat, faiss.IVFFlat, faiss.IVFPQ.
Default values: faiss.Flat
mini_batch_size The number of observations per mini-batch for the data iterator.

Optional
Valid values: positive integer
Default value: 5000

Input Formats
 Train channel contains your data
- Test channel emits accuracy or MSE
 recordIO-protobuf or CSV training
- First column is label
 File or pipe mode on either

Instance Types
 Training on CPU or GPU
- Ml.m5.2xlarge
- Ml.p2.xlarge
 Inference
- CPU for lower latency
- GPU for higher throughput on large batches

3.2.3 K-Means
It is unsupervised machine learning algorithm used for clustering.
Consider the unlabeled dataset represented in the next figure: you can clearly see 5 blobs of
instances. The K-Means algorithm is a simple algorithm capable of clustering this kind of dataset
very quickly and efficiently, often in just a few iterations.

151
Modeling Select the appropriate model

Figure 78: An unlabeled dataset composed of five blobs of instances

You have to specify the number of clusters k that the algorithm must find. In this example, it is
pretty obvious from looking at the data that k should be set to 5, but in general it is not that easy.
Each instance was assigned to one of the 5 clusters. In the context of clustering, an instance’s
label is the index of the cluster that this instance gets assigned to by the algorithm. The algorithm
decide which instance to assign to the cluster by specifying the distance between the instance
and the center of the cluster called Centroid.

Figure 79: K-Means decision boundaries

The vast majority of the instances were clearly assigned to the appropriate cluster, but a few
instances were probably mislabeled (especially near the boundary between the top left cluster
and the central cluster). Indeed, the K-Means algorithm does not behave very well when the
blobs have very different diameters since all it cares about when assigning an instance to a
cluster is the distance to the centroid.
Instead of assigning each instance to a single cluster, which is called hard clustering, it can be
useful to just give each instance a score per cluster: this is called soft clustering. For example, the
score can be the distance between the instance and the centroid, or conversely it can be a
similarity score (or affinity).

152
Modeling Select the appropriate model

The K-Means Algorithm


So how does the algorithm work? Well it is really quite simple. Suppose you were given the
centroids: you could easily label all the instances in the dataset by assigning each of them to the
cluster whose centroid is closest. Conversely, if you were given all the instance labels, you could
easily locate all the centroids by computing the mean of the instances for each cluster. But you
are given neither the labels nor the centroids, so how can you proceed? Well, just start by placing
the centroids randomly (e.g., by picking k instances at random and using their locations as
centroids). Then label the instances, update the centroids, label the instances, update the
centroids, and so on until the centroids stop moving. The algorithm is guaranteed to converge in
a finite number of steps (usually quite small), it will not oscillate forever. You can see the
algorithm in action in the below figure: the centroids are initialized randomly (top left), then the
instances are labeled (top right), then the centroids are updated (center left), the instances are
relabeled (center right), and so on. As you can see, in just 3 iterations the algorithm has reached
a clustering that seems close to optimal.

Figure 80: The K-Means algorithm

K-Means Initialization Problems


Unfortunately, although the algorithm is guaranteed to converge, it may not converge to the
right solution (i.e., it may converge to a local optimum): this depends on the centroid

153
Modeling Select the appropriate model

initialization. For example, below figure shows two sub-optimal solutions that the algorithm can
converge to if you are not lucky with the random initialization step:

Figure 81: Sub-optimal solutions due to unlucky centroid initializations

Solutions:
 Centroid Initialization Methods
If you happen to know approximately where the centroids should be (e.g., if you ran
another clustering algorithm earlier), then you can set the initial hyperparameter to a
NumPy array containing the list of centroids, and set n_init to 1.

 Random Initialization with multiple times


Run the algorithm multiple times with different random initializations and keep the best
solution. This is controlled by the n_init hyperparameter: by default, it is equal to 10, which
means that the whole algorithm described earlier actually runs 10 times when you call fit()
and Scikit-Learn keeps the best solution.
The best solution is identified by model’s inertia, this is the mean squared distance
between each instance and its closest centroid. The K-Means class runs the algorithm
n_init times and keeps the model with the lowest inertia.

 K-Means++
Smarter initialization step that tends to select centroids that are distant from one another,
and this makes the K-Means algorithm much less likely to converge to a suboptimal
solution. K-Means++ initialization algorithm:
- Take one centroid c(1), chosen uniformly at random from the dataset.
- Take a new centroid c(i), choosing an instance x(i) with probability.
D(x(i))2 = j=1m D(x(j))2 where D(x(i)) is the distance between the instance x(i) and
the closest centroid that was already chosen.
- This probability distribution ensures that instances further away from already
chosen centroids are much more likely be selected as centroids.
- Repeat the previous step until all k centroids have been chosen.

154
Modeling Select the appropriate model

NOTE: The K-Means class actually uses this initialization method by default. If you
want to force it to use the original method (i.e., picking k instances randomly to
define the initial centroids), then you can set the init hyperparameter to "random".

K-Means accelerated algorithm


It considerably accelerates the algorithm by avoiding many unnecessary distance calculations:
this is achieved by exploiting the triangle inequality (i.e., the straight line is always the shortest)
and by keeping track of lower and upper bounds for distances between instances and centroids.
This is the algorithm used by default by the K-Means class (but you can force it to use the original
algorithm by setting the algorithm hyperparameter to "full", although you probably will never
need to).

Mini Batches
Instead of using the full dataset at each iteration, the algorithm is capable of using mini-batches,
moving the centroids just slightly at each iteration. This speeds up the algorithm typically by a
factor of 3 or 4 and makes it possible to cluster huge datasets that do not fit in memory.
Although the Mini-batch K-Means algorithm is much faster than the regular K-Means algorithm,
its inertia is generally slightly worse, especially as the number of clusters increases.

Optimal Number of Clusters


We have set the number of clusters k to 5 because it was obvious by looking at the data that this
is the correct number of clusters. But in general, it will not be so easy to know how to set k, and
the result might be quite bad if you set it to the wrong value.

Figure 82: Bad choices for the number of clusters

The inertia is not a good performance metric when trying to choose k since it keeps get- ting
lower as we increase k. Indeed, the more clusters there are, the closer each instance will be to its
closest centroid, and therefore the lower the inertia will be.

155
Modeling Select the appropriate model

Figure 83: Selecting the number of clusters k using the “elbow rule”

As you can see, the inertia drops very quickly as we increase k up to 4, but then it decreases
much more slowly as we keep increasing k. This curve has roughly the shape of an arm, and there
is an “elbow” at k=4 so if we did not know better, it would be a good choice: any lower value
would be dramatic, while any higher value would not help much, and we might just be splitting
perfectly good clusters in half for no good reason.

Feature Reduction
Suppose you have 20 features, and you make K means = 5, so now all the data set is being
divided into 5 clusters. Now measure the distance from your data point to these 5 clusters
and remove the feature vector and use the new vector (which is the distance from 5 clusters)

Silhouette score
A more precise approach (but also more computationally expensive) is to use the silhouette
score, which is the mean silhouette coefficient over all the instances. An instance’s silhouette
coefficient is equal to (b – a) / max (a, b) where a is the mean distance to the other instances in
the same cluster (it is the mean intra-cluster distance), and b is the mean nearest-cluster
distance, that is the mean distance to the instances of the next closest cluster (defined as the one
that minimizes b, excluding the instance’s own cluster). The silhouette coefficient can vary
between -1 and +1: a coefficient close to +1 means that the instance is well inside its own cluster
and far from other clusters, while a coefficient close to 0 means that it is close to a cluster
boundary, and finally a coefficient close to -1 means that the instance may have been assigned to
the wrong cluster.

156
Modeling Select the appropriate model

Figure 84: Selecting the number of clusters k using the silhouette score

As you can see, this visualization is much richer than the previous one: in particular, although it
confirms that k=4 is a very good choice, it also underlines the fact that k=5 is quite good as well,
and much better than k=6 or 7. This was not visible when comparing inertias.

Limits of K-Means
Despite its many merits, most notably being fast and scalable, K-Means is not perfect.
 It is necessary to run the algorithm several times to avoid sub-optimal solutions,
 You need to specify the number of clusters, which can be quite a hassle.
 K-Means does not behave very well when the clusters have varying sizes, different
densities, or non-spherical shapes.

NOTE: It is important to scale the input features before you run K-Means, or else
the clusters may be much stretched, and K-Means will perform poorly. Scaling the
features does not guarantee that all the clusters will be nice and spherical, but it
generally improves things.

Hyperparameters
Parameter Description
feature_dim The number of features in the input data.

Required
Valid values: positive integer.
k The number of required clusters.

Required
Valid values: Positive integer

epochs The number of passes done over the training data.

157
Modeling Select the appropriate model

Optional
Valid values: Positive integer
Default value: 1
eval_metrics A JSON list of metric types used to report a score for the model.
Allowed values are msd for Means Square Error and ssd for Sum of
Square Distance. If test data is provided, the score is reported for
each of the metrics requested.

Optional
Valid values: Either [\"msd\"] or [\"ssd\"] or [\"msd\",\"ssd\"] .
Default value: [\"msd\"]
extra_center_factor The algorithm creates K centers
= num_clusters * extra_center_factor as it runs and reduces the
number of centers from K to k when finalizing the model.

Optional
Valid values: Either a positive integer or auto.
Default value: auto

init_method Method by which the algorithm chooses the initial cluster centers.
The standard k-means approach chooses them at random. An
alternative k-means++ method chooses the first cluster center at
random. Then it spreads out the position of the remaining initial
clusters by weighting the selection of centers with a probability
distribution that is proportional to the square of the distance of the
remaining data points from existing centers.

Optional
Valid values: Either random or kmeans++.
Default value: random
mini_batch_size The number of observations per mini-batch for the data iterator.

Optional
Valid values: Positive integer
Default value: 5000

Input Formats
 Train channel, optional test

158
Modeling Select the appropriate model

- Train ShardedByS3Key, test FullyReplicated


 recordIO-protobuf or CSV
 File or Pipe on either

Instance Types
CPU or GPU, but CPU recommended
 Only one GPU per instance used on GPU
 So use p*.xlarge if you’re going to use GPU

3.2.4 Principal Component Analysis (PCA)


This is unsupervised machine learning algorithm used for dimensionality reduction.
This algorithm uses projection to reduce the number of features.

Projection

Figure 85: A 3D dataset lying close to a 2D subspace

Notice that all training instances lie close to a plane: this is a lower-dimensional (2D) subspace of
the high-dimensional (3D) space. Now if we project every training instance perpendicularly onto
this subspace (as represented by the short lines connecting the instances to the plane), we get
the new 2D dataset shown in the below figure. We have just reduced the dataset’s
dimensionality from 3D to 2D. Note that the axes correspond to new features z1 and z2 (the
coordinates of the projections on the plane).

159
Modeling Select the appropriate model

Figure 86: The new 2D dataset after projection

However, projection is not always the best approach to dimensionality reduction. In many cases
the subspace may twist and turn, such as in the famous Swiss roll toy dataset represented in the
below figure.

Figure 87: Swiss roll dataset

Simply projecting onto a plane (e.g., by dropping x3 ) would squash different layers of the Swiss
roll together, as shown on the left of Figure 88. However, what you really want is to unroll the
Swiss roll to obtain the 2D dataset on the right of Figure 88.

160
Modeling Select the appropriate model

Figure 88: Squashing by projecting onto a plane (left) versus unrolling the Swiss roll

Principal Component Analysis (PCA) is by far the most popular dimensionality reduction
algorithm. First it identifies the hyperplane that lies closest to the data, and then it projects the
data onto it, just like in Figure 85.

Preserving Variance
Before you can project the training set onto a lower-dimensional hyperplane, you first need to
choose the right hyperplane. For example, a simple 2D dataset is represented on the left of
below figure, along with three different axes (i.e., one-dimensional hyperplanes). On the right is
the result of the projection of the dataset onto each of these axes. As you can see, the projection
onto the solid line preserves the maximum variance, while the projection onto the dotted line
preserves very little variance, and the projection onto the dashed line preserves an intermediate
amount of variance.

Figure 89: Selecting the subspace onto which to project

It seems reasonable to select the axis that preserves the maximum amount of variance, as it will
most likely lose less information than the other projections. Another way to justify this choice is
that it is the axis that minimizes the mean squared distance between the original dataset and its
projection onto that axis. This is the rather simple idea behind PCA.

Principal Components
161
Modeling Select the appropriate model

PCA identifies the axis that accounts for the largest amount of variance in the training set. In the
above figure, it is the solid line. It also finds a second axis, orthogonal to the first one that
accounts for the largest amount of remaining variance. In this 2D example there is no choice: it is
the dotted line. If it were a higher-dimensional data- set, PCA would also find a third axis,
orthogonal to both previous axes, and a fourth, a fifth, and so on—as many axes as the number
of dimensions in the dataset.
The unit vector that defines the ith axis is called the ith principal component (PC). In Figure 89, the
1st PC is c1 and the 2nd PC is c2. In Figure 85 the first two PCs are represented by the orthogonal
arrows in the plane, and the third PC would be orthogonal to the plane (pointing up or down).
So how can you find the principal components of a training set? Luckily, there is a standard
matrix factorization technique called Singular Value Decomposition (SVD).

Projecting to d Dimensions
Once you have identified all the principal components, you can reduce the dimensionality of the
dataset down to d dimensions by projecting it onto the hyperplane defined by the first d principal
components. Selecting this hyperplane ensures that the projection will preserve as much
variance as possible. For example, in Figure 85 the 3D dataset is projected down to the 2D plane
defined by the first two principal components, preserving a large part of the dataset’s variance.
As a result, the 2D projection looks very much like the original 3D dataset. To project the training
set onto the hyperplane, you can simply compute the matrix multiplication of the training set
matrix X by the matrix Wd, defined as the matrix containing the first d principal components (i.e.,
the matrix composed of the first d columns of V) also called covariance matrix.

Choosing the Right Number of Dimensions


Instead of arbitrarily choosing the number of dimensions to reduce down to, it is generally
preferable to choose the number of dimensions that add up to a sufficiently large portion of the
variance (e.g., 95%). Unless, of course, you are reducing dimensionality for data visualization—in
that case you will generally want to reduce the dimensionality down to 2 or 3.
You could then set n_components=d and run PCA again. However, there is a much better option:
instead of specifying the number of principal components you want to preserve, you can set
n_components to be a float between 0.0 and 1.0, indicating the ratio of variance you wish to
preserve.

162
Modeling Select the appropriate model

Figure 90: Explained variance as a function of the number of dimensions

Randomized PCA
If you set the svd_solver hyperparameter to "randomized", Scikit-Learn uses a stochastic
algorithm called Randomized PCA that quickly finds an approximation of the first d principal
components. Its computational complexity is O(m × d 2 ) + O(d 3 ), instead of O(m × n 2 ) + O(n 3 )
for the full SVD approach, so it is dramatically faster than full SVD when d is much smaller than n.
By default, svd_solver is actually set to "auto": Scikit-Learn automatically uses the randomized
PCA algorithm if m or n is greater than 500 and d is less than 80% of m or n, or else it uses the full
SVD approach. If you want to force Scikit-Learn to use full SVD, you can set the svd_solver
hyperparameter to "full".

Incremental PCA
One problem with the preceding implementations of PCA is that they require the whole training
set to fit in memory in order for the algorithm to run. Fortunately, Incremental PCA (IPCA)
algorithms have been developed: you can split the training set into mini-batches and feed an
IPCA algorithm one mini-batch at a time. This is useful for large training sets, and also to apply
PCA online (i.e., on the fly, as new instances arrive).

NOTE: T-distributed stochastic neighbor embedding (t-SNE) = Amazon’s favorite


dimensionality reduction technique, frequently show up in the questions. However,
same as PCA, less interpretable. You won’t be able to see the direct impact of
relevant features on the model outcome.

163
Modeling Select the appropriate model

NOTE: Recursive feature elimination=feature selection/dimensionality reduction,


solves overfitting, interpretation is easy, direct relationships between feature and
output.

Hyperparameters
Parameter Description
feature_dim Input dimension.
Required
Valid values: positive integer

mini_batch_size Number of rows in a mini-batch.


Required
Valid values: positive integer
num_components The number of principal components to compute.
Required
Valid values: positive integer
algorithm_mode Mode for computing the principal components.
Optional
Valid values: regular or randomized
Default value: regular
 Regular: for datasets with sparse data and a moderate number
of observations and features.
 Randomized: for datasets with both a large number of
observations and features. This mode uses an approximation
algorithm.

NOTE: Different from original algorithm.


extra_components As the value increases, the solution becomes more accurate but
the runtime and memory consumption increase linearly. The
default, -1, means the maximum of 10 and num_components. Valid
for randomized mode only.
Optional
Valid values: Non-negative integer or -1
Default value: -1
subtract_mean Indicates whether the data should be unbiased both during training
and at inference.
Optional
Valid values: One of true or false

164
Modeling Select the appropriate model

Default value: true

Input Formats
 recordIO-protobuf or CSV
 File or Pipe on either

Instance Types
GPU or CPU
 It depends “on the specifics of the input data

NOTE: The PCA and K-means algorithms are useful in collection of data using
‫ى‬
census ‫السكان‬ ‫ التعداد‬form.

3.2.5 XGBoost
This is a supervised machine learning algorithm used for regression and classification. It stands
for Extreme Gradient Boosting.
Boosted group of trees.
New trees made to correct the errors of previous trees.
Uses gradient descent to minimize loss as new trees are added.
The model is serialized and de-serialized with Pickle.
Can be used within your notebook (AWS Only).

Algorithm Steps
1. Make initial prediction, this prediction could be any value by default it is 0.5.
2. Calculate the residuals which is the difference between observed values and predicted
value.

165
Modeling Select the appropriate model

Figure 91: Residuals difference between Observed and Predicted values

3. Build the XGBoost tree (common way), Start the tree by a leaf and all the residuals to this
leaf.
4. Calculate the Similarity score for this leaf:
Similarity Score = (Sum of residuals) 2 / (Number of residuals +)
Such that  is L2 regularization parameter.
5. Now we want to decide if we should go to branch this leaf for more branches on the tree.
To answer this question we will take first 2 observations with the lowest values and
calculate their average. For example Dosage < 15 if yes so the residuals for these
observations go to this leaf and the observations more than 15 goes to the other leave
then calculate the similarity score for both leaves.

Figure 92: Building XGBoost Tree

6. We need to quantify how much better the leaves cluster similar residuals than the root.
We do this by calculating the gain of splitting the residuals into two groups.
166
Modeling Select the appropriate model

Gain = Left similarity + Right similarity – Root similarity


7. The gain for dosage < 15 is 120.33.
8. Now we will shift the threshold for the next two observations

Figure 93: Building XGBoost Tree: Shift threshold

9. Build a simple tree that divides the observations using the new threshold Dosage<22.5.
10. Calculate the gain for the new tree.

Figure 94: Building XGBoost: Build the second tree

11. The gain for dosage < 22.5 is 4.


12. Compare the 2 gains from the 2 different trees, the greater gain is the best splitting at
residuals into clusters of similar values. So dosage < 15 is better splitting.
13. Repeat steps from 8 to 12 and calculate the gain.
14. The gain for dosage < 30 will be 56.33.
15. So the gain for dosage < 15 is better as its gain is greater than dosage < 30.
16. Now we finished shifting all over the observations so we will use the tree build for dosage
< 15.
17. Repeat the previous steps for the second branch in the tree for dosage < 15 and calculate
the gain.

167
Modeling Select the appropriate model

Figure 95: Build XGBoost tree: Build the second Branch

18. Shift the threshold for the second branch and calculate the gain.

Figure 96: Build XGBoost tree: Second Branch after shifting

19. As the gain for dosage < 30 in the second branch is greater than the gain will dosage <
22.5 for the second branch. So we will use the second branch dosage < 30.
20. We done building this tree for the observation we have. In real time the default is to allow
6 levels.
21. We will use a hyperparameter  (Gama) to decide to prune the branch or not. If the branch
(Gain -) is negative value then remove the branch, if positive don’t remove the branch.
22. Calculate the output for all the nodes.
Output = (Sum of residuals) / (number of residuals +)

168
Modeling Select the appropriate model

Figure 97: Build XGBoost Tree: Calculate Output

23. Calculate the new prediction value for each observation.


New predicted value = old predicted value +  × output
Such that  is the learning rate
24. Calculate the new residual for each observation

Figure 98: Build XGBoost Tree: Calculate new predicted value

25. Now the new residuals is much smaller than the ones calculated in step 2.
26. Now we can build another tree based on the new residuals that gives smaller residuals.
27. We will keep building trees with smaller residuals until the residuals is super small or we
have reached the maximum number of trees.
Lambda (): is a regularization parameters (that is used to decrease similarity value, it is inversely
proportional to the number of residuals in the node. Also it will decrease the Gain calculation.
When  > 0, it is easier to prune (‫ )تلقيم‬the trees as Gain values is much smaller.

NOTE: When the Gain values is large it will be hard to prune the tree as the Gama
value will remove the branches and the root.
169
Modeling Select the appropriate model

Figure 99: XGBoost Summary

Hyperparameters
Parameter Description
num_class The number of classes.
Required if objective is set to multi:softmax or multi:softprob.
Valid values: integer
num_round The number of rounds to run the training.
Required
Valid values: integer
alpha L1 regularization term on weights. Increasing this value makes models
more conservative.
Optional
Valid values: float
Default value: 0
base_score The initial prediction score of all instances, global bias.
Optional
Valid values: float
Default value: 0.5
booster Which booster to use. The gbtree and dart values use a tree-based
model, while gblinear uses a linear function.
Optional
Valid values: String. One of gbtree, gblinear, or dart.
Default value: gbtree
colsample_bylevel Subsample ratio of columns for each split, in each level.
Optional
Valid values: Float. Range: [0,1].
Default value: 1
colsample_bynode Subsample ratio of columns from each node.
170
Modeling Select the appropriate model

Optional
Valid values: Float. Range: (0,1].
Default value: 1
colsample_bytree Subsample ratio of columns when constructing each tree.
Optional
Valid values: Float. Range: [0,1].
Default value: 1
deterministic_histogram When this flag is enabled, XGBoost builds histogram on GPU
deterministically. Used only if tree_method is set to gpu_hist.
Optional
Valid values: String. Range: true or false
Default value: true
early_stopping_rounds The model trains until the validation score stops improving. Validation
error needs to decrease at least every early_stopping_rounds to
continue training. SageMaker hosting uses the best model for
inference.
Optional
Valid values: integer
Default value: -
Eta (learning rate) Step size shrinkage used in updates to prevent overfitting. After each
boosting step, you can directly get the weights of new features.
The eta parameter actually shrinks the feature weights to make the
boosting process more conservative.
Optional
Valid values: Float. Range: [0,1].
Default value: 0.3
eval_metric Evaluation metrics for validation data. A default metric is assigned
according to the objective:
 rmse: for regression
 error: for classification
 map: for ranking
gamma Minimum loss reduction required to make a further partition on a leaf
node of the tree. The larger, the more conservative the algorithm is.
Optional
Valid values: Float. Range: [0,∞).
Default value: 0
grow_policy Controls the way that new nodes are added to the tree. Currently
supported only if tree_method is set to hist.
Optional

171
Modeling Select the appropriate model

Valid values: String. Either depthwise or lossguide.


Default value: depthwise
lambda L2 regularization term on weights. Increasing this value makes models
more conservative.
Optional
Valid values: float
Default value: 1
lambda_bias L2 regularization term on bias.
Optional
Valid values: Float. Range: [0.0, 1.0].
Default value: 0
max_depth Maximum depth of a tree. Increasing this value makes the model more
complex and likely to be overfit. 0 indicates no limit. A limit is required
when grow_policy=depth-wise.
Optional
Valid values: Integer. Range: [0,∞)
Default value: 6
max_leaves Maximum number of nodes to be added. Relevant only
if grow_policy is set to lossguide.
Optional
Valid values: integer
Default value: 0
nthread Number of parallel threads used to run xgboost.
Optional
Valid values: integer
Default value: Maximum number of threads.
objective Specifies the learning task and the corresponding learning objective.
Examples: reg:logistic, multi:softmax, reg:squarederror
tree_method The tree construction algorithm used in XGBoost.
Optional
Valid values: One of auto, exact, approx, hist, or gpu_hist.
Default value: auto
subsample Subsample ratio of the training instance. Setting it to 0.5 means that
XGBoost randomly collects half of the data instances to grow trees.
This prevents overfitting.
Optional
Valid values: Float. Range: [0,1].
Default value: 1

172
Modeling Select the appropriate model

csv_weights When this flag is enabled, XGBoost differentiates the importance of


instances for csv input by taking the second column (the column after
labels) in training data as the instance weights.
Optional
Valid values: 0 or 1
Default value: 0
interaction_constraints Specify groups of variables that are allowed to interact.
Optional
Valid values: Nested list of integers. Each integer represents a feature,
and each nested list contains features that are allowed to interact e.g.,
[[1,2], [3,4,5]].
Default value: None
max_bin Maximum number of discrete bins to bucket continuous features.
Used only if tree_method is set to hist.
Optional
Valid values: integer
Default value: 256
max_delta_step Maximum delta step allowed for each tree's weight estimation. When
a positive integer is used, it helps make the update more conservative.
The preferred option is to use it in logistic regression. Set it to 1-10 to
help control the update.
Optional
Valid values: Integer. Range: [0,∞).
Default value: 0
min_child_weight Minimum sum of instance weight (hessian) needed in a child. If the
tree partition step results in a leaf node with the sum of instance
weight less than min_child_weight, the building process gives up
further partitioning. In linear regression models, this simply
corresponds to a minimum number of instances needed in each node.
The larger the algorithm, the more conservative it is.
Optional
Valid values: Float. Range: [0,∞).
Default value: 1
monotone_constraints Specifies monotonicity constraints on any feature.
Optional
Valid values: Tuple of Integers. Valid integers: -1 (decreasing
constraint), 0 (no constraint), 1 (increasing constraint).

173
Modeling Select the appropriate model

E.g., (0, 1): No constraint on first predictor, and an increasing


constraint on the second. (-1, 1): Decreasing constraint on first
predictor, and an increasing constraint on the second.
Default value: (0, 0)
normalize_type Type of normalization algorithm.
Optional
Valid values: Either tree or forest.
Default value: tree
one_drop When this flag is enabled, at least one tree is always dropped during
the dropout.
Optional
Valid values: 0 or 1
Default value: 0
process_type The type of boosting process to run.
Optional
Valid values: String. Either default or update.
Default value: default
rate_drop The dropout rate that specifies the fraction of previous trees to drop
during the dropout.
Optional
Valid values: Float. Range: [0.0, 1.0].
Default value: 0.0
refresh_leaf This is a parameter of the 'refresh' updater plug-in. When set
to true (1), tree leaves and tree node stats are updated. When set
to false(0), only tree node stats are updated.
Optional
Valid values: 0/1
Default value: 1
sample_type Type of sampling algorithm.
Optional
Valid values: Either uniform or weighted.
Default value: uniform
scale_pos_weight Controls the balance of positive and negative weights. It's useful for
unbalanced classes. A typical value to consider: sum(negative
cases) / sum(positive cases).
Optional
Valid values: float
Default value: 1
seed Random number seed.

174
Modeling Select the appropriate model

Optional
Valid values: integer
Default value: 0
single_precision_histogram When this flag is enabled, XGBoost uses single precision to build
histograms instead of double precision. Used only if tree_method is set
to hist or gpu_hist.
Optional
Valid values: String. Range: true or false
Default value: false
sketch_eps Used only for approximate greedy algorithm. This translates into O(1
/ sketch_eps) number of bins. Compared to directly select number of
bins, this comes with theoretical guarantee with sketch accuracy.
Optional
Valid values: Float, Range: [0, 1].
Default value: 0.03
skip_drop Probability of skipping the dropout procedure during a boosting
iteration.
Optional
Valid values: Float. Range: [0.0, 1.0].
Default value: 0.0
tree_method The tree construction algorithm used in XGBoost.
Optional
Valid values: One of auto, exact, approx, hist, or gpu_hist.
Default value: auto
tweedie_variance_power Parameter that controls the variance of the Tweedie distribution.
Optional
Valid values: Float. Range: (1, 2).
Default value: 1.5
updater A comma-separated string that defines the sequence of tree updaters
to run. This provides a modular way to construct and to modify the
trees.
Optional
Valid values: comma-separated string.
Default value: grow_colmaker, prune
verbosity Verbosity of printing messages.
Valid values: 0 (silent), 1 (warning), 2 (info), 3 (debug).
Optional
Default value: 1

175
Modeling Select the appropriate model

Important Hyperparameters
 Subsample
- Prevents overfitting
 Eta
- Step size shrinkage, prevents overfitting
 Gamma
- Minimum loss reduction to create a partition; larger = more conservative (‫)محافظ‬
 Alpha
- L1 regularization term; larger = more conservative
 Lambda
- L2 regularization term; larger = more conservative
 eval_metric
- Optimize on AUC, error, rmse…etc.
 For example, if you care about false positives more than accuracy, you might
use AUC here
 scale_pos_weight
- Adjusts balance of positive and negative weights
- Helpful for unbalanced classes
- Might set to sum(negative cases) / sum(positive cases)
 max_depth
- Max depth of the tree
- Too high and you may overfitting

Input Formats
 So, it takes CSV or libsvm input.
 It also accepts recordIO-protobuf and Parquet as well

Instance Types
 Uses CPU’s only for multiple instance training
 Is memory-bound, not compute bound
- So, M5 is a good choice
 As of XGBoost 1.2, single-instance GPU training is available
- For example P3
 Must set tree_method hyperparameter to gpu_hist
 Trains more quickly and can be more cost effective

176
Modeling Select the appropriate model

3.2.6 IP Insights
Amazon SageMaker IP Insights is an unsupervised learning algorithm that learns the usage
patterns for IPv4 addresses. It is designed to capture associations between IPv4 addresses and
various entities, such as user IDs or account numbers. You can use it to identify a user attempting
to log into a web service from an anomalous IP address, for example. Or you can use it to identify
an account that is attempting to create computing resources from an unusual IP address. Trained
IP Insight models can be hosted at an endpoint for making real-time predictions or used for
processing batch transforms.
SageMaker IP insights ingests historical data as (entity, IPv4 Address) pairs and learns the IP usage
patterns of each entity. When queried with an (entity, IPv4 Address) event, a SageMaker IP
Insights model returns a score that infers how anomalous the pattern of the event is. For
example, when a user attempts to log in from an IP address, if the IP Insights score is high
enough, a web login server might decide to trigger a multi-factor authentication system. In more
advanced solutions, you can feed the IP Insights score into another machine learning model. For
example, you can combine the IP Insight score with other features to rank the findings of another
security system, such as those from Amazon GuardDuty.

Training and Validation


The SageMaker IP Insights algorithm supports training and validation data channels. It uses the
optional validation channel to compute an area-under-curve (AUC) score on a predefined
negative sampling strategy. The AUC metric validates how well the model discriminates between
positive and negative samples. Training and validation data content types need to be
in text/csv format. The first column of the CSV data is an opaque string that provides a unique
identifier for the entity. The second column is an IPv4 address in decimal-dot notation. IP Insights
currently supports only File mode.

How is it used?
 Uses a neural network to learn latent vector representations of entities and IP addresses.
 Entities are hashed and embedded, need sufficiently large hash size as we use the entity
names as strings so it should be hashed before fed into the algorithm. IP insight will do this
for us. Hash size should be large enough to insure that the number of collisions, which
occur when distinct entities are mapped to the same latent vector, remain insignificant.

177
Modeling Select the appropriate model

 Automatically generates negative samples during training by randomly pairing entities and
IP’s. This to overcome sub-sampling problem so the algorithm automatically generate data
by randomly pairing entities and IPs and give them negative score (i.e. can’t access),
because the fishy data will sure be less than can access data. The same as fraud algorithm,
the fraudulent transactions are sure less than good transactions.
 During training, IP Insights automatically generates negative samples by randomly pairing
entities and IP addresses. These negative samples represent data that is less likely to occur
in reality. The model is trained to discriminate between positive samples that are observed
in the training data and these generated negative samples. More specifically, the model is
trained to minimize the cross entropy.

Hyperparameters
Parameter Name Description
num_entity_vectors The number of entity vector representations (entity
embedding vectors) to train. Each entity in the training set
is randomly assigned to one of these vectors using a hash
function. Because of hash collisions, it might be possible to
have multiple entities assigned to the same vector. This
would cause the same vector to represent multiple
entities. This generally has a negligible effect on model
performance, as long as the collision rate is not too severe.
To keep the collision rate low, set this value as high as
possible. However, the model size, and, therefore, the
memory requirement, for both training and inference,
scales linearly with this hyperparameter. We recommend
that you set this value to twice the number of unique
entity identifiers.
Required
Valid values: 1 ≤ positive integer ≤ 250,000,000
vector_dim The size of embedding vectors to represent entities and IP
addresses. The larger the value, the more information that
can be encoded using these representations. In practice,
model size scales linearly with this parameter and limits
how large the dimension can be. In addition, using vector
representations that are too large can cause the model to
overfit, especially for small training datasets. Overfitting
occurs when a model doesn't learn any pattern in the data
but effectively memorizes the training data and, therefore,
178
Modeling Select the appropriate model

cannot generalize well and performs poorly during


inference. The recommended value is 128.
Required
Valid values: 4 ≤ positive integer ≤ 4096
epochs The number of passes over the training data. The optimal
value depends on your data size and learning rate. Typical
values range from 5 to 100.
Optional
Valid values: positive integer ≥ 1
Default value: 10
learning_rate The learning rate for the optimizer. IP Insights use a
gradient-descent-based Adam optimizer. The learning rate
effectively controls the step size to update model
parameters at each iteration. Too large a learning rate can
cause the model to diverge because the training is likely to
overshoot a minima. On the other hand, too small a
learning rate slows down convergence. Typical values
range from 1e-4 to 1e-1.
Optional
Valid values: 1e-6 ≤ float ≤ 10.0
Default value: 0.001
mini_batch_size The number of examples in each mini batch. The training
procedure processes data in mini batches. The optimal
value depends on the number of unique account
identifiers in the dataset. In general, the larger
the mini_batch_size, the faster the training and the greater
the number of possible shuffled-negative-sample
combinations. However, with a large mini_batch_size, the
training is more likely to converge to a poor local minimum
and perform relatively worse for inference.
Optional
Valid values: 1 ≤ positive integer ≤ 500000
Default value: 10,000
weight_decay The weight decay coefficient. This parameter adds an L2
regularization factor that is required to prevent the model
from overfitting the training data.
Optional
Valid values: 0.0 ≤ float ≤ 10.0
Default value: 0.00001

179
Modeling Select the appropriate model

Input Formats
 User names, account ID’s can be fed in directly; no need to pre-process
 Training channel, optional validation (computes AUC score)
 CSV only
- Entity
- IP

Instance Types
 CPU or GPU
- GPU recommended
- Ml.p3.2xlarge or higher
- Can use multiple GPU’s
- Size of CPU instance depends on vector_dim and num_entity_vectors

3.2.7 Factorization Machines


The Factorization Machines algorithm is a general-purpose supervised learning algorithm that
you can use for both classification and regression tasks. It is an extension of a linear model that is
designed to capture interactions between features within high dimensional sparse datasets
economically. For example, in a click prediction system, the Factorization Machines model can
capture click rate patterns observed when ads from a certain ad-category are placed on pages
from a certain page-category. Factorization machines are a good choice for tasks dealing with
high dimensional sparse datasets, such as click prediction and item recommendation.
The prediction task for a Factorization Machines model is to estimate a function ŷ from a feature
set xi to a target domain. This domain is real-valued for regression and binary for classification.
The Factorization Machines model is supervised and so has a training dataset (xi,yj) available. The
advantages this model presents lie in the way it uses a factorized parameterization to capture the
pairwise feature interactions. It can be represented mathematically as follows:

The three terms in this equation correspond respectively to the three components of the model:

 The w0 term represents the global bias.


 The wi linear terms model the strength of the ith variable.

180
Modeling Select the appropriate model

 The <vi,vj> factorization terms model the pairwise interaction between the ith and
jth variable.

The global bias and linear terms are the same as in a linear model. The pairwise feature
interactions are modeled in the third term as the inner product of the corresponding factors
learned for each feature. Learned factors can also be considered as embedding vectors for each
feature. For example, in a classification task, if a pair of features tends to co-occur more often in
positive labeled samples, then the inner product of their factors would be large. In other words,
their embedding vectors would be close to each other in cosine similarity.

Hyperparameters
Parameter Name Description
num_factors The dimensionality of factorization.
Required
Valid values: Positive integer. Suggested value range: [2,1000], 64
typically generates good outcomes and is a good starting point.
predictor_type The type of predictor.
 binary_classifier: For binary classification tasks.
 regressor: For regression tasks.
Required
Valid values: String: binary_classifier or regressor
bias_init_method The initialization method for the bias term:
 normal: Initializes weights with random values sampled from a
normal distribution with a mean of zero and standard deviation
specified by bias_init_sigma.
 uniform: Initializes weights with random values uniformly
sampled from a range specified by [-bias_init_scale,
+bias_init_scale].
 constant: Initializes the weights to a scalar value specified
by bias_init_value.

Optional
Valid values: uniform, normal, or constant
Default value: normal
bias_init_scale Range for initialization of the bias term. Takes effect
if bias_init_method is set to uniform.
Optional
181
Modeling Select the appropriate model

Valid values: Non-negative float. Suggested value range: [1e-8,


512].
Default value: None
bias_init_sigma The standard deviation for initialization of the bias term. Takes
effect if bias_init_method is set to normal.
Optional
Valid values: Non-negative float. Suggested value range: [1e-8,
512].
Default value: 0.01
bias_init_value The initial value of the bias term. Takes effect
if bias_init_method is set to constant.
Optional
Valid values: Float. Suggested value range: [1e-8, 512].
Default value: None
bias_lr The learning rate for the bias term.
Optional
Valid values: Non-negative float. Suggested value range: [1e-8,
512].
Default value: 0.1
bias_wd The weight decay for the bias term.
Optional
Valid values: Non-negative float. Suggested value range: [1e-8,
512].
Default value: 0.01
epochs The number of training epochs to run.
Optional
Valid values: Positive integer
Default value: 1
eps Epsilon parameter to avoid division by 0.
Optional
Valid values: Float. Suggested value: small.
Default value: None

All the bias parameters are applied to factors and linear terms by replace bias with factors and
linear i.e. factors_init_method and linear_init_method.

NOTE: Bias is the linear bias term, Linear is the linear weight and Factor is the
factors weight.

182
Modeling Select the appropriate model

Input Formats
 recordIO-protobuf with Float32
 Sparse data means CSV isn’t practical

Instance Types
 CPU or GPU
- CPU recommended
- GPU only works with dense data

3.2.8 Object Detection


 Identify all objects in an image with bounding boxes.
 Detects and classifies objects with a single deep neural network.
 Classes are accompanied by confidence scores.
 Can train from scratch, or use pre-trained models based on ImageNet.
 Amazon SageMaker Object Detection uses the Single Shot multibox Detector
(SSD) algorithm that takes a convolutional neural network (CNN) pre-trained for
classification task as the base network. SSD uses the output of intermediate layers as
features for detection.
 Various CNNs such as VGG and ResNet (Residual Network) have achieved great
performance on the image classification task. Object detection in Amazon SageMaker
supports both VGG-16 and ResNet-50 as a base network for SSD. The algorithm can be
trained in full training mode or in transfer learning mode.
 In full training mode, the base network is initialized with random weights and then trained
on user data.
 In transfer learning mode, the base network weights are loaded from pre-trained models.
 Uses flip, rescale, and jitter internally to avoid overfitting.

Hyperparameters
Parameter Name Description
num_classes The number of output classes. This parameter defines the
dimensions of the network output and is typically set to the
number of classes in the dataset.
Required
Valid values: positive integer

183
Modeling Select the appropriate model

num_training_samples The number of training examples in the input dataset.


Note
If there is a mismatch between this value and the number of
samples in the training set, then the behavior of
the lr_scheduler_step parameter will be undefined and
distributed training accuracy may be affected.
Required
Valid values: positive integer
base_network The base network architecture to use.
Optional
Valid values: 'vgg-16' or 'resnet-50'
Default value: 'vgg-16'
early_stopping True to use early stopping logic during training. False not to use
it.
Optional
Valid values: True or False
Default value: False
image_shape The image size for input images. We rescale the input image to
a square image with this size. We recommend using 300 and
512 for better performance.
Optional
Valid values: positive integer ≥300
Default: 300
epochs The number of training epochs.
Optional
Valid values: positive integer
Default: 30
freeze_layer_pattern The regular expression (regex) for freezing layers in the base
network. For example, if we
set freeze_layer_pattern = "^(conv1_|conv2_).*", then any
layers with a name that contains "conv1_" or "conv2_" are
frozen, which means that the weights for these layers are not
updated during training.
learning_rate The initial learning rate.
Optional
Valid values: float in (0, 1]
Default: 0.001

mini_batch_size The batch size for training. In a single-machine multi-gpu


setting, each GPU handles mini_batch_size/num_gpu training

184
Modeling Select the appropriate model

samples. For the multi-machine training in dist_sync mode, the


actual batch size is mini_batch_size*number of machines.
momentum The momentum for sgd. Ignored for other optimizers.
Optional
Valid values: float in (0, 1]
Default: 0.9
optimizer The optimizer types.
Optional
Valid values: ['sgd', 'adam', 'rmsprop', 'adadelta']
Default: 'sgd'
use_pretrained_model Indicates whether to use a pre-trained model for training. If set
to 1, then the pre-trained model with corresponding
architecture is loaded and used for training. Otherwise, the
network is trained from scratch.
Optional
Valid values: 0 or 1
Default: 1
weight_decay The weight decay coefficient for sgd and rmsprop. Ignored for
other optimizers.
Optional
Valid values: float in (0, 1)
Default: 0.0005

Input Formats
 RecordIO or image format (JPG or PNG)
 With image format, supply a JSON file for annotation data for each image

Instance Types
 Use GPU instances for training (multi-GPU and multi-machine)
- Ml.p2.xlarge, ml.p2.8xlarge, ml.p2.16xlarge, ml.p3.2xlarge, ml.p3.8clarge,
ml.p3.16xlarge
 Use CPU or CPU for inference
- C5, M5, P2, P3

3.2.9 Image Classification


 Assign one or more labels to an image
 Doesn’t tell you where objects are, just what objects are in the image

185
Modeling Select the appropriate model

 It uses a convolutional neural network (ResNet)


 Full training mode
- Network initialized with random weights
 Transfer learning mode
- Initialized with pre-trained weights
- The top fully-connected layer is initialized with random weights
- Network is fine-tuned with new training data
 Default image size is 3-channel 224x224 (ImageNet’s dataset)

Hyperparameters
Parameter Name Description
num_classes Number of output classes. This parameter defines the
dimensions of the network output and is typically set to the
number of classes in the dataset.
Besides multi-class classification, multi-label classification is
supported too.
Required
Valid values: positive integer
augmentation_type Data augmentation type. The input images can be augmented
in multiple ways as specified below.
 crop: Randomly crop the image and flip the image horizontally
 crop_color: In addition to ‘crop’, three random values in the
range [-36, 36], [-50, 50], and [-50, 50] are added to the
corresponding Hue-Saturation-Lightness channels respectively
 crop_color_transform: In addition to crop_color, random
transformations, including rotation, shear, and aspect ratio
variations are applied to the image. The maximum angle of
rotation is 10 degrees, the maximum shear ratio is 0.1, and the
maximum aspect changing ratio is 0.25.
Optional
Valid values: crop, crop_color, or crop_color_transform.
Default value: no default value
beta_1 The beta1 for adam, that is the exponential decay rate for the
first moment estimates.
Optional
Valid values: float. Range in [0, 1].
Default value: 0.9

186
Modeling Select the appropriate model

beta_2 The beta2 for adam, that is the exponential decay rate for the
second moment estimates.
Optional
Valid values: float. Range in [0, 1].
Default value: 0.999
epochs Number of training epochs.
Optional
Valid values: positive integer
Default value: 30
gamma The gamma for rmsprop, the decay factor for the moving
average of the squared gradient.
Optional
Valid values: float. Range in [0, 1].
Default value: 0.9
image_shape The input image dimensions, which is the same size as the
input layer of the network. The format is defined as
'num_channels, height, width'.
learning_rate Initial learning rate.
Optional
Valid values: float. Range in [0, 1].
Default value: 0.1
momentum The momentum for sgd and nag, ignored for other optimizers.
Optional
Valid values: float. Range in [0, 1].
Default value: 0.9
multi_label Flag to use for multi-label classification where each sample can
be assigned multiple labels. Average accuracy across all classes
is logged.
Optional
Valid values: 0 or 1
Default value: 0
optimizer The optimizer type. For more details of the parameters for the
optimizers, please refer to MXNet's API.
Optional
Valid values: One of sgd, adam, rmsprop, or nag.
use_pretrained_model Flag to use pre-trained model for training. If set to 1, then the
pretrained model with the corresponding number of layers is
loaded and used for training. Only the top FC layer are

187
Modeling Select the appropriate model

reinitialized with random weights. Otherwise, the network is


trained from scratch.
Optional
Valid values: 0 or 1
Default value: 0
weight_decay The coefficient weight decay for sgd and nag, ignored for other
optimizers.
Optional
Valid values: float. Range in [0, 1].
Default value: 0.0001

Input Formats
 Apache MXNet RecordIO
- Not protobuf
 Supports both RecordIO (application/x-recordio) and image (image/png, image/jpeg,
and application/x-image) content types for training in file mode.
 Image format requires .lst files to associate image index, class label, and path to the
image
 Supports the RecordIO (application/x-recordio) content type for training in pipe mode.
 Augmented Manifest Image Format enables Pipe mod
 The algorithm supports image/png, image/jpeg, and application/x-image for inference.

Instance Types
 GPU instances for training (P2, P3) either multi-GPU or multi-machine.
 CPU or GPU for inference (C4, P2, P3)

3.2.10 Semantic Segmentation


 Pixel-level object classification.
 Different from image classification – that assigns labels to whole images.
 Different from object detection – that assigns labels to bounding boxes.
 Useful for self-driving vehicles, medical imaging diagnostics, robot sensing.
 Produces a segmentation mask.
 Built on MXNet Gluon and Gluon CV
 Choice of 3 algorithms:
- Fully-Convolutional Network (FCN)
188
Modeling Select the appropriate model

- Pyramid Scene Parsing (PSP)


- DeepLabV3
 Choice of backbones:
- ResNet50
- ResNet101
- Both trained on ImageNet
 Incremental training, or training from scratch, supported to

Hyperparameters
Parameter Name Description
backbone The backbone to use for the algorithm's encoder
component.
Optional
Valid values: resnet-50, resnet-101
Default value: resnet-50
use_pretrained_model Whether a pretrained model is to be used for the backbone.
Optional
Valid values: True, False
Default value: True
algorithm The algorithm to use for semantic segmentation.
Optional
Valid values:
 fcn: Fully-Convolutional Network (FCN) algorithm
 psp: Pyramid Scene Parsing (PSP) algorithm
 deeplab: DeepLab V3 algorithm

Default value: fcn

Input Formats
 JPG Images and PNG annotations
 For both training and validation
 Label maps to describe annotations
 Augmented manifest image format supported for Pipe mode.
 JPG images accepted for inference

Instance Types
 Only GPU supported for training (P2 or P3) on a single machine only
189
Modeling Select the appropriate model

- Specifically ml.p2.xlarge, ml.p2.8xlarge, ml.p2.16xlarge, ml.p3.2xlarge, ml.p3.8xlarge,


or ml.p3.16xlarge
 Inference on CPU (C5 or M5) or GPU (P2 or P3

3.2.11 Blazing Text


It is a supervised machine learning algorithm which is used for:
Text Classification
 Predict labels for a sentence NOT entire documents
 Useful in web searches, information retrieval
 Supervised, you should train the algorithm with data containing the label and the
sentence.

Word2Vec
 Creates a vector representation of words NOT sentences or documents
 Semantically similar words are represented by vectors close to each other
 This is called a word embedding
 It is useful for NLP, but is not an NLP algorithm in itself!
 Used in machine translation, sentiment analysis
 Remember it only works on individual words, not sentences or documents
Word2vec has multiple modes
- Cbow (Continuous Bag of Words): order of words don’t matter
- Skip-gram: order of words matter
- Batch skip-gram
 Distributed computation over many CPU nodes

Hyperparameters
Word2Vec
Parameter Name Description
mode The Word2vec architecture used for training.
Required
Valid values: batch_skipgram, skipgram, or cbow
batch_size The size of each batch when mode is set
to batch_skipgram. Set to a number between 10 and 20.
190
Modeling Select the appropriate model

Optional
Valid values: Positive integer
Default value: 11
buckets The number of hash buckets to use for subwords.
Optional
Valid values: positive integer
Default value: 2000000
evaluation Whether the trained model is evaluated using
the WordSimilarity-353 Test
learning_rate The step size used for parameter updates.
Optional
Valid values: Positive float
Default value: 0.05
negative_samples The number of negative samples for the negative sample
sharing strategy.
Optional
Valid values: Positive integer
Default value: 5
min_count Words that appear less than min_count times are
discarded.
Optional
Valid values: Non-negative integer
Default value: 5
vector_dim The dimension of the word vectors that the algorithm
learns.
Optional
Valid values: Positive integer
Default value: 100
window_size The size of the context window. The context window is
the number of words surrounding the target word used
for training.
Optional
Valid values: Positive integer
Default value: 5

Text Classification
Parameter Name Description
mode The training mode.
Required
191
Modeling Select the appropriate model

Valid values: supervised


buckets The number of hash buckets to use for word n-grams.
Optional
Valid values: Positive integer
Default value: 2000000
early_stopping Whether to stop training if validation accuracy doesn't
improve after a patience number of epochs.
Optional
Valid values: (Boolean) True or False
Default value: False
learning_rate The step size used for parameter updates.
Optional
Valid values: Positive float
Default value: 0.05
patience The number of epochs to wait before applying early
stopping when no progress is made on the validation set.
Used only when early_stopping is True.
Optional
Valid values: Positive integer
Default value: 4
vector_dim The dimension of the embedding layer.
Optional
Valid values: Positive integer
Default value: 100
word_ngrams The number of word n-gram features to use.
Optional
Valid values: Positive integer
Default value: 2

Input Formats
For supervised mode:
 One sentence per line
 First “word” in the sentence is the string __label__ followed by the label
 Also, “augmented manifest text format”
 Text should be pre-processed
For word2vec mode:

192
Modeling Select the appropriate model

 Just wants a text file with one training sentence per line.

Instance Types
 For Cbow and Skipgram, recommend a single ml.p3.2xlarge
- Any single CPU or single GPU instance will work
 For batch_skipgram,
- can use single or multiple CPU instances
 For text classification,
- C5 recommended if less than 2GB training data. For larger data sets, use a single
GPU instance (ml.p2.xlarge or ml.p3.2xlarge

3.2.12 Seq2Seq
 Input is a sequence of tokens, output is a sequence of tokens
 Machine Translation
 Text summarization
 Speech to text
 Implemented mainly with RNN’s and CNN’s with attention
 Training for machine translation can take days, even on SageMaker
 Pre-trained models are available
 Public training datasets are available for specific translation task (ready made translation)

Algorithm
Typically, a neural network for sequence-to-sequence modeling consists of a few layers,
including:
- Embedding layer. In this layer, the input matrix, which is input tokens encoded in a sparse
way (for example, one-hot encoded) are mapped to a dense feature layer. This is required
because a high-dimensional feature vector is more capable of encoding information
regarding a particular token (word for text corpora) than a simple one-hot-encoded vector.
It is also a standard practice to initialize this embedding layer with a pre-trained word
vector like FastText or Glove or to initialize it randomly and learn the parameters during
training.
- Encoder layer. After the input tokens are mapped into a high-dimensional feature space,
the sequence is passed through an encoder layer to compress all the information from the
input embedding layer (of the entire sequence) into a fixed-length feature vector.

193
Modeling Select the appropriate model

Typically, an encoder is made of RNN-type networks like long short-term memory (LSTM)
or gated recurrent units (GRU)
- Decoder layer. The decoder layer takes this encoded feature vector and produces the
output sequence of tokens. This layer is also usually built with RNN architectures (LSTM
and GRU).
The whole model is trained jointly to maximize the probability of the target sequence given the
source sequence.
Attention mechanism. The disadvantage of an encoder-decoder framework is that model
performance decreases as and when the length of the source sequence increases because of the
limit of how much information the fixed-length encoded feature vector can contain. To tackle this
problem, the algorithm uses attention mechanism which the decoder tries to find the location in
the encoder sequence where the most important information could be located and uses that
information and previously decoded words to predict the next token in the sequence.
To summarize:

Input  dense feature layer  Encode (LSTM)  Decoder


Remember Sequence to sequence is many to many and this means it is many to one then one to
many.

Hyperparameters
Parameter Name Description
batch_size Mini batch size for gradient descent.
Optional
Valid values: positive integer
Default value: 64
beam_size Length of the beam for beam search. Used during
training for computing bleu and used during inference.
Optional
Valid values: positive integer
Default value: 5
bleu_sample_size Number of instances to pick from validation dataset to
decode and compute bleu score during training. Set to -1
to use full validation set (if bleu is chosen
as optimized_metric).
Optional
Valid values: integer
Default value: 0

194
Modeling Select the appropriate model

cnn_activation_type The cnn activation type to be used.


Optional
Valid values: String. One of glu, relu, softrelu, sigmoid,
or tanh.
Default value: glu
cnn_hidden_dropout Dropout probability for dropout between convolutional
layers.
Optional
Valid values: Float. Range in [0,1].
Default value: 0
decoder_type Decoder type.
Optional
Valid values: String. Either rnn or cnn.
Default value: rnn
embed_dropout_source Dropout probability for source side embeddings.
Optional
Valid values: Float. Range in [0,1].
Default value: 0
encoder_type Encoder type. The rnn architecture is based on attention
mechanism by Bahdanau et al. and cnn architecture is
based on Gehring et al.
Optional
Valid values: String. Either rnn or cnn.
Default value: rnn
learning_rate Initial learning rate.
Optional
Valid values: float
Default value: 0.0003
num_layers_decoder Number of layers for Decoder rnn or cnn.
Optional
Valid values: positive integer
Default value: 1

num_layers_encoder Number of layers for Encoder rnn or cnn.


Optional
Valid values: positive integer
Default value: 1
optimized_metric Metrics to optimize with early stopping.
Optional

195
Modeling Select the appropriate model

Valid values: String. One of perplexity, accuracy, or bleu.


Default value: perplexity
optimizer_type Optimizer to choose from.
Optional
Valid values: String. One of adam, sgd, or rmsprop.
Default value: adam
xavier_factor_type Xavier factor type.
Optional
Valid values: String. One of in, out, or avg.
Default value: in

Input Formats
 RecordIO-Protobuf
- Tokens must be integers (this is unusual, since most algorithms want floating point
data.)
- For example indices into vocabulary files
 Start with tokenized text files, you need to actually build a vocabulary file that maps every
word to a number.
- You should provide the vocabulary file and the tokenized text files
 Convert to protobuf using sample code
- Packs into integer tensors with vocabulary files
- A lot like the TF/IDF
 Must provide training data, validation data, and vocabulary files

Instance Types
 Can only use GPU instance types (P3 for example)
 Can only use a single machine for training
 But can use multi-GPU’s on one machine

3.2.13 Object2Vec
This is unsupervised algorithm.
It is general-purpose neural embedding algorithm. Object2Vec generalizes the well-known
Word2Vec embedding technique for words. It learns embeddings of more general-purpose
objects such as sentences, customers, and products.
It creates low-dimensional dense embeddings of high-dimensional objects, it represents how
objects are similar to each other.
196
Modeling Select the appropriate model

It is basically word2vec, generalized to handle things other than words.


It is used for:
 Compute nearest neighbors of objects
 Visualize clusters
 Genre prediction
 Recommendations (similar items or users)

Algorithm
 Process data into JSON Lines and shuffle it
 Train with two input channels, two encoders, and a comparator
 Encoder choices:
- Average-pooled embeddings
- CNN’s
- Bidirectional LSTM
 Comparator generates ultimate label then followed by a feed-forward neural network

Figure 100: Object2Vec algorithm

Hyperparameters
Parameter Name Description
enc0_network, enc1_network The network model for the enc0 encoder.
Optional
Valid values: hcnn, bilstm, or pooled_embedding
 hcnn: A hierarchical convolutional neural network.
 bilstm: A bidirectional long short-term memory network
(biLSTM), in which the signal propagates backward and

197
Modeling Select the appropriate model

forward in time. This is an appropriate recurrent neural


network (RNN) architecture for sequential learning tasks.
 pooled_embedding: Averages the embeddings of all of
the tokens in the input.
dropout The dropout probability for network layers. Dropout is a
form of regularization used in neural networks that
reduces overfitting by trimming codependent neurons.
Optional
Valid values: 0.0 ≤ float ≤ 1.0
Default value: 0.0
output_layer The type of output layer where you specify that the task
is regression or classification.
Optional
Valid values: softmax or mean_squared_error
mlp_activation The type of activation function for the multilayer
perceptron (MLP) layer.
Optional
Valid values: tanh, relu, or linear
 tanh: Hyperbolic tangent
 relu: Rectified linear unit (ReLU)
 linear: Linear function
Default value: linear
optimizer The optimizer type.
Optional
Valid values: adadelta, adagrad, adam, sgd, or rmsprop.
 Adadelta
 adagrad
 adam:
 sgd
 rmsprop:
Default value: adam

Input Formats
 Data must be tokenized into integers
 Training data consists of pairs of tokens and/or sequences of tokens
- Sentence – sentence
- Labels-sequence (genre to description?)
- Customer-customer
198
Modeling Select the appropriate model

- Product-product
- User-item

Instance Types
 Can only train on a single machine (CPU or GPU, multi-GPU OK)
- Ml.m5.2xlarge
- Ml.p2.xlarge
- If needed, go up to ml.m5.4xlarge or ml.m5.12xlarge

 Inference: use ml.p2.2xlarge


- Use INFERENCE_PREFERRED_MODE environment variable to optimize for encoder
embeddings rather than classification or regression.

3.2.14 Neural Topic Model


 This algorithm is unsupervised machine learning algorithm based on “Neural Variational
Inference”
 Organize documents into topics
 Classify or summarize documents based on topics
 It’s not just TF/IDF
- It is actually a grouping things together into higher levels of what those terms might
represent.
- “bike”, “car”, “train”, “mileage”, and “speed” might classify a document as
“transportation” for example (although it wouldn’t know to call it that)
 You define how many topics you want
 These topics are a latent representation based on top ranking words

NOTE: Remember the word representation in NLP, but this time is document
representation not words. The topics are inferred from the observed word
distributions in the corpus. The words define the direction of the document.
 As this is latent representation:
- Used to find similar documents in the topic space
- Input to another supervised algorithm such as a document classifier

199
Modeling Select the appropriate model

- Algorithms based in part on these representations are expected to perform better


than those based on lexical features alone.

 Because the method is unsupervised, only the number of topics, not the topics themselves,
are pre-specified. The topic names generated are not human related topic names.
 Lowering “mini_batch_size” and “learning_rate” can reduce validation loss at expense of
training time

NOTE: Although you can use both the Amazon SageMaker NTM and LDA
algorithms for topic modeling, they are distinct algorithms and can be expected to
produce different results on the same input data.

Hyperparameters
Parameter Name Description
mini_batch_size The number of examples in each mini batch.

Optional

Valid values: Positive integer (min: 1, max: 10000)

Default value: 256

learning_rate can reduce validation loss at expense of training time


The learning rate for the optimizer.

Optional

Valid values: Float (min: 1e-6, max: 1.0)

Default value: 0.001

num_topics The number of required topics.

Required

Valid values: Positive integer (min: 2, max: 1000)

batch_norm Whether to use batch normalization during training.

Optional

Valid values: true or false

200
Modeling Select the appropriate model

Default value: false

clip_gradient The maximum magnitude for each gradient component.

Optional

Valid values: Float (min: 1e-3)

Default value: Infinity

Input Formats
 Four data channels
- “train” is required
- “validation”, “test”, and “auxiliary” optional
 recordIO-protobuf or CSV
 Words must be tokenized into integers
- Every document must contain a count for every word in the vocabulary in CSV
- The “auxiliary” channel is for the vocabulary
 File or pipe mode

Instance Types
- GPU or CPU
- GPU recommended for training
- CPU for inference

3.2.15 Latent Dirichlet Allocation (LDA)


 Latent Dirichlet Allocation is an unsupervised algorithm. The topics themselves are
unlabeled; they are just groupings of documents with a shared subset of words.
 Another topic modeling algorithm but it is not deep learning algorithm.
 Can be used for things other than words
- Cluster customers based on purchases
- Harmonic analysis in music
 Optional test channel can be used for scoring results
- Per-word log likelihood is the metric used for how good LDA works.
 Functionally similar to NTM, but CPU-based and therefore maybe cheaper/more efficient.
201
Modeling Select the appropriate model

 Linear Discriminant Analysis could help reduce dimensionality but transform features also,
you could not recognize transformed features.

Hyperparameters
Parameter Name Description
Num_topics The number of topics for LDA to find within the data.
Required
Valid values: positive integer
Alpha0 Initial guess for concentration parameter
Smaller values generate sparse topic mixtures
Larger values (>1.0) produce uniform mixtures
Optional
Valid values: Positive float
Default value: 1.0
max_iterations The maximum number of iterations to perform during the ALS phase of
the algorithm. Can be used to find better quality minima at the expense of
additional computation, but typically should not be adjusted.
Optional
Valid values: Positive integer
Default value: 1000
tol Target error tolerance for the ALS phase of the algorithm. Can be used to
find better quality minima at the expense of additional computation, but
typically should not be adjusted.
Optional
Valid values: Positive float
Default value: 1e-8
max_restarts The number of restarts to perform during the Alternating Least Squares
(ALS) spectral decomposition phase of the algorithm. Can be used to find
better quality local minima at the expense of additional computation, but
typically should not be adjusted.
Optional
Valid values: Positive integer
Default value: 10

Input Formats
 Train channel, optional test channel as this is unsupervised algorithm.
 RecordIO-protobuf or CSV
- We need to tokenize that data first. Every document does have counts for every
word in the vocabulary for that document, so we should pass a list of tokens,
202
Modeling Select the appropriate model

integers that represent each word, and how often that word occurs in each
individual document, not the documents themselves.

 Each document has counts for every word in vocabulary (in CSV format)
 Pipe mode only supported with recordIO

Instance Types
 Single-instance CPU training

3.2.16 DeepAR
 Forecasting one-dimensional time series data for example future stock prices
 Uses RNN’s
 Classical forecasting methods, such as autoregressive integrated moving average (ARIMA)
or exponential smoothing (ETS), fit a single model to each individual time series.
 Allows you to train the same model over several related time series
- If you have many times series that are somehow interdependent, it can actually
learn from those relationships between those time series to create a better model
for predicting any individual time series.
- For example, you might have time series groupings for demand for different
products, server loads, and requests for webpages. For this type of application, you
can benefit from training a single model jointly over all of the time series.

 Finds frequencies and seasonality


 Always include entire time series for training, testing, and inference
- You always include the entire time series for training, testing, and inference, so even
though you might only be interested in a certain window of it, you want to give it all
the data

 Use entire dataset as training set, remove last time points for testing. Evaluate on withheld
values.
 Don’t use very large values for prediction length (> 400 datapoints)
 Train on many time series and not just one when possible
 Each training example consists of a pair of adjacent context and prediction windows with
fixed predefined lengths. To control how far in the past the network can see, use
the context_length hyperparameter. To control how far in the future predictions can be
made, use the prediction_length hyperparameter.
203
Modeling Select the appropriate model

Hyperparameters
Parameter Name Description
Context_length The number of time-points that the model gets to see before making the
prediction. The value for this parameter should be about the same as
the prediction_length. The model also receives lagged inputs from the target,
so context_length can be much smaller than typical seasonalities. For
example, a daily time series can have yearly seasonality. The model
automatically includes a lag of one year, so the context length can be shorter
than a year. The lag values that the model picks depend on the frequency of
the time series. For example, lag values for daily frequency are previous week,
2 weeks, 3 weeks, 4 weeks, and year.
Required
Valid values: Positive integer
Epochs The maximum number of passes over the training data. The optimal value
depends on your data size and learning rate. See
also early_stopping_patience. Typical values range from 10 to 1000.
Required
Valid values: Positive integer
mini_batch_size The size of mini-batches used during training. Typical values range from 32 to
512.
Optional
Valid values: positive integer
Default value: 128
Learning_rate The learning rate used in training. Typical values range from 1e-4 to 1e-1.
Optional
Valid values: float
Default value: 1e-3
Num_cells The number of cells to use in each hidden layer of the RNN. Typical values
range from 30 to 100.
Optional
Valid values: positive integer
num_layers The number of hidden layers in the RNN. Typical values range from 1 to 4.
Optional
Valid values: positive integer
Default value: 2
prediction_length The number of time-steps that the model is trained to predict, also called the
forecast horizon. The trained model always generates forecasts with this

204
Modeling Select the appropriate model

length. It can't generate longer forecasts. The prediction_length is fixed when


a model is trained and it cannot be changed later.
Required
Valid values: Positive integer
time_freq The granularity of the time series in the dataset. Use time_freq to select
appropriate date features and lags. The model supports the following basic
frequencies. It also supports multiples of these basic frequencies. For
example, 5min specifies a frequency of 5 minutes.
 M: monthly
 W: weekly
 D: daily
 H: hourly
 min: every minute
Required
Valid values: An integer followed by M, W, D, H, or min. For example, 5min.
cardinality When using the categorical features (cat), cardinality is an array specifying the
number of categories (groups) per categorical feature. Set this to auto to infer
the cardinality from the data. The auto mode also works when no categorical
features are used in the dataset. This is the recommended setting for the
parameter.
Set cardinality to ignore to force DeepAR to not use categorical features, even
it they are present in the data.
To perform additional data validation, it is possible to explicitly set this
parameter to the actual value. For example, if two categorical features are
provided where the first has 2 and the other has 3 possible values, set this to
[2, 3].
For more information on how to use categorical feature, see the data-section
on the main documentation page of DeepAR.
Optional
Valid values: auto, ignore, array of positive integers, empty string, or
Default value: auto
dropout_rate The dropout rate to use during training. The model uses zoneout
regularization. For each iteration, a random subset of hidden neurons are not
updated. Typical values are less than 0.2.
Optional
Valid values: float
Default value: 0.1
embedding_dimension Size of embedding vector learned per categorical feature (same value is used
for all categorical features).

205
Modeling Select the appropriate model

Input Formats
 JSON lines format
- Gzip or Parquet
 Each record must contain:
- Start: the starting time stamp
- Target: the time series values
 Each record can contain:
- Dynamic_feat: dynamic features (such as, was a promotion applied to a product in a
time series of product purchases)
- Cat: categorical features

Instance Types
 Can use CPU or GPU
 Single or multi machine while training
 Start with CPU (C4.2xlarge, C4.4xlarge)
 Move up to GPU if necessary
- Only helps with larger models
 May need larger instances for tuning when doing hyperparameter tuning job
 CPU-only for inference

3.2.17 Random Cut Forest


 Random Cut Forest (RCF) is an unsupervised machine learning algorithm that is used for
anomaly detection.
 This algorithm is developed by Amazon.
 Anomalies can manifest as unexpected spikes in time series data, breaks in periodicity, or
unclassifiable data points.
 Assigns an anomaly score to each data point.
 RCF can work on streaming data and it is used in Kinesis Data Analytics.
 With each data point, RCF associates an anomaly score. Low score values indicate that the
data point is considered "normal." High values indicate the presence of an anomaly in the
data.
 SageMaker takes three standard deviations from the mean score as a reference if the data
point is considered anomalous (High) or not (Low).

206
Modeling Select the appropriate model

 RCF scales well with respect to number of features, data set size, and number of instances.

Algorithm
 The main idea behind the RCF algorithm is to create a forest of trees where each tree is
obtained using a partition of a sample of the training data.
 For example, a random sample of the input data is first determined.
 The random sample is then partitioned according to the number of trees in the forest.
 Each tree is given such a partition and organizes that subset of points into a k-d tree.
 While inference the data point is added to the tree structure as if the data point is used for
training. The anomaly score is calculated by changing in the tree structure that happens
due to the addition of this data point.
 If the data point is added as a leaf so the anomaly score will be low but if the data point is
added as branch (sometimes called height or depth) the anomaly score will be high.
 That’s why we are saying that “the expected change in complexity of the tree as a result
adding that point to the tree; which, in approximation, is inversely proportional to the
resulting depth of the point in the tree”.
 The random cut forest assigns an anomaly score by computing the average score from
each constituent tree and scaling the result with respect to the sample size.

Step 1: Sample Data


 RCF algorithm is to obtain a random sample of the training data. In particular, suppose we
want a sample of size from total data points. If the training data is small enough, the
entire dataset can be used, and we could randomly draw elements from this set.
However, frequently the training data is too large to fit all at once, and this approach isn't
feasible. Instead, we use a technique called reservoir sampling.
 Reservoir sampling is an algorithm for efficiently drawing random samples from a dataset.

Step 2: Train Model


The next step in RCF is to construct a random cut forest using the random sample of data. First,
the sample is partitioned into a number of equal-sized partitions equal to the number of trees in
the forest. Then, each partition is sent to an individual tree. The tree recursively organizes its
partition into a binary tree by partitioning the data domain into bounding boxes.

207
Modeling Select the appropriate model

Figure 101: RCF data partition

The RCF algorithm organizes these data in a tree by first computing a bounding box of the data,
selecting a random dimension (giving more weight to dimensions with higher "variance"), and
then randomly determining the position of a hyperplane "cut" through that dimension. The two
resulting subspaces define their own sub tree. In this example, the cut happens to separate a
lone point from the remainder of the sample. The first level of the resulting binary tree consists
of two nodes, one which will consist of the subtree of points to the left of the initial cut and the
other representing the single point on the right.

Step 3: Inference
When performing inference using a trained RCF model the final anomaly score is reported as the
average across scores reported by each tree.
Note that it is often the case that the new data point does not already reside in the tree. To
determine the score associated with the new point the data point is inserted into the given tree
and the tree is efficiently (and temporarily) reassembled in a manner equivalent to the training
process described above.
That is, the resulting tree is as if the input data point were a member of the sample used to
construct the tree in the first place. The reported score is inversely proportional to the depth of
the input point within the tree.

Hyperparameters
Parameter Name Description
Num_trees Increasing reduces noise
Number of trees in the forest.
Optional
Valid values: Positive integer (min: 50, max: 1000)
Default value: 100
208
Modeling Select the appropriate model

Num_samples_per_tree Should be chosen such that 1/num_samples_per_tree


approximates the ratio of anomalous to normal data
Number of random samples given to each tree from the
training data set.
Optional
Valid values: Positive integer (min: 1, max: 2048)
Default value: 256
eval_metrics A list of metrics used to score a labeled test data set. The
following metrics can be selected for output:
 Accuracy - returns fraction of correct predictions.
 precision_recall_fscore - returns the positive and negative
precision, recall, and F1-scores.
Optional
Valid values: a list with possible values taken
from accuracy or precision_recall_fscore.
Default value: Both accuracy, precision_recall_fscore are
calculated.

Input Formats
 RecordIO-protobuf or CSV
 Can use File or Pipe mode on either
 Optional test channel for computing accuracy, precision, recall, and F1 on labeled data
(anomaly or not)

Instance Types
 Does not take advantage of GPUs
 Use M4, C4, or C5 for training
 ml.c5.xl for inference

3.2.18 Neural Collaborative Filtering


A recommender system is a set of tools that helps provide users with a personalized experience
by predicting user preference amongst a large number of options. Matrix factorization (MF) is a
well-known approach to solving such a problem.

209
Modeling Select the appropriate model

Conventional MF solutions exploit explicit feedback in a linear fashion; explicit feedback consists
of direct user preferences, such as ratings for movies on a five-star scale or binary preference on
a product (like or not like). However, explicit feedback isn’t always present in datasets.
NCF solves the absence of explicit feedback by only using implicit feedback, which is derived from
user activity, such as clicks and views. In addition, NCF utilizes multi-layer perceptron to introduce
non-linearity into the solution.

Architecture
An NCF model contains two intrinsic sets of network layers: embedding and NCF layers. You use
these layers to build a neural matrix factorization solution with two separate network
architectures, generalized matrix factorization (GMF) and multi-layer perceptron (MLP), whose
outputs are then concatenated as input for the final output layer.

NOTE: Training and deploying the model using script mode

210
ML implementation and Operations SageMaker

4. ML implementation and Operations


4.1 SageMaker
4.1.1 Amazon ECR
Amazon Elastic Container Registry is full managed container registry that makes it easy to
store, manage, share and deploy your container images and artifacts anywhere. ECR hosts
your images in a highly available and high performance architecture, allowing you to reliably
deploy images for your container applications.
You can share container software privately with your organization or public worldwide.
Can be used with Fargate for one click deployment.

Figure 102: Amazon ECR

211
ML implementation and Operations SageMaker

4.1.2 Introduction to SageMaker

Figure 103: SageMaker Workflow

SageMaker is intended to manage the entire machine learning workflow.

SageMaker Notebooks
Notebook Instances on EC2 are spun up from the console
 S3 data access
 Scikit learn, Spark, Tensorflow
 Wide variety of built-in models
 Ability to spin up training instances
 Ability to deploy trained models for making predictions at scale

SageMaker Console
Less flexible than notebooks as you can write code in notebooks.
SageMaker functions:
- Kick off training jobs
- Kick off hyperparameter tuning job
- End point configuration
- Create end points

Data Preparation
 Data must come from S3 Ideal format varies with algorithm – often it is RecordIO /
Protobuf
 Apache Spark integrates with SageMaker
212
ML implementation and Operations SageMaker

 Scikit learn, numpy, pandas all at your disposal within a notebook

Training on SageMaker
 Create a training job
- URL of S3 bucket with training data
- ML compute resources
- URL of S3 bucket for output
- ECR path to training code
 Training options
- Built-in training algorithms
- Spark MLLib
- Custom Python Tensorflow / MXNet code
- Your own Docker image
- Algorithm purchased from AWS marketplace

 When creating a training job:


The following are the mandatory fields:
- AlgorithmSpecification
The registry path of the Docker image that contains the training algorithm and
algorithm-specific metadata, including the input mode.

- OutputDataConfig
Specifies the path to the S3 location where you want to store model artifacts.
Amazon SageMaker creates subfolders for the artifacts.

- ResourceConfig
The resources, including the ML compute instances and ML storage volumes, to use
for model training.

- RoleArn
The Amazon Resource Name (ARN) of an IAM role that Amazon SageMaker can
assume to perform tasks on your behalf.

- StoppingCondition

213
ML implementation and Operations SageMaker

Specifies a limit to how long a model training job can run. It also specifies how long a
managed Spot training job has to complete. When the job reaches the time limit,
Amazon SageMaker ends the training job. Use this API to cap model training costs.

- TrainingJobName
The name of the training job. The name must be unique within an AWS Region in an
AWS account.

NOTE: Input path is not mandatory as the training path could be local on the
training machine.

Deploying Trained Models


 Save your trained model to S3
 Can deploy two ways:
- Persistent endpoint for making individual predictions on demand
- SageMaker Batch Transform to get predictions for an entire dataset
 Lots of cool options
- Inference Pipelines for more complex processing
- SageMaker Neo for deploying to edge devices
- Elastic Inference for accelerating deep learning models
- Automatic scaling (increase # of endpoints as needed)

4.1.3 Automatic Model Tuning


SageMaker can tune the model hyperparameters i.e. detect the best values for the
hyperparameters i.e. learning rate, batch size, depth….etc.
 Define the hyper-parameters you care about and the ranges you want to try, and the
metrics you are optimizing for
 SageMaker spins up a “Hyperparameter Tuning Job” that trains as many combinations as
you’ll allow.
- Training instances are spun up as needed, potentially a lot of them
 The set of hyperparameters producing the best results can then be deployed as a model.
 It learns as it goes, so it doesn’t have to try every possible combination i.e. the tuner learn
the path which increase the model performance so it didn’t pick values that the tuner
learned it will not give a good results.

214
ML implementation and Operations SageMaker

Best Practice
 Don’t optimize too many hyperparameters at once
 Limit your ranges to as small a range as possible
 Use logarithmic scales when appropriate if the hyperparameter value range from 0.001 to
0.1
 Don’t run too many training jobs concurrently
- This limits how well the process can learn as it goes
 Make sure training jobs running on multiple instances report the correct objective metric
in the end i.e. after all the instance finish their process.

4.1.4 SageMaker Dock Container


4.1.4.1 Container
 A container is a standard unit of software that packages up code and all its dependencies
so the application runs quickly and reliably from one computing environment to another.
 Container image becomes container at runtime.
4.1.4.2 Docker
 A Docker container image is a light weight standalone, executable package of software that
includes everything needed to run an application: code, runtime, system tools, system
libraries and settings which run on Docker engine available for Linux and windows.

SageMaker Container Scenarios


1. Built in SageMaker Algorithm or framework
For most use cases, you can use the built-in algorithms and frameworks without worrying
about containers.
You can deploy and train these algorithms from SageMaker console, CLI or notebook by
specifying algorithm or framework version when creating your estimator.

2. Use Pre-built SageMaker Container Image


You can use the built-in algorithms and frameworks using Docker container. SageMaker
provides container for built-in algorithms and prebuilt Docker images for some of the most
common ML frameworks such as Apache MXNet, Tensorflow, PyTorch and Chainer.
It also supports ML libraries for Scikit learn and Spark ML.

215
ML implementation and Operations SageMaker

You can deploy the containers by passing the full container URI to their respective
SageMaker SDK Estimator class.

3. Extend a prebuilt SageMaker Container Image


You can extend a prebuilt SageMaker algorithm or model Docker image, you can modify
SageMaker image to satisfy your needs.

4. Adapt an existing Container Image


You can adapt a pre-existing container image to work with SageMaker.
You need to modify the Docker container to enable either training, inference or both tool
kits.

Prebuild SageMaker Docker Image Types


Docker Image for Deep Learning
SageMaker provides prebuilt Docker image that include deep learning frameworks libraries
and other dependencies needed for training and inference.

Docker Image for Scikit & Spark ML


SageMaker provides prebuilt Docker image that install a Scikit learn and Spark ML libraries.
These libraries include dependencies needed to build Docker image that are compatible
with SageMaker.

NOTE: Tensorflow doesn’t get distributed across multiple machines automatically.


So, if you need to distribute that training across multiple machines that might use
GPU, you can do one of the below choices:
Use framework called Horror VOD
Parameter server

216
ML implementation and Operations SageMaker

4.1.4.3 SageMaker Modes


You might be quite specialized and have several highly customized algorithms and Docker
containers to support those algorithms, and AWS has a workflow to create and support these
bespoke components as well.

Script Mode
SageMaker offers a solution using script mode. Script mode enables you to write custom training
and inference code while still utilizing common ML framework containers maintained by AWS.
Script mode is easy to use and extremely flexible.

Local Mode
Amazon SageMaker Python SDK supports local mode, which allows you to create estimators and
deploy them to your local environment. This is a great way to test your deep learning scripts
before running them in SageMaker’s managed training or hosting environments. Local Mode is
supported for frameworks images (TensorFlow, MXNet, Chainer, PyTorch, and Scikit-Learn) and
images you supply yourself.
The Amazon SageMaker deep learning containers have recently been open sourced , which
means you can pull the containers into your working environment and use custom code built into
the Amazon SageMaker Python SDK to test your algorithm locally, just by changing a single line of
code. This means that you can iterate and test your work without having to wait for a new
training or hosting cluster to be built each time.

217
ML implementation and Operations SageMaker

The Amazon SageMaker local mode allows you to switch seamlessly between local and
distributed, managed training by simply changing one line of code. Everything else works the
same.
The local mode in the Amazon SageMaker Python SDK can emulate CPU (single and multi-
instance) and GPU (single instance) SageMaker training jobs by changing a single argument in the
TensorFlow, PyTorch or MXNet estimators. To do this, it uses Docker compose and NVIDIA
Docker. It will also pull the Amazon SageMaker TensorFlow, PyTorch or MXNet containers from
Amazon ECS, so you’ll need to be able to access a public Amazon ECR repository from your local
environment.

4.1.4.4 SageMaker Toolkit Structure


When SageMaker make trains a model, it creates the following folder structure.

Hyperparameters
configuration

Used for training


distributed on
more than a server

Train: Model generated


Deploy: model artifact
& inference code Training, Validation
or Testing

Python script that do


the training

Failure & Error


messages

Figure 104: SageMaker Docker Folder Structure

4.1.4.5 Docker Image Folder Structure


WORKDIR
 nginx.conf
 predictor.py
 serve/
 train/
 wsgi.py

 nginx.conf: configuration for front end server at deployment time

218
ML implementation and Operations SageMaker

 predictor.py: That is a program that implements flash webserver for making the predictions
at runtime. Customize that code for your application.
 Server/: Program starts when the server starts for hosting. File starts the G unicorn server
which run multiple instance of flash application that is defined in the predictor.py file.
 Train/: Program starts when you start the Docker for training.
 Wsgi.py: Invoke your flash application.

4.1.4.6 Extend Docker Image


1. Step 1: Create an Amazon SageMaker Instance from the console
2. Step 2: Create a Docker file and training script:
2.a Docker File:

Only mandatory
environment variable

Figure 105: Docker File

The Docker file script performs the following tasks:


 FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorchtraining:1.5.1-cpu-
py36-ubuntu16.04 – Downloads the SageMaker PyTorch base image. You can replace
this with any SageMaker base image you want to bring to build containers.
 ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code – Sets /opt/ml/code as the
training script directory.
 COPY cifar10.py /opt/ml/code/cifar10.py – Copies the script to the location inside the
container that is expected by SageMaker. The script must be located in this folder.
 ENV SAGEMAKER_PROGRAM cifar10.py – Sets your cifar10.py training script as the
entry point script.

219
ML implementation and Operations SageMaker

2.b Training Script:


Create a cifar.py (training script file) and added to the folder <folder for the Docker file>.

3. Step 3: Build the container


From a notebook:
!cd <folder for the Docker file>
!docker build –t foo #looking for default filename ‘Dockerfile’ without any extensions
! docker build -t foo -f Dockerfile-text.txt #passing Docker file name
4. Step 4: Test Container
From the code of your training:
Estimator = Estimator(‘imagename=”Foo”,…………..)

5. Step 5: Push the Container to Amazon ECR


After you successfully run the local mode test, you can push the Docker container to
Amazon ECR and use it to run training jobs.

220
ML implementation and Operations SageMaker

6. Step 6: Call ECR image


After you push the container, you can call the Amazon ECR image from anywhere in the
SageMaker environment.
algorithm_name="pytorch-extended-container-test"
ecr_image='{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account, region,
algorithm_name)

estimator=Estimator( image_uri=ecr_image, role=get_execution_role(),


base_job_name='pytorch-extended-container-test', instance_count=1,
instance_type='ml.p2.xlarge' )

4.1.4.7 Adapt Docker Container for SageMaker


1. Step 1: Create a SageMaker notebook instance
2. Step 2: Create Docker file and training script

221
ML implementation and Operations SageMaker

The Docker file script performs the following tasks:


 FROM tensorflow/tensorflow:2.2.0rc2-gpu-py3-jupyter – Downloads the TensorFlow
Docker base image. You can replace this with any Docker base image you want to bring
to build containers, as well as with AWS pre-built container base images.
 RUN pip install sagemaker-training – Installs SageMaker Training Toolkit that contains
the common functionality necessary to create a container compatible with SageMaker.
 COPY train.py /opt/ml/code/train.py – Copies the script to the location inside the
container that is expected by SageMaker. The script must be located in this folder.
 ENV SAGEMAKER_PROGRAM train.py – Takes your training script train.py as the
entrypoint script copied in the /opt/ml/code folder of the container. This is the only
environmental variable that you must specify when you build your own container.

3. Step 3: Build the container


As in 4.1.4.6 Extend Docker Image
4. Step 4: Test the container
As in 4.1.4.6 Extend Docker Image
5. Step 5: Push the Container to Amazon ECR
As in 4.1.4.6 Extend Docker Image
6. Step 6: Call ECR image
As in 4.1.4.6 Extend Docker Image

4.1.4.8 Adapting Your Own Inference Container


If none of the Amazon SageMaker prebuilt inference containers suffice for your situation, and
you want to use your own Docker container, use the SageMaker Inference Toolkit to adapt your
container to work with SageMaker hosting.

222
ML implementation and Operations SageMaker

To adapt your container to work with SageMaker hosting, create the inference code in one or
more Python script files and a Docker file that imports the inference toolkit.
The inference code includes an inference handler, a handler service, and an entrypoint. In this
example, they are stored as three separate Python files. All three of these Python files must be in
the same directory as your Dockerfile.
Step 1: Create an Inference Handler
The SageMaker inference toolkit is built on the multi-model server (MMS). MMS expects a
Python script that implements functions to load the model, pre-process input data, get
predictions from the model, and process the output data in a model handler.
The model_fn Function
The model_fn function is responsible for loading your model. It takes a model_dir argument that
specifies where the model is stored.
def model_fn(self, model_dir)

The input_fn Function


The input_fn function is responsible for deserializing your input data so that it can be passed to
your model.
It takes input data and content type as parameters, and returns deserialized data. The SageMaker
inference toolkit provides a default implementation that deserializes the following content types:
- JSON
- CSV
- Numpy array
- NPZ
def input_fn(self, input_data, content_type)

The predict_fn Function


The predict_fn function is responsible for getting predictions from the model. It takes the model
and the data returned from input_fn as parameters, and returns the prediction.
def predict_fn(self, data, model)

The output_fn Function


The output_fn function is responsible for serializing the data that the predict_fn function returns
as a prediction. The SageMaker inference toolkit implements a default output_fn function that
serializes Numpy arrays, JSON, and CSV. If your model outputs any other content type, or you
want to perform other post-processing of your data before sending it to the user, you must
implement your own output_fn function.
223
ML implementation and Operations SageMaker

def output_fn(self, prediction, accept)

Step 2: Implement a Handler Service


The handler service is executed by the model server. The handler service implements initialize
and handle methods. The initialize method is invoked when the model server starts, and the
handle method is invoked for all incoming inference requests to the model server.
class HandlerService(DefaultHandlerService)

Step 3: Implement an Entrypoint


The entrypoint starts the model server by invoking the handler service. You specify the location
of the entrypoint in your Dockerfile.

Step 4: Write a Dockerfile


In your Dockerfile, copy the model handler from step 2 and specify the Python file from the
previous step as the entrypoint in your Dockerfile.

Step 5: Build and Register Your Container


Now you can build your container and register it in Amazon Elastic Container Registry (Amazon
ECR).

4.1.4.9 Use Your Own Training Algorithms


This section explains how Amazon SageMaker interacts with a Docker container that runs your
custom training algorithm. Use this information to write training code and create a Docker image
for your training algorithms.
How Amazon SageMaker Runs Your Training Image?
To configure a Docker container to run as an executable, use an ENTRYPOINT instruction in a
Dockerfile.
How Amazon SageMaker Provides Training Information?
224
ML implementation and Operations SageMaker

This section explains how SageMaker makes training information, such as training data,
hyperparameters, and other configuration information, available to your Docker container.
When you send a CreateTrainingJob request to SageMaker to start model training, you specify
the Amazon Elastic Container Registry path of the Docker image that contains the training
algorithm. You also specify the Amazon Simple Storage Service (Amazon S3) location where
training data is stored and algorithm-specific parameters. SageMaker makes this information
available to the Docker container so that your training algorithm can use it. This section explains
how we make this information available to your Docker container. For information about creating
a training job, see CreateTrainingJob.
Hyperparameters
SageMaker makes the hyperparameters in a CreateTrainingJob request available in the Docker
container in the /opt/ml/input/config/hyperparameters.json file.

Environment Variables
The following environment variables are set in the container:
TRAINING_JOB_NAME – Specified in the TrainingJobName parameter of the CreateTrainingJob
request.
TRAINING_JOB_ARN – The Amazon Resource Name (ARN) of the training job returned as the
TrainingJobArn in the CreateTrainingJob response.

All environment variables specified in the Environment parameter in the CreateTrainingJob


request.

Input Data Configuration


You specify data channel information in the InputDataConfig parameter in a CreateTrainingJob
request. SageMaker makes this information available in the
/opt/ml/input/config/inputdataconfig.json file in the Docker container.

For example, suppose that you specify three data channels (train, evaluation, and validation) in
your request. SageMaker provides the following JSON:

225
ML implementation and Operations SageMaker

Training Data
The TrainingInputMode parameter in a CreateTrainingJob request specifies how to make data
available for model training: in FILE mode or PIPE mode. Depending on the specified input mode,
SageMaker does the following:

FILE mode—SageMaker makes the data for the channel available in the
/opt/ml/input/data/channel_name directory in the Docker container. For example, if you have
three channels named training, validation, and testing, SageMaker makes three directories in the
Docker container:

/opt/ml/input/data/training
/opt/ml/input/data/validation
/opt/ml/input/data/testing

PIPE mode—SageMaker makes data for the channel available from the named pipe:
/opt/ml/input/data/channel_name_epoch_number.

4.1.4.10 Distributed Training Configuration


If you're performing distributed training with multiple containers, SageMaker makes information
about all containers available in the /opt/ml/input/config/resourceconfig.json file.

226
ML implementation and Operations SageMaker

To enable inter-container communication, this JSON file contains information for all containers.
SageMaker makes this file available for both FILE and PIPE mode algorithms. The file provides the
following information:
current_host—The name of the current container on the container network. For example, algo-1.
Host values can change at any time. Don't write code with specific values for this variable.

hosts—The list of names of all containers on the container network, sorted lexicographically. For
example, ["algo-1", "algo-2", "algo-3"] for a three-node cluster. Containers can use these names
to address other containers on the container network. Host values can change at any time. Don't
write code with specific values for these variables.

network_interface_name—The name of the network interface that is exposed to your container.


For example, containers running the Message Passing Interface (MPI) can use this information to
set the network interface name.

NOTE: Do not use the information in /etc/hostname or /etc/hosts because it might


be inaccurate.
Hostname information may not be immediately available to the algorithm container. We
recommend adding a retry policy on hostname resolution operations as nodes become available
in the cluster.

4.1.4.11 Environment Variables


 SAGEMAKER_PROGRAM
 SAGEMAKER_TRAINING_MODULE
 SAGEMAKER_SERVICE_MODULE
 SAGEMAKER_MODEL_DIR
 SAGEMAKER_CHANNEL/SM_CHANNEL_<Channel Name>
Channel Name could be Training, testing or Validation.
 SAGEMAKER_HPS (Hyperparameters)

4.1.4.12 Tensorflow Training


When using TensorFlow with Amazon SageMaker:

227
ML implementation and Operations SageMaker

1. Train
a. Preparing Training Script
The training script is very similar to a training script you might run outside of SageMaker, but
you can access useful properties about the training environment through various
environment variables.

SM_MODEL_DIR:
A string that represents the local path where the training job writes the model artifacts to.
After training, artifacts in this directory are uploaded to S3 for model hosting. This is
different than the model_dir argument passed in your training script, which can be an S3
location. SM_MODEL_DIR is always set to /opt/ml/model.

SM_NUM_GPUS:
An integer representing the number of GPUs available to the host.

SM_OUTPUT_DATA_DIR:
A string that represents the path to the directory to write output artifacts to. Output
artifacts might include checkpoints, graphs, and other files to save, but do not include
model artifacts. These artifacts are compressed and uploaded to S3 to an S3 bucket with
the same prefix as the model artifacts.

SM_CHANNEL_XXXX:
A string that represents the path to the directory that contains the input data for the
specified channel. For example, if you specify two input channels in the TensorFlow
estimator’s fit call, named ‘train’ and ‘test’, the environment variables
SM_CHANNEL_TRAIN and SM_CHANNEL_TEST are set.

A typical training script loads data from the input channels, configures training with
hyperparameters, trains a model, and saves a model to SM_MODEL_DIR so that it can be
deployed for inference later. Hyperparameters are passed to your script as arguments and
can be retrieved with an argparse.ArgumentParser instance.
b. Adapting your local TensorFlow script
c. Use third-party libraries
d. Create an Estimator
2. Deploy to a SageMaker Endpoint
a. Deploying from an Estimator

228
ML implementation and Operations SageMaker

After a TensorFlow estimator has been fit, it saves a TensorFlow SavedModel bundle
in the S3 location defined by output_path. You can call deploy on a TensorFlow
estimator object to create a SageMaker Endpoint.

b. Deploying directly from model artifacts


If you already have existing model artifacts in S3, you can skip training and deploy
them directly to an endpoint.

3. Making predictions against a SageMaker Endpoint


Once you have the Predictor instance returned by model.deploy(...) or estimator.deploy(...),
you can send prediction requests to your Endpoint.
TensorFlow Serving Input and Output
Supported Formats
SageMaker’s TensorFlow Serving endpoints can also accept some additional input formats that
are not part of the TensorFlow REST API, including a simplified JSON format, line-delimited JSON
objects (“JSON” or “JSONlines”), and CSV data.

4.1.4.13 Deep Learning AMI (DLAMI)


 This is a customized machine learning instance include infra structure and tools to
accelerate deep learning in the cloud at any scale.
 It is pre-installed with popular deep learning frameworks and interfaces such as
TensorFlow, PyTorch, Apache MXNet, Chainer, Gluon, HorVord and Keras to train
sophisticated custom AI models.
 There are two flavors:
- Deep Learning AMI with Conda
Frameworks installed separately using Conda packages and separate Python
environment.
Frameworks are: Apache MXNet, Chainer, Keras, PyTorch, TensorFlow and
TensorFlow 2.

- Deep Learning Base AMI


No frameworks installed, only NVidia Cuda and dependencies.

229
ML implementation and Operations SageMaker

4.1.5 Production Variant


 Variant weight tell SageMaker how to distribute traffic among 2 different models.
 This is A/B test and test the performance of the new model.
 A/B test is the production variant.
 When deploying new model and want to test its performance in production before
rolling it out, you can send a defined percentage of the new requests to the new model
i.e. 10% and the other 90% to the old model. You can increase the percentage from 10%
 20%  30%…..etc. Once you are sure that the new model perform well you can roll
out to the new model and remove the old one.

4.1.6 SageMaker Neo


 Train once, run anywhere.
 Neo automatically optimizes Glun, Keras, XGBoost, MXNet, PyTorch, TensorFlow,
TensorFlow lite and ONNX models frameworks for inference on operating systems as
windows, Linux and Android based on processors Ambarella, ARM, Intel, NVidia, NXP,
Qualcon, Texas and Xilinx.
 Neo is not only for edge devices but also used for cloud instances.
 Edge Devices supported
 ARM, Intel, NVidia, DeepLens
 Cloud Instances supported
 C4, C5, m4, m5, P2, P3 and g4dn

How it works?
 Neo consists of compiler and runtime.
 First, Neo compilation API reads model exported from various frameworks. I converts the
framework specific functions and operations into framework agonist intermediate
representation.
 Next, it performs a series of optimizations.
 Then, it generates binary code for optimized operations, write them to shared object
library and saves the model definition and parameters into separate files.
 Neo, also provides a runtime for each target platform that loads and executes the
compiled optimized model.

4.1.7 SageMaker Security


 Identity & access Manager (IAM)

230
ML implementation and Operations SageMaker

 Setup users’ accounts for AWS and these user accounts have the permissions they
need.
 Restrict the permissions of the different services that are talking to each other. For
example, set a permission to SageMaker note book for S3 access.
 Permissions:
- Create Training Job - Create Model
- Create Endpoint configuration - Create Transform Job
- Create Hyperparameter Tuning - Create Notebook Instance
- Update Notebook instance
 Policies:
- AmazonSageMakerReadOnly
- AmazonSageMakerFullAccess
- AdministratorAccess
- DataScientist

 Multiple Factor Authentication (MFA)


Use MFA with root and admin accounts to enhance security.

 SSL/TLS Connection
 Use SSL/TLS for all connections between servers.
 Connecting to EMR can’t use SSL/TLS.

 Cloud Trail
Use CloudTrail to log any activity to the APIs that you are using. You will have the chance
what is happening, when and who did it.

 Encryption
Use encryption whenever appropriate especially with Personal Identification Information
(PII)
If you are sending data like names, emails, addresses or credentials, make sure to encrypt
these data in rest and at transit.
 Encryption in rest

231
ML implementation and Operations SageMaker

Key Management Service (KMS)


Any SageMaker jobs or notebooks will accept KMS key to encrypt all the data stored
by the jobs and notebooks.
- Training, Tuning, Batch Transform and endpoints.
- Everything in notebooks in opt/ml/* can be encrypted also the temp folder
in Docker container.

S3 Encryption
- You can use S3 encryption for training data and hosting models.
- S3 can also use KMS to encrypt the data.

 Encryption at transit
- Basically all traffic support TLS/SSL in SageMaker.
- IAM Roles can be used to give permissions to access specific resources.
- Internodes training (in case of training on multiple servers) may optionally
encrypt data when inter-transfer data.
 Can increase training time and cost
 Enabled via console or API when setting up a training or tuning job
 Deep Learning can be trained on multiple nodes.

 Virtual Private Cloud (VPC)


 Training job runs in VPC.
 You can use private VPC for even more security, by default VPC has no internet
connection this will lead to some issues as follows:
- You will need to setup S3-VPC endpoints.
Endpoint policies and S3 bucket policies can make this secure.
NOTE: S3 needs internet connection to be accessed.

- Notebooks are internet enabled by default. If disabled, your VPC needs an


interfacing endpoints (Private Link) or NAT gateway, and allow outbound
connections for training and hosting to work.

NOTE: Notebooks needs internet connection to download libraries.


 Training and inference containers are also internet enabled by default.
232
ML implementation and Operations SageMaker

Network isolation is an option, but this also prevent S3 access.

4.1.8 SageMaker Resources


 Algorithms rely on deep learning will benefit from GPU instances (P2 or P3) for training.
Blazing text, DeepAR
 Inference is usually less demand, use compute instances.
C4, C5, C6 and C6gn
 GPU instances are expensive, but can be used in inference to speedup although compute
instance is enough.

Using Spot Instances


 Can use EC2 spot instances for training, this will save about 90% over on demand.
 Spot instances can be interrupted, use check points to S3 so training can resume.
 Can increase training time as you need to wait for Spot Instance to become available.

Instances Properties
 P2 Instances
- High frequency Intel Xeon E5-2686 v4 (Broadwell) processors
- High-performance NVIDIA K80 GPUs, each with 2,496 parallel processing
cores and 12GiB of GPU memory
- Supports GPUDirect™ for peer-to-peer GPU communications
- Provides Enhanced Networking using Elastic Network Adapter (ENA) with up
to 25 Gbps of aggregate network bandwidth within a Placement Group
- EBS-optimized by default at no additional cost

 P3 Instances
- Up to 8 NVIDIA Tesla V100 GPUs, each pairing 5,120 CUDA Cores and 640
Tensor Cores
- High frequency Intel Xeon E5-2686 v4 (Broadwell) processors for p3.2xlarge,
p3.8xlarge, and p3.16xlarge.
- High frequency 2.5 GHz (base) Intel Xeon 8175M processors for
p3dn.24xlarge.
- Supports NVLink for peer-to-peer GPU communication
- Provides up to 100 Gbps of aggregate network bandwidth.
- EFA support on p3dn.24xlarge instances
233
ML implementation and Operations SageMaker

 G3 Instances
- High frequency Intel Xeon E5-2686 v4 (Broadwell) processors
- NVIDIA Tesla M60 GPUs, each with 2048 parallel processing cores and 8 GiB of
video memory
- Enables NVIDIA GRID Virtual Workstation features, including support for 4
monitors with resolutions up to 4096x2160. Each GPU included in your
instance is licensed for one “Concurrent Connected User"

Elastic Inference (EI)


 Accelerates deep learning inference, It is a fraction of cost than using
dedicated GPU instance for inference.
 Elastic Inference may be added with a CPU instance i.e.
ml.ei.medium/large/xlarge
 Elastic Inference instances may also be used with notebooks.
 Elastic Inference works with deep learning frameworks.
 Works with TensorFlow and MXNet prebuild containers. ONNX may be
used by exporting models to MXNet.
 Elastic Inference with custom containers built with EI enabled
TensorFlow and MXNet libs will contain code for EI to work.
 Works with Image classification and Object detection built in
algorithms.

4.1.9 SageMaker Automatic Scaling


 To use automatic scaling, you define and apply a scaling policy that uses Amazon
CloudWatch metrics and target values that you assign. Automatic scaling uses the policy
to increase or decrease the number of instances in response to actual workloads.
 Setup scaling policy to define target metrics, min/max capacity, cooldown periods.
 There are two types of supported scaling policies: target-tracking scaling and step
scaling. It is recommended to use target-tracking scaling policies for your auto-scaling
configuration. You configure:
- Target-Tracking scaling
You choose a scaling metric and set a target value. Application Auto Scaling creates
and manages the CloudWatch alarms that trigger the scaling policy and calculates
the scaling adjustment based on the metric and the target value. The scaling policy
adds or removes capacity as required to keep the metric at, or close to, the specified

234
ML implementation and Operations SageMaker

target value. In addition to keeping the metric close to the target value, a target
tracking scaling policy also adjusts to changes in the metric due to a changing load
pattern.

- Step Scaling
You choose scaling metrics and threshold values for the CloudWatch alarms that
trigger the scaling process as well as define how your scalable target should be
scaled when a threshold is in breach for a specified number of evaluation periods.
Step scaling policies increase or decrease the current capacity of a scalable target
based on a set of scaling adjustments, known as step adjustments.
Step adjustments
When you create a step scaling policy, you add one or more step adjustments that
enable you to scale based on the size of the alarm breach. Each step adjustment
specifies the following:
- A lower bound for the metric value
- An upper bound for the metric value
- The amount by which to scale, based on the scaling adjustment type

 CloudWatch will monitor the performance of your inference nodes and scale them as
needed.
 Dynamically adjust number of instances for a production variant. According to the load
on which model.
 Load test configuration before using it. So you can test scaling configuration and load
this configuration in the production.

4.1.10 Availability Zones in SageMaker


 Create robust endpoints when hosting your model. SageMaker endpoints can help protect
your application from Availability Zone outages and instance failures. If an outage occurs or
an instance fails, SageMaker automatically attempts to distribute your instances across
Availability Zones. For this reason, we strongly recommended that you deploy multiple
instances for each production endpoint.
 You should configure VPC with 2 subsets, each in different availability zone if you are using
custom VPC.

235
ML implementation and Operations SageMaker

4.1.11 SageMaker Inference Pipeline


 Using more than one container in deployment and string them together using inference
pipeline.
 You can have any combination of pre-trained built-in algorithms or your own algorithm
that are hosted in Docker containers and them all together.
 You can have from 2 – 5 container to be hooked together.
 You can combine pre-processing, predictors and post-processing in different containers
and chain them all together in the inference pipeline.
 Spark ML/Scikit learn containers can be used:
- Spark ML with Glue or EMR
- Serialized into MLeap format
 Used for real time inference or batch inference.
When you deploy machine learning models into production to make predictions on new data,
you need to ensure that the same data processing steps that were used in training are also
applied to each inference request.
Using inference pipelines, can reuse the data processing steps applied in model training during
inference without the need to maintain two separate copies of the same code. This ensures
accuracy of your predictions and reduces development overhead.
Also remember the managed service aspect of Amazon SageMaker. Inference pipelines are
completely managed, which means when you deploy the pipeline model, the service installs and
runs the sequence of containers on each Amazon EC2 instance in the endpoint or batch
transform job.

4.1.12 SageMaker with Spark


 Spark is a very popular framework for preprocessing data and it also has a very powerful
ML libraries (MLLib) to perform ML at large scale.
 Spark does a lot of what SageMaker can do and even more.
 Spark load data in data frames in Spark and you can distribute the processing of that data
frame to sort of manipulation and massage across an entire cluster.
 SageMaker – Spark library let you integrate both to take the power of SageMaker and
Spark. This library lets you use SageMaker within Spark driven script.

236
ML implementation and Operations SageMaker

How this works?


Preprocess data With the SageMaker SageMaker
in Spark as usual same data Estimator model used for
Map data and frame comes
XGBoost, PCA, K- inference
from Spark
Reduce Date means

How to integrate Spark and SageMaker?


 Connect notebook or Zeppelin to a remote EMR cluster running Spark.
 Train data frame that should have features column as vector of doubles and an optional
labels as vector of doubles for supervised algorithms.
 Call Fit on your SageMaker estimator to get SageMakerModel and pass the Spark data
frame for the fit function. Now you have SageMaker model trained with Spark trained
with Spark data frame.
 Call transform on SageMakerModel to make inference.
 This also works for Spark pipelines.

4.1.13 Notebook Lifecycle


To install packages or sample notebooks on your notebook instance, configure networking and
security for it, or otherwise use a shell script to customize it, use a lifecycle configuration. A
lifecycle configuration provides shell scripts that run only when you create the notebook instance
or whenever you start one. When you create a notebook instance, you can create a new lifecycle
configuration and the scripts it uses or apply one that you already have.

You can also use a lifecycle configuration script to access AWS services from your notebook. For
example, you can create a script that lets you use your notebook to control other AWS resources,
such as an Amazon EMR instance.

The following are best practices for using lifecycle configurations:


 Lifecycle configurations run as the root user. If your script makes any changes within the
/home/ec2-user/SageMaker directory, (for example, installing a package with pip), use the
command sudo -u ec2-user to run as the ec2-user user. This is the same user that Amazon
SageMaker runs as.

237
ML implementation and Operations SageMaker

 SageMaker notebook instances use conda environments to implement different kernels for
Jupyter notebooks. If you want to install packages that are available to one or more notebook
kernels, enclose the commands to install the packages with conda environment commands
that activate the conda environment that contains the kernel where you want to install the
packages.

You can use a notebook instance created with a custom lifecycle configuration script to access
AWS services from your notebook. For example, you can create a script that lets you use your
notebook with Sparkmagic to control other AWS resources, such as an Amazon EMR instance.
You can then use the Amazon EMR instance to process your data instead of running the data
analysis on your notebook. This allows you to create a smaller notebook instance because you
won't use the instance to process data. This is helpful when you have large datasets that would
require a large notebook instance to process the data.

Amazon SageMaker periodically tests and releases software that is installed on notebook
instances. This includes:
 Kernel updates
 Security patches
 AWS SDK updates
 Amazon SageMaker Python SDK updates
 Open source software updates

4.1.14 SageMaker Studio


 SageMaker Studio is a web based IDE for ML that lets you build, train, debug, deploy and
monitor your ML models. It UI is built on jupyterLab.
 SageMaker Studio has all the tools needed to take your models from experimentation to
production.
 In a single unified visual interface:
- Write and execute code in Jupyter notebook.
- Build and train machine learning models
- Deploy the models and monitor the performance of their prediction.
- Track and debug ML experiments.

SageMaker Studio Entity Status


238
ML implementation and Operations SageMaker

- Domain
SageMaker Studio domain consists of an associated Amazon EFS volume, list of
authorized users and a variety of security, application policy and VPC configuration.

- User profile
A user profile represents a single user within domain.

- App
An app represents an application that supports the reading and execution
experience of the users’ notebook, terminals and consoles.
App can be Jupyter notebook or kernel gateway.

SageMaker Studio Notebooks


 Amazon SageMaker Studio notebooks are collaborative notebooks that you can launch
quickly because you don't need to set up compute instances and file storage beforehand.
 You can share your notebooks with others, so that they can easily reproduce your results
and collaborate while building models and exploring your data.

4.1.15 SageMaker Experiments


 SageMaker experiments is a capability of SageMaker that let you Organize, track, compare
and evaluate your ML experiments.
 ML is an iterative process, you need to experiment with multiple combination of data,
algorithms and parameters while observing the impact of incremental changes on model
accuracy.
 Overtime this iterations can result in thousands of model training runs and model versions.
 SageMaker experiment automatically tracks the inputs, parameters, configurations and
results of your iterations as trials. You can assign, group and organize these trials into
experiments.
 SageMaker experiments is integrated with SageMaker studio providing a visual interface to
browse you active and pas experiments, compare trials on key performance metrics and
identify the best performing models.
 SageMaker experiments comes with its own “Experiments Python SDK” which makes
analytics capabilities easily access from SageMaker notebook.
239
ML implementation and Operations SageMaker

 All experiment artifacts including datasets, algorithms, hyperparameters and model


metrics are tracked and recorded.
 Tracking experiments could be manual or automatic.

4.1.16 SageMaker Monitoring


 Monitoring is important part of maintaining the reliability, availability and performance of
SageMaker.
 SageMaker Monitoring tools:
- Amazon CloudWatch
Monitors AWS resources and application run on AWS in real time.
You can collect and track metrics, create customizable Dashboards and set alarms
when a metric reaches threshold.
Example: Track CPU usage of EC2 and run new instance when needed.

- Amazon CloudWatch Log


Monitor, store and access your log files from EC2 instances, CloudTrail and other
resources. CloudWatch log can monitor information in the log files and notify when
threshold met.

- AWS CloudTrail
Capture API call and relevant events made by or on behalf of your AWS account and
delivers log files to S3. It can identify which users or accounts called AWS. IP address
and when it is called.

- Amazon CloudWatch Events


CloudWatch Events delivers a near-real time stream of system events that describe
changes in AWS resources. Event Rules react to a status changes in training, hyper-
parameter and Batch transform.

CloudWatch
 Collects raw data and process it into readable near real time metrics.
240
ML implementation and Operations SageMaker

 Statistics are kept for 15 months.


 CloudWatch console limits search in metrics for only the last 2 weeks.
 CloudWatch Metrics are:
- Endpoint Invocation Metrics
Number of invocation, number of invocations with 4XX errors, number of
invocations with 5XX errors, number of invocations sent to the model normalized by
instance count in each product variant, Model latency and overhead latency.

- Multi model Endpoint Metrics


Model load time, model unload time and model download time.

NOTE: Multi model endpoint: Create an endpoint that can host multiple
models. They used as shared serving container that is enabled to host
multiple models.
- Jobs and Endpoint Metrics
CPU Utilization, Memory Utilization, GPU Utilization, GPU Memory Utilization and
Disk Utilization.

- Ground Truth Metrics


Active workers, Dataset objects annotated by humans or auto annotated, Jobs failed
and Jobs succeeded.

- Features store Metrics


Consumed Read Requests and Consumed Write Requests.

- Pipeline Metrics

CloudWatch Log
 To help debug your processing jobs, training jobs, endpoints, transform jobs, notebooks,
notebooks configuration, model container and algorithm container. Any component sends
to stdout or sterror is also send to CloudWatch.

241
ML implementation and Operations SageMaker

CloudTrail
 CloudTrail capture all API calls for SageMaker with the exception of Invoke endpoints as
events.
 The calls captured include calls to SageMaker from Console and code.
 If you create trail, CloudTrail events will be sent to S3, if not you can still use Event History
from console.
 Data collected include IP, who, when and additional details.
 SageMaker supports logging non API service event to CloudTrail files automatically for
model tuning jobs this includes Hyperparameter tuning jobs, this is used to help you
improve governance, compliance and operation and risk auditing.

SageMaker Event Bridge


 SageMaker Event Bridge monitors status change events.
 Events from SageMaker are delivered to Event Bridge in near real time.
 You can write code to automate actions to take when an event matches.
 Actions could be:
- Invoke lambda function
- Invoke EC2 instance run command
- Rely event to kinesis data stream
- Activate step function
- Notify SNS
 Events Monitored:
- Training job
- Hyperparameter tuning job
- Transform job
- Endpoint state changed
- Feature group state changed
- Model package
- Pipeline execution
- Pipeline step state change
- Image state change
- Image version state change

242
ML implementation and Operations SageMaker

4.1.17 SageMaker Debugger


 SageMaker Debugger will debug, monitor and profile training jobs in real time, detect non-
converging conditions, optimize resource utilization by eliminating bottlenecks, improve
training time and reduce costs of ML models.
 ML training jobs can have problems such as system bottlenecks, overfitting, saturated
activation function and vanishing gradients which can compromise model performance.
 SageMaker debugger profiles and debug training jobs help solve above problems and
improve your ML model’s compute resource utilization and performance.
 SageMaker debugger offer tools to send alerts when training anomalies are found, take
actions against the problems and identify the root cause of the problem by visualizing
collected metrics and tensors.
 SageMaker debugger supports the following frameworks Apache MXNet, TensorFlow,
PyTorch and XGBoost.

SageMaker Debugger Workflow


1. Configure a SageMaker training job debugger:
- Estimator API (Python SDK)
- SageMaker create training job (CLI or boto3)
- Custom training job with debugger
2. Start training job and monitor training issues in real time
- By using SageMaker studio debugger
3. Get alerts and take prompt actions against the training issues:
- Receive texts and emails through Simple Notification Service (SNS)
- Stop training job
- Setup actions with CloudWatch events and lambda functions
4. Receive training reports suggested to fix the issues and insight into training jobs
- Studio debugger

243
ML implementation and Operations SageMaker

- Deep learning framework profiling report


- SageMaker XGBoost training report
5. Explore deep analysis of the training issues and bottlenecks
- For debugging model hyperparameters by using debugger visual output tensor in
Tensor dashboard
- Profiling training job using SMDebug Client Library.
6. Fix the issues considering suggested provided by debugger

 Saves job training state at periodic intervals:


- Gradients/Tensors (Model Hyperparameters) such as weight, gradients and
activation output of convolution neural networks as the model trained.
- Define rules for detecting unwanted conditions while training using debugger rule
API (SMDebug) python SDK.
- Debug job (Processing container) is run for each rule you configure.
- Fires a CloudWatch event when the rule is hit.
 Integrates with SageMaker Studio debugger dashboard
 It automatically generates training reports
 SageMaker debugger has built in rules for:
- Monitor system bottlenecks
Monitor system utilization rate for resources such as CPU, GPU, memories, network
and I/O Data. This feature is available for any training job in SageMaker.
- Profiling deep learning frameworks
Profiling operations for TensorFlow and PyTorch frameworks such as step duration,
data loaders, forward and backwards operations and Python profiling metrics. Also,
framework specific metrics.
- Debug model Hyperparameters
Track and debug model parameters (Hyperparameters) such as weight, gradients,
biases and scalar values of your training job.

Use Debugger in Custom Containers


 SageMaker debugger available for any deep learning models that you bring to
SageMaker.
 You need to make changes to training script to implement the debugger hook callback
and returns tensors from training jobs
 You need the following resources to build a customized container with Debugger:

244
ML implementation and Operations SageMaker

- Amazon SageMaker Python SDK


- The SMDebug open source client library
- A Docker base image of your choice
- Your training script with a Debugger hook registered
Debugger API
- Available on GitHub provided through SageMaker Python SDK.
- Construct hooks and rules to create training jobs and describe training jobs
- SMDebug client library lets you register hooks callbacks for accessing training data.
Insight Dashboard
- Built-in actions to receive notifications through SNS or stop training in response to
debugger rules.
- Profiling system usage and training.
- Framework metrics Max Utilization time, step outlier, overall framework metrics.

245
ML implementation and Operations SageMaker

Figure 106: SageMaker Debugger Architecture

4.1.18 SageMaker Ground Truth


 SageMaker Ground Truth manages human who will label your data for training purpose.
 SageMaker Ground Truth creates its own model as images are labeled by people. As this
model learns, only images the modal isn’t sure about will be sent to human labelers. This
can reduce cost of labeling by 70%.
Ambiguous
Raw Data Learning Model data
Human Labelers
246
ML implementation and Operations SageMaker

 Human Labelers could be:


- Mechanical Turk: huge workforce of people around the world who will label your
data.
- Internal Team: for sensitive data.
- Professional labeling companies.

4.1.19 SageMaker Autopilot


 SageMaker autopilot is a wrapper for AutoML.
 SageMaker autopilot automates key tasks of un automatic machine learning process
 It automates:
- Algorithm selection
- Data preprocessing
- Model tuning
- Cross validation
- Resampling
- Infrastructure

Autopilot Workflow
 You will choose the data locations in S3, Autopilot will load data from S3 for training
 You should select the target column
 Autopilot will automatic create a model
 Notebook is available for visibility and control
 Model leaderboard by ranking test for recommended models
 Deploy & Monitor the new model
 Refine notebook if needed

Autopilot Features
 Autopilot can add human guidance
 Problem types: Binary classification, multiple classification and Regression
 Algorithm types: linear-learner, XGBoost, deep learning (Multilayer Perceptron) (MLP)
 Data must be tabular
 Autopilot explainability explains how models make predictions using features attribution
approach using SageMaker clarify. It generates report indicate importance of each
feature made by the best candidate. This explainability functionality can make ML model
more understandable by AWS customer:

247
ML implementation and Operations SageMaker

 The governance report can be used to inform risk and compliance teams and
external regulators
 Transparency how model arrive to its’ prediction.
 Feature attribution:
- Uses SHAP baselines/Shapley values
- Research from co-operative game theory
- Assigns each feature an importance value for a given prediction
For example:
A model that approves loans for houses, that race (‫ )العنرص‬is strong feature
and there is something wrong here you can go back and take a look at the
bias that might exist in your source data.

4.1.20 SageMaker ModelMonitor


 ModelMonitor sends automatic alerts on quality deviation on the deployed model.
 ModelMonitor send alerts that notify you when there are deviation in the Model quality.
 Alerts are sent using CloudWatch.
 Visualize data drift for example, loan model starts giving people more credit due to drifting
or missing impute features. Also, may be overtime people incomes are rising because of
inflation or we are missing some data because people stopped collecting for any reason.
 As your data you are using to train the model changes, this is data drift. You can visualize
this overtime and alert you if things begins to change too much.
 ModelMonitor can detect anomalies, outliers and new features.
 No code needed
 Detect new features that are coming or features that needs to change.
 ModelMonitor provide the following monitors:
- Monitor drift in data quality
- Monitor drift in model quality metrics as accuracy
- Monitor bias in model predictions
- Monitor drift in feature attribution

Monitor Data Quality


1. Enable data captures input and output from real time inference endpoints and store data
in S3.
2. Create a baseline by analyzing an input dataset that you specify.

248
ML implementation and Operations SageMaker

- The baseline computes baseline schema constraints and statistics for each
feature using Deequ, an open source library built on Spark, which measure data
quality in large datasets.
3. Define and schedule data quality monitoring jobs
4. View data quality monitoring with CloudWatch
5. Interpret the results of monitoring jobs
6. SageMaker studio to enable data quality monitoring and visualize results

Monitor Model Quality


As in Data quality except:
1. In step 2: create a baseline that run a job that compares predictions from the model with
ground truth labels in a baseline dataset.
2. In step 3: Ingest ground truth labels that ModelMonitor merges with captured predictions
data from real time inference endpoint.

Monitor Model Bias


 Bias in data can affect the model performance.
 Bias in data could be between train data and data at deployment (Live data) or between
train data and live data after a time.
For example: changes in taxes or mortage rates or changes in holiday days.
 This bias could be temporarily or permanently.
 If the data values exceeds a certain threshold an event will be send to CloudWatch to
be logged.
 If the bias is a certain value then there will be a lot of bias alarms, so ModelMonitor will
use statistical confident interval with probability score to send the alarm.

Monitor Feature Attribution


 Attribution is a value that says the power or weight of this feature in the current model.
 We can detect drift by comparing the ranking of the individual features changed from
training data to live data and raw attribution score. So, we will take in consideration rank
and attribution score of the feature.
 We can then use Normalized Discounted Cumulative (NDCG) for comparing the feature
attribution ranking of train and live data.

Other ModelMonitor Features


 SageMaker Clarify also helps explain model behavior, understand which features
contribute the most to your predictions

249
ML implementation and Operations SageMaker

 ModelMonitor can integrate with Tensor board, Quick Sight and SageMaker Studio

4.1.21 SageMaker JumpStart


 You can use SageMaker JumpStart to learn about SageMaker features and capabilities
through curated 1-click solutions, example notebooks, and pre-trained models that you
can deploy. You can also fine-tune the models and deploy them.
 Over 150 open source models in NLP, Object detection, image classification…..etc.
 Can only be accessed from SageMaker Studio.

4.1.22 SageMaker Data Wrangler


 Import/transform/Analyze/export data within SageMaker Studio
 Import: Connect to and import data from Amazon S3, Athena or Redshift.
 Dataflow: create dataflow to define a series of ML data preparation steps.
- You can use a flow to combine datasets from different data sources.
 Transform: transform your dataset using standard transforms like string, vector and
numeric data formatting tools.
Feature transform like text/Date time and categorical encoding.
 Analyze: Analyze features in your dataset at any point in your flow.
Data Wrangler include built-in data visualization like scatter plot and Histograms.
 Export: It offers export functionality to other SageMaker services including pipeline,
feature store and Python code.

4.1.23 SageMaker Feature Store


 Find, discover, and share features.
 It is like a repository of the training features in large corporates.
 It has 2 modes online and offline for real time prediction and batch predictions
respectively.
 Features are organized into feature groups.
 It can be used by scientists, Engineers and general practitioners.
 Feature store reduce repetitive data processing and required work to convert raw data to
features.

250
ML implementation and Operations SageMaker

4.1.24 SageMaker Edge Manager


Challenges
 Operating ML models on Edge devices is challenging and it has limited compute, memory
and connectivity.
 You also need to monitor model drift after deploy as this affect model quality and can
cause model to decay overtime.
 You need to write code to get data from different devices and recognize data drift.
 To update the model on Edge devices, you should rebuild the entire application.
Solution
 Edge Manager, you can optimize, run, monitor and update ML models across fleet of
devices at the edge.

How it works?
 Edge Manager has five main components:
- Compiling: Compile model with SageMaker Neo
- Packing: Pack Neo models
- Deploy: Deploy models to devices
- Agent: Run model for inference
- Maintain: Maintain model on devices
 SageMaker Edge Manager can sample model input and output data from edge devices and
send it to the cloud for monitoring and analysis.
 View dashboards that tracks and visually report on the operation of the deployed model
with SageMaker console.
 By this way developers can improve model quality by using SageMaker ModelMonitor for
drift detection, then relabel data using ground truth.

251
ML implementation and Operations SageMaker

4.1.25 Put the all together

Figure 107: SageMaker products

252
ML implementation and Operations AI Services

4.2 AI Services
4.2.1 Amazon Comprehend
 Comprehensive natural language processing service (NLP)
 Natural language processing and text analysis
 Input any text may be social media, web pages, documents, transcript and medical records
(Comprehend Medical)
 Can be trained on your data or it out of the box on its pre-trained data.
 Extract key phrases, entities, sentiment, language, syntax, topics and document
classification.
 Entities
It can detect and extract entities from text i.e. Amazon Inc.
It can also detect and extract person names, dates and locations with confident
score.
 Key phrases
It can extract important phrases in sentences with confident score.
 Language
It can detect language of the text.
 Sentiment Analysis
Categorize text to neutral, positive, negative and mixed.
 Syntax
Detects nouns, verbs and punctuation.

4.2.2 Amazon Translate


 Use deep learning for translation
 Support custom terminology
 In CSV or TMX format
 Used for proper names, brand names……..etc.
 Detect language by using the same algorithm in Comprehend

NOTE: TMX is a standard format in the world of ML translation

253
ML implementation and Operations AI Services

4.2.3 Amazon Transcribe


 Speech to text
 Input in FLAC, MP3, MP4 or WAV in a specified language
 Streaming audio HTTP/2 or websocket. It supports English, French and Spanish only.
 Speaker identification and identify number of speakers.
 Channel identification:
 Two callers could be transcribed separately and merge them together based on
timing of what is called “Utterance”.
 If Two speakers are talking, every transcribed could be provided individually.
 Custom Vocabularies:
 Vocabulary list (Special Words) i.e. names or acronyms.
 Vocabulary to be words and how to pronounce.

4.2.4 Amazon Polly


 Natural text to speech with many voices and languages
 Integrated and used by Alexa under the hood.
 Lexicons:
 Customize pronunciation of specific words i.e. “W3” to be “World Wide Web
Consortium”
 It is used with acronyms.
 For any document
 SSML (Speech Synthesis Markup Language)
 It gives more control over emphasis and pronunciation for breathing and whispering
also speech rate, pitch and pauses to make speech more natural.
 It is for current document
 Speech Marks
 Speech Marks are: “How are you?” Ahmed asked.
 This is used in character animation.

4.2.5 Amazon Forecast


 Time series analysis
 Fully managed service to deliver accurate forecasts with ML.
 It uses AutoML to choose the best model for your data.
 Supported models are: ARIMA, DeepAR, ETS, NPTS and Prophet.
 Can combine with the associated data to find relationships between multiple time series
together based on dataset groups, predictions and forecasts.
254
ML implementation and Operations AI Services

 Amazon Forecast can increase your forecasting accuracy by automatically ingest local
weather information on your demand.
 Use cases: Inventory planning, financial planning and Resources planning.

How it works?
 Datasets: are collection of your input data.
 Dataset groups: are collection of datasets that contain complimentary information.
 Predictors: Custom models trained on your data.
 Forecast: You can generate forecasts for your time series data, query them using Forecast
API.

4.2.6 Amazon Lex


 Billed as the inner working for Alexa.
 Natural Chatbot engine.
 The bot is built around intents.
 Can be deployed to AWS, mobile, Facebook, messenger, Slack and Twilio.
 It is using transcribe and Polly for speech to text and text to speech.
 Amazon transcribe to convert customer speech to text.
 Polly to say the response to the customer.

How it works?
 Utterance invoke intents i.e. “I want Pizza?”
 Lambda function are invoked to fulfill intents.
 Slot specify extra information needed by intent. i.e. “What size?”, “What Toppings?” and
“Do you need crust?”

4.2.7 Amazon Rekognition


 Computer vision
 Object and scene detection
 You can use your own face collection.
 Image moderation: detect if the image contain any offensive content.
 Facial Analysis: make analysis to the face by detecting Gender, Age, emotion, glasses, and
face expression.
 Celebrity recognition: entertainment, Sport, politics in images and videos.
 Text in image: Extract text from images.
 Video Analysis

255
ML implementation and Operations AI Services

 Mark time line when detecting objects, faces or celebrity.


 People path: track a person through the video.
 Images can be uploaded from S3 or provide image bytes with the request.
 Face recognition depends on good lighting, angle, visibility of eye and resolution.
 Video must come from kinesis video stream.
 The video should be H.264 encoded with 5  30 FPS. Resolution is better than FPS.
 You can trigger lambda function to trigger image analysis upon image upload.
 Define custom labels and train Rekognition on custom labels.

4.2.8 Amazon Personalize


It exposes amazon recommendation service as a webservice.
Amazon Personalize can make recommendations based on real-time event data only, historical
event data only (see Importing bulk records), or a mixture of both. Record events in real-time so
Amazon Personalize can learn from your user’s most recent activity and update
recommendations as they use your application. This keeps your interactions data fresh and
improves the relevance of Amazon Personalize recommendations.
You can record real-time events using the AWS SDKs, AWS Amplify or AWS Command Line
Interface (AWS CLI). When you record events, Amazon Personalize appends the event data to the
Interactions dataset in your dataset group.

Amazon Personalize workflow


1. Determine your use case
Before you use Amazon Personalize, determine your use case to identify what recipe to use to
train your model, and what data to import into Amazon Personalize. Recipes are Amazon
Personalize algorithms that are prepared for different use cases. To get started providing
personalized experiences for your users, choose your use case from the following and note its
corresponding recipe type.
- Recommending items for users (USER_PERSONALIZATION recipes)
- Ranking items for a given user (PERSONALIZED_RANKING recipes)
- Recommending similar items (RELATED_ITEMS recipes)

2. Import data

256
ML implementation and Operations AI Services

You import item, user, and interaction records into Amazon Personalize datasets. You can choose
to import records in bulk, or incrementally, or both. With incremental imports, you can add one
or more historical records or import data from real-time user activity.
The data that you import depends on your use case. For information about the types of data that
you can import, see Datasets and schemas and the sections on each dataset type (Interactions
dataset, Items dataset, Users dataset).

3. Train a model
After you've imported your data, Amazon Personalize uses it to train a model. In Amazon
Personalize, you start training by creating a solution, where you specify your use case by choosing
an Amazon Personalize recipe. Then you create a solution version, which is the trained model that
Amazon Personalize uses to generate recommendations.

4. Deploy a model (for real-time recommendations)


After Amazon Personalize finishes creating your solution version (trained model), you deploy it in
a campaign. A campaign creates and manages a recommendation API that you use in your
application to request real-time recommendations from your custom model. For more
information about deploying a model. For batch recommendations, you don't need to create a
campaign.

5. Get recommendations
Get recommendations in real-time or as part of a batch workflow with purely historical data. Get
real-time recommendations when you want to update recommendations as customers use your
application. Get batch recommendations when you do not require real-time updates

6. Refresh your data and repeat


Keep your item and user data current, record new interaction data in real-time, and re-train your
model on a regular basis. This allows your model to learn from your user’s most recent activity
and sustains and improves the relevance of recommendations.

Requirements for recording events and training a model


To record events, you need the following:
- A dataset group that includes an Interactions dataset, which can be empty. If you went
through the Getting started guide, you can use the same dataset group and dataset that

257
ML implementation and Operations AI Services

you created. For information on creating a dataset group and a dataset, see Preparing and
importing data.
- An event tracker.
- A call to the PutEvents operation.
You can start out with an empty Interactions dataset and, when you have recorded enough data,
train the model using only new recorded events. The minimum data requirements to train a
model are:
- 1000 records of combined interaction data (after filtering by eventType and
eventValueThreshold, if provided)
- 25 unique users with at least 2 interactions each

How real-time events influence recommendations


Once you create a campaign, Amazon Personalize automatically uses new recorded event data
for existing items (items you included in the data you used to train the latest model) when
generating recommendations for the user. This does not require retraining the model (unless you
are using the SIMS or Popularity-Count recipes).

Instead, Amazon Personalize adds the new recorded event data to the user's history. Amazon
Personalize then uses the modified data when generating recommendations for the user (and
this user only).

- For recorded events for new items (items you did not include in the data you used to train
the model), if you trained your model (solution version) with the User-Personalization
recipe, Amazon Personalize automatically updates the model every two hours, and after
each update the new items influence recommendations. See User-Personalization recipe.
- For any other recipe, you must re-train the model for the new records to influence
recommendations. Amazon Personalize stores recorded events for new items and, once
you create a new solution version (train a new model), this new data will influence Amazon
Personalize recommendations for the user.
- For recorded events for new users (users that were not included in the data you used to
train the model), recommendations will initially be for popular items only.
Recommendations will be more relevant as you record more events for the user. Amazon
Personalize stores the new user data, so you can also retrain the model for more relevant
recommendations.

258
ML implementation and Operations AI Services

- For new, anonymous users (users without a userId), Amazon Personalize uses the sessionId
you pass in the PutEvents operation to associate events with the user before they log in.
This creates a continuous event history that includes events that occurred when the user
was anonymous.

4.2.9 Amazon Textract


 OCR with forms, fields and table support.

4.2.10 Amazon DeepRacer


 Reinforcement learning powered by 1/18 scale race car.
 User for educate.

4.2.11 DeepLens
 Deep learning enabled video camera.
 Integrated with SageMaker, Rekognition, Tensorflow and MXNet.
 You can use IoT Green grass to deploy a pre-trained model.
 You can use SageMaker neo.
 Do deep learning in the edge.

4.2.12 AWS DeepComposer


 AI powered keyboard
 Compose a melody into entire song using AI.
 Used for education purpose.

4.2.13 Amazon Fraud Detector


 Upload Historical fraud data.
 Build a custom model from a template you choose.
 Expose an API for your online application
 Use cases:
 New Accounts
 Guest Checkout
 “Try before you buy” abuse
 Online payment

259
ML implementation and Operations AI Services

4.2.14 Amazon CodeGuru


 Automatic code review
 Finds code that hurt performance
 Resource leaks and race conditions
 Offers specific recommendations
 Powered by ML
 Currently for Java only

4.2.15 Contact Lens for Amazon Connect


 For customer support call centers
 Ingest audio data from recorded calls
 Allow search in calls/chats.
 Sentiment analysis
 Find “Utterance” that correlate with successful calls
 Categorize calls
 Measure talk speed and interruption
 Theme detection

4.2.16 Amazon Kindra


 Enterprise Search with natural language i.e. “How do I connect to my VPN?”
 Combines data from different sources into one searchable repository:
 File systems
 Share point
 Intranet
 JDBC/S3
 ML Powered
 Alexa’s sister

4.2.17 Amazon Augmented AI (A2I)


 Human Review for ML predictions
 Very similar to ground truth
 Integrates with SageMaker
 Build workflow for less confident predictions
 Can access Mechanical Turk workforce or vendors
260
ML implementation and Operations AI Services

 Integrate with Textract and Rekognition

4.2.18 Put all together


 Build your own Alexa
 Transcribe  Lex  Polly
 Make universal Translator
 Transcribe  Translator  Polly
 Build Jeff Detector (Detect Person)
 DeepLens  Rekognition
 Is your call make you happy?
 Transcribe  Comprehend

261
ML implementation and Operations AWS IoT for Predictive Maintenance

4.3 AWS IoT for Predictive Maintenance


4.3.1 IoT Green Grass
 You can deploy Neo compiled models to edge devices using IoT Green Grass.
 Inference at the edge with local data, using model trained in the cloud.
 Green Grass using lambda function for inference applications.

4.3.2 Use case


The interest in machine learning for industrial and manufacturing use cases on the edge is
growing. Manufacturers need to know when a machine is about to fail so they can better plan for
maintenance. For example, as a manufacturer, you might have a machine that is sensitive to
various temperature, velocity, or pressure changes. When these changes occur, they might
indicate a failure.
Typically, an ML model is built for each type of machine or sub-process using its unique data and
features. This leads to an expansive set of ML models that represents each of the critical
machines in the manufacturing process and different types of predictions desired. Although the
ML model supports inference of new data sent to the AWS Cloud, you can also perform the
inference on premises, where latency is much lower. This results in a more real-time evaluation
of the data. Performing local inference also saves costs related to the transfer of what could be
massive amounts of data to the cloud.
The AWS services used to build and train ML models for automated deployment to the edge
make the process highly scalable and easy to do. You collect data from the machines or
infrastructure that you want to make predictions on and build ML models using AWS services in
the cloud. Then you transfer the ML models back to the on-premises location where they are
used with a simple AWS Lambda function to evaluate new data sent to a local server running
AWS Greengrass.
AWS Greengrass lets you run local compute, messaging, ML inference, and more. It includes a
lightweight IoT broker that you run on your own hardware close to the connected equipment.
The broker communicates securely with many IoT devices and is a gateway to AWS IoT Core
where selected data can be further processed. AWS Greengrass can also execute AWS Lambda
functions to process or evaluate data locally without an ongoing need to connect to the cloud.

262
ML implementation and Operations AWS IoT for Predictive Maintenance

263
ML implementation and Operations Security

4.4 Security
4.4.1 PrivateLink
AWS PrivateLink is a highly available, scalable technology that enables you to privately connect
your VPC to:
 Supported AWS services
 Services hosted by other AWS accounts (VPC endpoint services)
 Supported AWS Marketplace partner services.
You do not need to use any of the following to use PrivateLink service:
 internet gateway
 NAT device
 public IP address
 DirectConnect
 AWS Site-to-Site VPN connection

4.4.2 VPC Endpoints


A VPC endpoint enables private connections between your VPC and supported AWS services and
VPC endpoint services powered by AWS PrivateLink.
 AWS PrivateLink is a technology that enables you to privately access services by using
private IP addresses.
 Traffic between your VPC and the other service does not leave the Amazon network.
A VPC endpoint does not require:
 Internet gateway
 Virtual private gateway
 NAT device
 VPN connection
 AWS Direct Connect connection.
Instances in your VPC do not require public IP addresses to communicate with resources in the
service.
VPC endpoints are virtual devices. They are horizontally scaled, redundant, and highly available
VPC components. They allow communication between instances in your VPC and services
without imposing availability risks.
There are 3 types of VPC endpoints:

264
ML implementation and Operations Security

Interface endpoints
An interface endpoint is an elastic network interface with a private IP address from the IP address
range of your subnet. It serves as an entry point for traffic destined to a supported AWS service
or a VPC endpoint service. Interface endpoints are powered by AWS PrivateLink.

Gateway Load Balancer endpoints


A Gateway Load Balancer endpoint is an elastic network interface with a private IP address from
the IP address range of your subnet. Gateway Load Balancer endpoints are powered by AWS
PrivateLink. This type of endpoint serves as an entry point to intercept traffic and route it to a
service that you've configured using Gateway Load Balancers, for example, for security
inspection. You specify a Gateway Load Balancer endpoint as a target for a route in a route table.
Gateway Load Balancer endpoints are supported for endpoint services that are configured for
Gateway Load Balancers only.

Gateway endpoints
A gateway endpoint is for the following supported AWS services:
 Amazon S3
 DynamoDB

4.4.3 VPC endpoint services (AWS PrivateLink)


You can create your own application in your VPC and configure it as an AWS PrivateLink-powered
service (referred to as an endpoint service). Other AWS principals can create a connection from
their VPC to your endpoint service using an interface VPC endpoint.

4.4.4 Bucket policy and VPC endpoint


You can use Amazon S3 bucket policies to control access to buckets from specific virtual private
cloud (VPC) endpoints, or specific VPCs. This section contains example bucket policies that can be
used to control Amazon S3 bucket access from VPC endpoints.
A VPC endpoint for Amazon S3 is a logical entity within a VPC that allows connectivity only to
Amazon S3. The VPC endpoint routes requests to Amazon S3 and routes responses back to the
VPC. VPC endpoints change only how requests are routed. Amazon S3 public endpoints and DNS
names will continue to work with VPC endpoints.

265
ML implementation and Operations Security

VPC endpoints for Amazon S3 provide two ways to control access to your Amazon S3 data:
 You can control which VPCs or VPC endpoints have access to your buckets by using
Amazon S3 bucket policies.
- Restricting access to a specific VPC endpoint
- Restricting access to a specific VPC
 You can control the requests, users, or groups that are allowed through a specific VPC
endpoint as in the next section.

4.4.5 AWS Site to Site


By default, instances that you launch into an Amazon VPC can't communicate with your own
(remote) network. You can enable access to your remote network from your VPC by creating an
AWS Site-to-Site VPN (Site-to-Site VPN) connection, and configuring routing to pass traffic
through the connection.

Although the term VPN connection is a general term, in this documentation, a VPN connection
refers to the connection between your VPC and your own on-premises network. Site-to-Site VPN
supports Internet Protocol security (IPsec) VPN connections.

4.4.6 Control access to services with VPC endpoints


4.4.6.1 Use VPC endpoint policies
A VPC endpoint policy is an IAM resource policy that you attach to an endpoint when you create
or modify the endpoint. If you do not attach a policy when you create an endpoint, we attach a
default policy for you that allows full access to the service. If a service does not support endpoint
policies, the endpoint allows full access to the service. An endpoint policy does not override or
replace IAM user policies or service-specific policies (such as S3 bucket policies). It is a separate
policy for controlling access from the endpoint to the specified service.
You cannot attach more than one policy to an endpoint. However, you can modify the policy at
any time.

4.4.6.2 Security groups


When you create an interface endpoint, you can associate security groups with the endpoint
network interface that is created in your VPC. If you do not specify a security group, the default
security group for your VPC is automatically associated with the endpoint network interface. You

266
ML implementation and Operations Security

must ensure that the rules for the security group allow communication between the endpoint
network interface and the resources in your VPC that communicate with the service.

4.4.7 SageMaker notebook instance networking


Default communication with the internet
When your notebook allows direct internet access, SageMaker provides a network interface that
allows the notebook to communicate with the internet through a VPC managed by SageMaker.
Traffic within your VPC's CIDR goes through elastic network interface created in your VPC. All the
other traffic goes through the network interface created by SageMaker, which is essentially
through the public internet. Traffic to gateway VPC endpoints like Amazon S3 and DynamoDB
goes through the public internet, while traffic to interface VPC interface endpoints still goes
through your VPC. If you want to use gateway VPC endpoints, you might want to disable direct
internet access.

VPC communication with the internet


To disable direct internet access, you can specify a VPC for your notebook instance. By doing so,
you prevent SageMaker from providing internet access to your notebook instance. As a result,
the notebook instance can't train or host models unless your VPC has an interface endpoint (AWS
PrivateLink) or a NAT gateway and your security groups allow outbound connections.

Security and Shared Notebook Instances


A SageMaker notebook instance is designed to work best for an individual user. It is designed to
give data scientists and other users the most power for managing their development
environment.
A notebook instance user has root access for installing packages and other pertinent software.
We recommend that you exercise judgement when granting individuals access to notebook
instances that are attached to a VPC that contains sensitive information.

Amazon SageMaker notebook instances can be launched with or without your Virtual Private
Cloud (VPC) attached. When launched with your VPC attached, the notebook can either be
configured with or without direct internet access.

267
ML implementation and Operations Security

Using the Amazon SageMaker console, these are the three options:
1. No customer VPC is attached.
No VPC configured and internet check box is checked

In this configuration, all the traffic goes through the single network interface. The notebook
instance is running in an Amazon SageMaker managed VPC as shown in the above diagram.

2. Customer VPC is attached with direct internet access.


VPC configured and internet check box is checked

In this configuration, the notebook instance needs to decide which network traffic should go
down either of the two network interfaces.

3. Customer VPC is attached without direct internet access.


VPC configured and internet check box is not checked

268
ML implementation and Operations Security

IMPORTANT NOTE: In this configuration, the notebook instance can still be


configured to access the internet. The network interface that gets launched only
has a private IP address. What that means is that it needs to either be in a private
subnet with a NAT or to access the internet back through a virtual private
gateway. If launched into a public subnet, it won’t be able to speak to the internet
through an internet gateway (IGW).

NOTE: If SageMaker requests data from S3 and the bucket is protected then
SageMaker should have the role with the decryption permission Key. So this role
and key should be defined to SageMaker.

4.4.8 Network Isolation


Run Training and Inference Containers in Internet-Free Mode
SageMaker training and deployed inference containers are internet-enabled by default. This
allows containers to access external services and resources on the public internet as part of your
training and inference workloads. However, this could provide an avenue for unauthorized access
to your data. For example, a malicious user or code that you accidentally install on the container
(in the form of a publicly available source code library) could access your data and transfer it to a
remote host.
If you use an Amazon VPC by specifying a value for the VpcConfig parameter when you call
CreateTrainingJob, CreateHyperParameterTuningJob, or CreateModel, you can protect your data
and resources by managing security groups and restricting internet access from your VPC.
However, this comes at the cost of additional network configuration, and has the risk of
configuring your network incorrectly. If you do not want SageMaker to provide external network
access to your training or inference containers, you can enable network isolation.
269
ML implementation and Operations Security

Network Isolation
You can enable network isolation when you create your training job or model by setting the value
of the EnableNetworkIsolation parameter to true when you call CreateTrainingJob,
CreateHyperParameterTuningJob, or CreateModel.
If you enable network isolation, the containers can't make any outbound network calls, even to
other AWS services such as Amazon S3. Additionally, no AWS credentials are made available to
the container runtime environment. In the case of a training job with multiple instances, network
inbound and outbound traffic is limited to the peers of each training container. SageMaker still
performs download and upload operations against Amazon S3 using your SageMaker execution
role in isolation from the training or inference container.
The following managed SageMaker containers do not support network isolation because they
require access to Amazon S3:
- Chainer
- PyTorch
- Scikit-learn
- SageMaker Reinforcement Learning
Network isolation with a VPC
Network isolation can be used in conjunction with a VPC. In this scenario, the download and
upload of customer data and model artifacts are routed through your VPC subnet. However, the
training and inference containers themselves continue to be isolated from the network, and do
not have access to any resource within your VPC or on the internet.

4.4.9 Private packages


Although you can disable direct internet access to Sagemaker Studio notebooks and notebook
instances, you need to ensure that your data scientists can still gain access to popular packages.
Therefore, you may choose to build your own isolated dev environments that contain your choice
of packages and kernels.
You can use one of the following methods:
 Use Conda channel paths to a private repository where our packages are stored.
- Build such custom channel is to create a bucket in Amazon S3.
- Copy the packages into the bucket.
These packages can be either approved packages among the organization or the custom
packages built using conda build. These packages need to be indexed periodically or as soon
as there is an update. The methods to index packages are out of scope of this post.

270
ML implementation and Operations Security

 AWS CodeArtifact, is a fully managed artifact repository that makes it easy for organizations of
any size to securely store, publish, and share software packages used in your software
development process.

4.4.10 Secure Deployment

There are two main methods of implementing controls to improve the security of AWS services
during deployment. One of them is preventive and uses controls to stop an event from occurring.
The other is responsive, and uses controls that are applied in response to events.

Preventive controls protect workloads and mitigate threats and vulnerabilities. A couple of
approaches to implement preventive controls are:

 Use IAM condition keys supported by the service to ensure that resources without
necessary security controls cannot be deployed.
 Use the AWS Service Catalog to invoke AWS CloudFormation templates that deploy
resources with all the necessary security controls in place.

Responsive controls drive remediation of potential deviations from security baselines. An


approach to implement responsive controls is:

 Use CloudWatch Events to catch resource creation events, then use a Lambda function to
validate that resources were deployed with the necessary security controls, or terminate
resources any if the necessary security controls aren’t present.

4.4.11 Protect communication in distributed training job


By default, Amazon SageMaker runs training jobs in an Amazon Virtual Private Cloud (Amazon
VPC) to help keep your data secure. You can add another level of security to protect your training
containers and data by configuring a private VPC. Distributed ML frameworks and algorithms
usually transmit information that is directly related to the model such as weights, not the training
dataset. When performing distributed training, you can further protect data that is transmitted
between instances. This can help you to comply with regulatory requirements. To do this, use
inter-container traffic encryption.

Enabling inter-container traffic encryption can increase training time, especially if you are using
distributed deep learning algorithms. Enabling inter-container traffic encryption doesn't affect
271
ML implementation and Operations Security

training jobs with a single compute instance. However, for training jobs with several compute
instances, the effect on training time depends on the amount of communication between
compute instances. For affected algorithms, adding this additional level of security also increases
cost. The training time for most SageMaker built-in algorithms, such as XGBoost, DeepAR, and
linear learner, typically aren't affected.

You can enable inter-container traffic encryption for training jobs or hyperparameter tuning jobs.
You can use SageMaker APIs or console to enable inter-container traffic encryption.

4.4.12 AI Services opt-out policies (AWS Organization)


Certain AWS artificial intelligence (AI) services, may store and use customer content processed by
those services for the development and continuous improvement of Amazon AI services and
technologies.
As an AWS customer, you can choose to opt out of having your content stored or used for service
improvements.
As an AWS customer, you can choose to opt out of having your content stored or used for service
improvements.
Instead of configuring this setting individually for each AWS account that your organization uses,
you can configure an organization policy that enforces your setting choice on all accounts that
are members of the organization.
You can choose to opt out of content storage and use for an individual AI service, or for all of the
covered services at once.
You can query the effective policy applicable to each account to see the effects of your setting
choices.

Effective AI services
The effective AI services opt-out policy specifies the final rules that apply to an AWS account. It is
the aggregation of any AI services opt-out policies that the account inherits, plus any AI services
opt-out policies that are directly attached to the account. When you attach an AI services opt-out
policy to the organization's root, it applies to all accounts in your organization. When you attach
an AI services opt-out policy to an OU, it applies to all accounts and OUs that belong to the OU.
When you attach a policy directly to an account, it applies only to that one AWS account.
For example, the AI services opt-out policy attached to the organization root might specify that
all accounts in the organization opt out of content use by all AWS machine learning services. A
272
ML implementation and Operations Security

separate AI services opt-out policy attached directly to one member account specifies that it opts
in to content use for only Amazon Rekognition. The combination of these AI services opt-out
policies comprises the effective AI services opt-out policy. The result is that all accounts in the
organization are opted out of all AWS services, with the exception of one account that opts in to
Amazon Rekognition.

You can view the effective AI services opt-out policy for an account from the AWS Management
Console, AWS API, or AWS Command Line Interface.

273
ML implementation and Operations Deploy and operationalize ML solutions

4.5 Deploy and operationalize ML solutions

Figure 108: Production Infrastructure

4.5.1 Deployment Management


Unmanaged Deployment
You are responsible for the deployment
 Create AMI containing your model artifacts.
 Launch one or more EC2 instances with this AMI.
 Configure the automatic scaling options necessary to scale.

Managed Deployment
 Provides Deploy with one click or a single API call.
 Auto scaling
Step 1: Create the model
 Use the createModelAPI.
 Name the model and tell Amazon SageMaker where it is stored.
 Use this if you’re hosting on Amazon SageMaker or running a batch job.
Step 2: Create an HTTPS endpoint configuration
 Use the createEndpointConfigAPI.
 Associate it with one or more created models.
 Set one or more configuration (product variants) for each model.

274
ML implementation and Operations Deploy and operationalize ML solutions

 For each product variant, specify instance type and initial count and set its initial weight
(How much traffic it receives).

Step 3: Deploy an HTTPS endpoint based on HTTPS endpoint configuration


 Use createEndpointAPI
 Specify the endpoint configuration, model name and any tags you want to add.
Deploy and host via SDK

275
ML implementation and Operations Deploy and operationalize ML solutions

4.5.2 Deployment Options


Blue/Green Deployment
The blue/green deployment technique provides two identical production environments. You can
use this technique when you need to deploy a new version of the model to production. As shown
in the figure, this technique requires two identical environments:
 A live production environment (blue) that runs version n,
 An exact copy of this environment (green) that runs version n+1.

SageMaker Steps
1. Create a new endpoint configuration, using the same production variants for the existing
live model and for the new model.
2. Update the existing live endpoint with the new endpoint configuration. Amazon
SageMaker creates the required infrastructure for the new production variant and updates
the weights without any downtime.
3. Switch traffic to the new model through an API call.
4. Create a new endpoint configuration with only the new production variant and apply it to
the endpoint. Amazon SageMaker terminates the infrastructure for the previous
production variant.
In this approach, all live inference traffic is served by either the old or new model at any given
point. However, before directing the live traffic to new model, synthetic traffic is used to test and
validate the new model.

Canary Deployment
A/B testing is similar to canary testing, but has larger user groups and a longer time scale,
typically days or even weeks. For this type of testing, Amazon SageMaker endpoint configuration
uses two production variants: one for model A, and one for model B. For a fair comparison of two
models, begin by configuring the settings for both models to balance traffic between the models
equally (50/50) and make sure that both models have identical instance configurations. This
initial setting is necessary so the neither version of the model is impacted by difference in traffic
patterns or difference in the underlying compute capacity.
After you have monitored the performance of both models with the initial setting of equal
weights, you can either gradually change the traffic weights to put the models out of balance
(60/40, 80/20, etc.), or you can change the weights in a single step, continuing until a single
model is processing all of the live traffic.

276
ML implementation and Operations Deploy and operationalize ML solutions

With canary testing, you can validate a new release with minimal risk.
1. You do this by first deploying to a small group of your users. Other users continue to use
the previous version.
2. When you’re satisfied with the new release, you can gradually roll the new release out to
all users.
3. After you have confirmed that the new model performs as expected, you can gradually roll
it out to all users, scaling endpoints up and down accordingly.

277
ML implementation and Operations Deploy and operationalize ML solutions

4.5.3 Inference Types


Batch Inference
 Model available with batch prediction
 Inference are given in batches
 Data has multiple rows
 Results available on a scheduled basis when the job is done

Figure 109: Batch Inference

Real Time Inference


 Model available all the time
 Inference are given in real time
 Single observation of data
 Results available as users interact in real time.

278
ML implementation and Operations Deploy and operationalize ML solutions

Figure 110: Real time inference

279
ML implementation and Operations Deploy and operationalize ML solutions

4.5.4 Instance Types

In addition to the traditional auto scaling of ML compute instances for cost savings, consider the
difference between CPU vs GPU. While deep learning based models require high power GPU
instance for training, inferences against the deep learning models do not typically need the full
power of a GPU. As such, hosting these deep learning models on a full-fledged GPU may lead to
underutilization and unnecessary costs. Amazon Elastic Inference enables you to attach low-cost,
GPU-powered acceleration to Amazon EC2 and Amazon SageMaker instances to reduce the cost
of running deep learning inferences. Standalone GPU instances are designed for model training
and are typically oversized for inference. Even though training jobs batch process hundreds of
data samples in parallel, most inference happens on a single input in real-time and consumes
only a small amount of GPU compute. Amazon Elastic Inference solves this problem by allowing
you to attach the appropriate amount of GPU-powered inference acceleration to any Amazon
EC2 or Amazon SageMaker instance type, with no code changes

280
Appendices Algorithms Input Formats

5. Appendices
5.1 Algorithms Input Formats
No. Model Input Format
1 Linear Learner  RecordIO-wrapped protobuf
- Float32 data only!
 CSV
- First column assumed to be the label
 File or Pipe mode both supported
2 K Nearest Neighbors  Train channel contains your data
- Test channel emits accuracy or MSE
 recordIO-protobuf or CSV training
- First column is label
 File or pipe mode on either
3 K-Means  recordIO-protobuf or CSV
 File or Pipe on either
 Train channel, optional test
- Train ShardedByS3Key, test Fully Replicated
4 Principal Component  recordIO-protobuf or CSV
Analysis (PCA)  File or Pipe on either
5 XGBoost  CSV or libsvm input.
 recordIO-protobuf and Parquet as well
6 IP Insights  User names, account ID’s can be fed in directly; no need to pre-process
 Training channel, optional validation (computes AUC score)
 CSV only (Entity, IP)
7 Factorization Machines  recordIO-protobuf with Float32
 Sparse data means CSV isn’t practical
8 Object Detection  RecordIO or image format (JPG or PNG)
- JSON file for annotation data for each image
9 Image Classification  Apache MXNet RecordIO
- Not protobuf
 Supports both RecordIO (application/x-recordio) and image
(image/png, image/jpeg, and application/x-image) content types for training in
file mode.
- Image format requires .lst files to associate image index, class label, and path
to the image
 Supports the RecordIO (application/x-recordio) content type for training in pipe
mode.
- Augmented Manifest Image Format enables Pipe mod
 The algorithm supports image/png, image/jpeg, and application/x-image for
inference.
10 Semantic Segmentation  JPG Images and PNG annotations
 For both training and validation
 Label maps to describe annotations
 Augmented manifest image format supported for Pipe mode.
 JPG images accepted for inference
11 Blazing Text For supervised mode (Text Classification):
 One sentence per line
 First “word” in the sentence is the string __label__ followed by the label
281
Appendices Algorithms Input Formats

 Augmented manifest text format


 Text should be pre-processed
For Word2Vec:
 Just wants a text file with one training sentence per line.
12 Seq2Seq  RecordIO-Protobuf
- Tokens must be integers
- For example indices into vocabulary files
 Start with tokenized text files, you need to actually build a vocabulary file that
maps every word to a number.
- You should provide the vocabulary file and the tokenized text files
 Convert to protobuf using sample code
- Packs into integer tensors with vocabulary files
 Must provide training data, validation data, and vocabulary files
13 Object2Vec  Data must be tokenized into integers
 Training data consists of pairs of tokens and/or sequences of tokens
- Sentence – sentence
- Labels-sequence (genre to description?)
- Customer-customer
- Product-product
- User-item
14 Neural Topic Model  recordIO-protobuf or CSV
 File or pipe mode
 Four data channels
- “train” is required
- “validation”, “test”, and “auxiliary” optional
 Words must be tokenized into integers
- Every document must contain a count for every word in the vocabulary in CSV
- The “auxiliary” channel is for the vocabulary
15 Latent Dirichlet  RecordIO-protobuf or CSV
Allocation (LDA) We need to tokenize that data first. Every document does have counts for every
word in the vocabulary for that document, so we should pass a list of tokens,
integers that represent each word, and how often that word occurs in each
individual document, not the documents themselves.
 Pipe mode only supported with RecordIO
 Each document has counts for every word in vocabulary (in CSV format)
 Train channel, optional test channel as this is unsupervised algorithm.
16 DeepAR  JSON lines format
- Gzip or Parquet
 Each record must contain:
- Start: the starting time stamp
- Target: the time series values
 Each record can contain:
- Dynamic_feat: dynamic features (such as, was a promotion applied to a product
in a time series of product purchases)
- Cat: categorical features
17 Random Cut Forest  RecordIO-protobuf or CSV
 Can use File or Pipe mode on either
 Optional test channel for computing accuracy, precision, recall, and F1 on labeled
data (anomaly or not)

282
Appendices Algorithm Instance Types

5.2 Algorithm Instance Types


No. Model Training Inference
1 Linear Learner  Single or multi-machine CPU or GPU
 Multi-GPU does not help
2 K Nearest Neighbors CPU or GPU - CPU for lower latency
- Ml.m5.2xlarge - GPU for higher throughput on large
- Ml.p2.xlarge batches
3 K-Means CPU or GPU, but CPU recommended
 Only one GPU per instance used on GPU
 use p*.xlarge if you’re going to use GPU
4 Principal Component GPU or CPU
Analysis (PCA)  It depends “on the specifics of the input
data
5 XGBoost  Uses CPU’s only for multiple instance
training
 Is memory-bound, not compute bound
- So, M5 is a good choice
 As of XGBoost 1.2, single-instance GPU
training is available ex. P3
- Must set tree_method hyperparameter
to gpu_hist
- Trains more quickly and can be more
cost effective
6 IP Insights CPU or GPU
 GPU recommended
 Ml.p3.2xlarge or higher
 Can use multiple GPU’s
 Size of CPU instance depends on
vector_dim and num_entity_vectors
7 Factorization Machines CPU or GPU
 CPU recommended
 GPU only works with dense data
8 Object Detection Use GPU instances for training (multi-GPU Use CPU or CPU for inference
and multi-machine) - C5, M5, P2, P3
- Ml.p2.xlarge, ml.p2.8xlarge,
ml.p2.16xlarge, ml.p3.2xlarge,
ml.p3.8clarge, ml.p3.16xlarge
9 Image Classification GPU instances (P2, P3) either multi-GPU or CPU or GPU (C4, P2, P3)
multi-machine.
10 Semantic Segmentation Only GPU supported (P2 or P3) on a single CPU (C5 or M5) or GPU (P2 or P3)
machine only
- Specifically ml.p2.xlarge, ml.p2.8xlarge,
ml.p2.16xlarge, ml.p3.2xlarge,
ml.p3.8xlarge, or ml.p3.16xlarge
11 Blazing Text
12 Seq2Seq Can only use GPU instance types (P3 for
example)
 Can only use a single machine for training
 But can use multi-GPU’s on one machine
13 Object2Vec only train on a single machine (CPU or GPU, use ml.p2.2xlarge
multi-GPU OK)
283
Appendices Algorithm Instance Types

- Ml.m5.2xlarge - Use INFERENCE_PREFERRED_MODE


- Ml.p2.xlarge environment variable to optimize for
- If needed, go up to ml.m5.4xlarge or encoder embedding rather than
ml.m5.12xlarge classification or regression.
14 Neural Topic Model GPU or CPU CPU for inference
- GPU recommended
15 Latent Dirichlet Single-instance CPU training
Allocation (LDA)
16 DeepAR  Can use CPU or GPU CPU-only
 Single or multi machine
 Start with CPU (C4.2xlarge, C4.4xlarge)
 Move up to GPU if necessary
Only with large models
 May need larger instances for tuning when
doing hyperparameter tuning job
17 Random Cut Forest Use M4, C4, or C5 ml.c5.xl

284
Appendices Algorithm Type & Usage

5.3 Algorithm Type & Usage


No. Model Type Usage
1 Linear Learner Supervised
2 K Nearest Neighbors Supervised Regression and classification
Can be used in feature reduction
Filling missing values
SMOTE
3 K-Means Unsupervised Can be used for feature reduction by replacing the feature vector
with the vector holding the distance from every cluster.
4 Principal Component Unsupervised Dimension reduction.
Analysis (PCA)
5 XGBoost Supervised Regression and classification

6 IP Insights Unsupervised - Identify a user attempting to log into a web service from an
anomalous IP address
- Identify an account that is attempting to create computing
resources from an unusual IP address
7 Factorization Machines Supervised Regression and classification
It is an extension of a linear model that is designed to capture
interactions between features within high dimensional sparse
datasets economically
Factorization machines are a good choice for tasks dealing with high
dimensional sparse datasets, such as click prediction and item
recommendation.
8 Object Detection Supervised Identify all objects in an image with bounding boxes with
CNN confidence score.
9 Image Classification Supervised Assign one or more labels to an image
CNN
10 Semantic Segmentation Supervised Pixel-level object classification built on MXNet Gluon and Gluon CV.
CNN Instance segmentation: which is used by vehicles that tells you
more specific object class.
11 Blazing Text Text Classification: Web searches and information retrieval.
Word2vec: Word embedding, used for translation and sentiment
analysis. Only words not sentences or documents.
12 Seq2Seq RNN Machine Translation
Text summarization
Speech to text
13 Object2Vec Unsupervised it represents how objects are similar to each other
Compute nearest neighbors of objects
Visualize clusters
Genre prediction
Recommendations
14 Neural Topic Model Unsupervised Classify or summarize documents based on topics
Deep Learning
15 Latent Dirichlet Unsupervised Topic modeling algorithm
Allocation (LDA) Cluster customers based on purchases
Harmonic analysis in music
16 DeepAR Supervised Forecasting one-dimensional time series
17 Random Cut Forest Unsupervised Anomaly detection with anomaly score

285
Appendices Algorithm Type & Usage

THANK YOU

286

You might also like