ML Certificate Preparation (Last Version)
ML Certificate Preparation (Last Version)
AWS
Prepared By:
Ahmed Mohamed Elhamy
This page is intentionally left blank.
Introduction Create Data repositories for ML
Table of Contents
Introduction .................................................................................................... 8
References ...................................................................................................... 9
1. Data Engineering .................................................................................... 10
1.1 Create Data repositories for ML ......................................................................................... 10
1.1.1 Lake Formation ......................................................................................................... 10
1.1.2 S3 .............................................................................................................................. 10
1.1.3 Amazon FSx for Lustre .............................................................................................. 15
1.1.4 Amazon EFS............................................................................................................... 15
1.2 Identify and implement a data-ingestion ........................................................................... 16
1.2.1 Apache Kafka............................................................................................................. 16
1.2.2 Kinesis ....................................................................................................................... 17
1.2.2.1 Kinesis Streams ..................................................................................................... 18
1.2.2.2 Kinesis firehose ..................................................................................................... 18
1.2.2.3 Kinesis Analytics .................................................................................................... 20
1.2.2.4 Kinesis Video Streams ........................................................................................... 23
1.2.3 Glue ........................................................................................................................... 27
1.2.3.1 Glue Data Catalog ................................................................................................. 27
1.2.3.2 Crawlers ................................................................................................................ 28
1.2.3.3 Glue ETL ................................................................................................................ 30
1.2.3.4 Job Authoring........................................................................................................ 31
1.2.3.5 Job Execution ........................................................................................................ 31
1.2.3.6 Job Workflow ........................................................................................................ 32
1.2.4 Data Stores in Machine learning ............................................................................... 33
1.2.4.1 Redshift ................................................................................................................. 33
1.2.4.2 RDS, Aurora ........................................................................................................... 33
1.2.4.3 DynamoDB ............................................................................................................ 33
1.2.4.4 ElasticSearch ......................................................................................................... 33
1.2.4.5 ElastiCache ............................................................................................................ 33
1
Introduction Create Data repositories for ML
2
Introduction Create Data repositories for ML
3
Introduction Create Data repositories for ML
4
Introduction Create Data repositories for ML
5
Introduction Create Data repositories for ML
6
Introduction Create Data repositories for ML
7
Introduction Create Data repositories for ML
Introduction
This document is for any candidate who want to pass AWS machine learning certificate exam.
This document is following exam preparation path recommended from AWS.
The document structure is classified as per the domains stated by Amazon that will be covered in
the exam.
This document is covering all the topics with full and clear explanation.
You should have a background in Machine Learning, this document is only used for exam
preparation not for full Machine Learning explanation.
This document is explaining all the amazon products and tools that is used in Machine Learning
till the end of 2021.
This document is not discussing any python code that is related to Machine Learning.
NOTE: This document is built from many books, websites, you tube channels…etc.
as being stated in the references section. All rights reserved to their owners with
many thanks for them for this clear explanations.
Hope that this document is helpful for all of you, and hope the success for all of you.
Thanks
Ahmed Mohamed Elhamy
8
References Create Data repositories for ML
References
Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow.
https://g.co/kgs/HmXTUi
AWS Certified Machine Learning Specialty 2021 - Hands On! – Udemy Course
https://www.udemy.com/share/1029De2@PW1KVGFbTFIPd0dDBXpOfhRuSlQ=/
StatQuest
https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw
DeepLizard
https://www.youtube.com/c/deeplizard
Stanford University
https://www.youtube.com/watch?v=6niqTuYFZLQ
9
Data Engineering Create Data repositories for ML
1. Data Engineering
1.1 Create Data repositories for ML
1.1.1 Lake Formation
A data lake is a centralized repository that allows you to store all your structured and
unstructured data at any scale. You can store your data as-is, without having to first structure the
data, and run different types of analytics—from dashboards and visualizations to big data
processing, real-time analytics, and machine learning to guide better decisions.
A data warehouse is a database optimized to analyze relational data coming from transactional
systems and line of business applications. The data structure, and schema are defined in advance
to optimize for fast SQL queries, where the results are typically used for operational reporting
and analysis. Data is cleaned, enriched, and transformed so it can act as the “single source of
truth” that users can trust.
A data lake is different, because it stores relational data from line of business applications, and
non-relational data from mobile apps, IoT devices, and social media. The structure of the data or
schema is not defined when data is captured. This means you can store all of your data without
careful design or the need to know what questions you might need answers for in the future.
Different types of analytics on your data like SQL queries, big data analytics, full text search, real-
time analytics, and machine learning can be used to uncover insights.
1.1.2 S3
Amazon S3 allows people to store objects (files) in “buckets” (directories)
Buckets must have a globally unique name
Objects (files) have a Key. The key is the FULL path:
<my_bucket>/my_file.txt
<my_bucket>/my_folder1/another_folder/my_file.txt
This will be interesting when we look at partitioning
Max object size is 5TB
Object Tags (key / value pair –up to 10) –useful for security / lifecycle
10
Data Engineering Create Data repositories for ML
Buckets used with Amazon S3 Transfer Acceleration can't have dots (.) in their
names.
S3 Data Partitions
Pattern for speeding up range queries (ex: AWS Athena)
By Date: s3://bucket/my-data-set/year/month/day/hour/data_00.csv
By Product: s3://bucket/my-data-set/product-id/data_32.csv
You can define whatever partitioning strategy you like!
Data partitioning will be handled by some tools we use (e.g. AWS Glue and Athena)
S3 Storage Tier
Amazon S3 Standard -General Purpose
Amazon S3 Standard-Infrequent Access (IA)
Amazon S3 One Zone-Infrequent Access
Amazon S3 Intelligent Tier
Amazon Glacier
11
Data Engineering Create Data repositories for ML
Amazon S3 Glacier provides three options for access to archives, from a few minutes to
several hours, and S3 Glacier Deep Archive provides two access options ranging from 12 to
48 hours.
S3 Life Cycle
Set of rules to move data between different tiers, to save storage cost
Example: General Purpose => Infrequent Access => Glacier
Transition actions: objects are transitioned to another storage class.
Move objects to Standard IA class 60 days after creation
And move to Glacier for archiving after 6 months
Expiration actions: S3 deletes expired objects on our behalf
Access log files can be set to delete after a specified period of time
S3 Encryption
There are 4 methods of encrypting objects in S3
SSE-S3: encrypts S3 objects using keys handled & managed by AWS
SSE-KMS: use AWS Key Management Service to manage encryption keys
Additional security (user must have access to KMS key)
Audit trail for KMS key usage
SSE-C: when you want to manage your own encryption keys
Client Side Encryption
NOTE: From an ML perspective, SSE-S3 and SSE-KMS will be most likely used
S3 Accessibility
User based
IAM policies -which API calls should be allowed for a specific user
Sample IAM Policy
This IAM policy grants the IAM entity (user, group, or role) it is attached to permission to
perform any S3 operation on the bucket named “my_bucket”, as well as that bucket’s
contents.
{
"Version": "2012-10-17",
"Statement":[{
"Effect": "Allow",
12
Data Engineering Create Data repositories for ML
"Action": "s3:*",
"Resource": ["arn:aws:s3:::my_bucket",
"arn:aws:s3:::my_bucket/*"]
}
]
}
Resource Based
Bucket Policies -bucket wide rules from the S3 console -allows cross account
It is used for:
Grant public access to the bucket
Force objects to be encrypted at upload
Grant access to another account (Cross Account)
Sample S3 Bucket Policy
This S3 bucket policy enables the root account 111122223333 and the IAM user Alice
under that account to perform any S3 operation on the bucket named “my bucket”, as well
as that bucket’s contents.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": ["arn:aws:iam::111122223333:user/Alice",
"arn:aws:iam::111122223333:root"]
},
"Action": "s3:*",
"Resource": ["arn:aws:s3:::my_bucket",
"arn:aws:s3:::my_bucket/*"]
}
]
}
Object Access Control List (ACL) –finer grain
Bucket Access Control List (ACL) –less common
S3 Default Encryption
13
Data Engineering Create Data repositories for ML
The old way to enable default encryption was to use a bucket policy and refuse any HTTP
command without the proper headers.
The new way is to use the “default encryption” option in S3
Note: Bucket Policies are evaluated before “default encryption”
S3 Security
Networking -VPC Endpoint Gateway:
Allow traffic to stay within your VPC (instead of going through public web)
Make sure your private services (AWS SageMaker) can access S3
Logging and Audit:
S3 access logs can be stored in other S3 bucket
API calls can be logged in AWS CloudTrail
Tagged Based (combined with IAM policies and bucket policies)
Example: Add tag Classification=PHI to your objects
14
Data Engineering Create Data repositories for ML
both input modes now cover the spectrum of use cases, from small experimental training jobs to
petabyte-scale distributed training jobs.
Amazon SageMaker algorithms
Most first-party Amazon SageMaker algorithms work best with the optimized protobuf recordIO
format. For this reason, this release offers Pipe mode support only for the protobuf recordIO
format. The algorithms in the following list support Pipe input mode today when used with
protobuf recordIO-encoded datasets:
- Principal Component Analysis (PCA)
- K-Means Clustering
- Factorization Machines
- Latent Dirichlet Allocation (LDA)
- Linear Learner (Classification and Regression)
- Neural Topic Modelling
- Random Cut Forest
15
Data Engineering Identify and implement a data-ingestion
Compatibility
MSK clusters are compatible with:
Kafka partition reassignment tools
Kafka APIs
Kafka admin client
3rd party tools
MSK not compatible with:
Tools that upload .jar files “Cruise Control”, “Uber Replicator”, “LinkedInn”,
“Confluent Control Center” and “Auto Data Balancer”
16
Data Engineering Identify and implement a data-ingestion
1.2.2 Kinesis
Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you
can get timely insights and react quickly to new information. Amazon Kinesis offers key
capabilities to cost-effectively process streaming data at any scale, along with the flexibility to
choose the tools that best suit the requirements of your application. With Amazon Kinesis, you
can ingest real-time data such as video, audio, application logs, website clickstreams, and IoT
telemetry data for machine learning, analytics, and other applications. Amazon Kinesis enables
you to process and analyze data as it arrives and respond instantly instead of having to wait until
all your data is collected before the processing can begin.
17
Data Engineering Identify and implement a data-ingestion
Figure 1: Kinesis
1MB/s or 2MB/s or
1000messages/s 5API calls/s
Figure 2: Kinesis Streams
19
Data Engineering Identify and implement a data-ingestion
Firehose
Fully managed, send to S3, Splunk, Redshift, ElasticSearch.
Serverless data transformations with Lambda
Near real time (lowest buffer time is 1 minute)
Automated Scaling
No data storage
20
Data Engineering Identify and implement a data-ingestion
Amazon Kinesis Data Analytics reduces the complexity of building, managing, and
integrating Apache Flink applications with other AWS services.
Pay only for resources consumed (but it’s not cheap)
Serverless; scales automatically
Use IAM permissions to access streaming source and destination(s)
SQL or Flink to write the computation
Schema discovery
Lambda can be used for pre-processing
Kinesis data analytics could make reference to tables in S3 buckets.
Amazon Kinesis Analytics applications can transform data before it is processed by your
SQL code. This feature allows you to use AWS Lambda to convert formats, enrich data,
filter data, and more. Once the data is transformed by your function, Kinesis Analytics
sends the data to your application’s SQL code for real-time analytics.
NOTE: Apache Flink is an open source framework and engine for processing data
streams
21
Data Engineering Identify and implement a data-ingestion
Pre-built functions include everything from sum and count distinct to machine learning
algorithms
Aggregations run continuously using window operators
Use cases
Streaming ETL: select columns, make simple transformations, on streaming data
Continuous metric generation: live leaderboard for a mobile game
Responsive analytics: look for certain criteria and build alerting (filtering)
22
Data Engineering Identify and implement a data-ingestion
HOTSPOTS
Locate and return information about relatively dense regions in your data
Example: a collection of overheated servers in a data center
23
Data Engineering Identify and implement a data-ingestion
Producers:
Security camera, body-worn camera, AWS DeepLens, smartphone camera, audio feeds,
images, RADAR data, RTSP camera.
One producer per video stream
Video playback capability
Consumers
Build your own (MXNet, Tensorflow)
AWS SageMaker
Amazon Rekognition Video
Keep data for 1 hour to 10 years
The software pulls media fragments from the streams using the real-time Kinesis Video Streams
GetMedia API operation, parses the media fragments to extract the H264 chunk, samples the
24
Data Engineering Identify and implement a data-ingestion
frames that need decoding, then decodes the I-frames and converts them into image formats
such as JPEG/PNG format, before invoking the Amazon SageMaker endpoint. As the Amazon
SageMaker-hosted model returns inferences, KIT captures and publishes those results into a
Kinesis data stream. Customers can then consume those results using their favorite service, such
as AWS Lambda. Finally, the library publishes a variety of metrics into Amazon CloudWatch so
that customers can build dashboards, monitor, and alarm on thresholds as they deploy into
production.
25
Data Engineering Identify and implement a data-ingestion
26
Data Engineering Identify and implement a data-ingestion
1.2.3 Glue
AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and
combine data for analytics, machine learning, and application development. AWS Glue provides
all of the capabilities needed for data integration.
AWS Glue provides both visual and code-based interfaces to make data integration easier. Users
can easily find and access data using the AWS Glue Data Catalog. Data engineers and ETL (extract,
transform, and load) developers can visually create, run, and monitor ETL workflows with a few
clicks in AWS Glue Studio. Data analysts and data scientists can use AWS Glue DataBrew to
visually enrich, clean, and normalize data without writing code. With AWS Glue Elastic Views,
application developers can use familiar Structured Query Language (SQL) to combine and
replicate data across different data stores.
AWS Glue enables you to perform ETL operations on streaming data using continuously-running
jobs. AWS Glue streaming ETL is built on the Apache Spark Structured Streaming engine, and can
ingest streams from Amazon Kinesis Data Streams, Apache Kafka, and Amazon Managed
Streaming for Apache Kafka (Amazon MSK). Streaming ETL can clean and transform streaming
data and load it into Amazon S3 or JDBC data stores. Use Streaming ETL in AWS Glue to process
event data like IoT streams, clickstreams, and network logs.
Features
Fully managed, cost effective, pay only for the resources consumed
Jobs are run on a serverless Spark platform
Glue Scheduler to schedule the jobs, could be run every 5 minutes minimum.
Glue Triggers to automate job runs based on “events”
27
Data Engineering Identify and implement a data-ingestion
1.2.3.2 Crawlers
28
Data Engineering Identify and implement a data-ingestion
The following is the general workflow for how a crawler populates the AWS Glue Data
Catalog:
1. A crawler runs any custom classifiers that you choose to infer the format and schema
of your data. You provide the code for custom classifiers, and they run in the order that
you specify.
2. The first custom classifier to successfully recognize the structure of your data is used to
create a schema. Custom classifiers lower in the list are skipped.
3. If no custom classifier matches your data's schema, built-in classifiers try to recognize
your data's schema. An example of a built-in classifier is one that recognizes JSON.
4. The crawler connects to the data store. Some data stores require connection
properties for crawler access.
5. The inferred schema is created for your data.
6. The crawler writes metadata to the Data Catalog. A table definition contains metadata
about the data in your data store. The table is written to a database, which is a
container of tables in the Data Catalog. Attributes of a table include classification,
which a label is created by the classifier that inferred the table schema.
29
Data Engineering Identify and implement a data-ingestion
Bundled Transformations:
DropFields, DropNullFields–remove (null) fields
Filter –specify a function to filter records
Join –to enrich data
Map -add fields, delete fields, perform external lookups
In typical analytic workloads, column-based file formats like Parquet or ORC are preferred over
text formats like CSV or JSON. It is common to convert data from CSV/JSON/etc. into Parquet for
files on Amazon S3, which can be done in the transformation phase.
Target can be S3, JDBC (RDS, Redshift), or in Glue Data Catalog
32
Data Engineering Identify and implement a data-ingestion
1.2.4.3 DynamoDB
NoSQL data store, serverless, provision read/write capacity
Useful to store a machine learning model served by your application
1.2.4.4 ElasticSearch
Indexing of data
Search amongst data points
Clickstream Analytics
1.2.4.5 ElastiCache
Caching mechanism
Not really used for Machine Learning
NOTE: Amazon ML allows you to create a datasource object from data stored in a
MySQL database in Amazon Relational Database Service (Amazon RDS). When you
perform this action, Amazon ML creates an AWS Data Pipeline object that
executes the SQL query that you specify, and places the output into an S3 bucket
of your choice. Amazon ML uses that data to create the datasource.
33
Data Engineering Identify and implement a data-ingestion
hour, runs an Amazon Elastic MapReduce (Amazon EMR)–based analysis on that hour’s Amazon
Simple Storage Service (Amazon S3) log data, loads the results into a relational database for
future lookup, and then automatically sends you a daily summary email.
Example 1:
You can use AWS Data Pipeline to archive your web server's logs to Amazon Simple Storage
Service (Amazon S3) each day and then run a weekly Amazon EMR (Amazon EMR) cluster over
those logs to generate traffic reports. AWS Data Pipeline schedules the daily tasks to copy data
and the weekly task to launch the Amazon EMR cluster. AWS Data Pipeline also ensures that
Amazon EMR waits for the final day's data to be uploaded to Amazon S3 before it begins its
analysis, even if there is an unforeseen delay in uploading the logs.
Example 2:
For example, you could define a job that, every hour, runs an Amazon Elastic MapReduce
(Amazon EMR)–based analysis on that hour’s Amazon Simple Storage Service (Amazon S3) log
data, loads the results into a relational database for future lookup, and then automatically sends
you a daily summary email.
Features
Manages task dependencies
Retries and notifies on failures
Data sources may be on-premises
Highly available
Destinations include S3, RDS, DynamoDB, Redshift and EMR
Control over environment resources
Access to EC2 and EMR
Can create resources in your account
34
Data Engineering Identify and implement a data-ingestion
Data Pipeline:
Orchestration service
More control over the environment, compute resources that run code, & code
Allows access to EC2 or EMR instances (creates resources in your own account)
Features
Run batch jobs as Docker images
Dynamic provisioning of the instances (EC2 & Spot Instances)
Optimal quantity and type based on volume and requirements
No need to manage clusters, fully serverless
You just pay for the underlying EC2 instances
35
Data Engineering Identify and implement a data-ingestion
Batch:
For any computing job regardless of the job (must provide Docker image)
Resources are created in your account, managed by Batch
For any non-ETL related work, Batch is probably better
Features
Use to design workflows
Easy visualizations
Advanced Error Handling and Retry mechanism outside the code
Audit of the history of workflows
36
Data Engineering Identify and implement a data-ingestion
Figure 17: Train Machine Learning Model Figure 18: Tune a machine Learning Model
37
Data Engineering Identify and implement a data-ingestion
38
Data Engineering Identify and implement a data-transformation
Components
Hadoop Core (Common):
Libraries and utilities for all of these modules to run on top on Java and Scripts.
HDFS:
Hadoop Distributed File System
YARN (Yet Another Resource Negotiator)
Manage the resources across the cluster.
It performs scheduling and resource allocation for the Hadoop System.
It is composed of three components: Resource Manager, Nodes Manager and
Application Manager.
MapReduce
Software framework for easily writing applications that process vast amount of data in
parallel on a large cluster in a reliable fault tolerant manner.
It consists of:
Map Functions: do thing like transform, reformat data or extract data. Its output is
intermediate data.
Resource Functions: Takes the intermediate data and aggregating this data for the
final answer.
39
Data Engineering Identify and implement a data-transformation
Compose of clusters, Cluster is a collection of EC2 instance where every instance is called a
Node.
EMR Cluster
Master Node
Manages the cluster by running software components to co-ordinate the
distribution of data and tasks among other nodes for processing.
It tracks the status of tasks and monitors the health of the cluster.
Also known as Leader Node
Core Nodes
These are the nodes with software components that run tasks and store the data on
the HDFS
Task Nodes
These nodes only run tasks and don’t store data on HDFS, used for only computation
(sudden tasks).
EMR Usage
Transit Cluster: Configured to automatically terminate once all steps have been completed.
Load input data Process data Store data Terminate
Long Run Cluster: manually terminated after interacting with it.
EMR Services
Nodes are EC2 instances
VPC to configure network
S3 to load and save your data
CloudWatch to monitor cluster performance and configure alarms
IAM for permissions
CloudTrail to audit requests to the services
Data pipeline to schedule and start cluster
EMR Storage
HDFS
Very good for performance but it will go away when the cluster shutdown. HDFS is stored
as blocks and distributed across by default block size is 128 MB.
40
Data Engineering Identify and implement a data-transformation
EMRS
Which allows you to use S3 as though it were in HDFS file system and use DynamoDB to
track the consistency across MRFS.
Local File System
EBS
EMR Promises
EMR changes by hour and EC2 instances
Provision new nodes on failure
Add/remove tasks on the fly
Resize running cluster nodes
41
Data Engineering Identify and implement a data-transformation
1. Spark context connect to different cluster managers which allocate the resources across
the applications.
2. Upon connecting, Spark will acquire executors on nodes in the cluster.
3. The executors are processes that run computations and store data.
4. The application code is send to the executors.
5. Spark context will send tasks to the executors to run.
Spark Components
Resilient Distribution Dataset (RDD)
Represents a logical collection of data partitioned across different compute node.
Spark SQL
Engine that provides low latency interactive queries up to 100X faster than map reduce.
Supports various data sources: JDBC, ODBC, JSON, ORC, Parquet and HDFS
Spark SQL exposes data frames as python and datasets as Scala.
Spark SQL uses distributed queries that executes across the entire cluster.
Spark Streaming
42
Data Engineering Identify and implement a data-transformation
Real time solution that leverage spark course fast scheduling capabilities to do streaming
analytics.
It supports ingestion from Twitter, Kafka, Flune, HDFS and Zero MQ.
Spark Streaming can integrate with AWS Kinesis.
MLib (Machine Learning Library)
Graphx
Data structure graph.
MLIB
Machine Learning Library in Spark contains:
Classification: logistic regression and Naiive Bayes.
Regression
Decision Trees
Recommendation Engine using ALS (Alternating Least Square).
Clustering (K-means)
LDA (topic modeling)
SVD, PCA
ML workflows (pipelines, transformation and persistence)
Statistics functions
Zeppelin
A notebook for Spark.
Can use Spark SQL.
Can visualize data in charts and graphs.
EMR Notebook
Amazon notebook for EMR with more integration to AWS.
Notebook is backed in S3.
Provision cluster from notebook.
Feeding task to the cluster from notebook.
Hosted inside VPC.
Access only via console.
43
Data Engineering Identify and implement a data-transformation
EMR Security
IAM Policies
To grant or deny permissions and determine what actions user can perform with Amazon
EMR and other AWS resources. IAM policies with tags to control access on a cluster by
cluster basis (per cluster).
IAM Role
For EMRFS request to S3 allow you to control whether cluster users can access files from
EMR based on user, group or location.
Kerberos
Strong authentication through secret key cryptography that ensures that passwords aren’t
sent over the network in unencrypted format.
Role for auto scaling and a role for cleaning EC2 instances.
NOTE: If using Spot instances in Master node or core node then risk for partial
data loss, but you can use it in testing only.
For using Spark with SageMaker refer to section 4.1.12 SageMaker with Spark
45
Data Engineering Identify and implement a data-transformation
Provides fully managed Jupyter Notebooks, and tools like Spark UI and YARN Timeline
Service to simplify debugging.
By using a directed acyclic graph (DAG) execution engine, Spark can create efficient query plans
for data transformations. Spark also stores input, output, and intermediate data in-memory as
resilient dataframes, which allows for fast processing without I/O cost, boosting performance of
iterative or interactive workloads.
47
Exploratory Data Analysis Perform featuring engineering
Additive model
When seasonality is constant.
48
Exploratory Data Analysis Perform featuring engineering
Multiplicative model
Model seasonality increase as Trend increase.
Time Series = Seasonality × Trend × Noise
NOTE: Correlation matrix is used to show linear relationship, while scatter matrix
shows any relationship.
49
Exploratory Data Analysis Perform featuring engineering
- Mean Replacing
Replace missing data with mean value.
If you have outliers, you can use “Median value”
Advantage: fast and easy and will not affect sample size nor mean value.
Disadvantage:
Can’t be used for categorical columns
Not very accurate
Misses correlations between features.
- Dropping
Drop rows or columns contains the missing data.
Use this method if:
Not many rows contains missing data
Dropping rows doesn’t bias the data
You don’t have enough time
It is reasonably
It is not a good approach at all think about filling the missing data with any other
column as summary and text.
- Common point
Use the most common value for that column to replace missing values. Useful for
categorical variables.
- Machine Learning
Use Machine Learning algorithms to fill the missing data
KNN
Find “K nearest neighbor” rows and average their values.
Assume numeric data, there is methods for categorical data handling but not
good solutions.
Deep Learning
Build machine learning model to impute missing data.
50
Exploratory Data Analysis Perform featuring engineering
51
Exploratory Data Analysis Perform featuring engineering
2.1.8 Binning
Bucket observations together based on range of values.
Useful when there is uncertainty of measurements.
Transform numeric data to ordinal data.
For example: age (20s), (30s)….etc.
Using algorithms for categorical data rather than numerical data.
Quantile binning
It categorize your data by their place in the data distribution, so it ensures that everyone of
your bias has an equal number of samples within them.
It ensures that the same number of samples in each resulting bin.
2.1.9 Transforming
If you have features that has an exponential trend within it, we can make logarithmic
transform to make the data look more linear.
You can also take x2 or x.
2.1.11 Scaling
Some models prefer data to be normally distributed around 0.
Most models require feature data to at least be scaled to comparable values. Otherwise,
features with larger magnitude will have more weight than they should be. For example,
age and income. Income will be much higher weight than age.
There are 4 methods for scaling:
52
Exploratory Data Analysis Perform featuring engineering
- MinMax Scaler
Also called normalization.
Values are shifted and rescaled so they end up ranging from 0 1.
Formula subtract the value form min value and divide by max – min.
Very sensitive to outliers.
Normalizer builds totally new features that are not correlated to initial features.
- Standardization
This is done by subtract the value from mean value then divide by standard
deviation. So, the result distribution has unit variance.
It doesn’t bound to 0 1 but not affected by outliers.
NOTE: StandardScaler and other scalers that work featurewise are preferred
in case meaningful information is located in the relation between feature
values from one sample to another sample, wherease Normalizer and other
scalers that work stamplewise are preferred in case meaningful information
is located in the relation between feature values from one feature to another
feature.
- Robust Scaler
It is better than standardization in dealing with outliers.
Formula as follows:
Calculate median (50th percentile)
Calculate 25th and 75th percentiles
Value = (Value – median) / (P75 – P25)
53
Exploratory Data Analysis Perform featuring engineering
54
Exploratory Data Analysis Perform featuring engineering
2.1.13 Residuals
A residuals plot (see the picture above) which has an increasing trend (first figure) suggests that
the error variance increases with the independent variable; while a distribution that reveals a
decreasing trend (second figure) indicates that the error variance decreases with the
independent variable. Neither of these distributions are constant variance patterns. Therefore
they indicate that the assumption of constant variance is not likely to be true and the regression
is not a good one. On the other hand, a horizontal-band pattern (third figure) suggests that the
variance of the residuals is constant.
55
Exploratory Data Analysis Perform featuring engineering
The Residual vs. Order of the Data plot can be used to check the drift of the variance (see the
picture above) during the experimental process, when data are time-ordered. If the residuals are
randomly distributed around zero, it means that there is no drift in the process.
56
Exploratory Data Analysis Perform featuring engineering
If the data being analyzed is time series data (data recorded sequentially), the Residual vs. Order
of the Data plot will reflect the correlation between the error term and time. Fluctuating patterns
around zero will indicate that the error term is dependent.
57
Exploratory Data Analysis Analyze and visualize data for ML
2.1.14 Shuffling
Many algorithms benefit from shuffling your data.
Otherwise, they may learn from residual signals in the training data resulting from the
order in which they were collected.
Standard SQL
Uses Presto with ANSI SQL support.
58
Exploratory Data Analysis Analyze and visualize data for ML
SPICE
Datasets of Quick Sight is imported into SPICE (10GB).
SPICE is super-fast, parallel in memory calculation engine.
SPICE uses columnar storage in memory machine code generation.
SPICE accelerating interactive queries on large data sets.
High available, durable and scalable.
Pricing
59
Exploratory Data Analysis Analyze and visualize data for ML
Annual or monthly
Standard or enterprise
SPICE can go beyond (10GB)
60
Modeling Frame business problems as ML problems
3. Modeling
3.1 Frame business problems as ML problems
3.1.1 Supervised Machine Learning
3.1.1.1 Regression
Logistic Regression
Some regression algorithms can be used for classification as well, Logistic Regression (also called
Logit Regression) is commonly used to estimate the probability that an instance belongs to a
particular class.
If the estimated probability is greater than 50%, then the model predicts that the instance
belongs to that class (called the positive class, labeled “1”), or else it predicts that it does not (i.e.,
it belongs to the negative class, labeled “0”). This makes it a binary classifier.
Estimating Probabilities
Just like a Linear Regression model, a Logistic Regression model computes a weighted sum of the
input features (plus a bias term), but instead of outputting the result directly like the Linear
Regression model does, it outputs the logistic of this result.
The logistic—noted σ (·)—is a sigmoid function (i.e., S-shaped) that outputs a number between 0
and 1.
Once the Logistic Regression model has estimated the probability p = hθ (x) that an instance x
belongs to the positive class, it can make its prediction ŷ easily.
61
Modeling Frame business problems as ML problems
Notice that σ (t) < 0.5 when t < 0, and σ (t) ≥ 0.5 when t ≥ 0, so a Logistic Regression model
predicts 1 if x T θ is positive, and 0 if it is negative.
Training and Cost Function
The objective of training is to set the parameter vector θ so that the model estimates high
probabilities for positive instances (y = 1) and low probabilities for negative instances (y = 0).
Decision Trees
Decision Trees are versatile Machine Learning algorithms that can per- form both classification
and regression tasks, and even multioutput tasks.
One of the many qualities of Decision Trees is that they require very little data preparation. In
particular, they don’t require feature scaling or centering at all.
A node’s samples attribute counts how many training instances it applies to:
For example, 100 training instances have a petal length greater than 2.45 cm (depth 1,
right), among which 54 have a petal width smaller than 1.75 cm (depth 2, left).
62
Modeling Frame business problems as ML problems
A node’s value attribute tells you how many training instances of each class this node applies to:
For example, the bottom-right node applies to 0 Iris-Setosa, 1 Iris- Versicolor, and 45 Iris-
Virginica.
A node’s gini attribute measures its impurity: a node is “pure” (gini=0) if all training instances it
applies to belong to the same class.
For example, since the depth-1 left node applies only to Iris-Setosa training instances, it is
pure and its gini score is 0.
The depth-2 left node has a gini score equal to 1 – (0/54)2 – (49/54)2 – (5/54)2 ≈ 0.168.
The thick vertical line rep- resents the decision boundary of the root node (depth 0): petal length
= 2.45 cm. Since the left area is pure (only Iris-Setosa), it cannot be split any further. However,
the right area is impure, so the depth-1 right node splits it at petal width = 1.75 cm (represented
by the dashed line). Since max-depth was set to 2, the Decision Tree stops right there. However,
if you set max-depth to 3, then the two depth-2 nodes would each add another decision
boundary (represented by the dotted lines).
A Decision Tree can also estimate the probability that an instance belongs to a particular class k:
first it traverses the tree to find the leaf node for this instance, and then it returns the ratio of
training instances of class k in this node. For example, suppose you have found a flower whose
petals are 5 cm long and 1.5 cm wide. The corresponding leaf node is the depth-2 left node, so
the Decision Tree should output the following probabilities: 0% for Iris-Setosa (0/54), 90.7% for
Iris-Versicolor (49/54), and 9.3% for Iris-Virginica (5/54). And of course if you ask it to predict the
class, it should output Iris-Versicolor (class 1) since it has the highest probability.
The CART Training Algorithm
Scikit-Learn uses the Classification And Regression Tree (CART) algorithm to train Decision Trees
(also called “growing” trees). The idea is really quite simple: the algorithm first splits the training
63
Modeling Frame business problems as ML problems
set in two subsets using a single feature k and a threshold tk (e.g., “petal length ≤ 2.45 cm”). How
does it choose k and tk? It searches for the pair (k, tk) that produces the purest subsets (weighted
by their size).
The cost function that the algorithm tries to minimize is given by:
Once it has successfully split the training set in two, it splits the subsets using the same logic, then
the sub-subsets and so on, recursively. It stops recurring once it reaches the maximum depth
(defined by the max-depth hyperparameter), or if it cannot find a split that will reduce impurity.
Making predictions requires traversing the Decision Tree from the root to a leaf. Decision Trees
are generally approximately balanced. Since each node only requires checking the value of one
feature, the overall prediction complexity is independent of the number of features. So
predictions are very fast, even when dealing with large training sets.
Gini Impurity or Entropy
By default, the Gini impurity measure is used, but you can select the entropy impurity measure
instead by setting the criterion hyperparameter to "entropy".
So should you use Gini impurity or entropy? The truth is, most of the time it does not make a big
difference: they lead to similar trees. Gini impurity is slightly faster to compute, so it is a good
default. However, when they differ, Gini impurity tends to isolate the most frequent class in its
own branch of the tree, while entropy tends to produce slightly more balanced trees.
Regularization Hyperparameters
This is controlled by the max_depth hyperparameter (the default value is none, which means
unlimited). Reducing max_depth will regularize the model and thus reduce the risk of overfitting.
64
Modeling Frame business problems as ML problems
min_samples_split: The minimum number of samples a node must have before it can be split).
min_samples_leaf: The minimum number of samples a leaf node must have.
min_weight_fraction_leaf: Same as min_samples_leaf but expressed as a fraction of the total
number of weighted instances.
max_leaf_nodes: Maximum number of leaf nodes.
max_features: Maximum number of features that are evaluated for splitting at each node.
Regression
This tree looks very similar to the classification tree you built earlier. The main difference is that
instead of predicting a class in each node, it predicts a value. For example, suppose you want to
make a prediction for a new instance with x1 = 0.6. You traverse the tree starting at the root, and
you eventually reach the leaf node that predicts value=0.1106. This prediction is simply the
average target value of the 110 training instances associated to this leaf node. This prediction
results in a Mean Squared Error (MSE) equal to 0.0151 over these 110 instances.
The CART algorithm works mostly the same way as earlier, except that instead of try- ing to split
the training set in a way that minimizes impurity, it now tries to split the training set in a way that
minimizes the MSE.
Instability
65
Modeling Frame business problems as ML problems
Hopefully by now you are convinced that Decision Trees have a lot going for them: they are
simple to understand and interpret, easy to use, versatile, and powerful. However they do have a
few limitations. First, as you may have noticed, Decision Trees love orthogonal decision
boundaries (all splits are perpendicular to an axis), which makes them sensitive to training set
rotation.
3.1.1.2 Classification
Performance measure for the classification: evaluating a classifier is often significantly trickier
than evaluating a regressor.
Accuracy
Accuracy: The percent (ratio) of cases classified correctly:
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁
Confusion Matrix
It is used for evaluating a classifier.
To compute the confusion matrix, you first need to have a set of predictions, so they can be
compared to the actual targets.
Each row in a confusion matrix represents an actual class, while each column represents a
predicted class.
NOTE: It is not always the case that the actual class is represented as rows and
predicted represented as predicted may be they are swapped. TAKE CARE.
A perfect classifier would have only true positives and true negatives, so its confusion matrix
would have nonzero values only on its main diagonal i.e. no false positives and no false negatives.
66
Modeling Frame business problems as ML problems
Precision
I detected wrong data by getting wrong data as positive. ()أنا جبت حاجات غلط
Such that:
TP is true positive (predicted as positive and they are actually positive).
FP is false positive (predicted as positive but they are actually negative).
NOTE: As much as you classify negative classes as positive classes then FP will
increase and the overall precision will decrease.
قل رprecision اكت كل لما ال
.أكت كل لما جبت حاجات غلط ر
Good choice of metric when you care a lot about false positives i.e. medical screening, drug
testing.
Recall
Recall is also called Sensitivity or True Positive Rate (TPR) or Completeness
67
Modeling Frame business problems as ML problems
Such that:
TP is true positive (predicted as positive and they are actually positive).
FN is false negative (predicted as negative but they are actually positive).
NOTE: As much as you didn’t recognize all the data of your class i.e. FN increase
the overall recall decrease.
قل رrecall أكت كل لما ال
.أكت كل لما مجبتش حاجات ر
Good choice of metric when you care a lot about false negatives i.e. fraud detection.
F1
It is often convenient to combine precision and recall into a single metric called the F1 score, in
particular if you need a simple way to compare two classifiers. The F1 score is the harmonic mean
of precision and recall.
Precision/Recall Tradeoff
The F1 score favors classifiers that have similar precision and recall. This is not always what you
want: in some contexts you mostly care about precision, and in other con- texts you really care
about recall. For example, if you trained a classifier to detect videos that are safe for kids, you
would probably prefer a classifier that rejects many good videos (low recall) but keeps only safe
ones (high precision), rather than a classifier that has a much higher recall but lets a few really
bad videos show up in your product (in such cases, you may even want to add a human pipeline
to check the classifier’s video selection). On the other hand, suppose you train a classifier to
detect shoplifters on surveillance images: it is probably fine if your classifier has only 30%
precision as long as it has 99% recall (sure, the security guards will get a few false alerts, but
almost all shoplifters will get caught).
68
Modeling Frame business problems as ML problems
Unfortunately, you can’t have it both ways: increasing precision reduces recall, and vice versa.
This is called the precision/recall tradeoff.
So let’s suppose you decide to aim for 90% precision. You look up the first plot and find that you
need to use a threshold of about 8,000. To be more precise you can search for the lowest
threshold that gives you at least 90% precision.
Another way to select a good precision/recall tradeoff is to plot precision directly against recall.
69
Modeling Frame business problems as ML problems
You can see that precision really starts to fall sharply around 80% recall. You will probably want to
select a precision/recall tradeoff just before that drop—for example, at around 60% recall. But of
course the choice depends on your project.
ROC Curve
The receiver operating characteristic (ROC) curve is another common tool used with binary
classifiers. It is very similar to the precision/recall curve, but instead of plotting precision versus
recall, the ROC curve plots the true positive rate (another name for recall) against the false
positive rate (FPR).
𝐹𝑃
FPR (False positive rate) = = 1 – TNR (True Negative Rate also called specificity)
𝐹𝑃 + 𝑇𝑁
The FPR is the ratio of negative instances that are incorrectly classified as positive.
The TNR is the ratio of negative instances that are correctly classified as negative.
𝑇𝑁
𝑇𝑁𝑅 =
𝑇𝑁 + 𝐹𝑃
To plot the ROC curve, you first need to compute the TPR and FPR for various threshold values.
70
Modeling Frame business problems as ML problems
Once again there is a tradeoff: the higher the recall (TPR), the more false positives (FPR) the
classifier produces. The dotted line represents the ROC curve of a purely random classifier; a
good classifier stays as far away from that line as possible (toward the top-left corner).
One way to compare classifiers is to measure the area under the curve (AUC). A per- fect
classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC
equal to 0.5.
Commonly used metric for comparing classifiers.
PR or ROC
As a rule of thumb, you should prefer the PR curve whenever the positive class is rare or when you
care more about the false positives than the false negatives, and the ROC curve otherwise.
71
Modeling Frame business problems as ML problems
72
Modeling Frame business problems as ML problems
Number of correct and incorrect predictions per class (infer from colors of each cell)
F1 scores per class
True class frequencies: the “total” column
Predicted class frequencies: the “total” row
Bayes
A Bayesian network is a graphical model that represents a set of variables and their
conditional dependencies.
For example, disease and symptoms are connected using a network diagram. All symptoms
connected to a disease are used to calculate the probability of the existence of the disease.
Naive Bayes classifier is a technique to assign class labels to the samples from the available
set of labels. This method assumes each feature’s value as independent and will not
consider any correlation or relationship between the features.
73
Modeling Frame business problems as ML problems
RMSE gives an idea of how much error the system typically makes in its predictions, with a higher
weight for large errors.
Even though the RMSE is generally the preferred performance measure for regression tasks, in
some contexts you may prefer to use another function. For example, suppose that there are
many outlier districts. In that case, you may consider using the Mean Absolute Error.
RMSE says about the error value but not the sign of error. The question is to find whether the
model overestimates or underestimates. The solution is residual plots.
RMSE for a hypothetical regression model that would always predict the mean of the target as
the answer. For example, if you were predicting the age of a house buyer and the mean age for
the observations in your training data was 35, the baseline model would always predict the
answer as 35. You would compare your ML model against this baseline to validate if your ML
model is better than a ML model that predicts this constant answer.
R squared is another commonly used metric with linear regression problems. R squared explains
the fraction of variance accounted for by the model. It’s like a percentage, reporting a number
from 0 to 1. When R squared is close to 1 it usually indicates that a lot of the variabilities in the
data can be explained by the model itself.
R squared will always increase when more explanatory variables are added to the model highest r
squared may not be the best model. To counter this potential issue, there is another metric
called the Adjusted R squared. The Adjusted R squared has already taken care of the added effect
for additional variables and it only increases when the added variables have significant effects in
74
Modeling Frame business problems as ML problems
the prediction. The adjusted R squared adjusts your final value based on the number of features
and number of data points you have in your dataset.
A recommendation, therefore, is to look at both R squared and Adjusted R squared. This will
ensure that your model is performing well but that there’s also not too much overfitting.
NOTE: When there is more than one feature highly correlated with least squares,
the data matrix X has less than full rank, and therefore the moment matrix XTX
cannot be inverted, the ordinary least squares estimator doesn’t exist.
75
Modeling Frame business problems as ML problems
Variance
This part is due to the model’s excessive sensitivity to small variations in the training data.
A model with many degrees of freedom (such as a high-degree polynomial model) is likely
to have high variance, and thus to overfit the training data.
Irreducible error
This part is due to the noisiness of the data itself. The only way to reduce this part of the
error is to clean up the data (e.g., fix the data sources, such as broken sensors, or detect
and remove outliers).
Increasing a model’s complexity will typically increase its variance and reduce its bias. Conversely,
reducing a model’s complexity increases its bias and reduces its variance. This is why it is called a
tradeoff.
3.1.1.6 Regularization
A good way to reduce overfitting is to regularize the model (i.e., to constrain it): the fewer
degrees of freedom it has, the harder it will be for it to overfit the data. For example, a simple
way to regularize a polynomial model is to reduce the number of polynomial degrees.
For a linear model, regularization is typically achieved by constraining the weights of the model.
Ridge Regression
Ridge Regression (also called Tikhonov regularization) is a regularized version of Lin- ear
Regression: a regularization term equal to α∑i = 1n θi2 is added to the cost function.
76
Modeling Frame business problems as ML problems
This forces the learning algorithm to not only fit the data but also keep the model weights as
small as possible. Note that the regularization term should only be added to the cost function
during training. Once the model is trained, you want to evaluate the model’s performance using
the un-regularized performance measure.
The hyperparameter α controls how much you want to regularize the model. If α = 0 then Ridge
Regression is just Linear Regression. If α is very large, then all weights end up very close to zero
and the result is a flat line going through the data’s mean.
Lasso Regression
Least Absolute Shrinkage and Selection Operator Regression (simply called Lasso Regression) is
another regularized version of Linear Regression.
It adds a regularization term to the cost function, but it uses the ℓ1 norm of the weight vector
instead of half the square of the ℓ2 norm.
An important characteristic of Lasso Regression is that it tends to completely eliminate the
weights of the least important features (i.e., set them to zero).
77
Modeling Frame business problems as ML problems
Hard Classifier
A very simple way to create an even better classifier is to aggregate the predictions of each
classifier and predict the class that gets the most votes. This majority-vote classifier is called a
hard voting classifier.
Soft Classifier
If all classifiers are able to estimate class probabilities, then you can tell Scikit-Learn to predict the
class with the highest class probability, averaged over all the individual classifiers. This is called
soft voting.
NOTE: SVC is not calculating probability by default, so you need to define its
probability hyperparameter to true.
78
Modeling Frame business problems as ML problems
mode (i.e., the most frequent prediction, just like a hard voting classifier) for classification, or the
average for regression.
Each individual predictor has a higher bias than if it were trained on the original training set, but
aggregation reduces both bias and variance.
I can be parallelized on different CPUs or cores as this is different training on different datasets.
Out-of-Bag Evaluation
With bagging, some instances may be sampled several times for any given predictor, while others
may not be sampled at all. By default a Bagging Classifier samples m training instances with
replacement (bootstrap=True), where m is the size of the training set. This means that only about
63% of the training instances are sampled on average for each predictor. The remaining 37% of
the training instances that are not sampled are called out-of-bag (oob) instances.
Note: They are not the same 37% for all predictors.
Boosting
Boosting (originally called hypothesis boosting) refers to any Ensemble method that can combine
several weak learners into a strong learner. The general idea of most boosting methods is to train
predictors sequentially, each trying to correct its predecessor.
Most popular boosting methods is AdaBoost (short for Adaptive Boosting) and Gradient Boosting.
It can’t be parallelized as it trains data sequentially.
AdaBoosting
One way for a new predictor to correct its predecessor is to pay a bit more attention to the
training instances that the predecessor underfitted. This results in new predictors focusing more
and more on the hard cases.
For example, to build an AdaBoost classifier, a first base classifier (such as a Decision Tree) is
trained and used to make predictions on the training set. The relative weight of misclassified
training instances is then increased. A second classifier is trained using the updated weights and
again it makes predictions on the training set, weights are updated, and so on.
Gradient Boosting
Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting
its predecessor. However, instead of tweaking the instance weights at every iteration like
AdaBoost does, this method tries to fit the new predictor to the residual errors made by the
previous predictor.
79
Modeling Frame business problems as ML problems
Summary
XGBoost is the latest hotness
Boosting generally yields better accuracy
Bagging avoids overfitting
Bagging is easier to parallelize
Bagging reduces both bias and variance
Random Forest
Random Forest is an ensemble of Decision Trees, generally trained via the bagging method (or
sometimes pasting), typically with max_samples set to the size of the training set.
It is a combination of multiple trees. Train a decision tree for each sampled data (Data is divided
into samples using random techniques). Combine end results of each tree by voting.
With a few exceptions, a Random Forest Classifier has all the hyperparameters of a Decision Tree
(to control how trees are grown), plus all the hyperparameters of a Bagging to control the
ensemble itself.
The Random Forest algorithm introduces extra randomness when growing trees; instead of
searching for the very best feature when splitting a node, it searches for the best feature among
a random subset of features. This results in a greater tree diversity, which (once again) trades a
higher bias for a lower variance, generally yielding an overall better model.
There may be some variations of K-Fold cross-validation, for example, the Leave-One-Out cross-
validation. In the Leave-One-Out cross-validation, the K is equal to N. Every time we leave one
data point out for testing, we are using the rest in the training data. This is usually used for very
small datasets where every data point is very valuable.
There's also stratified K-Fold cross-validation, which is often used when there are seasonalities or
subgroups in small proportion in the data set. Stratified K-Fold cross-validation is going to ensure
that for each fold, there are some equal weight proportions of the data for every different fold.
For instance, while splitting the data you might want to ensure that there is an equal
representation of a certain target variable among the different folds.
NOTE: SageMaker has automated hyperparameter tuning, which uses methods like gradient
descent, Bayesian optimization, and evolutionary algorithms to conduct a guided search for the
best hyperparameter settings.
hyperparameter optimization.
82
Modeling Frame business problems as ML problems
Data analysis:
When analyzing a new dataset, it is often useful to first discover clusters of similar
instances, as it is often easier to analyze clusters separately.
Dimensionality reduction:
Once a dataset has been clustered, it is usually possible to measure each instance’s affinity
with each cluster (affinity is any measure of how well an instance fits into a cluster). Each
instance’s feature vector x can then be replaced with the vector of its cluster affinities. If
there are k clusters, then this vector is k dimensional. This is typically much lower
dimensional than the original feature vector, but it can preserve enough information for
further processing.
Semi-supervised learning:
If you only have a few labels, you could perform clustering and propagate the labels to all
the instances in the same cluster. This can greatly increase the amount of labels available
for a subsequent supervised learning algorithm, and thus improve its performance.
Search engines:
For example, some search engines let you search for images that are similar to a reference
image. To build such a system, you would first apply a clustering algorithm to all the images
in your database: similar images would end up in the same cluster. Then when a user
provides a reference image, all you need to do is to find this image’s cluster using the
trained clustering model, and you can then simply return all the images from this cluster.
Segment an image:
By clustering pixels according to their color, then replacing each pixel’s color with the
mean color of its cluster, it is possible to reduce the number of different colors in the
image considerably. This technique is used in many object detection and tracking systems,
as it makes it easier to detect the contour of each object.
84
Modeling Frame business problems as ML problems
This is how random cut forest algorithm is used in kinesis analytics to detect anomalies.
You can use RANDOM_CUT_FOREST function in the SQL of kinesis analytics to detect
anomalies in the data. With 100 trees with sub samples of 100 elements with 1000
elements in history and 10 shingle size.
It is learning as it goes.
Shingles is just a parameter of how many data points to look at while inferencing.
85
Modeling Frame business problems as ML problems
86
Modeling Frame business problems as ML problems
The most common step function used in Perceptron is the Heaviside step function and sometime
the sign function.
A single TLU can be used for simple linear binary classification. It computes a linear combination
of the inputs and if the result exceeds a threshold, it outputs the positive class or else outputs the
negative class.
Training a TLU in this case means finding the right values for w0, w1, and w2.
When all the neurons in a layer are connected to every neuron in the previous layer (i.e., its input
neurons), it is called a fully connected layer or a dense layer.
Input neurons are just output whatever input they are fed. Moreover, an extra bias feature is
generally added (x0 = 1).
87
Modeling Frame business problems as ML problems
It is possible to efficiently compute the outputs of a layer of artificial neurons for several
instances at once:
hw,b (X) = ϕ (XW + b)
X represents the matrix of input features. It has one row per instance, one column per
feature.
The weight matrix W contains all the connection weights except for the ones from the bias
neuron. It has one row per input neuron and one column per artificial neuron in the layer.
The bias vector b contains all the connection weights between the bias neuron and the
artificial neurons. It has one bias term per artificial neuron.
The function ϕ is called the activation function: when the artificial neurons are TLUs, it is a
step function
3.1.3.2 Multi-Layer Perceptron and Backpropagation
An MLP is composed of one (pass-through) input layer, one or more layers of TLUs, called hidden
layers, and one final layer of TLUs called the output layer.
88
Modeling Frame business problems as ML problems
The signal flows only in one direction (from the inputs to the out- puts), so this architecture is an
example of a feedforward neural net- work (FNN).
Backpropagation training algorithm:
It handles one mini-batch at a time (for example containing 32 instances each), and it goes
through the full training set multiple times.
Each mini-batch is passed to the network’s input layer, which just sends it to the first
hidden layer. The algorithm then computes the output of all the neurons in this layer (for
every instance in the mini-batch). The result is passed on to the next layer, its output is
computed and passed to the next layer, and so on until we get the output of the last layer,
the output layer. This is the forward pass: it is exactly like making predictions, except all
intermediate results are preserved since they are needed for the backward pass.
Next, the algorithm measures the network’s output error (i.e., it uses a loss function that
compares the desired output and the actual output of the network, and returns some
measure of the error).
Then it computes how much each output connection contributed to the error. This is done
analytically by simply applying the chain rule (perhaps the most fundamental rule in
calculus), which makes this step fast and precise.
The algorithm then measures how much of these error contributions came from each
connection in the layer below, again using the chain rule—and so on until the algorithm
reaches the input layer. As we explained earlier, this reverse pass efficiently measures the
error gradient across all the connection weights in the network by propagating the error
gradient backward through the network (hence the name of the algorithm).
89
Modeling Frame business problems as ML problems
Finally, the algorithm performs a Gradient Descent step to tweak all the connection
weights in the network, using the error gradients it just computed.
This algorithm is so important, it’s worth summarizing it again: for each training instance the
backpropagation algorithm first makes a prediction (forward pass), measures the error, then goes
through each layer in reverse to measure the error contribution from each connection (reverse
pass), and finally slightly tweaks the connection weights to reduce the error (Gradient Descent
step).
The hyperbolic/Tangent function tanh(z) = 2σ(2z) – 1 Just like the logistic function it is S-
shaped, continuous, and differentiable, but its output value ranges from –1 to 1 (instead of
0 to 1 in the case of the logistic function), which tends to make each layer’s output more or
less centered around 0 at the beginning of training. This often helps speed up convergence
( )التقاربand preferable over sigmoid/Logistic functions.
Swish
- From google, performs very well
- For deep networks more than 40 layers
90
Modeling Frame business problems as ML problems
Maxout
- Outputs the max of the inputs
- ReLU is a special case of Maxout
91
Modeling Frame business problems as ML problems
Output
You need one output neuron per output dimension.
If you want to guarantee that the output will always be positive, then you can use the ReLU
activation function, or the Softplus activation function in the output layer.
If you want to guarantee that the predictions will fall within a given range of values, then
you can use the logistic function or the hyperbolic tangent, and scale the labels to the
appropriate range: 0 to 1 for the logistic function, or –1 to 1 for the hyperbolic tangent.
Error
The loss function to use during training is typically the mean squared error
Mean absolute error if you have a lot of outliers in the training set
Huber loss, which is a combination of both.
Hyperparameters values
Hyperparameter Typical Value
# input neurons One per input feature (e.g., 28 x 28 = 784 for MNIST)
# hidden layers Depends on the problem. Typically 1 to 5.
# neurons per Depends on the problem. Typically 10 to 100.
hidden layer
# output neurons 1 per prediction dimension
Hidden activation ReLU
92
Modeling Frame business problems as ML problems
Output activation None or ReLU/Softplus (if positive outputs) or Logistic/Tanh (if bounded
outputs)
Loss function MSE or MAE/Huber (if outliers)
Classification:
MLPs can also be used for classification tasks. For a binary classification problem, you just
need a single output neuron using the logistic activation function: the output will be a
number between 0 and 1, which you can interpret as the estimated probability of the
positive class. Obviously, the estimated probability of the negative class is equal to one
minus that number.
MLPs can also easily handle multilabel binary classification tasks, For example, you could
have an email classification system that predicts whether each incoming email is ham or
spam and if the email is urgent or not. In this case, you would need two output neurons,
both using the logistic activation function: the first would output the probability that the
email is spam and the second would output the probability that it is urgent.
If each instance can belong only to a single class, out of 3 or more possible classes (e.g.,
classes 0 through 9 for digit image classification), then you need to have one output
neuron per class, and you should use the Softmax activation function for the whole output
layer. The Softmax function will ensure that all the estimated probabilities are between 0
and 1 and that they add up to one (which is required if the classes are exclusive). This is
called multiclass classification.
Softmax
Used in the final output layer of a multiple classification problem
Basically converts outputs to probabilities of each classification.
Used for multiclass classification not multilabel classification.
93
Modeling Frame business problems as ML problems
In a classification problem for classify Iris plant, suppose the output from DNN as shown above in
the figure. The problem is some outputs is not restricted from 0 to 1 so is hard to be interpreted.
So we are going to use Softmax, which we are going to calculate the probability of each class by
using the following equation: eSetosa / (eSetosa + eVersicolor + eVirginica)
For every class, we will have a probability for this class ranging from 0 to 1 and their sum will
equal to 1.
NOTE: This is not a real probability and these number depends on the weights
from DNN, and will be changed if this weights changed.
Cross Entropy
Now that you know how the model estimates probabilities and makes predictions, let’s take a
look at training. The objective is to have a model that estimates a high probability for the target
class (and consequently a low probability for the other classes).
Cross Entropy, penalizes the model when it estimates a low probability for a target class. Cross
entropy is frequently used to measure how well a set of estimated class probabilities match the
target classes.
94
Modeling Frame business problems as ML problems
Hyperparameters
Number of hidden layers
Number of neurons per hidden layer
Other parameters
- Learning rate
A simple approach for tuning the learning rate is to start with a large value that
makes the training algorithm diverge, then divide this value by 3 and try again, and
repeat until the training algorithm stops diverging.
- Batch size
Batch size is how many sample size in each epoch.
Have a significant impact on your model’s performance and the training time. In
general the optimal batch size will be lower than 32.
A small batch size ensures that each training iteration is very fast, and although a
large batch size will give a more precise estimate of the gradients, in practice this
does not matter much since the optimization landscape is quite complex and the
direction of the true gradients do not point precisely in the direction of the
optimum.
Smaller batch size can get out easily from local minima while large batch size can
stuck in the local minima.
If we are working with shuffle data it will be weirder to use large batch size as the
local minima will sometimes calculated right and sometimes not (inconsistent
results).
95
Modeling Frame business problems as ML problems
- Activation function
ReLU activation function will be a good default for all hidden layers. For the output
layer, it really depends on your task.
Solutions
Weight Initialization
To solve vanishing/exploding gradient problem we need the variance of the outputs of each layer
to be equal to the variance of its inputs.
Initially we used to initialize weights by random numbers with mean 0 and standard deviation 1.
Let’s take an example to discuss the problem of using random numbers:
Suppose we have a 250 input neuron and the value for each neuron is 1, and the weight is
generated randomly with mean 0 and SD 1.
96
Modeling Frame business problems as ML problems
ReLU activation function is not perfect. It suffers from a problem known as the dying
ReLU during training, some neurons effectively die, meaning they stop outputting
anything other than 0 when negative values applied.
Leaky ReLU. This function is defined as LeakyReLUα(z) = max(αz, z)
The hyperparameter α defines how much the function “leaks”: it is the slope of the
function for z < 0, and is typically set to 0.01.
Setting α = 0.2 (huge leak) seemed to result in better performance than α = 0.01
(small leak).
Randomized leaky ReLU (RReLU), where α is picked randomly in a given range during
training, and it is fixed to an average value during testing.
Parametric leaky ReLU (PReLU), where α is authorized to be learned during training.
It is complicated.
Exponential linear unit (ELU) that outperformed all the ReLU variants in their
experiments: training time was reduced and the neural network performed better
on the test set.
ELU activation function is that it is slower to compute than the ReLU and its variants
(due to the use of the exponential function) but during training this is compensated
by the faster convergence rate.
Self-Normalizing Neural Networks (SELU) is just a scaled version of the ELU activation
function
98
Modeling Frame business problems as ML problems
Batch Normalization
The problem comes when one of the weights becomes drastically large than the other weights,
the output from its corresponding neuron will be extremely large and this imbalance will be
cascaded to the neural network causing instability.
To solve this problem we should apply batch normalization. Batch normalization is applied per
layer. It normalizes the output from the activation functions before being passed to the next
layer.
1. z = x – m / s, such that z is the output from activation function, m is the mean and s is the
standard deviation.
2. Multiply the output with arbitrary parameter g.
3. Then add arbitrary parameter b.
4. The final will be: z = (x - m/s) * g + b
5. The values m, s, g and b are all trainable meaning that they will be optimized during the
training.
By this way all the weights don’t become imbalanced and the training speed will be greatly
increased.
Benefits:
The vanishing gradients problem was strongly reduced, to the point that they could
use saturating activation functions such as the tanh and even the logistic activation
function.
The networks were also much less sensitive to the weight initialization.
Ability to use much larger learning rates, significantly speeding up the learning
process.
Batch Normalization also acts like a regularizer, reducing the need for other
regularization techniques.
Other Solutions
Multilevel hierarchy
99
Modeling Frame business problems as ML problems
Gradient Checking
A debugging technique
Numerically check the derivatives computed during training
Useful for validating code of neural network training
100
Modeling Frame business problems as ML problems
Unsupervised Pre-training
When you want to train a model for which you don’t have much labeled training data, and you
cannot find a model trained on a similar task.
You should gather plenty of unlabeled training data, you can try to train the layers one by one,
starting with the lowest layer and then going up, using an unsupervised feature detector
algorithm such as Restrict-ted Boltzmann Machines or auto encoders.
Once all layers have been trained this way, you can add the output layer for your task, and fine-
tune the final network using supervised learning (i.e., with the labeled training examples). At this
point, you can unfreeze all the pre-trained layers, or just some of the upper ones.
Another huge speed boost comes from using a faster optimizer than the regular Gradient
Descent optimizer.
101
Modeling Frame business problems as ML problems
Momentum Optimization
Recall that Gradient Descent simply updates the weights θ by directly subtracting the gradient of
the cost function J(θ) with regards to the weights (∇θ J(θ)) multiplied by the learning rate η. The
equation is: θ ← θ – η∇θ J(θ). It does not care about what the earlier gradients were. If the local
gradient is tiny, it goes very slowly.
Momentum optimization cares a great deal about what previous gradients were: at each
iteration, it subtracts the local gradient from the momentum vector m (multiplied by the learning
rate η), and it updates the weights by simply adding this momentum vector. In other words, the
gradient is used for acceleration, not for speed. To simulate some sort of friction mechanism and
prevent the momentum from growing too large, the algorithm introduces a new hyperparameter
β, simply called the momentum, which must be set between 0 (high friction) and 1 (no friction). A
typical momentum value is 0.9.
You can easily verify that if the gradient remains constant, the terminal velocity (i.e., the
maximum size of the weight updates) is equal to that gradient multiplied by the learning rate η
multiplied by1/1- β (ignoring the sign). For example, if β = 0.9, then the terminal velocity is equal
to 10 times the gradient times the learning rate, so Momentum optimization ends up going 10
times faster than Gradient Descent! This allows Momentum optimization to escape from plateaus
much faster than Gradient Descent.
NOTE: In deep neural networks that don’t use Batch Normalization, the upper
layers will often end up having inputs with very different scales, so using
Momentum optimization helps a lot. It can also help roll past local optima.
102
Modeling Frame business problems as ML problems
AdaGrad
Consider the elongated bowl problem again: Gradient Descent starts by quickly going down the
steepest slope, then slowly goes down the bottom of the valley. It would be nice if the algorithm
could detect this early on and correct its direction to point a bit more toward the global
optimum. The AdaGrad algorithm achieves this by scaling down the gradient vector along the
steepest dimensions
RMSProp
Although AdaGrad slows down a bit too fast and ends up never converging to the global
optimum, the RMSProp algorithm fixes this by accumulating only the gradients from the most
recent iterations (as opposed to all the gradients since the beginning of training). It does so by
using exponential decay in the first step
103
Modeling Frame business problems as ML problems
Power scheduling
Set the learning rate to a function of the iteration number t: η (t) = η0 / (1 + t/k)c . The initial
learning rate η0, the power c (typically set to 1) and the steps s are hyperparameters. The
learning rate drops at each step, and after s steps it is down to η0 / 2. After s more steps, it is
down to η0 / 3. Then down to η0 / 4, then η0 / 5, and so on. As you can see, this schedule first
drops quickly, then more and more slowly. Of course, this requires tuning η0, s (and possibly c).
104
Modeling Frame business problems as ML problems
Exponential scheduling
Set the learning rate to: η(t) = η0 0.1t/s. The learning rate will gradually drop by a factor of 10
every s steps. While power scheduling reduces the learning rate more and more slowly,
exponential scheduling keeps slashing it by a factor of 10 every s steps.
Performance scheduling
Measure the validation error every N steps (just like for early stopping) and reduce the learning
rate by a factor of λ when the error stops dropping.
3.1.3.10 Regularization
Constraining a model to make it simpler and reduce the risk of overfitting is called regularization.
The amount of regularization to apply during learning can be controlled by a hyper- parameter. A
hyperparameter is a parameter of a learning algorithm (not of the model). As such, it is not
affected by the learning algorithm itself; it must be set prior to training and remains constant
during training. If you set the regularization hyper- parameter to a very large value, you will get
an almost flat model (a slope close to zero); the learning algorithm will almost certainly not
overfit the training data, but it will be less likely to find a good solution. Tuning hyperparameters
is an important part of building a Machine Learning system
We already implemented one of the best regularization techniques in:
Early stopping
Batch Normalization was designed to solve the vanishing/exploding gradients problems, is
also acts like a pretty good regularizer.
In this section we will present other popular regularization techniques for neural networks: ℓ1
and ℓ2 regularization, dropout and max-norm regularization.
105
Modeling Frame business problems as ML problems
ℓ1 and ℓ2 regularization
You can use ℓ1 and ℓ2 regularization to constrain a neural network’s connection weights (but
typically not its biases).
A regularization term is added as weights are learned
ℓ1 term is the sum of the weights 𝜆 𝑘𝑖=1 𝑤𝑖
ℓ2 term is the sum of the square of the weights 𝜆 k𝑖=1 𝑤𝑖2
Difference between ℓ1 and ℓ2:
ℓ1: sum of weights
Performs feature selection – entire features go to 0
Computationally inefficient
Sparse output because it is removing information from the data.
ℓ2: sum of square of weights
All features remain considered, just weighted
Computationally efficient
Dense output
Simplify Model
Try to drop some neurons and/or layers.
Dropout
At every training step, every neuron (including the input neurons, but always excluding the
output neurons) has a probability p of being temporarily “dropped out,” meaning it will be
entirely ignored during this training step, but it may be active during the next step. The
hyperparameter p is called the dropout rate, and it is typically set to 50%. After training, neurons
don’t get dropped anymore.
Dropout including the input and hidden layers excluding the output layer.
MC Dropout, can boost the performance of any trained dropout model, without having to
retrain it or even modify it at all!
MC Dropout provides a much better measure of the model’s uncertainty.
MC Dropout simple to implement
Max-Norm Regularization
It constrains the weights w of the incoming connections such that ∥ *w* ∥2 ≤ _r_, where r is the
max-norm hyperparameter and ∥ · ∥2 is the ℓ2 norm. Max-norm regularization does not add a
regularization loss term to the overall loss function. Instead, it is typically implemented by
computing ∥w∥2 after each training step and clipping w if needed (w w(r/∥w∥2)).
107
Modeling Frame business problems as ML problems
Padding
A neuron located in row i, column j of a given layer is connected to the outputs of the neurons in
the previous layer located in rows i to i + fh – 1, columns j to j + fw – 1, where fh and fw are the
height and width of the receptive field In order for a layer to have the same height and width as
the previous layer, it is common to add zeros around the inputs, as shown in the diagram. This is
called zero padding.
108
Modeling Frame business problems as ML problems
Stride
It is also possible to connect a large input layer to a much smaller layer by spacing out the
receptive fields, as shown in the next figure. The shift from one receptive field to the next is
called the stride.
Filters
A neuron’s weights can be represented as a small image the size of the receptive field. For
example, the next figure two possible sets of weights, called filters (or convolution kernels). The
first one is represented as a black square with a vertical white line in the middle (it is a 7 × 7
matrix full of 0s except for the central column, which is full of 1s); neurons using these weights
will ignore everything in their receptive field except for the central vertical line (since all inputs
will get multiplied by 0, except for the ones located in the central vertical line). The second filter
is a black square with a horizontal white line in the middle. Once again, neurons using these
weights will ignore everything in their receptive field except for the central horizontal line.
Now if all neurons in a layer use the same vertical line filter (and the same bias term), and you
feed the network the input image shown in next figure, the layer will output the top-left image.
Notice that the vertical white lines get enhanced while the rest gets blurred. Similarly, the upper-
right image is what you get if all neurons use the same horizontal line filter; notice that the
horizontal white lines get enhanced while the rest is blurred out. Thus, a layer full of neurons
using the same filter outputs a feature map, which highlights the areas in an image that activate
the filter the most. Of course you do not have to define the filters manually: instead, during
training the convolutional layer will automatically learn the most useful filters for its task, and the
layers above will learn to combine them into more complex patterns.
109
Modeling Frame business problems as ML problems
Figure 58: Applying two different filters to get two feature maps
110
Modeling Frame business problems as ML problems
Figure 59: Convolution layers with multiple feature maps, and images with three color
Moreover, input images are also composed of multiple sublayers: one per color channel. There
are typically three: red, green, and blue (RGB). Grayscale images have just one channel, but some
images may have much more—for example, satellite images that capture extra light frequencies
(such as infrared).
Specifically, a neuron located in row i, column j of the feature map k in a given convolutional layer
l is connected to the outputs of the neurons in the previous layer l – 1, located in rows i × sh to i ×
sh + fh – 1 and columns j × sw to j × sw + fw – 1, across all feature maps (in layer l – 1). Note that
all neurons located in the same row i and column j but in different feature maps are connected to
the outputs of the exact same neurons in the previous layer.
Pooling Layer
Pooling layer goal is to subsample (i.e., shrink) the input image in order to reduce the
computational load, the memory usage, and the number of parameters (thereby limiting the risk
of overfitting).
111
Modeling Frame business problems as ML problems
Just like in convolutional layers, each neuron in a pooling layer is connected to the outputs of a
limited number of neurons in the previous layer, located within a small rectangular receptive
field. You must define its size, the stride, and the padding type, just like before. However, a
pooling neuron has no weights; all it does is aggregate the inputs using an aggregation function
such as the max, min or mean.
Pool layer also introduces some level of invariance to small translations, rotation and scaling as
shown in the below figure.
Max pooling and average pooling can be performed along the depth dimension rather than the
spatial dimensions, although this is not as common. This can allow the CNN to learn to be
invariant to various features. For example, it could learn multiple filters, each detecting a
different rotation of the same pattern, such as handwritten digits.
112
Modeling Frame business problems as ML problems
Figure 62: Depth-wise max pooling can help the CNN learn any invariance
Flatten Layer
This layer is converting the 2D layer to 1D layer for passing into a flat hidden layer of neuron.
CNN Architectures
Typical CNN architectures stack a few convolutional layers (each one generally followed by a
ReLU layer), then a pooling layer, then another few convolutional layers (+ReLU), then another
pooling layer, and so on. The image gets smaller and smaller as it progresses through the
network, but it also typically gets deeper and deeper (i.e., with more feature maps) thanks to the
convolutional layers (see Figure 14-11). At the top of the stack, a regular feedforward neural
network is added, composed of a few fully connected layers (+ReLU), and the final layer outputs
the prediction (e.g., a Softmax layer that outputs estimated class probabilities).
113
Modeling Frame business problems as ML problems
NOTE: A common mistake is to use convolution kernels that are too large. For
example, instead of using a convolutional layer with a 5 × 5 kernel, it is generally
preferable to stack two layers with 3 × 3 kernels: it will use less parameters and
require less computations, and it will usually perform better. One exception to this
recommendation is for the first convolutional layer: it can typically have a large
kernel (e.g., 5 × 5), usually with stride of 2 or more: this will reduce the spatial
dimension of the image without losing too much information, and since the input
image only has 3 channels in general, it will not be too costly.
Simple Usage:
Conv2D MaxPooling2D Dropout Flatten Dense Dropout Softmax
Conv2D: Convolution for the image data
Max Pooling: Subsample the image down to shrink the amount of data
Dropout: Prevent overfitting
Flatten: Convert data to 1D to be fed into the perceptron (Dense Layer)
Dense: Just a perceptron for normal DNN (Hidden layer of neurons)
Softmax: Multiclass classification.
previous architectures: GoogLeNet actually has 10 times fewer parameters than AlexNet
(roughly 6 million instead of 60 million).
115
Modeling Frame business problems as ML problems
One to One:
One input and one output.
Classification from a set of categories.
One to Many:
One input and many outputs. In every iteration there is an output.
Example: Image captioning, such that input is an image and output is a sequence of words
of different length.
Many to One:
Many inputs and one output.
Example:
- Sentiment analysis, such that input is a sequence of words and the output is
sentiment of this text weather it is positive or negative.
- Video as input with variable number of frames (many) and the output is a
classification or action for the entire video.
Many to Many:
Many inputs and many outputs.
Example:
Machine translation, such that input sequence of words and output is sequence of words.
The input and the output is variable in length i.e. the input in English which could have a
variable in length and the output in French which could have a variable in length. Also, the
English sentence should not be the same length in French sentence.
116
Modeling Frame business problems as ML problems
Video classification on frame level such that input as video variable frames and each frame
should be classified.
Recurrent Neuron
The core recurrent neuron will take some input x, feed that input into the RNN.
RNN core neuron has some internal hidden state and that internal hidden state will be
updated every time that RNN reads a new input.
The internal hidden state will then feedback to the model the next time it reads an input.
Frequently we will want our RNN’s to produce some output at every time step.
This pattern will read an input, update its hidden state and then produce an output.
Recurrent Formula
We can process a sequence of vector X by applying a recurrence formula at every step time.
117
Modeling Frame business problems as ML problems
Inside the RNN cell, we are computing some recurrence relation with a function f
The function f will depend on some weights (w), it will accept the previous hidden state h(t-
1) and the input at the current state xt and this will output the updated hidden state h(t).
The updated hidden state will be used in the next step as previous hidden state with the
new input.
NOTE: the same function and the same set of parameters (same updatable
weights) are used at every time step.
Figure 66: Simple RNN Function form (Vanilla Recurrent Neural Network)
The current hidden state ht which is function that takes the previous hidden state h(t-1)
and some input x.
We have some weight matrix Wxh that we multiply against the input Xt.
Another weight matrix Whh that we multiply against the previous hidden state h(t-1)
Then we add the results and pass them to tanh function.
If we have an output from this cell we might have another weight matrix Why the will be
multiplied by ht and this will be the cell output.
The process begins by passing the initial hidden state h0 (commonly 0) with the first input
x1 to the function fw.
Then apply the function fw, apply the weights and calculate the next hidden state h1.
The new hidden state h1 with the new input x2 will be recurred to the same cell.
In each iteration in process the weights will be updated (Weights are reused). Hidden
states will be calculated and passed to the cell in the next step.
This process will be repeated over and over again till we consume all the inputs xt.
In back propagation, we will have a separate gradient for w flowing from each of those
time steps and then the final gradient for w will be the sum of all of those individual per
time gradient.
We can also have yt explicitly, every ht at each step might feed into some other neural
network that can produce yt.
Also, we can calculate the loss at every individual step, the total loss will be the sum for all
the individual losses. Then we calculate the gradient for the total loss with respect to w.
119
Modeling Frame business problems as ML problems
The final output will depend upon the final hidden state as this hidden state holds all the
information from all the previous hidden states.
Sequence to Sequence
120
Modeling Frame business problems as ML problems
It is used for something like machine translation where you take a variably sized input and
a variably sized output.
You can think of this as a combination of the many to one (Encoder) plus a one to many
(Decoder)
Encoder will receive the variably sized input which is your sentence in English and then
summarize that entire sentence using the final hidden state of the encoder network.
Decoder will receive the input as a vector from the encoder and produce this variable sized
output which is your sentence in another language.
121
Modeling Frame business problems as ML problems
IFOG gates:
- Forget gate (f): How much do we want to forget from the cell memory
- Input gate (i): How much do we want to input into our cell
- Gate gate (g): How much do we want to write to our cell
- Output gate (o): How much to reveal from cell to the output world
Stack previous hidden state vector and current input state vector, multiply them with very
big weight matrix w to compute the four different gates, which all have the same size as
the hidden state.
NOTE: Sometimes we use different weight matrix as different weight matrix for
each gate.
Input, forget and output gates use sigmoid function (from 0 - 1) while the gate uses tanh
function (from -1 to 1).
We calculate cell state and hidden state.
ct = (f ct-1) + (i g)
ht = o tanh(ct)
NOTE: forget gate (f) is a vector or zeros and ones that telling us for each element
in the cell state, do we want to forget that element of the cell or remember that
element of the cell. The same concept apply to input (i) and output (o) vectors as
all of them comes from the sigmoid function.
Cell states will be incremented or decremented on each step.
The hidden state will be used in the next step.
Back propagation with LSTM
Back propagation is element wise multiplication with the forget gate and so the cell state
and this will solve the gradient descent vanishing and exploding for 2 reasons:
- Forget gate is element wise multiplication rather than full matrix multiplication
- Multiply by different forget gate at every step
RNN Variants
Gated Recurrent Unit (GRU)
122
Modeling Frame business problems as ML problems
Learning phrase representation using encoder –decoder for statistical machine translation.
3.1.3.14 Reinforcement
It is used in:
Supply chain management
HVAC systems stands for heating, ventilation, and air conditioning.
Industrial Robots
Dialog Systems
Autonomous Vehicles
Yields very fast performance once the space has been explored.
Components of an MDP:
- Agent
- Environment
- State
- Action
- Reward
This process of selecting an action from a given state, transitioning to a new state, and receiving a
reward happens sequentially over and over again, which creates something called
a trajectory that shows the sequence of states, actions, and rewards.
123
Modeling Frame business problems as ML problems
Throughout this process, it is the agent's goal to maximize the total amount of rewards that it
receives from taking actions in given states. This means that the agent wants to maximize not just
the immediate reward, but the cumulative rewards it receives over time.
MDP Notation
In an MDP, we have a set of states (S), a set of actions (A), and a set of rewards (R). We'll assume
that each of these sets has a finite number of elements.
At each time step t=0, 1, 2, ⋯, the agent receives some representation of the environment's
state St ∈ S. Based on this state, the agent selects an action At ∈ A. This gives us the state-action
pair (St, At).
Time is then incremented to the next time step t+1, and the environment is transitioned to a new
state St+1 ∈ S. At this time, the agent receives a numerical reward Rt+1 ∈ R for the action At taken
from state St.
We can think of the process of receiving a reward as an arbitrary function f that maps state-
action pairs to rewards. At each time t, we have:
The trajectory representing the sequential process of selecting an action from a state,
transitioning to a new state, and receiving a reward can be represented as:
- This process then starts over for the next time step, t+1.
- Note, t+1 is no longer in the future, but is now the present. When we cross the dotted line
on the bottom left, the diagram shows t+1 transforming into the current time step t so
that St+1 and Rt+1 are now St and Rt.
Transition Probabilities
- Since the sets (S) and (R) are finite, the random variables (Rt) and (St) have well defined
probability distributions. In other words, all the possible values that can be assigned
to Rt and St have some associated probability. These distributions depend on
the preceding state and action that occurred in the previous time step t−1.
- For example, suppose s′ ∈ S and r ∈ R. Then there is some probability
that St=s′ and Rt=r. This probability is determined by the particular values of
the preceding state (s) ∈ S and action a ∈ A(s). Note that A(s) is the set of actions that can
be taken from state (s).
- Let’s define this probability.
For all s′ ∈ S, s ∈ S, r ∈ R, and a ∈ A(s), we define the probability of the transition to
state s′ with reward r from taking action (a) in state (s) as:
Expected Return
We stated that the goal of an agent in an MDP is to maximize its cumulative rewards. We need a
way to aggregate and formalize these cumulative rewards. For this, we introduce the concept of
the expected return of the rewards at a given time step.
For now, we can think of the return simply as the sum of future rewards. Mathematically, we
define the return G at time t as
Gt = Rt+1 + Rt+2 + Rt+3 + ……..+ RT
Where T is the final time step.
This concept of the expected return is super important because it's the agent's objective to
maximize the expected return. The expected return is what's driving the agent to make the
decisions it makes.
In our definition of the expected return, we introduced T, the final time step. When the notion of
having a final time step makes sense, the agent-environment interaction naturally breaks up into
subsequences, called episodes. For example, think about playing a game of pong. Each new round
of the game can be thought of as an episode, and the final time step of an episode occurs when a
player scores a point.
Each episode ends in a terminal state at time T, which is followed by resetting the environment
to some standard starting state or to a random sample from a distribution of possible starting
states. The next episode then begins independently from how the previous episode ended.
Formally, tasks with episodes are called episodic tasks.
There exists other types of tasks though where the agent-environment interactions don't break
up naturally into episodes, but instead continue without limit. These types of tasks are
called continuing tasks.
For example, Painting robots work continuously.
Continuing tasks make our definition of the return at each time t problematic because our final
time step would be T= ∞.
Discounted Return
Our revision of the way we think about return will make use of discounting. Rather than the
agent's goal being to maximize the expected return of rewards, it will instead be to maximize the
expected discounted return of rewards. Specifically, the agent will be choosing action (At) at each
time t to maximize the expected discounted return.
Agent's goal to maximize the expected discounted return of rewards.
To define the discounted return, we first define the discount rate (), to be a number
between 0 and 1. The discount rate will be the rate for which we discount future rewards and will
determine the present value of future rewards. With this, we define the discounted return as:
This definition of the discounted return makes it to where our agent will care more about the
immediate reward over future rewards since future rewards will be more heavily discounted. So,
while the agent does consider the rewards it expects to receive in the future, the more
126
Modeling Frame business problems as ML problems
immediate rewards have more influence when it comes to the agent making a decision about
taking a particular action.
Now, check out this relationship below showing how returns at successive time steps are related to
each other. We'll make use of this relationship later.
Also, check this out. Even though the return at time t is a sum of an infinite number of terms, the return
is actually finite as long as the reward is nonzero and constant, and γ<1.
For example, if the reward at each time step is a constant 1 and γ<1, then the return is
This infinite sum yields a finite result. If you want to understand this concept more deeply, then
research infinite series convergence. For our purposes though, you're free to just trust the fact that this is
true, and understand the infinite sum of discounted returns is finite if the conditions we outlined are
met.
Secondly, in addition to understanding the probability of selecting an action, we'd probably also like to
know how good a given action or a given state is for the agent. In terms of rewards, selecting one
action over another in a given state may increase or decrease the agent's rewards, so knowing this in
advance will probably help our agent out with deciding which actions to take in which states. This is
where value functions become useful, and we'll also expand on this idea in just a bit.
Question Addressed by
How probable is it for an agent to select any action from a given state? Policies
How good is any given action or any given state for an agent? Value functions
Policies
127
Modeling Frame business problems as ML problems
A policy is a function that maps a given state to probabilities of selecting each possible action from that
state. We will use the symbol π to denote a policy.
When speaking about policies, formally we say that an agent “follows a policy.” For example, if an agent
follows policy π at time t, then π (a | s) is the probability that At = a, if St=s. This means that, at time t,
under policy π, the probability of taking action (a) in state (s) is π (a | s).
Value Functions
Value functions are functions of states, or of state-action pairs, that estimate how good it is for
an agent to be in a given state, or how good it is for the agent to perform a given action in a given
state.
State-Value Function
The state-value function for policy π, denoted as vπ, tells us how good any given state is for an
agent following policy π. In other words, it gives us the value of a state under π.
Action-Value Function
Similarly, the action-value function for policy π, denoted as qπ, tells us how good it is for the agent
to take any given action from a given state while following policy π. In other words, it gives us the
value of an action under π.
Conventionally, the action-value function qπ is referred to as the Q-function, and the output from
the function for any given state-action pair is called a Q-value. The letter “Q” is used to represent
the quality of taking a given action in a given state.
Optimality
It is the goal of reinforcement learning algorithms to find a policy that will yield a lot of rewards
for the agent if the agent indeed follows that policy. Specifically, reinforcement learning
algorithms seek to find a policy that will yield more return to the agent than all other policies.
In terms of return, a policy π is considered to be better than or the same as policy π′ if the
expected return of π is greater than or equal to the expected return of π′ for all states.
Optimal State-Value Function
The optimal policy has an associated optimal state-value function. We denote the optimal
state-value function as v∗ and define as:
128
Modeling Frame business problems as ML problems
For all s ∈ S. In other words, v∗ gives the largest expected return achievable by any policy π for
each state.
For all s ∈ S and a ∈ A(s). In other words, q∗ gives the largest expected return achievable by any
policy π for each possible state-action pair.
Q-Learning
Q-learning is a reinforcement learning technique used for learning the optimal policy in a Markov
Decision Process. We'll illustrate how this technique works by introducing a game where a
reinforcement learning agent tries to maximize points.
We left off talking about the fact that once we have our optimal Q-function q∗ we can determine
the optimal policy by applying a reinforcement learning algorithm to find the action that
maximizes q∗ for each state.
The objective of Q-learning is to find a policy that is optimal in the sense that the expected value
of the total reward over all successive steps is the maximum achievable. So, in other words, the
goal of Q-learning is to find the optimal policy by learning the optimal Q-values for each state-
action pair.
129
Modeling Frame business problems as ML problems
Value Iteration
The Q-learning algorithm iteratively updates the Q-values for each state-action pair using the
Bellman equation until the Q-function converges to the optimal Q-function, q∗. This approach is
called value iteration.
Q Table
We'll be making use of a table, called a Q-table, to store the Q-values for each state-action pair.
The horizontal axis of the table represents the actions, and the vertical axis represents the states.
So, the dimensions of the table are the number of actions by the number of states.
All the Q-values in the table are first initialized to zero. Over time, though, as the agent plays
several episodes of the game, the Q-values produced for the state-action pairs that the agent
experiences will be used to update the Q-values stored in the Q-table.
As the Q-table becomes updated, in later moves and later episodes, the agent can look in the Q-
table and base its next action on the highest Q-value for the current state. This will make more
sense once we actually start playing the game and updating the table.
Episodes
Now, we'll set some standard number of episodes that we want the lizard to play. Let's say we
want the agent to play five episodes. It is during these episodes that the learning process will take
place.
In each episode, the agent starts out by choosing an action from the starting state based on the
current Q-values in the table. The agent chooses the action based on which action has the
highest Q-value in the Q-table for the current state.
But, wait... That's kind of weird for the first actions in the first episode, right? Because all the Q-
values are set zero at the start, so there's no way for the agent to differentiate between them to
discover which one is considered better. So, what action does it start with?
To answer this question, we'll introduce the trade-off between exploration and exploitation.
Exploration vs Exploitation
130
Modeling Frame business problems as ML problems
Exploration is the act of exploring the environment to find out information about
it. Exploitation is the act of exploiting the information that is already known about the
environment in order to maximize the return.
We need a balance of both exploitation and exploration. So how do we implement this?
We want to make the Q-value for the given state-action pair as close as we can to the right hand
side of the Bellman equation so that the Q-value will eventually converge to the optimal Q-
value q∗.
This will happen over time by iteratively comparing the loss between the Q-value and the optimal
Q-value for the given state-action pair and then updating the Q-value over and over again each
time we encounter this same state-action pair to reduce the loss.
Learning Rate
The learning rate is a number between 0 and 1, which can be thought of as how quickly the agent
abandons the previous Q-value in the Q-table for a given state-action pair for the new Q-value.
We don't want to just overwrite the old Q-value, but rather, we use the learning rate as a tool to
determine how much information we keep about the previously computed Q-value for the given
131
Modeling Frame business problems as ML problems
state-action pair versus the new Q-value calculated for the same state-action pair at a later time
step. We'll denote the learning rate with the symbol α, and we'll arbitrarily set α=0.7 for example.
The higher the learning rate, the more quickly the agent will adopt the new Q-value. For example,
if the learning rate is 1, the estimate for the Q-value for a given state-action pair would be the
straight up newly calculated Q-value and would not consider previous Q-values that had been
calculated for the given state-action pair at previous time steps.
Summary
This is a summary for the reinforcement algorithm:
MDP provide a mathematical framework for modeling decision making in situations where
outcomes are partly random and partly under the control of a decision maker.
Our “Q” values are described as a reward function 𝑅𝑎 (𝑠, 𝑠’)
Start off with Q values of 0
Explore the space
As bad things happen after a given state/action, reduce its Q
As rewards happen after a given state/action, increase its Q
You can “look ahead” more than one step by using a discount factor when computing Q
(here (s) is previous state, s’ is current state)
- Q(s, a) += discount * (reward(s, a) + max(Q(s’)) – Q(s, a))
Exploration problem:
- We can’t always choose the highest Q value as at the beginning all the Q values are
initialized with 0 and you will miss a lot of paths.
- If a random number is less than epsilon, don’t follow the highest Q, but choose at
random
- That way, exploration never totally stops
- Choosing epsilon can be tricky
Markov Decision Process (MDP) is the same as discrete time stochastic control process and
Dynamic programming.
Stop Words
Manually excluded from the text, because they occur too frequently in all documents in
the corpus.
There are 179 stop words in NLTK library i.e. she, he, is….etc.
Tokenizing
Separating text data into tokens by white space and punctuation as token separators.
133
Modeling Frame business problems as ML problems
Stemming
Set of rules to slice a string to a substring. The goal is to remove word affixes (particularly
suffixes).
Such as:
- Removing “s”, “es” which generally indicates plurality.
- Removing past tense suffixes: “ed”
For example “The children are playing and running. The weather was better yesterday.”
- Stemming: “The children are play and run. The weather was better yesterday.”
Lemmatization
It looks up words in dictionary and returns the “head” word called a “lemma.”
- It is more complex than stemming.
- For best results, word position tags should be provided: Adjective, noun...etc.
For example “The children are playing and running. The weather was better yesterday.”
- Lemmatizing: “The child be play and run. The weather be good yesterday”
When preparing the data
Apply words stemming and lemmatization
Remove Stop words
removing punctuations
convert text to lowercase (actually depends on your use-case)
replacing digits
After preprocessing, we then move on to tokenizing the corpus
3.1.4.2 Vectorization
ML algorithms expect numeric vectors as inputs instead of texts
This transformation is called vectorization or feature extraction.
134
Modeling Frame business problems as ML problems
Bag of Words
Each document is represented by a vector with size equal to the size of the corpus (vocabulary).
Each entry is the number of times the corresponding word occurred in the sentence (raw counts
method)
Issues:
We lost information inherent in the word order.
Large documents can have big word counts compared to small docs.
135
Modeling Frame business problems as ML problems
Which tag does the following sentence belong to? “Learn stock markets playing this game”
Pre-processing:
Remove stop words and
Remove words shorter than 2 characters
Apply stemming
137
Modeling Frame business problems as ML problems
Calculation:
We want to calculate the probabilities:
P (Finance | “learn stock market play game”): Probability of the Finance tag, given the sentence:
Learn stock markets playing this game
P (Not Finance | “learn stock market play game”): Probability of the Not Finance tag, given the
sentence: Learn stock markets playing this game
We will assign a category to “learn stock market play game” based on whichever probability is
larger.
By using Bayed Theorem
𝑃 𝐹𝑖𝑛𝑎𝑛𝑐𝑒 "𝑙𝑒𝑎𝑟𝑛 𝑠𝑡𝑜𝑐𝑘 𝑚𝑎𝑟𝑘𝑒𝑡 𝑝𝑙𝑎y 𝑔𝑎𝑚𝑒") ~ 𝑃 ("𝒍𝒆𝒂𝒓𝒏 𝒔𝒕𝒐𝒄𝒌 𝒎𝒂𝒓𝒌𝒆𝒕 𝒑𝒍𝒂𝒚
𝒈𝒂𝒎𝒆"|𝐹𝑖𝑛𝑎𝑛𝑐𝑒) 𝑥 𝑃 (𝐹𝑖𝑛𝑎𝑛𝑐𝑒)
𝑃 (𝐹𝑖𝑛𝑎𝑛𝑐𝑒) = 2 /5 as we have 2 sentences finance out of 5 sentences.
𝑃 𝑁𝑜𝑛 𝐹𝑖𝑛𝑎𝑛𝑐𝑒 "𝑙𝑒𝑎𝑟𝑛 𝑠𝑡𝑜𝑐𝑘 𝑚𝑎𝑟𝑘𝑒𝑡 𝑝𝑙𝑎𝑦 𝑔𝑎𝑚𝑒") ~ 𝑃("𝒍𝒆𝒂𝒓𝒏 𝒔𝒕𝒐𝒄𝒌 𝒎𝒂𝒓𝒌𝒆𝒕 𝒑𝒍𝒂𝒚
𝒈𝒂𝒎𝒆"|𝑁𝑜𝑛 𝐹𝑖𝑛𝑎𝑛𝑐𝑒) 𝑥 𝑃 (𝑁𝑜𝑛 𝐹𝑖𝑛𝑎𝑛𝑐𝑒)
138
Modeling Frame business problems as ML problems
Problem: We don’t have any data with the exact sequence: “learn stock market play game”
Solution: Be naïve! Assume every word is conditionally independent.
Smoothing:
Result:
139
Modeling Frame business problems as ML problems
140
Modeling Frame business problems as ML problems
Sentiment Lexicons
NLTK has two main sentiment data sources:
Bing Liu Opinion Lexicon
SentiWordNet
We also have VADER:
Sentiment metrics with score [-1, 1]: -Positive -Neutral -Negative -Compound
It considers the following cases:
Punctuation: Namely the exclamation point (!), increases the magnitude of the intensity
Capitalization, specifically using ALL-CAPS to emphasize meaning.
Degree modifiers (also called intensifiers, booster words, or degree adverbs): “The service
is extremely good” > “The service is very good” > “The service is marginally good”
The contrastive conjunction “but”: “The food here is great, but the service is horrible”
Uses tri-gram to extend: “The food here isn’t really all that great
Word Representation
This is a relation between words.
The relation could also be relation between words from different languages.
141
Modeling Frame business problems as ML problems
Word2Vec
Word2Vec is a technique for natural language processing published in 2013. The word2vec
algorithm uses a neural network model to learn word associations from a large corpus of text.
Once trained, such a model can detect synonymous words or suggest additional words for a
partial sentence. As the name implies, word2vec represents each distinct word with a particular
list of numbers called a vector. The vectors are chosen carefully such that a simple mathematical
function (the cosine similarity between the vectors) indicates the level of semantic
similarity between the words represented by those vectors.
Word2vec is a group of related models that are used to produce word embeddings. These
models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts
of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically
of several hundred dimensions, with each unique word in the corpus being assigned a
corresponding vector in the space. Word vectors are positioned in the vector space such that
words that share common contexts in the corpus are located close to one another in the space.
Word2vec can utilize either of two model architectures to produce a distributed
representation of words: continuous bag-of-words (CBOW) or continuous skip-gram. In the
continuous bag-of-words architecture, the model predicts the current word from a window of
surrounding context words. The order of context words does not influence prediction (bag-of-
words assumption). In the continuous skip-gram architecture, the model uses the current word
to predict the surrounding window of context words. The skip-gram architecture weighs nearby
context words more heavily than more distant context words.
Sentence Vectors
The same concept of word2Vec but for sentences
Main goal: Create a numeric representation for a sentence (document) regardless of its length.
142
Modeling Frame business problems as ML problems
Pre-trained System
Universal Sentence Encode, it uses Google’s sentence encoder.
Provides pre-trained models to get fixed size sentence (512) vectors.
143
Modeling Select the appropriate model
How it works?
Preprocessing
- Training data must be normalized (so all features are weighted the same)
- Linear Learner can do this for you automatically
- Input data should be shuffled
Training
- Uses stochastic gradient descent
- Choose an optimization algorithm (Adam, AdaGrad, SGD…..etc.)
- Multiple models are optimized in parallel
- Tune L1, L2 regularization
Validation • Most optimal model is selected
Input Formats
RecordIO-wrapped protobuf
- Float32 data only!
CSV
- First column assumed to be the label
File or Pipe mode both supported
Hyperparameters
Parameter Description
num_classes The number of classes for the response variable. The algorithm assumes that classes are
labeled 0... num_classes - 1.
Required
Valid values: binary_classifier, multiclass_classifier, or regressor
accuracy_top_k When computing the top-k accuracy metric for multiclass classification, the value of k. If
the model assigns one of the top-k scores to the true label, an example is scored as
correct.
Optional
Valid values: Positive integers
Default value: 3
balance_multiclas Specifies whether to use class weights, which give each class equal importance in the loss
s_weights
function. Used only when the predictor_type is multiclass_classifier.
Optional
Valid values: true, false
Default value: false
binary_classifier_ When predictor_type is set to binary_classifier, the model evaluation criteria for the
model_
validation dataset (or for the training dataset if you don't provide a validation dataset).
selection_criteria
Criteria include:
accuracy—The model with the highest accuracy.
f_beta—The model with the highest F1 score. The default is F1.
precision_at_target_recall—The model with the highest precision at a given recall target.
recall_at_target_precision—The model with the highest recall at a given precision target.
loss_function—The model with the lowest value of the loss function used in training.
Optional
Valid values: accuracy, f_beta, precision_at_target_recall, recall_at_target_precision,
or loss_function
Default value: accuracy
epochs The maximum number of passes over the training data.
Optional
Valid values: Positive integer
Default value: 15
init_method Sets the initial distribution function used for model weights. Functions include:
uniform—Uniformly distributed between (-scale, +scale)
normal—Normal distribution, with mean 0 and sigma
Optional
145
Modeling Select the appropriate model
Optional
Valid values: auto or non-negative float
Default value: auto
learning_rate The step size used by the optimizer for parameter updates.
Optional
Valid values: auto or positive floating-point integer
Default value: auto, whose value depends on the optimizer chosen.
loss Specifies the loss function.
The available loss functions and their default values depend on the value
of predictor_type:
If the predictor_type is set to regressor, the available options
are auto, squared_loss, absolute_loss, eps_insensitive_squared_loss, eps_insensitive_abso
lute_loss, quantile_loss, and huber_loss. The default value for auto is squared_loss.
If the predictor_type is set to binary_classifier, the available options are auto,logistic,
and hinge_loss. The default value for auto is logistic.
If the predictor_type is set to multiclass_classifier, the available options
are auto and softmax_loss. The default value for auto is softmax_loss.
Valid
values: auto, logistic, squared_loss, absolute_loss, hinge_loss, eps_insensitive_squared_los
s, eps_insensitive_absolute_loss, quantile_loss, or huber_loss
Optional
Default value: auto
mini_batch_size The number of observations per mini-batch for the data iterator.
Optional
Valid values: Positive integer
Default value: 1000
momentum The momentum of the sgd optimizer.
Optional
Valid values: auto or a floating-point integer between 0 and 1.0
Default value: auto
146
Modeling Select the appropriate model
num_models The number of models to train in parallel. For the default, auto, the algorithm decides the
number of parallel models to train. One model is trained according to the given training
parameter (regularization, optimizer, loss), and the rest by close parameters.
Optional
Valid values: auto or positive integer
Default values: auto
optimizer The optimization algorithm to use.
Optional
Valid values:
Auto — The default value.
Sgd — Stochastic gradient descent.
Adam — Adaptive momentum estimation.
Rmsprop — A gradient-based optimization technique that uses a moving average of
squared gradients to normalize the gradient.
Default value: auto. The default setting for auto is adam.
target_recall The target recall.
If binary_classifier_model_selection_criteria is precision_at_target_recall, then recall is
held at this value while precision is maximized.
Optional
Valid values: Floating-point integer between 0 and 1.0
Default value: 0.8
wd The weight decay parameter, also known as the L2 regularization parameter. If you don't
want to use L2 regularization, set the value to 0.
Optional
Valid values: auto or non-negative floating-point integer
Default value: auto
Instance Types
Training
- Single or multi-machine CPU or GPU
Multi-GPU does not help
147
Modeling Select the appropriate model
AWS KNN
How it works?
Step 1: Sample
To specify the total number of data points to be sampled from the training dataset, use
the sample_size parameter. For example, if the initial dataset has 1,000 data points and
the sample_size is set to 100, where the total number of instances is 2, each worker would
sample 50 points. A total set of 100 data points would be collected. Sampling runs in linear time
with respect to the number of data points.
The current implementation of the k-NN algorithm has two methods of dimension reduction. You
specify the method in the dimension_reduction_type hyperparameter.
- The sign method specifies a random projection, which uses a linear projection using a
matrix of random signs.
- fjlt method specifies a fast Johnson-Lindenstrauss transform, a method based on the
Fourier transform. The fjlt method should be used when the target dimension is large
and has better performance with CPU inference.
NOTE: Using dimension reduction introduces noise into the data and this noise can
reduce prediction accuracy.
149
Modeling Select the appropriate model
Hyperparameters
Parameter Description
feature_dim The number of features in the input data.
Required
Valid values: positive integer.
k The number of nearest neighbors.
Required
Valid values: positive integer
predictor_type The type of inference to use on the data labels.
Required
Valid values: classifier for classification or regressor for regression.
sample_size The number of data points to be sampled from the training data
set.
Required
Valid values: positive integer
dimension_reduction_target The target dimension to reduce to.
Optional
Valid values: sign for random projection or fjlt for the fast Johnson-
Lindenstrauss transform.
Optional
Valid values: L2 for Euclidean-distance, INNER_PRODUCT for inner-
product distance, COSINE for cosine similarity.
150
Modeling Select the appropriate model
Default value: L2
index_type The type of index.
Optional
Valid values: faiss.Flat, faiss.IVFFlat, faiss.IVFPQ.
Default values: faiss.Flat
mini_batch_size The number of observations per mini-batch for the data iterator.
Optional
Valid values: positive integer
Default value: 5000
Input Formats
Train channel contains your data
- Test channel emits accuracy or MSE
recordIO-protobuf or CSV training
- First column is label
File or pipe mode on either
Instance Types
Training on CPU or GPU
- Ml.m5.2xlarge
- Ml.p2.xlarge
Inference
- CPU for lower latency
- GPU for higher throughput on large batches
3.2.3 K-Means
It is unsupervised machine learning algorithm used for clustering.
Consider the unlabeled dataset represented in the next figure: you can clearly see 5 blobs of
instances. The K-Means algorithm is a simple algorithm capable of clustering this kind of dataset
very quickly and efficiently, often in just a few iterations.
151
Modeling Select the appropriate model
You have to specify the number of clusters k that the algorithm must find. In this example, it is
pretty obvious from looking at the data that k should be set to 5, but in general it is not that easy.
Each instance was assigned to one of the 5 clusters. In the context of clustering, an instance’s
label is the index of the cluster that this instance gets assigned to by the algorithm. The algorithm
decide which instance to assign to the cluster by specifying the distance between the instance
and the center of the cluster called Centroid.
The vast majority of the instances were clearly assigned to the appropriate cluster, but a few
instances were probably mislabeled (especially near the boundary between the top left cluster
and the central cluster). Indeed, the K-Means algorithm does not behave very well when the
blobs have very different diameters since all it cares about when assigning an instance to a
cluster is the distance to the centroid.
Instead of assigning each instance to a single cluster, which is called hard clustering, it can be
useful to just give each instance a score per cluster: this is called soft clustering. For example, the
score can be the distance between the instance and the centroid, or conversely it can be a
similarity score (or affinity).
152
Modeling Select the appropriate model
153
Modeling Select the appropriate model
initialization. For example, below figure shows two sub-optimal solutions that the algorithm can
converge to if you are not lucky with the random initialization step:
Solutions:
Centroid Initialization Methods
If you happen to know approximately where the centroids should be (e.g., if you ran
another clustering algorithm earlier), then you can set the initial hyperparameter to a
NumPy array containing the list of centroids, and set n_init to 1.
K-Means++
Smarter initialization step that tends to select centroids that are distant from one another,
and this makes the K-Means algorithm much less likely to converge to a suboptimal
solution. K-Means++ initialization algorithm:
- Take one centroid c(1), chosen uniformly at random from the dataset.
- Take a new centroid c(i), choosing an instance x(i) with probability.
D(x(i))2 = j=1m D(x(j))2 where D(x(i)) is the distance between the instance x(i) and
the closest centroid that was already chosen.
- This probability distribution ensures that instances further away from already
chosen centroids are much more likely be selected as centroids.
- Repeat the previous step until all k centroids have been chosen.
154
Modeling Select the appropriate model
NOTE: The K-Means class actually uses this initialization method by default. If you
want to force it to use the original method (i.e., picking k instances randomly to
define the initial centroids), then you can set the init hyperparameter to "random".
Mini Batches
Instead of using the full dataset at each iteration, the algorithm is capable of using mini-batches,
moving the centroids just slightly at each iteration. This speeds up the algorithm typically by a
factor of 3 or 4 and makes it possible to cluster huge datasets that do not fit in memory.
Although the Mini-batch K-Means algorithm is much faster than the regular K-Means algorithm,
its inertia is generally slightly worse, especially as the number of clusters increases.
The inertia is not a good performance metric when trying to choose k since it keeps get- ting
lower as we increase k. Indeed, the more clusters there are, the closer each instance will be to its
closest centroid, and therefore the lower the inertia will be.
155
Modeling Select the appropriate model
Figure 83: Selecting the number of clusters k using the “elbow rule”
As you can see, the inertia drops very quickly as we increase k up to 4, but then it decreases
much more slowly as we keep increasing k. This curve has roughly the shape of an arm, and there
is an “elbow” at k=4 so if we did not know better, it would be a good choice: any lower value
would be dramatic, while any higher value would not help much, and we might just be splitting
perfectly good clusters in half for no good reason.
Feature Reduction
Suppose you have 20 features, and you make K means = 5, so now all the data set is being
divided into 5 clusters. Now measure the distance from your data point to these 5 clusters
and remove the feature vector and use the new vector (which is the distance from 5 clusters)
Silhouette score
A more precise approach (but also more computationally expensive) is to use the silhouette
score, which is the mean silhouette coefficient over all the instances. An instance’s silhouette
coefficient is equal to (b – a) / max (a, b) where a is the mean distance to the other instances in
the same cluster (it is the mean intra-cluster distance), and b is the mean nearest-cluster
distance, that is the mean distance to the instances of the next closest cluster (defined as the one
that minimizes b, excluding the instance’s own cluster). The silhouette coefficient can vary
between -1 and +1: a coefficient close to +1 means that the instance is well inside its own cluster
and far from other clusters, while a coefficient close to 0 means that it is close to a cluster
boundary, and finally a coefficient close to -1 means that the instance may have been assigned to
the wrong cluster.
156
Modeling Select the appropriate model
Figure 84: Selecting the number of clusters k using the silhouette score
As you can see, this visualization is much richer than the previous one: in particular, although it
confirms that k=4 is a very good choice, it also underlines the fact that k=5 is quite good as well,
and much better than k=6 or 7. This was not visible when comparing inertias.
Limits of K-Means
Despite its many merits, most notably being fast and scalable, K-Means is not perfect.
It is necessary to run the algorithm several times to avoid sub-optimal solutions,
You need to specify the number of clusters, which can be quite a hassle.
K-Means does not behave very well when the clusters have varying sizes, different
densities, or non-spherical shapes.
NOTE: It is important to scale the input features before you run K-Means, or else
the clusters may be much stretched, and K-Means will perform poorly. Scaling the
features does not guarantee that all the clusters will be nice and spherical, but it
generally improves things.
Hyperparameters
Parameter Description
feature_dim The number of features in the input data.
Required
Valid values: positive integer.
k The number of required clusters.
Required
Valid values: Positive integer
157
Modeling Select the appropriate model
Optional
Valid values: Positive integer
Default value: 1
eval_metrics A JSON list of metric types used to report a score for the model.
Allowed values are msd for Means Square Error and ssd for Sum of
Square Distance. If test data is provided, the score is reported for
each of the metrics requested.
Optional
Valid values: Either [\"msd\"] or [\"ssd\"] or [\"msd\",\"ssd\"] .
Default value: [\"msd\"]
extra_center_factor The algorithm creates K centers
= num_clusters * extra_center_factor as it runs and reduces the
number of centers from K to k when finalizing the model.
Optional
Valid values: Either a positive integer or auto.
Default value: auto
init_method Method by which the algorithm chooses the initial cluster centers.
The standard k-means approach chooses them at random. An
alternative k-means++ method chooses the first cluster center at
random. Then it spreads out the position of the remaining initial
clusters by weighting the selection of centers with a probability
distribution that is proportional to the square of the distance of the
remaining data points from existing centers.
Optional
Valid values: Either random or kmeans++.
Default value: random
mini_batch_size The number of observations per mini-batch for the data iterator.
Optional
Valid values: Positive integer
Default value: 5000
Input Formats
Train channel, optional test
158
Modeling Select the appropriate model
Instance Types
CPU or GPU, but CPU recommended
Only one GPU per instance used on GPU
So use p*.xlarge if you’re going to use GPU
Projection
Notice that all training instances lie close to a plane: this is a lower-dimensional (2D) subspace of
the high-dimensional (3D) space. Now if we project every training instance perpendicularly onto
this subspace (as represented by the short lines connecting the instances to the plane), we get
the new 2D dataset shown in the below figure. We have just reduced the dataset’s
dimensionality from 3D to 2D. Note that the axes correspond to new features z1 and z2 (the
coordinates of the projections on the plane).
159
Modeling Select the appropriate model
However, projection is not always the best approach to dimensionality reduction. In many cases
the subspace may twist and turn, such as in the famous Swiss roll toy dataset represented in the
below figure.
Simply projecting onto a plane (e.g., by dropping x3 ) would squash different layers of the Swiss
roll together, as shown on the left of Figure 88. However, what you really want is to unroll the
Swiss roll to obtain the 2D dataset on the right of Figure 88.
160
Modeling Select the appropriate model
Figure 88: Squashing by projecting onto a plane (left) versus unrolling the Swiss roll
Principal Component Analysis (PCA) is by far the most popular dimensionality reduction
algorithm. First it identifies the hyperplane that lies closest to the data, and then it projects the
data onto it, just like in Figure 85.
Preserving Variance
Before you can project the training set onto a lower-dimensional hyperplane, you first need to
choose the right hyperplane. For example, a simple 2D dataset is represented on the left of
below figure, along with three different axes (i.e., one-dimensional hyperplanes). On the right is
the result of the projection of the dataset onto each of these axes. As you can see, the projection
onto the solid line preserves the maximum variance, while the projection onto the dotted line
preserves very little variance, and the projection onto the dashed line preserves an intermediate
amount of variance.
It seems reasonable to select the axis that preserves the maximum amount of variance, as it will
most likely lose less information than the other projections. Another way to justify this choice is
that it is the axis that minimizes the mean squared distance between the original dataset and its
projection onto that axis. This is the rather simple idea behind PCA.
Principal Components
161
Modeling Select the appropriate model
PCA identifies the axis that accounts for the largest amount of variance in the training set. In the
above figure, it is the solid line. It also finds a second axis, orthogonal to the first one that
accounts for the largest amount of remaining variance. In this 2D example there is no choice: it is
the dotted line. If it were a higher-dimensional data- set, PCA would also find a third axis,
orthogonal to both previous axes, and a fourth, a fifth, and so on—as many axes as the number
of dimensions in the dataset.
The unit vector that defines the ith axis is called the ith principal component (PC). In Figure 89, the
1st PC is c1 and the 2nd PC is c2. In Figure 85 the first two PCs are represented by the orthogonal
arrows in the plane, and the third PC would be orthogonal to the plane (pointing up or down).
So how can you find the principal components of a training set? Luckily, there is a standard
matrix factorization technique called Singular Value Decomposition (SVD).
Projecting to d Dimensions
Once you have identified all the principal components, you can reduce the dimensionality of the
dataset down to d dimensions by projecting it onto the hyperplane defined by the first d principal
components. Selecting this hyperplane ensures that the projection will preserve as much
variance as possible. For example, in Figure 85 the 3D dataset is projected down to the 2D plane
defined by the first two principal components, preserving a large part of the dataset’s variance.
As a result, the 2D projection looks very much like the original 3D dataset. To project the training
set onto the hyperplane, you can simply compute the matrix multiplication of the training set
matrix X by the matrix Wd, defined as the matrix containing the first d principal components (i.e.,
the matrix composed of the first d columns of V) also called covariance matrix.
162
Modeling Select the appropriate model
Randomized PCA
If you set the svd_solver hyperparameter to "randomized", Scikit-Learn uses a stochastic
algorithm called Randomized PCA that quickly finds an approximation of the first d principal
components. Its computational complexity is O(m × d 2 ) + O(d 3 ), instead of O(m × n 2 ) + O(n 3 )
for the full SVD approach, so it is dramatically faster than full SVD when d is much smaller than n.
By default, svd_solver is actually set to "auto": Scikit-Learn automatically uses the randomized
PCA algorithm if m or n is greater than 500 and d is less than 80% of m or n, or else it uses the full
SVD approach. If you want to force Scikit-Learn to use full SVD, you can set the svd_solver
hyperparameter to "full".
Incremental PCA
One problem with the preceding implementations of PCA is that they require the whole training
set to fit in memory in order for the algorithm to run. Fortunately, Incremental PCA (IPCA)
algorithms have been developed: you can split the training set into mini-batches and feed an
IPCA algorithm one mini-batch at a time. This is useful for large training sets, and also to apply
PCA online (i.e., on the fly, as new instances arrive).
163
Modeling Select the appropriate model
Hyperparameters
Parameter Description
feature_dim Input dimension.
Required
Valid values: positive integer
164
Modeling Select the appropriate model
Input Formats
recordIO-protobuf or CSV
File or Pipe on either
Instance Types
GPU or CPU
It depends “on the specifics of the input data
NOTE: The PCA and K-means algorithms are useful in collection of data using
ى
census السكان التعدادform.
3.2.5 XGBoost
This is a supervised machine learning algorithm used for regression and classification. It stands
for Extreme Gradient Boosting.
Boosted group of trees.
New trees made to correct the errors of previous trees.
Uses gradient descent to minimize loss as new trees are added.
The model is serialized and de-serialized with Pickle.
Can be used within your notebook (AWS Only).
Algorithm Steps
1. Make initial prediction, this prediction could be any value by default it is 0.5.
2. Calculate the residuals which is the difference between observed values and predicted
value.
165
Modeling Select the appropriate model
3. Build the XGBoost tree (common way), Start the tree by a leaf and all the residuals to this
leaf.
4. Calculate the Similarity score for this leaf:
Similarity Score = (Sum of residuals) 2 / (Number of residuals +)
Such that is L2 regularization parameter.
5. Now we want to decide if we should go to branch this leaf for more branches on the tree.
To answer this question we will take first 2 observations with the lowest values and
calculate their average. For example Dosage < 15 if yes so the residuals for these
observations go to this leaf and the observations more than 15 goes to the other leave
then calculate the similarity score for both leaves.
6. We need to quantify how much better the leaves cluster similar residuals than the root.
We do this by calculating the gain of splitting the residuals into two groups.
166
Modeling Select the appropriate model
9. Build a simple tree that divides the observations using the new threshold Dosage<22.5.
10. Calculate the gain for the new tree.
167
Modeling Select the appropriate model
18. Shift the threshold for the second branch and calculate the gain.
19. As the gain for dosage < 30 in the second branch is greater than the gain will dosage <
22.5 for the second branch. So we will use the second branch dosage < 30.
20. We done building this tree for the observation we have. In real time the default is to allow
6 levels.
21. We will use a hyperparameter (Gama) to decide to prune the branch or not. If the branch
(Gain -) is negative value then remove the branch, if positive don’t remove the branch.
22. Calculate the output for all the nodes.
Output = (Sum of residuals) / (number of residuals +)
168
Modeling Select the appropriate model
25. Now the new residuals is much smaller than the ones calculated in step 2.
26. Now we can build another tree based on the new residuals that gives smaller residuals.
27. We will keep building trees with smaller residuals until the residuals is super small or we
have reached the maximum number of trees.
Lambda (): is a regularization parameters (that is used to decrease similarity value, it is inversely
proportional to the number of residuals in the node. Also it will decrease the Gain calculation.
When > 0, it is easier to prune ( )تلقيمthe trees as Gain values is much smaller.
NOTE: When the Gain values is large it will be hard to prune the tree as the Gama
value will remove the branches and the root.
169
Modeling Select the appropriate model
Hyperparameters
Parameter Description
num_class The number of classes.
Required if objective is set to multi:softmax or multi:softprob.
Valid values: integer
num_round The number of rounds to run the training.
Required
Valid values: integer
alpha L1 regularization term on weights. Increasing this value makes models
more conservative.
Optional
Valid values: float
Default value: 0
base_score The initial prediction score of all instances, global bias.
Optional
Valid values: float
Default value: 0.5
booster Which booster to use. The gbtree and dart values use a tree-based
model, while gblinear uses a linear function.
Optional
Valid values: String. One of gbtree, gblinear, or dart.
Default value: gbtree
colsample_bylevel Subsample ratio of columns for each split, in each level.
Optional
Valid values: Float. Range: [0,1].
Default value: 1
colsample_bynode Subsample ratio of columns from each node.
170
Modeling Select the appropriate model
Optional
Valid values: Float. Range: (0,1].
Default value: 1
colsample_bytree Subsample ratio of columns when constructing each tree.
Optional
Valid values: Float. Range: [0,1].
Default value: 1
deterministic_histogram When this flag is enabled, XGBoost builds histogram on GPU
deterministically. Used only if tree_method is set to gpu_hist.
Optional
Valid values: String. Range: true or false
Default value: true
early_stopping_rounds The model trains until the validation score stops improving. Validation
error needs to decrease at least every early_stopping_rounds to
continue training. SageMaker hosting uses the best model for
inference.
Optional
Valid values: integer
Default value: -
Eta (learning rate) Step size shrinkage used in updates to prevent overfitting. After each
boosting step, you can directly get the weights of new features.
The eta parameter actually shrinks the feature weights to make the
boosting process more conservative.
Optional
Valid values: Float. Range: [0,1].
Default value: 0.3
eval_metric Evaluation metrics for validation data. A default metric is assigned
according to the objective:
rmse: for regression
error: for classification
map: for ranking
gamma Minimum loss reduction required to make a further partition on a leaf
node of the tree. The larger, the more conservative the algorithm is.
Optional
Valid values: Float. Range: [0,∞).
Default value: 0
grow_policy Controls the way that new nodes are added to the tree. Currently
supported only if tree_method is set to hist.
Optional
171
Modeling Select the appropriate model
172
Modeling Select the appropriate model
173
Modeling Select the appropriate model
174
Modeling Select the appropriate model
Optional
Valid values: integer
Default value: 0
single_precision_histogram When this flag is enabled, XGBoost uses single precision to build
histograms instead of double precision. Used only if tree_method is set
to hist or gpu_hist.
Optional
Valid values: String. Range: true or false
Default value: false
sketch_eps Used only for approximate greedy algorithm. This translates into O(1
/ sketch_eps) number of bins. Compared to directly select number of
bins, this comes with theoretical guarantee with sketch accuracy.
Optional
Valid values: Float, Range: [0, 1].
Default value: 0.03
skip_drop Probability of skipping the dropout procedure during a boosting
iteration.
Optional
Valid values: Float. Range: [0.0, 1.0].
Default value: 0.0
tree_method The tree construction algorithm used in XGBoost.
Optional
Valid values: One of auto, exact, approx, hist, or gpu_hist.
Default value: auto
tweedie_variance_power Parameter that controls the variance of the Tweedie distribution.
Optional
Valid values: Float. Range: (1, 2).
Default value: 1.5
updater A comma-separated string that defines the sequence of tree updaters
to run. This provides a modular way to construct and to modify the
trees.
Optional
Valid values: comma-separated string.
Default value: grow_colmaker, prune
verbosity Verbosity of printing messages.
Valid values: 0 (silent), 1 (warning), 2 (info), 3 (debug).
Optional
Default value: 1
175
Modeling Select the appropriate model
Important Hyperparameters
Subsample
- Prevents overfitting
Eta
- Step size shrinkage, prevents overfitting
Gamma
- Minimum loss reduction to create a partition; larger = more conservative ()محافظ
Alpha
- L1 regularization term; larger = more conservative
Lambda
- L2 regularization term; larger = more conservative
eval_metric
- Optimize on AUC, error, rmse…etc.
For example, if you care about false positives more than accuracy, you might
use AUC here
scale_pos_weight
- Adjusts balance of positive and negative weights
- Helpful for unbalanced classes
- Might set to sum(negative cases) / sum(positive cases)
max_depth
- Max depth of the tree
- Too high and you may overfitting
Input Formats
So, it takes CSV or libsvm input.
It also accepts recordIO-protobuf and Parquet as well
Instance Types
Uses CPU’s only for multiple instance training
Is memory-bound, not compute bound
- So, M5 is a good choice
As of XGBoost 1.2, single-instance GPU training is available
- For example P3
Must set tree_method hyperparameter to gpu_hist
Trains more quickly and can be more cost effective
176
Modeling Select the appropriate model
3.2.6 IP Insights
Amazon SageMaker IP Insights is an unsupervised learning algorithm that learns the usage
patterns for IPv4 addresses. It is designed to capture associations between IPv4 addresses and
various entities, such as user IDs or account numbers. You can use it to identify a user attempting
to log into a web service from an anomalous IP address, for example. Or you can use it to identify
an account that is attempting to create computing resources from an unusual IP address. Trained
IP Insight models can be hosted at an endpoint for making real-time predictions or used for
processing batch transforms.
SageMaker IP insights ingests historical data as (entity, IPv4 Address) pairs and learns the IP usage
patterns of each entity. When queried with an (entity, IPv4 Address) event, a SageMaker IP
Insights model returns a score that infers how anomalous the pattern of the event is. For
example, when a user attempts to log in from an IP address, if the IP Insights score is high
enough, a web login server might decide to trigger a multi-factor authentication system. In more
advanced solutions, you can feed the IP Insights score into another machine learning model. For
example, you can combine the IP Insight score with other features to rank the findings of another
security system, such as those from Amazon GuardDuty.
How is it used?
Uses a neural network to learn latent vector representations of entities and IP addresses.
Entities are hashed and embedded, need sufficiently large hash size as we use the entity
names as strings so it should be hashed before fed into the algorithm. IP insight will do this
for us. Hash size should be large enough to insure that the number of collisions, which
occur when distinct entities are mapped to the same latent vector, remain insignificant.
177
Modeling Select the appropriate model
Automatically generates negative samples during training by randomly pairing entities and
IP’s. This to overcome sub-sampling problem so the algorithm automatically generate data
by randomly pairing entities and IPs and give them negative score (i.e. can’t access),
because the fishy data will sure be less than can access data. The same as fraud algorithm,
the fraudulent transactions are sure less than good transactions.
During training, IP Insights automatically generates negative samples by randomly pairing
entities and IP addresses. These negative samples represent data that is less likely to occur
in reality. The model is trained to discriminate between positive samples that are observed
in the training data and these generated negative samples. More specifically, the model is
trained to minimize the cross entropy.
Hyperparameters
Parameter Name Description
num_entity_vectors The number of entity vector representations (entity
embedding vectors) to train. Each entity in the training set
is randomly assigned to one of these vectors using a hash
function. Because of hash collisions, it might be possible to
have multiple entities assigned to the same vector. This
would cause the same vector to represent multiple
entities. This generally has a negligible effect on model
performance, as long as the collision rate is not too severe.
To keep the collision rate low, set this value as high as
possible. However, the model size, and, therefore, the
memory requirement, for both training and inference,
scales linearly with this hyperparameter. We recommend
that you set this value to twice the number of unique
entity identifiers.
Required
Valid values: 1 ≤ positive integer ≤ 250,000,000
vector_dim The size of embedding vectors to represent entities and IP
addresses. The larger the value, the more information that
can be encoded using these representations. In practice,
model size scales linearly with this parameter and limits
how large the dimension can be. In addition, using vector
representations that are too large can cause the model to
overfit, especially for small training datasets. Overfitting
occurs when a model doesn't learn any pattern in the data
but effectively memorizes the training data and, therefore,
178
Modeling Select the appropriate model
179
Modeling Select the appropriate model
Input Formats
User names, account ID’s can be fed in directly; no need to pre-process
Training channel, optional validation (computes AUC score)
CSV only
- Entity
- IP
Instance Types
CPU or GPU
- GPU recommended
- Ml.p3.2xlarge or higher
- Can use multiple GPU’s
- Size of CPU instance depends on vector_dim and num_entity_vectors
The three terms in this equation correspond respectively to the three components of the model:
180
Modeling Select the appropriate model
The <vi,vj> factorization terms model the pairwise interaction between the ith and
jth variable.
The global bias and linear terms are the same as in a linear model. The pairwise feature
interactions are modeled in the third term as the inner product of the corresponding factors
learned for each feature. Learned factors can also be considered as embedding vectors for each
feature. For example, in a classification task, if a pair of features tends to co-occur more often in
positive labeled samples, then the inner product of their factors would be large. In other words,
their embedding vectors would be close to each other in cosine similarity.
Hyperparameters
Parameter Name Description
num_factors The dimensionality of factorization.
Required
Valid values: Positive integer. Suggested value range: [2,1000], 64
typically generates good outcomes and is a good starting point.
predictor_type The type of predictor.
binary_classifier: For binary classification tasks.
regressor: For regression tasks.
Required
Valid values: String: binary_classifier or regressor
bias_init_method The initialization method for the bias term:
normal: Initializes weights with random values sampled from a
normal distribution with a mean of zero and standard deviation
specified by bias_init_sigma.
uniform: Initializes weights with random values uniformly
sampled from a range specified by [-bias_init_scale,
+bias_init_scale].
constant: Initializes the weights to a scalar value specified
by bias_init_value.
Optional
Valid values: uniform, normal, or constant
Default value: normal
bias_init_scale Range for initialization of the bias term. Takes effect
if bias_init_method is set to uniform.
Optional
181
Modeling Select the appropriate model
All the bias parameters are applied to factors and linear terms by replace bias with factors and
linear i.e. factors_init_method and linear_init_method.
NOTE: Bias is the linear bias term, Linear is the linear weight and Factor is the
factors weight.
182
Modeling Select the appropriate model
Input Formats
recordIO-protobuf with Float32
Sparse data means CSV isn’t practical
Instance Types
CPU or GPU
- CPU recommended
- GPU only works with dense data
Hyperparameters
Parameter Name Description
num_classes The number of output classes. This parameter defines the
dimensions of the network output and is typically set to the
number of classes in the dataset.
Required
Valid values: positive integer
183
Modeling Select the appropriate model
184
Modeling Select the appropriate model
Input Formats
RecordIO or image format (JPG or PNG)
With image format, supply a JSON file for annotation data for each image
Instance Types
Use GPU instances for training (multi-GPU and multi-machine)
- Ml.p2.xlarge, ml.p2.8xlarge, ml.p2.16xlarge, ml.p3.2xlarge, ml.p3.8clarge,
ml.p3.16xlarge
Use CPU or CPU for inference
- C5, M5, P2, P3
185
Modeling Select the appropriate model
Hyperparameters
Parameter Name Description
num_classes Number of output classes. This parameter defines the
dimensions of the network output and is typically set to the
number of classes in the dataset.
Besides multi-class classification, multi-label classification is
supported too.
Required
Valid values: positive integer
augmentation_type Data augmentation type. The input images can be augmented
in multiple ways as specified below.
crop: Randomly crop the image and flip the image horizontally
crop_color: In addition to ‘crop’, three random values in the
range [-36, 36], [-50, 50], and [-50, 50] are added to the
corresponding Hue-Saturation-Lightness channels respectively
crop_color_transform: In addition to crop_color, random
transformations, including rotation, shear, and aspect ratio
variations are applied to the image. The maximum angle of
rotation is 10 degrees, the maximum shear ratio is 0.1, and the
maximum aspect changing ratio is 0.25.
Optional
Valid values: crop, crop_color, or crop_color_transform.
Default value: no default value
beta_1 The beta1 for adam, that is the exponential decay rate for the
first moment estimates.
Optional
Valid values: float. Range in [0, 1].
Default value: 0.9
186
Modeling Select the appropriate model
beta_2 The beta2 for adam, that is the exponential decay rate for the
second moment estimates.
Optional
Valid values: float. Range in [0, 1].
Default value: 0.999
epochs Number of training epochs.
Optional
Valid values: positive integer
Default value: 30
gamma The gamma for rmsprop, the decay factor for the moving
average of the squared gradient.
Optional
Valid values: float. Range in [0, 1].
Default value: 0.9
image_shape The input image dimensions, which is the same size as the
input layer of the network. The format is defined as
'num_channels, height, width'.
learning_rate Initial learning rate.
Optional
Valid values: float. Range in [0, 1].
Default value: 0.1
momentum The momentum for sgd and nag, ignored for other optimizers.
Optional
Valid values: float. Range in [0, 1].
Default value: 0.9
multi_label Flag to use for multi-label classification where each sample can
be assigned multiple labels. Average accuracy across all classes
is logged.
Optional
Valid values: 0 or 1
Default value: 0
optimizer The optimizer type. For more details of the parameters for the
optimizers, please refer to MXNet's API.
Optional
Valid values: One of sgd, adam, rmsprop, or nag.
use_pretrained_model Flag to use pre-trained model for training. If set to 1, then the
pretrained model with the corresponding number of layers is
loaded and used for training. Only the top FC layer are
187
Modeling Select the appropriate model
Input Formats
Apache MXNet RecordIO
- Not protobuf
Supports both RecordIO (application/x-recordio) and image (image/png, image/jpeg,
and application/x-image) content types for training in file mode.
Image format requires .lst files to associate image index, class label, and path to the
image
Supports the RecordIO (application/x-recordio) content type for training in pipe mode.
Augmented Manifest Image Format enables Pipe mod
The algorithm supports image/png, image/jpeg, and application/x-image for inference.
Instance Types
GPU instances for training (P2, P3) either multi-GPU or multi-machine.
CPU or GPU for inference (C4, P2, P3)
Hyperparameters
Parameter Name Description
backbone The backbone to use for the algorithm's encoder
component.
Optional
Valid values: resnet-50, resnet-101
Default value: resnet-50
use_pretrained_model Whether a pretrained model is to be used for the backbone.
Optional
Valid values: True, False
Default value: True
algorithm The algorithm to use for semantic segmentation.
Optional
Valid values:
fcn: Fully-Convolutional Network (FCN) algorithm
psp: Pyramid Scene Parsing (PSP) algorithm
deeplab: DeepLab V3 algorithm
Input Formats
JPG Images and PNG annotations
For both training and validation
Label maps to describe annotations
Augmented manifest image format supported for Pipe mode.
JPG images accepted for inference
Instance Types
Only GPU supported for training (P2 or P3) on a single machine only
189
Modeling Select the appropriate model
Word2Vec
Creates a vector representation of words NOT sentences or documents
Semantically similar words are represented by vectors close to each other
This is called a word embedding
It is useful for NLP, but is not an NLP algorithm in itself!
Used in machine translation, sentiment analysis
Remember it only works on individual words, not sentences or documents
Word2vec has multiple modes
- Cbow (Continuous Bag of Words): order of words don’t matter
- Skip-gram: order of words matter
- Batch skip-gram
Distributed computation over many CPU nodes
Hyperparameters
Word2Vec
Parameter Name Description
mode The Word2vec architecture used for training.
Required
Valid values: batch_skipgram, skipgram, or cbow
batch_size The size of each batch when mode is set
to batch_skipgram. Set to a number between 10 and 20.
190
Modeling Select the appropriate model
Optional
Valid values: Positive integer
Default value: 11
buckets The number of hash buckets to use for subwords.
Optional
Valid values: positive integer
Default value: 2000000
evaluation Whether the trained model is evaluated using
the WordSimilarity-353 Test
learning_rate The step size used for parameter updates.
Optional
Valid values: Positive float
Default value: 0.05
negative_samples The number of negative samples for the negative sample
sharing strategy.
Optional
Valid values: Positive integer
Default value: 5
min_count Words that appear less than min_count times are
discarded.
Optional
Valid values: Non-negative integer
Default value: 5
vector_dim The dimension of the word vectors that the algorithm
learns.
Optional
Valid values: Positive integer
Default value: 100
window_size The size of the context window. The context window is
the number of words surrounding the target word used
for training.
Optional
Valid values: Positive integer
Default value: 5
Text Classification
Parameter Name Description
mode The training mode.
Required
191
Modeling Select the appropriate model
Input Formats
For supervised mode:
One sentence per line
First “word” in the sentence is the string __label__ followed by the label
Also, “augmented manifest text format”
Text should be pre-processed
For word2vec mode:
192
Modeling Select the appropriate model
Just wants a text file with one training sentence per line.
Instance Types
For Cbow and Skipgram, recommend a single ml.p3.2xlarge
- Any single CPU or single GPU instance will work
For batch_skipgram,
- can use single or multiple CPU instances
For text classification,
- C5 recommended if less than 2GB training data. For larger data sets, use a single
GPU instance (ml.p2.xlarge or ml.p3.2xlarge
3.2.12 Seq2Seq
Input is a sequence of tokens, output is a sequence of tokens
Machine Translation
Text summarization
Speech to text
Implemented mainly with RNN’s and CNN’s with attention
Training for machine translation can take days, even on SageMaker
Pre-trained models are available
Public training datasets are available for specific translation task (ready made translation)
Algorithm
Typically, a neural network for sequence-to-sequence modeling consists of a few layers,
including:
- Embedding layer. In this layer, the input matrix, which is input tokens encoded in a sparse
way (for example, one-hot encoded) are mapped to a dense feature layer. This is required
because a high-dimensional feature vector is more capable of encoding information
regarding a particular token (word for text corpora) than a simple one-hot-encoded vector.
It is also a standard practice to initialize this embedding layer with a pre-trained word
vector like FastText or Glove or to initialize it randomly and learn the parameters during
training.
- Encoder layer. After the input tokens are mapped into a high-dimensional feature space,
the sequence is passed through an encoder layer to compress all the information from the
input embedding layer (of the entire sequence) into a fixed-length feature vector.
193
Modeling Select the appropriate model
Typically, an encoder is made of RNN-type networks like long short-term memory (LSTM)
or gated recurrent units (GRU)
- Decoder layer. The decoder layer takes this encoded feature vector and produces the
output sequence of tokens. This layer is also usually built with RNN architectures (LSTM
and GRU).
The whole model is trained jointly to maximize the probability of the target sequence given the
source sequence.
Attention mechanism. The disadvantage of an encoder-decoder framework is that model
performance decreases as and when the length of the source sequence increases because of the
limit of how much information the fixed-length encoded feature vector can contain. To tackle this
problem, the algorithm uses attention mechanism which the decoder tries to find the location in
the encoder sequence where the most important information could be located and uses that
information and previously decoded words to predict the next token in the sequence.
To summarize:
Hyperparameters
Parameter Name Description
batch_size Mini batch size for gradient descent.
Optional
Valid values: positive integer
Default value: 64
beam_size Length of the beam for beam search. Used during
training for computing bleu and used during inference.
Optional
Valid values: positive integer
Default value: 5
bleu_sample_size Number of instances to pick from validation dataset to
decode and compute bleu score during training. Set to -1
to use full validation set (if bleu is chosen
as optimized_metric).
Optional
Valid values: integer
Default value: 0
194
Modeling Select the appropriate model
195
Modeling Select the appropriate model
Input Formats
RecordIO-Protobuf
- Tokens must be integers (this is unusual, since most algorithms want floating point
data.)
- For example indices into vocabulary files
Start with tokenized text files, you need to actually build a vocabulary file that maps every
word to a number.
- You should provide the vocabulary file and the tokenized text files
Convert to protobuf using sample code
- Packs into integer tensors with vocabulary files
- A lot like the TF/IDF
Must provide training data, validation data, and vocabulary files
Instance Types
Can only use GPU instance types (P3 for example)
Can only use a single machine for training
But can use multi-GPU’s on one machine
3.2.13 Object2Vec
This is unsupervised algorithm.
It is general-purpose neural embedding algorithm. Object2Vec generalizes the well-known
Word2Vec embedding technique for words. It learns embeddings of more general-purpose
objects such as sentences, customers, and products.
It creates low-dimensional dense embeddings of high-dimensional objects, it represents how
objects are similar to each other.
196
Modeling Select the appropriate model
Algorithm
Process data into JSON Lines and shuffle it
Train with two input channels, two encoders, and a comparator
Encoder choices:
- Average-pooled embeddings
- CNN’s
- Bidirectional LSTM
Comparator generates ultimate label then followed by a feed-forward neural network
Hyperparameters
Parameter Name Description
enc0_network, enc1_network The network model for the enc0 encoder.
Optional
Valid values: hcnn, bilstm, or pooled_embedding
hcnn: A hierarchical convolutional neural network.
bilstm: A bidirectional long short-term memory network
(biLSTM), in which the signal propagates backward and
197
Modeling Select the appropriate model
Input Formats
Data must be tokenized into integers
Training data consists of pairs of tokens and/or sequences of tokens
- Sentence – sentence
- Labels-sequence (genre to description?)
- Customer-customer
198
Modeling Select the appropriate model
- Product-product
- User-item
Instance Types
Can only train on a single machine (CPU or GPU, multi-GPU OK)
- Ml.m5.2xlarge
- Ml.p2.xlarge
- If needed, go up to ml.m5.4xlarge or ml.m5.12xlarge
NOTE: Remember the word representation in NLP, but this time is document
representation not words. The topics are inferred from the observed word
distributions in the corpus. The words define the direction of the document.
As this is latent representation:
- Used to find similar documents in the topic space
- Input to another supervised algorithm such as a document classifier
199
Modeling Select the appropriate model
Because the method is unsupervised, only the number of topics, not the topics themselves,
are pre-specified. The topic names generated are not human related topic names.
Lowering “mini_batch_size” and “learning_rate” can reduce validation loss at expense of
training time
NOTE: Although you can use both the Amazon SageMaker NTM and LDA
algorithms for topic modeling, they are distinct algorithms and can be expected to
produce different results on the same input data.
Hyperparameters
Parameter Name Description
mini_batch_size The number of examples in each mini batch.
Optional
Optional
Required
Optional
200
Modeling Select the appropriate model
Optional
Input Formats
Four data channels
- “train” is required
- “validation”, “test”, and “auxiliary” optional
recordIO-protobuf or CSV
Words must be tokenized into integers
- Every document must contain a count for every word in the vocabulary in CSV
- The “auxiliary” channel is for the vocabulary
File or pipe mode
Instance Types
- GPU or CPU
- GPU recommended for training
- CPU for inference
Linear Discriminant Analysis could help reduce dimensionality but transform features also,
you could not recognize transformed features.
Hyperparameters
Parameter Name Description
Num_topics The number of topics for LDA to find within the data.
Required
Valid values: positive integer
Alpha0 Initial guess for concentration parameter
Smaller values generate sparse topic mixtures
Larger values (>1.0) produce uniform mixtures
Optional
Valid values: Positive float
Default value: 1.0
max_iterations The maximum number of iterations to perform during the ALS phase of
the algorithm. Can be used to find better quality minima at the expense of
additional computation, but typically should not be adjusted.
Optional
Valid values: Positive integer
Default value: 1000
tol Target error tolerance for the ALS phase of the algorithm. Can be used to
find better quality minima at the expense of additional computation, but
typically should not be adjusted.
Optional
Valid values: Positive float
Default value: 1e-8
max_restarts The number of restarts to perform during the Alternating Least Squares
(ALS) spectral decomposition phase of the algorithm. Can be used to find
better quality local minima at the expense of additional computation, but
typically should not be adjusted.
Optional
Valid values: Positive integer
Default value: 10
Input Formats
Train channel, optional test channel as this is unsupervised algorithm.
RecordIO-protobuf or CSV
- We need to tokenize that data first. Every document does have counts for every
word in the vocabulary for that document, so we should pass a list of tokens,
202
Modeling Select the appropriate model
integers that represent each word, and how often that word occurs in each
individual document, not the documents themselves.
Each document has counts for every word in vocabulary (in CSV format)
Pipe mode only supported with recordIO
Instance Types
Single-instance CPU training
3.2.16 DeepAR
Forecasting one-dimensional time series data for example future stock prices
Uses RNN’s
Classical forecasting methods, such as autoregressive integrated moving average (ARIMA)
or exponential smoothing (ETS), fit a single model to each individual time series.
Allows you to train the same model over several related time series
- If you have many times series that are somehow interdependent, it can actually
learn from those relationships between those time series to create a better model
for predicting any individual time series.
- For example, you might have time series groupings for demand for different
products, server loads, and requests for webpages. For this type of application, you
can benefit from training a single model jointly over all of the time series.
Use entire dataset as training set, remove last time points for testing. Evaluate on withheld
values.
Don’t use very large values for prediction length (> 400 datapoints)
Train on many time series and not just one when possible
Each training example consists of a pair of adjacent context and prediction windows with
fixed predefined lengths. To control how far in the past the network can see, use
the context_length hyperparameter. To control how far in the future predictions can be
made, use the prediction_length hyperparameter.
203
Modeling Select the appropriate model
Hyperparameters
Parameter Name Description
Context_length The number of time-points that the model gets to see before making the
prediction. The value for this parameter should be about the same as
the prediction_length. The model also receives lagged inputs from the target,
so context_length can be much smaller than typical seasonalities. For
example, a daily time series can have yearly seasonality. The model
automatically includes a lag of one year, so the context length can be shorter
than a year. The lag values that the model picks depend on the frequency of
the time series. For example, lag values for daily frequency are previous week,
2 weeks, 3 weeks, 4 weeks, and year.
Required
Valid values: Positive integer
Epochs The maximum number of passes over the training data. The optimal value
depends on your data size and learning rate. See
also early_stopping_patience. Typical values range from 10 to 1000.
Required
Valid values: Positive integer
mini_batch_size The size of mini-batches used during training. Typical values range from 32 to
512.
Optional
Valid values: positive integer
Default value: 128
Learning_rate The learning rate used in training. Typical values range from 1e-4 to 1e-1.
Optional
Valid values: float
Default value: 1e-3
Num_cells The number of cells to use in each hidden layer of the RNN. Typical values
range from 30 to 100.
Optional
Valid values: positive integer
num_layers The number of hidden layers in the RNN. Typical values range from 1 to 4.
Optional
Valid values: positive integer
Default value: 2
prediction_length The number of time-steps that the model is trained to predict, also called the
forecast horizon. The trained model always generates forecasts with this
204
Modeling Select the appropriate model
205
Modeling Select the appropriate model
Input Formats
JSON lines format
- Gzip or Parquet
Each record must contain:
- Start: the starting time stamp
- Target: the time series values
Each record can contain:
- Dynamic_feat: dynamic features (such as, was a promotion applied to a product in a
time series of product purchases)
- Cat: categorical features
Instance Types
Can use CPU or GPU
Single or multi machine while training
Start with CPU (C4.2xlarge, C4.4xlarge)
Move up to GPU if necessary
- Only helps with larger models
May need larger instances for tuning when doing hyperparameter tuning job
CPU-only for inference
206
Modeling Select the appropriate model
RCF scales well with respect to number of features, data set size, and number of instances.
Algorithm
The main idea behind the RCF algorithm is to create a forest of trees where each tree is
obtained using a partition of a sample of the training data.
For example, a random sample of the input data is first determined.
The random sample is then partitioned according to the number of trees in the forest.
Each tree is given such a partition and organizes that subset of points into a k-d tree.
While inference the data point is added to the tree structure as if the data point is used for
training. The anomaly score is calculated by changing in the tree structure that happens
due to the addition of this data point.
If the data point is added as a leaf so the anomaly score will be low but if the data point is
added as branch (sometimes called height or depth) the anomaly score will be high.
That’s why we are saying that “the expected change in complexity of the tree as a result
adding that point to the tree; which, in approximation, is inversely proportional to the
resulting depth of the point in the tree”.
The random cut forest assigns an anomaly score by computing the average score from
each constituent tree and scaling the result with respect to the sample size.
207
Modeling Select the appropriate model
The RCF algorithm organizes these data in a tree by first computing a bounding box of the data,
selecting a random dimension (giving more weight to dimensions with higher "variance"), and
then randomly determining the position of a hyperplane "cut" through that dimension. The two
resulting subspaces define their own sub tree. In this example, the cut happens to separate a
lone point from the remainder of the sample. The first level of the resulting binary tree consists
of two nodes, one which will consist of the subtree of points to the left of the initial cut and the
other representing the single point on the right.
Step 3: Inference
When performing inference using a trained RCF model the final anomaly score is reported as the
average across scores reported by each tree.
Note that it is often the case that the new data point does not already reside in the tree. To
determine the score associated with the new point the data point is inserted into the given tree
and the tree is efficiently (and temporarily) reassembled in a manner equivalent to the training
process described above.
That is, the resulting tree is as if the input data point were a member of the sample used to
construct the tree in the first place. The reported score is inversely proportional to the depth of
the input point within the tree.
Hyperparameters
Parameter Name Description
Num_trees Increasing reduces noise
Number of trees in the forest.
Optional
Valid values: Positive integer (min: 50, max: 1000)
Default value: 100
208
Modeling Select the appropriate model
Input Formats
RecordIO-protobuf or CSV
Can use File or Pipe mode on either
Optional test channel for computing accuracy, precision, recall, and F1 on labeled data
(anomaly or not)
Instance Types
Does not take advantage of GPUs
Use M4, C4, or C5 for training
ml.c5.xl for inference
209
Modeling Select the appropriate model
Conventional MF solutions exploit explicit feedback in a linear fashion; explicit feedback consists
of direct user preferences, such as ratings for movies on a five-star scale or binary preference on
a product (like or not like). However, explicit feedback isn’t always present in datasets.
NCF solves the absence of explicit feedback by only using implicit feedback, which is derived from
user activity, such as clicks and views. In addition, NCF utilizes multi-layer perceptron to introduce
non-linearity into the solution.
Architecture
An NCF model contains two intrinsic sets of network layers: embedding and NCF layers. You use
these layers to build a neural matrix factorization solution with two separate network
architectures, generalized matrix factorization (GMF) and multi-layer perceptron (MLP), whose
outputs are then concatenated as input for the final output layer.
210
ML implementation and Operations SageMaker
211
ML implementation and Operations SageMaker
SageMaker Notebooks
Notebook Instances on EC2 are spun up from the console
S3 data access
Scikit learn, Spark, Tensorflow
Wide variety of built-in models
Ability to spin up training instances
Ability to deploy trained models for making predictions at scale
SageMaker Console
Less flexible than notebooks as you can write code in notebooks.
SageMaker functions:
- Kick off training jobs
- Kick off hyperparameter tuning job
- End point configuration
- Create end points
Data Preparation
Data must come from S3 Ideal format varies with algorithm – often it is RecordIO /
Protobuf
Apache Spark integrates with SageMaker
212
ML implementation and Operations SageMaker
Training on SageMaker
Create a training job
- URL of S3 bucket with training data
- ML compute resources
- URL of S3 bucket for output
- ECR path to training code
Training options
- Built-in training algorithms
- Spark MLLib
- Custom Python Tensorflow / MXNet code
- Your own Docker image
- Algorithm purchased from AWS marketplace
- OutputDataConfig
Specifies the path to the S3 location where you want to store model artifacts.
Amazon SageMaker creates subfolders for the artifacts.
- ResourceConfig
The resources, including the ML compute instances and ML storage volumes, to use
for model training.
- RoleArn
The Amazon Resource Name (ARN) of an IAM role that Amazon SageMaker can
assume to perform tasks on your behalf.
- StoppingCondition
213
ML implementation and Operations SageMaker
Specifies a limit to how long a model training job can run. It also specifies how long a
managed Spot training job has to complete. When the job reaches the time limit,
Amazon SageMaker ends the training job. Use this API to cap model training costs.
- TrainingJobName
The name of the training job. The name must be unique within an AWS Region in an
AWS account.
NOTE: Input path is not mandatory as the training path could be local on the
training machine.
214
ML implementation and Operations SageMaker
Best Practice
Don’t optimize too many hyperparameters at once
Limit your ranges to as small a range as possible
Use logarithmic scales when appropriate if the hyperparameter value range from 0.001 to
0.1
Don’t run too many training jobs concurrently
- This limits how well the process can learn as it goes
Make sure training jobs running on multiple instances report the correct objective metric
in the end i.e. after all the instance finish their process.
215
ML implementation and Operations SageMaker
You can deploy the containers by passing the full container URI to their respective
SageMaker SDK Estimator class.
216
ML implementation and Operations SageMaker
Script Mode
SageMaker offers a solution using script mode. Script mode enables you to write custom training
and inference code while still utilizing common ML framework containers maintained by AWS.
Script mode is easy to use and extremely flexible.
Local Mode
Amazon SageMaker Python SDK supports local mode, which allows you to create estimators and
deploy them to your local environment. This is a great way to test your deep learning scripts
before running them in SageMaker’s managed training or hosting environments. Local Mode is
supported for frameworks images (TensorFlow, MXNet, Chainer, PyTorch, and Scikit-Learn) and
images you supply yourself.
The Amazon SageMaker deep learning containers have recently been open sourced , which
means you can pull the containers into your working environment and use custom code built into
the Amazon SageMaker Python SDK to test your algorithm locally, just by changing a single line of
code. This means that you can iterate and test your work without having to wait for a new
training or hosting cluster to be built each time.
217
ML implementation and Operations SageMaker
The Amazon SageMaker local mode allows you to switch seamlessly between local and
distributed, managed training by simply changing one line of code. Everything else works the
same.
The local mode in the Amazon SageMaker Python SDK can emulate CPU (single and multi-
instance) and GPU (single instance) SageMaker training jobs by changing a single argument in the
TensorFlow, PyTorch or MXNet estimators. To do this, it uses Docker compose and NVIDIA
Docker. It will also pull the Amazon SageMaker TensorFlow, PyTorch or MXNet containers from
Amazon ECS, so you’ll need to be able to access a public Amazon ECR repository from your local
environment.
Hyperparameters
configuration
218
ML implementation and Operations SageMaker
predictor.py: That is a program that implements flash webserver for making the predictions
at runtime. Customize that code for your application.
Server/: Program starts when the server starts for hosting. File starts the G unicorn server
which run multiple instance of flash application that is defined in the predictor.py file.
Train/: Program starts when you start the Docker for training.
Wsgi.py: Invoke your flash application.
Only mandatory
environment variable
219
ML implementation and Operations SageMaker
220
ML implementation and Operations SageMaker
221
ML implementation and Operations SageMaker
222
ML implementation and Operations SageMaker
To adapt your container to work with SageMaker hosting, create the inference code in one or
more Python script files and a Docker file that imports the inference toolkit.
The inference code includes an inference handler, a handler service, and an entrypoint. In this
example, they are stored as three separate Python files. All three of these Python files must be in
the same directory as your Dockerfile.
Step 1: Create an Inference Handler
The SageMaker inference toolkit is built on the multi-model server (MMS). MMS expects a
Python script that implements functions to load the model, pre-process input data, get
predictions from the model, and process the output data in a model handler.
The model_fn Function
The model_fn function is responsible for loading your model. It takes a model_dir argument that
specifies where the model is stored.
def model_fn(self, model_dir)
This section explains how SageMaker makes training information, such as training data,
hyperparameters, and other configuration information, available to your Docker container.
When you send a CreateTrainingJob request to SageMaker to start model training, you specify
the Amazon Elastic Container Registry path of the Docker image that contains the training
algorithm. You also specify the Amazon Simple Storage Service (Amazon S3) location where
training data is stored and algorithm-specific parameters. SageMaker makes this information
available to the Docker container so that your training algorithm can use it. This section explains
how we make this information available to your Docker container. For information about creating
a training job, see CreateTrainingJob.
Hyperparameters
SageMaker makes the hyperparameters in a CreateTrainingJob request available in the Docker
container in the /opt/ml/input/config/hyperparameters.json file.
Environment Variables
The following environment variables are set in the container:
TRAINING_JOB_NAME – Specified in the TrainingJobName parameter of the CreateTrainingJob
request.
TRAINING_JOB_ARN – The Amazon Resource Name (ARN) of the training job returned as the
TrainingJobArn in the CreateTrainingJob response.
For example, suppose that you specify three data channels (train, evaluation, and validation) in
your request. SageMaker provides the following JSON:
225
ML implementation and Operations SageMaker
Training Data
The TrainingInputMode parameter in a CreateTrainingJob request specifies how to make data
available for model training: in FILE mode or PIPE mode. Depending on the specified input mode,
SageMaker does the following:
FILE mode—SageMaker makes the data for the channel available in the
/opt/ml/input/data/channel_name directory in the Docker container. For example, if you have
three channels named training, validation, and testing, SageMaker makes three directories in the
Docker container:
/opt/ml/input/data/training
/opt/ml/input/data/validation
/opt/ml/input/data/testing
PIPE mode—SageMaker makes data for the channel available from the named pipe:
/opt/ml/input/data/channel_name_epoch_number.
226
ML implementation and Operations SageMaker
To enable inter-container communication, this JSON file contains information for all containers.
SageMaker makes this file available for both FILE and PIPE mode algorithms. The file provides the
following information:
current_host—The name of the current container on the container network. For example, algo-1.
Host values can change at any time. Don't write code with specific values for this variable.
hosts—The list of names of all containers on the container network, sorted lexicographically. For
example, ["algo-1", "algo-2", "algo-3"] for a three-node cluster. Containers can use these names
to address other containers on the container network. Host values can change at any time. Don't
write code with specific values for these variables.
227
ML implementation and Operations SageMaker
1. Train
a. Preparing Training Script
The training script is very similar to a training script you might run outside of SageMaker, but
you can access useful properties about the training environment through various
environment variables.
SM_MODEL_DIR:
A string that represents the local path where the training job writes the model artifacts to.
After training, artifacts in this directory are uploaded to S3 for model hosting. This is
different than the model_dir argument passed in your training script, which can be an S3
location. SM_MODEL_DIR is always set to /opt/ml/model.
SM_NUM_GPUS:
An integer representing the number of GPUs available to the host.
SM_OUTPUT_DATA_DIR:
A string that represents the path to the directory to write output artifacts to. Output
artifacts might include checkpoints, graphs, and other files to save, but do not include
model artifacts. These artifacts are compressed and uploaded to S3 to an S3 bucket with
the same prefix as the model artifacts.
SM_CHANNEL_XXXX:
A string that represents the path to the directory that contains the input data for the
specified channel. For example, if you specify two input channels in the TensorFlow
estimator’s fit call, named ‘train’ and ‘test’, the environment variables
SM_CHANNEL_TRAIN and SM_CHANNEL_TEST are set.
A typical training script loads data from the input channels, configures training with
hyperparameters, trains a model, and saves a model to SM_MODEL_DIR so that it can be
deployed for inference later. Hyperparameters are passed to your script as arguments and
can be retrieved with an argparse.ArgumentParser instance.
b. Adapting your local TensorFlow script
c. Use third-party libraries
d. Create an Estimator
2. Deploy to a SageMaker Endpoint
a. Deploying from an Estimator
228
ML implementation and Operations SageMaker
After a TensorFlow estimator has been fit, it saves a TensorFlow SavedModel bundle
in the S3 location defined by output_path. You can call deploy on a TensorFlow
estimator object to create a SageMaker Endpoint.
229
ML implementation and Operations SageMaker
How it works?
Neo consists of compiler and runtime.
First, Neo compilation API reads model exported from various frameworks. I converts the
framework specific functions and operations into framework agonist intermediate
representation.
Next, it performs a series of optimizations.
Then, it generates binary code for optimized operations, write them to shared object
library and saves the model definition and parameters into separate files.
Neo, also provides a runtime for each target platform that loads and executes the
compiled optimized model.
230
ML implementation and Operations SageMaker
Setup users’ accounts for AWS and these user accounts have the permissions they
need.
Restrict the permissions of the different services that are talking to each other. For
example, set a permission to SageMaker note book for S3 access.
Permissions:
- Create Training Job - Create Model
- Create Endpoint configuration - Create Transform Job
- Create Hyperparameter Tuning - Create Notebook Instance
- Update Notebook instance
Policies:
- AmazonSageMakerReadOnly
- AmazonSageMakerFullAccess
- AdministratorAccess
- DataScientist
SSL/TLS Connection
Use SSL/TLS for all connections between servers.
Connecting to EMR can’t use SSL/TLS.
Cloud Trail
Use CloudTrail to log any activity to the APIs that you are using. You will have the chance
what is happening, when and who did it.
Encryption
Use encryption whenever appropriate especially with Personal Identification Information
(PII)
If you are sending data like names, emails, addresses or credentials, make sure to encrypt
these data in rest and at transit.
Encryption in rest
231
ML implementation and Operations SageMaker
S3 Encryption
- You can use S3 encryption for training data and hosting models.
- S3 can also use KMS to encrypt the data.
Encryption at transit
- Basically all traffic support TLS/SSL in SageMaker.
- IAM Roles can be used to give permissions to access specific resources.
- Internodes training (in case of training on multiple servers) may optionally
encrypt data when inter-transfer data.
Can increase training time and cost
Enabled via console or API when setting up a training or tuning job
Deep Learning can be trained on multiple nodes.
Instances Properties
P2 Instances
- High frequency Intel Xeon E5-2686 v4 (Broadwell) processors
- High-performance NVIDIA K80 GPUs, each with 2,496 parallel processing
cores and 12GiB of GPU memory
- Supports GPUDirect™ for peer-to-peer GPU communications
- Provides Enhanced Networking using Elastic Network Adapter (ENA) with up
to 25 Gbps of aggregate network bandwidth within a Placement Group
- EBS-optimized by default at no additional cost
P3 Instances
- Up to 8 NVIDIA Tesla V100 GPUs, each pairing 5,120 CUDA Cores and 640
Tensor Cores
- High frequency Intel Xeon E5-2686 v4 (Broadwell) processors for p3.2xlarge,
p3.8xlarge, and p3.16xlarge.
- High frequency 2.5 GHz (base) Intel Xeon 8175M processors for
p3dn.24xlarge.
- Supports NVLink for peer-to-peer GPU communication
- Provides up to 100 Gbps of aggregate network bandwidth.
- EFA support on p3dn.24xlarge instances
233
ML implementation and Operations SageMaker
G3 Instances
- High frequency Intel Xeon E5-2686 v4 (Broadwell) processors
- NVIDIA Tesla M60 GPUs, each with 2048 parallel processing cores and 8 GiB of
video memory
- Enables NVIDIA GRID Virtual Workstation features, including support for 4
monitors with resolutions up to 4096x2160. Each GPU included in your
instance is licensed for one “Concurrent Connected User"
234
ML implementation and Operations SageMaker
target value. In addition to keeping the metric close to the target value, a target
tracking scaling policy also adjusts to changes in the metric due to a changing load
pattern.
- Step Scaling
You choose scaling metrics and threshold values for the CloudWatch alarms that
trigger the scaling process as well as define how your scalable target should be
scaled when a threshold is in breach for a specified number of evaluation periods.
Step scaling policies increase or decrease the current capacity of a scalable target
based on a set of scaling adjustments, known as step adjustments.
Step adjustments
When you create a step scaling policy, you add one or more step adjustments that
enable you to scale based on the size of the alarm breach. Each step adjustment
specifies the following:
- A lower bound for the metric value
- An upper bound for the metric value
- The amount by which to scale, based on the scaling adjustment type
CloudWatch will monitor the performance of your inference nodes and scale them as
needed.
Dynamically adjust number of instances for a production variant. According to the load
on which model.
Load test configuration before using it. So you can test scaling configuration and load
this configuration in the production.
235
ML implementation and Operations SageMaker
236
ML implementation and Operations SageMaker
You can also use a lifecycle configuration script to access AWS services from your notebook. For
example, you can create a script that lets you use your notebook to control other AWS resources,
such as an Amazon EMR instance.
237
ML implementation and Operations SageMaker
SageMaker notebook instances use conda environments to implement different kernels for
Jupyter notebooks. If you want to install packages that are available to one or more notebook
kernels, enclose the commands to install the packages with conda environment commands
that activate the conda environment that contains the kernel where you want to install the
packages.
You can use a notebook instance created with a custom lifecycle configuration script to access
AWS services from your notebook. For example, you can create a script that lets you use your
notebook with Sparkmagic to control other AWS resources, such as an Amazon EMR instance.
You can then use the Amazon EMR instance to process your data instead of running the data
analysis on your notebook. This allows you to create a smaller notebook instance because you
won't use the instance to process data. This is helpful when you have large datasets that would
require a large notebook instance to process the data.
Amazon SageMaker periodically tests and releases software that is installed on notebook
instances. This includes:
Kernel updates
Security patches
AWS SDK updates
Amazon SageMaker Python SDK updates
Open source software updates
- Domain
SageMaker Studio domain consists of an associated Amazon EFS volume, list of
authorized users and a variety of security, application policy and VPC configuration.
- User profile
A user profile represents a single user within domain.
- App
An app represents an application that supports the reading and execution
experience of the users’ notebook, terminals and consoles.
App can be Jupyter notebook or kernel gateway.
- AWS CloudTrail
Capture API call and relevant events made by or on behalf of your AWS account and
delivers log files to S3. It can identify which users or accounts called AWS. IP address
and when it is called.
CloudWatch
Collects raw data and process it into readable near real time metrics.
240
ML implementation and Operations SageMaker
NOTE: Multi model endpoint: Create an endpoint that can host multiple
models. They used as shared serving container that is enabled to host
multiple models.
- Jobs and Endpoint Metrics
CPU Utilization, Memory Utilization, GPU Utilization, GPU Memory Utilization and
Disk Utilization.
- Pipeline Metrics
CloudWatch Log
To help debug your processing jobs, training jobs, endpoints, transform jobs, notebooks,
notebooks configuration, model container and algorithm container. Any component sends
to stdout or sterror is also send to CloudWatch.
241
ML implementation and Operations SageMaker
CloudTrail
CloudTrail capture all API calls for SageMaker with the exception of Invoke endpoints as
events.
The calls captured include calls to SageMaker from Console and code.
If you create trail, CloudTrail events will be sent to S3, if not you can still use Event History
from console.
Data collected include IP, who, when and additional details.
SageMaker supports logging non API service event to CloudTrail files automatically for
model tuning jobs this includes Hyperparameter tuning jobs, this is used to help you
improve governance, compliance and operation and risk auditing.
242
ML implementation and Operations SageMaker
243
ML implementation and Operations SageMaker
244
ML implementation and Operations SageMaker
245
ML implementation and Operations SageMaker
Autopilot Workflow
You will choose the data locations in S3, Autopilot will load data from S3 for training
You should select the target column
Autopilot will automatic create a model
Notebook is available for visibility and control
Model leaderboard by ranking test for recommended models
Deploy & Monitor the new model
Refine notebook if needed
Autopilot Features
Autopilot can add human guidance
Problem types: Binary classification, multiple classification and Regression
Algorithm types: linear-learner, XGBoost, deep learning (Multilayer Perceptron) (MLP)
Data must be tabular
Autopilot explainability explains how models make predictions using features attribution
approach using SageMaker clarify. It generates report indicate importance of each
feature made by the best candidate. This explainability functionality can make ML model
more understandable by AWS customer:
247
ML implementation and Operations SageMaker
The governance report can be used to inform risk and compliance teams and
external regulators
Transparency how model arrive to its’ prediction.
Feature attribution:
- Uses SHAP baselines/Shapley values
- Research from co-operative game theory
- Assigns each feature an importance value for a given prediction
For example:
A model that approves loans for houses, that race ( )العنرصis strong feature
and there is something wrong here you can go back and take a look at the
bias that might exist in your source data.
248
ML implementation and Operations SageMaker
- The baseline computes baseline schema constraints and statistics for each
feature using Deequ, an open source library built on Spark, which measure data
quality in large datasets.
3. Define and schedule data quality monitoring jobs
4. View data quality monitoring with CloudWatch
5. Interpret the results of monitoring jobs
6. SageMaker studio to enable data quality monitoring and visualize results
249
ML implementation and Operations SageMaker
ModelMonitor can integrate with Tensor board, Quick Sight and SageMaker Studio
250
ML implementation and Operations SageMaker
How it works?
Edge Manager has five main components:
- Compiling: Compile model with SageMaker Neo
- Packing: Pack Neo models
- Deploy: Deploy models to devices
- Agent: Run model for inference
- Maintain: Maintain model on devices
SageMaker Edge Manager can sample model input and output data from edge devices and
send it to the cloud for monitoring and analysis.
View dashboards that tracks and visually report on the operation of the deployed model
with SageMaker console.
By this way developers can improve model quality by using SageMaker ModelMonitor for
drift detection, then relabel data using ground truth.
251
ML implementation and Operations SageMaker
252
ML implementation and Operations AI Services
4.2 AI Services
4.2.1 Amazon Comprehend
Comprehensive natural language processing service (NLP)
Natural language processing and text analysis
Input any text may be social media, web pages, documents, transcript and medical records
(Comprehend Medical)
Can be trained on your data or it out of the box on its pre-trained data.
Extract key phrases, entities, sentiment, language, syntax, topics and document
classification.
Entities
It can detect and extract entities from text i.e. Amazon Inc.
It can also detect and extract person names, dates and locations with confident
score.
Key phrases
It can extract important phrases in sentences with confident score.
Language
It can detect language of the text.
Sentiment Analysis
Categorize text to neutral, positive, negative and mixed.
Syntax
Detects nouns, verbs and punctuation.
253
ML implementation and Operations AI Services
Amazon Forecast can increase your forecasting accuracy by automatically ingest local
weather information on your demand.
Use cases: Inventory planning, financial planning and Resources planning.
How it works?
Datasets: are collection of your input data.
Dataset groups: are collection of datasets that contain complimentary information.
Predictors: Custom models trained on your data.
Forecast: You can generate forecasts for your time series data, query them using Forecast
API.
How it works?
Utterance invoke intents i.e. “I want Pizza?”
Lambda function are invoked to fulfill intents.
Slot specify extra information needed by intent. i.e. “What size?”, “What Toppings?” and
“Do you need crust?”
255
ML implementation and Operations AI Services
2. Import data
256
ML implementation and Operations AI Services
You import item, user, and interaction records into Amazon Personalize datasets. You can choose
to import records in bulk, or incrementally, or both. With incremental imports, you can add one
or more historical records or import data from real-time user activity.
The data that you import depends on your use case. For information about the types of data that
you can import, see Datasets and schemas and the sections on each dataset type (Interactions
dataset, Items dataset, Users dataset).
3. Train a model
After you've imported your data, Amazon Personalize uses it to train a model. In Amazon
Personalize, you start training by creating a solution, where you specify your use case by choosing
an Amazon Personalize recipe. Then you create a solution version, which is the trained model that
Amazon Personalize uses to generate recommendations.
5. Get recommendations
Get recommendations in real-time or as part of a batch workflow with purely historical data. Get
real-time recommendations when you want to update recommendations as customers use your
application. Get batch recommendations when you do not require real-time updates
257
ML implementation and Operations AI Services
you created. For information on creating a dataset group and a dataset, see Preparing and
importing data.
- An event tracker.
- A call to the PutEvents operation.
You can start out with an empty Interactions dataset and, when you have recorded enough data,
train the model using only new recorded events. The minimum data requirements to train a
model are:
- 1000 records of combined interaction data (after filtering by eventType and
eventValueThreshold, if provided)
- 25 unique users with at least 2 interactions each
Instead, Amazon Personalize adds the new recorded event data to the user's history. Amazon
Personalize then uses the modified data when generating recommendations for the user (and
this user only).
- For recorded events for new items (items you did not include in the data you used to train
the model), if you trained your model (solution version) with the User-Personalization
recipe, Amazon Personalize automatically updates the model every two hours, and after
each update the new items influence recommendations. See User-Personalization recipe.
- For any other recipe, you must re-train the model for the new records to influence
recommendations. Amazon Personalize stores recorded events for new items and, once
you create a new solution version (train a new model), this new data will influence Amazon
Personalize recommendations for the user.
- For recorded events for new users (users that were not included in the data you used to
train the model), recommendations will initially be for popular items only.
Recommendations will be more relevant as you record more events for the user. Amazon
Personalize stores the new user data, so you can also retrain the model for more relevant
recommendations.
258
ML implementation and Operations AI Services
- For new, anonymous users (users without a userId), Amazon Personalize uses the sessionId
you pass in the PutEvents operation to associate events with the user before they log in.
This creates a continuous event history that includes events that occurred when the user
was anonymous.
4.2.11 DeepLens
Deep learning enabled video camera.
Integrated with SageMaker, Rekognition, Tensorflow and MXNet.
You can use IoT Green grass to deploy a pre-trained model.
You can use SageMaker neo.
Do deep learning in the edge.
259
ML implementation and Operations AI Services
261
ML implementation and Operations AWS IoT for Predictive Maintenance
262
ML implementation and Operations AWS IoT for Predictive Maintenance
263
ML implementation and Operations Security
4.4 Security
4.4.1 PrivateLink
AWS PrivateLink is a highly available, scalable technology that enables you to privately connect
your VPC to:
Supported AWS services
Services hosted by other AWS accounts (VPC endpoint services)
Supported AWS Marketplace partner services.
You do not need to use any of the following to use PrivateLink service:
internet gateway
NAT device
public IP address
DirectConnect
AWS Site-to-Site VPN connection
264
ML implementation and Operations Security
Interface endpoints
An interface endpoint is an elastic network interface with a private IP address from the IP address
range of your subnet. It serves as an entry point for traffic destined to a supported AWS service
or a VPC endpoint service. Interface endpoints are powered by AWS PrivateLink.
Gateway endpoints
A gateway endpoint is for the following supported AWS services:
Amazon S3
DynamoDB
265
ML implementation and Operations Security
VPC endpoints for Amazon S3 provide two ways to control access to your Amazon S3 data:
You can control which VPCs or VPC endpoints have access to your buckets by using
Amazon S3 bucket policies.
- Restricting access to a specific VPC endpoint
- Restricting access to a specific VPC
You can control the requests, users, or groups that are allowed through a specific VPC
endpoint as in the next section.
Although the term VPN connection is a general term, in this documentation, a VPN connection
refers to the connection between your VPC and your own on-premises network. Site-to-Site VPN
supports Internet Protocol security (IPsec) VPN connections.
266
ML implementation and Operations Security
must ensure that the rules for the security group allow communication between the endpoint
network interface and the resources in your VPC that communicate with the service.
Amazon SageMaker notebook instances can be launched with or without your Virtual Private
Cloud (VPC) attached. When launched with your VPC attached, the notebook can either be
configured with or without direct internet access.
267
ML implementation and Operations Security
Using the Amazon SageMaker console, these are the three options:
1. No customer VPC is attached.
No VPC configured and internet check box is checked
In this configuration, all the traffic goes through the single network interface. The notebook
instance is running in an Amazon SageMaker managed VPC as shown in the above diagram.
In this configuration, the notebook instance needs to decide which network traffic should go
down either of the two network interfaces.
268
ML implementation and Operations Security
NOTE: If SageMaker requests data from S3 and the bucket is protected then
SageMaker should have the role with the decryption permission Key. So this role
and key should be defined to SageMaker.
Network Isolation
You can enable network isolation when you create your training job or model by setting the value
of the EnableNetworkIsolation parameter to true when you call CreateTrainingJob,
CreateHyperParameterTuningJob, or CreateModel.
If you enable network isolation, the containers can't make any outbound network calls, even to
other AWS services such as Amazon S3. Additionally, no AWS credentials are made available to
the container runtime environment. In the case of a training job with multiple instances, network
inbound and outbound traffic is limited to the peers of each training container. SageMaker still
performs download and upload operations against Amazon S3 using your SageMaker execution
role in isolation from the training or inference container.
The following managed SageMaker containers do not support network isolation because they
require access to Amazon S3:
- Chainer
- PyTorch
- Scikit-learn
- SageMaker Reinforcement Learning
Network isolation with a VPC
Network isolation can be used in conjunction with a VPC. In this scenario, the download and
upload of customer data and model artifacts are routed through your VPC subnet. However, the
training and inference containers themselves continue to be isolated from the network, and do
not have access to any resource within your VPC or on the internet.
270
ML implementation and Operations Security
AWS CodeArtifact, is a fully managed artifact repository that makes it easy for organizations of
any size to securely store, publish, and share software packages used in your software
development process.
There are two main methods of implementing controls to improve the security of AWS services
during deployment. One of them is preventive and uses controls to stop an event from occurring.
The other is responsive, and uses controls that are applied in response to events.
Preventive controls protect workloads and mitigate threats and vulnerabilities. A couple of
approaches to implement preventive controls are:
Use IAM condition keys supported by the service to ensure that resources without
necessary security controls cannot be deployed.
Use the AWS Service Catalog to invoke AWS CloudFormation templates that deploy
resources with all the necessary security controls in place.
Use CloudWatch Events to catch resource creation events, then use a Lambda function to
validate that resources were deployed with the necessary security controls, or terminate
resources any if the necessary security controls aren’t present.
Enabling inter-container traffic encryption can increase training time, especially if you are using
distributed deep learning algorithms. Enabling inter-container traffic encryption doesn't affect
271
ML implementation and Operations Security
training jobs with a single compute instance. However, for training jobs with several compute
instances, the effect on training time depends on the amount of communication between
compute instances. For affected algorithms, adding this additional level of security also increases
cost. The training time for most SageMaker built-in algorithms, such as XGBoost, DeepAR, and
linear learner, typically aren't affected.
You can enable inter-container traffic encryption for training jobs or hyperparameter tuning jobs.
You can use SageMaker APIs or console to enable inter-container traffic encryption.
Effective AI services
The effective AI services opt-out policy specifies the final rules that apply to an AWS account. It is
the aggregation of any AI services opt-out policies that the account inherits, plus any AI services
opt-out policies that are directly attached to the account. When you attach an AI services opt-out
policy to the organization's root, it applies to all accounts in your organization. When you attach
an AI services opt-out policy to an OU, it applies to all accounts and OUs that belong to the OU.
When you attach a policy directly to an account, it applies only to that one AWS account.
For example, the AI services opt-out policy attached to the organization root might specify that
all accounts in the organization opt out of content use by all AWS machine learning services. A
272
ML implementation and Operations Security
separate AI services opt-out policy attached directly to one member account specifies that it opts
in to content use for only Amazon Rekognition. The combination of these AI services opt-out
policies comprises the effective AI services opt-out policy. The result is that all accounts in the
organization are opted out of all AWS services, with the exception of one account that opts in to
Amazon Rekognition.
You can view the effective AI services opt-out policy for an account from the AWS Management
Console, AWS API, or AWS Command Line Interface.
273
ML implementation and Operations Deploy and operationalize ML solutions
Managed Deployment
Provides Deploy with one click or a single API call.
Auto scaling
Step 1: Create the model
Use the createModelAPI.
Name the model and tell Amazon SageMaker where it is stored.
Use this if you’re hosting on Amazon SageMaker or running a batch job.
Step 2: Create an HTTPS endpoint configuration
Use the createEndpointConfigAPI.
Associate it with one or more created models.
Set one or more configuration (product variants) for each model.
274
ML implementation and Operations Deploy and operationalize ML solutions
For each product variant, specify instance type and initial count and set its initial weight
(How much traffic it receives).
275
ML implementation and Operations Deploy and operationalize ML solutions
SageMaker Steps
1. Create a new endpoint configuration, using the same production variants for the existing
live model and for the new model.
2. Update the existing live endpoint with the new endpoint configuration. Amazon
SageMaker creates the required infrastructure for the new production variant and updates
the weights without any downtime.
3. Switch traffic to the new model through an API call.
4. Create a new endpoint configuration with only the new production variant and apply it to
the endpoint. Amazon SageMaker terminates the infrastructure for the previous
production variant.
In this approach, all live inference traffic is served by either the old or new model at any given
point. However, before directing the live traffic to new model, synthetic traffic is used to test and
validate the new model.
Canary Deployment
A/B testing is similar to canary testing, but has larger user groups and a longer time scale,
typically days or even weeks. For this type of testing, Amazon SageMaker endpoint configuration
uses two production variants: one for model A, and one for model B. For a fair comparison of two
models, begin by configuring the settings for both models to balance traffic between the models
equally (50/50) and make sure that both models have identical instance configurations. This
initial setting is necessary so the neither version of the model is impacted by difference in traffic
patterns or difference in the underlying compute capacity.
After you have monitored the performance of both models with the initial setting of equal
weights, you can either gradually change the traffic weights to put the models out of balance
(60/40, 80/20, etc.), or you can change the weights in a single step, continuing until a single
model is processing all of the live traffic.
276
ML implementation and Operations Deploy and operationalize ML solutions
With canary testing, you can validate a new release with minimal risk.
1. You do this by first deploying to a small group of your users. Other users continue to use
the previous version.
2. When you’re satisfied with the new release, you can gradually roll the new release out to
all users.
3. After you have confirmed that the new model performs as expected, you can gradually roll
it out to all users, scaling endpoints up and down accordingly.
277
ML implementation and Operations Deploy and operationalize ML solutions
278
ML implementation and Operations Deploy and operationalize ML solutions
279
ML implementation and Operations Deploy and operationalize ML solutions
In addition to the traditional auto scaling of ML compute instances for cost savings, consider the
difference between CPU vs GPU. While deep learning based models require high power GPU
instance for training, inferences against the deep learning models do not typically need the full
power of a GPU. As such, hosting these deep learning models on a full-fledged GPU may lead to
underutilization and unnecessary costs. Amazon Elastic Inference enables you to attach low-cost,
GPU-powered acceleration to Amazon EC2 and Amazon SageMaker instances to reduce the cost
of running deep learning inferences. Standalone GPU instances are designed for model training
and are typically oversized for inference. Even though training jobs batch process hundreds of
data samples in parallel, most inference happens on a single input in real-time and consumes
only a small amount of GPU compute. Amazon Elastic Inference solves this problem by allowing
you to attach the appropriate amount of GPU-powered inference acceleration to any Amazon
EC2 or Amazon SageMaker instance type, with no code changes
280
Appendices Algorithms Input Formats
5. Appendices
5.1 Algorithms Input Formats
No. Model Input Format
1 Linear Learner RecordIO-wrapped protobuf
- Float32 data only!
CSV
- First column assumed to be the label
File or Pipe mode both supported
2 K Nearest Neighbors Train channel contains your data
- Test channel emits accuracy or MSE
recordIO-protobuf or CSV training
- First column is label
File or pipe mode on either
3 K-Means recordIO-protobuf or CSV
File or Pipe on either
Train channel, optional test
- Train ShardedByS3Key, test Fully Replicated
4 Principal Component recordIO-protobuf or CSV
Analysis (PCA) File or Pipe on either
5 XGBoost CSV or libsvm input.
recordIO-protobuf and Parquet as well
6 IP Insights User names, account ID’s can be fed in directly; no need to pre-process
Training channel, optional validation (computes AUC score)
CSV only (Entity, IP)
7 Factorization Machines recordIO-protobuf with Float32
Sparse data means CSV isn’t practical
8 Object Detection RecordIO or image format (JPG or PNG)
- JSON file for annotation data for each image
9 Image Classification Apache MXNet RecordIO
- Not protobuf
Supports both RecordIO (application/x-recordio) and image
(image/png, image/jpeg, and application/x-image) content types for training in
file mode.
- Image format requires .lst files to associate image index, class label, and path
to the image
Supports the RecordIO (application/x-recordio) content type for training in pipe
mode.
- Augmented Manifest Image Format enables Pipe mod
The algorithm supports image/png, image/jpeg, and application/x-image for
inference.
10 Semantic Segmentation JPG Images and PNG annotations
For both training and validation
Label maps to describe annotations
Augmented manifest image format supported for Pipe mode.
JPG images accepted for inference
11 Blazing Text For supervised mode (Text Classification):
One sentence per line
First “word” in the sentence is the string __label__ followed by the label
281
Appendices Algorithms Input Formats
282
Appendices Algorithm Instance Types
284
Appendices Algorithm Type & Usage
6 IP Insights Unsupervised - Identify a user attempting to log into a web service from an
anomalous IP address
- Identify an account that is attempting to create computing
resources from an unusual IP address
7 Factorization Machines Supervised Regression and classification
It is an extension of a linear model that is designed to capture
interactions between features within high dimensional sparse
datasets economically
Factorization machines are a good choice for tasks dealing with high
dimensional sparse datasets, such as click prediction and item
recommendation.
8 Object Detection Supervised Identify all objects in an image with bounding boxes with
CNN confidence score.
9 Image Classification Supervised Assign one or more labels to an image
CNN
10 Semantic Segmentation Supervised Pixel-level object classification built on MXNet Gluon and Gluon CV.
CNN Instance segmentation: which is used by vehicles that tells you
more specific object class.
11 Blazing Text Text Classification: Web searches and information retrieval.
Word2vec: Word embedding, used for translation and sentiment
analysis. Only words not sentences or documents.
12 Seq2Seq RNN Machine Translation
Text summarization
Speech to text
13 Object2Vec Unsupervised it represents how objects are similar to each other
Compute nearest neighbors of objects
Visualize clusters
Genre prediction
Recommendations
14 Neural Topic Model Unsupervised Classify or summarize documents based on topics
Deep Learning
15 Latent Dirichlet Unsupervised Topic modeling algorithm
Allocation (LDA) Cluster customers based on purchases
Harmonic analysis in music
16 DeepAR Supervised Forecasting one-dimensional time series
17 Random Cut Forest Unsupervised Anomaly detection with anomaly score
285
Appendices Algorithm Type & Usage
THANK YOU
286