AWS ML Notes -Domain 2 - Data Transformation
AWS ML Notes -Domain 2 - Data Transformation
Choice of IDEs
1. Supervised
• A deep learning (DL) framework. Currently, SageMaker supports RL in TensorFlow and Apache MXNet.
• An RL toolkit. An RL toolkit manages the interaction between the agent and the environment and
provides a wide selection of state of the art RL algorithms. SageMaker supports the Intel Coach and Ray
RLlib toolkits. For information about Intel Coach, see https://nervanasystems.github.io/coach/(opens in
a new tab). For information about Ray RLlib, see https://ray.readthedocs.io/en/latest/rllib.html(opens in
a new tab).
• An RL environment. You can use custom environments, open-source environments, or commercial
environments. For information, see RL Environments in Amazon SageMaker(opens in a new tab).
• Data analysis and processing: SageMaker Autopilot identifies your specific problem type, handles missing
values, normalizes your data, selects features, and prepares the data for model training.
• Model selection: SageMaker Autopilot explores a variety of algorithms. SageMaker Autopilot uses a cross-
validation resampling technique to generate metrics that evaluate the predictive quality of the algorithms
based on predefined objective metrics.
• Hyperparameter optimization: SageMaker Autopilot automates the search for optimal hyperparameter
configurations.
• Model training and evaluation: SageMaker Autopilot automates the process of training and evaluating
various model candidates.
o It splits the data into training and validation sets, and then it trains the selected model candidates
using the training data.
o Then it evaluates their performance on the unseen data of the validation set.
o Lastly, it ranks the optimized model candidates based on their performance and identifies the best
performing model.
• Model deployment: After SageMaker Autopilot has identified the best performing model, it provides the
option to deploy the model. It accomplishes this by automatically generating the model artifacts and the
endpoint that expose an API. External applications can send data to the endpoint and receive the
corresponding predictions or inferences.
2.1.3 SageMake JumpStart
SageMaker JS is a ML hub with foundation models, built-in algorithms, and prebuilt ML solutions that you can
deploy with a few clicks.
Features
Foundation Models
With JumpStart foundation models, many models are available such as:
Amazon SageMaker JumpStart provides developers and data science teams ready-to-start AI/ML models and
pipelines. SageMaker JumpStart is ready to be deployed and can be used as-is. For demand forecasting,
SageMaker JumpStart comes with a pre-trained, deep learning-based forecasting model, using Long- and Short-
Term Temporal Patterns with Deep Neural Networks (LSTNet).
Amazon SageMaker JumpStart solution uses Graph-Based Credit Scoring to construct a corporate network from
SEC filings (long-form text data).
• Fraud detection
Detect fraud in financial transactions by training a graph convolutional network with the deep graph library and a
SageMaker XGBoost model.
• Computer vision
Amazon SageMaker JumpStart supports over 20 state-of-the-art, fine-tunable object detection models from
PyTorch hub and MxNet GluonCV. The models include YOLO-v3, FasterRCNN, and SSD, pre-trained on MS-
COCO and PASCAL VOC datasets.
Amazon SageMaker JumpStart also supports image feature vector extraction for over 52 state-of-the-art image
classification models including ResNet, MobileNet, EfficientNet from TensorFlow hub. Use these new models to
generate image feature vectors for their images. The generated feature vectors are representations of the images
in a high-dimensional Euclidean space. They can be used to compare images and identify similarities for image
search applications.
JumpStart provides solutions for you to uncover valuable insights and connections in business-critical
documents. Use cases include text classification, document summarization, handwriting recognition,
relationship extraction, question and answering, and filling in missing values in tabular records.
• Predictive maintenance
The AWS predictive maintenance solution for automotive fleets applies deep learning techniques to common
areas that drive vehicle failures, unplanned downtime, and repair costs.
• Churn prediction
After training this model using customer profile information, you can take that same profile information for any
arbitrary customer and pass it to the model. You can then have it predict whether that customer is going to churn
or not. Amazon SageMaker JumpStart uses a few algorithms to help with this. LightGBM, CatBoost,
TabTransformer, and AutoGluon-Tabular used on a churn prediction dataset are a few examples.
• Personalized recommendations
Amazon SageMaker JumpStart can perform cross-device entity linking for online advertising by training a graph
convolutional network with a deep graph library.
You could use the model to summarize long documents with LangChain and Python. The Falcon LLM is a large
language model, trained by researchers at the Technology Innovation Institute (TII) on over 1 trillion tokens using
AWS. Falcon has many different variations, with its two main constituents Falcon 40B and Falcon 7B, comprised
of 40 billion and 7 billion parameters, respectively. Falcon has fine-tuned versions trained for specific tasks,
such as following instructions. Falcon performs well on a variety of tasks, including text summarization,
sentiment analysis, question answering, and conversing.
• Financial pricing
Many businesses dynamically adjust pricing on a regular basis to maximize their returns. Amazon SageMaker
JumpStart has solutions for price optimization, dynamic pricing, option pricing, or portfolio optimization use
cases. Estimate price elasticity using Double Machine Learning (ML) for causal inference and the Prophet
forecasting procedure. Use these estimates to optimize daily prices.
• Causal inference
Researchers can use machine learning models such as Bayesian networks to represent causal dependencies
and draw causal conclusions based on data.
2.1.5 Bedrock
Use cases
2.2 Train Models
2.2.1 Model Training Concepts
Minimizing loss:
Log-likelihood loss is an algorithm used for classification tasks, where the goal is to predict whether
an input belongs to one of two or more classes. For example, you might use logistic regression to
predict whether an email is spam.
Optimizing - Reducing Loss function:
Optimization Stochastic gradient Mini-batch gradient
Gradient descent
technique descent descent
Weights updated Every epoch Every datapoint Every batch
Speed of each
Slowest Fast Slower
epoch calculation
Smooth updates toward the Noisy or erratic updates Less noisy or erratic
Gradient steps
minima toward the minima updates toward the minima
Gradient descent
As mentioned, gradient descent only updates weights after it's gone through all of the data, also
known as an epoch. Of the three variations covered here
• gradient descent has the slowest speed to finding the minima as a result, but
• also has the fewest number of steps to reach the minima.
In stochastic gradient descent or SGD, you update your weights for each record you have in your
dataset.
For example, if you have 1000 data points in your dataset, SGD will update the parameters 1000 times.
With gradient descent, the parameters would be updated only once in every epoch.
• SGD leads to more parameter updates and, therefore, the model will get closer to the minima
more quickly.
• One drawback of SGD, however, is that it will oscillate in different directions, unlike gradient
descent, hence lot more steps.
Hybrid of gradient descent and SGD, this approach uses a smaller dataset or a batch of records, also
called batch size, to update your parameters.
• Mini-batch gradient descent updates more than gradient descent while having less erratic or
noisy updates as compared to SGD. The user-defined batch size helps you fit the smaller
dataset into memory. Having a smaller dataset helps the algorithms run on almost any average
computer that you might be using.
2.2.2 Compute Environment
AWS offers solutions for a variety of specific ML tasks, and this permits you to optimize on your particular use
case scenarios.
Model created
• Store in S3
• Package and distribute
• Register model (in registry)
Train a model
For built-in algorithms, the only inputs you need to provide are the
• training data
• hyperparameters
• compute resources.
Amazon SageMaker training options
When it comes to training environments, you have several to choose from:
• Create a training job using SageMaker console (see the Creating a Training Job Using the Amazon
SageMaker Console lesson for an example using this method).
o The low-level SageMaker APIs for the SDK for Python (Boto3) or the AWS CLI
What SageMaker streams data SageMaker will download SageMaker can stream data
directly from Amazon S3 the training data from S3 to directly from S3 to the
to the container, without the provisioned ML storage container with no code
downloading the data to volume. Then it will mount changes. Users can author their
the ML storage volume the directory to the Docker training script to interact with
volume for the training these files as though they were
container. stored on disk.
Pros Improve training In a distributed training Fast File mode works best when
performance by reducing setup ,the training data is the data is read sequentially.
the time spent on data distributed uniformly
download across the cluster.
Cons Manually ensure ML Augmented manifest files are
storage volume has not supported. The startup time
sufficient capacity to is lower when there are fewer
accommodate data from files in the S3 bucket provided.
Amazon S3.
1. Use your local laptop or desktop with the SageMaker Python SDK. You can get different
instance types, such as CPUs and GPUs, but are not required to use the managed notebook
instances.
a) training script
b) instance type
c) other configurations.
4. Call the fit method on the estimator to start the training job, passing in the training and
validation data channels.
5. SageMaker takes care of the rest. It pulls the image from Amazon Elastic Container Registry
(Amazon ECR) and loads it on the managed infrastructure.
6. Monitor the training job and retrieve the trained model artifacts once the job is complete.
Example
In this example, the PyTorch estimator is configured with the training script using the entry_point: train.py,
instance type ml.p3.2xlarge, and other settings. The fit method is called to launch the training job, passing in the
location of the training data.
Reducing training time
Amazon SageMaker script mode provides the flexibility to develop custom training and inference code
while using industry-leading machine learning frameworks.
a) Early stopping:
Early stopping is a regularization technique that shuts down the training process for a ML model
when the model's performance on a validation set stops improving.
a) Evaluating objective metric after each epoch: During the training process, SageMaker evaluates the
specified objective metric (for example, accuracy, loss, F1-score) for each epoch or iteration of the
training job.
b) Distributed training
A. Data parallelism is the process of splitting the training set in mini-batches evenly distributed across
nodes. Thus, each node only trains the model on a fraction of the total dataset.
B. Model parallelism is the process of splitting a model up between multiple instances or nodes.
• If model can fit on a single GPU's memory but your dataset is large, data parallelism is the
recommended approach. It splits the training data across multiple GPUs or instances for faster
processing and larger effective batch sizes.
• If model is too large to fit on a single GPU's memory, model parallelism becomes necessary. It splits the
model itself across multiple devices, enabling the training of models that would otherwise be intractable
on a single GPU.
Building a deployable model package
Step 2: Write a script that will run in the container to load the model artifact. In this example, the script
is named inference.py. This script can include custom code for generating predictions, as well as input
and output processing. It can also override the default implementations provided by the pre-built
containers.
To install additional libraries at container startup, add a requirements.txt file that specifies the libraries
to be installed by using pip.
Step 3: Create a model package that bundles the model artifact and the code. This package should
adhere to a specific folder structure and be packaged as a tar archive, named model.tar.gz, with gzip
compression.
2.3 Refine Models
2.3.1 Evaluating Model Performance
Bias and Variance
a) What are these
Bias Variance
Bias Variance
The model is too simple The model is too complex
Incorrect modeling or feature engineering Too much irrelevant data in training dataset
Inherited bias from the training dataset Model trained for too long on training dataset
2.3.2 Model Fit (Overfitting and Underfitting)
1. Overfit/Underfit
• Overfit
1. Reasons
Training data too small Too much irrelevant data Excessive training time Overly complex
architecture
• Underfit
Reasons
Early stopping
Pauses the training process
before the model learns the
noise in the data.
Pruning
Aims to remove weights
that don’t contribute much
to the training process
Regularization a) Dropout
Randomly drops out (or sets to 0), a number of neurons in each
layer of the neural network during each epoch.
b) L1 regularization
Push the weights of less important features to zero.
c) L2 regularization
Results in smaller overall weight values (and stabilizes the weights)
when there is high correlation between the input features.
Data augmentation perform data augmentation to increase the diversity of the
training data
Model architecture
simplification
b) Remediating Underfitting
The idea behind ensembling is that by combining the strengths of different models, the weaknesses of
individual models can be mitigated. This leads to improved overall performance.
a) Adaptive Boosting
(AdaBoost)
b) Gradient Boosting (GB)
c) Extreme Gradient
Boosting (XGBoost)
Boosting algorithm
Adaptive Boosting (AdaBoost) Gradient Boosting (GB) Extreme Gradient Boosting (XGBoost)
classification • classification • classification
• regression • regression
• large datasets and big data applications.
Bagging (bootstrap aggregation)
Random forests
Stacking
??
2.3.3 Hyperparameter Tuning
Benefits of Hyperparameter tuning
a) Impact of Hyperparameter tuning on model performance
Careful If the learning rate is too high, the A larger batch size can lead to However, too many
algorithm might overshoot the faster convergence but might can result in
optimal solution and fail to require more computational overfitting.
converge. resources.
Neural networks
# of layers # of neurons in each Choice of activation Regularization
layer functions Techniques
more layers -> more more neurons -> more introduce non-linearity into helps prevent
complex processing power the neural network overfitting .
Common activation functions Common
include: regularization
• Sigmoid function techniques
• Rectified Linear Unit • L1 /L2
(ReLU) regularization
• Hyperbolic Tangent (Tanh) • Dropout
• Softmax function • Early stopping
increasing the depth Increasing number of
of a network risks neurons risks
overfitting. overfitting
Decision Tree
Maximum Depth of tree # of neurons in each layer Choice of activation functions
helps manage complexity Sets a threshold that the data must Options to select how algorithm
of the model and prevent meet before splitting a node. evaluates node splits:
overfitting prevents the tree from creating too • Gini impurity: measures purity of
many branches. This also helps to data and the likelihood that data
prevent overfitting could be misclassified.
• Entropy: Measures randomness of
data. The child node that reduces
entropy the most is the split that
should be used.
Hyperparameter tuning techniques
Pros Cons When
Manual When you have a good time-consuming Domain knowledge, and
understanding of the problem prior experience with
at hand similar problems
Grid search Systematic and exhaustive approach to hyperparameter tuning. It involves defining all
possible hyperparameter values and training and evaluating the model for every
combination of these values.
Reliable technique, Computationally Small scale and
especially for smaller-scale expensive. accuracy
problems.
Random
search
More efficient than Grid Optimum hyperparameter
Search combination could be
missed.
Bayesian
optimization
Uses the performance of previous hyperparameter selections to predict which of the
subsequent values are likely to yield the best results.
• can handle composite • More complex to multiple objectives
objectives implement. and/or speed.
• can also converge faster • Works sequentially, so
than random search. difficult to scale.
2. Specify the hyperparameters to tune and the range of values to use for each of the
following: alpha, eta, max_depth, min_child_weight, and num_round.
3. Identify the objective metric that SageMaker AMT will use to gauge model performance.
4. Configure and launch the SageMaker AMT tuning job, including completion criteria to stop tuning after
the criteria have been met.
5. Identify the best-performing model and the hyperparameters used in its creation.
Pruning
Pruning is a technique that removes the least
important parameters or weights from a model.
Quantization
Quantization changes the representation of weights
to its most space-efficient representation.
E.g., instead of a 32-bit floating-point representation
of weight, quantization has the model use an 8-bit
integer representation.
Knowledge distillation
With distillation, a larger teacher model transfers
knowledge to a smaller student model. The student
model is trained on the same dataset as the teacher.
However, the student model is also trained on the
teacher model's knowledge of the data.
2.3.5 Refining Pre-trained models
Benefits of Fine tuning
a) Where fine-tuning fits in the training process
• To work with domain-specific language, such as industry jargon, technical terms, or other specialized
vocabulary
• To have responses that are more factual, less toxic, and better aligned to specific requirements
b) Fine-tuning approaches
a) Detecting
Plot your model's performance over time. If the Make sure your validation sets are representative
model's performance on specific tasks decreases of historic patterns in the data that are still
significantly after training on new data, it might be a relevant to the problem.
sign of catastrophic forgetting.
b) Preventing
b) Benefits
• Catalog models for production
• Manage model versions
• Control the approval status of models within your ML pipeline
• When you want the error in the same units as the target
Square root of MSE, in the
Root Mean Square variable
same units as the target
Error (RMSE) • For easier interpretation of the error magnitude
variable
• When comparing models with different scales
Proportion of variance in the • To understand how well the model fits the data
R-Squared (R²) dependent variable explained • When you want a metric bounded between 0 and 1
by the independent variables • For comparing models across different datasets
a) Impact of convergence
This is where SageMaker AMT can help. It can automatically tune models by finding the optimal
combination of hyperparameters, such as
Improve CNN
How SageMaker AMT improves issues with local maxima and local minima
When training a deep CNN for image classification tasks can encounter saddle points or local minima. This is
because the loss function landscape in high-dimensional spaces can be complex. Having multiple local minima
and saddle points can trap the optimization algorithm, leading to suboptimal convergence.
This is where SageMaker Training Compiler can help. It can automatically apply optimization techniques like
• tensor remapping
• operator fusion
• kernel optimization.
Debug Model Convergence with SageMaker Debugger