Tensorflow Enterprise
Tensorflow Enterprise
Learn TensorFlow
Enterprise
Copyright © 2020 Packt Publishing
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80020-914-5
www.packt.com
Why subscribe?
Contributors
Preface
Section 1 – TensorFlow
Enterprise Services and
Features
Chapter 1:
Overview of TensorFlow
Enterprise
UNDERSTANDING
TENSORFLOW
ENTERPRISE
TENSORFLOW
ENTERPRISE PACKAGES6
CONFIGURING CLOUD
ENVIRONMENTS FOR
TENSORFLOW
ENTERPRISE
SETTING UP A CLOUD
ENVIRONMENT8
CREATING A GOOGLE
CLOUD STORAGE
BUCKET10
ENABLING APIS13
CREATING A DATA
WAREHOUSE
USING TENSORFLOW
ENTERPRISE IN AI
PLATFORM
CLOUD STORAGE
READER26
BIGQUERY READER26
PERSISTING DATA IN
BIGQUERY 33
PERSISTING DATA IN A
STORAGE BUCKET37
SUMMARY
Chapter 2:
Running TensorFlow
Enterprise in Google AI
Platform
SETTING UP A
NOTEBOOK
ENVIRONMENT
AI PLATFORM
NOTEBOOK42
DEEP LEARNING
CONTAINER (DLC)52
SUGGESTIONS FOR
SELECTING
WORKSPACES58
EASY PARAMETERIZED
DATA EXTRACTION FROM
BIGQUERY
PUTTING IT TOGETHER63
SUMMARY
Section 2 – Data
Preprocessing and
Modeling
Chapter 3:
CONVERTING TABULAR
DATA TO A TENSORFLOW
DATASET
CONVERTING A
BIGQUERY TABLE TO A
TENSORFLOW
DATASET71
CONVERTING
DISTRIBUTED CSV FILES
TO A TENSORFLOW
DATASET
PREPARING AN EXAMPLE
CSV76
BUILDING FILENAME
PATTERNS WITH
TENSORFLOW I/O78
CREATING A DATASET
FROM CSV FILES79
INSPECTING THE
DATASET80
CONSTRUCTING A
PROTOBUF MESSAGE85
DECODING TFRECORD
AND RECONSTRUCTING
THE IMAGE
READING TFRECORD
AND DISPLAYING IT AS
IMAGES97
SUMMARY
Chapter 4:
USING TENSORFLOW
HUB
CREATING A GENERATOR
TO FEED IMAGE DATA AT
SCALE104
REUSING PRETRAINED
RESNET FEATURE
VECTORS107
COMPILING THE
MODEL109
TRAINING THE MODEL110
LEVERAGING THE
TENSORFLOW KERAS API
DATA ACQUISITION116
SOLVING A DATA
SCIENCE PROBLEM WITH
THE
US_BORDER_VOLUMES
TABLE118
SELECTING FEATURES
AND A TARGET FOR
MODEL TRAINING118
STREAMING TRAINING
DATA118
INPUT TO A MODEL121
MODEL TRAINING123
WORKING WITH
TENSORFLOW
ESTIMATORS
SUMMARY
Section 3 – Scaling and
Tuning ML Works
Chapter 5:
Training at Scale
WHITELISTING ACCESS
FOR READING TRAINING
DATA AND WRITING
ARTIFACTS
(ALTERNATIVE)144
EXECUTION COMMAND
AND FORMAT146
CLOUD COMMAND
ARGUMENTS148
ORGANIZING THE
TRAINING SCRIPT150
DATA STREAMING
PIPELINE150
SUBMITTING THE
TRAINING SCRIPT157
SUMMARY
Chapter 6:
Hyperparameter Tuning
TECHNICAL
REQUIREMENTS
DELINEATING
HYPERPARAMETER
TYPES
UNDERSTANDING THE
SYNTAX AND USE OF
KERAS TUNER
DELINEATING
HYPERPARAMETER
SEARCH ALGORITHMS
HYPERBAND175
BAYESIAN
OPTIMIZATION179
RANDOM SEARCH181
SUBMITTING TUNING
JOBS IN A LOCAL
ENVIRONMENT
SUBMITTING TUNING
JOBS IN GOOGLE'S AI
PLATFORM
SUMMARY
Section 4 – Model
Optimization and
Deployment
Chapter 7:
Model Optimization
TECHNICAL
REQUIREMENTS
UNDERSTANDING THE
QUANTIZATION CONCEPT
TRAINING A BASELINE
MODEL199
PREPARING A FULL
ORIGINAL MODEL FOR
SCORING
PREPARING TEST
DATA209
CONVERTING A FULL
MODEL TO A REDUCED
FLOAT16 MODEL
PREPARING THE
REDUCED FLOAT16
MODEL FOR SCORING219
SCORING A SINGLE
IMAGE WITH A
QUANTIZED MODEL220
SCORING A BATCH
IMAGE WITH A
QUANTIZED MODEL221
CONVERTING A FULL
MODEL TO A REDUCED
HYBRID QUANTIZATION
MODEL
MAPPING A PREDICTION
TO A CLASS NAME227
SCORING A SINGLE
IMAGE229
SCORING BATCH
IMAGES230
CONVERTING A FULL
MODEL TO AN INTEGER
QUANTIZATION MODEL
TRAINING A FULL
MODEL232
SCORING WITH AN
INTEGER QUANTIZATION
MODEL242
PREPARING A TEST
DATASET FOR
SCORING243
SCORING BATCH
IMAGES246
SUMMARY
Chapter 8:
TFRECORD DATASET –
INGESTION PIPELINE256
TFRECORD DATASET –
FEATURE ENGINEERING
AND TRAINING259
REGULARIZATION
L1 AND L2
REGULARIZATION264
ADVERSARIAL
REGULARIZATION267
SUMMARY
Chapter 9:
Serving a TensorFlow
Model
TECHNICAL
REQUIREMENTS
RUNNING LOCAL
SERVING
UNDERSTANDING
TENSORFLOW SERVING
WITH DOCKER
DOWNLOADING
TENSORFLOW SERVING
DOCKER IMAGES
SUMMARY
Conventions used
There are a number of text conventions used throughout this
book.
p2 = Person()
p2.name = 'Jane'
p2.age = 20
print(p2.name)
print(p2.age)
Any command-line input or output is written as follows:
TIPS OR IMPORTANT
NOTES
Get in touch
Feedback from our readers is always welcome.
Reviews
Please leave a review. Once you have read and used this book,
why not leave a review on the site that you purchased it from?
Potential readers can then see and use your unbiased opinion
to make purchase decisions, we at Packt can understand what
you think about our products, and our authors can see your
feedback on their book. Thank you!
Overview of
TensorFlow Enterprise
In this introductory chapter, you will learn how to set up and
run TensorFlow Enterprise in a Google Cloud Platform
(GCP) environment. This will enable you to get some initial
hands-on experience of how TensorFlow Enterprise integrates
with other services in GCP. One of the most important
improvements in TensorFlow Enterprise is the integration with
the data storage options in Google Cloud, such as Google
Cloud Storage and BigQuery.
Understanding TensorFlow
Enterprise
Understanding
TensorFlow Enterprise
TensorFlow has become an ecosystem consisting of many
valuable assets. At the core of its popularity and versatility is a
comprehensive machine learning library and model templates
that evolve quickly with new features and capabilities. This
popularity comes at a cost, and that cost is expressed as
complexity, intricate dependencies, and API updates or
deprecation timelines that can easily break the models and
workflow that were laboriously built not too long ago. It is one
thing to learn and use the latest improvement in your code as
you build a model to experiment with your ideas and
hypotheses, but it is quite another if your job is to build a
model for long-term production use, maintenance, and support.
TensorFlow Enterprise
packages
At the time of writing this book, TensorFlow Enterprise
includes the following packages:
Configuring cloud
environments for
TensorFlow Enterprise
Assuming you have a Google Cloud account already set up
with a billing method, before you can start using TensorFlow
Enterprise, there are some one-time setup steps that you must
complete in Google Cloud. This setup consists of the
following steps:
Setting up a cloud
environment
Now we are going to take a look at what we need to set up in
Google Cloud before we can start using TensorFlow
Enterprise. These setups are needed so that essential Google
Cloud services can integrate seamlessly into the user tenant.
For example, the project ID is used to enable resource
creation credentials and access for different services when
working with data in the TensorFlow workflow. And by virtue
of the project ID, you can read and write data into your Cloud
Storage and data warehouse.
CREATING A PROJECT
This is the first step. It is needed in order to enable billing so
you can use nearly all Google Cloud resources. Most resources
will ask for a project ID. It also helps you organize and track
your spending by knowing which services contribute to each
workload. Let's get started:
Creating a Google
Cloud Storage bucket
A Google Cloud Storage bucket is a common way to store
models and model assets from a model training job. Creating a
storage bucket is very easy. Just look for Storage in the left
panel and select Browser:
Enabling APIs
Now we have a project, but before we start consuming Google
Cloud services, we need to enable some APIs. This process
needs to be done only once, usually as the project ID is
created:
For now, this is good enough. There are more APIs that you'll
need as you go through the examples in this book; GCP will
ask you to enable the API when relevant. You can do so at that
time.
Creating a data
warehouse
We will use a simple example of putting data stored in a
Google Cloud bucket into a table that can be queried by
BigQuery. The easiest way to do so is to use the BigQuery UI.
Make sure it is in the right project. We will use this example to
create a dataset that contains one table.
You can navigate to BigQuery by searching for it in the search
bar of the GCP portal, as in the following screenshot:
Using TensorFlow
Enterprise in AI
Platform
In this section, we are going to see firsthand how easy it is to
access data stored in one of the Google Cloud Storage options,
such as a storage bucket or BigQuery. To do so, we need to
configure an environment to execute some example
TensorFlow API code and command-line tools in this section.
The easiest way to use TensorFlow Enterprise is through the
AI Platform Notebook in Google Cloud:
After this, you will have a good grasp of reading and writing
data to a Google Cloud Storage option and persisting your data
or objects produced as a result of your TensorFlow runtime.
my_train_dataset =
tf.data.TFRecordDataset
('gs://<BUCKET_NAME>/<F
ILE_NAME>*.tfrecord')
my_train_dataset =
my_train_dataset.repeat
()
my_train_dataset =
my_train_dataset.batch(
)
…
model.fit(my_train_dataset,
…)
In the example preceding pattern, the file stored in the bucket
is serialized into tfrecord, which is a binary format of
your original data. This is a very common way of storing and
serializing large amounts of data or files in the cloud for
TensorFlow consumption. This format enables a more efficient
read for data being streamed over a network. We will discuss
tfrecord in more detail in a future chapter.
BigQuery Reader
Likewise, BigQuery Reader is also integrated into the
TensorFlow Enterprise environment, so training data or
derived datasets stored in BigQuery can be consumed by
TensorFlow Enterprise.
TENSORFLOW I/O
For TensorFlow consumption of BigQuery data, it is better if
we use TensorFlow I/O to invoke the BigQuery API. This is
because TensorFlow I/O will provide us with a dataset object
that represents the query results, rather than the entire results,
as in the previous method. A dataset object is the means to
stream training data for a model during training. Therefore not
all training data has to be in memory at once. This
complements mini-batch training, which is arguably the most
common implementation of gradient descent optimization used
in deep learning. However, this is a bit more complicated than
the previous method. It requires you to know the schema of the
table. This example uses a public dataset hosted by Google
Cloud.
Persisting data in
BigQuery
We have looked at how to read data stored in Google Storage
solutions, such as Cloud Storage buckets or a BigQuery data
warehouse, and how to enable the data for consumption by AI
Platform's TensorFlow Enterprise instance running in
JupyterLab. Now let's take a look at some ways to write data
back, or persist our working data, into our cloud Storage.
The following query output shows the data from the BigQuery
table we just created:
Persisting data in a
storage bucket
In the previous Persisting data in BigQuery section, we saw
how a structured data source such as a CSV file or a pandas
DataFrame can be persisted in a BigQuery dataset as a table.
In this section, we are going to see how to persist working data
such as a NumPy array. In this case, the suitable target storage
is a Google Cloud Storage bucket.
Summary
This chapter provided a broad overview of the TensorFlow
Enterprise environment hosted by Google Cloud AI Platform.
We also saw how this platform seamlessly integrates specific
tools such as command-line APIs to facilitate the easy transfer
of data or objects between the JupyterLab environment and
our storage solutions. These tools make it easy to access data
stored in BigQuery or in storage buckets, which are the two
most commonly used data sources in TensorFlow.
In the next chapter, we will take a closer look at the three ways
available in AI Platform to use TensorFlow Enterprise: the
Notebook, Deep Learning VM, and Deep Learning Containers.
Chapter 2:
Running TensorFlow
Enterprise in Google
AI Platform
Currently, the TensorFlow Enterprise distribution is only
available through Google Cloud AI Platform. This chapter will
demonstrate how to launch AI Platform for use with
TensorFlow Enterprise. In AI Platform, TensorFlow Enterprise
can interact with Cloud Storage and BigQuery via their
respective command-line tools as well as simple APIs to load
data from the source. In this chapter, we are going to take a
look at how to launch AI Platform and how easy it is to start
using the TensorFlow Enterprise distribution.
AI Platform Notebook
This is the easiest and least complicated way to start using
TensorFlow Enterprise and get it running in Google Cloud:
1. Simply go to the Google Cloud
portal, select AI Platform in the left
panel, then select the Notebooks
option:
https://cloud.google.com/ai-
platform/deep-learning-
vm/docs/quickstart-cli
https://cloud.google.com/ai-
platform/deep-learning-
vm/docs/quickstart-marketplace
From there you will see all the instances you have created.
Stop them when you are done:
Figure 2.14 – Listing VM instances and managing their use
Deep Learning
Container (DLC)
This is a relatively more complicated way of using
TensorFlow Enterprise. An important reason for using this
approach is for cases where data is not stored in Google Cloud,
and you wish to run TensorFlow Enterprise on-premises or in
your local machine. Another reason is that for enterprise use,
you may want to use DLC as a base Docker image to build
your own Docker image for a specific use or distribution
amongst your team. This is the way to run TensorFlow
Enterprise outside of Google Cloud. Since it is a Docker
image, it requires the Docker Engine installed, and the daemon
running. It would be extremely helpful to have some basic
understanding of Docker. You will find a full list of currently
available DLCs at
https://console.cloud.google.com/gcr/images/deeplearning-
platform-release.
localhost:<LOCAL_PORT>
Here's how we can accomplish this (for reference, see
https://cloud.google.com/ai-platform/deep-learning-
containers/docs/getting-started-local):
3. <CONTAINER_REGISTRY>
is where the Docker image
can be found on the internet,
and for the Docker container
of our interest, it is in
gcr.io/deeplearning-
platform-
release/tf2-cpu.2-1.
localhost:8080
Suggestions for
selecting workspaces
All three methods discussed in the previous section lead you to
a JupyterLab that runs TensorFlow Enterprise. There are some
differences and consequences to consider for each method:
type(myderiveddata)
And it confirms the object being a pandas DataFrame:
Putting it together
The following is the complete code snippet for the quick
example we just worked with:
Summary
In this chapter, you have learned how to launch the JupyterLab
environment to run TensorFlow Enterprise. TensorFlow
Enterprise is available in three different forms: AI Platform
Notebook, DLVM, and a Docker container. The computing
resources used by these methods can be found in the Google
Cloud Compute Engine panel. These compute nodes do not
shut down on their own, therefore it is important to stop or
delete them once you are done using them.
Now that you have seen how to leverage data availability and
accessibility for TensorFlow Enterprise consumption, in the
next chapter, we are going to examine some common data
transformation, serialization, and storage techniques optimized
for TensorFlow Enterprise consumption and model training
pipelines.
Section 2 – Data
Preprocessing and
Modeling
In this part, you will learn how to preprocess and set up raw
data for efficient TensorFlow consumption, and you will also
learn how to build several different models using the
TensorFlow Enterprise API. We will also discuss how to build
custom models as well as leverage prebuilt models in
TensorFlow Hub.
Along the way, there will be some tips and utility functions
that are reusable in many situations. You will also understand
the rationale of the conversion process.
Converting tabular
data to a TensorFlow
dataset
Tabular or comma separated values (CSV) data with fixed
schemas and data types are commonly encountered. We
typically work it into a pandas DataFrame. We have seen in
the previous chapter how this can be easily done when the data
is hosted in a BigQuery table (the BigQuery magic command
that returns a query result to a pandas DataFrame by default).
Let's take a look at how to handle data that can fit into the
memory. In this example, we are going to read a public dataset
using the BigQuery magic command, so we can easily obtain
the data in a pandas DataFrame. Then we are going to convert
it to a TensorFlow dataset. A TensorFlow dataset is the data
structure for streaming training data in batches without using
up the compute node's runtime memory.
Converting a BigQuery
table to a TensorFlow
dataset
Each of the following steps is executed in a cell. Again, use
any of the AI platforms you prefer (AI Notebook, Deep
Learning VM, Deep Learning Container). An AI notebook is
the simplest and cheapest choice:
NOTE
Converting distributed
CSV files to a
TensorFlow dataset
If you are not sure about the data size, or are unsure as to
whether it can all fit in the Python runtime's memory, then
reading the data into a pandas DataFrame is not a viable
option. In this case, we may use a TF dataset to directly
access the data without opening it.
<FILE_NAME>-<pattern>-001.csv
…
<FILE_NAME>-<pattern>-00n.csv
Alternatively, there is the following pattern:
<FILE_NAME>-<pattern>-aa.csv
…
<FILE_NAME>-<pattern>-zz.csv
There is always a pattern in the filenames. The TensorFlow
module tf.io.gfile.glob is a convenient API
that encodes such filename patterns in a distributed filesystem.
This is critical for inferring distributed files that are stored in a
storage bucket. In this section, we will use this API to infer our
structured data (multiple CSV files), which is distributed in a
storage bucket. Once inferred, we will then convert it to a
dataset (using
tf.data.experimental.make_csv
_dataset).
Preparing an example
CSV
Since we need multiple CSV files of the same schema for this
demonstration, we may use open source CSV data such as the
Pima Indians Diabetes dataset (CSV) as our data source. This
CSV is hosted
in https://raw.githubusercontent.com/jbrownlee/Datasets/maste
r/pima-indians-diabetes.data.csv.
wget
https://raw.githubuserc
ontent.com/jbrownlee/Da
tasets/master/pima-
indians-
diabetes.data.csv
Again, for demonstration purposes only, we need to split this
data into multiple smaller CSVs, and then upload these CSVs
to a Google Cloud Storage bucket.
['Pregnancies', 'Glucose',
'BloodPressure',
'SkinThickness',
'Insulin', 'BMI',
'DiabetesPedigree',
'Age', 'Outcome']
Column names are not included in the CSV thus we may split
the file into multiple parts without extracting the header row.
Let’s start with steps below:
1. Split the file into multiple parts.
Building filename
patterns with
TensorFlow I/O
Once the files are uploaded, let's now go to our AI Platform
notebook environment and execute the following lines of code:
import tensorflow as tf
distributed_files_pattern =
'gs://myworkdataset/pim
a_indian_diabetes_data_
part*'
filenames =
tf.io.gfile.glob(distri
buted_files_pattern)
Tf.io.gfile.glob takes a file pattern string as
the input and creates a filenames list:
['gs://myworkdataset/pima_ind
ian_diabetes_data_part0
0.csv',
'gs://myworkdataset/pima_indi
an_diabetes_data_part01
.csv',
'gs://myworkdataset/pima_indi
an_diabetes_data_part02
.csv',
'gs://myworkdataset/pima_indi
an_diabetes_data_part03
.csv']
Now that we have a list of filenames that match the pattern, we
are ready to convert these files to a dataset.
COLUMN_NAMES =
['Pregnancies',
'Glucose',
'BloodPressure',
'SkinThicknes
s', 'Insulin', 'BMI',
'DiabetesPedi
gree', 'Age',
'Outcome']
Here's the source for column names: https://data.world/data-
society/pima-indians-diabetes-database.
Then we need to specify that the first lines in these files are
not headers as we convert the CSVs to a dataset:
ds =
tf.data.experimental.ma
ke_csv_dataset(
filenames,
header = False,
column_names =
COLUMN_NAMES,
batch_size=5, #
Intentionally make it
small for
# convenience.
label_name='Outcome',
num_epochs=1,
ignore_errors=True)
In make_csv_dataset, we use a list of
filenames as the input and specify there is no header, and we
then assign COLUMN_NAMES, make small batches for
showing the result, select a column as the target column
('Outcome'), and set the number of epochs to 1 since
we are not going to train a model with it at this point.
'Outcome': [1 0 0 0 0]
'Features:'
'Pregnancies' : [ 7
12 1 0 2]
'Glucose' : [129
88 128 93 96]
'BloodPressure' : [ 68
74 82 100 68]
'SkinThickness' : [49
40 17 39 13]
'Insulin' : [125
54 183 72 49]
'BMI' : [38.5
35.3 27.5 43.4 21.1]
'DiabetesPedigree' :
[0.439 0.378 0.115
1.021 0.647]
'Age' : [43
48 22 35 26]
During training, the data will be passed to the training process
in batches, and not as a single file to be opened and possibly
consume a large amount of runtime memory. In the preceding
example, we see that as a good practice, distributed files stored
in Cloud Storage follow a certain naming pattern. The
tf.io.gfile.glob API can easily infer
multiple files that are distributed in a Cloud Storage bucket.
We may easily use
tf.data.experimental.make_csv
_dataset to create a dataset instance from the
gfile instance. Overall, the tf.io and
tf.data APIs together make it possible to build a data
input pipeline without explicitly reading data into memory.
Constructing a protobuf
message
Now we have image_labels that map image files
to their labels. The next thing we need to do is to convert this
image to a tf.Example protobuf message. Protobuf
is Google's language-neutral mechanism or structure for
efficient serialization of data. When using this standard in
formatting the image data, you effectively convert the raw
image into a collection of key-value pairs, with most of the
keys being the metadata of the image, including the filename,
width, height, channels, label, and one key being the actual
pixel values as a byte array. Similar to image_label,
the tf.Example message consists of key-value pairs.
The key-value pairs are the metadata of the image, including
the three dimensions and their respective values, the label and
its value, and finally the image itself in byte array format. The
values are represented as tf.Tensor. Let's now
construct this protobuf message.
tf.train.ByteList can
handle string and byte.
tf.train.FloatList can
handle float (float32) and
double (float64).
tf.train.Int64List can
handle bool, enum, int32,
uint32, int64 and uint64.
Most other generic data types can be coerced into one of these
three types as per TensorFlow's documentation, which is
available at:
https://www.tensorflow.org/tutorials/load_data/tfrecord#tftrain
example:
Decoding TFRecord
and reconstructing the
image
In the previous section, we learned how to write a .jpg
image into a TFRecord dataset. Now we are going to
see how to read it back and display it. An important
requirement is that you must know the feature structure of the
TFRecord protobuf as indicated by its keys. The feature
structure is the same as the feature description used to build
the TFRecord in the previous section. In other words, in
the same way as a raw image was structured into a
tf.Example protobuf with a defined feature
description, we can use that feature description to parse or
reconstruct the image using the same knowledge stored in the
feature description:
In this section, you learned how to convert raw data (an image)
to TFRecord format, and verify that the conversion was
done correctly by reading the TFRecord back and
displaying it as an image. From this example, we can also see
that in order to decode and inspect TFRecord data, we
need the feature dictionary as was used during the encoding
process. It is important to bear this in mind when working with
TFRecord.
/home/<user_name>/Documents/<
project_name>
Then, below this level, we would have the following:
/home/<user_name>/Documents/<
project_name>train
/home/<user_name>/Documents/<
project_name>train/<cla
ss_1_dir>
/home/<user_name>/Documents/<
project_name>train/<cla
ss_2_dir>
/home/<user_name>/Documents/<
project_name>train/<cla
ss_n_dir>
/home/<user_name>/Documents/<
project_name>validation
/home/<user_name>/Documents/<
project_name>/validatio
n/<class_1_dir>
/home/<user_name>/Documents/<
project_name>/validatio
n/<class_2_dir>
/home/<user_name>/Documents/<
project_name>/validatio
n/<class_n_dir>
/home/<user_name>/Documents/<
project_name>test
/home/<user_name>/Documents/<
project_name> /test
/<class_1_dir>
/home/<user_name>/Documents/<
project_name>
test/<class_2_dir>
/home/<user_name>/Documents/<
project_name>
/test/<class_n_dir>
Another way of demonstrating the organization of images by
classes is as follows:
-base_dir
-train_dir
-class_1_dir
-class_2_dir
-class_n_dir
-validation_dir
-class_1_dir
-class_2_dir
-class_n_dir
-test
-class_1_dir
-class_2_dir
-class_n_dir
Images are placed in a directory based on their class. In this
section, the example is simplified to the following structure in
Cloud Storage:
-bucket
-badlands (Badlands national
park)
-kistefos (Kistefos Museum)
-maldives (Maldives beaches)
You may find example jpg images in:
https://github.com/PacktPublishing/learn-tensorflow-
enterprise/tree/master/chapter_03/from_gs
2. tf.train.FloatList:
float (float32, float64)
3. tf.train.Int64List:
bool, enum, int32,
uint32, int64, uint64
If we need to convert
numbers with floating points
into a feature of the
tf.train.FloatList
type, then the following
function does the job:
def
_float_feature
(value):
float_list_msg =
tf.train.Float
List(value=
[value])
coerced_list =
tf.train.Featu
re(float_list
=
float_list_msg)
return
coerced_list
A NOTE OF
CAUTION
tf.train.Feature accepts one
feature at a time. Each of these
functions deals with converting and
coercing one data feature at a time.
This function is different from
tf.train.Features, which
accepts a dictionary of multiple
features. In the next step, we are
going to use
tf.train.Features.
8. Consolidate the workflow of
creating the tf.Example protobuf
message into a wrapper function.
This function takes two inputs: a
byte string that represents the
image, and the corresponding label
of that image.
And you should see all the images contained in this protobuf
message. For brevity, we will show only two images, and
notice that Figures 3.14 and 3.15 have different dimensions,
which are preserved and retrieved correctly by the protobuf.
You have seen that whether it's one image or multiple images,
everything can be written in a single TFRecord. There
is no right or wrong way as to which one is preferred, as
factors such as memory and I/O bandwidth all come into play.
A rule of thumb is to distribute your training images to at least
32 - 128 shards (each shard is a TFRecord) to maintain
a file-level parallelism in the I/O process whenever you have
sufficient images to do so.
Summary
This chapter provided explanations and examples for dealing
with commonly seen structured and unstructured data. We first
looked at how to read and format a pandas DataFrame or CSV
type of data structure and converted it to a dataset for efficient
data ingestion pipelines. Then, as regards unstructured data,
we used image files as examples. While dealing with image
data, we have to organize these image files in a hierarchical
pattern, such that labels can be easily mapped to each image
file. TFRecord is the preferred format for handling
image data, as it wraps the image dimension, label, and image
raw bytes together in a format known as tf.Example.
Creating a generator to
feed image data at
scale
A convenient method to ingest data into the model is by a
generator. A Pythonic generator is an iterator that goes through
the data directory and passes batches of data to the model.
When a generator is used to cycle through our training data,
we do not have to load the entire image collection at one time
and worry about memory constraints in our compute node.
Rather, we send a batch of images at one time. Therefore, the
use of the Python generator is more efficient for the compute
node's memory than passing all the data as a huge NumPy
array.
Reusing pretrained
ResNet feature vectors
Now we are ready to construct the model. We will use the
tf.keras.sequential API. It consists of
three layers—input, ResNet, and a dense layer—as the
classification output. We also have the choice between fine-
tuning and retraining the ResNet (this requires longer training
time). The code for defining the model architecture is as
follows:
my_optimizer =
tf.keras.optimizers.SGD
(lr=0.005,
momentum=0.9)
And since we want to output probability for each class, we set
from_logits = True, We also would like the
model not to become overconfident, so we set
label_smoothing = 0.1 as a
regularization to penalize extremely high probability. We may
define a loss function as follows:
my_loss_function =
tf.keras.losses.Categor
icalCrossentropy(from_l
ogits=True,
label_smoothing=0.1)
We need to configure the model for training. This is
accomplished by defining the loss function and optimizer
as part of the model's training process, as the training process
needs to know what the loss function is to optimize for,
and what optimizer to use. To compile the model with the
optimizer and loss function specified, execute the
following code:
mdl.compile(
optimizer=my_optimizer,
loss=my_loss_function,
metrics=['accuracy'])
The outcome is a model architecture that is ready to be used
for training.
steps_per_epoch =
train_generator.samples
//
train_generator.batch_s
ize
validation_steps =
valid_generator.samples
//
valid_generator.batch_s
ize
hist = mdl.fit(
train_generator,
epochs=5,
steps_per_epoch=steps_p
er_epoch,
validation_data=valid_gen
erator,
validation_steps=validati
on_steps).history
And the training result should be similar to this:
Epoch 1/5
91/91
[======================
========] - 404s
4s/step - loss: 1.4899
- accuracy: 0.7348 -
val_loss: 1.3749 -
val_accuracy: 0.8565
Epoch 2/5
91/91
[======================
========] - 404s
4s/step - loss: 1.3083
- accuracy: 0.9309 -
val_loss: 1.3359 -
val_accuracy: 0.8963
Epoch 3/5
91/91
[======================
========] - 405s
4s/step - loss: 1.2723
- accuracy: 0.9704 -
val_loss: 1.3282 -
val_accuracy: 0.9077
Epoch 4/5
91/91
[======================
========] - 1259s
14s/step - loss: 1.2554
- accuracy: 0.9869 -
val_loss: 1.3302 -
val_accuracy: 0.9020
Epoch 5/5
91/91
[======================
========] - 403s
4s/step - loss: 1.2487
- accuracy: 0.9935 -
val_loss: 1.3307 -
val_accuracy: 0.8963
At each epoch, the loss function value and accuracy on
training data is provided. Since we have cross-validation data
provided, the model is also tested with a validation dataset at
the end of each training epoch. The loss function and
accuracy measurement are provided at each epoch by the Fit
API. This is the standard output for each training run.
GPUs are well suited for deep learning model training because
it can process multiple computations in parallel. A GPU
achieves parallel processing through a large number of cores.
This translates to large memory bandwidth and faster gradient
computation of all trainable parameters in the deep learning
architecture than otherwise would be the case in a CPU.
Leveraging the
TensorFlow Keras API
Keras is a deep learning API that wraps around machine
learning libraries such as TensorFlow, Theano, and Microsoft
Cognitive Toolkit (also known as CNTK). Its popularity as a
standalone API stems from the succinct style of the model
construction process. As of 2018, TensorFlow added Keras as
a high-level API moving forward, and it is now known as
tf.keras. Starting with the TensorFlow 2.0
distribution released in 2019, tf.keras has become the
official high-level API.
Input to a model
So far, we have taken care of specifying features and the target
in the training dataset. Now, we need to specify each feature as
either categorical or numeric. This requires us to set up
TensorFlow's feature_columns object. The
feature_columns object is the input to the
model:
Model training
Before the model can be used, we need to compile it. Since
this is a regression model, we may specify mean-square-
error (MSE) as our loss function, and for training
metrics, we will track MSE as well as mean-absolute-error
(MAE):
1. Compile the model with the proper
loss function and metrics used in
the regression task:
model.compile(
loss='mse',
metrics=['mae',
'mse'])
2. Train the model:
model.fit(training_data
set3, epochs=5)
3. Once the model is trained, we may
create a sample test dataset with two
observations. The test data has to be
in a dictionary format:
test_samples = {
'trip_direction' :
np.array(['Mexico
to US',
'US to
Canada']),
'day_type' :
np.array(['Weekday
s', 'Weekends']),
'day_of_week' :
np.array([4, 7]),
'avg_crossing_durati
on' :
np.array([32.8,
10.4]),
'percent_of_normal_v
olume' :
np.array([102,
89]),
'percent_of_normal_v
olume_truck' :
np.array([106,
84])
}
4. To score this test sample, execute
the following code:
model.predict(test_samp
les)
Working with
TensorFlow Estimators
TensorFlow estimators are also reusable components. The
Estimators are higher-level APIs that enable users to build,
train, and deploy machine learning models. It has several pre-
made models that can save users from the hassle of creating
computational graphs or sessions. This makes it easier for
users to try different model architectures quickly with limited
code changes. The Estimators are not specifically dedicated to
deep learning models in the same way as tf.keras.
Therefore, you will not find a lot of pre-made deep learning
models available. If you need to work with deep learning
frameworks, then the tf.keras API is the right choice
to get started.
DATASET_GCP_PROJECT_ID =
'bigquery-public-data'
DATASET_ID =
'covid19_geotab_mobilit
y_impact'
TABLE_ID =
'us_border_volumes'
This is the same BigQuery table (Figure 4.4) that we used for
the tf.keras section. See Figure 4.4 for some
randomly extracted rows of this table.
linear_est =
tf.estimator.LinearRegr
essor(feature_columns=f
eature_columns,
model_dir=MODEL_DIR)
linear_est.train(input_fn)
From the preceding code, the following is observed:
Summary
In this chapter, you have seen how the three major sources of
reusable model elements can integrate with the scalable data
pipeline. Through TensorFlow datasets and TensorFlow I/O
APIs, training data is streamed into the model training process.
This enables models to be trained without having to deal with
the compute node's memory.
TensorFlow Hub sits at the highest level of model reusability.
There, you will find many open source models already built
for consumption via a technique known as transfer learning. In
this chapter, we built a regression model using the
tf.keras API. Building a model this way (custom) is
actually not a straightforward task. Often, you will spend a lot
of time experimenting with different model parameters and
architectures. If your need can be addressed by means of pre-
built open source models, then TensorFlow Hub is the place.
However, for these pre-built models, you still need to
investigate the data structure required for the input layer, and
provide a final output layer for your purpose. However,
reusing these pre-built models in TensorFlow Hub will save
time in building and debugging your own model architecture.
In this part, you will learn about how to set up GPUs and
TPUs in a GCP environment for submitting a model training
job in GCP. You also will learn about the latest hyperparameter
tuning API and run it at scale using GCP resources.
Training at Scale
When we build and train more complex models or use large
amounts of data in an ingestion pipeline, we naturally want to
make better use of all the compute time and memory resources
at our disposal in a more efficient way. This is the major
purpose of this chapter, as we are going to integrate what we
learned in previous chapters with techniques for distributed
training running in a cluster of compute nodes.
This cost does not include the cost of cloud storage, where you
will read and write data or model artifacts. Remember to
delete cloud storage when you are not using it. FYI, the cloud
storage cost for the content and work related to this book is a
very small fraction of the overall cost.
TIP
git clone
https://github.com/Pack
tPublishing/learn-
tensorflow-
enterprise.git
As we have seen in previous chapters, Google's AI Platform
offers a convenient development environment known as
JupyterLab. It integrates with other Google Cloud services,
such as BigQuery or cloud storage buckets, through SDKs. In
this section, we are going to leverage Google Cloud's TPU for
a distributed training workload.
gcloud --help
The preceding command will return the following output:
Figure 5.1 – gcloud SDK verification
From here, the setup instructions are from Google Cloud's own
documentation site at this URL: https://cloud.google.com/ai-
platform/training/docs/using-tpus#tpu-runtime-versions:
The project we use must also know about the TPU service
account. In step 3 of the previous section, we passed our
project's Bearer Token to our TPU service account so the TPU
can access our project. Basically, it is similar to adding another
member to this project, and in this case, the new member is the
TPU service account:
Cloud command
arguments
The example command (discussed ahead in this section),
illustrates a directory structure as shown in Figure 5.11:
Figure 5.11 – Directory structure and file organization in a
local client for an example training run
module-
name=python.ScriptProje
ct.traincloudtpu_resnet
_cache
We then specify the TensorFlow Enterprise version to be 2.1
and the Python interpreter version to be 3.7, and a scale tier of
BASIC_TPU should suffice for this example. We also
set the region to be us-central1. The
BASIC_TPU scale tier provides us with a master VM
and a TPU VM with eight TPU V2 cores.
--distribution_strategy=tpu \
--model_dir=gs://ai-tpu-
experiment/traincloudtp
u_tfkd_resnet_cache \
--train_epochs=10 \--
data_dir=gs://ai-tpu-
experiment/tfrecord-
flowers
We specify
distribution_strategy=tpu as a user-
defined flag because we may use this value in conditional
logic to select the proper distribution strategy. We also specify
model_dir, which is a cloud storage path that we grant
write permissions to the TPU service in order to serialize
checkpoints and model assets. Then, for the remaining flags,
we specify the number of epochs for training in
train_epochs, and the path to the training data
indicated by data_dir, which is also a cloud storage
path that we grant read permissions to the TPU service. The
TPU's distributed training strategy
(https://www.tensorflow.org/guide/distributed_training#tpustra
tegy) implements all necessary operations across multiple
cores.
This decode_and_resize
function parses the dataset to a
JPEG image with the corresponding
color value range, then parse the
labels, one-hot encodes the image,
and resizes the image using the
nearest neighbor method in order to
standardize it to 224 by 224 pixels
for our model of choice (ResNet).
This function also provides different
ways to return the label, whether it
is as plain text or an integer. If you
wish, you can return labels in
different notations and styles by
simply adding the notations of your
interest to the return tuple:
return resized_image,
label_one_hot,
label_txt, label
Job name:
traincloudtpu_tfk_resne
t50
Staging bucket is gs://ai-
tpu-experiment
Bucket to save the model is
gs://ai-tpu-
experiment/traincloudtp
u_tfk_resnet50
Training data is in
gs://tfrecord-
dataset/flowers
As soon as we submit the preceding command, it will be in the
queue for execution in your Cloud AI Platform instance. To
find out where we can monitor the training process, we can run
gcloud ai-platform jobs
describe
traincloudtpu_tfk_resnet50 to
retrieve a URL to the running log:
This is a lengthy log that will run until the training job is
complete. Toward the end of the training run, the log will look
like this:
Figure 5.14 – Google Cloud AI Platform TPU training log
example excerpt 2
m = tf.keras.Sequential([
hub.KerasLayer('https:/
/tfhub.dev/google/image
net/resnet_v2_50/featur
e_vector/4',
trainable=False),
tf.keras.layers.Dense(n
um_classes,
activation='softmax')
])
m.build([None, 224, 224, 3])
# Batch input shape.
As shown in the preceding lines of code, the URL to a pre-
trained model is passed into KerasLayer. However,
currently, the TPU running in Cloud AI Platform has no direct
access to TensorFlow Hub's URL. To download the model,
follow the simple instructions from TensorFlow Hub's site, as
shown in Figure 5.17:
Figure 5.17 – Downloading a pre-trained model from
TensorFlow Hub
os.environ['TFHUB_CACHE_DIR']
= 'gs://ai-tpu-
experiment/model-cache-
dir/imagenet_resnet_v2_
50_feature_vector_4'
This line can be inserted before the model definition in the
run function. In the model definition, we will specify the
model architecture by using hub.KerasLayer, as
usual:
with strategy.scope():
model =
tf.keras.Sequential([
tf.keras.layers.InputLay
er(input_shape=IMAGE_SI
ZE + (3,)),
hub.KerasLayer('https://
tfhub.dev/google/imagen
et/resnet_v2_50/feature
_vector/4',
trainable=
flags_obj.fine_tuning_c
hoice),
tf.keras.layers.D
ense(5,
activation='softmax',
name =
'custom_class')
])
Because we have the TFHUB_CACHE_DIR
environmental variable already defined with our cloud storage
name and path, when the TPU executes the
hub.KerasLayer part of the model architecture
code, the TPU runtime will look for the model from
TFHUB_CACHE_DIR first instead of attempting to
go through a RESTful API call to retrieve the model. After
these small modifications are made to the training script, we
can rename it as trainer_hub.py. The training
work can be launched with a similar invocation style:
For MirroredStrategy, we
will set scale-tier to
BASIC_GPU. This will give us a
single worker instance with one
NVIDIA Tesla K80 GPU. The
command to invoke training with
trainer_hub_gpu.py is as
follows:
vs_code % gcloud ai-
platform jobs
submit training
traincloudgpu_tfhu
b_resnet_gpu_1 \
--staging-
bucket=gs://ai-
tpu-experiment \
--package-path=python \
--module-
name=python.Script
Project.trainer_hu
b \
--runtime-version=2.2 \
--python-version=3.7 \
--scale-tier=BASIC_GPU
\
--region=us-central1 \
-- \
--
distribution_strat
egy=gpu \
--model_dir=gs://ai-
tpu-
experiment/traincl
oudgpu_tfhub_resne
t_gpu_1 \
--train_epochs=10 \
--data_dir=gs://ai-tpu-
experiment/tfrecor
d-flowers
Job
[traincloudtpu_tfh
ub_resnet_gpu_1]
submitted
successfully.
Your job is still active. You may view the status of your job
with the command
Summary
From all the examples that we have covered in this chapter, we
learned how to leverage a distributed training strategy with the
TPU and GPU through AI Platform, which runs on
TensorFlow Enterprise 2.2 distributions. AI Platform is a
service that wraps around TPU or GPU accelerator hardware
and manages the configuration and setup for your training job.
Hyperparameter Tuning
In this chapter, we are going to start by looking at three
different hyperparameter tuning algorithms—Hyperband,
Bayesian optimization, and random search. These algorithms
are implemented in the tf.keras API, which makes
them relatively easy to understand. With this API, you now
have access to simplified APIs for these complex and
advanced algorithms that we will encounter in this chapter. We
will learn how to implement these algorithms and use the best
hyperparameters we can find to build and train an image
classification model. We will also learn the details of its
learning process in order to know which hyperparameters to
search and optimize. We will start by getting and preparing the
data, and then we'll apply our algorithm to it. Along the way,
we will also try to understand key principles and the logic to
implement user choices for these algorithms as user inputs,
and we'll look at a template to submit tuning and training jobs
in GCP Cloud TPU.
Technical requirements
The entire code base for this chapter is in the following
GitHub repository. Please clone it to your environment:
https://github.com/PacktPublishing/learn-tensorflow-
enterprise/blob/master/chapter_06/
git clone
https://github.com/Pack
tPublishing/learn-
tensorflow-
enterprise.git
Delineating
hyperparameter types
As we develop a model and its training process, we define
variables and set their values to determine the training
workflow and the model's structure. These values (such as the
number of hidden nodes in a layer of a multilayer perceptron,
or the selection of an optimizer and a loss function) are known
as hyperparameters. These parameters are specified by the
model creator. The performance of a machine learning model
often depends on the model architecture and the
hyperparameters selected during its training process. Finding a
set of optimal hyperparameters for the model is not a trivial
task. The simplest method to this task is by grid search, that is,
building all possible combinations of hyperparameter values
within a search space and then comparing the evaluation
metrics across these combinations. While this is
straightforward and thorough, it is a tedious process. We will
see how the new tf.keras API implements three
different search algorithms.
Algorithm hyperparameters:
These parameters are required to
execute the learning algorithm, such
as the learning rate in the loss
function used during gradient
descent, or the choice of loss
function.
NOTE
Hyperband
Bayesian optimization
Random search
Understanding the
syntax and use of
Keras Tuner
For the most part, as far as Keras Tuner is concerned,
hyperparameters can be described by the following three data
types: integers, floating points, and choices from a list of
discrete values or objects. In the following sub-sections, we
will take a closer look at how to use these data types to define
hyperparameters in different parts of the model architecture
and training workflow.
tf.keras.layers.Dense(units =
hp_units, activation =
'relu')
In the preceding line of code, hp_units is the number
of nodes in this layer. If you wish to subject hp_units
to hyperparameter search, then you simply need to define the
definition for this hyperparameter's search space. Here's an
example:
hp = kt.HyperParameters()
hp_units = hp.Int('units',
min_value = 64,
max_value = 256, step =
16)
hp is the object that represents an instance of
kerastuner.
hp_units = hp.Choice('units',
values = [64, 80, 90])
hp_Choice is a flexible type for hyperparameters. It
can also be used to define algorithmic hyperparameters such as
activation functions. All it needs is the name of possible
activation functions. A search space for different activation
functions may look like this:
hp_activation =
hp.Choice('dense_activa
tion', values=['relu',
'tanh', 'sigmoid'])
Then the definition for the layer that uses this hyperparameter
would be:
tf.keras.layers.Dense(units =
hp_units, activation =
hp_activation)
Another place where hp.Choice may be applied is
when you want to try different optimizers:
hp_optimizer =
hp.Choice('selected_opt
imizer', ['sgd',
'adam'])
Then, in the model compilation step, where an optimizer is
specified in the training workflow, you would simply define
optimizer as hp_optimizer:
model.compile(optimizer =
hp_optimizer, loss = …,
metrics = …)
In the preceding example, we pass hp_optimizer
into the model compilation step as our selection for the
optimizer to be used in the training process.
def model_builder(hp):
hp_units =
hp.Int('units',
min_value = 64,
max_value = 256,
step = 64)
hp_activation =
hp.Choice('dense_activa
tion',
values=['relu',
'tanh', 'sigmoid'])
IMAGE_SIZE = (224, 224)
model =
tf.keras.Sequential([
tf.keras.layers.InputLaye
r(input_shape=IMAGE_SIZ
E + (3,)),
hub.KerasLayer('https://t
fhub.dev/google/imagene
t/resnet_v2_50/feature_
vector/4',
trainable=False),
tf.keras.layers.Flatten()
,
tf.keras.layers.Dense(uni
ts = hp_units,
act
ivation =
hp_activation,
ker
nel_initializer='glorot
_uniform'),
tf.keras.layers.Dense(5,
activation='softmax',
name =
'custom_class')
])
model.build([None, 224,
224, 3])
model.compile(
optimizer=tf.keras.op
timizers.SGD(lr=1e-2,
mome
ntum=0.5),
loss=tf.keras.losses.
CategoricalCrossentropy
(
from_
logits=True,
label_smoothing=0.1),
metrics=['accuracy'])
return model
With the Keras Tuner API, the search space format and the
way in which the search space is referenced inside the model
layer or training algorithm are straightforward and provide
great flexibility. All that was done was defining a search space,
then passing the object holding the search space into the model
definition. It would be a daunting task to handle the
conditional logic following the grid search approach.
Hyperband
Bayesian optimization
Random search
Delineating
hyperparameter search
algorithms
In this section, we will take a closer look at three algorithms
that traverse the hyperparameter search space. These
algorithms are implemented by the tf.keras API.
Hyperband
Hyperparameter search is an inherently tedious process that
requires a budget B to test a finite set of possible
hyperparameter configurations n. In this context, budget
simply means compute time as indicated by the epoch, and the
training data subsets. The hyperband algorithm takes
advantage of early stopping and successive halving so that it
can evaluate more hyperparameter configurations in a given
time and with a given set of hardware resources. Early
stopping helps eliminate underperforming configurations
before too much training time is invested in them.
import kerastuner as kt
import tensorflow_hub as hub
import tensorflow as tf
from absl import flags
flags_obj = flags.FLAGS
strategy =
tf.distribute.MirroredS
trategy()
tuner = kt.Hyperband(
hypermodel =
model_builder,
objective =
'val_accuracy',
max_epochs = 3,
factor = 2,
distribution_stra
tegy=strategy,
directory =
flags_obj.model_dir,
project_name =
'hp_tune_hb',
overwrite = True)
Here is a description of the parameters shown:
distribution_strategy:
This is used if hardware is available
for distributed training.
hp_units = hp.Int('units',
min_value = 64,
max_value = 256, step =
64)
hp_activation =
hp.Choice('dense_activa
tion', values=['relu',
'tanh', 'sigmoid'])
Inside the model's sequential API definition, you will find
these hyperparameters in one of the Dense layers:
tf.keras.layers.Dense(units =
hp_units, activation =
hp_activation,
kernel_initializer='glo
rot_uniform'),
Before exiting this function, you would compile the model and
return the model to the tuner instance. Now let's begin with the
training of the Hyperband hyperparameter:
By default, num_trials = 1
indicates this will return the best
model. Since this is a list object, we
retrieve it by the first index of a list,
which is 0. The print statement
shows how the item in best_hps
may be referenced.
3. It is recommended that once you
have best_hps, you should
retrain your model with these
parameters. We will start with the
tuner object initialized with
best_hps:
model =
tuner.hypermodel.b
uild(best_hps)
4. Then we may define checkpoints
and callbacks for the formal
training:
checkpoint_prefix =
os.path.join(flags
_obj.model_dir,
'best_hp_train_ckp
t_{epoch}')
callbacks = [
tf.keras.callbacks.
ModelCheckpoint(
filepath=checkpoint_pre
fix,
save_weights_only=True)
]
5. Now let's call the fit function to
start training with the best
hyperparameter configuration:
model.fit(
train_ds,
epochs=30,
steps_per_epoch=ST
EPS_PER_EPOCHS,
validation_data
=val_ds,
validation_step
s=VALIDATION_STEPS
,
callbacks=callb
acks)
6. Once training is completed, save the
trained model:
model_save_dir =
os.path.join(flags
_obj.model_dir,
'b
est_save_model')
model.save(model_save_d
ir)
Bayesian optimization
This method leverages what is learned from the initial training
samples and nudges changes in hyperparameter values towards
the favorable direction of the search space. Actually, what was
learned from the initial training samples is a probabilistic
function that models the value of our objective function. This
probabilistic function, also known as a surrogate function,
models the distribution of our objective (that is, validation
loss) as a Gaussian process. With a surrogate function ready,
the next hyperparameter configuration candidate is selected
such that it is most likely to improve (that is, minimize, if the
objective is validation loss) the surrogate function.
The tuner instance invokes this algorithm in a
straightforward fashion. Here is an example:
tuner =
kt.BayesianOptimization
(
hypermodel =
model_builder,
objective
='val_accuracy',
max_trials = 50,
directory =
flags_obj.model_dir,
project_name =
'hp_tune_bo',
overwrite = True
)
This line of code defines a tuner object that I set up to
use the Bayesian optimization algorithm as a means for
hyperparameter optimization. Similar to Hyperband, it requires
a function definition for hypermodel. In this case,
model_builder from Hyperband is used again.
The criterion for optimization is validation accuracy. The
maximum number of trials is set to 50, and we will specify
the directory in which to save the model as user input during
job submission. The user input for model_dir is
carried by flags_obj.model_dir.
tuner.search(train_ds,
steps_per_epoch=STEPS
_PER_EPOCHS,
validation_data=val_d
s,
validation_steps=VALI
DATION_STEPS,
epochs=30,
callbacks=
[tf.keras.callbacks.Ear
lyStopping(
'val_a
ccuracy')])
And the rest of it, such as retrieving the best hyperparameter
configuration and training the model with this configuration, is
all the same as in the Hyperband section.
Random search
Random search is simply a random selection of the
hyperparameter configuration search space. Here's an example
definition:
tuner = kt.RandomSearch(
hypermodel =
model_builder,
objective='val_ac
curacy',
max_trials = 5,
directory =
flags_obj.model_dir,
project_name =
'hp_tune_rs',
overwrite = True)
In the RandomSearch API in the preceding code, we
define the model_builder function as
hypermodel. This function contains our
hyperparameter objects that hold definitions for the
hyperparameter name and search space. hypermodel
specifies the name of our function, which will accept the best
hyperparameters found by the search and use these values to
build a model. Our objective is to find the best set of
hyperparameters that maximizes validation accuracy, and we
set max_trials to 5. The directory to save the model
is provided as user input. The user input for
model_dir is captured by the
flags_obj.model_dir object.
With the help of the following code, we will set up user inputs
or flags and perhaps assign default values to these flags when
necessary. Let's have a quick review of how user inputs may
be handled and defined in the Python script.absl
library, and the APIs that are commonly used for handling user
input:
'''Runs the
hyperparameter
search.'''
if(flags_obj.tuner_type.l
ower() ==
'BayesianOptimization'.
lower()):
tuner =
kt.BayesianOptimization
(
hypermodel =
model_builder,
objective
='val_accuracy',
tune_new_entries
= True,
allow_new_entries
= True,
max_trials = 5,
directory =
flags_obj.model_dir,
project_name =
'hp_tune_bo',
overwrite = True
)
elif
(flags_obj.tuner_type.l
ower() ==
'RandomSearch'.lower())
:
tuner =
kt.RandomSearch(
hypermodel =
model_builder,
objective='val_ac
curacy',
tune_new_entries
= True,
allow_new_entries
= True,
max_trials = 5,
directory =
flags_obj.model_dir,
project_name =
'hp_tune_rs',
overwrite = True)
Unless it's specified via input to use either Bayesian
optimization or random search, the default choice is
Hyperband. This is indicated in the else block in the
following code:
else:
# Default choice for
tuning algorithm is
hyperband.
tuner = kt.Hyperband(
hypermodel =
model_builder,
objective =
'val_accuracy',
max_epochs = 3,
factor = 2,
distribution_stra
tegy=strategy,
directory =
flags_obj.model_dir,
project_name =
'hp_tune_hb',
overwrite = True)
Now the search algorithm is executed based on the logic of the
preceding code; we need to pass the best hyperparameters. For
our own information, we may use the
get_gest_hyperparameters API to
print out the best hyperparameters. We will get the optimal
hyperparameters with the help of the following code:
best_hps =
tuner.get_best_hyperpar
ameters(num_trials = 1)
[0]
print(f'''
The hyperparameter
search is done.
The best number of
nodes in the dense
layer is
{best_hps.get('units')}
.
The optimal learning
rate for the optimizer
is {best_hps.get(
'learning_rate')}.
''')
Now we can pass these best hyperparameters, best_hp,
to the model and train the model with these values. The
tuner.hypermodel.build API handles
the passing of these values to the model.
logging.info('INSIDE MAIN
FUNCTION user input
model_dir %s',
fla
gs_obj.model_dir)
# Save model trained with
chosen HP in user
specified bucket
location
model_save_dir =
os.path.join(flags_obj.
model_dir,
'best_s
ave_model')
model.save(model_save_dir
)
if __name__ == '__main__':
app.run(main)
To run this as a script
(hp_kt_resnet_local.py), you could
simply invoke it with the following command:
python3
hp_kt_resnet_local_pub.
py \
--
model_dir=resnet_local_
hb_output \
--train_epoch_best=2 \
--tuner_type=hyperband
In the preceding command, we invoke the python3
runtime to execute our training script,
hp_kt_resnet_local.py.
model_dir is the place we wish to save the model.
Tuner_type designates the selection of the
hyperparameter search algorithm. Other algorithm choices you
may try are Bayesian optimization and random search.
NOTE
The code is lengthy, so you can find the entire code and
instructions in the following GitHub repository:
https://github.com/PacktPublishing/learn-tensorflow-
enterprise/tree/master/chapter_06/gcptuningwork.
In the storage bucket, there are the model assets saved from
training using the best hyperparameter configuration in the
best_save_model folder. Further, we can see
that each trial of the hyperparameter tuning workflow is also
saved in the hp_tune_hb folder.
Summary
In this chapter, we learned how to use Keras Tuner in Google
Cloud AI Platform. We learned how to run the hyperparameter
search, and we learned how to train a model with the best
hyperparameter configuration. We have also seen that in a
typical Keras style, integrating Keras Tuner into our existing
model training workflow is very easy, especially with the
simple treatment of hyperparameters as just arrays of a certain
data type. This really opens up the choices for
hyperparameters, and we do not need to implement the search
logic or complicated conditional loops to keep track of the
results.
Model Optimization
In this chapter, we will learn about the concept of model
optimization through a technique known as quantization. This
is important because even though capacity, such as compute
and memory, are less of an issue in a cloud environment,
latency and throughput are always a factor in the quality and
quantity of the model's output. Therefore, model optimization
to reduce latency and maximize throughput can help reduce
the compute cost. In the edge environment, many of the
constraints are related to resources such as memory, compute,
power consumption, and bandwidth.
In this chapter, you will learn how to make your model as lean
and mean as possible, with acceptable or negligible changes in
the model's accuracy. In other words, we will reduce the model
size so that we can have the model running on less power and
fewer compute resources without overly impacting its
performance. In this chapter, we are going to take a look at
recent advances and a method available for TensorFlow:
TFLite Quantization.
Technical requirements
You will find all the source code in
https://github.com/PacktPublishing/learn-tensorflow-
enterprise.git.
git clone
https://github.com/Pack
tPublishing/learn-
tensorflow-
enterprise.git
All the resources for this chapter are available in the
chapter_07 folder in the GitHub link for the book.
Understanding the
quantization concept
Quantization is a technique whereby the model size is reduced
and its efficiency therefore improved. This technique is helpful
in building models for mobile or edge deployment, where
compute resources or power supply are constrained. Since our
aim is to make the model run as efficiently as possible, we are
also accepting the fact that the model has to become smaller
and therefore less precise than the original model. This means
that we are transforming the model into a lighter version of its
original self, and that the transformed model is an
approximation of the original one.
Training a baseline
model
Let's begin start by training an image classification model with
five classes of flowers. We will leverage a pre-trained ResNet
feature vector hosted in TensorFlow Hub
(https://tfhub.dev/google/imagenet/resnet_v2_50/feature_vecto
r/4) and you can download the flower images in TFRecord
format from here: https://dataverse.harvard.edu/dataset.xhtml?
persistentId=doi:10.7910/DVN/1ECTVN.
python3 default_trainer.py \
--
distribution_strategy=d
efault \
--fine_tuning_choice=False \
--train_batch_size=32 \
--validation_batch_size=40 \
--train_epochs=5 \
--
data_dir=tf_datasets/fl
ower_photos \
--
model_dir=trained_resne
t_vector
From the directory where this script is stored, you will find a
subfolder with the prefix name
trained_resnet_vector, followed by a
date and time stamp such as 20200910-213303.
This subfolder contains the saved model. We will use this
model as our baseline model. Once training is complete, you
will find the saved model in the following directory:
trained_resnet_vector-
20200910-
213303/save_model/assets
This saved model is in the same directory where
default_trainer.py is stored. Now that we
have a trained TensorFlow model, in the next section, we are
going score our test data with the trained model.
Preparing a full
original model for
scoring
After training for a full model is complete, we will use a
Scoring Jupyter notebook in this repository to
demonstrate scoring with a full model. This notebook can be
found in https://github.com/PacktPublishing/learn-tensorflow-
enterprise/blob/master/chapter_07/train_base_model/Scoring.i
pynb.
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as
plt
from PIL import Image,
ImageOps
import IPython.display as
display
path_saved_model =
'trained_resnet_vector-
unquantized/save_model'
trained_model =
tf.saved_model.load(pat
h_saved_model)
The full model we just trained is now loaded in our Jupyter
notebook's runtime as trained_model. For
scoring, a few more steps are required. We have to find the
model signature for prediction:
signature_list =
list(trained_model.sign
atures.keys())
signature_list
It shows that there is only one signature in this list:
['serving_default']
We will create an infer wrapper function and pass the
signature into it:
infer =
trained_model.signature
s[signature_list[0]]
Here, signature_list[0] is equivalent to
serving_default. Now let's print the output:
print(infer.structured_output
s)
Let's take a look at the output of the preceding function:
{'custom_class':
TensorSpec(shape=(None,
5), dtype=tf.float32,
name='custom_class')}
The output is a NumPy array of shape=(None,
5). This array will hold the probability of classes predicted
by the model.
Now let's work on the test data. The test data provided in this
case is in TFRecord format. We are going to convert it to a
batch of images expressed as a NumPy array in the dimensions
of [None, 224, 224, 3].
root_dir = '
tf_datasets/flower_phot
os'
test_pattern =
'{}/image_classificatio
n_builder-
test.tfrecord*'.format(
root_dir)
test_all_files =
tf.data.Dataset.list_fi
les(
tf.io.gfile.glob(test_p
attern))
test_all_ds =
tf.data.TFRecordDataset
(test_all_files,
num_parallel_reads=tf.data.ex
perimental.AUTOTUNE)
We will check the sample size of the image with the following
code:
sample_size = 0
for raw_record in
test_all_ds:
sample_size += 1
print('Sample size: ',
sample_size)
Here is the output:
Sample size: 50
This shows that we have 50 samples in our test data.
def
decode_and_resize(seria
lized_example):
# resized image should be
[224, 224, 3] and
normalized to value
range [0, 255]
# label is integer index
of class.
parsed_features =
tf.io.parse_single_exam
ple(
serialized_example,
features = {
'image/channels'
: tf.io.FixedLenFeatur
e([], tf.int64),
'image/class/label'
: tf.io.FixedLenFeatur
e([], tf.int64),
'image/class/text' :
tf.io.FixedLenFeature([
], tf.string),
'image/colorspace' :
tf.io.FixedLenFeature([
], tf.string),
'image/encoded' :
tf.io.FixedLenFeature([
], tf.string),
'image/filename' :
tf.io.FixedLenFeature([
], tf.string),
'image/format' :
tf.io.FixedLenFeature([
], tf.string),
'image/height' :
tf.io.FixedLenFeature([
], tf.int64),
'image/width' :
tf.io.FixedLenFeature([
], tf.int64)
})
image =
tf.io.decode_jpeg(parse
d_features['image/encod
ed'],
channels=3)
label =
tf.cast(parsed_features
['image/class/label'],
tf.int32)
label_txt =
tf.cast(parsed_features
['image/class/text'],
tf.string)
label_one_hot =
tf.one_hot(label, depth
= 5)
resized_image =
tf.image.resize(image,
[224, 224],
method='nearest')
return resized_image,
label_one_hot
def normalize(image, label):
#Convert `image` from [0,
255] -> [0, 1.0] floats
image = tf.cast(image,
tf.float32) / 255.
return image, label
The decode_and_resize function parses an
image, resizes it to 224 by 224 pixels, and, at the same
time, one-hot encodes the image's label.
decode_and_resize then returns the image
and corresponding label as a tuple, so that the image and label
are always kept together.
decoded =
test_all_ds.map(decode_
and_resize)
normed =
decoded.map(normalize)
Notice that we introduced an additional dimension as the first
dimension through np.expand_dims. This extra
dimension is intended for the variable batch size:
np_img_holder = np.empty((0,
224, 224,3), float)
np_lbl_holder = np.empty((0,
5), int)
for img, lbl in normed:
r = img.numpy() # image
value extracted
rx = np.expand_dims(r,
axis=0)
lx = np.expand_dims(lbl,
axis=0)
np_img_holder =
np.append(np_img_holder
, rx, axis=0)
np_lbl_holder =
np.append(np_lbl_holder
, lx, axis=0)
The test data is now in NumPy format with standardized
dimensions, pixel values between 0 and 1, and is batched, as
are the labels.
%matplotlib inline
plt.figure()
for i in
range(len(np_img_holder
)):
plt.subplot(10, 5, i+1)
plt.axis('off')
plt.imshow(np.asarray(np_img_
holder[i]))
In the preceding code snippet, we iterate through our image
array and place each image in one of the subplots. There are
50 images (10 rows, with each row having five subplots), as
can be seen in the following figure:
Figure 7.1 – 50 images within the test dataset of five flower
classes
NOTE
The tf.saved_model.load
API helps us to load the saved
model we built and trained.
2. Then we will create a converter
object to refer to the savedModel
directory with the following line of
code:
converter =
tf.lite.TFLiteConv
erter.from_saved_m
odel(saved_model_d
ir)
input_data =
np.array(np.expand_dims
(np_img_holder[0],
axis=0),
dtype=np.float32)
interpreter.set_tensor(input_
details[0]['index'],
input_data)
interpreter.invoke()
output_data =
interpreter.get_tensor(
output_details[0]
['index'])
print(output_data)
Here is the output:
[[1.5181543e-04 2.0090181e-05
1.0022727e-06
2.8991076e-06
9.9982423e-01]]
To map output_data back to the original labels,
execute the following command:
lookup(output_data,
val_label_map)
Here is the output:
'tulips'
NOTE
It's expected that your model
accuracy will be slightly different
from the nominal value printed here.
Every time a base model is trained,
the model accuracy will not be
identical. However, it should not be
too dissimilar to the nominal value.
Another factor that impacts
reproducibility in terms of model
accuracy is the number of epochs
used in training; in this case, only
five epochs for demonstration and
didactic purposes. More training
epochs will give you a better and
tighter variance in terms of model
accuracy.
converter.optimizations =
[tf.lite.Optimize.DEFAU
LT]
converter.target_spec.support
ed_types = [tf.float16]
tflite_model =
converter.convert()
For hybrid quantization, we will simply remove the middle
line about supported_types:
converter.optimizations =
[tf.lite.Optimize.DEFAU
LT]
tflite_model =
converter.convert()
Everything else remains pretty much the same. Following is
the complete notebook for hybrid quantization and scoring:
The decode_and_resize
function parses an image, resizes it
to 224 by 224 pixels, and, at the
same time, one-hot encodes the
image's label.
decode_and_resize then
returns the image and corresponding
label as a tuple, so that the image
and label are always kept together.
Mapping a prediction to
a class name
From TFRecord, we need to create a reverse lookup dictionary
to map probability back to the label. In other words, we need
to find the index where maximum probability is positioned in
the array. We will then map this position index to the flower
type.
feature_description = {
'image/channels'
: tf.io.FixedLenFeatur
e([], tf.int64),
'image/class/label'
: tf.io.FixedLenFeatur
e([], tf.int64),
'image/class/text' :
tf.io.FixedLenFeature([
], tf.string),
'image/colorspace' :
tf.io.FixedLenFeature([
], tf.string),
'image/encoded' :
tf.io.FixedLenFeature([
], tf.string),
'image/filename' :
tf.io.FixedLenFeature([
], tf.string),
'image/format' :
tf.io.FixedLenFeature([
], tf.string),
'image/height' :
tf.io.FixedLenFeature([
], tf.int64),
'image/width' :
tf.io.FixedLenFeature([
], tf.int64)
}
def
_parse_function(example
_proto):
return
tf.io.parse_single_exam
ple(example_proto,
feature_d
escription)
parsd_ds =
test_all_ds.map(_parse_
function)
val_label_map = {}
# getting label mapping
for image_features in
parsd_ds.take(50):
label_idx =
image_features['image/c
lass/label'].numpy()
label_str =
image_features['image/c
lass/text'].numpy().dec
ode()
if label_idx not in
val_label_map:
val_label_map[label_i
dx] = label_str
In the preceding code, we used
feature_description to parse
test_all_ds. Once it is parsed using
_parse_function, we iterate through the entire
test dataset. The information we want can be found in
image/class/label and
image/class/text.
The decode_and_resize
function parses an image, resizes it
to 224 by 224 pixels, and, at the
same time, one-hot encodes the
image's label.
decode_and_resize then
returns the image and corresponding
label as a tuple, so that the image
and label are always kept together.
So now, decode_and_resize is
applied to each image in
train_all_ds and
val_all_ds. The resulting
datasets are dataset and
val_dataset, respectively.
5. We also need to normalize the
validation dataset and finalize the
training dataset for the training run
process:
# Create dataset for
training run
BATCH_SIZE = 32
VALIDATION_BATCH_SIZE =
40
dataset =
dataset.map(normal
ize,
num_parallel_calls=tf.d
ata.experimental.A
UTOTUNE)
val_dataset =
val_dataset.map(no
rmalize,
num_parallel_calls=tf.d
ata.experimental.A
UTOTUNE)
val_ds =
val_dataset.batch(
VALIDATION_BATCH_S
IZE)
AUTOTUNE =
tf.data.experiment
al.AUTOTUNE
train_ds =
prepare_for_traini
ng(dataset)
The feature_description in
the preceding code is a collection of
key-value pairs. Each pair
delineates a piece of metadata
represented as a tensor:
def
_parse_function(ex
ample_proto):
return
tf.io.parse_single
_example(example_p
roto,
feature_description)
parsd_ds =
test_all_ds.map(_p
arse_function)
1. This is a batch_predict
function that treats the input NumPy
array as an unsigned 8-bit integer
(uint8):
def
batch_predict(inpu
t_raw,
input_tensor,
output_tensor,
dictionary):
input_data =
np.array(np.expand
_dims(input_raw,
axis=0),
dtype=np.uint8)
interpreter.set_tensor(
input_tensor['inde
x'],
input_data)
interpreter.invoke()
interpreter_output =
interpreter.get_te
nsor(
output_tensor['index'])
plain_text_label =
lookup(interpreter
_output,
dictionary)
return plain_text_label
NOTE
It's expected that your model
accuracy will be slightly different
from the nominal value printed here.
Every time a base model is trained,
the model accuracy will not be
identical. However, it should not be
too dissimilar to the nominal value.
Another factor that impacts
reproducibility in terms of model
accuracy is the number of epochs
used in training; in this case, only
five epochs for demonstration and
didactic purposes. More training
epochs will give you a better and
tighter variance in terms of model
accuracy.
This result may vary if you retrained the full model over again,
but it shouldn't be too dissimilar to this value. Furthermore,
based on my experience with this data, integer quantized
model performance is on a par with that of the original full
model. The preceding code shows that our TFLite model
performed just as well as the original model. As we reduce the
model size through quantization, we are still able to preserve
the model's accuracy. In this example, the accuracy is not
impacted just because the model is now more compact.
Summary
In this chapter, we learned to optimize a trained model by
making it smaller and therefore more compact. Therefore, we
have more flexibility when it comes to deploying these models
in various hardware or resource constrained conditions.
Optimization is important for model deployment in a resource
constrained environment such as edge devices with limited
compute, memory, or power resources. We achieved model
optimization by means of quantization, where we reduced the
model footprint by altering the weight, biases, and activation
levels' data type.
TFRecord dataset –
ingestion pipeline
Another means of streaming training data into the model
during the training process is through the TFRecord dataset.
TFRecord is a protocol buffer format. Data stored in this
format may be used in Python, Java, and C++. In enterprise
or production systems, this format may provide versatility and
promote reusability of data across different applications.
Another caveat for TFRecord is that if you wish to use TPU as
your compute target, and you wish to use a pipeline to ingest
training data, then TFRecord is the means to achieve it.
Currently, TPU does not work with generators. Therefore, the
only way to stream data through a pipeline approach is by
means of TFRecord. Again, the size of this dataset does not
require TFRecord in reality. This is only used for learning
purposes.
git clone
https://github.com/Pack
tPublishing/learn-
tensorflow-
enterprise.git
Once this command is complete, get in the following path:
learn-tensorflow-
enterprise/tree/master/
chapter_07/train_base_m
odel/tf_datasets/flower
_photos
You will see the following TFRecord datasets:
image_classification_builder-
train.tfrecord-00000-
of-00002
image_classification_builder-
train.tfrecord-00001-
of-00002
image_classification_builder-
validation.tfrecord-
00000-of-00001
image_classification_builder-
test.tfrecord-00000-of-
00001
Make a note of the file path where these datasets are stored.
TFRecord dataset –
feature engineering and
training
When we used a generator as the ingestion pipeline, the
generator took care of batching and matching data and labels
during the training process. However, unlike the generator, in
order to use the TFRecord dataset, we have to parse it and
perform some necessary feature engineering tasks, such as
normalization and standardization, ourselves. The creator of
TFRecord has to provide a feature description dictionary as a
template for parsing the samples. In this case, the following
feature dictionary is provided:
features = {
'image/channels'
: tf.io.FixedLenFeatur
e([], tf.int64),
'image/class/label'
: tf.io.FixedLenFeatur
e([], tf.int64),
'image/class/text' :
tf.io.FixedLenFeature([
], tf.string),
'image/colorspace' :
tf.io.FixedLenFeature([
], tf.string),
'image/encoded' :
tf.io.FixedLenFeature([
], tf.string),
'image/filename' :
tf.io.FixedLenFeature([
], tf.string),
'image/format' :
tf.io.FixedLenFeature([
], tf.string),
'image/height' :
tf.io.FixedLenFeature([
], tf.int64),
'image/width' :
tf.io.FixedLenFeature([
], tf.int64)
})
We will go through the following steps to parse the dataset,
perform feature engineering tasks, and submit the dataset for
training. These steps follow the completion of the TFRecord
dataset – ingestion pipeline section:
1. Parse TFRecord and resize the
images. We will use the preceding
dictionary to parse TFRecord in
order to extract a single image as a
NumPy array and its corresponding
label. We will define a
decode_and_resize function
that should be used:
def
decode_and_resize(
serialized_example
):
# resized image
should be [224,
224, 3] and #
normalized to
value range [0,
255]
# label is integer
index of class.
parsed_features =
tf.io.parse_single
_example(
serialized_exampl
e,
features = {
'image/channels'
: tf.io.FixedLenF
eature([],
tf.int
64),
'image/class/label'
: tf.io.FixedLenF
eature([],
tf.int
64),
'image/class/text'
:
tf.io.FixedLenFeat
ure([],
tf.string),
'image/colorspace'
:
tf.io.FixedLenFeat
ure([],
tf.string),
'image/encoded' :
tf.io.FixedLenFeat
ure([],
tf.string),
'image/filename' :
tf.io.FixedLenFeat
ure([],
tf.string),
'image/format' :
tf.io.FixedLenFeat
ure([],
tf.string),
'image/height' :
tf.io.FixedLenFeat
ure([], tf.int64),
'image/width' :
tf.io.FixedLenFeat
ure([], tf.int64)
})
image =
tf.io.decode_jpeg(
parsed_features[
'image/encode
d'], channels=3)
label =
tf.cast(parsed_fea
tures[
'image/class/la
bel'], tf.int32)
label_txt =
tf.cast(parsed_fea
tures
['image/class/text
'], tf.string)
label_one_hot =
tf.one_hot(label,
depth = 5)
resized_image =
tf.image.resize(im
age, [224, 224],
method='nearest')
return
resized_image,
label_one_hot
The decode_and_resize
function takes a dataset in
TFRecord format, parses it, extracts
the metadata and actual image, and
then returns the image and its label.
Here, we apply
decode_and_resize to all the
datasets, and then normalize the
dataset at a pixel-wise level.
4. Batch datasets for training
processes. The final step to be
performed on the TFRecord dataset
is batching. We will define a few
variables for this purpose, and
define a function,
prepare_for_model, for
batching:
pixels =224
IMAGE_SIZE = (pixels,
pixels)
TRAIN_BATCH_SIZE = 32
# Validation and test
data are small.
Use all in a
batch.
VAL_BATCH_SIZE = sum(1
for _ in
tf.data.TFRecordDa
taset(val_all_file
s))
TEST_BATCH_SIZE = sum(1
for _ in
tf.data.TFRecordDa
taset(test_all_fil
es))
def
prepare_for_model(
ds, BATCH_SIZE,
cache=True,
TRAINING_DATA=True
,
shuffle_buffer_siz
e=1000):
if cache:
if
isinstance(cache,
str):
ds =
ds.cache(cache)
else:
ds = ds.cache()
ds =
ds.shuffle(buffer_
size=shuffle_buffe
r_size)
if TRAINING_DATA:
# Repeat forever
ds = ds.repeat()
ds =
ds.batch(BATCH_SIZ
E)
ds =
ds.prefetch(buffer
_size=AUTOTUNE)
return ds
The prepare_for_model
function takes a dataset and then
caches it in memory and prefetches
it. If this function is applied to the
training data, it also repeats it
infinitely to make sure you don't run
out of data during the training
process.
5. Execute batching. Use the map
function to apply the batching
function:
NUM_EPOCHS = 5
SHUFFLE_BUFFER_SIZE =
1000
prepped_test_ds =
prepare_for_model(
resized_normalized
_test_ds,
TEST_BATCH_SIZE,
False, False)
prepped_train_ds =
resized_normalized
_train_ds.repeat(1
00).shuffle(buffer
_size=SHUFFLE_BUFF
ER_SIZE)
prepped_train_ds =
prepped_train_ds.b
atch(TRAIN_BATCH_S
IZE)
prepped_train_ds =
prepped_train_ds.p
refetch(buffer_siz
e = AUTOTUNE)
prepped_val_ds =
resized_normalized
_val_ds.repeat(NUM
_EPOCHS).shuffle(b
uffer_size=SHUFFLE
_BUFFER_SIZE)
prepped_val_ds =
prepped_val_ds.bat
ch(80)
prepped_val_ds =
prepped_val_ds.pre
fetch(buffer_size
= AUTOTUNE)
The preceding code sets up batches
of training, validation, and test data.
These are ready to be fed into the
training routine. We have now
completed the data ingestion
pipeline.
6. Build and train the model. This part
does not vary from the previous
section. We will build and train a
model with the same architecture as
seen in the generator:
FINE_TUNING_CHOICE =
False
NUM_CLASSES = 5
IMAGE_SIZE = (224, 224)
mdl =
tf.keras.Sequentia
l([
tf.keras.layers.Inp
utLayer(input_shap
e=IMAGE_SIZE +
(3,),
name='input_layer'
),
hub.KerasLayer("htt
ps://tfhub.dev/goo
gle/imagenet/resne
t_v1_101/feature_v
ector/4",
trainable=FINE_TUN
ING_CHOICE, name =
'resnet_fv'),
tf.keras.layers.Den
se(NUM_CLASSES,
activation='softma
x', name =
'custom_class')
])
mdl.build([None, 224,
224, 3])
mdl.compile(
optimizer=tf.keras.op
timizers.SGD(lr=0.
005,
momentum=0.9),
loss=tf.keras.losses.
CategoricalCrossen
tropy(
from_
logits=True,
label_smoothing=0.
1),
metrics=['accuracy'])
mdl.fit(
prepped_train_ds,
epochs=5,
steps_per_epoch=10
0,
validation_data=pre
pped_val_ds,
validation_steps=1)
Regularization
During the training process, the model is learning to find the
best set of weights and biases that minimize the loss
function. As the model architecture becomes more complex, or
simply starts to take on more layers, the model is being fitted
with more parameters. Although this may help to produce a
better fit during training, having to use more parameters may
also lead to overfitting.
L1 and L2 regularization
Traditional methods to address the concern of overfitting
involve introducing a penalty term in the loss function.
This is known as regularization. The penalty term is directly
related to model complexity, which is largely determined by
the number of non-zero weights. To be more specific, there are
three traditional types of regularization used in machine
learning:
kernel_regularizer: A
regularizer applied to the weight
matrix
bias_regularizer: A
regularizer applied to the bias vector
activity_regularizer: A
regularizer applied to the output of
the layer
KERNEL_REGULARIZER =
tf.keras.regularizers.l
2(l=0.1)
ACTIVITY_REGULARIZER =
tf.keras.regularizers.L
1L2(l1=0.1,l2=0.1)
mdl = tf.keras.Sequential([
tf.keras.layers.InputLaye
r(input_shape=IMAGE_SIZ
E + (3,)),
hub.KerasLayer("https://t
fhub.dev/google/imagene
t/resnet_v2_50/feature_
vector/4",trainable=FIN
E_TUNING_CHOICE),
tf.keras.layers.Dense(NUM
_CLASSES
,act
ivation='softmax'
,ker
nel_regularizer=KERNEL_
REGULARIZER
,ac
tivity_regularizer =
ACT
IVITY_REGULARIZER
,na
me = 'custom_class')
])
mdl.build([None, 224, 224,
3])
Notice that we are using an alias to define regularizers of
interest to us outside the layer. This will make it easy to adjust
the hyperparameters (l1, l2) that determine how strongly
we want the regularization term to penalize the loss
function for potential overfit:
KERNEL_REGULARIZER =
tf.keras.regularizers.l
2(l=0.1)
ACTIVITY_REGULARIZER =
tf.keras.regularizers.L
1L2(l1=0.1,l2=0.1)
This is followed by the addition of these
regularizer definitions in the dense layer
definition:
tf.keras.layers.Dense(NUM_CLA
SSES
,activa
tion='softmax'
,kernel
_regularizer=KERNEL_REG
ULARIZER
,activi
ty_regularizer =
ACTIVITY_
REGULARIZER
,name =
'custom_class')
These are the only changes that are required to the code used
in the previous section.
Adversarial
regularization
An interesting technique known as adversarial learning
emerged in 2014 (if interested, read the seminal paper
published by Goodfellow et al., 2014). This idea stems from
the fact that a machine learning model's accuracy can be
greatly compromised, and will produce incorrect predictions,
if the inputs are slightly noisier than expected. Such noise is
known as adversarial perturbation. Therefore, if the training
dataset is augmented with some random variation in the data,
then we can use this technique to make our model more robust.
TensorFlow's
AdversarialRegularization API is
designed to complement the tf.keras API and
simplify model building and training processes. We are going
to reuse the TFRecord dataset downloaded as the original
training data. Then we will apply a data augmentation
technique to this dataset, and finally we will train the model.
To do so follow the given steps:
The decode_and_resize
function takes a dataset in
TFRecord format, parses it, extracts
the metadata and actual image, and
returns the image and its label. At a
more detailed level, inside this
function, the TFRecord dataset is
parsed with parsed_feature.
This is how we extract different
metadata from the dataset. The
image is decoded by the
decode_jpeg API, and it is
resized to 224 by 224 pixels. As
for the label, it is extracted and one-
hot encoded.
8. Finally, the function returns the
resized image and the
corresponding one-hot label.
def normalize(image,
label):
#Convert `image`
from [0, 255] ->
[0, 1.0] floats
image =
tf.cast(image,
tf.float32) / 255.
return image, label
Summary
This chapter presented some common practices for enhancing
and improving your model building and training processes.
One of the most common issues in dealing with training data
handling is to stream or fetch training data in an efficient and
scalable manner. In this chapter, you have seen two methods to
help you build such an ingestion pipeline: generators and
datasets. Each has its strengths and purposes. Generators
manage data transformation and batching quite well, while a
dataset API is designed where a TPU is the target.
Serving a TensorFlow
Model
By now, after learning all the previous chapters, you have seen
many facets of a model building process in TensorFlow
Enterprise (TFE). Now it is time to wrap up what we have
done and look at how we can serve the model we have built. In
this chapter, we are going to look at the fundamentals of
serving a TensorFlow model, which is through a RESTful API
in localhost. The easiest way to get started is by using
TensorFlow Serving (TFS). Out of the box, TFS is a system
for serving machine learning models built with TensorFlow.
Although it is not yet officially supported by TFE, you will see
that it works with models built by TFE 2. It can run as either a
server or as a Docker container. For our ease, we are going to
use a Docker container, as it is really the easiest way to start
using TFS, regardless of your local environment, as long as
you have a Docker engine available. In this chapter, we will
cover the following topics:
Technical requirements
To follow along with this chapter, and for trying the example
code here: https://github.com/PacktPublishing/learn-
tensorflow-enterprise, you will need to clone the GitHub
repository for this book, and navigate to the folder in
chapter_09. You may clone the repository with the
following command:
git clone
https://github.com/Pack
tPublishing/learn-
tensorflow-
enterprise.git
We will work from the folder named chapter_09.
Inside this folder, there is a Jupyter notebook containing
source code. You will also find the
flowerclassifier/001 directory, which
contains a saved_model.pb file ready for your
use. In the raw_images directory, you will find a few
raw JPG images for testing.
mdl.fit(
train_dataset,
epochs=5,
steps_per_epoch=steps_p
er_epoch,
validation_data=valid_dat
aset,
validation_steps=validati
on_steps)
After you've executed the preceding code, you have a model
object, mdl, that can be saved via the following syntax:
saved_model_path = ''
tf.saved_model.save(mdl,
saved_model_path)
If you take a look at the current directory, you will find a
saved_model.pb file there.
Understanding
TensorFlow Serving
with Docker
At the core of TFS is actually a TensorFlow model server that
runs a model Protobuf file. Installing the model server is not
straightforward, as there are many dependencies. As a
convenience, the TensorFlow team also provides this model
server in a Docker container, which is a platform that uses
virtualization at the operating system level, and it is self-
contained with all the necessary dependencies (that is, libraries
or modules) to run in an isolated environment.
Downloading
TensorFlow Serving
Docker images
Once the Docker engine is up and running, you are ready to
perform the following steps:
flowerclassifier is the
directory name two levels up from
the saved_model.pb file. In
between the two, you will notice
that there is a directory, 001. This
hierarchy is required by TFS, and so
is the naming convention for the
middle directory, which has to be an
integer. It doesn't have to be 001, as
long as it is all integers.
Now the model is served. In the next section, we will see how
to build our client that calls this model using the Python JSON
library.
Summary
In this chapter, you learned how to deploy a TensorFlow
SavedModel. This is by no means the most common
method to use in enterprise deployment. In an enterprise
deployment scenario, many factors determine how the
deployment pipeline should be built, and depending on the use
cases, it can quickly diverge in terms of deployment patterns
and choices from there. For example, some organizations use
AirFlow as their orchestration tool, and some may prefer
KubeFlow, while many others still use Jenkins.
The goal of this book is to show you how to leverage the latest
and most reliable implementation of TensorFlow Enterprise
from a data scientist/machine learning model builder's
perspective.
Rowel Atienza
ISBN: 978-1-83882-165-4
ISBN: 978-1-78953-846-5
6. Summary
9. Chapter 2:
10. Running TensorFlow Enterprise in Google AI Platform
1. AI Platform Notebook
2. Deep Learning Virtual Machine Image
3. Deep Learning Container (DLC)
4. Suggestions for selecting workspaces
1. Putting it together
3. Summary
6. Summary
14. Chapter 4:
15. Reusable Models and Scalable Data Pipelines
1. Data acquisition
2. Solving a data science problem with the
us_border_volumes table
3. Selecting features and a target for model training
4. Streaming training data
5. Input to a model
6. Model training
5. Summary
19. Chapter 6:
20. Hyperparameter Tuning
1. Technical requirements
2. Delineating hyperparameter types
3. Understanding the syntax and use of Keras Tuner
1. Hyperband
2. Bayesian optimization
3. Random search
5. Submitting tuning jobs in a local environment
6. Submitting tuning jobs in Google's AI Platform
7. Summary
1. Technical requirements
2. Understanding the quantization concept
7. Summary
24. Chapter 8:
25. Best Practices for Model Training and Performance
2. Regularization
1. L1 and L2 regularization
2. Adversarial regularization
3. Summary
26. Chapter 9:
27. Serving a TensorFlow Model
1. Technical requirements
2. Running Local Serving
3. Understanding TensorFlow Serving with Docker
4. Downloading TensorFlow Serving Docker images
5. Summary
Landmarks
1. Cover
2. Table of Contents