0% found this document useful (0 votes)
157 views

Deep Learning: Huawei AI Academy Training Materials

HCIA-AI V3.0 Deep Learning chapter

Uploaded by

Mohammad Waleed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
157 views

Deep Learning: Huawei AI Academy Training Materials

HCIA-AI V3.0 Deep Learning chapter

Uploaded by

Mohammad Waleed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Huawei AI Academy Training Materials

Deep Learning

Huawei Technologies Co., Ltd.


Copyright © HiSilicon (Shanghai) Technologies Co., Ltd. 2020. All rights reserved.
No part of this document may be reproduced or transmitted in any form or by any means without
prior written consent of HiSilicon (Shanghai) Technologies Co., Ltd.

Trademarks and Permissions

, , and other HiSilicon icons are trademarks of HiSilicon Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of their respective
holders.

Notice
The purchased products, services and features are stipulated by the contract made between HiSilicon
and the customer. All or part of the products, services and features described in this document may
not be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all
statements, information, and recommendations in this document are provided "AS IS" without
warranties, guarantees or representations of any kind, either express or implied.
The information in this document is subject to change without notice. Every effort has been made in
the preparation of this document to ensure accuracy of the contents, but all statements, information,
and recommendations in this document do not constitute a warranty of any kind, express or implied.

HiSilicon (Shanghai) Technologies Co., Ltd.

Address: New R&D Center, 49 Wuhe Road, Bantian,


Longgang District,
Shenzhen 518129 P. R. China

Website: http://www.hisilicon.com/en/

Email: [email protected]
Deep Learning Page 2

Contents

1 Deep Learning..................................................................................................................................... 4
1.1 Deep learning ............................................................................................................................................................................ 4
1.1.1 Overview .................................................................................................................................................................................. 4
1.1.2 Deep Neural Network ......................................................................................................................................................... 5
1.1.3 Development History of Deep Learning ....................................................................................................................... 6
1.1.4 Perception Algorithm .......................................................................................................................................................... 7
1.2 Training Rules ..........................................................................................................................................................................10
1.2.1 Loss Function ........................................................................................................................................................................10
1.2.2 Gradient Descent Method ...............................................................................................................................................11
1.2.3 BP Algorithm ........................................................................................................................................................................13
1.3 Activation Function ................................................................................................................................................................15
1.4 Regularization ..........................................................................................................................................................................17
1.4.1 Parameter Penalty ..............................................................................................................................................................17
1.4.2 Dataset Expansion ..............................................................................................................................................................19
1.4.3 Dropout ..................................................................................................................................................................................20
1.4.4 Early Stopping of Training ...............................................................................................................................................20
1.5 Optimizers .................................................................................................................................................................................21
1.5.1 Momentum Optimizer ......................................................................................................................................................21
1.5.2 AdaGrad Optimizer ............................................................................................................................................................22
1.5.3 RMSProp Optimizer............................................................................................................................................................23
1.5.4 Adam Optimizer ..................................................................................................................................................................23
1.6 Types of Neural Networks ..................................................................................................................................................24
1.6.1 CNN..........................................................................................................................................................................................24
1.6.2 RNN..........................................................................................................................................................................................28
1.6.3 GAN..........................................................................................................................................................................................31
1.7 Common Issues .......................................................................................................................................................................33
1.7.1 Data Imbalance ...................................................................................................................................................................33
1.7.2 Gradient Vanishing and Gradient Explosion .............................................................................................................34
1.7.3 Overfitting .............................................................................................................................................................................34
1.8 Summary ...................................................................................................................................................................................35
1.9 Quiz .............................................................................................................................................................................................35
2 Deep Learning Development Frameworks ............................................................................... 36
2.1 Deep Learning Development Frameworks ...................................................................................................................36
2.1.1 Introduction to PyTorch ....................................................................................................................................................36
Deep Learning Page 3

2.1.2 Introduction to MindSpore ..............................................................................................................................................37


2.1.3 Introduction to TensorFlow .............................................................................................................................................38
2.2 TensorFlow 2.0 Basics ...........................................................................................................................................................39
2.2.1 Introduction ..........................................................................................................................................................................39
2.2.2 Tensors ....................................................................................................................................................................................40
2.2.3 Eager Execution Mode ......................................................................................................................................................40
2.2.4 AutoGraph .............................................................................................................................................................................40
2.3 TensorFlow 2.0 Modules ......................................................................................................................................................40
2.3.1 Common Modules ..............................................................................................................................................................40
2.3.2 Keras API ................................................................................................................................................................................41
2.4 Basic Development Steps of TensorFlow 2.0 ...............................................................................................................42
2.4.1 Environment Setup .............................................................................................................................................................42
2.4.2 Development Process ........................................................................................................................................................43
2.5 Summary ...................................................................................................................................................................................46
2.6 Quiz .............................................................................................................................................................................................46
Deep Learning Development Frameworks Page 4

1 Deep Learning

Deep learning is a machine learning model based on neural networks and has great
advantages in fields such as computer vision, speech recognition, and natural language
processing. This chapter introduces the basic knowledge of deep learning, including the
development history of deep learning, components of deep learning neural networks,
types of deep learning neural networks, and common problems in deep learning projects.

1.1 Deep learning


1.1.1 Overview
In conventional machine learning, features are manually selected. More features indicate
more information transferred to a model, and a stronger expression capability of the
model. However, as features increase, the algorithm complexity grows, and the model
search space also rises accordingly. The training data will appear very sparse in the
feature space, which affects the similarity judgment. This phenomenon is called
dimension explosion. More importantly, a feature not beneficial to the task may interfere
with the learning effect. Limited by the number of features, conventional machine
learning algorithms are suitable for training small volumes of data. When the data
volume increases to a certain extent, it is difficult to improve the performance by
increasing the data volume. Therefore, conventional machine learning has a relatively
low requirement for computer hardware, and supports a limited computing amount.
Generally, no GPU is required for parallel computing.
Deep Learning Development Frameworks Page 5

Figure 1-1 General process of machine learning


Figure 1-1 shows the general process of conventional machine learning. In this process,
features have strong interpretability because they are manually selected. However, more
features do not mean better learning effect. Proper feature selection is the key to
identification success. The number of required features can be determined by the
problem. To avoid inherent biases that may be introduced by manual feature selection,
deep learning seeks an algorithm that can automatically extract features. Although this
weakens the interpretability of features, it improves the adaptability of the model to
different problems. In addition, deep learning uses an end-to-end learning model and
high-dimensional weight parameters to obtain higher performance than conventional
methods based on massive training data. Massive data poses higher requirements on
hardware: The processing speed of a large number of matrix operations on the CPU is
too slow, and a GPU is needed for parallel acceleration.

1.1.2 Deep Neural Network


Generally, deep learning refers to a deep neural network, that is, a multi-layer neural
network. It is a model constructed by simulating the neural network of human beings. As
shown in Figure 1-2, a deep neural network is a stack of sensors, which simulate human
neurons. In the middle and right parts of Figure 1-2, each circle represents one neuron.
The following description will illustrate the similarities between this design and the
neurons of human brains. In the design and application of artificial neural networks, the
following factors need to be considered: neuron functions, connection modes among
neurons, and network learning (training).
Deep Learning Development Frameworks Page 6

Figure 1-2 Human brain neurons and artificial neural networks


So what exactly is a neural network? Currently, there are different definitions of neural
networks. According to Hecht Nielsen, an American neural network scientist, a neural
network is a computer system formed by multiple highly simple processing units
connected to each other in a specific manner. The system processes information by
dynamically responding to external input information based on a status of the system.
Based on the source, characteristics, and explanations of the neural network, the neural
network can be simply expressed as an information processing system designed to
imitate the human brain structure and functions. Artificial neural networks reflect some
basic features of human brain functions, such as parallel information processing,
learning, association, pattern classification, and memorization. A neural network is a
network formed by interconnected artificial neurons, is abstraction and simplification of a
human brain in terms of microstructure and function, and is an important way of
simulating human smart.

1.1.3 Development History of Deep Learning


The development history of deep learning is also the development history of neural
networks. Since the 1950s, with the continuous development of computer hardware
technologies, neural networks have developed from a single layer to multiple layers, and
finally become the current well-known deep neural networks. Generally, the development
of neural networks can be divided into three phases, as shown in Figure 1-3.
Deep Learning Development Frameworks Page 7

Figure 1-3 Development history of machine learning


In 1958, Rosenblatt invented the Perceptron algorithm, marking the beginning of the
germination phase of neural networks. However, machine learning in this period had not
been separated from other research directions of artificial smart (AI). Therefore, the
Perceptron algorithm had not been greatly developed. In 1969, Minsky, an American AI
pioneer, questioned that perceptrons could only handle linear classification problems and
could not handle even the simplest exclusive OR (XOR) problem. These doubts directly
sentenced the Perceptron algorithm to death, and also brought a "cold winter" to deep
learning for nearly 20 years.
It wasn't until 1986 that Hinton's Multilayer Perceptron (MLP) changed the situation.
Hinton proposed to use the sigmoid function to perform nonlinear mapping on the
output of perceptrons. This effectively solves the problem of nonlinear classification and
learning. In addition, Hinton invented the backpropagation (BP) algorithm suitable for
MLP training. This algorithm and its derivatives are still used for deep neural network
training nowadays. In 1989, Robert Hecht-Nielsen proved the universal approximation
theorem. According to the theorem, any continuous function f in a closed interval can be
approximated by using a BP network with one hidden layer. In short, neural networks
have the capability of fitting any continuous function. Until 1998, a variety of neural
networks emerged, including the well-known convolutional neural network (CNN) and
recurrent neural network (RNN). However, as excessively deep neural network training
may lead to gradient vanishing and gradient explosion, neural networks once again faded
out.
2006 is a significant year of deep learning. In this year, Hinton proposed a solution to
gradient vanishing in deep network training: a combination of unsupervised pre-training
and supervised fine-tuning. In 2012, AlexNet proposed by Hinton's project team, won the
top-class image recognition competition ImageNet Large Scale Visual Recognition
Challenge over other methods, setting off the climax of deep learning. In 2016, the deep
learning AI program AlphaGo developed by Google beat the Go world champion Lee
Sedol who is a player of 9 dan rank, further promoting the popularity of deep learning.

1.1.4 Perception Algorithm


The single-layer perceptron is the simplest neural network. As shown in Figure 1-4, the
input vector 𝑋 = [𝑥0 , 𝑥1 , … , 𝑥𝑛 ]𝑇 and the weight 𝑊 = [𝑤0 , 𝑤1 , … , 𝑤𝑛 ]𝑇 are first used to
calculate an inner product, which is denoted as net. 𝑥0 is generally fixed at 1, and 𝑤0 is
referred to as an offset. For regression problems, net can be directly used as the output
of perceptrons, while for classification problems, net can be used as the output only after
being input into the activation function Sgn(net). The Sgn function is set to 1 in the
region where x is greater than 0, and is set to –1 in other regions.
Deep Learning Development Frameworks Page 8

Figure 1-4 Perceptrons


The perceptron shown in Figure 1-4 is equivalent to a classifier. It uses the high-
dimensional X vector as input and performs binary classification on input samples in
high-dimensional space. Specifically, when 𝑊 𝑇 𝑋 > 0, if Sgn(net) is equal to 1, samples
are classified into a positive class, or if Sgn(net) is equal to –1, samples are classified into
a negative class. The boundary between the two classes is 𝑊 𝑇 𝑋 = 0, a hyperplane in
high-dimensional space.

Figure 1-5 XOR problem


A perceptron is essentially a linear model, which can handle only linear classification but
not nonlinear data. As shown in Figure 1-5, the perceptron can easily find a straight line
to classify AND and OR operations correctly, but it cannot handle XOR operations. In
1969, Minsky used such a simple example to prove the limitations of perceptrons.
Deep Learning Development Frameworks Page 9

Figure 1-6 MLP


For a perceptron to process nonlinear data, the MLP (namely, a feedforward neural
network, FNN) is invented, as shown in Figure 1-6. FNN is the simplest neural network, in
which neurons (perceptrons) are arranged hierarchically. It is one of the most widely
used and rapidly developed artificial neural networks. The three leftmost neurons in
Figure 1-6 form the input layer of the entire network. The neurons at the input layer do
not have a computing function, and are only used to represent the component values of
the input vector. Nodes at other layers than the input layer represent neurons with the
computing function, and are referred to as computing units. Each layer of neurons
accepts only the output of the previous layer of neurons as input and provides output to
the next layer. Neurons at the same layer are not interconnected, and inter-layer
information can only be transmitted in one direction.

Figure 1-7 MLP for solving XOR problems


Only a very simple MLP is needed to solve the XOR problem. The left part in Figure 1-7
shows the structure of an MLP. The solid line indicates that the weight is 1, the dashed
Deep Learning Development Frameworks Page 10

line indicates that the weight is –1, and the number in a circle indicates an offset. For
example, for the point (0, 1):
𝑥1 = 0, 𝑥2 = 1
The output of the purple neuron is as follows:
𝑆𝑔𝑛( 𝑥1 + 𝑥2 − 1.5) = 𝑆𝑔𝑛( − 0.5) = −1
The coefficients of 𝑥1 and 𝑥2 are both 1 because the two lines on the left of the purple
neuron are solid lines. The output of the yellow neuron is as follows:
𝑆𝑔𝑛( − 𝑥1 − 𝑥2 + 0.5) = 𝑆𝑔𝑛( − 0.5) = −1
The coefficients of 𝑥1 and 𝑥2 are both –1 because the two lines on the left of the yellow
neuron are dashed lines. The output of the rightmost neuron is as follows:
𝑆𝑔𝑛( − 1 − 1 + 1) = 𝑆𝑔𝑛( − 1) = −1
In the preceding formula, both the numbers –1 in the left part are the outputs of the
purple and yellow neurons, and the number +1 is the offset of the output neuron. You
can verify that the outputs of the MLP for (0, 0), (1, 0), and (1, 1) are 1, –1, and 1,
respectively, which are consistent with the results of the XOR operations. Actually, the
purple and yellow neurons correspond to the purple and yellow lines in the right part of
Figure 1-7, respectively, so that a linear classifier is used to classify nonlinear samples. As
the number of hidden layers increases, the nonlinear classification capability of the neural
network is gradually enhanced, as shown in Figure 1-8.

Figure 1-8 Neural network with multiple hidden layers

1.2 Training Rules


The core of machine learning model training is the loss function, and deep learning is no
exception. This section describes the rules for model training based on the loss function
in deep learning, including the gradient descent method and BP algorithm.

1.2.1 Loss Function


During training of a deep neural network, you first need to build a function to describe
the target classification error, which is the loss function (error function). The loss function
Deep Learning Development Frameworks Page 11

reflects the error between the target output and the actual output of a perceptron. The
most common error function is the mean squared error function.
1
𝐽(𝑤) = ∑ (𝑡𝑑 − 𝑜𝑑 )2
2𝑛
𝑥∈𝑋,𝑑∈𝐷

In the formula, w is the model parameter, X is the training sample set, n is the size of X,
D is the collection of neurons at the output layer, t is the target output, and o is the
actual output. Although w does not appear in the right part of the formula, the actual
output o needs to be calculated based on the model. Therefore, the actual output o
depends on the value of w. As described above, both t and o are constants once the
training sample is given. The actual output of the loss function varies with w, so the
independent variable of the error function is w. The mean square error loss function is
characterized in that the square sum of errors is used as the main body, where an error
refers to a difference between the target output t and the actual output o. In the
formula, the coefficient 1/2 is difficult to understand. As described below, the existence of
this coefficient allows for a more concise form of the derivative of the loss function. That
is, the coefficient 1/2 is multiplied by the index 2, and the number 1 is obtained
Cross entropy loss is another commonly used loss function.
1
𝐽(𝑤) = − ∑ (𝑡𝑑 𝑙𝑛 𝑜𝑑 + ( 1 − 𝑡𝑑 ) 𝑙𝑛( 1 − 𝑜𝑑 ))
𝑛
𝑥∈𝑋,𝑑∈𝐷

The meanings of the symbols are the same as those of the mean square error loss
function. The cross entropy loss expresses the distance between two probability
distributions. In general, the mean square error loss function is mainly used for regression
problems, while the cross entropy loss function is more used for classification problems.
The objective of the training model is to search for a weight vector that minimizes the
loss function. However, the neural network model is highly complex, and there is no
effective method to obtain an analytical solution in mathematics. Therefore, the gradient
descent method is needed to calculate the minimum value of the loss function.

1.2.2 Gradient Descent Method


The gradient of the multivariate function 𝑓(𝑥1 , 𝑥2 , … , 𝑥𝑛 ) at X is as follows:
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝑇
𝛻𝑓(𝑥1 , 𝑥2 , … , 𝑥𝑛 ) = [ , ,…, ] |
𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑛 𝑋
The direction of the gradient vector is the fastest growing direction of the function. As a
result, the direction of the negative gradient vector −𝛻𝑓 is the fastest descent direction
of the function. The gradient descent method enables the loss function to search along
the negative gradient direction and update the parameters iteratively, finally minimizing
the loss function.
Each sample in the training sample set X is denoted as <x, t>, where x is the input vector,
t is the target output, o is the actual output, and 𝜂 is the learning rate. Figure 1-9 shows
the pseudocode of the batch gradient descent (BGD) algorithm.
Deep Learning Development Frameworks Page 12

Figure 1-9 BGD method


The BGD algorithm is a product of directly applying gradient descent to deep learning,
and is actually uncommon. The main problem of this algorithm lies in that all training
samples need to be calculated each time the weight is updated, and therefore the
convergence speed is very slow. For this disadvantage, the stochastic gradient descent
(SGD, also known as incremental gradient descent) method is used, which is a common
gradient descent variant. Figure 1-10 shows the pseudocode of the SGD method.

Figure 1-10 SGD method


The SGD algorithm selects one sample at a time to update the gradient. One of the
advantages of such practice is that the dataset can be expanded during model training.
This mode of training the model during data collection is called online learning.
Compared with the BGD algorithm, the SGD algorithm increases the frequency of weight
update, but moves to another extreme. Most training samples contain noises. The BGD
method can reduce the impact of noise by averaging the gradients of multiple samples.
However, the SGD method considers only a single sample each time the weight is
Deep Learning Development Frameworks Page 13

updated. As a result, when the extremum is approximated to, the gradient direction is
oriented up and down near the extremum but difficult to converge to the extremum.

Figure 1-11 Mini-batch gradient descent method


In practice, the most commonly used gradient descent algorithm is the mini-batch
gradient descent (MBGD) algorithm, as shown in Figure 1-11. In view of the
disadvantages of the foregoing two gradient descent algorithms, the MBGD algorithm
uses a small batch of samples each time the weight is updated, so that both efficiency
and stability of the gradient are considered. The batch size varies with the specific
problem, and is generally 128.

1.2.3 BP Algorithm
The gradient of the loss function needs to be calculated when the gradient descent
algorithm is used. For conventional machine learning algorithms, such as linear
regression and support vector machine (SVM), manual calculation of gradients is
sometimes feasible. However, the neural network model function is complex, and the
gradient of the loss function with respect to all parameters cannot be represented by
using one formula. Therefore, Hinton proposes the BP algorithm, which effectively
accelerates the training of neural networks by updating weight values layer by layer
during the backpropagation process.
Deep Learning Development Frameworks Page 14

Figure 1-12 Backpropagation of errors


As shown in Figure 1-12, the backpropagation direction is opposite to the forward
propagation direction. For each sample <x, t> in the training sample set X, an output
provided by the model is denoted as o. Assume that the loss function is the mean square
error loss function.
1
𝐽(𝑤) = ∑ (𝑡𝑑 − 𝑜𝑑 )2
2𝑛
𝑥∈𝑋,𝑑∈𝐷

Assume that there are L layers in the model (the input layer is excluded), and the
parameter of the lth layer is denoted as 𝑤𝑙 . It is considered that J(w) does not obtain the
minimum value during iteration because there is a deviation between w and the optimal
parameter value for each layer. That is, the loss function value is resulted from an error
of the parameter value. In the forward propagation process, each layer causes a certain
error. These errors accumulate layer by layer and are represented in the form of a loss
function at the output layer. Without a given model function, we cannot determine the
relationship between the loss function and the parameters, but can determine the
relationship 𝜕𝐽/𝜕𝑜 between the loss function and the model output. This is a key step in
understanding the BP algorithm.
Assuming that an output of the last but one layer is 𝑜′, and an activation function of the
output layer is f, the loss function may be expressed as follows:
1
𝐽(𝑤) = ∑ (𝑡𝑑 − 𝑓( 𝑤𝐿 𝑜′𝑑 ))2
2𝑚
𝑥∈𝑋,𝑑∈𝐷

𝑜′𝑑 is related to 𝑤1 , 𝑤2 , … , 𝑤𝐿−1 only. As illustrated, the loss function is split into two
parts: a part caused by 𝑤𝐿 and a part caused by other parameters. The latter is
accumulated by errors and acts on the loss function in the form of output at the last but
one layer. According to 𝜕𝐽/𝜕𝑜 obtained above, 𝜕𝐽/𝜕𝑜′ and 𝜕𝐽/𝜕𝑤𝐿 can be easily
Deep Learning Development Frameworks Page 15

calculated. In this way, the gradient of the loss function with respect to the parameters of
the output layer is calculated. It is easy to find that the derivative value 𝑓′(𝑤𝐿 𝑜′𝑑 ) of the
activation function participates in the calculation of 𝜕𝐽/𝜕𝑜′ and 𝜕𝐽/𝜕𝑤𝐿 in the form of
weight. When the derivative value of the activation function is always less than 1 (this is
the case with the sigmoid function), the value of 𝜕𝐽/𝜕𝑜 becomes increasingly small
during backpropagation. This phenomenon is called gradient vanishing, which will be
described in more detail below.
Other layer parameters may be similarly obtained based on the relationship between
𝜕𝐽/𝜕𝑜′ and 𝜕𝐽/𝜕𝑜′′. Intuitively, the BP algorithm is the process of distributing errors layer
by layer. It is essentially an algorithm that uses the chain rule to calculate the loss
function with respect to the parameters of each layer.
Generally, the BP algorithm is shown in Figure 1-13.

Figure 1-13 BP algorithm


In the formula, ⊙ indicates multiplication by element, and f is the activation function.
Notably, the output of the ith layer is also the input of the (i+1)th layer. The output of the
0th layer is defined as the input of the entire network. In addition, when the activation
function is sigmoid, the following can be proved:
𝑓 ′ (𝑥) = 𝑓(𝑥)(1 − 𝑓(𝑥))
Therefore, 𝑓 ′ (𝑜[𝑙 − 1]) in the algorithm can also be expressed as 𝑜[𝑙](1 − 𝑜[𝑙]).

1.3 Activation Function


An activation function plays an important role in learning and understanding highly
complex nonlinear functions by a neural network model. The existence of the activation
function introduces nonlinear features into the neural network. If no activation function
is used, the neural network can represent only one linear function regardless of the
number of layers in the neural network. The complexity of the linear function is limited,
Deep Learning Development Frameworks Page 16

and the capability of learning complex function mappings from data is low. This section
describes common activation functions of deep learning and their advantages and
disadvantages. You can use them as required.

Figure 1-14 Activation functions


As shown in the upper left part of Figure 1-14, the sigmoid function is the most
commonly used activation function in the early stage of FNN research. Similar to the
functions in the logistic regression model, the sigmoid function can be used at the output
layer to implement binary classification. The sigmoid function is monotonic, continuous,
and easy to derive. The output is bounded, and the network is easy to converge.
However, the derivative of the sigmoid function approaches 0 at a location away from
the origin. When the network is very deep, the BP algorithm makes more and more
neurons fall into the saturation region, which makes the gradient modulus increasingly
small. Generally, if the sigmoid network has five or fewer layers, the gradient is degraded
to 0, which is difficult to train. This phenomenon is called gradient vanishing. Another
defect of sigmoid is that the output of the sigmoid is not zero-centered.
As shown in the upper-middle part of Figure 1-14, tanh is a major substitute for the
sigmoid function. The tanh activation function corrects the defect of sigmoid that the
output of the sigmoid is not zero-centered. The tanh activation function is closer to the
natural gradient in the gradient descent algorithm, thereby reducing the required number
of iterations. However, similar to sigmoid, the tanh function is easy to become saturated.
As shown in the upper right part of Figure 1-14, the Softsign function reduces the
tendency to saturation of the tanh and sigmoid functions to some extent. However, the
Softsign, tanh, and sigmoid activation functions all easily cause gradient vanishing. The
Deep Learning Development Frameworks Page 17

derivative of an activation function always approaches 0 at a location far away from the
function center. As a result, the weight cannot be updated.
As shown in the lower left part of Figure 1-14, the Rectified Linear Unit (ReLU) function
is the most widely used activation function at present. Compared with sigmoid and other
activation functions, the ReLU function does not have an upper bound. Therefore, the
neurons are never saturated. This effectively alleviates the gradient vanishing problem,
and enables quick convergence in the gradient descent algorithm. Experiments show that
neural networks using the ReLU activation function can perform well without
unsupervised pre-training. In addition, an exponential operation needs to be performed
on each of the functions such as sigmoid. Consequently, a calculation amount of these
functions is quite large. The ReLU activation function can reduce a lot of calculation
workload. Although the ReLU function has many advantages, its disadvantages are
obvious. Because the ReLU function does not have an upper bound, the ReLU function is
easy to diverge during training. Moreover, the ReLU function is not differentiable at a
location with value 0. As a result, the ReLU function is not smooth enough in some
regression problems. Most importantly, the value of the ReLU function is constantly 0 in
the negative domain, which may result in neuron death.
As shown in the lower middle part of Figure 1-14, the Softplus function is modified based
on the ReLU function. Although the Softplus function has a larger computation amount
than the ReLU function, the Softplus function has a continuous derivative, and a
relatively smooth defined surface.
The softmax function is an extension of the sigmoid function in high dimensions. The
softmax function is used to map any K-dimensional real number vector to a K-
dimensional probability distribution. Therefore, the softmax function is often used as the
output layer of a multiclass classification task.

1.4 Regularization
Regularization is a very important and effective technique in machine learning to reduce
the generalization error. Compared with conventional machine learning models, a deep
learning model generally has a larger capacity, and therefore is more likely to cause
overfitting. To this end, researchers have proposed many effective techniques to prevent
overfitting, including:
 Adding constraints to parameters, such as L1 and L2 norms.
 Expanding the training dataset, such as adding noise and changing data.
 Dropout
 Stopping the training stopped in advance.
This section describes these methods one by one.

1.4.1 Parameter Penalty


Many regularization methods restrict the learning capability of a model by adding a
parameter penalty term Z(w) to the objective function J.
𝐽̃ = 𝐽 + 𝑎𝑍(𝑤)
Deep Learning Development Frameworks Page 18

In the formula, a is a non-negative penalty coefficient. The value of a measures the


relative contribution of the penalty term Z and the standard objective function J to the
total objective function. If a is set to 0, regularization is not used. A larger value of a
indicates greater regularization strength. a is a hyperparameter. It should be noted that,
in deep learning, a constraint is generally added only to the affine parameter w but not
the bias term. This is because the bias term typically requires only a small amount of
data for precise fitting, and adding constraints often leads to underfitting.
Different regularization methods may be obtained based on different values of Z. This
section describes two types of regularization: L1 regularization and L2 regularization. In
linear regression models, Lasso regression can be obtained by L1 regularization, and ridge
regression can be obtained by L2 regularization. Actually, L1 and L2 represent norms. The
L1 norm of a vector is defined as:

||𝑤||1 = ∑ | 𝑤𝑖 |
𝑖
This formula represents the sum of absolute values of all elements in the vector. It can be
proved that the gradient of the L1 norm is Sgn(w). In this way, the gradient descent
method can be used to solve the L1 regularization model.
The L2 norm is a common Euclidean distance.

||𝑤||2 = √∑ 𝑤𝑖2
𝑖

The L2 norm is widely used, and is often denoted as ||w|| with the subscript ignored.
However, the gradient of the L2 norm is complex, and is generally represented by the
following formula in L2 regularization:
1
𝑍(𝑤) = ||𝑤||2
2
As illustrated, a derivative of the penalty term for L2 regularization is w. Therefore, when
gradient descent is performed on the L2 regularization model, the weight update formula
should be changed to the following:
𝑤 = (1 − 𝜂𝑎)𝑤 − 𝜂𝛻𝐽
Compared with the normal gradient update formula, the preceding formula is equivalent
to multiplying the parameter by a reduction factor, thereby limiting the parameter
growth.
Deep Learning Development Frameworks Page 19

Figure 1-15 Geometric meaning of parameter penalty


Figure 1-15 shows the difference between L1 regularization and L2 regularization. In the
figure, the contour line indicates the standard objective function J, and the black solid
line indicates the regular term. The geometric meaning of the parameter penalty is that,
for any point in the feature space, not only the value of the standard objective function
corresponding to the point but also the size of the geometric graph corresponding to the
regular term of the point need to be considered. It is easy to image that, when the
penalty coefficient a becomes larger, the black shape shows a stronger tendency to get
smaller, and the parameter gets closer to the origin.
As shown in Figure 1-15, it is highly probable that the parameter that stabilizes the L1
regularization model appears at a corner point of the square. This means that the
parameters of the L1 regularization model are likely to be sparse matrices. According to
w
the example in the figure, the value of 1 corresponding to the optimal parameter is set
to 0. Therefore, L1 regularization can be used for feature selection.
From the perspective of probability distribution, many norm constraints are equivalent to
adding prior distributions to parameters. The L2 norm indicates that the parameters
conform to the Gaussian prior distribution, and the L1 norm indicates that the
parameters conform to the Laplacian distribution.

1.4.2 Dataset Expansion


The most effective way to prevent overfitting is to add a training set. A larger training set
has a smaller overfitting probability. However, collecting data (especially labeled data) is
time-consuming and expensive. Dataset expansion is a time-saving method, but it varies
in different fields.
In the field of object recognition, common dataset expansion methods include image
rotation and scaling. The premise for image change is that the image class remains the
same after the change. In handwritten digit recognition, digits 6 and 9 are easily
confusing after rotation and require extra attention. In speech recognition, random noise
is often added to the input data. The common idea of natural language recognition is to
replace synonyms.
Noise injection is a common method for dataset expansion. The noise injection object can
be the input, a hidden layer, or the output layer. For the softmax classification problem,
noise can be added to the output layer by using the label smoothing technology.
Assuming that there are a total of K alternative classes for classification problems, the
standard output provided by the dataset is generally represented as a K-dimensional
vector through one-hot encoding. The elements corresponding to the correct class are 1,
and other elements are 0. With noise added, the elements corresponding to the correct
class may be 1–(k–1)e/k, and the other elements are e/k, where e represents a constant
that is small enough. Intuitively, label smoothing narrows the difference between the
label values of correct samples and wrong samples. This is equivalent to increasing the
difficulty of model training. For a model with overfitting, increasing the difficulty can
effectively alleviate the overfitting situation and further improve the model performance.
Deep Learning Development Frameworks Page 20

1.4.3 Dropout
Dropout is a common regularization method with simple calculation. It has been widely
used since 2014. To put it simply, Dropout randomly discards the output of some neurons
during training. The parameters of these discarded neurons are not updated. Dropout
constructs a series of subnets with different structures by randomly discarding input, as
shown in Figure 1-16. These subnets are merged in a certain manner in the same deep
neural network. This is equivalent to adopting the ensemble learning method. In the
process of using the model, we want to use the collective wisdom of all the trained
subnets, so random discarding is no longer used.

Figure 1-16 Dropout


Compared with parameter penalty, Dropout has lower calculation complexity and is
easier to implement. In the training process, the Dropout random process is neither a
sufficient condition nor a necessary condition. Invariable shielding parameters can be
constructed to obtain a model good enough. Generally, Dropout performs better when
the activation function is close to the linear function.

1.4.4 Early Stopping of Training


The training process can be stopped in advance, and the validation data can be
periodically tested. As shown in Figure 1-17, when the loss function of the validation data
starts to rise, the training can be stopped in advance to avoid overfitting. However,
stopping the training in advance also brings the risk of underfitting. This is because the
number of samples in the validation set is often insufficient. As a result, the training is
often not stopped at the moment when the model generalization error is the smallest. In
extreme cases, the generalization error of the model on the validation set may start to
decrease quickly after a small rise, and stopping the training in advance may result in
underfitting of the model.
Deep Learning Development Frameworks Page 21

Figure 1-17 Early stopping of training

1.5 Optimizers
There are various optimized versions of gradient descent algorithms. In object-oriented
language implementation, different gradient descent algorithms are often encapsulated
into an object which is called an optimizer. Common optimizers include the SGD
optimizer, momentum optimizer, Nesterov, Adagrad, Adadelta, RMSprop, Adam, AdaMax,
and Nadam. These optimizers mainly improve the convergence speed of the algorithm
and the stability of the algorithm after the convergence to the local extremum, and
reduce the difficulty in adjusting the hyperparameters. This section describes the design
of several most commonly used optimizers.

1.5.1 Momentum Optimizer


The momentum optimizer is a basic improvement to the gradient descent algorithm. A
momentum term is added to the weight update formula, as shown in Figure 1-18. If the
weight variation during the nth iteration is d(n), the weight update rule is changed to the
following:
𝑑(𝑛) = −𝜂𝛻𝑤 𝐽 + 𝑎𝑑(𝑛 − 1)
In the formula, a is a constant between 0 and 1, called momentum. ad(n-1) is referred to
as a momentum term. Imagine a small ball rolls down from a random point on the error
surface. The common gradient descent algorithm is equivalent to moving the ball along
the curve, but this does not conform to the physical law. In actual application, the ball
accumulates momentum as it rolls down and thus has a greater velocity in the downhill
direction.
Deep Learning Development Frameworks Page 22

Figure 1-18 Function of the momentum term


In a region where the gradient direction is stable, the ball rolls more and more quickly.
This helps the ball quickly cross the flat region and accelerate the model convergence.
Moreover, as shown in Figure 1-19, the momentum term corrects the direction of the
gradient and reduces sudden changes. In addition, the ball with inertia is more likely to
roll over some narrow local extremum, making it less likely for the model to fall into the
local extremum.

Figure 1-19 Accelerating model convergence by the momentum term


The momentum optimizer is disadvantageous in that the momentum term may cause the
ball to cross the optimal solution and additional iterations are required for convergence.
Besides, the learning rate and momentum a of the momentum optimizer still need to be
manually set, and more experiments are usually needed to determine a proper value.

1.5.2 AdaGrad Optimizer


A characteristic common to the SGD algorithm, the MBGD algorithm, and the
momentum optimizer is that each parameter is updated at the same learning rate.
Adagrad considers that different learning rates should be set for different parameters.
The gradient update formula of Adagrad is generally written as follows:
𝜂
𝛥𝑤 = − 𝑛
𝑔(𝑛)
𝑒 + √∑𝑖=1 𝑔2 (𝑖)
In the formula, g(n) represents the gradient dJ/dw of the cost function in the nth
iteration, and e is a small constant. As the value of n increases, the denominator in the
formula gradually increases. Therefore, the weight update amplitude gradually decreases,
which is equivalent to dynamically reducing the learning rate. In the initial phase of
Deep Learning Development Frameworks Page 23

model training, the distance between the initial value and the optimal solution of the loss
function is long. Therefore, a high learning rate is required. However, as the number of
updates increases, the weight parameter gets closer to the optimal solution, so the
learning rate decreases accordingly. The advantage of Adagrad lies in its automatic
update of the learning rate, but its disadvantage also comes from this. Because the
update of the learning rate depends on the gradient in previous iterations, it is likely that
the learning rate has been reduced to 0 when the weight parameter is far from the
optimal solution. In this case, the optimization is meaningless.

1.5.3 RMSProp Optimizer


The RMSprop optimizer is an improvement to the Adagrad optimizer. An attenuation
coefficient is introduced to the algorithm of the RMSprop optimizer, so that the historical
gradient is attenuated by a certain proportion in each iteration. The gradient update
formula is as follows:
𝑟(𝑛) = 𝑏𝑟(𝑛 − 1) + (1 − 𝑏)𝑔2 (𝑛)
𝜂
𝛥𝑤 = − 𝑔(𝑛)
𝑒 + √𝑟(𝑛)
In the formula, b is an attenuation factor, and e is a small constant. Due to the effect of
the attenuation factor, r does not necessarily increase monotonically with the increase of
n. Such practice solves the problem that the Adagrad optimizer stops too early, which is
suitable for handling non-stationary targets, especially for RNN networks.

1.5.4 Adam Optimizer


The adaptive moment estimation (Adam) is developed based on the Adagrad and
Adadelta optimizers and is the most widely used optimizer at present. Adam tries to
calculate an adaptive learning rate for each parameter, which is very useful in a complex
network structure. Different parts of a network are sensitive to weight adjustment
differently, and a very sensitive part generally requires a smaller learning rate. If the
sensitive part is manually identified, it is difficult or complex to specify a dedicated
learning rate for this part. When the parameters are updated, the gradient update
formula of the Adam optimizer is similar to that of the RMSprop optimizer.
𝜂
𝛥𝑤 = − 𝑚(𝑛)
𝑒 + √𝑣(𝑛)
In the formula, m and v represent the first-moment (mean) estimation and second-
moment (non-central variance) estimation of the historical gradient, respectively. Similar
to the attenuation formula proposed in RMSprop, m and v can be defined as follows:
𝑚(𝑛) = 𝑎𝑚(𝑛 − 1) + (1 − 𝑎)𝑔(𝑛)
𝑣(𝑛) = 𝑏𝑣(𝑛 − 1) + (1 − 𝑏)𝑔2 (𝑛)
With respect to their forms, m and v are the moving means of the gradient and gradient
square, respectively. However, such definitions will cause the algorithm to be unstable
during the first several iterations. Assuming that both m(0) and v(0) are 0, when a and b
are close to 1, m and v are very close to 0 in the initial iteration. To solve this problem,
the following are used in practice:
𝑚(𝑛)
𝑚
̂ (𝑛) =
1 − 𝑎𝑛
Deep Learning Development Frameworks Page 24

𝑣(𝑛)
𝑣̂(𝑛) =
1 − 𝑏𝑛
The learning rate, a, and b all need to be manually set in Adam, and the setting difficulty
is greatly reduced. Experiments show that, a is equal to 0.9, b is equal to 0.999, and the
learning rate is 0.0001. In practice, Adam converges quickly. When the algorithm
converges to saturation, the learning rate can be properly reduced, and other parameters
do not need to be adjusted. Generally, the learning rate can be converged to a
satisfactory extremum after being reduced for several times.

1.6 Types of Neural Networks


From the beginning of BP neural networks, person put forward the neural network for
solving various problems. In the field of computer vision, CNNs are currently the most
widely used deep learning models. In the field of natural language processing, RNNs
were once magnificent. This section introduces a game theory-based generative model:
generative adversarial network (GAN).

1.6.1 CNN
1.6.1.1 Overview
A CNN is an FNN. Different from a fully connected neural network, the CNN enables its
artificial neurons to respond to units within a partial coverage area, and has excellent
performance in image processing. The CNN generally includes a convolutional layer, a
pooling layer, and a fully connected layer.
In the 1960s, when studying neurons used for local sensitivity and direction selection in
the cat visual cortex, Hubel and Wiesel found that the unique network structures could
effectively reduce the complexity of FNNs, based on which they proposed the CNN. The
CNN has become one of the research hotspots in many scientific fields, especially in
pattern recognition. The CNN has been widely used because it avoids the complex image
preprocessing and can directly input the original image.
The name of CNN comes from convolution operations. Convolution is an inner product
operation performed on an image (or a feature map) and a filter matrix (also called a
filter or a convolution kernel). The image is an input of the neural network, and the
feature map is an output of each convolutional layer or pooling layer in the neural
network. The difference is that the values in the feature map are outputs of neurons.
Therefore, the values are not limited theoretically. The values in the image correspond to
the luminance of the RGB channels, and the values range from 0 to 255. Each
convolutional layer in the neural network corresponds to one or more filter matrices.
Different from a fully connected neural network, the CNN enables each neuron at a
convolutional layer to use only the output of neurons in a local window but not all
neurons at the upper layer as its input. This characteristic of convolution operations is
referred to as local perception.
It is generally considered that human perception of the outside world is from local to
global. Spatial correlations among local pixels of an image are closer than those among
pixels that are far away. Therefore, each neuron does not need to collect global
information of an image and needs to collect only local information. Then we can obtain
Deep Learning Development Frameworks Page 25

the global information at a higher layer by synthesizing the local information collected by
each neuron. The idea of sparse connectivity is inspired by the structure of the biological
visual system. The neurons in the visual cortex can respond to the stimuli in only certain
regions, and therefore can receive information locally.
Another characteristic of convolution operations is parameter sharing. One or more
convolution kernels can be used to scan an input image. A parameter in the convolution
kernel is a weight of the model. At a convolutional layer, all neurons share the same
convolution kernel, and therefore share the same weight. Weight sharing means that
when each convolution kernel traverses the entire image, a parameter of the convolution
kernel is fixed. For example, a convolutional layer has three feature convolution kernels,
and each convolution kernel scans the entire image. In a scanning process, a parameter
value of the convolution kernel is fixed, that is, all pixels of the entire image share the
same weight. This means that the features learned from a part of the image can also be
applied to other parts of the image or other images, which is called position invariance.

1.6.1.2 Convolutional Layer


Figure 1-20 shows the typical architecture of a CNN. The leftmost image in the figure is
the model input. The input image first passes through a convolutional layer including
three convolution kernels to obtain three feature maps. Parameters of the three
convolution kernels are independent of each other, and may be obtained by optimizing
the BP algorithm. During a convolution operation, a window of the input image is
mapped to a neuron in the feature map. The purpose of the convolution operation is to
extract different input features. The first convolutional layer may extract only some low-
level features such as edges, lines, and angles. A multi-layer network can extract more
complex features based on the low-level features.

Figure 1-20 CNN structure


The convolution operation (Han Bingtao, 2017) shown in Figure 1-21 is considered. In a
five-dimensional matrix, a maximum of 3 x 3 different regions with the same shapes as
the convolution kernel can be found. Therefore, the dimension of the feature map is 3 x
3.
Deep Learning Development Frameworks Page 26

Figure 1-21 Convolution operation example


As shown in Figure 1-22, each element in the feature map is obtained by multiplying a
region of the original image by a convolution kernel. In the matrix shown in the left part
of Figure 1-22, the yellow region is related to the elements in the upper left corner of the
feature map. Each element in this part is multiplied by a corresponding element in the
convolution kernel, and a sum of the products is obtained, to obtain the first element 4 in
the feature map. The example here does not contain the bias term, that is, the bias is
equal to 0. In a more general convolution operation, a result usually needs to be summed
up with the bias term after a point multiplication operation, so the result can be output
as a feature map. The bias term in this example has a similar meaning to the bias term in
linear regression.

Figure 1-22 Convolution operation example


The basic structure of a convolutional layer is multi-channel convolution. As shown in
Figure 1-23, one convolutional layer can contain multiple convolution kernels and bias
terms. Each combination of a convolution kernel and a bias term can map an input
tensor to a feature map. The meaning of the multi-channel convolution is to stitch all
feature maps obtained from the convolution kernels and bias terms to form a three-
dimensional matrix as output. Generally, the input and output tensors and the
convolution kernels are all three-dimensional matrices, and the three dimensions
represent the width, height, and depth. To extend the foregoing convolution operation to
three dimensions, set the depth and input tensor of each convolution kernel to the same.
This ensures that the depth of the feature map corresponding to a single convolution
kernel is 1. The convolution operation does not pose specific requirements on the width
and height of the convolution kernel. However, for ease of operation, the width and
height of the convolution kernel are generally the same. In addition, the feature maps
Deep Learning Development Frameworks Page 27

obtained through calculation by using different convolution kernels must have the same
width and height so that they can be stitched together. In other words, all convolution
kernels at the same convolutional layer must have the same size.

Figure 1-23 Convolutional layer structure


The feature map output by the convolutional layer needs to be activated. Activation
functions are sometimes considered as a part of the convolutional layer. However,
because an activation function is not closely related to a convolution operation, the
activation function is sometimes implemented as an independent layer. The most
commonly used activation layer is the ReLU layer, that is, the ReLU activation function.

1.6.1.3 Pooling Layer


A pooling layer combines nearby units, reduces a size of a feature map, and reduces
dimensions. Common pooling layers include the max pooling layer and the average
pooling layer. As shown in Figure 1-24, the max pooling layer divides a feature map into
several regions and uses the maximum value of each region to represent the entire
region. The average pooling is similar to the max pooling, except that the average value
of each region is used to represent the region. A shape of each region in the feature map
is referred to as a pooling window size.
Deep Learning Development Frameworks Page 28

Figure 1-24 Pooling operation example


In an actual CNN, basically, convolutional layers and pooling layers are alternately
interconnected. Both pooling and convolution can increase the feature scale, which is
equivalent to extracting the features of the previous layer. However, different from the
convolution operation, the pooling layer does not include any parameter. In addition, the
pooling layer does not involve arrangement of elements in each small region, and
concerns only statistical features of these elements.
The pooling layer focuses on reducing the size of input data of the next layer, effectively
reducing a quantity of parameters, reducing a calculation amount, and preventing
overfitting. Another function of the pooling layer is to map an input of any size to an
output of a fixed length by properly setting the size and step of a pooling window. It is
assumed that an input size is 𝑎 × 𝑎, a size of the pooling window is ⌈𝑎/4⌉, and a step is
⌊𝑎/4⌋. If a is a multiple of 4, the size of the pooling window is equal to the step, and it is
easy to learn that the output size of the pooling layer is 4 × 4. When a is an integer that
is not exactly divided by 4, the size of the pooling window is always greater than the step
by 1, and it can be proved that the output size of the pooling layer is still 4 × 4. This
feature of the pooling layer enables the CNN to be applicable to an input image of any
size.

1.6.1.4 Fully Connected Layer


A fully connected layer is generally used as an output of the CNN. A common task in the
pattern recognition field is classification or regression, for example, determining a class of
an object in an image, or scoring an object in an image. For these problems, it is
obviously inappropriate to use a feature map as an output, and therefore a feature map
needs to be mapped to a vector that meets a requirement. This operation usually
involves vectorization of the feature map, that is, arranging each neuron in the feature
map into a vector in a fixed sequence.

1.6.2 RNN
The RNN is a neural network that captures dynamic information in sequential data
through periodical connections of hidden layer nodes. It can classify sequential data.
Deep Learning Development Frameworks Page 29

Unlike other FNNs, an RNN can hold the context state in the sequential data. The RNN is
no longer limited to spatial boundaries of conventional neural networks, and can be
extended in time sequences. Intuitively, the nodes between the memory unit at the
current moment and the memory unit at the next moment can be connected. RNNs are
widely used in sequence-related scenarios, such as videos, audios, and sentences.

Figure 1-25 RNN structure


The left part of Figure 1-25 shows the classic structure of RNNs. In the figure, x(t)
indicates the value of an input sequence at time node t, s(t) indicates the state of a
memory unit at time node t, o(t) indicates the output of a hidden layer at time node t,
and U, V, and W respectively indicate model weights. It can be seen that the update of
the hidden layer depends not only on the current input x(t), but also on the memory unit
state s(t–1) of the previous time node, that is, s(t) = f(Ux(t) + Ws(t–1)), where f
represents an activation function. The output layer of the RNN is the same as that of the
MLP, and details are omitted herein.

Figure 1-26 RNN structure


As shown in Figure 1-26 (Andrej Karpathy, 2015, The Unreasonable Effectiveness of
RNNs), there are many different RNNs structures. The leftmost part of Figure 1-26
indicates a common BP neural network, which does not involve a time sequence. The
second part from the leftmost of Figure 1-26 is a generative model that can generate
sequences that meet specific requirements based on a single input. The middle part of
Figure 1-26 is the most typical RNN structure that can be used for classification or
regression tasks. The two right parts of Figure 1-26 both can be used for sequence
translation. The structure in the second part from the rightmost of Figure 1-26 is also
referred to as encoder-decoder structure.
Deep Learning Development Frameworks Page 30

The RNN relies on the backpropagation through time (BPTT) algorithm, which is an
extension of the conventional BP algorithm on time sequences. The conventional BP
algorithm considers only the error propagation between different hidden layers, while the
BPTT algorithm further needs to consider the error propagation within the same hidden
layer between different time nodes. Specifically, the error of a memory unit at moment t
consists of two parts: a component propagated by the hidden layer at moment t, and a
component propagated by the memory unit at moment t+1. The method for calculating
the two components when they are separately propagated is the same as that of the
conventional BP algorithm. When propagated to the memory unit, the sum of the two
components is used as the error of the memory unit at moment t. It is easy to calculate
gradients of parameters U, V, and W at moment t based on the errors of the hidden layer
and the memory unit at moment t. After all time nodes are traversed reversely, T
gradients are obtained for each of the parameters U, V, and W, where T indicates a total
time length. The sum of the T gradients is the total gradient of the parameters U, V, and
W. After obtaining the gradient of each parameter, you can easily solve the problem by
using the gradient descent algorithm.
RNNs still have many problems. Because the memory unit receives output from its own
previous moment each time, problems easily occurred in deep fully connected neural
networks such as gradient vanishing and gradient explosion also trouble RNNs.
Moreover, the state of the memory unit at moment t cannot exist for a long time. The
state of the memory unit needs to be mapped by the activation function at each
moment. When a loop reaches the end of a long sequence, the input at the beginning of
the sequence may already be scattered in the mapping of the activation function. In
other words, the RNN attenuates the information that is stored for a long time.

Figure 1-27 LSTM neural network


We want the model to hold memory information for a long period of time in many tasks.
However, when the capacity of the memory unit is limited, the RNN inevitably fails to
memorize all information in the whole sequence. Therefore, we hope that the memory
Deep Learning Development Frameworks Page 31

unit can selectively remember key information, and the long short-term memory (LSTM)
network can implement this function. As shown in Figure 1-27, (Colah, 2015,
Understanding LSTMs Networks), the core of the LSTM network is the LSTM block, which
replaces the hidden layer in RNNs. The LSTM block includes three computing units: an
input gate, a forget gate, and an output gate, so that the LSTM can selectively memorize,
forget, and output information. In this way, the selective memory function is
implemented. Notably, there are two lines connecting adjacent LSTM blocks, representing
the cell state and the hidden state of the LSTM, respectively.

Figure 1-28 Gate recurrent unit


As shown in Figure 1-28, the gate recurrent unit (GRU) is a variant of the LSTM. The GRU
combines the forget gate and the input gate into an update gate. The GRU also combines
the cell state and hidden state of the LSTM into a single hidden state. The GRU model is
simpler than the standard LSTM model and is very popular.

1.6.3 GAN
A GAN is a framework that can be used in scenarios such as image generation, semantic
segmentation, text generation, data augmentation, chatbots, information retrieval, and
information sorting. Before the emergence of GANs, a deep generation model usually
needs a Markov chain or maximum conditional likelihood estimation, which can easily
lead to a lot of difficult probabilistic problems. Through the adversarial process, a GAN
trains generator G and discriminator D at the same time for the two parties to play the
game. Discriminator D is used to determine whether a sample is real or generated by
generator G. Generator G is used to try to generate a sample that cannot be
distinguished from real samples by discriminator D. The GAN adopts a mature BP
algorithm for training.
Deep Learning Development Frameworks Page 32

Figure 1-29 GAN structure


As shown in Figure 1-29, the input of the generator is noise z. z conforms to a manually
selected prior probability distribution, such as a uniform distribution or a Gaussian
distribution. The input space can be mapped to the sample space by using a certain
network structure. The input of the discriminator is a real sample x or a forged sample
G(z), and the output is the authenticity of the sample. Any classification model can be
used to design the discriminator. CNNs and fully connected neural networks are
commonly used as discriminators. For example, we might want to generate an image
depicting a cat and make the image as real as possible. The discriminator is used to
determine whether the image is real.
The objective of the GAN is the generator:
G  minG max D Ex ~ Pdata [log D( x)]  Ez ~ Pz [log(1  D(G( z)))]

The objective function consists of two parts. The first part is related only to discriminator
D. If a real sample is input, the value of the first part is larger when the output of D is
closer to 1. The second part is related to both G and D. When the input is random noise,
G can generate a sample. Discriminator D receives this sample as input. The value of the
second part is larger when the output is closer to 0. Since the objective of D is to
maximize the objective function, it is necessary to output 1 in the first term and 0 in the
second term, that is, to correctly classify the samples. Although the objective of the
generator is to minimize the objective function, the first term of the objective function is
irrelevant to the generator. Therefore, the generator can only minimize the second term.
To minimize the second term, the generator needs to output a sample that makes the
discriminator output 1, that is, make the discriminator as unable to identify sample
authenticity as possible.
Since GAN was first proposed in 2014, more than 200 GAN variants have been derived
and widely used in many generation problems. However, the original GAN also has some
problems, for example, an unstable training process. The training processes of the fully
connected neural network, CNN, and RNN described above all minimize the cost function
by optimizing parameters. GAN training is different, mainly because the adversarial
Deep Learning Development Frameworks Page 33

relationship between generator G and discriminator D is uneasy to be balanced. A


general GAN training process is: alternately training D and G until D(G(z)) is basically
stable at about 0.5. In this case, D and G reach Nash equilibrium, and the training ends.
However, in some cases, the model is hard to reach Nash equilibrium, and may even
encounter problems such as pattern crash. Therefore, how to improve the GAN to
increase model stability has always been a hot topic in academic research. In general,
GANs have some disadvantages, but these disadvantages do not affect the importance of
the GANs to generation models.

1.7 Common Issues


Deep learning models are complex and may encounter various problems during training.
This section summarizes common issues so that you can quickly locate and solve the
issues.

1.7.1 Data Imbalance


In datasets of classification tasks, the number of samples in each class may be
unbalanced. Data imbalance occurs when the number of samples in one or more classes
for prediction is very small. For example, among 4251 training images, more than 2000
classes may contain only one image, and some categories may contain 2 to 5 images. In
this case, the model cannot adequately check each category, affecting model
performance. The methods for alleviating data imbalance mainly include random
undersampling, random oversampling and Synthetic Minority Over-sampling Technique
(SMOTE).
Random undersampling is to randomly remove samples from a category with sufficient
observations. This method can increase the running time and solve the storage problem
when the training dataset is very large. However, during sample deletion, some samples
containing important information may also be discarded, and the remaining samples may
have deviations and cannot accurately represent major classes. Therefore, random
undersampling may lead to inaccurate results on actual test datasets.
Random oversampling is to increase the number of observations by copying existing
samples for unbalanced classes. Unlike undersampling, this method does not cause
information loss, so the performance on the actual test datasets is generally better than
that of undersampling. However, because the new samples are the same as the original
samples, the possibility of overfitting is increased.
SMOTE requires using a synthesis method to obtain observations of unbalanced classes.
It is similar to existing methods that use the nearest neighbor classification. SMOTE first
selects a data subset from minor classes, and then synthesizes new samples based on the
subset. These synthesized samples are added to the original dataset. This method is
advantageous in that it does not lose valuable information, and can also effectively
alleviate overfitting by generating synthetic samples through random sampling. However,
for high-dimensional data, SMOTE performance is less satisfactory. When generating a
synthetic instance, SMOTE does not take into account adjacent instances from other
classes. This results in an increase in class overlap and causes additional noise.
Deep Learning Development Frameworks Page 34

1.7.2 Gradient Vanishing and Gradient Explosion


When the number of network layers is large enough, the gradients of model parameters
in the backpropagation process may become very small or large, which is called gradient
vanishing or gradient explosion. In essence, both problems originate from
backpropagation formulas. Assuming that a model has three layers and each layer has
only one neuron, a backpropagation formula can be written as follows:
𝛿1 = 𝛿3 𝑓′2 (𝑜1 )𝑤3 𝑓′1(𝑜0 )𝑤2
f is the activation function. In this example, the sigmoid function is used as an example.
As the number of network layers increases, the number of occurrences of f(o)w in the
formula increases. According to the mean inequality, the maximum value of 𝑓 ′ (𝑥) =
𝑓(𝑥)(1 − 𝑓(𝑥)) is 1/4. Therefore, when w is not greater than 4, f(o)w is definitely less
than 1. When multiple terms less than 1 are multiplied, 𝛿1 inevitably approaches 0. This
is the cause of the gradient vanishing. Similarly, gradient explosion mainly occurs in cases
that w is very large. When multiple terms larger than 1 are multiplied, 𝛿1 is very large.
Actually, gradient explosion and gradient vanishing are caused by the deep network and
unstable network weight update. In essence, they are caused by the chain rule in gradient
backpropagation. Methods for coping with gradient vanishing mainly include pre-
training, ReLU activation functions, LSTM neural networks, and residual modules. (In
2015, ILSVRC champion ResNet increased the model depth to 152 layers by introducing a
residual module into the model. In comparison, the 2014 champion GoogLeNet has only
27 layers.) The main solution to gradient explosion is gradient clipping. The idea of
gradient clipping is to set a gradient threshold and forcibly limit the gradient within this
range to prevent excessively large gradients.

1.7.3 Overfitting
Overfitting refers to the problem that a model performs well on the training set but
poorly on the test set. Overfitting may be caused by many reasons, such as excessively
high feature dimensions, excessively complex model assumptions, excessive parameters,
insufficient training data, and excessive noise. In essence, overfitting occurs because the
model overfits the training dataset without taking into account the generalization
capability. Consequently, the model can better predict the training set, but the prediction
result of the new data is poor.
If overfitting occurs due to insufficient training data, consider more data. One approach is
to obtain more data from the data source, but this approach is often time-consuming
and laborious. A more common practice is data augmentation.
If overfitting is caused by an excessively complex model, multiple methods can be used to
suppress overfitting. The simplest method is to adjust hyperparameters of the model and
reduce the number of layers and neurons on the network to limit the fitting capability of
the network. Alternatively, the regularization technology may be introduced into the
model. Related content has been described above and therefore is omitted herein.
Deep Learning Development Frameworks Page 35

1.8 Summary
This chapter mainly introduces the definition and development of neural networks,
training rules of perceptron machines, and common neural networks (CNNs, RNNs, and
GANs). It also describes common issues and solutions of neural networks in AI
engineering.

1.9 Quiz
1. Deep learning is a new research direction derived from machine learning. What are
the differences between deep learning and conventional machine learning?
2. In 1986, the introduction of MLP ended the first "cold winter" in the history of
machine learning. Why can MLP solve the XOR problem? What is the role of
activation functions in the problem solving?
3. The sigmoid activation function is widely used in the early stage of neural network
research. What problems does it have? Does the tanh activation function solve these
problems?
4. The regularization method is widely used in deep learning models. What is its
purpose? How does Dropout implement regularization?
5. An optimizer is the encapsulation of model training algorithms. Common optimizers
include SGD and Adam. Try to compare the performance differences between
optimizers.
6. Supplement the convolution operation result in Figure 1-22 by referring to the
example.
7. RNNs can save the context state in the sequential data. How is this memory function
implemented? What problems might occur when you deal with long sequences?
8. The GAN is a deep generative network framework. Please briefly describe its training
principle.
9. Gradient explosion and gradient vanishing are common problems in deep learning.
What are their causes? How can I avoid these problems?
Deep Learning Development Frameworks Page 36

2 Deep Learning Development Frameworks

This chapter introduces the common frameworks and their features in the AI field, and
describes the typical framework TensorFlow in detail to help you understand the concept
of AI and put it into practice to meet actual demands. This chapter also introduces
MindSpore, a Huawei-developed framework that boasts many unsurpassable advantages.
After reading this chapter, you can choose to use MindSpore based on your requirements.

2.1 Deep Learning Development Frameworks


2.1.1 Introduction to PyTorch
PyTorch is a Python-based machine learning computing framework released by Facebook.
It is developed based on Torch, a scientific computing framework supported by a large
number of machine learning algorithms. Torch is a tensor operation library similar to
NumPy, featuring high flexibility, but it is less popular because it uses the programming
language Lua. This is why PyTorch is developed.
In addition to Facebook, organizations such as Twitter, GMU, and Salesforce also use
PyTorch.
The following sections describe the features of PyTorch.

2.1.1.1 Python First


PyTorch does not simply bind Python to the C++ framework. PyTorch directly supports
Python access at a fine grain. Developers can use PyTorch as easily as using NumPy or
SciPy. This not only lowers the threshold for understanding Python, but also ensures that
the code is basically consistent with the native Python implementation.

2.1.1.2 Dynamic Neural Network


Many mainstream frameworks such as TensorFlow 1.x do not support this feature. To run
TensorFlow 1.x, developers must create static computational graphs in advance, and run
the feed and run commands to repeatedly execute the created graphs. In contrast,
PyTorch with this feature is free from such complexity, and PyTorch programs can
dynamically build or adjust computational graphs during execution.

2.1.1.3 Easy to Debug


PyTorch can generate dynamic graphs during execution, and developers can stop the
interpreter in the debugger and view the output of a specific node.
Deep Learning Development Frameworks Page 37

In addition, PyTorch provides tensors that support CPUs and GPUs, greatly accelerating
computing.

2.1.2 Introduction to MindSpore


Based on the design ideas of algorithm as code, efficient execution and flexible
deployment, Huawei has developed the core architecture of MindSpore. The architecture
is divided into four layers. The on-demand collaborative distributed architecture,
scheduling, distributed deployment, and communication library reside at the same layer.
The next is the execution efficiency layer (including data model downstream
deployment). The parallelism layer contains pipeline execution, deep graph optimization,
and operator fusion. The upper layer is MindSpore intermediate representation (IR) for
computational graphs. MindSpore enables automatic differentiation, automatic
parallelism, and automatic tuning, and supports all-scenario application programming
interfaces (APIs) that comply with our design ideas: algorithm as code, efficient
execution, and flexible deployment.
The core of the AI framework and one of the decisive factors of a programming
paradigm is the automatic differentiation technology used in the AI framework. A deep
learning model is trained by forward and backward computation. Taking the
mathematical expression here as an example, the forward computation of this formula is
performed by the computation process at the black arrow. After the output f of the
forward computation is obtained, the backward computation is performed by using the
chain rule to obtain x, differential value of y. During model design, only forward
computation is covered, while backward computation needs to be implemented by an
automatic differential technology of a framework.
In addition, with the expansion of NLP models, the memory overhead for training ultra-
large models such as Bert (340M) and GPT-2 (1542M) exceeds the capacity of a single
card. Therefore, the models need to be divided into multiple cards for execution.
Currently, the manual model parallelism is used in the industry. It requires model
segmentation and cluster topology awareness, so it is difficult to develop. In addition, it is
also difficult to ensure high performance and optimize performance.
MindSpore can automatically segment the entire graph based on the input and output
data of the data dimensions of the operator, and integrate data parallelism and model
parallelism. Cluster topology awareness scheduling allows the cluster topology to be
perceived, and automatic scheduling of subgraphs to be executed to minimize the
communication overhead. It can maintain the single-node coding logic to implement
model parallelism, improving the development efficiency tenfold compared with manual
parallelization.
Model execution is now facing huge challenges under powerful computing power: the
memory wall problem, high interaction overhead, and difficult data supply. Partial
operations are performed on the host, while the others are performed on the device. The
interaction overhead is much larger than the execution overhead, resulting in the low
accelerator usage.
MindSpore uses the chip-oriented deep graph optimization technology to minimize the
synchronization waiting time and maximize the parallelism of data, computing, and
communication. Data and the entire graph computation are on the Ascend AI Processor.
Deep Learning Development Frameworks Page 38

MindSpore also uses the on-device execution to implement decentralization. The


optimization of adaptive graph segmentation driven by gradient data can implement
autonomous All Reduce and synchronize the gradient aggregation, boosting computing
and communication efficiency.
In addition, it uses the distributed architecture of on-demand device-edge-cloud
collaboration. The unified model IR brings consistent deployment experience, and the
graph optimization technology of software and hardware collaboration shields scenario
differences. Device-cloud collaboration of Federal Meta Learning breaks the boundaries
of device and cloud, and implements real-time update of the multi-device collaboration
model.

2.1.3 Introduction to TensorFlow


TensorFlow is Google's second-generation open-source software library for digital
computing. The TensorFlow computing framework supports various deep learning
algorithms and multiple computing platforms, ensuring high system stability.
TensorFlow has the following features:

2.1.3.1 Multi-platform
All platforms that support the Python development environment also support
TensorFlow. However, TensorFlow depends on other software such as the NVIDIA CUDA
Toolkit and cuDNN to access a supported GPU.

2.1.3.2 GPU
TensorFlow supports certain NVIDIA GPUs, which are compatible with NVIDIA CUDA
Toolkit versions that meet specific performance standards.

2.1.3.3 Distributed
TensorFlow supports distributed computing, allowing computational graphs to be
computed on different processes. These processes may be located on different servers.

2.1.3.4 Multi-lingual
The main programming language of TensorFlow is Python. C++, Java, and Go API can
also be used, but stability cannot not be guaranteed, as are many third-party bindings for
C#, Haskell, Julia, Rust, Ruby, Scala, R (even PHP). Google recently released a mobile-
optimized TensorFlow-Lite library for running TensorFlow applications on Android.

2.1.3.5 Scalability
One of the main advantages of using TensorFlow is that it has a modular, scalable, and
flexible design. Developers can easily port models among the CPU, GPU, and TPU with a
few code changes. Python developers can develop their own models by using native and
low-level APIs (or core APIs) of TensorFlow, or develop built-in models by using advanced
API libraries of TensorFlow. TensorFlow has many built-in and distributed libraries. It can
be overlaid with an advanced deep learning framework such as Keras to serve as an
advanced API.
Deep Learning Development Frameworks Page 39

2.1.3.6 Powerful Computing Performance


TensorFlow can achieve the best performance on Google TPU, but it also strives to
achieve high performance on a variety of platforms, including servers, desktops,
embedded systems, and mobile devices.
The distributed deployment of TensorFlow enables itself to run on different computers.
From smartphones to computer clusters, the desired training models can be generated.
Currently, supported native distributed deep learning frameworks include TensorFlow,
CNTK, DeepLearning4J, and MXNet.
When a single GPU is used, most deep learning frameworks rely on cuDNN, and
therefore support almost the same training speed, provided that the hardware computing
capabilities or allocated memories slightly differ. However, for large-scale deep learning,
massive data makes it difficult for the single GPU to complete training in a limited time.
To handle such cases, TensorFlow enables distributed training.
TensorFlow is considered as one of the best libraries for neural networks, and can reduce
difficulty in deep learning development. In addition, TensorFlow is an open-source
platform, which facilitates TensorFlow maintenance and update, improve the efficiency of
TensorFlow.
Keras, ranking third in the number of stars on GitHub, is packaged into an advanced API
of TensorFlow 2.0, which makes TensorFlow 2.0 more flexible, and easier to debug.
After a tensor is created in TensorFlow 1.0, the result cannot be returned directly. To
obtain the result, the session mechanism needs to be created, which includes the concept
of graph, and code cannot run without session.run. This style is more like the hardware
programming language VHDL.
Compared with some simple frameworks such as PyTorch, TensorFlow 1.0 adds the
preceding concepts, which are confusing for users.
It is complex to debug TensorFlow 1.0, and its APIs are disordered, making it difficult for
beginners. Learners will come across many difficulties in using TensorFlow 1.0 even after
gaining the basic knowledge. As a result, many researchers have turned to PyTorch.

2.2 TensorFlow 2.0 Basics


2.2.1 Introduction
The core function of TensorFlow 2.0 is the dynamic graph mechanism called eager
execution. It allows users to compile and debug models like writing normal programs,
making TensorFlow easier to learn and apply. It also supports more platforms and
languages, and improves the compatibility between components by standardizing the
exchange formats and alignment of APIs. Deprecated APIs have been deleted in this
version, and duplicate APIs have been reduced to avoid confusion. TensorFlow 2.0 also
delivers excellent performance in compatibility and continuity by providing the
TensorFlow 1.x compatibility module. In addition, the tf.contrib module has been
removed. Maintained modules are moved to separate repositories. Unused and
unmaintained modules are removed.
Deep Learning Development Frameworks Page 40

2.2.2 Tensors
Tensor is the most basic data structure in TensorFlow. All data is encapsulated in tensors.
It is defined as a multidimensional array. A scalar is a rank-0 tensor. A vector is a rank-1
tensor. A matrix is a rank-2 tensor. In TensorFlow, tensors are classified into constant
tensors and variable tensors.

2.2.3 Eager Execution Mode


Static graph: TensorFlow 1.0 uses static graphs (graph mode) to separate the definition
and execution by using computational graphs. This is a declarative programming model.
In graph mode, developers need to build a computational graph, start a session, and then
input data to obtain an execution result.
This static graph has many advantages in distributed training, performance optimization,
and deployment. However, it is inconvenient to perform debugging, which is similar to
invoking a compiled C language program. In this case, internal debugging cannot be
performed on the program. Therefore, eager execution based on dynamic calculation
graphs is provided. Eager execution is a type of imperative programming, which is
consistent with the native Python.
A result is returned immediately after an operation is performed. TensorFlow 2.0 uses the
eager execution mode by default.

2.2.4 AutoGraph
In TensorFlow 2.0, eager execution is enabled by default. Eager execution is intuitive and
flexible for users (easier and faster to run a one-time operation), but may compromise
performance and deployability.
To achieve optimal performance and make a model deployable anywhere, you can run
@tf.function to add a decorator to build a graph from a program, making Python code
more efficient.
tf.function can build a TensorFlow operation in the function into a graph. In this way, this
function can be executed in graph mode. Such practice can be considered as
encapsulating the function as a TensorFlow operation of a graph.

2.3 TensorFlow 2.0 Modules


2.3.1 Common Modules
tf: Functions in the tf module are used to perform common arithmetic operations,
such as tf.abs (calculating an absolute value), tf.add (adding elements one by one), and
tf.concat (concatenating tensors). Most operations in this module can be performed by
NumPy.
1. tf.errors: error type module of TensorFlow
Deep Learning Development Frameworks Page 41

2. tf.data: implements operations on datasets. Input pipes created by tf.data are used to
read training data. In addition, data can be easily input from memories such as
NumPy.
3. tf.distributions: implements various statistical distributions. The functions in this
module are used to implement various statistical distributions, such as Bernoulli
distribution, uniform distribution, and Gaussian distribution.
4. tf.gfile: implements operations on files. Functions in this module can be used to
perform file I/O operations, copy files, and rename files.
5. tf.image: implements operations on images. Functions in this module include image
processing functions. This module is similar to OpenCV, and provides functions
related to image luminance, saturation, phase inversion, cropping, resizing, image
format conversion (RGB to HSV, YUV, YIQ, or gray), rotation, and Sobel edge
detection. This module is equivalent to a small image processing package of
OpenCV.
6. tf.keras: a Python API for invoking Keras tools. This is a large module that enables
various network operations.
7. tf.nn: function support module of the neural network. It is the most commonly used
module, which is used to construct the classical convolutional network. It also
contains the sub-module of rnn_cell, which is used to construct the recurrent neural
network. Common functions include: avg_pool (...), batch_normalization (...),
bias_add (...), conv2d (...), dropout (...), relu (...),
sigmoid_cross_entropy_with_logits(...), and softmax (...).

2.3.2 Keras API


TensorFlow 2.0 recommends Keras for network building. Common neural networks are
included in keras.layers.
Keras is a high-level API used to build and train deep learning models. It can be used for
rapid prototype design, advanced research, and production. It has the following three
advantages:

2.3.2.1 Easy to Use


Keras provides simple and consistent API that is optimized for common cases. It also
provides practical and clear feedback on user errors.

2.3.2.2 Modular and Composable


You can build Keras models by connecting configurable building blocks together, with
little restriction.

2.3.2.3 Easy to Extend


You can customize building blocks to express new research ideas, create layers and loss
functions, and develop advanced models.
The common functional modules are as follows:
Deep Learning Development Frameworks Page 42

2.3.2.4 tf.keras.layers
The tf.keras.layers namespace provides a large number of common network layer APIs,
such as fully connected layer, active aquifer, pooling layer, convolutional layer, and
recurrent neural network layer. For these network layers, you only need to specify the
related parameters of the network layer during creation and invoke the __call__ method
to complete the forward computation. When invoking the __call__ method, Keras
automatically invokes the forward propagation logic of each layer. Generally, the logic is
implemented in the call function of the class.

2.3.2.5 Network Container


For common networks, class instances at each layer need to be manually called to
complete the forward propagation computation. When the number of network layers
becomes large, the code is bloated. The network container Sequential provided by Keras
can be used to encapsulate multiple network layers into a large network model. The
instance of the network model needs to be invoked so that the sequential computing of
data from the first layer to the last layer can be completed at one time.

2.4 Basic Development Steps of TensorFlow 2.0


2.4.1 Environment Setup
2.4.1.1 Environment Setup in Windows
Operating system: Windows 10
Pip software built in Anaconda 3 (adapting to Python 3)
Install TensorFlow.
Open Anaconda Prompt and run the pip command to install TensorFlow.

Figure 2-1 Installation command

Run the pip install tensorflow command on the command line API, as shown in Figure 2-
1.
Deep Learning Development Frameworks Page 43

2.4.1.2 Environment Setup in Linux


The simplest way for installing TensorFlow in Linux is to run the pip command. If the
installation speed is slow, change to Tsinghua mirror in China and run the following
command on the terminal:
pip install pip –U
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
Run the pip install tensorflow==2.0.0 command to install TensorFlow.

2.4.2 Development Process


The configuration process includes the following steps:
1. Data preparation: includes data exploration and data processing.
2. Network construction: includes defining the network structure, the loss function, the
model evaluation indicators, and selecting the optimizer.
3. Model training and verification
4. Model saving
5. Model restoration and invoking
The following describe the preceding process based on an actual project, MNIST
handwritten digit recognition.
Handwritten digit recognition is a common image recognition task where computers
recognize text in handwriting images. Different from printed fonts, handwriting of
different person has different sizes and styles, making it difficult for computers to
recognize handwriting. This project applies deep learning and TensorFlow tools to train
and build models based on MNIST handwriting datasets.

2.4.2.1 Data Preparation


Download the MNIST datasets from http://yann.lecun.com/exdb/mnist/.
The MNIST datasets consist of a training set and a test set.
 Training set: 60,000 handwriting images and corresponding labels
 Test set: 10,000 handwriting images and corresponding labels
Figure 2-2 shows a dataset example.
Deep Learning Development Frameworks Page 44

Figure 2-2 Dataset example

2.4.2.2 Network Construction


The softmax function is also called normalized exponential function. It is a derivative of
the binary classification function sigmoid in terms of multi-class classification. Figure 2-3
shows the calculation method of softmax.

Figure 2-3 Softmax calculation method

The process of model establishment is the core process of network structure definition. As
shown in Figure 2-4, the network operation process defines how the output is calculated
based on the input.

Figure 2-4 Model calculation process

Figure 2-5 shows the core code for TensorFlow to implement the softmax regression
model.
Deep Learning Development Frameworks Page 45

Figure 2-5 Softmax implementation code

Model compilation involves the following two parts:


Loss function selection: In machine learning or deep learning, an indicator needs to be
defined to indicate whether a model is proper. This indicator is called cost or loss, and is
minimized as far as possible. In this project, the cross entropy loss function is used.
Gradient descent method: A loss function is constructed for an original model needs to be
optimized by using an optimization algorithm, to find optimal parameters and further
minimize a value of the loss function. Among optimization algorithms for solving
machine learning parameters, the gradient descent-based optimization algorithm
(gradient descent) is usually used.

2.4.2.3 Model Training and Verification


As shown in Figure 2-6, all training data is trained through batch iteration or full
iteration. In the experiment, all data is trained five times. In TensorFlow, model.fit is
directly used for training, where epoch indicates the number of training iterations.
Deep Learning Development Frameworks Page 46

Figure 2-6 Training process

As shown in Figure 2-7, you can test the model using the test set, compare predicted
results with actual ones, and find correctly predicted labels, to calculate the accuracy of
the test set.

Figure 2-7 Test and verification

2.5 Summary
This chapter describes the common frameworks and features in the AI field, especially
the module components and basic usage of TensorFlow. On this basis, a training code
example is provided to introduce the application of framework functions and modules in
the practical situation. You can set up the environment and run the sample project
according to the instruction in this chapter. It is believed that after this process, you will
have a deeper understanding of the AI field.

2.6 Quiz
1. AI is widely used. What are the mainstream frameworks of AI? What are their
features?
2. As a typical AI framework, TensorFlow has a large number of users. During the
maintenance of TensorFlow, the major change is that its version change from
TensorFlow 1.0 to TensorFlow 2.0. Please describe the differences between the two
versions.
3. TensorFlow has many modules to meet users' actual needs. Please describe three
common TensorFlow modules.
4. Configure an AI development framework by following instructions in this chapter.

You might also like