Deep Learning: Huawei AI Academy Training Materials
Deep Learning: Huawei AI Academy Training Materials
Deep Learning
, , and other HiSilicon icons are trademarks of HiSilicon Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of their respective
holders.
Notice
The purchased products, services and features are stipulated by the contract made between HiSilicon
and the customer. All or part of the products, services and features described in this document may
not be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all
statements, information, and recommendations in this document are provided "AS IS" without
warranties, guarantees or representations of any kind, either express or implied.
The information in this document is subject to change without notice. Every effort has been made in
the preparation of this document to ensure accuracy of the contents, but all statements, information,
and recommendations in this document do not constitute a warranty of any kind, express or implied.
Website: http://www.hisilicon.com/en/
Email: [email protected]
Deep Learning Page 2
Contents
1 Deep Learning..................................................................................................................................... 4
1.1 Deep learning ............................................................................................................................................................................ 4
1.1.1 Overview .................................................................................................................................................................................. 4
1.1.2 Deep Neural Network ......................................................................................................................................................... 5
1.1.3 Development History of Deep Learning ....................................................................................................................... 6
1.1.4 Perception Algorithm .......................................................................................................................................................... 7
1.2 Training Rules ..........................................................................................................................................................................10
1.2.1 Loss Function ........................................................................................................................................................................10
1.2.2 Gradient Descent Method ...............................................................................................................................................11
1.2.3 BP Algorithm ........................................................................................................................................................................13
1.3 Activation Function ................................................................................................................................................................15
1.4 Regularization ..........................................................................................................................................................................17
1.4.1 Parameter Penalty ..............................................................................................................................................................17
1.4.2 Dataset Expansion ..............................................................................................................................................................19
1.4.3 Dropout ..................................................................................................................................................................................20
1.4.4 Early Stopping of Training ...............................................................................................................................................20
1.5 Optimizers .................................................................................................................................................................................21
1.5.1 Momentum Optimizer ......................................................................................................................................................21
1.5.2 AdaGrad Optimizer ............................................................................................................................................................22
1.5.3 RMSProp Optimizer............................................................................................................................................................23
1.5.4 Adam Optimizer ..................................................................................................................................................................23
1.6 Types of Neural Networks ..................................................................................................................................................24
1.6.1 CNN..........................................................................................................................................................................................24
1.6.2 RNN..........................................................................................................................................................................................28
1.6.3 GAN..........................................................................................................................................................................................31
1.7 Common Issues .......................................................................................................................................................................33
1.7.1 Data Imbalance ...................................................................................................................................................................33
1.7.2 Gradient Vanishing and Gradient Explosion .............................................................................................................34
1.7.3 Overfitting .............................................................................................................................................................................34
1.8 Summary ...................................................................................................................................................................................35
1.9 Quiz .............................................................................................................................................................................................35
2 Deep Learning Development Frameworks ............................................................................... 36
2.1 Deep Learning Development Frameworks ...................................................................................................................36
2.1.1 Introduction to PyTorch ....................................................................................................................................................36
Deep Learning Page 3
1 Deep Learning
Deep learning is a machine learning model based on neural networks and has great
advantages in fields such as computer vision, speech recognition, and natural language
processing. This chapter introduces the basic knowledge of deep learning, including the
development history of deep learning, components of deep learning neural networks,
types of deep learning neural networks, and common problems in deep learning projects.
line indicates that the weight is –1, and the number in a circle indicates an offset. For
example, for the point (0, 1):
𝑥1 = 0, 𝑥2 = 1
The output of the purple neuron is as follows:
𝑆𝑔𝑛( 𝑥1 + 𝑥2 − 1.5) = 𝑆𝑔𝑛( − 0.5) = −1
The coefficients of 𝑥1 and 𝑥2 are both 1 because the two lines on the left of the purple
neuron are solid lines. The output of the yellow neuron is as follows:
𝑆𝑔𝑛( − 𝑥1 − 𝑥2 + 0.5) = 𝑆𝑔𝑛( − 0.5) = −1
The coefficients of 𝑥1 and 𝑥2 are both –1 because the two lines on the left of the yellow
neuron are dashed lines. The output of the rightmost neuron is as follows:
𝑆𝑔𝑛( − 1 − 1 + 1) = 𝑆𝑔𝑛( − 1) = −1
In the preceding formula, both the numbers –1 in the left part are the outputs of the
purple and yellow neurons, and the number +1 is the offset of the output neuron. You
can verify that the outputs of the MLP for (0, 0), (1, 0), and (1, 1) are 1, –1, and 1,
respectively, which are consistent with the results of the XOR operations. Actually, the
purple and yellow neurons correspond to the purple and yellow lines in the right part of
Figure 1-7, respectively, so that a linear classifier is used to classify nonlinear samples. As
the number of hidden layers increases, the nonlinear classification capability of the neural
network is gradually enhanced, as shown in Figure 1-8.
reflects the error between the target output and the actual output of a perceptron. The
most common error function is the mean squared error function.
1
𝐽(𝑤) = ∑ (𝑡𝑑 − 𝑜𝑑 )2
2𝑛
𝑥∈𝑋,𝑑∈𝐷
In the formula, w is the model parameter, X is the training sample set, n is the size of X,
D is the collection of neurons at the output layer, t is the target output, and o is the
actual output. Although w does not appear in the right part of the formula, the actual
output o needs to be calculated based on the model. Therefore, the actual output o
depends on the value of w. As described above, both t and o are constants once the
training sample is given. The actual output of the loss function varies with w, so the
independent variable of the error function is w. The mean square error loss function is
characterized in that the square sum of errors is used as the main body, where an error
refers to a difference between the target output t and the actual output o. In the
formula, the coefficient 1/2 is difficult to understand. As described below, the existence of
this coefficient allows for a more concise form of the derivative of the loss function. That
is, the coefficient 1/2 is multiplied by the index 2, and the number 1 is obtained
Cross entropy loss is another commonly used loss function.
1
𝐽(𝑤) = − ∑ (𝑡𝑑 𝑙𝑛 𝑜𝑑 + ( 1 − 𝑡𝑑 ) 𝑙𝑛( 1 − 𝑜𝑑 ))
𝑛
𝑥∈𝑋,𝑑∈𝐷
The meanings of the symbols are the same as those of the mean square error loss
function. The cross entropy loss expresses the distance between two probability
distributions. In general, the mean square error loss function is mainly used for regression
problems, while the cross entropy loss function is more used for classification problems.
The objective of the training model is to search for a weight vector that minimizes the
loss function. However, the neural network model is highly complex, and there is no
effective method to obtain an analytical solution in mathematics. Therefore, the gradient
descent method is needed to calculate the minimum value of the loss function.
updated. As a result, when the extremum is approximated to, the gradient direction is
oriented up and down near the extremum but difficult to converge to the extremum.
1.2.3 BP Algorithm
The gradient of the loss function needs to be calculated when the gradient descent
algorithm is used. For conventional machine learning algorithms, such as linear
regression and support vector machine (SVM), manual calculation of gradients is
sometimes feasible. However, the neural network model function is complex, and the
gradient of the loss function with respect to all parameters cannot be represented by
using one formula. Therefore, Hinton proposes the BP algorithm, which effectively
accelerates the training of neural networks by updating weight values layer by layer
during the backpropagation process.
Deep Learning Development Frameworks Page 14
Assume that there are L layers in the model (the input layer is excluded), and the
parameter of the lth layer is denoted as 𝑤𝑙 . It is considered that J(w) does not obtain the
minimum value during iteration because there is a deviation between w and the optimal
parameter value for each layer. That is, the loss function value is resulted from an error
of the parameter value. In the forward propagation process, each layer causes a certain
error. These errors accumulate layer by layer and are represented in the form of a loss
function at the output layer. Without a given model function, we cannot determine the
relationship between the loss function and the parameters, but can determine the
relationship 𝜕𝐽/𝜕𝑜 between the loss function and the model output. This is a key step in
understanding the BP algorithm.
Assuming that an output of the last but one layer is 𝑜′, and an activation function of the
output layer is f, the loss function may be expressed as follows:
1
𝐽(𝑤) = ∑ (𝑡𝑑 − 𝑓( 𝑤𝐿 𝑜′𝑑 ))2
2𝑚
𝑥∈𝑋,𝑑∈𝐷
𝑜′𝑑 is related to 𝑤1 , 𝑤2 , … , 𝑤𝐿−1 only. As illustrated, the loss function is split into two
parts: a part caused by 𝑤𝐿 and a part caused by other parameters. The latter is
accumulated by errors and acts on the loss function in the form of output at the last but
one layer. According to 𝜕𝐽/𝜕𝑜 obtained above, 𝜕𝐽/𝜕𝑜′ and 𝜕𝐽/𝜕𝑤𝐿 can be easily
Deep Learning Development Frameworks Page 15
calculated. In this way, the gradient of the loss function with respect to the parameters of
the output layer is calculated. It is easy to find that the derivative value 𝑓′(𝑤𝐿 𝑜′𝑑 ) of the
activation function participates in the calculation of 𝜕𝐽/𝜕𝑜′ and 𝜕𝐽/𝜕𝑤𝐿 in the form of
weight. When the derivative value of the activation function is always less than 1 (this is
the case with the sigmoid function), the value of 𝜕𝐽/𝜕𝑜 becomes increasingly small
during backpropagation. This phenomenon is called gradient vanishing, which will be
described in more detail below.
Other layer parameters may be similarly obtained based on the relationship between
𝜕𝐽/𝜕𝑜′ and 𝜕𝐽/𝜕𝑜′′. Intuitively, the BP algorithm is the process of distributing errors layer
by layer. It is essentially an algorithm that uses the chain rule to calculate the loss
function with respect to the parameters of each layer.
Generally, the BP algorithm is shown in Figure 1-13.
and the capability of learning complex function mappings from data is low. This section
describes common activation functions of deep learning and their advantages and
disadvantages. You can use them as required.
derivative of an activation function always approaches 0 at a location far away from the
function center. As a result, the weight cannot be updated.
As shown in the lower left part of Figure 1-14, the Rectified Linear Unit (ReLU) function
is the most widely used activation function at present. Compared with sigmoid and other
activation functions, the ReLU function does not have an upper bound. Therefore, the
neurons are never saturated. This effectively alleviates the gradient vanishing problem,
and enables quick convergence in the gradient descent algorithm. Experiments show that
neural networks using the ReLU activation function can perform well without
unsupervised pre-training. In addition, an exponential operation needs to be performed
on each of the functions such as sigmoid. Consequently, a calculation amount of these
functions is quite large. The ReLU activation function can reduce a lot of calculation
workload. Although the ReLU function has many advantages, its disadvantages are
obvious. Because the ReLU function does not have an upper bound, the ReLU function is
easy to diverge during training. Moreover, the ReLU function is not differentiable at a
location with value 0. As a result, the ReLU function is not smooth enough in some
regression problems. Most importantly, the value of the ReLU function is constantly 0 in
the negative domain, which may result in neuron death.
As shown in the lower middle part of Figure 1-14, the Softplus function is modified based
on the ReLU function. Although the Softplus function has a larger computation amount
than the ReLU function, the Softplus function has a continuous derivative, and a
relatively smooth defined surface.
The softmax function is an extension of the sigmoid function in high dimensions. The
softmax function is used to map any K-dimensional real number vector to a K-
dimensional probability distribution. Therefore, the softmax function is often used as the
output layer of a multiclass classification task.
1.4 Regularization
Regularization is a very important and effective technique in machine learning to reduce
the generalization error. Compared with conventional machine learning models, a deep
learning model generally has a larger capacity, and therefore is more likely to cause
overfitting. To this end, researchers have proposed many effective techniques to prevent
overfitting, including:
Adding constraints to parameters, such as L1 and L2 norms.
Expanding the training dataset, such as adding noise and changing data.
Dropout
Stopping the training stopped in advance.
This section describes these methods one by one.
||𝑤||1 = ∑ | 𝑤𝑖 |
𝑖
This formula represents the sum of absolute values of all elements in the vector. It can be
proved that the gradient of the L1 norm is Sgn(w). In this way, the gradient descent
method can be used to solve the L1 regularization model.
The L2 norm is a common Euclidean distance.
||𝑤||2 = √∑ 𝑤𝑖2
𝑖
The L2 norm is widely used, and is often denoted as ||w|| with the subscript ignored.
However, the gradient of the L2 norm is complex, and is generally represented by the
following formula in L2 regularization:
1
𝑍(𝑤) = ||𝑤||2
2
As illustrated, a derivative of the penalty term for L2 regularization is w. Therefore, when
gradient descent is performed on the L2 regularization model, the weight update formula
should be changed to the following:
𝑤 = (1 − 𝜂𝑎)𝑤 − 𝜂𝛻𝐽
Compared with the normal gradient update formula, the preceding formula is equivalent
to multiplying the parameter by a reduction factor, thereby limiting the parameter
growth.
Deep Learning Development Frameworks Page 19
1.4.3 Dropout
Dropout is a common regularization method with simple calculation. It has been widely
used since 2014. To put it simply, Dropout randomly discards the output of some neurons
during training. The parameters of these discarded neurons are not updated. Dropout
constructs a series of subnets with different structures by randomly discarding input, as
shown in Figure 1-16. These subnets are merged in a certain manner in the same deep
neural network. This is equivalent to adopting the ensemble learning method. In the
process of using the model, we want to use the collective wisdom of all the trained
subnets, so random discarding is no longer used.
1.5 Optimizers
There are various optimized versions of gradient descent algorithms. In object-oriented
language implementation, different gradient descent algorithms are often encapsulated
into an object which is called an optimizer. Common optimizers include the SGD
optimizer, momentum optimizer, Nesterov, Adagrad, Adadelta, RMSprop, Adam, AdaMax,
and Nadam. These optimizers mainly improve the convergence speed of the algorithm
and the stability of the algorithm after the convergence to the local extremum, and
reduce the difficulty in adjusting the hyperparameters. This section describes the design
of several most commonly used optimizers.
model training, the distance between the initial value and the optimal solution of the loss
function is long. Therefore, a high learning rate is required. However, as the number of
updates increases, the weight parameter gets closer to the optimal solution, so the
learning rate decreases accordingly. The advantage of Adagrad lies in its automatic
update of the learning rate, but its disadvantage also comes from this. Because the
update of the learning rate depends on the gradient in previous iterations, it is likely that
the learning rate has been reduced to 0 when the weight parameter is far from the
optimal solution. In this case, the optimization is meaningless.
𝑣(𝑛)
𝑣̂(𝑛) =
1 − 𝑏𝑛
The learning rate, a, and b all need to be manually set in Adam, and the setting difficulty
is greatly reduced. Experiments show that, a is equal to 0.9, b is equal to 0.999, and the
learning rate is 0.0001. In practice, Adam converges quickly. When the algorithm
converges to saturation, the learning rate can be properly reduced, and other parameters
do not need to be adjusted. Generally, the learning rate can be converged to a
satisfactory extremum after being reduced for several times.
1.6.1 CNN
1.6.1.1 Overview
A CNN is an FNN. Different from a fully connected neural network, the CNN enables its
artificial neurons to respond to units within a partial coverage area, and has excellent
performance in image processing. The CNN generally includes a convolutional layer, a
pooling layer, and a fully connected layer.
In the 1960s, when studying neurons used for local sensitivity and direction selection in
the cat visual cortex, Hubel and Wiesel found that the unique network structures could
effectively reduce the complexity of FNNs, based on which they proposed the CNN. The
CNN has become one of the research hotspots in many scientific fields, especially in
pattern recognition. The CNN has been widely used because it avoids the complex image
preprocessing and can directly input the original image.
The name of CNN comes from convolution operations. Convolution is an inner product
operation performed on an image (or a feature map) and a filter matrix (also called a
filter or a convolution kernel). The image is an input of the neural network, and the
feature map is an output of each convolutional layer or pooling layer in the neural
network. The difference is that the values in the feature map are outputs of neurons.
Therefore, the values are not limited theoretically. The values in the image correspond to
the luminance of the RGB channels, and the values range from 0 to 255. Each
convolutional layer in the neural network corresponds to one or more filter matrices.
Different from a fully connected neural network, the CNN enables each neuron at a
convolutional layer to use only the output of neurons in a local window but not all
neurons at the upper layer as its input. This characteristic of convolution operations is
referred to as local perception.
It is generally considered that human perception of the outside world is from local to
global. Spatial correlations among local pixels of an image are closer than those among
pixels that are far away. Therefore, each neuron does not need to collect global
information of an image and needs to collect only local information. Then we can obtain
Deep Learning Development Frameworks Page 25
the global information at a higher layer by synthesizing the local information collected by
each neuron. The idea of sparse connectivity is inspired by the structure of the biological
visual system. The neurons in the visual cortex can respond to the stimuli in only certain
regions, and therefore can receive information locally.
Another characteristic of convolution operations is parameter sharing. One or more
convolution kernels can be used to scan an input image. A parameter in the convolution
kernel is a weight of the model. At a convolutional layer, all neurons share the same
convolution kernel, and therefore share the same weight. Weight sharing means that
when each convolution kernel traverses the entire image, a parameter of the convolution
kernel is fixed. For example, a convolutional layer has three feature convolution kernels,
and each convolution kernel scans the entire image. In a scanning process, a parameter
value of the convolution kernel is fixed, that is, all pixels of the entire image share the
same weight. This means that the features learned from a part of the image can also be
applied to other parts of the image or other images, which is called position invariance.
obtained through calculation by using different convolution kernels must have the same
width and height so that they can be stitched together. In other words, all convolution
kernels at the same convolutional layer must have the same size.
1.6.2 RNN
The RNN is a neural network that captures dynamic information in sequential data
through periodical connections of hidden layer nodes. It can classify sequential data.
Deep Learning Development Frameworks Page 29
Unlike other FNNs, an RNN can hold the context state in the sequential data. The RNN is
no longer limited to spatial boundaries of conventional neural networks, and can be
extended in time sequences. Intuitively, the nodes between the memory unit at the
current moment and the memory unit at the next moment can be connected. RNNs are
widely used in sequence-related scenarios, such as videos, audios, and sentences.
The RNN relies on the backpropagation through time (BPTT) algorithm, which is an
extension of the conventional BP algorithm on time sequences. The conventional BP
algorithm considers only the error propagation between different hidden layers, while the
BPTT algorithm further needs to consider the error propagation within the same hidden
layer between different time nodes. Specifically, the error of a memory unit at moment t
consists of two parts: a component propagated by the hidden layer at moment t, and a
component propagated by the memory unit at moment t+1. The method for calculating
the two components when they are separately propagated is the same as that of the
conventional BP algorithm. When propagated to the memory unit, the sum of the two
components is used as the error of the memory unit at moment t. It is easy to calculate
gradients of parameters U, V, and W at moment t based on the errors of the hidden layer
and the memory unit at moment t. After all time nodes are traversed reversely, T
gradients are obtained for each of the parameters U, V, and W, where T indicates a total
time length. The sum of the T gradients is the total gradient of the parameters U, V, and
W. After obtaining the gradient of each parameter, you can easily solve the problem by
using the gradient descent algorithm.
RNNs still have many problems. Because the memory unit receives output from its own
previous moment each time, problems easily occurred in deep fully connected neural
networks such as gradient vanishing and gradient explosion also trouble RNNs.
Moreover, the state of the memory unit at moment t cannot exist for a long time. The
state of the memory unit needs to be mapped by the activation function at each
moment. When a loop reaches the end of a long sequence, the input at the beginning of
the sequence may already be scattered in the mapping of the activation function. In
other words, the RNN attenuates the information that is stored for a long time.
unit can selectively remember key information, and the long short-term memory (LSTM)
network can implement this function. As shown in Figure 1-27, (Colah, 2015,
Understanding LSTMs Networks), the core of the LSTM network is the LSTM block, which
replaces the hidden layer in RNNs. The LSTM block includes three computing units: an
input gate, a forget gate, and an output gate, so that the LSTM can selectively memorize,
forget, and output information. In this way, the selective memory function is
implemented. Notably, there are two lines connecting adjacent LSTM blocks, representing
the cell state and the hidden state of the LSTM, respectively.
1.6.3 GAN
A GAN is a framework that can be used in scenarios such as image generation, semantic
segmentation, text generation, data augmentation, chatbots, information retrieval, and
information sorting. Before the emergence of GANs, a deep generation model usually
needs a Markov chain or maximum conditional likelihood estimation, which can easily
lead to a lot of difficult probabilistic problems. Through the adversarial process, a GAN
trains generator G and discriminator D at the same time for the two parties to play the
game. Discriminator D is used to determine whether a sample is real or generated by
generator G. Generator G is used to try to generate a sample that cannot be
distinguished from real samples by discriminator D. The GAN adopts a mature BP
algorithm for training.
Deep Learning Development Frameworks Page 32
The objective function consists of two parts. The first part is related only to discriminator
D. If a real sample is input, the value of the first part is larger when the output of D is
closer to 1. The second part is related to both G and D. When the input is random noise,
G can generate a sample. Discriminator D receives this sample as input. The value of the
second part is larger when the output is closer to 0. Since the objective of D is to
maximize the objective function, it is necessary to output 1 in the first term and 0 in the
second term, that is, to correctly classify the samples. Although the objective of the
generator is to minimize the objective function, the first term of the objective function is
irrelevant to the generator. Therefore, the generator can only minimize the second term.
To minimize the second term, the generator needs to output a sample that makes the
discriminator output 1, that is, make the discriminator as unable to identify sample
authenticity as possible.
Since GAN was first proposed in 2014, more than 200 GAN variants have been derived
and widely used in many generation problems. However, the original GAN also has some
problems, for example, an unstable training process. The training processes of the fully
connected neural network, CNN, and RNN described above all minimize the cost function
by optimizing parameters. GAN training is different, mainly because the adversarial
Deep Learning Development Frameworks Page 33
1.7.3 Overfitting
Overfitting refers to the problem that a model performs well on the training set but
poorly on the test set. Overfitting may be caused by many reasons, such as excessively
high feature dimensions, excessively complex model assumptions, excessive parameters,
insufficient training data, and excessive noise. In essence, overfitting occurs because the
model overfits the training dataset without taking into account the generalization
capability. Consequently, the model can better predict the training set, but the prediction
result of the new data is poor.
If overfitting occurs due to insufficient training data, consider more data. One approach is
to obtain more data from the data source, but this approach is often time-consuming
and laborious. A more common practice is data augmentation.
If overfitting is caused by an excessively complex model, multiple methods can be used to
suppress overfitting. The simplest method is to adjust hyperparameters of the model and
reduce the number of layers and neurons on the network to limit the fitting capability of
the network. Alternatively, the regularization technology may be introduced into the
model. Related content has been described above and therefore is omitted herein.
Deep Learning Development Frameworks Page 35
1.8 Summary
This chapter mainly introduces the definition and development of neural networks,
training rules of perceptron machines, and common neural networks (CNNs, RNNs, and
GANs). It also describes common issues and solutions of neural networks in AI
engineering.
1.9 Quiz
1. Deep learning is a new research direction derived from machine learning. What are
the differences between deep learning and conventional machine learning?
2. In 1986, the introduction of MLP ended the first "cold winter" in the history of
machine learning. Why can MLP solve the XOR problem? What is the role of
activation functions in the problem solving?
3. The sigmoid activation function is widely used in the early stage of neural network
research. What problems does it have? Does the tanh activation function solve these
problems?
4. The regularization method is widely used in deep learning models. What is its
purpose? How does Dropout implement regularization?
5. An optimizer is the encapsulation of model training algorithms. Common optimizers
include SGD and Adam. Try to compare the performance differences between
optimizers.
6. Supplement the convolution operation result in Figure 1-22 by referring to the
example.
7. RNNs can save the context state in the sequential data. How is this memory function
implemented? What problems might occur when you deal with long sequences?
8. The GAN is a deep generative network framework. Please briefly describe its training
principle.
9. Gradient explosion and gradient vanishing are common problems in deep learning.
What are their causes? How can I avoid these problems?
Deep Learning Development Frameworks Page 36
This chapter introduces the common frameworks and their features in the AI field, and
describes the typical framework TensorFlow in detail to help you understand the concept
of AI and put it into practice to meet actual demands. This chapter also introduces
MindSpore, a Huawei-developed framework that boasts many unsurpassable advantages.
After reading this chapter, you can choose to use MindSpore based on your requirements.
In addition, PyTorch provides tensors that support CPUs and GPUs, greatly accelerating
computing.
2.1.3.1 Multi-platform
All platforms that support the Python development environment also support
TensorFlow. However, TensorFlow depends on other software such as the NVIDIA CUDA
Toolkit and cuDNN to access a supported GPU.
2.1.3.2 GPU
TensorFlow supports certain NVIDIA GPUs, which are compatible with NVIDIA CUDA
Toolkit versions that meet specific performance standards.
2.1.3.3 Distributed
TensorFlow supports distributed computing, allowing computational graphs to be
computed on different processes. These processes may be located on different servers.
2.1.3.4 Multi-lingual
The main programming language of TensorFlow is Python. C++, Java, and Go API can
also be used, but stability cannot not be guaranteed, as are many third-party bindings for
C#, Haskell, Julia, Rust, Ruby, Scala, R (even PHP). Google recently released a mobile-
optimized TensorFlow-Lite library for running TensorFlow applications on Android.
2.1.3.5 Scalability
One of the main advantages of using TensorFlow is that it has a modular, scalable, and
flexible design. Developers can easily port models among the CPU, GPU, and TPU with a
few code changes. Python developers can develop their own models by using native and
low-level APIs (or core APIs) of TensorFlow, or develop built-in models by using advanced
API libraries of TensorFlow. TensorFlow has many built-in and distributed libraries. It can
be overlaid with an advanced deep learning framework such as Keras to serve as an
advanced API.
Deep Learning Development Frameworks Page 39
2.2.2 Tensors
Tensor is the most basic data structure in TensorFlow. All data is encapsulated in tensors.
It is defined as a multidimensional array. A scalar is a rank-0 tensor. A vector is a rank-1
tensor. A matrix is a rank-2 tensor. In TensorFlow, tensors are classified into constant
tensors and variable tensors.
2.2.4 AutoGraph
In TensorFlow 2.0, eager execution is enabled by default. Eager execution is intuitive and
flexible for users (easier and faster to run a one-time operation), but may compromise
performance and deployability.
To achieve optimal performance and make a model deployable anywhere, you can run
@tf.function to add a decorator to build a graph from a program, making Python code
more efficient.
tf.function can build a TensorFlow operation in the function into a graph. In this way, this
function can be executed in graph mode. Such practice can be considered as
encapsulating the function as a TensorFlow operation of a graph.
2. tf.data: implements operations on datasets. Input pipes created by tf.data are used to
read training data. In addition, data can be easily input from memories such as
NumPy.
3. tf.distributions: implements various statistical distributions. The functions in this
module are used to implement various statistical distributions, such as Bernoulli
distribution, uniform distribution, and Gaussian distribution.
4. tf.gfile: implements operations on files. Functions in this module can be used to
perform file I/O operations, copy files, and rename files.
5. tf.image: implements operations on images. Functions in this module include image
processing functions. This module is similar to OpenCV, and provides functions
related to image luminance, saturation, phase inversion, cropping, resizing, image
format conversion (RGB to HSV, YUV, YIQ, or gray), rotation, and Sobel edge
detection. This module is equivalent to a small image processing package of
OpenCV.
6. tf.keras: a Python API for invoking Keras tools. This is a large module that enables
various network operations.
7. tf.nn: function support module of the neural network. It is the most commonly used
module, which is used to construct the classical convolutional network. It also
contains the sub-module of rnn_cell, which is used to construct the recurrent neural
network. Common functions include: avg_pool (...), batch_normalization (...),
bias_add (...), conv2d (...), dropout (...), relu (...),
sigmoid_cross_entropy_with_logits(...), and softmax (...).
2.3.2.4 tf.keras.layers
The tf.keras.layers namespace provides a large number of common network layer APIs,
such as fully connected layer, active aquifer, pooling layer, convolutional layer, and
recurrent neural network layer. For these network layers, you only need to specify the
related parameters of the network layer during creation and invoke the __call__ method
to complete the forward computation. When invoking the __call__ method, Keras
automatically invokes the forward propagation logic of each layer. Generally, the logic is
implemented in the call function of the class.
Run the pip install tensorflow command on the command line API, as shown in Figure 2-
1.
Deep Learning Development Frameworks Page 43
The process of model establishment is the core process of network structure definition. As
shown in Figure 2-4, the network operation process defines how the output is calculated
based on the input.
Figure 2-5 shows the core code for TensorFlow to implement the softmax regression
model.
Deep Learning Development Frameworks Page 45
As shown in Figure 2-7, you can test the model using the test set, compare predicted
results with actual ones, and find correctly predicted labels, to calculate the accuracy of
the test set.
2.5 Summary
This chapter describes the common frameworks and features in the AI field, especially
the module components and basic usage of TensorFlow. On this basis, a training code
example is provided to introduce the application of framework functions and modules in
the practical situation. You can set up the environment and run the sample project
according to the instruction in this chapter. It is believed that after this process, you will
have a deeper understanding of the AI field.
2.6 Quiz
1. AI is widely used. What are the mainstream frameworks of AI? What are their
features?
2. As a typical AI framework, TensorFlow has a large number of users. During the
maintenance of TensorFlow, the major change is that its version change from
TensorFlow 1.0 to TensorFlow 2.0. Please describe the differences between the two
versions.
3. TensorFlow has many modules to meet users' actual needs. Please describe three
common TensorFlow modules.
4. Configure an AI development framework by following instructions in this chapter.