Machine Learning: Huawei AI Academy Training Materials
Machine Learning: Huawei AI Academy Training Materials
Machine Learning
and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of their respective
holders.
Notice
The purchased products, services, and features are stipulated by the contract made between Huawei
and the customer. All or part of the products, services, and features described in this document may not
be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all
statements, information, and recommendations in this document are provided "AS IS" without
warranties, guarantees, or representations of any kind, either express or implied.
The information in this document is subject to change without notice. Every effort has been made in the
preparation of this document to ensure accuracy of the contents, but all statements, information, and
recommendations in this document do not constitute a warranty of any kind, express, or implied.
Website: http://e.huawei.com
Machine Learning Page 3
Contents
1 Machine Learning
Figure 1-3 Relationship between the hypothesis function and target function
This practice also applies to a more complex situation, as shown in 0. For a given task, we
can collect a large amount of training data. The data must satisfy a certain target
function f. Otherwise, it is meaningless to learn such a task. Machine learning algorithms
can provide, by analyzing the training data, a hypothesis function g that is as similar to
the target function f as possible. Therefore, the output of machine learning algorithms
cannot be the same as the target function. However, with increasing training data, the
hypothesis function g gradually approaches the target function f to achieve satisfactory
precision.
Notably, the existence of the target function f is sometimes highly abstract. For a typical
image classification task, the target function is a mapping from an image set to a
category set. To enable a computer program to process logical information such as
images and categories, you need to map the images or categories to a scalar, a vector, or
a matrix in a particular encoding manner. For example, you can assign a sequence
number starting with 0 to each category to map the category to a scalar. Different one-
hot vectors can also be used to represent different categories, and this manner is referred
Machine Learning Page 7
to as one-hot encoding. The image encoding mode is slightly complex, and is generally
represented by a three-dimensional matrix. With this encoding mode, we can consider
the definition domain of the target function f as a set of three-dimensional matrices, and
its value range as a set of a series of label numbers. Although the encoding process is not
a part of machine learning algorithms, in some cases, the selection of encoding mode
also affects efficiency of machine learning algorithms.
Although the value range of a regression model can be an infinite set, the output of a
classification model is usually limited. This is because the size of a dataset cannot
increase infinitely, and the number of categories in the dataset is the same as the
number of training samples at most. Therefore, the number of categories cannot be
infinite. When a classification model is trained, a category set L usually needs to be
manually specified for the model to select a category for output. The size of the category
set L is generally denoted as K, which indicates the number of possible categories.
purple objects indicate Virginica samples, and gray objects indicate unknown samples.
Assume that the output of the clustering algorithm, which has been introduced in
unsupervised learning, is shown in the gray dashed circle in the figure. Collect statistics
on the number of samples of each category in these circles, and use the category with
the largest number of samples as the cluster category. For example, the cluster in the
upper left corner belongs to Setosa, and the cluster in the upper right corner belongs to
Virginica. By combining unsupervised learning algorithms with supervision information,
semi-supervised learning algorithms can bring higher accuracy with lower labor cost.
80% of the total number of samples, and the test set account for 20%. In this example,
there are four samples in the training set and one sample in the test set.
training and optimization: If data is thoroughly cleaned, the model is less susceptible to
interference from abnormal data, ensuring the model training effect.
filter method. Statistical measures commonly used in filter methods include Pearson
correlation coefficient, chi-square coefficient, and mutual information. Because filter
methods do not consider the relationship between features, they only tend to filter out
redundant variables.
The output of the model is the probability that the target is the true value. As we know,
the model accuracy increases as the training data increases. So why not use all the data
for training, but use part of it as test set? This is because we are concerned about the
performance of the model in the face of unknown data, not known data. It can be
understood that the training set is like the question bank that students studied when
preparing for an exam. No matter how high the accuracy rate of students in the question
bank is not surprising, because the question bank is always limited. As long as the
students' memory is good enough, all the answers can be memorized. Only through an
examination can we really check the students' mastery of knowledge, because the
questions appear in the examination are never seen by the students. The test set is
equivalent to a test paper prepared by the researcher for the model. That is, in the entire
dataset (including the training set and test set), the model can read only the features of
the training set and test set. The targets of the test set can only be used by the
researcher to evaluate the performance of the model.
get when you run the model on new samples (test set). Obviously, we prefer a model
with a smaller generalization error.
Once the form of a model is given, all possible functions constitute a space, which is
hypothesis space. Machine learning algorithms are searching for a suitable fitting
function in a hypothesis space. If the mathematical model is too simple or the training
time is too short, the training error of the model will be large. This phenomenon is called
underfitting. For the former cause, a more complex model needs to be used for
retraining. For the latter cause, the underfitting phenomenon can be effectively alleviated
only by prolonging the training time. However, to accurately determine the causes of
underfitting often requires certain experience and methods. On the contrast, overfitting
refers to the phenomenon that the training error of a model is very small (because the
model is complex) but the generalization capability is weak, that is, the generalization
error is relatively large. There are many ways to mitigate overfitting. The common ones
are as follows: appropriately simplifying the model; ending training before overfitting
occurs; using the Dropout and Weight Decay methods. 0 shows the underfitting, good
fitting, and overfitting results for the same dataset.
Bias and variance are two subforms that we should pay attention to. As shown in 0,
variance is the offset of the prediction result from the average value, and is the error
Machine Learning Page 20
caused by the model's sensitivity to small fluctuations in the training set. Bias is the
difference between the average prediction value and the correct value we are trying to
predict. Unresolvable errors refer to errors caused by imperfections of models and
finiteness of data. Theoretically, if there is infinite amount of data and a perfect model,
the error can be eliminated. However, there is no such situation in practice, so the
generalization error can never be eliminated.
diagonal of the table while values outside the diagonal are 0 or close to 0. Each symbol
in the binary-classification confusion matrix shown in 0 is described as follows:
(1) P: positive, indicating the number of real positive cases in the data.
(2) N: negative, indicating the number of real negative cases other than P in the data.
(3) TP: true positive, indicating the number of positive cases that are correctly classified
by the classifier.
(4) TN: true negative, indicating the number of negative cases that are correctly classified
by the classifier.
(5) FP: false positive, indicating the number of positive cases that are incorrectly classified
by the classifier.
(6) FN: false negative, indicating the number of negative cases that are incorrectly
classified by the classifier.
0 lists other concepts in the binary-classification confusion matrix.
the model can be calculated as follows: 140/160 = 87.5%; the recall rate is 140/170 =
82.4%; the accuracy rate is (140 + 10)/200 = 75%.
algorithms, the concept of validation set, and hyperparameter search and cross
validation.
hyperparameters, and to evaluate the performance of the model on the test set.
Common methods used to search for model hyperparameters include grid search,
random search, heuristic intelligent search, and Bayesian search.
accuracies on validation sets. The average value of the k classification accuracies can be
used as the performance indicator of the model generalization capability.
The k-fold cross validation can prevent the randomness of validation set division, and the
validation result is more persuasive. However, k-fold cross validation requires the training
of k models. If the dataset is large, the training time is long. Therefore, k-fold cross
validation is generally applicable to small datasets.
The value of k in k-fold cross validation is also a hyperparameter, which needs to be
determined through experiments. In an extreme case, the value of k is the same as the
number of samples in the training set. This practice is called leave-one-out cross
validation, in which a training sample is left as the validation set during each training.
The training effect of leave-one-out cross validation is better, because almost all training
samples participate in training. However, leave-one-out cross validation will last for a
longer time, so it only applies to very small datasets.
In the formula, argmax indicates that a maximum value point is to be obtained, that is, h,
−1
which maximizes the value of the target function. In the target function, (√2𝜋𝜎) is a
constant irrelevant to h. Multiplying or dividing the target function by a constant does
not change the position of the maximum or minimum value point. Therefore, the
optimization target of the model can be expressed as follows:
Machine Learning Page 29
𝑚
(ℎ(𝑥𝑖 ) − 𝑦𝑖 )2
𝑎𝑟𝑔𝑚𝑎𝑥 ∏ 𝑒𝑥𝑝 (− )
ℎ 2𝜎 2
𝑖=1
Because the logarithmic function is monotonic, setting the target function to ln does not
affect the maximum and minimum value points.
𝑚 𝑚
(ℎ(𝑥𝑖 ) − 𝑦𝑖 )2 (ℎ(𝑥𝑖 ) − 𝑦𝑖 )2
𝑎𝑟𝑔𝑚𝑎𝑥 𝑙𝑛 (∏ 𝑒𝑥𝑝 (− 2
)) = 𝑎𝑟𝑔𝑚𝑎𝑥 ∑ −
ℎ 2𝜎 ℎ 2𝜎 2
𝑖=1 𝑖=1
If the target function is set to a negative value, the original maximum value point is
changed to the minimum value point. In addition, we can multiply the target function by
a constant 𝜎 2 /𝑚 to convert the optimization target of the model into:
𝑚
1
𝑎𝑟𝑔𝑚𝑖𝑛 ∑(ℎ(𝑥𝑖 ) − 𝑦𝑖 )2
ℎ 2𝑚
𝑖=1
obvious underfitting occurs if the original linear regression model is used. The solution is
to use polynomial regression, as shown in Figure 1-32.
ℎ(𝑥) = 𝑤1 𝑥 + 𝑤2 𝑥 2 + ⋯ + 𝑤𝑛 𝑥 𝑛 + 𝑏
In the formula, n indicates the number of polynomial regression dimensions.
Because the polynomial regression dimension is a hyperparameter, overfitting may occur
if the dimension is selected unexpectedly. Applying regularization helps reduce
overfitting. The most common regularization method is to add a square sum loss to the
target function.
𝑚
1 2
𝐽(𝑤) = ∑(ℎ( 𝑥𝑖 ) − 𝑦𝑖 )2 + 𝜆‖𝑤‖2
2𝑚
𝑖=1
In the formula, ‖⋅‖2 indicates an L2 regular term. The linear regression model using this
loss function is also known as a Ridge regression model. Similarly, a linear regression
model with absolute loss is called a Lasso regression model.
𝑚
1
𝐽(𝑤) = ∑(ℎ( 𝑥𝑖 ) − 𝑦𝑖 )2 + 𝜆 ∑‖𝑤‖1
2𝑚
𝑖=1
In the formula, ‖⋅‖1 indicates an L1 regular term.
Similar to the derivation process of linear regression, the logarithm of the target function
can be taken without changing the position of the maximum value point. Therefore, the
optimization target of the model is equivalent to:
𝑚
Multiplying the target function by the constant -1/m will cause the original maximum
value point to become the minimum value point, that is:
𝑚
−1
𝑎𝑟𝑔𝑚𝑖𝑛 ∑(𝑦 𝑙𝑛 ℎ (𝑥) + (1 − 𝑦) 𝑙𝑛(1 − ℎ(𝑥)))
ℎ 𝑚
𝑖=1
𝐻(𝑋) = − ∑ 𝑝𝑘 𝑙𝑜𝑔2 𝑝𝑘
𝑘=1
𝐾
𝐺𝑖𝑛𝑖 = 1 − ∑ 𝑝𝑘2
𝑘=1
In the formula, 𝑝𝑘 indicates the probability that the sample belongs to category k, and K
indicates the total number of categories. The larger the difference between the purity
before and after the segmentation, the better the model accuracy is to be improved by
judging a certain feature. Therefore, the feature should be added to the decision tree
model.
Generally, the decision tree construction process can be divided into the following three
phases:
Machine Learning Page 34
(1) Feature selection: Select a feature from the features of the training data as the split
standard of the current node. (Different standards generate different decision tree
algorithms.)
(2) Decision tree generation: Generate subnodes from top down based on the selected
feature and stop until the dataset can no longer be split.
(3) Pruning: Reduce the tree size and optimize its node structure to restraint overfitting
of the model. Pruning can be classified into pre-pruning and post-pruning.
0 shows an example of classification using a decision tree model. The classification result
is affected by the Refund, Marital Status, and Taxable Income attributes. From this
example, we can see that the decision tree model can handle not only the case where the
attribute has two values, but also the case where the attribute has multiple values or
even consecutive values. In addition, a decision tree model is interpretable. We can
intuitively analyze the importance relationship between attributes based on the structure
diagram on the right in 0.
1.5.5 SVMs
An SVM is a linear classifier defined in the eigenspace with the largest interval. By means
of kernel tricks, SVMs can be made into nonlinear classifiers in essence. The SVM learning
algorithm is the optimal solution to convex quadratic linear programming. In general, the
main ideas of SVM include two points:
(1) Based on the structural risk minimization principle, an optimal hyperplane is
constructed in the eigenspace, so that the learner is optimized globally, and the
expectation of the whole sample space satisfies an upper boundary with a certain
probability.
(2) In the case of linear inseparability, non-linear mapping algorithms are used to convert
the linearly inseparable samples of low-dimensional input space into high-dimensional
eigenspace. In this way, samples are linearly separable. Then the linear algorithm can be
used to analyze the non-linear features of samples.
Straight lines are used to divide data into different categories. Actually, we can use
multiple straight lines to divide data, as shown in 0. The core idea of SVM is to find a line
that meets the preceding conditions and keep the point closest to the line away from the
Machine Learning Page 35
line as far as possible. This gives the model a strong generalization capability. These
points closest to the straight line are called support vectors.
1.5.6 KNN
The K-nearest neighbor (KNN) classification algorithm is a theoretically mature method
and one of the simplest machine learning algorithms. KNN is a non-parametric method,
which usually works well in datasets with irregular decision boundaries. According to this
method, if the majority of K samples most similar to one sample (nearest neighbors in
the eigenspace) belong to a specific category, this sample also belongs to this category.
Machine Learning Page 36
∏ 𝑃(𝑋𝑖 |𝐶 = 𝑐)
𝑖=1
By combining the independent hypothesis of features, we can prove that:
𝑛
𝑃(𝑋|𝐶 = 𝑐) = ∏ 𝑃(𝑋𝑖 |𝐶 = 𝑐)
𝑖=1
The content of the independent hypothesis of features is that the distribution of each
attribute value is independent of the distribution of other attribute values when a given
sample classification is used as a condition. Naive Bayes is "naive" precisely because of
the use of independent hypothesis of features in its model. Making this hypothesis
effectively simplifies computation, and makes the Naive Bayes classifier have higher
accuracy and training speed on large databases.
For example, we want to determine a person's gender C based on the height 𝑋1 and
weight 𝑋2. Suppose that the probabilities of men with a height of 180 centimeters and
150 centimeters are 80% and 20% respectively, and the probabilities of men with a
weight of 80 kilograms and 50 kilograms are 70% and 30% respectively. According to the
Naive Bayesian model, the probability that a person with a height of 180 centimeters and
a weight of 50 kilograms is male is 0.8 × 0.3 = 0.24, while the probability that a person
with a height of 150 centimeters and a weight of 80 kilograms is male is only 0.7 × 0.2 =
0.14. It can be assumed that the two features of height and weight independently
contribute to the probability that a person is male.
The performance of the Naive Bayesian model usually depends on the degree to which
the independent hypothesis of features is satisfied. In the preceding example, the two
features of height and weight are not completely independent. This correlation inevitably
affects the accuracy of the model. However, as long as the correlation is not high, we can
continue to use the Naive Bayesian model. In actual applications, different features are
seldom completely independent of each other.
the training process of decision trees, sampling is performed on both the sample level
and feature level. At the sample level, the sample subsets used for decision tree training
are determined by Bootstrap sampling (repeatable sampling). At the feature level, some
features are randomly selected to calculate the information gain before each node of a
decision tree is split. By synthesizing the prediction results of multiple decision trees, the
random forest model can reduce the variance of a single decision tree model, but cannot
effectively correct the bias. Therefore, the random forest model requires that each
decision tree cannot be underfitting, even if this requirement may lead to overfitting of
some decision trees. In addition, each decision tree model in the random forest is
independent of each other. Therefore, the training and prediction processes can be
executed concurrently.
1.7 Summary
This chapter first describes the definition and classification of machine learning, as well
as problems machine learning solves. Then, it introduces key knowledge points of
machine learning, including the overall procedure (data collection, data cleansing,
feature selection, model training, model evaluation, and model deployment), common
algorithms (including linear regression, logistic regression, decision tree, SVM, Naive
Bayes, KNN, ensemble learning, and K-means), gradient descent algorithms, and
hyperparameters. Finally, a complete machine learning process is presented by the case
of using linear regression to predict house prices.
Machine Learning Page 46
1.8 Quiz
1. Machine learning is the core technology of AI. Please define machine learning.
2. The generalization error of a model can be divided into variance, bias, and
irreducible error. What is the difference between variance and bias? What are the
characteristics of variance and bias of an overfitting model?
3. Please calculate the value of 𝐹1 for the confusion matrix shown in 0.
4. In machine learning, a dataset is generally divided into the training set, validation
set, and test set. What is the difference between the validation set and test set? Why
do we need to introduce the validation set?
5. Linear regression models use linear functions to fit data. How does a linear
regression model process non-linear data?
6. Many classification models can only deal with binary-classification problems. Try to
provide a method, using SVM as an example, to deal with multiclass classification
problems.
7. How does the Gaussian kernel function in the SVM map a feature to an infinite
dimensional space?
8. Is gradient descent the only way to train a model? What are the limitations of this
algorithm?