Machine Learning-Lecture 16(Student)
Machine Learning-Lecture 16(Student)
⚫ Many in the field believe that the major reason for these successes is the
availability of , made possible by
the wide-scale use of digitization in science and industry.
⚫ In this chapter we discuss the basics of neural networks and deep learning, and
then go into some of the specializations for specific problems, such as
for ,
and
for and other
1
⚫ In the terminology of neural networks, the four features X1,...,X4 make up the
units in the . The arrows indicate that each of the
inputs from the input layer feeds into each of the K
(we get to pick K; here we chose 5).
⚫ The neural network model has hidden units the form
2
a linear regression model in the K = 5 activations. All the parameters
and need to be estimated from data.
⚫ In the early instances of neural networks, the activation
function was favored
A ReLU activation can be computed and stored more efficiently than a sigmoid
activation. Although it thresholds at zero, because we apply it to a linear
function (10.2) the constant term will shift this inflection point.
3
⚫ Details about how to perform this minimization are provided in Section 10.7.
4
➢ It has hidden layers L1 (256 units) and L2 (128 units) rather than
one. Later we will see a network with seven hidden layers.
➢ It has ten output variables, rather than one. In this case the variables
really represent a single qualitative variable and so are quite dependent.
➢ The function used for training the network is tailored for the
task.
⚫ The first hidden layer is as in (10.2), with
for k = 1,...,𝐾1 . The second hidden layer treats the activations of the first
hidden layer as and computes new activations
for ℓ = 1,...,𝐾2 .
5
(2) (2)
⚫ We have introduced additional superscript notation such as ℎ𝑙 (𝑋) and 𝑤𝑙𝑗
⚫ If these were all separate quantitative responses, we would simply set each
𝑓𝑚 (𝑋) = 𝑍𝑚 and be done. However, we would like our estimates to represent
class probabilities , just like in multinomial
logistic regression in Section 4.3.5. So we use the special activation
function (see (4.13) on page 141),
6
( and ). Even though the goal is to build a
classifier, our model actually estimates a probability for each of the 10
classes. The classifier then assigns the image to the class with the
⚫ To train this network, since the response is qualitative, we look for coefficient
estimates that minimize the negative multinomial log-likelihood
7
Computer Session
⚫ A Single Layer Network on the Hitters Data
8
⚫ A Multilayer Network on the MNIST Digit Data
9
10