0% found this document useful (0 votes)
2 views

Machine Learning-Lecture 16(Student)

The document discusses the evolution and fundamentals of deep learning, particularly focusing on neural networks, which gained popularity after 2010 due to advancements in architecture and the availability of large datasets. It explains single-layer and multilayer neural networks, detailing their structure, activation functions, and training processes, including the use of regularization techniques. The document also highlights the performance improvements of neural networks over traditional models in tasks such as digit classification.

Uploaded by

hubertkuo418
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Machine Learning-Lecture 16(Student)

The document discusses the evolution and fundamentals of deep learning, particularly focusing on neural networks, which gained popularity after 2010 due to advancements in architecture and the availability of large datasets. It explains single-layer and multilayer neural networks, detailing their structure, activation functions, and training processes, including the use of regularization techniques. The document also highlights the performance improvements of neural networks over traditional models in tasks such as digit classification.

Uploaded by

hubertkuo418
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Lecture 16: Deep Learning

⚫ The cornerstone of deep learning is the


⚫ Neural networks rose to fame in the late 1980s. Then along came SVMs,
boosting, and random forests, and neural networks fell somewhat from favor.
Part of the reason was that neural networks required a lot of tinkering, while
the new methods were more automatic.
⚫ Neural networks resurfaced after 2010 with the new name deep learning, with
new architectures, additional bells and whistles, and a string of success stories
on some niche problems such as

⚫ Many in the field believe that the major reason for these successes is the
availability of , made possible by
the wide-scale use of digitization in science and industry.
⚫ In this chapter we discuss the basics of neural networks and deep learning, and
then go into some of the specializations for specific problems, such as
for ,
and
for and other

10.1 Single Layer Neural Networks


⚫ A neural network takes an input vector of p variables X = (X1, X2,...,Xp) and
builds a nonlinear function f(X) to predict the response Y .
⚫ Figure 10.1 shows a simple feed-forward neural network for modeling a
quantitative response neural
network using p = 4 predictors.

1
⚫ In the terminology of neural networks, the four features X1,...,X4 make up the
units in the . The arrows indicate that each of the
inputs from the input layer feeds into each of the K
(we get to pick K; here we chose 5).
⚫ The neural network model has hidden units the form

⚫ It is built up here in two steps. First the K 𝐴𝑘 , k =


1, . . . , K, in activations the hidden layer are computed as functions of the
input features X1,...,Xp,

where g(z) is a that is specified in


advance. These K activations from the hidden layer then feed into the output
layer, resulting in

2
a linear regression model in the K = 5 activations. All the parameters
and need to be estimated from data.
⚫ In the early instances of neural networks, the activation
function was favored

which is the same function used in logistic regression to convert a linear


function into probabilities between zero and one (see Figure 10.2).
⚫ The preferred choice in modern neural networks is the
activation function, which takes the form

A ReLU activation can be computed and stored more efficiently than a sigmoid
activation. Although it thresholds at zero, because we apply it to a linear
function (10.2) the constant term will shift this inflection point.

⚫ Fitting a neural network requires estimating the unknown parameters in (10.1).


For a quantitative response, typically is used, so
that the parameters are chosen to minimize

3
⚫ Details about how to perform this minimization are provided in Section 10.7.

10.2 Multilayer Neural Networks


⚫ Modern neural networks typically have more than one ,
and often many units per layer.
⚫ In theory a single hidden layer with a large number of units has the ability to
approximate most functions. However, the learning task of discovering a good
solution is made much with each of
modest size.
⚫ We will illustrate a large on the famous and
publicly available handwritten digit dataset. Figure 10.3 shows
examples of these digits.
⚫ The idea is to build a model to into their correct
digit class . Every image has p = 28 × 28 = 784 pixels, each of which is an
eight-bit grayscale value between 0 and 255 representing the relative amount
of the written digit in that tiny square. These pixels are stored in the input
vector X (in, say, column order). The output is the class label, represented by a
vector Y = (Y0, Y1,...,Y9) of 10 dummy variables, with a one in the position
corresponding to the label, and zeros elsewhere. In the machine learning
community, this is known as . There are 60,000
training images, and 10,000 test images.

⚫ Figure 10.4 shows a architecture that works well


for solving the task. It differs from Figure 10.1 in
several ways:

4
➢ It has hidden layers L1 (256 units) and L2 (128 units) rather than
one. Later we will see a network with seven hidden layers.
➢ It has ten output variables, rather than one. In this case the variables
really represent a single qualitative variable and so are quite dependent.
➢ The function used for training the network is tailored for the
task.
⚫ The first hidden layer is as in (10.2), with

for k = 1,...,𝐾1 . The second hidden layer treats the activations of the first
hidden layer as and computes new activations

for ℓ = 1,...,𝐾2 .

5
(2) (2)
⚫ We have introduced additional superscript notation such as ℎ𝑙 (𝑋) and 𝑤𝑙𝑗

in (10.10) and (10.11) to indicate to which layer the and


( ) belong, in this case layer 2. The notation in Figure
10.4 represents the entire matrix of weights that feed from the input layer to
the first hidden layer L1. This matrix will have . Each
element feeds to the second hidden layer L2 via the matrix of weights
of dimension
Note:

⚫ We now get to the layer, where we now have responses rather


than one. The first step is to compute ten different linear models similar to our
single model (10.1)

for m = 0, 1,..., 9. The matrix stores all of these weights.


Note:

⚫ If these were all separate quantitative responses, we would simply set each
𝑓𝑚 (𝑋) = 𝑍𝑚 and be done. However, we would like our estimates to represent
class probabilities , just like in multinomial
logistic regression in Section 4.3.5. So we use the special activation
function (see (4.13) on page 141),

This ensures that the 10 numbers behave like

6
( and ). Even though the goal is to build a
classifier, our model actually estimates a probability for each of the 10
classes. The classifier then assigns the image to the class with the

⚫ To train this network, since the response is qualitative, we look for coefficient
estimates that minimize the negative multinomial log-likelihood

also known as the . Details on how to minimize this


entropy objective are given in Section 10.7.
⚫ Table 10.1 compares the test performance of the neural network with two
simple models presented in Chapter 4 that make use of linear decision
boundaries: and .
The improvement of neural networks over both of these linear methods is
dramatic:

⚫ Adding the number of coefficients in W1, W2 and B, we get in


all, more than times the number 785 × 9=7,065 needed for multinomial
logistic regression. Recall that there are images in the training set.
⚫ While this might seem like a large training set, there are almost four times as
many coefficients in the neural network model as there are observations in the
training set! To avoid , some regularization is needed. In this
example, we used two forms of : , which is
similar to ridge regression from Chapter 6, and regularization.
We discuss both forms of regularization in Section 10.7.

7
Computer Session
⚫ A Single Layer Network on the Hitters Data

8
⚫ A Multilayer Network on the MNIST Digit Data

9
10

You might also like