0% found this document useful (0 votes)
51 views59 pages

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

The document summarizes key concepts about neural networks and machine learning. It describes how artificial neural networks are inspired by biological neurons and synapses. An artificial neuron aggregates weighted input signals and applies an activation function to generate output. Common activation functions include sigmoid, tanh, and ReLU. Neural networks are trained by minimizing an error function such as mean squared error for regression or cross-entropy error for classification, using the gradient of the error function.

Uploaded by

s8nd11d UNI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views59 pages

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

The document summarizes key concepts about neural networks and machine learning. It describes how artificial neural networks are inspired by biological neurons and synapses. An artificial neuron aggregates weighted input signals and applies an activation function to generate output. Common activation functions include sigmoid, tanh, and ReLU. Neural networks are trained by minimizing an error function such as mean squared error for regression or cross-entropy error for classification, using the gradient of the error function.

Uploaded by

s8nd11d UNI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms


dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 25: Neural Networks

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 1 / 59
Artificial Neural Networks

Artificial neural networks or simply neural networks are inspired by biological


neuronal networks.
A real biological neuron, or a nerve cell, comprises dendrites, a cell body, and an
axon that leads to synaptic terminals. A neuron transmits information via
electrochemical signals.
When there is enough concentration of ions at the dendrites of a neuron it
generates an electric pulse along its axon called an action potential, which in turn
activates the synaptic terminals, releasing more ions and thus causing the
information to flow to dendrites of other neurons.
A human brain has on the order of 100 billion neurons, with each neuron having
between 1,000 to 10,000 connections to other neurons.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 2 / 59
Artificial Neural Networks

Artificial neural networks are comprised of abstract neurons that try to mimic real
neurons at a very high level.
They can be described via a weighted directed graph G = (V , E ), with each node
vi ∈ V representing a neuron, and each directed edge (vi , vj ) ∈ E representing a
synaptic to dendritic connection from vi to vj . The weight of the edge wij denotes
the synaptic strength.

1 x0

x1
bk
w1k

x2 w2k zk wk ·
Pd
i=1 wik · xi + bk
.. wdk
.
xd

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 3 / 59
Artificial Neuron

An artificial neuron acts as a processing unit, that first aggregates the incoming
signals via a weighted sum, and then applies some function to generate an output.
A binary neuron outputs a 1 whenever the combined signal exceeds a threshold, or
0 otherwise.

d
X
net k = bk + wik · xi = bk + w T x
i =1

x0 is a special bias neuron whose value is always fixed at 1, and the weight from
x0 to zk is bk .
Finally, the output value of zk is given as some activation function, f (·), applied
to the net input at zk

zk = f (net k )

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 4 / 59
Linear Activation Function

Function: f (net k ) = net k


∂ f (net )
Derivative: ∂net j = 1
j

+∞
zk

−∞
−∞ −bk 0 +∞
wTx
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 5 / 59
Step Activation Function

(
0 if net k ≤ 0
Function: f (net k ) =
1 if net k > 0
∂ f (netj )
Derivative: ∂netj =0

1.0
zk

0.5

0
−∞ −bk 0 +∞
wTx
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 6 / 59
Rectified Linear Activation Function

(
0 if net k ≤ 0
Function: f (net k ) =
net k if net k > 0
(
∂ f (netj ) 0 if netj ≤ 0
Derivative: ∂netj =
1 if netj > 0

+∞
zk

0
−∞ −bk 0 +∞
wTx
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 7 / 59
Sigmoid Activation Function

1
Function: f (net k ) = 1+exp{−net k }
∂ f (netj )
Derivative: ∂netj = f (netj ) · (1 − f (netj ))

1.0
zk

0.5

0
−∞ −bk 0 +∞
wTx
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 8 / 59
Hyperbolic Tangent Activation Function

exp{net k }−exp{−net k } exp{2·net k }−1


Function: f (net k ) = exp{net k }+exp{−net k } = exp{2·net k }+1
∂ f (netj )
Derivative: ∂netj = 1 − f (netj )2

1
zk

−1
−∞ −bk 0 +∞
wTx
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 9 / 59
Softmax Activation Function

exp{net k }
Function: f (net k | net) = Pp
i=1 exp{net i }
Derivative:
(
∂f (netj | net) ∂oj oj · (1 − oj ) if k = j
= =
∂net k ∂net k −ok · oj if k =
6 j
zk

netj net k

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 10 / 59
Error Functions
Squared Error: Given an input vector x ∈ Rd , the squared error loss function
measures the squared deviation between the predicted output vector o ∈ Rp and
the true response y ∈ Rp , defined as follows:

p
1 2 1X
Ex = ky − ok = (yj − oj )2
2 2 j =1

where Ex denotes the error on input x. The partial derivative of the squared error
function with respect to a particular output neuron oj is

∂Ex 1
= · 2 · (yj − oj ) · −1 = oj − yj
∂oj 2

Across all the output neurons, we can write this as


∂Ex
= o −y
∂o
As discussed, squared error is often used in regression.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 11 / 59
Error Functions

Cross-Entropy Error: For classification tasks, with K classes {c1 , c2 , · · · , cK }, the


number of output neurons p = K , with one output neuron per class.
Each of the classes is coded as a one-hot vector, with class ci encoded as the ith
standard basis vector e i = (ei 1 , ei 2 , · · · , eiK )T ∈ {0, 1}K , with eii = 1 and eij = 0 for
all j 6= i.
The cross-entropy loss is defined as

K
X  
Ex = − yi · ln(oi ) = − y1 · ln(o1 ) + · · · + yK · ln(oK )
i =1

Note that only one element of y is 1 and the rest are 0.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 12 / 59
Error Functions

The partial derivative of the cross-entropy error function with respect to a


particular output neuron oj is

∂Ex yj
=−
∂oj oj

The vector of partial derivatives of the error function with respect to the output
neurons is therefore given as
 T  T
∂Ex ∂Ex ∂Ex ∂Ex y1 y2 yK
= , , ··· , = − ,− ,··· ,−
∂o ∂o1 ∂o2 ∂oK o1 o2 oK

Cross-entropy is often used for classification.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 13 / 59
Error Functions

Binary Cross-Entropy Error: For binary classes, the positive class is 1 and the
negative class is 0.
Given x ∈ Rd , with y ∈ {0, 1}, there is only one output neuron o.


Ex = − y · ln(o) + (1 − y ) · ln(1 − o)

The partial derivative with respect to the output neuron o is


∂Ex ∂ n o
= −y · ln(o) − (1 − y ) · ln(1 − o)
∂o ∂o 
y 1−y −y · (1 − o) + (1 − y ) · o o −y
=− + · −1 = =
o 1−o o · (1 − o) o · (1 − o)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 14 / 59
Linear and Logistic Regression via Neural Networks

Networks of (artificial) neurons are capable of representing and learning arbitrarily


complex functions for regression.

1 x0 1 x0
b1
b bp
x1 w1 o x1 w11 o1
w1p
.. wd .. ..
. . .
wd 1
xd xd wdp op

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 15 / 59
ANN for Multiple and Multivariate Regression
Example – Multiple Linear Regression

Consider the multiple regression of sepal length and petal length on the
dependent attribute petal width for the Iris dataset with n = 150 points.

ŷ = −0.014 − 0.082 · x1 + 0.45 · x2

The squared error for this optimal solution is 6.179 on the training data.
A neural network,with linear activation and minimizing the SSE via gradient
descent, results in

o = 0.0096 − 0.087 · x1 + 0.452 · x2

with an SSE of 6.18, which is very close to the optimal solution.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 16 / 59
ANN for Multiple and Multivariate Regression
Example – Multivariate Linear Regression

We use the neural network architecture to learn the weights and bias for the Iris
dataset, where we use sepal length and sepal width as the independent
attributes, and petal length and petal width as the response or dependent
attributes.
Therefore, each x i is 2D, and 2-dimensional, and y i is also 2D.Minimizing the
SSE via gradient descent, yields:
   
b1 b2 −1.83 −1.47    
w11 w12  =  1.72 o1 −1.83 + 1.72 · x1 − 1.46 · x2
0.72 =
o2 −1.47 + 0.72 · x1 − 0.50 · x2
w21 w22 −1.46 −0.50

The SSE on the training set is 84.9. Optimal multivariate regression yields an SSE
of 84.16:
   
ŷ1 −2.56 + 1.78 · x1 − 1.34 · x2
=
ŷ2 −1.59 + 0.73 · x1 − 0.48 · x2

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 17 / 59
Classification via Neural Networks

Networks of artificial neurons can also learn to classify the inputs.


A simple change to the neural network allows it to solve the logistic regression
problem. All we have to do is use a sigmoid activation function at the output
neuron o, and use the cross-entropy error instead of squared error.

Ex = − y · ln(o) + (1 − y ) · ln(1 − o)

The output of the neural network is


1
o = f (net o ) = sigmoid(b + w T x) = = π(x)
1 + exp{−(b + w T x)}

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 18 / 59
Neural networks for multiclass logistic regression
Example

We applied the single output neural network with logistic activation at the output
neuron and cross-entropy error function, on the Iris principal components dataset.
The output is a binary response indicating Iris-virginica (Y = 1) or one of
the other Iris types (Y = 0).
As expected, the neural network learns an identical set of weights and bias as
shown for the logistic regression model, namely:
o = −6.79 − 5.07 · x1 − 3.29 · x2
Next, we used a softmax activation and cross-entropy error function, to the Iris
principal components data with three classes: Iris-setosa (Y = 1),
Iris-versicolor (Y = 2) and Iris-virginica (Y = 3).
We fix the weights and bias for output neuron o3 to be zero:
o1 = −3.49 + 3.61 · x1 + 2.65 · x2
o2 = −6.95 − 5.18 · x1 − 3.40 · x2
o3 = 0 + 0 · x1 + 0 · x2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 19 / 59
Neural networks for multiclass logistic regression
Example

We do not constrain the weights and bias for o3 we obtain the following model:
o1 = −0.89 + 4.54 · x1 + 1.96 · x2
o2 = −3.38 − 5.11 · x1 − 2.88 · x2
o3 = 4.24 + 0.52 · x1 + 0.92 · x2
Misclassified points are shown in dark gray color. Points in class c1 and c2 are
shown displaced with respect to the base class c3 only for illustration.
Y
π1 (x) π3 (x)
rS rS rS rS
π2 (x)
rS rS Sr Sr rS rS rS rS rS rS rSrS rS rS rS rS rS rS rS rS rS
rS Sr Sr Sr rS rS rS rS rS rS rS rS rS rS bC
bC bC bC bC bC bCbC Cb bC bC bC bC bC bC bC bC Cb bC Cb bC Cb
bC bC Cb bC bC bC bC Cb bC bC bC bC Cb bC Cb bC
Cb Cb bC bC bC
bC bC

X1 uT uT uT uT
uT uT uT Tu uT uT uT uT TuuT Tu
uT uTuT uT uTuT uT uT uT uT uT uT
uT uT Tu Tu uT uT uT uT Tu uT TuuT Tu uT uT uT
Tu uT Tu uT

X2

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 20 / 59
Multilayer Perceptron: One Hidden Layer

A multilayer perceptron (MLP) is a neural network that has distinct layers of


neurons.
The inputs to the neural network comprise the input layer, and the final outputs
from the MLP comprise the output layer.
Any intermediate layer is called a hidden layer, and an MLP can have one or many
hidden layers.
Networks with many hidden layers are called deep neural networks.
An MLP is also a feed-forward network. That is, information flows in the forward
direction, and from a layer only to the subsequent layer.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 21 / 59
Multilayer Perceptron: One Hidden Layer
Input Hidden Output
Layer Layer Layer
1 x0 1 z0

bk
x1 z1 o1
w1k
.. .. ..
. . .
wk 1
xi wik zk wkj oj
wkp
.. .. ..
. . .
wdk
xd zm op

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 22 / 59
MLP: Feed-forward Phase

Given the input neuron values, the output for each hidden neuron zk is:
d
!
X
zk = f (net k ) = f bk + wik · xi
i =1

where f is some activation function, and wik denotes the weight between input
neuron xi and hidden neuron zk .
Next, given the hidden neuron values, the value for each output neuron oj is:
m
!
X
oj = f (netj ) = f bj + wij · zi
i =1

where wij denotes the weight between hidden neuron zi and output neuron oj .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 23 / 59
MLP: Feed-Forward Phase

The output vector can then be computed as follows:

neto = b o + W oT z
o = f (neto ) = f b o + W oT z


To summarize, for a given input x ∈ D with desired response y , an MLP


computes the output vector via the feed-forward process, as follows:
   
o = f b o + W oT z = f b o + W oT · f b h + W hT x

where, o = (o1 , o2 , · · · , op )T is the vector of predicted outputs from the single


hidden layer MLP.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 24 / 59
MLP: Backpropagation Phase

Backpropagation is the algorithm used to learn the weights between successive


layers in an MLP.
The name comes from the manner in which the error gradient is propagated
backwards from the output to input layers via the hidden layers.
For a given input pair (x, y ) in the training data, the MLP first computes the
output vector o via the feed-forward step.
Next, it computes the error in the predicted output vis-a-vis the true response y
using the squared error function
p
1 2 1X
Ex = ky − ok = (yj − oj )2
2 2 j =1

The basic idea is to examine the extent to which an output neuron, say oj ,
deviates from the corresponding target response yj , and to modify the weights wij
between each hidden neuron zi and oj as some function of the error.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 25 / 59
MLP: Backpropagation Phase

The weight update is done via a gradient descent approach to minimize the error.
Let ∇wij be the gradient of the error function with respect to wij , or simply the
weight gradient at wij .
Given the previous weight estimate wij , a new weight is computed by taking a
small step η in a direction that is opposite to the weight gradient at wij

wij = wij − η · ∇wij

In a similar manner, the bias term bj is also updated via gradient descent

bj = bj − η · ∇bj

where ∇bj is the gradient of the error function with respect to bj , which we call
the bias gradient at bj .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 26 / 59
MLP: Backpropagation Phase
Updating Parameters Between Hidden and Output Layer

Consider the weight wij between hidden neuron zi and output neuron oj , and the
bias term bj between z0 and oj .
We compute the weight gradient at wij and bias gradient at bj , as follows:

∂Ex ∂Ex ∂netj


∇wij = = · = δj · zi
∂wij ∂netj ∂wij
∂Ex ∂Ex ∂netj
∇bj = = · = δj
∂bj ∂netj ∂bj

where we use the symbol δj to denote the partial derivative of the error with
respect to net signal at oj , which we also call the net gradient at oj
Next, we compute δj , the net gradient at oj .

∂Ex ∂Ex ∂f (netj )


δj = = ·
∂netj ∂f (netj ) ∂netj

Note that f (netj ) = oj .


Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 27 / 59
MLP: Backpropagation Phase
Updating Parameters Between Hidden and Output Layer

Using the squared error function for the former, we have


( p )
∂Ex ∂Ex ∂ 1X
= = (yk − ok )2 = (oj − yj )
∂f (netj ) ∂oj ∂oj 2 k =1

where we used the observation that all ok for k 6= j are constants with respect to
oj . Since we assume a sigmoid activation function, for the latter, we have

∂f (netj )
= oj · (1 − oj )
∂netj

Putting it all together, we get

δj = (oj − yj ) · oj · (1 − oj )

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 28 / 59
MLP: Backpropagation Phase
Updating Parameters Between Input and Hidden Layer

Consider wij between xi and zj , and bj between x0 and zj .

∂Ex ∂Ex ∂netj


∇wij = = · = δj · xi
∂wij ∂netj ∂wij
∂Ex ∂Ex ∂netj
∇bj = = · = δj
∂bj ∂netj ∂bj

δj at zj has to consider the error gradients that flow back from all the output
neurons to zj .
p p
∂Ex X ∂Ex ∂net k ∂zj ∂zj X ∂Ex ∂net k
δj = = · · = · ·
∂netj k =1 ∂net k ∂zj ∂netj ∂netj k =1 ∂net k ∂zj
p
X
= zj · (1 − zj ) · δk · wjk
k =1

∂ zj
where ∂netj = zj · (1 − zj ), assuming a sigmoid activation function.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 29 / 59
Backpropagation of gradients from output to hidden layer
To find the net gradient at zj we have to consider the net gradients at each of the
output neurons δk but weighted by the strength of the connection wjk between zj
and ok .
Pp
That is, we compute the weighted sum of gradients k =1 δk · wjk , which is used to
compute δj , the net gradient at hidden neuron zj .
Input Hidden Output
Layer Layer Layer
1 x0 o1 δ1

bj wj 1 ..
.

xi wij zj wjk ok δk
X
p
δk · wjk wjp ..
k=1 .

op δp

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 30 / 59
MLP Training: Stochastic Gradient Descent

The MLP training takes multiple iterations over the input points. For each input
x i , the MLP computes the output vector o i via the feed-forward step. In the
backpropagation phase, we compute the error gradient vector δ o with respect to
the net at output neurons, followed by δ h for hidden neurons.
In the stochastic gradient descent step, we compute the error gradients with
respect to the weights and biases, which are used to update the weight matrices
and bias vectors.
MLP-Training (D, m, η, maxiter):
// Initialize bias vectors
1 b h ← random m-dimensional vector with small values
2 b o ← random p-dimensional vector with small values
// Initialize weight matrices
3 W h ← random d × m matrix with small values
4 W o ← random m × p matrix with small values
5 t ← 0 // iteration counter

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 31 / 59
MLP Training: Stochastic Gradient Descent
6 repeat
7 foreach (x i , y i ) ∈ D in random order do
// Feed-forward phase
z i ← f b h + W hT x i 

8

9 o i ← f b o + W oT z i
// Backpropagation phase: net gradients
10 δ o ← o i ⊙ (1 − o i ) ⊙ (o i − y i )
11 δ h ← z i ⊙ (1 − z i ) ⊙ (W o · δ o )
// Gradient descent for bias vectors
12 ∇bo ← δ o ; b o ← b o − η · ∇bo
13 ∇bh ← δ h ; b h ← b h − η · ∇bh
// Gradient descent for weight matrices
14 ∇Wo ← z i · δ To ; W o ← W o − η · ∇Wo
15 ∇Wh ← x i · δ Th ; W h ← W h − η · ∇Wh
16 t ← t +1
17 until t ≥ maxiter

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 32 / 59
MLP with one hidden layer
Example

Evaluate an MLP with a hidden layer using a non-linear activation function to


learn the sine curve.
The training data comprises n = 25 points xi sampled randomly in the range
[−10, 10], with yi = sin(xi ).
The testing data comprises 1000 points sampled uniformly from the same range.
The figure also shows the desired output curve (thin line).
We used an MLP with one input neuron (d = 1), ten hidden neurons (m = 10)
and one output neuron (p = 1). The hidden neurons use tanh activations, whereas
the output unit uses an identity activation. The step size is η = 0.005.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 33 / 59
MLP for sine curve
t =1

1.0
bC bC bC
bC
bC
bC bC bC
0.5
bC
bC
0
Y

bC bC
bC bC
bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

−0.5 bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC
bC bC bC
bC bC bC

bC
bC bC bC

bC bC
bC bC bC
bC bC bC
bC bC

bC
bC bC
bC bC

bC
bC bC
bC bC

−1.0 bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC
bC bC Cb bC bC
bC bC bC bC bC
bC bC bC bC bC
bC bC
bC bC Cb bC
bC bC
bC bC bC bC bC
bC
bC bC bC bC
bC bC

−1.5 bC bC
bC
bC
bC
bC
bC
bC Cb bC
bC bC
bC bC
Cb bC
bC bC

bC
bC bC bC
bC bC
bC
bC Cb bC
bC bC
bC
bC bC bC
bC bC
bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC

−2.0 bC
bC
bC
bC
bC
bC
bC
bC bC bC
bC
bC
bC
bC
bC
bC
bC

bC bC
bC bC
bC bC bC
bC
bC bC bC
bC
bC bC
bC bC
bC bC bC bC
bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
−2.5 bC bC
bC bC
bC bC
bC bC bC
bC bC bC
bC bC bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC bC

bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12

X
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 34 / 59
MLP for sine curve
t = 1000

bC bC bC bC bC bC
bC bC bC bC bC Cb bC bC bC bC bC Cb bC
1.00 bC bC
Cb bC
bC bC
bC bC bC
Cb bC bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC

bC bC Cb bC
bC bC bC bC
bC bC

bC bC
bC
bC
bC
bC bC bC bC
bC bC
bC
bC bC
bC bC

0.75 bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC

0.50 bC bC
bC
bC
bC bC
bC
bC
bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC

bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC Cb bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC
0.25 bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC bC bC
bC
bC bC bC
Cb bC bC
bC bC bC
bC bC bC
bC bC bC

bC bC bC

bC bC
bC bC
bC Cb bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC
bC Cb bC bC
0
Y

bC bC
bC bC bC bC
bC bC bC bC bC
bC bC
bC Cb bC bC
bC bC bC bC
bC bC bC bC

bC
bC bC bC bC
bC bC bC Cb bC bC
bC bC
bC bC bC bC
bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC

bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC
bC bC bC bC bC
bC bC bC

−0.25 bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
bC
bC
bC
bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC bC
Cb bC bC
bC bC bC
bC bC bC
bC bC bC

bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC
bC bC bC bC
bC

−0.50 bC
bC bC
bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC bC bC bC
bC bC bC
bC
bC
bC

bC
bC bC bC bC
bC bC bC bC bC
Cb bC

bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC Cb bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC Cb bC bC bC bC bC bC bC bC bC
bC bC bC

−0.75
bC bC
bC bC
−1.00 bC

−1.25
−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12

X
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 35 / 59
MLP for sine curve
t = 5000

1.00 bC bC bC
Cb bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC
bC bC

bC
bC bC bC bC

bC bC Cb bC
bC bC
bC bC
bC bC
bC bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC
bC bC bC bC
0.75 bC bC
bC
bC
bC
bC
bC bC bC bC
bC bC
bC
bC
bC

bC
bC bC
bC bC
bC bC
bC bC
bC bC

bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC
bC bC bC bC bC
bC bC bC
0.50 bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC bC
bC bC
bC bC
bC bC
bC bC bC
bC bC bC
bC bC bC bC
bC bC bC
bC bC bC bC
bC bC bC
bC bC bC
bC bC bC
bC bC bC
bC
bC bC bC bC bC bC
bC
bC bC bC bC bC bC
bC
bC
bC bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC
bC bC bC
bC
bC bC bC bC bC bC

0.25 bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC bC
bC bC bC
bC bC bC bC bC bC bC bC

bC bC
bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC
bC bC bC bC
bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC
bC bC bC
bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC
bC
bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
0 bC bC bC
Y

bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC
bC bC bC bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC

bC
bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC
bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC
bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC
bC bC bC bC bC bC
bC
bC bC bC bC
bC bC bC bC bC bC bC

bC
bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
−0.25 bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC
bC bC
bC bC
bC bC bC
bC
bC
bC
bC bC
bC
bC
bC bC
bC
bC
bC
bC

bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC
bC bC
bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC

−0.50
bC
bC bC bC bC bC
bC bC bC bC

bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC

bC bC bC bC bC bC
bC bC
bC bC

bC
bC bC bC bC
bC bC bC bC
bC bC bC
bC bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC
bC bC bC

−0.75 bC bC
bC bC
bC bC
bC bC bC bC
bC bC bC
bC bC
bC
bC
bC bC
bC
bC
bC

bC bC
bC
bC bC bC bC
bC bC
bC
bC bC bC bC
bC bC Cb bC bC bC
bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC

bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC

−1.00
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC
bC bC bC
bC bC
Cb bC

bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC

−1.25
−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12

X
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 36 / 59
MLP for sine curve
t = 10000

bC bC bC bC bC
bC bC bC bC bC Cb bC bC bC bC bC
Cb bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
Cb bC
bC bC bC
bC
bC bC bC
bC
bC bC
bC bC
bC
bC bC
bC bC
bC bC
bC bC
bC
bC bC
bC bC
bC bC
bC

1.25 bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC bC
bC bC
bC bC
bC
bC bC

1.00 bC
bC
bC
bC
bC bC
bC
bC
bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC

bC
bC bC bC

bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC
bC
bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC
bC bC bC bC bC
bC bC
bC bC
bC bC bC bC bC
bC
0.75 bC
bC
bC
bC
bC

bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC bC bC bC
bC bC
bC
bC
bC Cb bC
bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC
bC bC bC

bC
bC bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC
bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC

bC
bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC

bC
bC bC bC
bC bC bC
bC
bC bC

0.50 bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC

0.25 bC
bC
bC
bC
bC
bC
bC bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC bC bC
bC bC
bC bC
bC bC

bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC
bC
bC bC
0 bC bC
Y

bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC

bC
bC bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC bC

bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC
bC bC bC
bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
−0.25 bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC bC
bC
bC
bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC
bC bC bC bC
bC bC bC bC
bC bC bC

−0.50 bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC bC
bC
bC
bC
bC

bC
bC bC bC
bC bC bC bC
bC

bC
bC bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC
bC bC
bC bC bC bC
bC bC bC
−0.75 bC
bC
bC bC
bC
bC bC
bC
bC
bC bC
bC
bC
bC

bC
bC bC bC bC
bC bC

bC
bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC

bC
bC
bC
bC bC bC bC bC

bC
bC bC bC bC bC
bC bC
bC bC bC bC Cb bC
bC bC
bC
bC bC bC bC
bC bC bC bC bC bC
bC bC
−1.00 bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC bC bC
bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC
bC bC

bC bC
bC
bC bC bC bC
bC bC
bC bC bC bC
bC bC Cb bC
bC bC
bC bC bC bC
bC bC bC bC

−1.25 bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC

−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning
X Chapter 25: Neural Networks 37 / 59
MLP for sine curve
t = 15000

bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC
1.00 bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
Cb bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

bC
bC bC bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC Cb bC bC bC bC bC
bC bC bC bC bC
bC
bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC Cb bC bC bC Cb bC bC bC
bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC
Cb bC
bC bC
bC bC bC bC bC
0.75 bC
bC
bC
bC
bC
bC
bC

bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC
bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC

bC
bC bC bC bC bC bC bC
bC bC bC bC bC

bC
bC bC bC bC bC bC

bC
bC bC bC bC

0.50 bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC bC
bC
bC
bC
bC
bC
bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC
0.25 bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC

bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC

0 bC bC bC
Y

bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC
bC bC bC bC

bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC

bC
bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC
bC
bC bC bC bC bC bC

bC
bC bC bC bC bC bC bC
bC bC bC

−0.25
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC
bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC

−0.50 bC
bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC

bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC

bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC

−0.75 bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC

bC bC bC
bC bC bC bC bC bC bC
bC bC
bC
bC
bC bC
bC
bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC
bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC
bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC Cb bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
−1.00
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC
bC bC
bC bC
bC
bC bC
bC bC
bC bC
bC bC
bC bC bC
bC
bC bC bC
bC
bC bC bC bC
bC bC bC bC
bC bC

−1.25 bC bC
bC bC
bC bC
bC bC bC
bC bC bC bC bC Cb bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC
Cb bC
bC bC
bC bC

−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12

X
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 38 / 59
MLP for sine curve
t = 30000

We can observe that, even with a very small training data of 25 points sampled
randomly from the sine curve, the MLP is able to learn the desired function.

1.00 bC bC
bC bC
Cb bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC
Cb bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC bC bC
bC bC
bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC

bC
bC bC bC bC bC
bC bC bC
bC
bC bC bC bC bC bC bC bC bC bC bC
Cb bC bC bC Cb bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC
bC bC bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC bC bC Cb bC bC bC Cb bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC
bC bC
0.75 bC
bC
bC
bC bC
bC bC

bC
bC
bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC
bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC
bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC
bC
bC bC bC bC bC bC
bC
bC
bC bC bC bC bC bC

0.50 bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC

0.25 bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC bC bC
bC bC
bC
bC
bC
bC
bC

bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC
0 bC
Y

bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC

bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC

bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC

bC
bC bC bC bC bC
bC bC bC
bC
bC bC bC
bC bC bC bC bC bC

−0.25 bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC

−0.50
bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC

bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC

bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC

bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC

−0.75 bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC
bC
bC bC bC bC bC bC
bC bC bC bC bC
bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC
bC bC bC bC
bC bC bC
bC bC bC bC bC
bC bC
bC bC bC
bC bC bC bC
bC bC
bC bC
bC bC bC

bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
Cb bC
bC bC bC bC bC Cb bC bC
Cb bC bC bC bC

−1.00 bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

−1.25
−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning X Chapter 25: Neural Networks 39 / 59
MLP for sine curve
Test range [−20, 20]

The MLP model has not really learned the sine function; rather, it has learned to
approximate it only in the specified range [−10, 10].
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC

2.0 bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC bC
bC bC bC
bC bC
bC bC bC
bC bC bC
bC bC
bC bC
bC bC
1.5 bC bC bC
bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC

1.0 bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC
Cb bC bC
bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC
bC
bC bC
bC bC bC bC
bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC Cb bC bC bC bC bC bC bC
bC bC
bC bC bC
bC bC bC
bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC Cb bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC
bC bC bC bC bC
bC bC bC Cb bC bC bC bC bC Cb bC bC bC

bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC
bC bC
bC bC
bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC Cb bC bC bC bC bC bC bC bC
bC bC bC bC
bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC
bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC

0.5 bC bC
bC bC
bC bC
bC bC
bC bC bC bC
bC bC
bC bC
bC bC bC
bC bC bC bC
bC bC
bC bC
bC bC
bC bC bC bC
bC bC
bC bC
bC bC bC
bC bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC
bC bC
bC bC
bC bC bC
bC bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC

bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC
bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC

0 bC bC
Y

bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC

bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC
bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC

bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC
−0.5 bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC
bC bC bC
bC bC
bC bC bC
bC bC
bC bC
bC bC bC bC
bC bC
bC bC
bC bC
bC bC

bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC
bC bC bC bC bC
bC
bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
−1.0 bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC

−1.5 bC bC bC
bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC

−2.0 bC bC
bC bC bC
bC bC bC bC
bC bC
bC bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

−2.5 bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

−20 −15 −10 −5 0 5 10 15 20


Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning X Chapter 25: Neural Networks 40 / 59
MLP for handwritten digit classification
Example

We evaluate an MLP with one hidden layer for the task of predicting the correct
label for a hand-written digit from the MNIST database, which contains 60,000
training images that span the 10 digit labels, from 0 to 9.
Each (grayscale) image is a 28 × 28 matrix of pixels, with values between 0 and
255. Each pixel is converted to a value in the interval [0, 1] by dividing by 255.
0 0 0 0 0

5 5 5 5 5

10 10 10 10 10

15 15 15 15 15

20 20 20 20 20

25 25 25 25 25

0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
0 0 0 0 0

5 5 5 5 5

10 10 10 10 10

15 15 15 15 15

20 20 20 20 20

25 25 25 25 25

0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 41 / 59
MLP for handwritten digit classification
Example

Since images are 2-dimensional matrices, we first flatten them into a vector
x ∈ R784 with dimensionality d = 28 × 28 = 784. This is done by simply
concatenating all of the rows of the images to obtain one long vector.
Next, since the output labels are categorical values that denote the digits from 0
to 9, we need to convert them into binary (numerical) vectors, using one-hot
encoding.
Thus, the label 0 is encoded as e 1 = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0)T ∈ R10 , the label 1 as
e 2 = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0)T ∈ R10 , and so on, and finally the label 9 is encoded
as e 10 = (0, 0, 0, 0, 0, 0, 0, 0, 0, 1)T ∈ R10 . That is, each input image vector x has a
corresponding target response vector y ∈ {e 1 , e 2 , · · · , e 10 }.
Thus, the input layer for the MLP has d = 784 neurons, and the output layer has
p = 10 neurons.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 42 / 59
MLP for handwritten digit classification
Example

For the hidden layer, we consider several MLP models, each with a different
number of hidden neurons m. We try m = 0, 7, 49, 98, 196, 392, to study the effect
of increasing the number of hidden neurons, from small to large.
For the hidden layer, we use ReLU activation function, and for the output layer,
we use softmax activation, since the target response vector has only one neuron
with value 1, with the rest being 0.
Note that m = 0 means that there is no hidden layer – the input layer is directly
connected to the output layer, which is equivalent to a multiclass logistic
regression model. We train each MLP for t = 15 epochs, using step size η = 0.25.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 43 / 59
MLP for handwritten digit classification
Example

The final test error at the end of training is given as


m 0 7 10 49 98 196 392
errors 1677 901 792 546 495 470 454

We can observe that adding a hidden layer significantly improves the prediction
accuracy. Using even a small number of hidden neurons helps, compared to the
logistic regression model (m = 0). For example, using m = 7 results in 901 errors
(or error rate 9.01%) compared to using m = 0, which results in 1677 errors (or
error rate 16.77%).
On the other hand, as we increase the number of hidden neurons, the error rate
decreases, though with diminishing returns. Using m = 196, the error rate is
4.70%, but even after doubling the number of hidden neurons (m = 392), the
error rate goes down to only 4.54%. Further increasing m does not reduce the
error rate.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 44 / 59
MNIST: Prediction error as a function of epochs
During training, we plot the number of misclassified images after each epoch, on
the separate MNIST test set comprising 10,000 images.
Figure shows the number of errors from each of the models (with a different
number of hidden neurons m), after each epoch.
3,000
m=0
2,500
m=7
m = 10
2,000
m = 49
m = 98
errors

1,500
m = 196
m = 392
1,000

500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

epochs
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 45 / 59
Deep Multilayer Perceptron
We now generalize the feed-forward and backpropagation steps for many hidden
layers, as well as arbitrary error and neuron activation functions.
Consider an MLP with h hidden layers, n input points x i ∈ Rd (l = 0), and the
response vector y i ∈ Rp (l = h + 1).
l =0 l =1 l =2 ... l =h l = h+1

1 x0 1 z01 1 z02 1 ··· 1 z0h

x1 z11 z12 ··· z1h o1

.. .. .. .. .. ..
. . . . . .

xi zi1 zi2 ··· zih oi

.. .. .. .. .. ..
. . . . . .

xd zn11 zn22 ··· znhh op

l =0 l =1 l =2 ... l =h l = h+1

x z1 z2 ··· zh o
W 0 , b0 W 1 , b1 W h , bh

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 46 / 59
Deep Multilayer Perceptron
Feed-forward Phase

Typically in a deep MLP, the same activation function f l is used for all neurons in
a given layer l.
The input layer always uses the identity activation, so f 0 is the identity function.
Also, all bias neurons also use the identity function with a fixed value of 1.
The hidden layers typically use sigmoid, tanh, or ReLU activations.
The output layer typically uses sigmoid or softmax activations for classification
tasks, or identity activations for regression tasks.
For (x, y ) ∈ D, the deep MLP computes the output vector as:
 
o = f h+1 b h + W hT · z h
  
= f h+1 b h + W hT · f h b h−1 + W hT−1 · z h−1
..
= .
  !
  
h+1
=f b h + W hT ·f h
b h−1 + W hT−1 ·f h−1
···f 2
b 1 + W T1 ·f 1
b 0 + W T0 ·x

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 47 / 59
Deep Multilayer Perceptron
Backpropagation Phase

Consider the weight update between a given layer and another, including between
the input and hidden layer, or between two hidden layers, or between the last
hidden layer and the output layer.
Let zil be a neuron in layer l, and zjl +1 a neuron in the next layer l + 1. Let wijl be
the weight between zil and zjl +1 , and let bjl denote the bias term between z0l and
zjl +1 .
The weight and bias are updated using the gradient descent approach

wijl = wijl − η · ∇w l bjl = bjl − η · ∇bl


ij j

where ∇w l is the weight gradient and ∇bl is the bias gradient, i.e., the partial
ij j
derivative of the error function with respect to the weight and bias, respectively.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 48 / 59
Deep Multilayer Perceptron
Backpropagation Phase

We can use the chain rule to write the weight and bias gradient, as follows
∂Ex ∂Ex ∂netj
∇w l = l
= · = δjl +1 · zil = zil · δjl +1
ij ∂wij ∂netj ∂wijl
∂Ex ∂Ex ∂netj
∇bl = = · = δjl +1
j ∂bjl ∂netj ∂bjl

In summary, the update of the weights and biases is

W l = W l − η · ∇W l
b l = b l − η · ∇b l

where η is the step size. However, we observe that to compute the weight and
bias gradients for layer l we need to compute the net gradients δ l +1 at layer l + 1.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 49 / 59
Deep Multilayer Perceptron
Net Gradients at Output Layer

If all of the output neurons are independent (for example, when using linear or
sigmoid activations), the net gradient is obtained by differentiating the error
function with respect to the net signal at the output neurons. That is,

∂Ex ∂Ex ∂f h+1 (netj ) ∂Ex ∂f h+1 (netj )


δjh+1 = = h+1 · = ·
∂netj ∂f (netj ) ∂netj ∂oj ∂netj

If the output neurons are not independent (for example, when using a softmax
activation), then :
p
∂Ex X ∂Ex ∂f h+1 (neti )
δjh+1 = = ·
∂netj i =1
∂f h+1 (neti ) ∂netj

For regression, we use the SSE with linear activation function, whereas for logistic
regression and classification, we use the cross-entropy error function with a
sigmoid activation for binary classes, and softmax activation for multiclass
problems.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 50 / 59
Deep Multilayer Perceptron
Net Gradients at Hidden Layers

Let us assume that we have already computed the net gradients at layer l + 1,
namely δ l +1 .
Since neuron zjl in layer l is connected to all of the neurons in layer l + 1 (except
for the bias neuron z0l +1 ), to compute the net gradient at zjl , we have to account
for the error from each neuron in layer l + 1, as follows:
nl+1
∂Ex X ∂Ex ∂net k ∂f l (netj )
δjl = = · l ·
∂netj k =1 ∂net k ∂f (netj ) ∂netj
nl+1
∂f l (netj ) X l +1 l
= · δk · wjk
∂netj k =1

So the net gradient at zjl in layer l depends on the derivative of the activation
function with respect to its net j , and the weighted sum of the net gradients from
all the neurons zkl +1 at the next layer l + 1.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 51 / 59
Net Gradients at Hidden Layers
For the commonly used activation functions at the hidden layer, we have

1
 for linear
l l l
∂f = z (1 − z ) for sigmoid
 l l
(1 − z ⊙ z ) for tanh

The net gradients are computed recursively, starting from the output layer h + 1,
then hidden layer h, and so on, until we finally compute the net gradients at the
first hidden layer l = 1. That is,

δ h = ∂f h ⊙ W h · δ h+1

 
δ h−1 = ∂f h−1 ⊙ W h−1 · δ h = ∂f h−1 ⊙ W h−1 · ∂f h ⊙ W h · δ h+1


..
.
  !
1 1 2 h h+1

δ = ∂f ⊙ W 1 · ∂f ⊙ W 2 · · · · ∂f ⊙ Wh ·δ

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 52 / 59
Deep MLP Training: Stochastic Gradient Descent

Deep-MLP-Training (D, h, η, maxiter, n1 , n2 , · · · , nh , f 1 , f 2 , · · · , f h+1 ):


1 n0 ← d // input layer size
2 nh+1 ← p // output layer size
// Initialize weight matrices and bias vectors
3 for l = 0, 1, 2, · · · , h do
4 b l ← random nl +1 vector with small values
5 W l ← random nl × nl +1 matrix with small values
6 t ← 0 // iteration counter

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 53 / 59
Deep MLP Training: Stochastic Gradient Descent
7 repeat
8 foreach (x i , y i ) ∈ D in random order do
9 z 0 ← x i // Feed-Forward Phase
 
10 for l = 0, 1, 2, . . . , h do z l +1 ← f l +1 b l + W Tl · z l
11 o i ← z h+1
12 if independent outputs then // Backpropagation phase
13 δ h+1 ← ∂f h+1 ⊙ ∂E x i // net gradients at output
14 else
15 δ h+1 ← ∂F h+1 · ∂E x i // net gradients at output
for l = h, h − 1, · · · , 1 do δ l ← ∂f l ⊙ W l · δ l +1 // net gradients

16
17 for l = 0, 1, · · · , h do // Gradient Descent Step
T
18 ∇W l ← z l · δ l +1 // weight gradient matrix at layer l
19 ∇bl ← δ l +1 // bias gradient vector at layer l
20 for l = 0, 1, · · · , h do
21 W l ← W l − η · ∇W l // update W l
22 b l ← b l − η · ∇bl // update b l

23 t ← t +1
24 until t ≥ maxiter
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 54 / 59
Vanishing or Exploding Gradients

In the vanishing gradient problem, the norm of the net gradient can decay
exponentially with the distance from the output layer, that is, as we
backpropagate the gradients from the output layer to the input layer. In this case
the network will learn extremely slowly, if at all, since the gradient descent method
will make minuscule changes to the weights and biases.
On the other hand, in the exploding gradient problem, the norm of the net
gradient can grow exponentially with the distance from the output layer. In this
case, the weights and biases will become exponentially large, resulting in a failure
to learn. The gradient explosion problem can be mitigated to some extent by
gradient thresholding, that is, by resetting the value if it exceeds an upper bound.
The vanishing gradients problem is more difficult to address. Typically sigmoid
activations are more susceptible to this problem, and one solution is to use
alternative activation functions such as ReLU. In general, recurrent neural
networks, which are deep neural networks with feedback connections, are more
prone to vanishing and exploding gradients.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 55 / 59
Deep MLP
Example

We now examine deep MLPs for predicting the labels for the MNIST handwritten
images dataset.
Recall that this dataset has n = 60000 grayscale images of size 28 × 28 that we
treat as d = 784 dimensional vectors. The pixel values between 0 and 255 are
converted to the range 0 and 1 by dividing each value by 255. The target
response vector is a one-hot encoded vector for class labels {0, 1, . . . , 9}.
Thus, the input to the MLP x i has dimensionality d = 784, and the output layer
has dimensionality p = 10. We use softmax activation for the output layer. We
use ReLU activation for the hidden layers, and consider several deep models with
different number and sizes of the hidden layers. We use step size η = 0.3 and train
for t = 15 epochs. Training was done using minibatches, using batch size of 1000.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 56 / 59
Deep MLP
Example

We evaluate the performance of each MLP on the MNIST test datatset that
contains 10,000 images. The final test error at the end of training is given as
hidden layers errors
n1 = 392 396
n1 = 196, n2 = 49 303
n1 = 392, n2 = 196, n3 = 49 290
n1 = 392, n2 = 196, n3 = 98, n4 = 49 278

We can observe that as we increase the number of layers, we do get performance


improvements.
The deep MLP with four hidden layers of sizes n1 = 392, n2 = 196, n3 = 98, n4 = 49
results in an error rate of 2.78% on the training set, whereas the MLP with a
single hidden layer of size n1 = 392 has an error rate of 3.96%.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 57 / 59
MNIST: Deep MLPs
Prediction error as a function of epochs.

The deeper MLP improves the prediction accuracy, but adding more layers does
not reduce the error rate, and can also lead to performance degradation.

5,000
n1 = 392
n1 = 196, n2 = 49
4,000 n1 = 392, n2 = 196, n3 = 49
n1 = 392, n2 = 196, n3 = 98, n4 = 49
errors

3,000

2,000

1,000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

epochs

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 58 / 59
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 25: Neural Networks

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 59 / 59

You might also like