0% found this document useful (0 votes)

51 views59 pages

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

The document summarizes key concepts about neural networks and machine learning. It describes how artificial neural networks are inspired by biological neurons and synapses. An artificial neuron aggregates weighted input signals and applies an activation function to generate output. Common activation functions include sigmoid, tanh, and ReLU. Neural networks are trained by minimizing an error function such as mean squared error for regression or cross-entropy error for classification, using the gradient of the error function.

Uploaded by

s8nd11d UNI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views59 pages

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

Uploaded by

s8nd11d UNI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms

dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 25: Neural Networks

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 1 / 59
Artificial Neural Networks

Artificial neural networks or simply neural networks are inspired by biological

neuronal networks.
A real biological neuron, or a nerve cell, comprises dendrites, a cell body, and an
axon that leads to synaptic terminals. A neuron transmits information via
electrochemical signals.
When there is enough concentration of ions at the dendrites of a neuron it
generates an electric pulse along its axon called an action potential, which in turn
activates the synaptic terminals, releasing more ions and thus causing the
information to flow to dendrites of other neurons.
A human brain has on the order of 100 billion neurons, with each neuron having
between 1,000 to 10,000 connections to other neurons.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 2 / 59
Artificial Neural Networks

Artificial neural networks are comprised of abstract neurons that try to mimic real
neurons at a very high level.
They can be described via a weighted directed graph G = (V , E ), with each node
vi ∈ V representing a neuron, and each directed edge (vi , vj ) ∈ E representing a
synaptic to dendritic connection from vi to vj . The weight of the edge wij denotes
the synaptic strength.

1 x0

x1
bk
w1k

x2 w2k zk wk ·
Pd
i=1 wik · xi + bk
.. wdk
.
xd

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 3 / 59
Artificial Neuron

An artificial neuron acts as a processing unit, that first aggregates the incoming
signals via a weighted sum, and then applies some function to generate an output.
A binary neuron outputs a 1 whenever the combined signal exceeds a threshold, or
0 otherwise.

d
X
net k = bk + wik · xi = bk + w T x
i =1

x0 is a special bias neuron whose value is always fixed at 1, and the weight from
x0 to zk is bk .
Finally, the output value of zk is given as some activation function, f (·), applied
to the net input at zk

zk = f (net k )

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 4 / 59
Linear Activation Function

Function: f (net k ) = net k

∂ f (net )
Derivative: ∂net j = 1
j

+∞
zk

−∞
−∞ −bk 0 +∞
wTx
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 5 / 59
Step Activation Function

(
0 if net k ≤ 0
Function: f (net k ) =
1 if net k > 0
∂ f (netj )
Derivative: ∂netj =0

1.0
zk

0.5

0
−∞ −bk 0 +∞
wTx
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 6 / 59
Rectified Linear Activation Function

(
0 if net k ≤ 0
Function: f (net k ) =
net k if net k > 0
(
∂ f (netj ) 0 if netj ≤ 0
Derivative: ∂netj =
1 if netj > 0

+∞
zk

0
−∞ −bk 0 +∞
wTx
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 7 / 59
Sigmoid Activation Function

1
Function: f (net k ) = 1+exp{−net k }
∂ f (netj )
Derivative: ∂netj = f (netj ) · (1 − f (netj ))

1.0
zk

0.5

0
−∞ −bk 0 +∞
wTx
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 8 / 59
Hyperbolic Tangent Activation Function

exp{net k }−exp{−net k } exp{2·net k }−1

Function: f (net k ) = exp{net k }+exp{−net k } = exp{2·net k }+1
∂ f (netj )
Derivative: ∂netj = 1 − f (netj )2

1
zk

−1
−∞ −bk 0 +∞
wTx
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 9 / 59
Softmax Activation Function

exp{net k }
Function: f (net k | net) = Pp
i=1 exp{net i }
Derivative:
(
∂f (netj | net) ∂oj oj · (1 − oj ) if k = j
= =
∂net k ∂net k −ok · oj if k =
6 j
zk

netj net k

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 10 / 59
Error Functions
Squared Error: Given an input vector x ∈ Rd , the squared error loss function
measures the squared deviation between the predicted output vector o ∈ Rp and
the true response y ∈ Rp , defined as follows:

p
1 2 1X
Ex = ky − ok = (yj − oj )2
2 2 j =1

where Ex denotes the error on input x. The partial derivative of the squared error
function with respect to a particular output neuron oj is

∂Ex 1
= · 2 · (yj − oj ) · −1 = oj − yj
∂oj 2

Across all the output neurons, we can write this as

∂Ex
= o −y
∂o
As discussed, squared error is often used in regression.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 11 / 59
Error Functions

Cross-Entropy Error: For classification tasks, with K classes {c1 , c2 , · · · , cK }, the

number of output neurons p = K , with one output neuron per class.
Each of the classes is coded as a one-hot vector, with class ci encoded as the ith
standard basis vector e i = (ei 1 , ei 2 , · · · , eiK )T ∈ {0, 1}K , with eii = 1 and eij = 0 for
all j 6= i.
The cross-entropy loss is defined as

K
X
Ex = − yi · ln(oi ) = − y1 · ln(o1 ) + · · · + yK · ln(oK )
i =1

Note that only one element of y is 1 and the rest are 0.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 12 / 59
Error Functions

The partial derivative of the cross-entropy error function with respect to a

particular output neuron oj is

∂Ex yj
=−
∂oj oj

The vector of partial derivatives of the error function with respect to the output
neurons is therefore given as
T T
∂Ex ∂Ex ∂Ex ∂Ex y1 y2 yK
= , , ··· , = − ,− ,··· ,−
∂o ∂o1 ∂o2 ∂oK o1 o2 oK

Cross-entropy is often used for classification.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 13 / 59
Error Functions

Binary Cross-Entropy Error: For binary classes, the positive class is 1 and the
negative class is 0.
Given x ∈ Rd , with y ∈ {0, 1}, there is only one output neuron o.

Ex = − y · ln(o) + (1 − y ) · ln(1 − o)

The partial derivative with respect to the output neuron o is

∂Ex ∂ n o
= −y · ln(o) − (1 − y ) · ln(1 − o)
∂o ∂o
y 1−y −y · (1 − o) + (1 − y ) · o o −y
=− + · −1 = =
o 1−o o · (1 − o) o · (1 − o)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 14 / 59
Linear and Logistic Regression via Neural Networks

Networks of (artificial) neurons are capable of representing and learning arbitrarily

complex functions for regression.

1 x0 1 x0
b1
b bp
x1 w1 o x1 w11 o1
w1p
.. wd .. ..
. . .
wd 1
xd xd wdp op

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 15 / 59
ANN for Multiple and Multivariate Regression
Example – Multiple Linear Regression

Consider the multiple regression of sepal length and petal length on the
dependent attribute petal width for the Iris dataset with n = 150 points.

ŷ = −0.014 − 0.082 · x1 + 0.45 · x2

The squared error for this optimal solution is 6.179 on the training data.
A neural network,with linear activation and minimizing the SSE via gradient
descent, results in

o = 0.0096 − 0.087 · x1 + 0.452 · x2

with an SSE of 6.18, which is very close to the optimal solution.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 16 / 59
ANN for Multiple and Multivariate Regression
Example – Multivariate Linear Regression

We use the neural network architecture to learn the weights and bias for the Iris
dataset, where we use sepal length and sepal width as the independent
attributes, and petal length and petal width as the response or dependent
attributes.
Therefore, each x i is 2D, and 2-dimensional, and y i is also 2D.Minimizing the
SSE via gradient descent, yields:
   
b1 b2 −1.83 −1.47
w11 w12  =  1.72 o1 −1.83 + 1.72 · x1 − 1.46 · x2
0.72 =
o2 −1.47 + 0.72 · x1 − 0.50 · x2
w21 w22 −1.46 −0.50

The SSE on the training set is 84.9. Optimal multivariate regression yields an SSE
of 84.16:

ŷ1 −2.56 + 1.78 · x1 − 1.34 · x2
=
ŷ2 −1.59 + 0.73 · x1 − 0.48 · x2

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 17 / 59
Classification via Neural Networks

Networks of artificial neurons can also learn to classify the inputs.

A simple change to the neural network allows it to solve the logistic regression
problem. All we have to do is use a sigmoid activation function at the output
neuron o, and use the cross-entropy error instead of squared error.

Ex = − y · ln(o) + (1 − y ) · ln(1 − o)

The output of the neural network is

1
o = f (net o ) = sigmoid(b + w T x) = = π(x)
1 + exp{−(b + w T x)}

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 18 / 59
Neural networks for multiclass logistic regression
Example

We applied the single output neural network with logistic activation at the output
neuron and cross-entropy error function, on the Iris principal components dataset.
The output is a binary response indicating Iris-virginica (Y = 1) or one of
the other Iris types (Y = 0).
As expected, the neural network learns an identical set of weights and bias as
shown for the logistic regression model, namely:
o = −6.79 − 5.07 · x1 − 3.29 · x2
Next, we used a softmax activation and cross-entropy error function, to the Iris
principal components data with three classes: Iris-setosa (Y = 1),
Iris-versicolor (Y = 2) and Iris-virginica (Y = 3).
We fix the weights and bias for output neuron o3 to be zero:
o1 = −3.49 + 3.61 · x1 + 2.65 · x2
o2 = −6.95 − 5.18 · x1 − 3.40 · x2
o3 = 0 + 0 · x1 + 0 · x2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 19 / 59
Neural networks for multiclass logistic regression
Example

We do not constrain the weights and bias for o3 we obtain the following model:
o1 = −0.89 + 4.54 · x1 + 1.96 · x2
o2 = −3.38 − 5.11 · x1 − 2.88 · x2
o3 = 4.24 + 0.52 · x1 + 0.92 · x2
Misclassified points are shown in dark gray color. Points in class c1 and c2 are
shown displaced with respect to the base class c3 only for illustration.
Y
π1 (x) π3 (x)
rS rS rS rS
π2 (x)
rS rS Sr Sr rS rS rS rS rS rS rSrS rS rS rS rS rS rS rS rS rS
rS Sr Sr Sr rS rS rS rS rS rS rS rS rS rS bC
bC bC bC bC bC bCbC Cb bC bC bC bC bC bC bC bC Cb bC Cb bC Cb
bC bC Cb bC bC bC bC Cb bC bC bC bC Cb bC Cb bC
Cb Cb bC bC bC
bC bC

X1 uT uT uT uT
uT uT uT Tu uT uT uT uT TuuT Tu
uT uTuT uT uTuT uT uT uT uT uT uT
uT uT Tu Tu uT uT uT uT Tu uT TuuT Tu uT uT uT
Tu uT Tu uT

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 20 / 59
Multilayer Perceptron: One Hidden Layer

A multilayer perceptron (MLP) is a neural network that has distinct layers of

neurons.
The inputs to the neural network comprise the input layer, and the final outputs
from the MLP comprise the output layer.
Any intermediate layer is called a hidden layer, and an MLP can have one or many
hidden layers.
Networks with many hidden layers are called deep neural networks.
An MLP is also a feed-forward network. That is, information flows in the forward
direction, and from a layer only to the subsequent layer.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 21 / 59
Multilayer Perceptron: One Hidden Layer
Input Hidden Output
Layer Layer Layer
1 x0 1 z0

bk
x1 z1 o1
w1k
.. .. ..
. . .
wk 1
xi wik zk wkj oj
wkp
.. .. ..
. . .
wdk
xd zm op

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 22 / 59
MLP: Feed-forward Phase

Given the input neuron values, the output for each hidden neuron zk is:
d
!
X
zk = f (net k ) = f bk + wik · xi
i =1

where f is some activation function, and wik denotes the weight between input
neuron xi and hidden neuron zk .
Next, given the hidden neuron values, the value for each output neuron oj is:
m
!
X
oj = f (netj ) = f bj + wij · zi
i =1

where wij denotes the weight between hidden neuron zi and output neuron oj .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 23 / 59
MLP: Feed-Forward Phase

The output vector can then be computed as follows:

neto = b o + W oT z
o = f (neto ) = f b o + W oT z

To summarize, for a given input x ∈ D with desired response y , an MLP

computes the output vector via the feed-forward process, as follows:

o = f b o + W oT z = f b o + W oT · f b h + W hT x

where, o = (o1 , o2 , · · · , op )T is the vector of predicted outputs from the single

hidden layer MLP.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 24 / 59
MLP: Backpropagation Phase

Backpropagation is the algorithm used to learn the weights between successive

layers in an MLP.
The name comes from the manner in which the error gradient is propagated
backwards from the output to input layers via the hidden layers.
For a given input pair (x, y ) in the training data, the MLP first computes the
output vector o via the feed-forward step.
Next, it computes the error in the predicted output vis-a-vis the true response y
using the squared error function
p
1 2 1X
Ex = ky − ok = (yj − oj )2
2 2 j =1

The basic idea is to examine the extent to which an output neuron, say oj ,
deviates from the corresponding target response yj , and to modify the weights wij
between each hidden neuron zi and oj as some function of the error.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 25 / 59
MLP: Backpropagation Phase

The weight update is done via a gradient descent approach to minimize the error.
Let ∇wij be the gradient of the error function with respect to wij , or simply the
weight gradient at wij .
Given the previous weight estimate wij , a new weight is computed by taking a
small step η in a direction that is opposite to the weight gradient at wij

wij = wij − η · ∇wij

In a similar manner, the bias term bj is also updated via gradient descent

bj = bj − η · ∇bj

where ∇bj is the gradient of the error function with respect to bj , which we call
the bias gradient at bj .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 26 / 59
MLP: Backpropagation Phase
Updating Parameters Between Hidden and Output Layer

Consider the weight wij between hidden neuron zi and output neuron oj , and the
bias term bj between z0 and oj .
We compute the weight gradient at wij and bias gradient at bj , as follows:

∂Ex ∂Ex ∂netj

∇wij = = · = δj · zi
∂wij ∂netj ∂wij
∂Ex ∂Ex ∂netj
∇bj = = · = δj
∂bj ∂netj ∂bj

where we use the symbol δj to denote the partial derivative of the error with
respect to net signal at oj , which we also call the net gradient at oj
Next, we compute δj , the net gradient at oj .

∂Ex ∂Ex ∂f (netj )

δj = = ·
∂netj ∂f (netj ) ∂netj

Note that f (netj ) = oj .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 27 / 59
MLP: Backpropagation Phase
Updating Parameters Between Hidden and Output Layer

Using the squared error function for the former, we have

( p )
∂Ex ∂Ex ∂ 1X
= = (yk − ok )2 = (oj − yj )
∂f (netj ) ∂oj ∂oj 2 k =1

where we used the observation that all ok for k 6= j are constants with respect to
oj . Since we assume a sigmoid activation function, for the latter, we have

∂f (netj )
= oj · (1 − oj )
∂netj

Putting it all together, we get

δj = (oj − yj ) · oj · (1 − oj )

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 28 / 59
MLP: Backpropagation Phase
Updating Parameters Between Input and Hidden Layer

Consider wij between xi and zj , and bj between x0 and zj .

∂Ex ∂Ex ∂netj

∇wij = = · = δj · xi
∂wij ∂netj ∂wij
∂Ex ∂Ex ∂netj
∇bj = = · = δj
∂bj ∂netj ∂bj

δj at zj has to consider the error gradients that flow back from all the output
neurons to zj .
p p
∂Ex X ∂Ex ∂net k ∂zj ∂zj X ∂Ex ∂net k
δj = = · · = · ·
∂netj k =1 ∂net k ∂zj ∂netj ∂netj k =1 ∂net k ∂zj
p
X
= zj · (1 − zj ) · δk · wjk
k =1

∂ zj
where ∂netj = zj · (1 − zj ), assuming a sigmoid activation function.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 29 / 59
Backpropagation of gradients from output to hidden layer
To find the net gradient at zj we have to consider the net gradients at each of the
output neurons δk but weighted by the strength of the connection wjk between zj
and ok .
Pp
That is, we compute the weighted sum of gradients k =1 δk · wjk , which is used to
compute δj , the net gradient at hidden neuron zj .
Input Hidden Output
Layer Layer Layer
1 x0 o1 δ1

bj wj 1 ..
.

xi wij zj wjk ok δk
X
p
δk · wjk wjp ..
k=1 .

op δp

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 30 / 59
MLP Training: Stochastic Gradient Descent

The MLP training takes multiple iterations over the input points. For each input
x i , the MLP computes the output vector o i via the feed-forward step. In the
backpropagation phase, we compute the error gradient vector δ o with respect to
the net at output neurons, followed by δ h for hidden neurons.
In the stochastic gradient descent step, we compute the error gradients with
respect to the weights and biases, which are used to update the weight matrices
and bias vectors.
MLP-Training (D, m, η, maxiter):
// Initialize bias vectors
1 b h ← random m-dimensional vector with small values
2 b o ← random p-dimensional vector with small values
// Initialize weight matrices
3 W h ← random d × m matrix with small values
4 W o ← random m × p matrix with small values
5 t ← 0 // iteration counter

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 31 / 59
MLP Training: Stochastic Gradient Descent
6 repeat
7 foreach (x i , y i ) ∈ D in random order do
// Feed-forward phase
z i ← f b h + W hT x i

8

9 o i ← f b o + W oT z i
// Backpropagation phase: net gradients
10 δ o ← o i ⊙ (1 − o i ) ⊙ (o i − y i )
11 δ h ← z i ⊙ (1 − z i ) ⊙ (W o · δ o )
// Gradient descent for bias vectors
12 ∇bo ← δ o ; b o ← b o − η · ∇bo
13 ∇bh ← δ h ; b h ← b h − η · ∇bh
// Gradient descent for weight matrices
14 ∇Wo ← z i · δ To ; W o ← W o − η · ∇Wo
15 ∇Wh ← x i · δ Th ; W h ← W h − η · ∇Wh
16 t ← t +1
17 until t ≥ maxiter

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 32 / 59
MLP with one hidden layer
Example

Evaluate an MLP with a hidden layer using a non-linear activation function to

learn the sine curve.
The training data comprises n = 25 points xi sampled randomly in the range
[−10, 10], with yi = sin(xi ).
The testing data comprises 1000 points sampled uniformly from the same range.
The figure also shows the desired output curve (thin line).
We used an MLP with one input neuron (d = 1), ten hidden neurons (m = 10)
and one output neuron (p = 1). The hidden neurons use tanh activations, whereas
the output unit uses an identity activation. The step size is η = 0.005.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 33 / 59
MLP for sine curve
t =1

1.0
bC bC bC
bC
bC
bC bC bC
0.5
bC
bC
0
Y

bC bC
bC bC
bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

−0.5 bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC
bC bC bC
bC bC bC

bC
bC bC bC

bC bC
bC bC bC
bC bC bC
bC bC

bC
bC bC
bC bC

−1.0 bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC
bC bC Cb bC bC
bC bC bC bC bC
bC bC bC bC bC
bC bC
bC bC Cb bC
bC bC
bC bC bC bC bC
bC
bC bC bC bC
bC bC

−1.5 bC bC
bC
bC
bC
bC
bC
bC Cb bC
bC bC
bC bC
Cb bC
bC bC

bC
bC bC bC
bC bC
bC
bC Cb bC
bC bC
bC
bC bC bC
bC bC
bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC

−2.0 bC
bC
bC
bC
bC
bC
bC
bC bC bC
bC
bC
bC
bC
bC
bC
bC

bC bC
bC bC
bC bC bC
bC
bC bC bC
bC
bC bC
bC bC
bC bC bC bC
bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
−2.5 bC bC
bC bC
bC bC
bC bC bC
bC bC bC
bC bC bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC bC

bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12

X
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 34 / 59
MLP for sine curve
t = 1000

bC bC bC bC bC bC
bC bC bC bC bC Cb bC bC bC bC bC Cb bC
1.00 bC bC
Cb bC
bC bC
bC bC bC
Cb bC bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC

bC bC Cb bC
bC bC bC bC
bC bC

bC bC
bC
bC
bC
bC bC bC bC
bC bC
bC
bC bC
bC bC

0.75 bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC

0.50 bC bC
bC
bC
bC bC
bC
bC
bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC

bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC Cb bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC
0.25 bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC bC bC
bC
bC bC bC
Cb bC bC
bC bC bC
bC bC bC
bC bC bC

bC bC bC

bC bC
bC bC
bC Cb bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC
bC Cb bC bC
0
Y

bC bC
bC bC bC bC
bC bC bC bC bC
bC bC
bC Cb bC bC
bC bC bC bC
bC bC bC bC

bC
bC bC bC bC
bC bC bC Cb bC bC
bC bC
bC bC bC bC
bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC

−0.25 bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
bC
bC
bC
bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC bC
Cb bC bC
bC bC bC
bC bC bC
bC bC bC

−0.50 bC
bC bC
bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC bC bC bC
bC bC bC
bC
bC
bC

bC
bC bC bC bC
bC bC bC bC bC
Cb bC

bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC Cb bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC Cb bC bC bC bC bC bC bC bC bC
bC bC bC

−0.75
bC bC
bC bC
−1.00 bC

−1.25
−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12

X
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 35 / 59
MLP for sine curve
t = 5000

1.00 bC bC bC
Cb bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC
bC bC

bC
bC bC bC bC

bC bC Cb bC
bC bC
bC bC
bC bC
bC bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC
bC bC bC bC
0.75 bC bC
bC
bC
bC
bC
bC bC bC bC
bC bC
bC
bC
bC

bC
bC bC
bC bC
bC bC
bC bC
bC bC

bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC
bC bC bC bC bC
bC bC bC
0.50 bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC bC
bC bC
bC bC
bC bC
bC bC bC
bC bC bC
bC bC bC bC
bC bC bC
bC bC bC bC
bC bC bC
bC bC bC
bC bC bC
bC bC bC
bC
bC bC bC bC bC bC
bC
bC bC bC bC bC bC
bC
bC
bC bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC
bC bC bC
bC
bC bC bC bC bC bC

0.25 bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC bC
bC bC bC
bC bC bC bC bC bC bC bC

bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC
bC bC bC bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC

bC
bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
−0.25 bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC
bC bC
bC bC
bC bC bC
bC
bC
bC
bC bC
bC
bC
bC bC
bC
bC
bC
bC

−0.50
bC
bC bC bC bC bC
bC bC bC bC

bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC

bC bC bC bC bC bC
bC bC
bC bC

bC
bC bC bC bC
bC bC bC bC
bC bC bC
bC bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC
bC bC bC

−0.75 bC bC
bC bC
bC bC
bC bC bC bC
bC bC bC
bC bC
bC
bC
bC bC
bC
bC
bC

bC bC
bC
bC bC bC bC
bC bC
bC
bC bC bC bC
bC bC Cb bC bC bC
bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC

bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC

−1.00
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC
bC bC bC
bC bC
Cb bC

bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC

−1.25
−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12

X
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 36 / 59
MLP for sine curve
t = 10000

bC bC bC bC bC
bC bC bC bC bC Cb bC bC bC bC bC
Cb bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
Cb bC
bC bC bC
bC
bC bC bC
bC
bC bC
bC bC
bC
bC bC
bC bC
bC bC
bC bC
bC
bC bC
bC bC
bC bC
bC

1.25 bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC bC
bC bC
bC bC
bC
bC bC

1.00 bC
bC
bC
bC
bC bC
bC
bC
bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC

bC
bC bC bC

bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC
bC
bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC
bC bC bC bC bC
bC bC
bC bC
bC bC bC bC bC
bC
0.75 bC
bC
bC
bC
bC

bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC bC bC bC
bC bC
bC
bC
bC Cb bC
bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC
bC bC bC

bC
bC bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC
bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC

bC
bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC

bC
bC bC bC
bC bC bC
bC
bC bC

0.25 bC
bC
bC
bC
bC
bC
bC bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC bC bC
bC bC
bC bC
bC bC

bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC
bC
bC bC
0 bC bC
Y

bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC

bC
bC bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC bC

bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC
bC bC bC
bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
−0.25 bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC bC
bC
bC
bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC
bC bC bC bC
bC bC bC bC
bC bC bC

−0.50 bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC bC
bC
bC
bC
bC

bC
bC bC bC
bC bC bC bC
bC

bC
bC bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC
bC bC
bC bC bC bC
bC bC bC
−0.75 bC
bC
bC bC
bC
bC bC
bC
bC
bC bC
bC
bC
bC

bC
bC bC bC bC
bC bC

bC
bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC

bC
bC
bC
bC bC bC bC bC

bC
bC bC bC bC bC
bC bC
bC bC bC bC Cb bC
bC bC
bC
bC bC bC bC
bC bC bC bC bC bC
bC bC
−1.00 bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC bC bC
bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC
bC bC

bC bC
bC
bC bC bC bC
bC bC
bC bC bC bC
bC bC Cb bC
bC bC
bC bC bC bC
bC bC bC bC

−1.25 bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC

−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning
X Chapter 25: Neural Networks 37 / 59
MLP for sine curve
t = 15000

bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC
1.00 bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
Cb bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

bC
bC bC bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC Cb bC bC bC bC bC
bC bC bC bC bC
bC
bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC Cb bC bC bC Cb bC bC bC
bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC
Cb bC
bC bC
bC bC bC bC bC
0.75 bC
bC
bC
bC
bC
bC
bC

bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC
bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC

bC
bC bC bC bC bC bC bC
bC bC bC bC bC

bC
bC bC bC bC bC bC

bC
bC bC bC bC

0.50 bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC bC
bC
bC
bC
bC
bC
bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC
0.25 bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC

bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC

0 bC bC bC
Y

bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC
bC bC bC bC

bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC

bC
bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC
bC
bC bC bC bC bC bC

bC
bC bC bC bC bC bC bC
bC bC bC

−0.50 bC
bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC

bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC

bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC

−0.75 bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC

−1.25 bC bC
bC bC
bC bC
bC bC bC
bC bC bC bC bC Cb bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC
Cb bC
bC bC
bC bC

−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12

X
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 38 / 59
MLP for sine curve
t = 30000

We can observe that, even with a very small training data of 25 points sampled
randomly from the sine curve, the MLP is able to learn the desired function.

1.00 bC bC
bC bC
Cb bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC
Cb bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC bC bC
bC bC
bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC

bC
bC bC bC bC bC
bC bC bC
bC
bC bC bC bC bC bC bC bC bC bC bC
Cb bC bC bC Cb bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC
bC bC bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC bC bC Cb bC bC bC Cb bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC
bC bC
0.75 bC
bC
bC
bC bC
bC bC

0.50 bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC

0.25 bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC bC bC
bC bC
bC
bC
bC
bC
bC

bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC
0 bC
Y

bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC

bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC

bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC

bC
bC bC bC bC bC
bC bC bC
bC
bC bC bC
bC bC bC bC bC bC

−0.50
bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC

bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC

bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC

bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC

−0.75 bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC
bC
bC bC bC bC bC bC
bC bC bC bC bC
bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC
bC bC bC bC
bC bC bC
bC bC bC bC bC
bC bC
bC bC bC
bC bC bC bC
bC bC
bC bC
bC bC bC

bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
Cb bC
bC bC bC bC bC Cb bC bC
Cb bC bC bC bC

−1.00 bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

−1.25
−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning X Chapter 25: Neural Networks 39 / 59
MLP for sine curve
Test range [−20, 20]

The MLP model has not really learned the sine function; rather, it has learned to
approximate it only in the specified range [−10, 10].
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC

2.0 bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC bC
bC bC bC
bC bC
bC bC bC
bC bC bC
bC bC
bC bC
bC bC
1.5 bC bC bC
bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC

1.0 bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC
Cb bC bC
bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC
bC
bC bC
bC bC bC bC
bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC Cb bC bC bC bC bC bC bC
bC bC
bC bC bC
bC bC bC
bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC Cb bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC
bC bC bC bC bC
bC bC bC Cb bC bC bC bC bC Cb bC bC bC

bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC
bC bC
bC bC
bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC Cb bC bC bC bC bC bC bC bC
bC bC bC bC
bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC
bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC

0.5 bC bC
bC bC
bC bC
bC bC
bC bC bC bC
bC bC
bC bC
bC bC bC
bC bC bC bC
bC bC
bC bC
bC bC
bC bC bC bC
bC bC
bC bC
bC bC bC
bC bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC
bC bC
bC bC
bC bC bC
bC bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC

bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC
bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC

0 bC bC
Y

bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC

bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC
bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC

bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC
bC bC bC bC bC
bC
bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
−1.0 bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC

−1.5 bC bC bC
bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC

−2.0 bC bC
bC bC bC
bC bC bC bC
bC bC
bC bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

−2.5 bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

−20 −15 −10 −5 0 5 10 15 20

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning X Chapter 25: Neural Networks 40 / 59
MLP for handwritten digit classification
Example

We evaluate an MLP with one hidden layer for the task of predicting the correct
label for a hand-written digit from the MNIST database, which contains 60,000
training images that span the 10 digit labels, from 0 to 9.
Each (grayscale) image is a 28 × 28 matrix of pixels, with values between 0 and
255. Each pixel is converted to a value in the interval [0, 1] by dividing by 255.
0 0 0 0 0

5 5 5 5 5

10 10 10 10 10

15 15 15 15 15

20 20 20 20 20

25 25 25 25 25

0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
0 0 0 0 0

5 5 5 5 5

10 10 10 10 10

15 15 15 15 15

20 20 20 20 20

25 25 25 25 25

0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 41 / 59
MLP for handwritten digit classification
Example

Since images are 2-dimensional matrices, we first flatten them into a vector
x ∈ R784 with dimensionality d = 28 × 28 = 784. This is done by simply
concatenating all of the rows of the images to obtain one long vector.
Next, since the output labels are categorical values that denote the digits from 0
to 9, we need to convert them into binary (numerical) vectors, using one-hot
encoding.
Thus, the label 0 is encoded as e 1 = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0)T ∈ R10 , the label 1 as
e 2 = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0)T ∈ R10 , and so on, and finally the label 9 is encoded
as e 10 = (0, 0, 0, 0, 0, 0, 0, 0, 0, 1)T ∈ R10 . That is, each input image vector x has a
corresponding target response vector y ∈ {e 1 , e 2 , · · · , e 10 }.
Thus, the input layer for the MLP has d = 784 neurons, and the output layer has
p = 10 neurons.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 42 / 59
MLP for handwritten digit classification
Example

For the hidden layer, we consider several MLP models, each with a different
number of hidden neurons m. We try m = 0, 7, 49, 98, 196, 392, to study the effect
of increasing the number of hidden neurons, from small to large.
For the hidden layer, we use ReLU activation function, and for the output layer,
we use softmax activation, since the target response vector has only one neuron
with value 1, with the rest being 0.
Note that m = 0 means that there is no hidden layer – the input layer is directly
connected to the output layer, which is equivalent to a multiclass logistic
regression model. We train each MLP for t = 15 epochs, using step size η = 0.25.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 43 / 59
MLP for handwritten digit classification
Example

The final test error at the end of training is given as

m 0 7 10 49 98 196 392
errors 1677 901 792 546 495 470 454

We can observe that adding a hidden layer significantly improves the prediction
accuracy. Using even a small number of hidden neurons helps, compared to the
logistic regression model (m = 0). For example, using m = 7 results in 901 errors
(or error rate 9.01%) compared to using m = 0, which results in 1677 errors (or
error rate 16.77%).
On the other hand, as we increase the number of hidden neurons, the error rate
decreases, though with diminishing returns. Using m = 196, the error rate is
4.70%, but even after doubling the number of hidden neurons (m = 392), the
error rate goes down to only 4.54%. Further increasing m does not reduce the
error rate.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 44 / 59
MNIST: Prediction error as a function of epochs
During training, we plot the number of misclassified images after each epoch, on
the separate MNIST test set comprising 10,000 images.
Figure shows the number of errors from each of the models (with a different
number of hidden neurons m), after each epoch.
3,000
m=0
2,500
m=7
m = 10
2,000
m = 49
m = 98
errors

1,500
m = 196
m = 392
1,000

500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

epochs
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 45 / 59
Deep Multilayer Perceptron
We now generalize the feed-forward and backpropagation steps for many hidden
layers, as well as arbitrary error and neuron activation functions.
Consider an MLP with h hidden layers, n input points x i ∈ Rd (l = 0), and the
response vector y i ∈ Rp (l = h + 1).
l =0 l =1 l =2 ... l =h l = h+1

1 x0 1 z01 1 z02 1 ··· 1 z0h

x1 z11 z12 ··· z1h o1

.. .. .. .. .. ..
. . . . . .

xi zi1 zi2 ··· zih oi

.. .. .. .. .. ..
. . . . . .

xd zn11 zn22 ··· znhh op

l =0 l =1 l =2 ... l =h l = h+1

x z1 z2 ··· zh o
W 0 , b0 W 1 , b1 W h , bh

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 46 / 59
Deep Multilayer Perceptron
Feed-forward Phase

Typically in a deep MLP, the same activation function f l is used for all neurons in
a given layer l.
The input layer always uses the identity activation, so f 0 is the identity function.
Also, all bias neurons also use the identity function with a fixed value of 1.
The hidden layers typically use sigmoid, tanh, or ReLU activations.
The output layer typically uses sigmoid or softmax activations for classification
tasks, or identity activations for regression tasks.
For (x, y ) ∈ D, the deep MLP computes the output vector as:

o = f h+1 b h + W hT · z h

= f h+1 b h + W hT · f h b h−1 + W hT−1 · z h−1
..
= .
!

h+1
=f b h + W hT ·f h
b h−1 + W hT−1 ·f h−1
···f 2
b 1 + W T1 ·f 1
b 0 + W T0 ·x

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 47 / 59
Deep Multilayer Perceptron
Backpropagation Phase

Consider the weight update between a given layer and another, including between
the input and hidden layer, or between two hidden layers, or between the last
hidden layer and the output layer.
Let zil be a neuron in layer l, and zjl +1 a neuron in the next layer l + 1. Let wijl be
the weight between zil and zjl +1 , and let bjl denote the bias term between z0l and
zjl +1 .
The weight and bias are updated using the gradient descent approach

wijl = wijl − η · ∇w l bjl = bjl − η · ∇bl

ij j

where ∇w l is the weight gradient and ∇bl is the bias gradient, i.e., the partial
ij j
derivative of the error function with respect to the weight and bias, respectively.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 48 / 59
Deep Multilayer Perceptron
Backpropagation Phase

We can use the chain rule to write the weight and bias gradient, as follows
∂Ex ∂Ex ∂netj
∇w l = l
= · = δjl +1 · zil = zil · δjl +1
ij ∂wij ∂netj ∂wijl
∂Ex ∂Ex ∂netj
∇bl = = · = δjl +1
j ∂bjl ∂netj ∂bjl

In summary, the update of the weights and biases is

W l = W l − η · ∇W l
b l = b l − η · ∇b l

where η is the step size. However, we observe that to compute the weight and
bias gradients for layer l we need to compute the net gradients δ l +1 at layer l + 1.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 49 / 59
Deep Multilayer Perceptron
Net Gradients at Output Layer

If all of the output neurons are independent (for example, when using linear or
sigmoid activations), the net gradient is obtained by differentiating the error
function with respect to the net signal at the output neurons. That is,

∂Ex ∂Ex ∂f h+1 (netj ) ∂Ex ∂f h+1 (netj )

δjh+1 = = h+1 · = ·
∂netj ∂f (netj ) ∂netj ∂oj ∂netj

If the output neurons are not independent (for example, when using a softmax
activation), then :
p
∂Ex X ∂Ex ∂f h+1 (neti )
δjh+1 = = ·
∂netj i =1
∂f h+1 (neti ) ∂netj

For regression, we use the SSE with linear activation function, whereas for logistic
regression and classification, we use the cross-entropy error function with a
sigmoid activation for binary classes, and softmax activation for multiclass
problems.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 50 / 59
Deep Multilayer Perceptron
Net Gradients at Hidden Layers

Let us assume that we have already computed the net gradients at layer l + 1,
namely δ l +1 .
Since neuron zjl in layer l is connected to all of the neurons in layer l + 1 (except
for the bias neuron z0l +1 ), to compute the net gradient at zjl , we have to account
for the error from each neuron in layer l + 1, as follows:
nl+1
∂Ex X ∂Ex ∂net k ∂f l (netj )
δjl = = · l ·
∂netj k =1 ∂net k ∂f (netj ) ∂netj
nl+1
∂f l (netj ) X l +1 l
= · δk · wjk
∂netj k =1

So the net gradient at zjl in layer l depends on the derivative of the activation
function with respect to its net j , and the weighted sum of the net gradients from
all the neurons zkl +1 at the next layer l + 1.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 51 / 59
Net Gradients at Hidden Layers
For the commonly used activation functions at the hidden layer, we have

1
 for linear
l l l
∂f = z (1 − z ) for sigmoid
 l l
(1 − z ⊙ z ) for tanh


The net gradients are computed recursively, starting from the output layer h + 1,
then hidden layer h, and so on, until we finally compute the net gradients at the
first hidden layer l = 1. That is,

δ h = ∂f h ⊙ W h · δ h+1

δ h−1 = ∂f h−1 ⊙ W h−1 · δ h = ∂f h−1 ⊙ W h−1 · ∂f h ⊙ W h · δ h+1

..
.
!
1 1 2 h h+1

δ = ∂f ⊙ W 1 · ∂f ⊙ W 2 · · · · ∂f ⊙ Wh ·δ

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 52 / 59
Deep MLP Training: Stochastic Gradient Descent

Deep-MLP-Training (D, h, η, maxiter, n1 , n2 , · · · , nh , f 1 , f 2 , · · · , f h+1 ):

1 n0 ← d // input layer size
2 nh+1 ← p // output layer size
// Initialize weight matrices and bias vectors
3 for l = 0, 1, 2, · · · , h do
4 b l ← random nl +1 vector with small values
5 W l ← random nl × nl +1 matrix with small values
6 t ← 0 // iteration counter

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 53 / 59
Deep MLP Training: Stochastic Gradient Descent
7 repeat
8 foreach (x i , y i ) ∈ D in random order do
9 z 0 ← x i // Feed-Forward Phase

10 for l = 0, 1, 2, . . . , h do z l +1 ← f l +1 b l + W Tl · z l
11 o i ← z h+1
12 if independent outputs then // Backpropagation phase
13 δ h+1 ← ∂f h+1 ⊙ ∂E x i // net gradients at output
14 else
15 δ h+1 ← ∂F h+1 · ∂E x i // net gradients at output
for l = h, h − 1, · · · , 1 do δ l ← ∂f l ⊙ W l · δ l +1 // net gradients

16
17 for l = 0, 1, · · · , h do // Gradient Descent Step
T
18 ∇W l ← z l · δ l +1 // weight gradient matrix at layer l
19 ∇bl ← δ l +1 // bias gradient vector at layer l
20 for l = 0, 1, · · · , h do
21 W l ← W l − η · ∇W l // update W l
22 b l ← b l − η · ∇bl // update b l

23 t ← t +1
24 until t ≥ maxiter
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 54 / 59
Vanishing or Exploding Gradients

In the vanishing gradient problem, the norm of the net gradient can decay
exponentially with the distance from the output layer, that is, as we
backpropagate the gradients from the output layer to the input layer. In this case
the network will learn extremely slowly, if at all, since the gradient descent method
will make minuscule changes to the weights and biases.
On the other hand, in the exploding gradient problem, the norm of the net
gradient can grow exponentially with the distance from the output layer. In this
case, the weights and biases will become exponentially large, resulting in a failure
to learn. The gradient explosion problem can be mitigated to some extent by
gradient thresholding, that is, by resetting the value if it exceeds an upper bound.
The vanishing gradients problem is more difficult to address. Typically sigmoid
activations are more susceptible to this problem, and one solution is to use
alternative activation functions such as ReLU. In general, recurrent neural
networks, which are deep neural networks with feedback connections, are more
prone to vanishing and exploding gradients.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 55 / 59
Deep MLP
Example

We now examine deep MLPs for predicting the labels for the MNIST handwritten
images dataset.
Recall that this dataset has n = 60000 grayscale images of size 28 × 28 that we
treat as d = 784 dimensional vectors. The pixel values between 0 and 255 are
converted to the range 0 and 1 by dividing each value by 255. The target
response vector is a one-hot encoded vector for class labels {0, 1, . . . , 9}.
Thus, the input to the MLP x i has dimensionality d = 784, and the output layer
has dimensionality p = 10. We use softmax activation for the output layer. We
use ReLU activation for the hidden layers, and consider several deep models with
different number and sizes of the hidden layers. We use step size η = 0.3 and train
for t = 15 epochs. Training was done using minibatches, using batch size of 1000.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 56 / 59
Deep MLP
Example

We evaluate the performance of each MLP on the MNIST test datatset that
contains 10,000 images. The final test error at the end of training is given as
hidden layers errors
n1 = 392 396
n1 = 196, n2 = 49 303
n1 = 392, n2 = 196, n3 = 49 290
n1 = 392, n2 = 196, n3 = 98, n4 = 49 278

We can observe that as we increase the number of layers, we do get performance

improvements.
The deep MLP with four hidden layers of sizes n1 = 392, n2 = 196, n3 = 98, n4 = 49
results in an error rate of 2.78% on the training set, whereas the MLP with a
single hidden layer of size n1 = 392 has an error rate of 3.96%.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 57 / 59
MNIST: Deep MLPs
Prediction error as a function of epochs.

The deeper MLP improves the prediction accuracy, but adding more layers does
not reduce the error rate, and can also lead to performance degradation.

5,000
n1 = 392
n1 = 196, n2 = 49
4,000 n1 = 392, n2 = 196, n3 = 49
n1 = 392, n2 = 196, n3 = 98, n4 = 49
errors

3,000

2,000

1,000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

epochs

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 58 / 59
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 25: Neural Networks

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 59 / 59

List of Consultant Engineering Firms
100% (2)
List of Consultant Engineering Firms
25 pages
MarieForleo ThreeToxicLies
No ratings yet
MarieForleo ThreeToxicLies
51 pages
Machine Learning (CSO851) - Lecture 08
No ratings yet
Machine Learning (CSO851) - Lecture 08
27 pages
Machine Learning and Pattern Recognition Week 8 Neural Net Intro
No ratings yet
Machine Learning and Pattern Recognition Week 8 Neural Net Intro
3 pages
AML 03 Dense Neural Networks
No ratings yet
AML 03 Dense Neural Networks
20 pages
Short Course Machine Learning F de Vuyst 1715052496
No ratings yet
Short Course Machine Learning F de Vuyst 1715052496
74 pages
Neural Networks Thesis Final ChristofferMagnus
No ratings yet
Neural Networks Thesis Final ChristofferMagnus
36 pages
Lec3 MLP Optimization
No ratings yet
Lec3 MLP Optimization
86 pages
lec05
No ratings yet
lec05
46 pages
AI2025_Lecture06_recording_slide
No ratings yet
AI2025_Lecture06_recording_slide
38 pages
UNIT II DNN
No ratings yet
UNIT II DNN
24 pages
cs188 sp23 Note25
No ratings yet
cs188 sp23 Note25
8 pages
Machine Learning with Artificial Neural Networks
No ratings yet
Machine Learning with Artificial Neural Networks
6 pages
DeepLearing Theory
No ratings yet
DeepLearing Theory
51 pages
Lecture_09_slides_-_after
No ratings yet
Lecture_09_slides_-_after
57 pages
Neural Networks From Scratch: 3.1 Formal Neuron
No ratings yet
Neural Networks From Scratch: 3.1 Formal Neuron
8 pages
Unit II
No ratings yet
Unit II
12 pages
Lecture - 05 (Introduction to ANN)
No ratings yet
Lecture - 05 (Introduction to ANN)
27 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Unit 1 DL
No ratings yet
Unit 1 DL
52 pages
16-dl-1 - converted
No ratings yet
16-dl-1 - converted
9 pages
NN unit_1
No ratings yet
NN unit_1
27 pages
Activation Function: Deep Neural Networks
No ratings yet
Activation Function: Deep Neural Networks
47 pages
Lecture20 Backprop
No ratings yet
Lecture20 Backprop
77 pages
Neural Network Notes
No ratings yet
Neural Network Notes
8 pages
Ad3451 ML Unit 4 Notes Eduengg
No ratings yet
Ad3451 ML Unit 4 Notes Eduengg
36 pages
Module 2
No ratings yet
Module 2
44 pages
Week 14 (NN)
No ratings yet
Week 14 (NN)
49 pages
unit v
No ratings yet
unit v
9 pages
Week 2 Artificial Neural Networks
No ratings yet
Week 2 Artificial Neural Networks
62 pages
Machine Learning: The Hundred-Page Book
No ratings yet
Machine Learning: The Hundred-Page Book
17 pages
Feedforward Networks: Marco Kuhlmann
No ratings yet
Feedforward Networks: Marco Kuhlmann
53 pages
Bai 1 Eng
No ratings yet
Bai 1 Eng
10 pages
Understanding and Creating Neural Networks
No ratings yet
Understanding and Creating Neural Networks
69 pages
2. Neural Network Training
No ratings yet
2. Neural Network Training
73 pages
Creating A Neural Network From Scratch in Python
100% (1)
Creating A Neural Network From Scratch in Python
12 pages
ML_MU_Unit_5NeuralNetworkpdf__2025_04_16_13_47_39
No ratings yet
ML_MU_Unit_5NeuralNetworkpdf__2025_04_16_13_47_39
57 pages
Lecture 7 - Perceptrons and Multi-Layer Feedforward Neural Networks Using Matlab Part 3
No ratings yet
Lecture 7 - Perceptrons and Multi-Layer Feedforward Neural Networks Using Matlab Part 3
6 pages
NeuralNetworks
No ratings yet
NeuralNetworks
29 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
Lec-06
No ratings yet
Lec-06
20 pages
AI Unit5 Neural Network 1c2c9166 c1b7 47a3 8ce1 e914f1ab6afb
No ratings yet
AI Unit5 Neural Network 1c2c9166 c1b7 47a3 8ce1 e914f1ab6afb
52 pages
Artificial intelligence basics
No ratings yet
Artificial intelligence basics
13 pages
Activation Functions in Neural Networks - 241102 - 224129
No ratings yet
Activation Functions in Neural Networks - 241102 - 224129
7 pages
Unit 2
No ratings yet
Unit 2
18 pages
Neural Networks Optional
No ratings yet
Neural Networks Optional
96 pages
What is Gradient Based Learning in Deep Learning
No ratings yet
What is Gradient Based Learning in Deep Learning
12 pages
Unit-5
No ratings yet
Unit-5
59 pages
UNIT-I.pptx
No ratings yet
UNIT-I.pptx
90 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-11 Reference-Material-I
No ratings yet
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-11 Reference-Material-I
40 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
L2 - UCLxDeepMind DL2020
No ratings yet
L2 - UCLxDeepMind DL2020
104 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
[Fall 2024] Deep Learning 1
No ratings yet
[Fall 2024] Deep Learning 1
55 pages
Happymonk Test Paper For Data Scientist Intern
No ratings yet
Happymonk Test Paper For Data Scientist Intern
2 pages
Chapter Neural Networks
No ratings yet
Chapter Neural Networks
14 pages
neural-networks-essay-feranmi-dere
No ratings yet
neural-networks-essay-feranmi-dere
7 pages
nn_pdf
No ratings yet
nn_pdf
11 pages
ML_Lec-22
No ratings yet
ML_Lec-22
25 pages
ST M Hdstat RNN Deep Learning
No ratings yet
ST M Hdstat RNN Deep Learning
17 pages
Dear The Weight
From Everand
Dear The Weight
Masud Rana
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
28 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
57 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
58 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
45 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
79 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
31 pages
High Precision Agriculture: An Application of Improved Machine-Learning Algorithms 2019
No ratings yet
High Precision Agriculture: An Application of Improved Machine-Learning Algorithms 2019
6 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
16 pages
Chapter 8: Itemset Mining
No ratings yet
Chapter 8: Itemset Mining
34 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
29 pages
Chapter 7: Dimensionality Reduction
No ratings yet
Chapter 7: Dimensionality Reduction
34 pages
Chapter 10: Sequence Mining
No ratings yet
Chapter 10: Sequence Mining
37 pages
Chapter 6: High-Dimensional Data
No ratings yet
Chapter 6: High-Dimensional Data
21 pages
Chapter 1: Data Mining and Analysis
No ratings yet
Chapter 1: Data Mining and Analysis
24 pages
Data Science in Agriculture Part I: Introduction
100% (1)
Data Science in Agriculture Part I: Introduction
2 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Chapter 3: Categorical Attributes
No ratings yet
Chapter 3: Categorical Attributes
26 pages
Introduction of Data Science - Mahatma Gandhi Central University
No ratings yet
Introduction of Data Science - Mahatma Gandhi Central University
17 pages
R 2 Calculations
No ratings yet
R 2 Calculations
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
HIRARC Guidelines Part C
No ratings yet
HIRARC Guidelines Part C
1 page
Non-Core Placement Bluebook 2024-25
No ratings yet
Non-Core Placement Bluebook 2024-25
709 pages
Project Report
No ratings yet
Project Report
11 pages
Algin, 2008, Practical Formula For Dimensioning A Rectangular Footing
No ratings yet
Algin, 2008, Practical Formula For Dimensioning A Rectangular Footing
7 pages
Appendix: Model Question Papers
No ratings yet
Appendix: Model Question Papers
24 pages
Ip Melc1 Q4
No ratings yet
Ip Melc1 Q4
2 pages
Measuring Tools and Instruments in EIM
100% (2)
Measuring Tools and Instruments in EIM
23 pages
Resettlement of Entire Towns and Villages As A Result of Climate Change
No ratings yet
Resettlement of Entire Towns and Villages As A Result of Climate Change
20 pages
Highway Construction Civil Engineering (Ce) Notes - Edurev
No ratings yet
Highway Construction Civil Engineering (Ce) Notes - Edurev
47 pages
Cargills: "To Be A Global Corporate Role Model in Community-Friendly National Development."
No ratings yet
Cargills: "To Be A Global Corporate Role Model in Community-Friendly National Development."
3 pages
Sample Data and Problems Project For Hydrology
No ratings yet
Sample Data and Problems Project For Hydrology
42 pages
Science 6 Reviewer 2
No ratings yet
Science 6 Reviewer 2
6 pages
9th Term Test Paper (J.K.) - QP
No ratings yet
9th Term Test Paper (J.K.) - QP
4 pages
7-Eng ch5 Agreement of Subject and Verb
No ratings yet
7-Eng ch5 Agreement of Subject and Verb
3 pages
Speech Writing.docx
No ratings yet
Speech Writing.docx
2 pages
Holiday HW Class7
No ratings yet
Holiday HW Class7
5 pages
F00090 Sterilization Consent Form English 1
No ratings yet
F00090 Sterilization Consent Form English 1
3 pages
M3 Playbook Training Plan Template
No ratings yet
M3 Playbook Training Plan Template
7 pages
ELECTIVE 1 - Gender, Women and Society
No ratings yet
ELECTIVE 1 - Gender, Women and Society
40 pages
HR Analytics Applications of Comparison of Means and ANOVA
No ratings yet
HR Analytics Applications of Comparison of Means and ANOVA
25 pages
Navaron Satkhira Bhomra Samples
No ratings yet
Navaron Satkhira Bhomra Samples
9 pages
Answer 1 Merged
No ratings yet
Answer 1 Merged
49 pages
Test of Significance
No ratings yet
Test of Significance
45 pages
Error Control Coding
No ratings yet
Error Control Coding
16 pages
DIGITAL DETOX
No ratings yet
DIGITAL DETOX
5 pages
Chapter 3 Ethics Communication
No ratings yet
Chapter 3 Ethics Communication
14 pages
Dissertation Topics For Clinical Psychology
100% (2)
Dissertation Topics For Clinical Psychology
4 pages
SLR Paresr C Program
No ratings yet
SLR Paresr C Program
7 pages

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

Uploaded by

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

Uploaded by

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms

Mohammed J. Zaki1 Wagner Meira Jr.2

Chapter 25: Neural Networks

Artificial neural networks or simply neural networks are inspired by biological

Function: f (net k ) = net k

exp{net k }−exp{−net k } exp{2·net k }−1

Across all the output neurons, we can write this as

Cross-Entropy Error: For classification tasks, with K classes {c1 , c2 , · · · , cK }, the

Note that only one element of y is 1 and the rest are 0.

The partial derivative of the cross-entropy error function with respect to a

Cross-entropy is often used for classification.

The partial derivative with respect to the output neuron o is

Networks of (artificial) neurons are capable of representing and learning arbitrarily

ŷ = −0.014 − 0.082 · x1 + 0.45 · x2

o = 0.0096 − 0.087 · x1 + 0.452 · x2

with an SSE of 6.18, which is very close to the optimal solution.

Networks of artificial neurons can also learn to classify the inputs.

The output of the neural network is

A multilayer perceptron (MLP) is a neural network that has distinct layers of

The output vector can then be computed as follows:

To summarize, for a given input x ∈ D with desired response y , an MLP

where, o = (o1 , o2 , · · · , op )T is the vector of predicted outputs from the single

Backpropagation is the algorithm used to learn the weights between successive

wij = wij − η · ∇wij

∂Ex ∂Ex ∂netj

∂Ex ∂Ex ∂f (netj )

Note that f (netj ) = oj .

Using the squared error function for the former, we have

Putting it all together, we get

Consider wij between xi and zj , and bj between x0 and zj .

∂Ex ∂Ex ∂netj

Evaluate an MLP with a hidden layer using a non-linear activation function to

−20 −15 −10 −5 0 5 10 15 20

The final test error at the end of training is given as

1 x0 1 z01 1 z02 1 ··· 1 z0h

x1 z11 z12 ··· z1h o1

xi zi1 zi2 ··· zih oi

xd zn11 zn22 ··· znhh op

wijl = wijl − η · ∇w l bjl = bjl − η · ∇bl

In summary, the update of the weights and biases is

∂Ex ∂Ex ∂f h+1 (netj ) ∂Ex ∂f h+1 (netj )

Deep-MLP-Training (D, h, η, maxiter, n1 , n2 , · · · , nh , f 1 , f 2 , · · · , f h+1 ):

We can observe that as we increase the number of layers, we do get performance

Mohammed J. Zaki1 Wagner Meira Jr.2

Chapter 25: Neural Networks

You might also like