Data Mining and Machine Learning: Fundamental Concepts and Algorithms
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 1 / 59
Artificial Neural Networks
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 2 / 59
Artificial Neural Networks
Artificial neural networks are comprised of abstract neurons that try to mimic real
neurons at a very high level.
They can be described via a weighted directed graph G = (V , E ), with each node
vi ∈ V representing a neuron, and each directed edge (vi , vj ) ∈ E representing a
synaptic to dendritic connection from vi to vj . The weight of the edge wij denotes
the synaptic strength.
1 x0
x1
bk
w1k
x2 w2k zk wk ·
Pd
i=1 wik · xi + bk
.. wdk
.
xd
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 3 / 59
Artificial Neuron
An artificial neuron acts as a processing unit, that first aggregates the incoming
signals via a weighted sum, and then applies some function to generate an output.
A binary neuron outputs a 1 whenever the combined signal exceeds a threshold, or
0 otherwise.
d
X
net k = bk + wik · xi = bk + w T x
i =1
x0 is a special bias neuron whose value is always fixed at 1, and the weight from
x0 to zk is bk .
Finally, the output value of zk is given as some activation function, f (·), applied
to the net input at zk
zk = f (net k )
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 4 / 59
Linear Activation Function
+∞
zk
−∞
−∞ −bk 0 +∞
wTx
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 5 / 59
Step Activation Function
(
0 if net k ≤ 0
Function: f (net k ) =
1 if net k > 0
∂ f (netj )
Derivative: ∂netj =0
1.0
zk
0.5
0
−∞ −bk 0 +∞
wTx
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 6 / 59
Rectified Linear Activation Function
(
0 if net k ≤ 0
Function: f (net k ) =
net k if net k > 0
(
∂ f (netj ) 0 if netj ≤ 0
Derivative: ∂netj =
1 if netj > 0
+∞
zk
0
−∞ −bk 0 +∞
wTx
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 7 / 59
Sigmoid Activation Function
1
Function: f (net k ) = 1+exp{−net k }
∂ f (netj )
Derivative: ∂netj = f (netj ) · (1 − f (netj ))
1.0
zk
0.5
0
−∞ −bk 0 +∞
wTx
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 8 / 59
Hyperbolic Tangent Activation Function
1
zk
−1
−∞ −bk 0 +∞
wTx
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 9 / 59
Softmax Activation Function
exp{net k }
Function: f (net k | net) = Pp
i=1 exp{net i }
Derivative:
(
∂f (netj | net) ∂oj oj · (1 − oj ) if k = j
= =
∂net k ∂net k −ok · oj if k =
6 j
zk
netj net k
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 10 / 59
Error Functions
Squared Error: Given an input vector x ∈ Rd , the squared error loss function
measures the squared deviation between the predicted output vector o ∈ Rp and
the true response y ∈ Rp , defined as follows:
p
1 2 1X
Ex = ky − ok = (yj − oj )2
2 2 j =1
where Ex denotes the error on input x. The partial derivative of the squared error
function with respect to a particular output neuron oj is
∂Ex 1
= · 2 · (yj − oj ) · −1 = oj − yj
∂oj 2
K
X
Ex = − yi · ln(oi ) = − y1 · ln(o1 ) + · · · + yK · ln(oK )
i =1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 12 / 59
Error Functions
∂Ex yj
=−
∂oj oj
The vector of partial derivatives of the error function with respect to the output
neurons is therefore given as
T T
∂Ex ∂Ex ∂Ex ∂Ex y1 y2 yK
= , , ··· , = − ,− ,··· ,−
∂o ∂o1 ∂o2 ∂oK o1 o2 oK
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 13 / 59
Error Functions
Binary Cross-Entropy Error: For binary classes, the positive class is 1 and the
negative class is 0.
Given x ∈ Rd , with y ∈ {0, 1}, there is only one output neuron o.
Ex = − y · ln(o) + (1 − y ) · ln(1 − o)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 14 / 59
Linear and Logistic Regression via Neural Networks
1 x0 1 x0
b1
b bp
x1 w1 o x1 w11 o1
w1p
.. wd .. ..
. . .
wd 1
xd xd wdp op
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 15 / 59
ANN for Multiple and Multivariate Regression
Example – Multiple Linear Regression
Consider the multiple regression of sepal length and petal length on the
dependent attribute petal width for the Iris dataset with n = 150 points.
The squared error for this optimal solution is 6.179 on the training data.
A neural network,with linear activation and minimizing the SSE via gradient
descent, results in
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 16 / 59
ANN for Multiple and Multivariate Regression
Example – Multivariate Linear Regression
We use the neural network architecture to learn the weights and bias for the Iris
dataset, where we use sepal length and sepal width as the independent
attributes, and petal length and petal width as the response or dependent
attributes.
Therefore, each x i is 2D, and 2-dimensional, and y i is also 2D.Minimizing the
SSE via gradient descent, yields:
b1 b2 −1.83 −1.47
w11 w12 = 1.72 o1 −1.83 + 1.72 · x1 − 1.46 · x2
0.72 =
o2 −1.47 + 0.72 · x1 − 0.50 · x2
w21 w22 −1.46 −0.50
The SSE on the training set is 84.9. Optimal multivariate regression yields an SSE
of 84.16:
ŷ1 −2.56 + 1.78 · x1 − 1.34 · x2
=
ŷ2 −1.59 + 0.73 · x1 − 0.48 · x2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 17 / 59
Classification via Neural Networks
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 18 / 59
Neural networks for multiclass logistic regression
Example
We applied the single output neural network with logistic activation at the output
neuron and cross-entropy error function, on the Iris principal components dataset.
The output is a binary response indicating Iris-virginica (Y = 1) or one of
the other Iris types (Y = 0).
As expected, the neural network learns an identical set of weights and bias as
shown for the logistic regression model, namely:
o = −6.79 − 5.07 · x1 − 3.29 · x2
Next, we used a softmax activation and cross-entropy error function, to the Iris
principal components data with three classes: Iris-setosa (Y = 1),
Iris-versicolor (Y = 2) and Iris-virginica (Y = 3).
We fix the weights and bias for output neuron o3 to be zero:
o1 = −3.49 + 3.61 · x1 + 2.65 · x2
o2 = −6.95 − 5.18 · x1 − 3.40 · x2
o3 = 0 + 0 · x1 + 0 · x2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 19 / 59
Neural networks for multiclass logistic regression
Example
We do not constrain the weights and bias for o3 we obtain the following model:
o1 = −0.89 + 4.54 · x1 + 1.96 · x2
o2 = −3.38 − 5.11 · x1 − 2.88 · x2
o3 = 4.24 + 0.52 · x1 + 0.92 · x2
Misclassified points are shown in dark gray color. Points in class c1 and c2 are
shown displaced with respect to the base class c3 only for illustration.
Y
π1 (x) π3 (x)
rS rS rS rS
π2 (x)
rS rS Sr Sr rS rS rS rS rS rS rSrS rS rS rS rS rS rS rS rS rS
rS Sr Sr Sr rS rS rS rS rS rS rS rS rS rS bC
bC bC bC bC bC bCbC Cb bC bC bC bC bC bC bC bC Cb bC Cb bC Cb
bC bC Cb bC bC bC bC Cb bC bC bC bC Cb bC Cb bC
Cb Cb bC bC bC
bC bC
X1 uT uT uT uT
uT uT uT Tu uT uT uT uT TuuT Tu
uT uTuT uT uTuT uT uT uT uT uT uT
uT uT Tu Tu uT uT uT uT Tu uT TuuT Tu uT uT uT
Tu uT Tu uT
X2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 20 / 59
Multilayer Perceptron: One Hidden Layer
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 21 / 59
Multilayer Perceptron: One Hidden Layer
Input Hidden Output
Layer Layer Layer
1 x0 1 z0
bk
x1 z1 o1
w1k
.. .. ..
. . .
wk 1
xi wik zk wkj oj
wkp
.. .. ..
. . .
wdk
xd zm op
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 22 / 59
MLP: Feed-forward Phase
Given the input neuron values, the output for each hidden neuron zk is:
d
!
X
zk = f (net k ) = f bk + wik · xi
i =1
where f is some activation function, and wik denotes the weight between input
neuron xi and hidden neuron zk .
Next, given the hidden neuron values, the value for each output neuron oj is:
m
!
X
oj = f (netj ) = f bj + wij · zi
i =1
where wij denotes the weight between hidden neuron zi and output neuron oj .
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 23 / 59
MLP: Feed-Forward Phase
neto = b o + W oT z
o = f (neto ) = f b o + W oT z
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 24 / 59
MLP: Backpropagation Phase
The basic idea is to examine the extent to which an output neuron, say oj ,
deviates from the corresponding target response yj , and to modify the weights wij
between each hidden neuron zi and oj as some function of the error.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 25 / 59
MLP: Backpropagation Phase
The weight update is done via a gradient descent approach to minimize the error.
Let ∇wij be the gradient of the error function with respect to wij , or simply the
weight gradient at wij .
Given the previous weight estimate wij , a new weight is computed by taking a
small step η in a direction that is opposite to the weight gradient at wij
In a similar manner, the bias term bj is also updated via gradient descent
bj = bj − η · ∇bj
where ∇bj is the gradient of the error function with respect to bj , which we call
the bias gradient at bj .
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 26 / 59
MLP: Backpropagation Phase
Updating Parameters Between Hidden and Output Layer
Consider the weight wij between hidden neuron zi and output neuron oj , and the
bias term bj between z0 and oj .
We compute the weight gradient at wij and bias gradient at bj , as follows:
where we use the symbol δj to denote the partial derivative of the error with
respect to net signal at oj , which we also call the net gradient at oj
Next, we compute δj , the net gradient at oj .
where we used the observation that all ok for k 6= j are constants with respect to
oj . Since we assume a sigmoid activation function, for the latter, we have
∂f (netj )
= oj · (1 − oj )
∂netj
δj = (oj − yj ) · oj · (1 − oj )
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 28 / 59
MLP: Backpropagation Phase
Updating Parameters Between Input and Hidden Layer
δj at zj has to consider the error gradients that flow back from all the output
neurons to zj .
p p
∂Ex X ∂Ex ∂net k ∂zj ∂zj X ∂Ex ∂net k
δj = = · · = · ·
∂netj k =1 ∂net k ∂zj ∂netj ∂netj k =1 ∂net k ∂zj
p
X
= zj · (1 − zj ) · δk · wjk
k =1
∂ zj
where ∂netj = zj · (1 − zj ), assuming a sigmoid activation function.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 29 / 59
Backpropagation of gradients from output to hidden layer
To find the net gradient at zj we have to consider the net gradients at each of the
output neurons δk but weighted by the strength of the connection wjk between zj
and ok .
Pp
That is, we compute the weighted sum of gradients k =1 δk · wjk , which is used to
compute δj , the net gradient at hidden neuron zj .
Input Hidden Output
Layer Layer Layer
1 x0 o1 δ1
bj wj 1 ..
.
xi wij zj wjk ok δk
X
p
δk · wjk wjp ..
k=1 .
op δp
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 30 / 59
MLP Training: Stochastic Gradient Descent
The MLP training takes multiple iterations over the input points. For each input
x i , the MLP computes the output vector o i via the feed-forward step. In the
backpropagation phase, we compute the error gradient vector δ o with respect to
the net at output neurons, followed by δ h for hidden neurons.
In the stochastic gradient descent step, we compute the error gradients with
respect to the weights and biases, which are used to update the weight matrices
and bias vectors.
MLP-Training (D, m, η, maxiter):
// Initialize bias vectors
1 b h ← random m-dimensional vector with small values
2 b o ← random p-dimensional vector with small values
// Initialize weight matrices
3 W h ← random d × m matrix with small values
4 W o ← random m × p matrix with small values
5 t ← 0 // iteration counter
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 31 / 59
MLP Training: Stochastic Gradient Descent
6 repeat
7 foreach (x i , y i ) ∈ D in random order do
// Feed-forward phase
z i ← f b h + W hT x i
8
9 o i ← f b o + W oT z i
// Backpropagation phase: net gradients
10 δ o ← o i ⊙ (1 − o i ) ⊙ (o i − y i )
11 δ h ← z i ⊙ (1 − z i ) ⊙ (W o · δ o )
// Gradient descent for bias vectors
12 ∇bo ← δ o ; b o ← b o − η · ∇bo
13 ∇bh ← δ h ; b h ← b h − η · ∇bh
// Gradient descent for weight matrices
14 ∇Wo ← z i · δ To ; W o ← W o − η · ∇Wo
15 ∇Wh ← x i · δ Th ; W h ← W h − η · ∇Wh
16 t ← t +1
17 until t ≥ maxiter
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 32 / 59
MLP with one hidden layer
Example
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 33 / 59
MLP for sine curve
t =1
1.0
bC bC bC
bC
bC
bC bC bC
0.5
bC
bC
0
Y
bC bC
bC bC
bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
−0.5 bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC
bC bC bC
bC bC bC
bC
bC bC bC
bC bC
bC bC bC
bC bC bC
bC bC
bC
bC bC
bC bC
bC
bC bC
bC bC
−1.0 bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC
bC bC Cb bC bC
bC bC bC bC bC
bC bC bC bC bC
bC bC
bC bC Cb bC
bC bC
bC bC bC bC bC
bC
bC bC bC bC
bC bC
−1.5 bC bC
bC
bC
bC
bC
bC
bC Cb bC
bC bC
bC bC
Cb bC
bC bC
bC
bC bC bC
bC bC
bC
bC Cb bC
bC bC
bC
bC bC bC
bC bC
bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
−2.0 bC
bC
bC
bC
bC
bC
bC
bC bC bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC bC
bC bC bC
bC
bC bC bC
bC
bC bC
bC bC
bC bC bC bC
bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
−2.5 bC bC
bC bC
bC bC
bC bC bC
bC bC bC
bC bC bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12
X
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 34 / 59
MLP for sine curve
t = 1000
bC bC bC bC bC bC
bC bC bC bC bC Cb bC bC bC bC bC Cb bC
1.00 bC bC
Cb bC
bC bC
bC bC bC
Cb bC bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC Cb bC
bC bC bC bC
bC bC
bC bC
bC
bC
bC
bC bC bC bC
bC bC
bC
bC bC
bC bC
0.75 bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC
0.50 bC bC
bC
bC
bC bC
bC
bC
bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC Cb bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC
0.25 bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC bC bC
bC
bC bC bC
Cb bC bC
bC bC bC
bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC
bC Cb bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC
bC Cb bC bC
0
Y
bC bC
bC bC bC bC
bC bC bC bC bC
bC bC
bC Cb bC bC
bC bC bC bC
bC bC bC bC
bC
bC bC bC bC
bC bC bC Cb bC bC
bC bC
bC bC bC bC
bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC
bC bC bC bC bC
bC bC bC
−0.25 bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
bC
bC
bC
bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC bC
Cb bC bC
bC bC bC
bC bC bC
bC bC bC
bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC
bC bC bC bC
bC
−0.50 bC
bC bC
bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC bC bC bC
bC bC bC
bC
bC
bC
bC
bC bC bC bC
bC bC bC bC bC
Cb bC
bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC Cb bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC Cb bC bC bC bC bC bC bC bC bC
bC bC bC
−0.75
bC bC
bC bC
−1.00 bC
−1.25
−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12
X
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 35 / 59
MLP for sine curve
t = 5000
1.00 bC bC bC
Cb bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC
bC bC
bC
bC bC bC bC
bC bC Cb bC
bC bC
bC bC
bC bC
bC bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC
bC bC bC bC
0.75 bC bC
bC
bC
bC
bC
bC bC bC bC
bC bC
bC
bC
bC
bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC
bC bC bC bC bC
bC bC bC
0.50 bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC bC
bC bC
bC bC
bC bC
bC bC bC
bC bC bC
bC bC bC bC
bC bC bC
bC bC bC bC
bC bC bC
bC bC bC
bC bC bC
bC bC bC
bC
bC bC bC bC bC bC
bC
bC bC bC bC bC bC
bC
bC
bC bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC
bC bC bC
bC
bC bC bC bC bC bC
0.25 bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC bC
bC bC bC
bC bC bC bC bC bC bC bC
bC bC
bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC
bC bC bC bC
bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC
bC bC bC
bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC
bC
bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
0 bC bC bC
Y
bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC
bC bC bC bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC
bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC
bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC
bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC
bC bC bC bC bC bC
bC
bC bC bC bC
bC bC bC bC bC bC bC
bC
bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
−0.25 bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC
bC bC
bC bC
bC bC bC
bC
bC
bC
bC bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC
bC bC
bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC
−0.50
bC
bC bC bC bC bC
bC bC bC bC
bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC
bC bC
bC
bC bC bC bC
bC bC bC bC
bC bC bC
bC bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC
bC bC bC
−0.75 bC bC
bC bC
bC bC
bC bC bC bC
bC bC bC
bC bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC bC bC bC
bC bC
bC
bC bC bC bC
bC bC Cb bC bC bC
bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC
bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC
−1.00
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC
bC bC bC
bC bC
Cb bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC
−1.25
−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12
X
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 36 / 59
MLP for sine curve
t = 10000
bC bC bC bC bC
bC bC bC bC bC Cb bC bC bC bC bC
Cb bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
Cb bC
bC bC bC
bC
bC bC bC
bC
bC bC
bC bC
bC
bC bC
bC bC
bC bC
bC bC
bC
bC bC
bC bC
bC bC
bC
1.25 bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC bC
bC bC
bC bC
bC
bC bC
1.00 bC
bC
bC
bC
bC bC
bC
bC
bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC
bC
bC bC bC
bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC
bC
bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC
bC bC bC bC bC
bC bC
bC bC
bC bC bC bC bC
bC
0.75 bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC bC bC bC
bC bC
bC
bC
bC Cb bC
bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC
bC bC bC
bC
bC bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC
bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC
bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC
bC
bC bC bC
bC bC bC
bC
bC bC
0.50 bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC
0.25 bC
bC
bC
bC
bC
bC
bC bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC
bC
bC bC
0 bC bC
Y
bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC
bC
bC bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC
bC bC bC
bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
−0.25 bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC bC
bC
bC
bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC
bC bC bC bC
bC bC bC bC
bC bC bC
−0.50 bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC bC
bC
bC
bC
bC
bC
bC bC bC
bC bC bC bC
bC
bC
bC bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC
bC bC
bC bC bC bC
bC bC bC
−0.75 bC
bC
bC bC
bC
bC bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC bC bC
bC bC
bC
bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC
bC
bC
bC bC bC bC bC
bC
bC bC bC bC bC
bC bC
bC bC bC bC Cb bC
bC bC
bC
bC bC bC bC
bC bC bC bC bC bC
bC bC
−1.00 bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC bC bC
bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC
bC bC
bC bC
bC
bC bC bC bC
bC bC
bC bC bC bC
bC bC Cb bC
bC bC
bC bC bC bC
bC bC bC bC
−1.25 bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC
−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning
X Chapter 25: Neural Networks 37 / 59
MLP for sine curve
t = 15000
bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC
1.00 bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
Cb bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC Cb bC bC bC bC bC
bC bC bC bC bC
bC
bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC Cb bC bC bC Cb bC bC bC
bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC
Cb bC
bC bC
bC bC bC bC bC
0.75 bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC
bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC
bC bC bC bC bC bC
bC
bC bC bC bC
0.50 bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC bC
bC
bC
bC
bC
bC
bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC
0.25 bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC
0 bC bC bC
Y
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC
bC bC bC bC
bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC
bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC
bC
bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC
bC bC bC
−0.25
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC
bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC
−0.50 bC
bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC
bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC
−0.75 bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC bC
bC bC
bC
bC
bC bC
bC
bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC
bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC
bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC Cb bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
−1.00
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC
bC bC
bC bC
bC
bC bC
bC bC
bC bC
bC bC
bC bC bC
bC
bC bC bC
bC
bC bC bC bC
bC bC bC bC
bC bC
−1.25 bC bC
bC bC
bC bC
bC bC bC
bC bC bC bC bC Cb bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC
Cb bC
bC bC
bC bC
−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12
X
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 38 / 59
MLP for sine curve
t = 30000
We can observe that, even with a very small training data of 25 points sampled
randomly from the sine curve, the MLP is able to learn the desired function.
1.00 bC bC
bC bC
Cb bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC
Cb bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC bC bC
bC bC
bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC
bC bC bC bC bC
bC bC bC
bC
bC bC bC bC bC bC bC bC bC bC bC
Cb bC bC bC Cb bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC
bC bC bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC bC bC Cb bC bC bC Cb bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC
bC bC
0.75 bC
bC
bC
bC bC
bC bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC
bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC
bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC
bC
bC bC bC bC bC bC
bC
bC
bC bC bC bC bC bC
0.50 bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
0.25 bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC bC bC
bC bC
bC
bC
bC
bC
bC
bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC
0 bC
Y
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC
bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC
bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC
bC
bC bC bC bC bC
bC bC bC
bC
bC bC bC
bC bC bC bC bC bC
−0.25 bC
bC
bC
bC
bC bC
bC
bC
bC
bC bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
−0.50
bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC
bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC
−0.75 bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC
bC
bC bC bC bC bC bC
bC bC bC bC bC
bC
bC bC bC bC bC
bC bC bC bC bC bC bC
bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC
bC bC bC bC
bC bC bC
bC bC bC bC bC
bC bC
bC bC bC
bC bC bC bC
bC bC
bC bC
bC bC bC
bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
Cb bC
bC bC bC bC bC Cb bC bC
Cb bC bC bC bC
−1.00 bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
−1.25
−12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning X Chapter 25: Neural Networks 39 / 59
MLP for sine curve
Test range [−20, 20]
The MLP model has not really learned the sine function; rather, it has learned to
approximate it only in the specified range [−10, 10].
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
2.0 bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC bC
bC bC bC
bC bC
bC bC bC
bC bC bC
bC bC
bC bC
bC bC
1.5 bC bC bC
bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
1.0 bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC
Cb bC bC
bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC
bC
bC bC
bC bC bC bC
bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC Cb bC bC bC bC bC bC bC
bC bC
bC bC bC
bC bC bC
bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC Cb bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC
bC bC bC bC bC
bC bC bC Cb bC bC bC bC bC Cb bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC
bC bC
bC bC
bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC Cb bC bC bC bC bC bC bC bC
bC bC bC bC
bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC
bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC
0.5 bC bC
bC bC
bC bC
bC bC
bC bC bC bC
bC bC
bC bC
bC bC bC
bC bC bC bC
bC bC
bC bC
bC bC
bC bC bC bC
bC bC
bC bC
bC bC bC
bC bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC
bC bC
bC bC
bC bC bC
bC bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC
bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
0 bC bC
Y
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC
bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC
−0.5 bC bC
bC bC
bC bC
bC bC
bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC bC
bC bC bC
bC bC
bC bC bC
bC bC
bC bC
bC bC bC bC
bC bC
bC bC
bC bC
bC bC
bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC
bC bC bC bC bC
bC
bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC
bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
−1.0 bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC
−1.5 bC bC bC
bC bC bC
bC bC bC
bC bC bC
bC bC
bC bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC
bC bC bC
−2.0 bC bC
bC bC bC
bC bC bC bC
bC bC
bC bC bC bC
bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
−2.5 bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
We evaluate an MLP with one hidden layer for the task of predicting the correct
label for a hand-written digit from the MNIST database, which contains 60,000
training images that span the 10 digit labels, from 0 to 9.
Each (grayscale) image is a 28 × 28 matrix of pixels, with values between 0 and
255. Each pixel is converted to a value in the interval [0, 1] by dividing by 255.
0 0 0 0 0
5 5 5 5 5
10 10 10 10 10
15 15 15 15 15
20 20 20 20 20
25 25 25 25 25
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
0 0 0 0 0
5 5 5 5 5
10 10 10 10 10
15 15 15 15 15
20 20 20 20 20
25 25 25 25 25
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 41 / 59
MLP for handwritten digit classification
Example
Since images are 2-dimensional matrices, we first flatten them into a vector
x ∈ R784 with dimensionality d = 28 × 28 = 784. This is done by simply
concatenating all of the rows of the images to obtain one long vector.
Next, since the output labels are categorical values that denote the digits from 0
to 9, we need to convert them into binary (numerical) vectors, using one-hot
encoding.
Thus, the label 0 is encoded as e 1 = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0)T ∈ R10 , the label 1 as
e 2 = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0)T ∈ R10 , and so on, and finally the label 9 is encoded
as e 10 = (0, 0, 0, 0, 0, 0, 0, 0, 0, 1)T ∈ R10 . That is, each input image vector x has a
corresponding target response vector y ∈ {e 1 , e 2 , · · · , e 10 }.
Thus, the input layer for the MLP has d = 784 neurons, and the output layer has
p = 10 neurons.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 42 / 59
MLP for handwritten digit classification
Example
For the hidden layer, we consider several MLP models, each with a different
number of hidden neurons m. We try m = 0, 7, 49, 98, 196, 392, to study the effect
of increasing the number of hidden neurons, from small to large.
For the hidden layer, we use ReLU activation function, and for the output layer,
we use softmax activation, since the target response vector has only one neuron
with value 1, with the rest being 0.
Note that m = 0 means that there is no hidden layer – the input layer is directly
connected to the output layer, which is equivalent to a multiclass logistic
regression model. We train each MLP for t = 15 epochs, using step size η = 0.25.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 43 / 59
MLP for handwritten digit classification
Example
We can observe that adding a hidden layer significantly improves the prediction
accuracy. Using even a small number of hidden neurons helps, compared to the
logistic regression model (m = 0). For example, using m = 7 results in 901 errors
(or error rate 9.01%) compared to using m = 0, which results in 1677 errors (or
error rate 16.77%).
On the other hand, as we increase the number of hidden neurons, the error rate
decreases, though with diminishing returns. Using m = 196, the error rate is
4.70%, but even after doubling the number of hidden neurons (m = 392), the
error rate goes down to only 4.54%. Further increasing m does not reduce the
error rate.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 44 / 59
MNIST: Prediction error as a function of epochs
During training, we plot the number of misclassified images after each epoch, on
the separate MNIST test set comprising 10,000 images.
Figure shows the number of errors from each of the models (with a different
number of hidden neurons m), after each epoch.
3,000
m=0
2,500
m=7
m = 10
2,000
m = 49
m = 98
errors
1,500
m = 196
m = 392
1,000
500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
epochs
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 45 / 59
Deep Multilayer Perceptron
We now generalize the feed-forward and backpropagation steps for many hidden
layers, as well as arbitrary error and neuron activation functions.
Consider an MLP with h hidden layers, n input points x i ∈ Rd (l = 0), and the
response vector y i ∈ Rp (l = h + 1).
l =0 l =1 l =2 ... l =h l = h+1
.. .. .. .. .. ..
. . . . . .
.. .. .. .. .. ..
. . . . . .
l =0 l =1 l =2 ... l =h l = h+1
x z1 z2 ··· zh o
W 0 , b0 W 1 , b1 W h , bh
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 46 / 59
Deep Multilayer Perceptron
Feed-forward Phase
Typically in a deep MLP, the same activation function f l is used for all neurons in
a given layer l.
The input layer always uses the identity activation, so f 0 is the identity function.
Also, all bias neurons also use the identity function with a fixed value of 1.
The hidden layers typically use sigmoid, tanh, or ReLU activations.
The output layer typically uses sigmoid or softmax activations for classification
tasks, or identity activations for regression tasks.
For (x, y ) ∈ D, the deep MLP computes the output vector as:
o = f h+1 b h + W hT · z h
= f h+1 b h + W hT · f h b h−1 + W hT−1 · z h−1
..
= .
!
h+1
=f b h + W hT ·f h
b h−1 + W hT−1 ·f h−1
···f 2
b 1 + W T1 ·f 1
b 0 + W T0 ·x
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 47 / 59
Deep Multilayer Perceptron
Backpropagation Phase
Consider the weight update between a given layer and another, including between
the input and hidden layer, or between two hidden layers, or between the last
hidden layer and the output layer.
Let zil be a neuron in layer l, and zjl +1 a neuron in the next layer l + 1. Let wijl be
the weight between zil and zjl +1 , and let bjl denote the bias term between z0l and
zjl +1 .
The weight and bias are updated using the gradient descent approach
where ∇w l is the weight gradient and ∇bl is the bias gradient, i.e., the partial
ij j
derivative of the error function with respect to the weight and bias, respectively.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 48 / 59
Deep Multilayer Perceptron
Backpropagation Phase
We can use the chain rule to write the weight and bias gradient, as follows
∂Ex ∂Ex ∂netj
∇w l = l
= · = δjl +1 · zil = zil · δjl +1
ij ∂wij ∂netj ∂wijl
∂Ex ∂Ex ∂netj
∇bl = = · = δjl +1
j ∂bjl ∂netj ∂bjl
W l = W l − η · ∇W l
b l = b l − η · ∇b l
where η is the step size. However, we observe that to compute the weight and
bias gradients for layer l we need to compute the net gradients δ l +1 at layer l + 1.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 49 / 59
Deep Multilayer Perceptron
Net Gradients at Output Layer
If all of the output neurons are independent (for example, when using linear or
sigmoid activations), the net gradient is obtained by differentiating the error
function with respect to the net signal at the output neurons. That is,
If the output neurons are not independent (for example, when using a softmax
activation), then :
p
∂Ex X ∂Ex ∂f h+1 (neti )
δjh+1 = = ·
∂netj i =1
∂f h+1 (neti ) ∂netj
For regression, we use the SSE with linear activation function, whereas for logistic
regression and classification, we use the cross-entropy error function with a
sigmoid activation for binary classes, and softmax activation for multiclass
problems.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 50 / 59
Deep Multilayer Perceptron
Net Gradients at Hidden Layers
Let us assume that we have already computed the net gradients at layer l + 1,
namely δ l +1 .
Since neuron zjl in layer l is connected to all of the neurons in layer l + 1 (except
for the bias neuron z0l +1 ), to compute the net gradient at zjl , we have to account
for the error from each neuron in layer l + 1, as follows:
nl+1
∂Ex X ∂Ex ∂net k ∂f l (netj )
δjl = = · l ·
∂netj k =1 ∂net k ∂f (netj ) ∂netj
nl+1
∂f l (netj ) X l +1 l
= · δk · wjk
∂netj k =1
So the net gradient at zjl in layer l depends on the derivative of the activation
function with respect to its net j , and the weighted sum of the net gradients from
all the neurons zkl +1 at the next layer l + 1.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 51 / 59
Net Gradients at Hidden Layers
For the commonly used activation functions at the hidden layer, we have
1
for linear
l l l
∂f = z (1 − z ) for sigmoid
l l
(1 − z ⊙ z ) for tanh
The net gradients are computed recursively, starting from the output layer h + 1,
then hidden layer h, and so on, until we finally compute the net gradients at the
first hidden layer l = 1. That is,
δ h = ∂f h ⊙ W h · δ h+1
δ h−1 = ∂f h−1 ⊙ W h−1 · δ h = ∂f h−1 ⊙ W h−1 · ∂f h ⊙ W h · δ h+1
..
.
!
1 1 2 h h+1
δ = ∂f ⊙ W 1 · ∂f ⊙ W 2 · · · · ∂f ⊙ Wh ·δ
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 52 / 59
Deep MLP Training: Stochastic Gradient Descent
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 53 / 59
Deep MLP Training: Stochastic Gradient Descent
7 repeat
8 foreach (x i , y i ) ∈ D in random order do
9 z 0 ← x i // Feed-Forward Phase
10 for l = 0, 1, 2, . . . , h do z l +1 ← f l +1 b l + W Tl · z l
11 o i ← z h+1
12 if independent outputs then // Backpropagation phase
13 δ h+1 ← ∂f h+1 ⊙ ∂E x i // net gradients at output
14 else
15 δ h+1 ← ∂F h+1 · ∂E x i // net gradients at output
for l = h, h − 1, · · · , 1 do δ l ← ∂f l ⊙ W l · δ l +1 // net gradients
16
17 for l = 0, 1, · · · , h do // Gradient Descent Step
T
18 ∇W l ← z l · δ l +1 // weight gradient matrix at layer l
19 ∇bl ← δ l +1 // bias gradient vector at layer l
20 for l = 0, 1, · · · , h do
21 W l ← W l − η · ∇W l // update W l
22 b l ← b l − η · ∇bl // update b l
23 t ← t +1
24 until t ≥ maxiter
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 54 / 59
Vanishing or Exploding Gradients
In the vanishing gradient problem, the norm of the net gradient can decay
exponentially with the distance from the output layer, that is, as we
backpropagate the gradients from the output layer to the input layer. In this case
the network will learn extremely slowly, if at all, since the gradient descent method
will make minuscule changes to the weights and biases.
On the other hand, in the exploding gradient problem, the norm of the net
gradient can grow exponentially with the distance from the output layer. In this
case, the weights and biases will become exponentially large, resulting in a failure
to learn. The gradient explosion problem can be mitigated to some extent by
gradient thresholding, that is, by resetting the value if it exceeds an upper bound.
The vanishing gradients problem is more difficult to address. Typically sigmoid
activations are more susceptible to this problem, and one solution is to use
alternative activation functions such as ReLU. In general, recurrent neural
networks, which are deep neural networks with feedback connections, are more
prone to vanishing and exploding gradients.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 55 / 59
Deep MLP
Example
We now examine deep MLPs for predicting the labels for the MNIST handwritten
images dataset.
Recall that this dataset has n = 60000 grayscale images of size 28 × 28 that we
treat as d = 784 dimensional vectors. The pixel values between 0 and 255 are
converted to the range 0 and 1 by dividing each value by 255. The target
response vector is a one-hot encoded vector for class labels {0, 1, . . . , 9}.
Thus, the input to the MLP x i has dimensionality d = 784, and the output layer
has dimensionality p = 10. We use softmax activation for the output layer. We
use ReLU activation for the hidden layers, and consider several deep models with
different number and sizes of the hidden layers. We use step size η = 0.3 and train
for t = 15 epochs. Training was done using minibatches, using batch size of 1000.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 56 / 59
Deep MLP
Example
We evaluate the performance of each MLP on the MNIST test datatset that
contains 10,000 images. The final test error at the end of training is given as
hidden layers errors
n1 = 392 396
n1 = 196, n2 = 49 303
n1 = 392, n2 = 196, n3 = 49 290
n1 = 392, n2 = 196, n3 = 98, n4 = 49 278
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 57 / 59
MNIST: Deep MLPs
Prediction error as a function of epochs.
The deeper MLP improves the prediction accuracy, but adding more layers does
not reduce the error rate, and can also lead to performance degradation.
5,000
n1 = 392
n1 = 196, n2 = 49
4,000 n1 = 392, n2 = 196, n3 = 49
n1 = 392, n2 = 196, n3 = 98, n4 = 49
errors
3,000
2,000
1,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
epochs
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 58 / 59
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 25: Neural Networks 59 / 59