0% found this document useful (0 votes)

57 views

NN Suppl

The document provides an overview of artificial neural networks (ANNs) including: 1) It describes the basic structure and properties of ANNs including their ability to learn nonlinear relationships and generalize from examples. 2) It discusses different types of ANNs like perceptrons, multilayer perceptrons, and Hopfield networks. 3) It outlines algorithms for training ANNs including the perceptron rule, delta rule, and backpropagation which uses gradient descent to minimize error for multilayer networks.

Uploaded by

ashmmandy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views

NN Suppl

Uploaded by

ashmmandy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 64

Artificial Neural Networks

Biointelligence Laboratory
Department of Computer Engineering
Seoul National University
Contents

 Introduction

 Perceptron and Gradient Descent Algorithm

 Multilayer Neural Networks

 Designing an ANN for Face Recognition Application

Introduction
The Brain vs. Computer

1. 10 billion neurons 1. Faster than neuron (10-9 sec)

2. 60 trillion synapses cf. neuron: 10-3 sec
3. Distributed processing 3. Central processing
4. Nonlinear processing 4. Arithmetic operation (linearity)
5. Parallel processing 5. Sequential processing
From Biological Neuron to
Artificial Neuron

Dendrite Cell Body Axon

From Biology to
Artificial Neural Networks
Properties of Artificial Neural Networks
 A network of artificial neurons

 Characteristics
 Nonlinear I/O mapping
 Adaptivity
 Generalization ability
 Fault-tolerance (graceful
degradation)
 Biological analogy

<Multilayer Perceptron Network>

Types of ANNs

 Single Layer Perceptron

 Multilayer Perceptrons (MLPs)
 Radial-Basis Function Networks (RBFs)
 Hopfield Network
 Boltzmann Machine
 Self-Organization Map (SOM)
 Modular Networks (Committee Machines)
Architectures of Networks

<Multilayer Perceptron Network> <Hopfield Network>

Features of Artifitial Neural Networks

 Records (examples) need to be represented as a

(possibly large) set of tuples of <attribute, value>
 The output values can be represented as a discrete value,
a real value, or a vector of values
 Tolerant to noise in input data
 Time factor
 It takes long time for training
 Once trained, an ANN produces output values (predictions) fast

 It is hard for human to interpret the process of prediction

by ANN
Example of Applications
 NETtalk [Sejnowski]
 Inputs: English text
 Output: Spoken phonemes

 Phoneme recognition [Waibel]

 Inputs: wave form features
 Outputs: b, c, d,…

 Robot control [Pomerleau]

 Inputs: perceived features
 Outputs: steering control
Application:
Autonomous Land Vehicle (ALV)
 NN learns to steer an autonomous vehicle.
 960 input units, 4 hidden units, 30 output units
 Driving at speeds up to 70 miles per hour

ALVINN System

Image of a
forward -
mounted Weight values
camera for one of the
hidden units
Application:
Error Correction by a Hopfield Network
corrupted
input data
original
target data
Corrected
data after
20 iterations
Corrected
data after
10 iterations Fully
corrected
data after
35 iterations
Perceptron
and
Gradient Descent Algorithm
Architecture of A Perceptron

 Input: a vector of real values

 Output: 1 or -1 (binary)
 Activation function: threshold function
NOTE: Perceptron is also called as a TLU (Threshold Logic Unit)
Hypothesis Space of Perceptrons

 Free parameters: weights (and thresholds)

 Learning: choosing values for the weights

 Hypotheses space of perceptron learning

 
H  {w | w  ( n 1) }
 n: dimension of the input vector

 Linear function
f (x)  w0  w1 x1    wn xn
Perceptron and Decision Hyperplane
 A perceptron represents a ‘hyperplane’ decision surface in
the n-dimensional space of instances (i.e. points).
 The perceptron outputs 1 for instances lying on one side
of the hyperplane and outputs -1 for instances lying on the
other side.
 Equation for the decision hyperplane: wx = 0.
 Some sets of positive and negative examples cannot be
separated by any hyperplane

 A perceptron can not learn a linearly nonseparable

problem.
Linearly Separable v.s. Linearly
Nonseparable

(a) Decision surface for a linearly separable set of examples

(correctly classified by a straight line)
(b) A set of training examples that is not linearly separable.
Representational Power of Perceptrons

 A single perceptron can be used to represent many boolean

functions.
 AND function: w0 = -0.8, w1 = w2 = 0.5
 OR function: w0 = -0.3, w1 = w2 = 0.5

 Perceptrons can represent all of the primitive boolean functions

AND, OR, NAND, and NOR.
 Note: Some boolean functions cannot be represented by a single
perceptron (e.g. XOR). Why not?

 Every boolean function can be represented by some network of

perceptrons only two levels deep. How?
 One way is to represent the boolean function in DNF form (OR of ANDs).
Perceptron Training Rule

 Note: output value o is +1 or -1 (not a real)

 Perceptron rule: a learning rule for a threshold unit.
 Conditions for convergence
 Training examples are linearly separable.
 Learning rate is sufficiently small.
Least Mean Square (LMS) Error

 Note: output value o is a real value (not binary)

 Delta rule: learning rule for an unthresholded perceptron
(i.e. linear unit).
 Delta rule is a gradient-descent rule.
 Also known as the Widrow-Hoff rule
Gradient Descent Method
Delta Rule for Error Minimization
E
wi  wi  wi , wi  
wi

wi    (t d  od ) xid
d D
Gradient Descent Algorithm for
Perceptron Learning
Properties of Gradient Descent
 Because the error surface contains only a single global
minimum, the gradient descent algorithm will converge
to a weight vector with minimum error, regardless of
whether the training examples are linearly separable.
 Condition: a sufficiently small learning rate

 If the learning rate is too large, the gradient descent

search may overstep the minimum in the error surface.
 A solution: gradually reduce the learning rate value.
Conditions for Gradient Descent
 Gradient descent is an important general strategy for
searching through a large or infinite hypothesis space.

 Conditions for gradient descent search

 The hypothesis space contains continuously parameterized
hypotheses (e.g., the weights in a linear unit).
 The error can be differentiated w.r.t. these hypothesis parameters.
Difficulties with Gradient Descent

 Converging to a local minimum can sometimes be quite

slow (many thousands of gradient descent steps).

 If there are multiple local minima in the error surface, then

there is no guarantee that the procedure will find the global
minimum.
Perceptron Rule v.s. Delta Rule
 Perceptron rule
 Thresholded output
 Converges after a finite number of iterations to a hypothesis that
perfectly classifies the training data, provided the training
examples are linearly separable.
 Can deal with only linearly separable data

 Delta rule
 Unthresholded output
 Converges only asymptotically toward the error minimum,
possibly requiring unbounded time, but converges regardless of
whether the training data are linearly separable.
 Can deal with linearly nonseparable data
Multilayer Perceptron
Multilayer Network and
Its Decision Boundaries

 Decision regions of a multilayer feedforward network.

 The network was trained to recognize 1 of 10 vowel sounds occurring
in the context “h_d”
 The network input consists of two parameter, F1 and F2, obtained
from a spectral analysis of the sound.
 The 10 network outputs correspond to the 10 possible vowel sounds.
Differentiable Threshold Unit

 Sigmoid function: nonlinear, differentiable

Backpropagation (BP) Algorithm
 BP learns the weights for a multilayer network, given a
network with a fixed set of units and interconnections.

 BP employs gradient descent to attempt to minimize the

squared error between the network output values and the
target values for these outputs.

 Two stage learning

 forward stage: calculate outputs given input pattern x.
 backward stage: update weights by calculating delta.
Error Function for BP

1
 

E ( w)  (t kd  o kd ) 2
2 d D koutputs

 E defined as a sum of the squared errors over all the

output units k for all the training examples d.

 Error surface can have multiple local minima

 Guarantee toward some local minimum
 No guarantee to the global minimum
Backpropagation Algorithm for MLP
Termination Conditions for BP

 The weight update loop may be iterated thousands of times

in a typical application.
 The choice of termination condition is important because
 Too few iterations can fail to reduce error sufficiently.
 Too many iterations can lead to overfitting the training data.

 Termination Criteria
 After a fixed number of iterations (epochs)
 Once the error falls below some threshold
 Once the validation error meets some criterion
Adding Momentum

 Original weight update rule for BP: w ji (n)   j x ji

 Adding momentum 
w ji (n)   j x ji  w ji (n  1), 0  1
 Help to escape a small local minima in the error surface.
 Speed up the convergence.
Derivation of the BP Rule

 Notations
 xij : the ith input to unit j
 wij : the weight associated with the ith input to unit j
 netj : the weighted sum of inputs for unit j
 oj : the output computed by unit j
 tj : the target output for unit j
  : the sigmoid function
 outputs : the set of units in the final layer of the network
 Downstream(j) : the set of units whose immediate inputs include the
output of unit j
Derivation of the BP Rule
 1
 Error measure: d ) 
E ( w  k k(t
2 koutputs
 o ) 2

E d
 Gradient descent: w ji  
w ji

Ed Ed net j Ed

 Chain rule:   x ji
w ji net j w ji net j
Case 1: Rule for Output Unit Weights

Ed Ed o j
 Step 1:  net j   w ji x ji
net j o j net j i

 Step 2: Ed   1  (t k  ok ) 2  (t j  o j )

o j o j 2 koutputs

o j  (net j )
 Step 3:   o j (1  o j )
net j net j

 All together: w ji   E d   (t j  o j )o j (1  o j ) x ji
w ji
Case 2: Rule for Hidden Unit Weights
Ed Ed net k o j
 Step 1:  
net j kDownstream( j ) net k o j net j
net k o j
  k
kDownstream ( j ) o j net j

o j
 
kDownstream ( j )
  k wkj
net j
  k
kDownstream ( j )
wkj o j (1  o j )

 Thus: w ji   j x ji , where  j  o j (1  o j )  w
k kj
kDownstream ( j )
Backpropagation for MLP: revisited
Convergence and Local Minima
 The error surface for multilayer networks may contain many
different local minima.
 BP guarantees to converge local minima only.
 BP is a highly effective function approximator in practice.
 The local minima problem found to be not severe in many
applications.

 Notes
 Gradient descent over the complex error surfaces represented by
ANNs is still poorly understood
 No methods are known to predict certainly when local minima will
cause difficulties.
 We can use only heuristics for avoiding local minima.
Heuristics for Alleviating the Local
Minima Problem
 Add a momentum term to the weight-update rule.

 Use stochastic descent rather than true gradient descent.

 Descend a different error surface for each example.

 Train multiple networks using the same data, but

initializing each network with different random weights.
 Select the best network w.r.t the validation set
 Make a committee of networks
Why BP Works in Practice?
A Possible Senario

 Weights are initialized to values near zero.

 Early gradient descent steps will represent a very smooth
function (approximately linear). Why?
 The sigmoid function is almost linear when the total input
(weighted sum of inputs to a sigmoid unit) is near 0.
 The weights gradually move close to the global minimum.
 As weights grow in the later stage of learning, they
represent highly nonlinear network functions.
 Gradient steps in this later stage move toward local
minima in this region, which is acceptable.
Representational Power of MLP
 Every boolean function can be represented exactly by
some network with two layers of units. How?
 Note: The number of hidden units required may grow
exponentially with the number of network inputs.

 Every bounded continuous function can be approximated

with arbitrarily small error by a network of two layers of
units.
 Sigmoid hidden units, linear output units
 How many hidden units?
NNs as Universal Function
Approximators
 Any function can be approximated to arbitrary accuracy
by a network with three layers of units (Cybenko 1988).
 Sigmoid units at two hidden layers
 Linear units at the output layer
 Any function can be approximated by a linear combination of
many localized functions having 0 everywhere except for some
small region.
 Two layers of sigmoid units are sufficient to produce good
approximations.

 Every bounded continuous function can be approximated,

with arbitrarily small error, by network with one hidden
layer [Cybenko 1989; Hornik et al. 1989]
BP Compared with CE & ID3
 For BP, every possible assignment of network weights
represents a syntactically distinct hypothesis.
 The hypothesis space is the n-dimensional Euclidean space of the
n network weights.

 Hypothesis space is continuous

 The hypothesis space of CE and ID3 is discrete.

 Differentiable
 Provides a useful structure for gradient search.
 This structure is quite different from the general-to-specific
ordering in CE, or the simple-to-complex ordering in ID3 or C4.5.
CE: candidate-elimination algorithm in ‘concept learning’ (T.M. Mitchell)
ID3: a learning scheme of ‘Decision Tree’ for discrete values (R. Quinlan)
C4.5: an improved scheme of ID3 for ‘real values’ (R. Quinlan)
Hidden Layer Representations
 BP has an ability to discover useful intermediate
representations at the hidden unit layers inside the
networks which capture properties of the input spaces that
are most relevant to learning the target function.

 When more layers of units are used in the network, more

complex features can be invented.

 But the representations of the hidden layers are very hard

to understand for human.
Hidden Layer Representation for Identity
Function
Hidden Layer Representation for Identity
Function

 The evolving sum of squared errors for each of the eight

output units as the number of training iterations (epochs)
increase
Hidden Layer Representation for Identity
Function

 The evolving hidden layer representation for the

 input string “01000000”
Hidden Layer Representation for Identity
Function

 The evolving weights for one of the three hidden units

Generalization and Overfitting

 Continuing training until the training error falls below

some predetermined threshold is a poor strategy since
BP is susceptible to overfitting.
 Need to measure the generalization accuracy over a validation
set (distinct from the training set).

 Two different types of overffiting

 Generalization error first decreases, then increases, even the
training error continues to decrease.
 Generalization error decreases, then increases, then decreases
again, while the training error continues to decreases.
Two Kinds of Overfitting Phenomena
Techniques for Overcoming the
Overfitting Problem
 Weight decay
 Decrease each weight by some small factor during each iteration.
 This is equivalent to modifying the definition of E to include a
penalty term corresponding to the total magnitude of the network
weights.
 The motivation for the approach is to keep weight values small, to
bias learning against complex decision surfaces.

 k-fold cross-validation
 Cross validation is performed k different times, each time using a
different partitioning of the data into training and validation sets
 The result are averaged after k times cross validation.
Designing an Artificial Neural
Network for Face Recognition
Application
Problem Definition

 Possible learning tasks

 Classifying camera images of faces of people in various poses.
 Direction, Identity, Gender, ...
 Data:
 624 grayscale images for 20 different people
 32 images per person, varying
 person’s expression (happy, sad, angry, neutral)
 direction (left, right, straight ahead, up)
 with and without sunglasses
 resolution of images: 120 x128, each pixel with a grayscale intensity
between 0 (black) and 255 (white)

 Task: Learning the direction in which the person is facing.

Factors for ANN Design in the Face
Recognition Task

 Input encoding

 Output encoding

 Network graph structure

 Other learning algorithm parameters

Input Coding for Face Recognition
 Possible Solutions
 Extract key features using preprocessing
 Coarse-resolution
 Features extraction
 edges, regions of uniform intensity, other local image features
 Defect: High preprocessing cost, variable number of features
 Coarse-resolution
 Encode the image as a fixed set of 30 x 32 pixel intensity values, with
one network input per pixel.
 The 30x32 pixel image is a coarse resolution summary of the original
120x128 pixel image
 Coarse-resolution reduces the number of inputs and weights to a much
more manageable size, thereby reducing computational demands.
Output Coding for Face Recognition
 Possible coding schemes
 Using one output unit with multiple threshold values
 Using multiple output units with single threshold value.
 One unit scheme
 Assign 0.2, 0.4, 0.6, 0.8 to encode four-way classification.
 Multiple units scheme (1-of-n output encoding)
 Use four distinct output units
 Each unit represents one of the four possible face directions, with
highest-valued output taken as the network prediction
Output Coding for Face Recognition
 Advantages of 1-of-n output encoding scheme
 It provides more degrees of freedom to the network for
representing the target function.
 The difference between the highest-valued output and the second-
highest can be used as a measure of the confidence in the network
prediction.

 Target value for the output units in 1-of-n encoding

scheme
 < 1, 0, 0, 0 > v.s. < 0.9, 0.1, 0.1, 0.1 >
 < 1, 0, 0, 0 >: will force the weights to grow without bound.
 < 0.9, 0.1, 0.1, 0.1 >: the network will have finite weights.
Network Structure for Face Recognition
 One hidden layer v.s. more hidden layers
 How many hidden nodes is used?
 Using 3 hidden units:
 test accuracy for the face data = 90%
 Training time = 5 min on Sun Sprac 5
 Using 30 hidden units:
 test accuracy for the face data = 91.5%
 Training time = 1 hour on Sun Sparc 5
Other Parameters for Face Recognition
 Learning rate  = 0.3
 Momentum  = 0.3
 Weight initialization: small random values near 0
 Number of iterations: Cross validation
 After every 50 iterations, the performance of the network was
evaluated over the validation set.
 The final selected network is the one with the highest accuracy
over the validation set
ANN for Face Recognition

960 x 3 x 4 network is trained on gray-level images of faces to

predict whether a person is looking to their left, right, ahead, or up.

MATH3161/MATH5165 Optimization: The University of New South Wales School of Mathematics and Statistics
No ratings yet
MATH3161/MATH5165 Optimization: The University of New South Wales School of Mathematics and Statistics
4 pages
The Art of Differentiating
No ratings yet
The Art of Differentiating
348 pages
ML Unit - 2
No ratings yet
ML Unit - 2
70 pages
Machine Learning: Algorithms and Applications: (Continued)
No ratings yet
Machine Learning: Algorithms and Applications: (Continued)
17 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
71 pages
Neural Network BSC
No ratings yet
Neural Network BSC
32 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Lecture 4: Perceptrons and Multilayer Perceptrons: Cognitive Systems II - Machine Learning SS 2005
No ratings yet
Lecture 4: Perceptrons and Multilayer Perceptrons: Cognitive Systems II - Machine Learning SS 2005
25 pages
4.2 Ann
No ratings yet
4.2 Ann
26 pages
Unit 5
No ratings yet
Unit 5
219 pages
Neural Network: Presented by Lecturer Dept. of Mechatronics Engineering Rajshahi University of Engineering & Technology
No ratings yet
Neural Network: Presented by Lecturer Dept. of Mechatronics Engineering Rajshahi University of Engineering & Technology
25 pages
Chapter_7
No ratings yet
Chapter_7
68 pages
Neural Network: Presented by Lecturer Dept. of Mechatronics Engineering Rajshahi University of Engineering & Technology
No ratings yet
Neural Network: Presented by Lecturer Dept. of Mechatronics Engineering Rajshahi University of Engineering & Technology
25 pages
2021 Lecture11 NeuralNetworks
No ratings yet
2021 Lecture11 NeuralNetworks
48 pages
5 1 ArtificialNeuralNetworks 4up
No ratings yet
5 1 ArtificialNeuralNetworks 4up
12 pages
2024 MTH058 Lecture02 Backpropagation
No ratings yet
2024 MTH058 Lecture02 Backpropagation
62 pages
2023-Lecture11-NeuralNetworks
No ratings yet
2023-Lecture11-NeuralNetworks
48 pages
Multi Layer Perceptron Haykin
No ratings yet
Multi Layer Perceptron Haykin
50 pages
2025-Lecture07-P2-MLP
No ratings yet
2025-Lecture07-P2-MLP
56 pages
Unit_4 ANN ppt
No ratings yet
Unit_4 ANN ppt
46 pages
855597620
No ratings yet
855597620
44 pages
Lecture+8
No ratings yet
Lecture+8
65 pages
ML2
No ratings yet
ML2
22 pages
PERCEPTRONS
No ratings yet
PERCEPTRONS
13 pages
Machine Learning Unit 5 Notes
No ratings yet
Machine Learning Unit 5 Notes
19 pages
Jntuk R20 ML Unit-V
No ratings yet
Jntuk R20 ML Unit-V
19 pages
Neural Network
No ratings yet
Neural Network
44 pages
Lec03 NeuralNetwork
No ratings yet
Lec03 NeuralNetwork
39 pages
Machine Learning: Chapter 4. Artificial Neural Networks
No ratings yet
Machine Learning: Chapter 4. Artificial Neural Networks
34 pages
ANN-unit 4 PDF
No ratings yet
ANN-unit 4 PDF
23 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
Lecture 10
No ratings yet
Lecture 10
155 pages
Ipcw Ann
No ratings yet
Ipcw Ann
100 pages
Slide 2
No ratings yet
Slide 2
35 pages
Machine Learning Module-3
No ratings yet
Machine Learning Module-3
23 pages
Anthony Kuh - Neural Networks and Learning Theory
No ratings yet
Anthony Kuh - Neural Networks and Learning Theory
72 pages
ML UNIT-5
No ratings yet
ML UNIT-5
19 pages
Lecture 4
No ratings yet
Lecture 4
50 pages
AN2DL_02_2324_Perceptron_2_FeedForward
No ratings yet
AN2DL_02_2324_Perceptron_2_FeedForward
55 pages
Unit - II ML
No ratings yet
Unit - II ML
9 pages
L04 Slides.mlp1
No ratings yet
L04 Slides.mlp1
22 pages
Unit-5 AI
No ratings yet
Unit-5 AI
19 pages
Notes_ML_02_Slides_RNN_ANN
No ratings yet
Notes_ML_02_Slides_RNN_ANN
105 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
81 pages
Introduction To Neural Networks: Revision Lectures: © John A. Bullinaria, 2004
No ratings yet
Introduction To Neural Networks: Revision Lectures: © John A. Bullinaria, 2004
24 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
Chapter 2. Training NN
No ratings yet
Chapter 2. Training NN
50 pages
Neural Network Presentation
No ratings yet
Neural Network Presentation
33 pages
Unit - 4 ANN
No ratings yet
Unit - 4 ANN
17 pages
Bim309 Ai Week13
No ratings yet
Bim309 Ai Week13
53 pages
UNIT 4 ML NN ,DL,CNN-1
No ratings yet
UNIT 4 ML NN ,DL,CNN-1
84 pages
ML Unit-2
No ratings yet
ML Unit-2
141 pages
ML 03
No ratings yet
ML 03
42 pages
Machine Learning
No ratings yet
Machine Learning
83 pages
neural (2)
No ratings yet
neural (2)
32 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
From Everand
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
Fouad Sabry
No ratings yet
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
From Everand
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
Fouad Sabry
No ratings yet
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
From Everand
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
Fouad Sabry
No ratings yet
Perceptrons: Fundamentals and Applications for The Neural Building Block
From Everand
Perceptrons: Fundamentals and Applications for The Neural Building Block
Fouad Sabry
No ratings yet
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
80 pages
CSE-4119_Assignment
No ratings yet
CSE-4119_Assignment
3 pages
Machine Learning Notes AndrewNg
No ratings yet
Machine Learning Notes AndrewNg
141 pages
Top 100 Deep Learning Interview Questions
No ratings yet
Top 100 Deep Learning Interview Questions
157 pages
Lect04 CSN382
No ratings yet
Lect04 CSN382
52 pages
6 Gradient Method
No ratings yet
6 Gradient Method
19 pages
Chapter 6 Slides
No ratings yet
Chapter 6 Slides
28 pages
Stochastic Gradient Descent - Term Paper
No ratings yet
Stochastic Gradient Descent - Term Paper
8 pages
AMP Chapter 13 - Nonlinier Programing
No ratings yet
AMP Chapter 13 - Nonlinier Programing
13 pages
DL Unit 1
No ratings yet
DL Unit 1
16 pages
Chapter 4 - Anatomy of A Learning Algorithms
No ratings yet
Chapter 4 - Anatomy of A Learning Algorithms
2 pages
ML Practice 1
No ratings yet
ML Practice 1
106 pages
Algebra and More For Analytics
No ratings yet
Algebra and More For Analytics
29 pages
Practical Machine Learning For Streaming Data With Python: Design, Develop, and Validate Online Learning Models 1st Edition Sayan Putatunda
100% (4)
Practical Machine Learning For Streaming Data With Python: Design, Develop, and Validate Online Learning Models 1st Edition Sayan Putatunda
62 pages
Updating_Weight
No ratings yet
Updating_Weight
9 pages
Stanford ML CS229-Merged Notes
No ratings yet
Stanford ML CS229-Merged Notes
126 pages
Machine Learning Week 3
No ratings yet
Machine Learning Week 3
4 pages
A Computationally Efficient Frequency Domain LMS Algorithm
No ratings yet
A Computationally Efficient Frequency Domain LMS Algorithm
7 pages
Lecture Note SGD
No ratings yet
Lecture Note SGD
4 pages
Sobolev Gradients: A Nonlinear Equivalent Operator Theory in Preconditioned Numerical Methods For Elliptic Pdes
No ratings yet
Sobolev Gradients: A Nonlinear Equivalent Operator Theory in Preconditioned Numerical Methods For Elliptic Pdes
12 pages
Image Dehazing Using Artificial Intelligence and Multi Exposure
No ratings yet
Image Dehazing Using Artificial Intelligence and Multi Exposure
50 pages
Artificial Neural Networks Notes PDF
100% (1)
Artificial Neural Networks Notes PDF
27 pages
33-Cauchy Method and Fletcher-Reeves Method-13-04-2024
No ratings yet
33-Cauchy Method and Fletcher-Reeves Method-13-04-2024
37 pages
SVM & CNN
No ratings yet
SVM & CNN
62 pages
Homework 3
No ratings yet
Homework 3
3 pages
Dive Into Deep Learning: Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola
No ratings yet
Dive Into Deep Learning: Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola
1,222 pages
Potential Question Slot 3-4
No ratings yet
Potential Question Slot 3-4
12 pages
DLbook
No ratings yet
DLbook
165 pages

NN Suppl

Uploaded by

NN Suppl

Uploaded by

Artificial Neural Networks

 Perceptron and Gradient Descent Algorithm

 Multilayer Neural Networks

 Designing an ANN for Face Recognition Application

1. 10 billion neurons 1. Faster than neuron (10-9 sec)

Dendrite Cell Body Axon

<Multilayer Perceptron Network>

 Single Layer Perceptron

<Multilayer Perceptron Network> <Hopfield Network>

 Records (examples) need to be represented as a

 It is hard for human to interpret the process of prediction

 Phoneme recognition [Waibel]

 Robot control [Pomerleau]

 Input: a vector of real values

 Free parameters: weights (and thresholds)

 Hypotheses space of perceptron learning

 A perceptron can not learn a linearly nonseparable

(a) Decision surface for a linearly separable set of examples

 A single perceptron can be used to represent many boolean

 Perceptrons can represent all of the primitive boolean functions

 Every boolean function can be represented by some network of

 Note: output value o is +1 or -1 (not a real)

 Note: output value o is a real value (not binary)

 If the learning rate is too large, the gradient descent

 Conditions for gradient descent search

 Converging to a local minimum can sometimes be quite

 If there are multiple local minima in the error surface, then

 Decision regions of a multilayer feedforward network.

 Sigmoid function: nonlinear, differentiable

 BP employs gradient descent to attempt to minimize the

 Two stage learning

 E defined as a sum of the squared errors over all the

 Error surface can have multiple local minima

 The weight update loop may be iterated thousands of times

 Original weight update rule for BP: w ji (n)   j x ji

Ed Ed net j Ed

 Step 2: Ed   1  (t k  ok ) 2  (t j  o j )

 Use stochastic descent rather than true gradient descent.

 Train multiple networks using the same data, but

 Weights are initialized to values near zero.

 Every bounded continuous function can be approximated

 Every bounded continuous function can be approximated,

 Hypothesis space is continuous

 When more layers of units are used in the network, more

 But the representations of the hidden layers are very hard

 The evolving sum of squared errors for each of the eight

 The evolving hidden layer representation for the

 The evolving weights for one of the three hidden units

 Continuing training until the training error falls below

 Two different types of overffiting

 Possible learning tasks

 Task: Learning the direction in which the person is facing.

 Network graph structure

 Other learning algorithm parameters

 Target value for the output units in 1-of-n encoding

960 x 3 x 4 network is trained on gray-level images of faces to

You might also like