Unit 2ANNs
Unit 2ANNs
5/22/2021
History of the Artificial Neural Networks
5/22/2021
Appropriate Problems for Neural Network Learning
• Instances are represented by many attribute-value pairs (e.g., the
pixels of a picture. ALVINN [Mitchell, p. 84]).
• The target function output may be discrete-valued, real-valued, or a
vector of several real- or discrete-valued attributes.
• The training examples may contain errors.
• Long training times are acceptable.
• Fast evaluation of the learned target function may be required.
• The ability for humans to understand the learned target function is
not important.
5/22/2021
Why Artificial Neural Networks?
5/22/2021
Artificial Neural Networks
• The “building blocks” of neural networks are the
neurons.
• In technical systems, we also refer to them as units or nodes.
• Basically, each neuron
receives input from many other neurons.
changes its internal state (activation) based on the current
input.
sends one output signal to many other neurons, possibly
including its input neurons (recurrent network).
How do ANNs work?
An artificial neural network (ANN) is either a hardware
implementation or a computer program which strives to
simulate the information processing capabilities of its biological
exemplar. ANNs are typically composed of a great number of
interconnected artificial neurons. The artificial neurons are
simplified models of their biological counterparts.
ANN is a technique for solving problems by constructing software
that works like our brains.
Consider humans
• Neuron - the basic computing unit
• Brain is a highly organized structure of networks of interconnected neurons
5/22/2021
Properties of artificial neural networks(ANN’s)
Many neuron-like threshold switching units
Many weighted interconnections among units
Highly parallel, distributed process
Emphasis on tuning weights automatically
5/22/2021
When to consider neural networks
Input is high-dimensional discrete or real-valued (e.g. raw sensor input)
Output is discrete or real valued
Output is a vector of values
Possibly noisy data
Form of target function is unknown
Human readability of result is unimportant
5/22/2021
Application Examples
Speech phoneme recognition
Image Classification
Financial prediction
Handwriting recognition
Computer vision
5/22/2021
Advantages of ANNs
• Distributed representation
• Representation and processing of fuzziness
• Highly parallel and distributed action
• Speed and fault-tolerance
5/22/2021
How do our brains work?
The Brain is A massively parallel information processing system.
Our brains are a huge network of processing elements. A typical brain contains a
network of 10 billion neurons.
How do our brains work?
Biological Neuron - A processing element
Dendrites: Input
Soma/Cell body: Processor
Synaptic: Link
Axon: Output
How do our brains work?
dendrites: nerve fibres carrying electrical signals to the cell
cell body: computes a non-linear function of its inputs
axon: single long fiber that carries the electrical signal from the cell body
to other neurons
synapse: the point of contact between the axon of one cell and the dendrite
of another, regulating a chemical connection whose strength affects the
input to the cell.
How do our brains work?
A neuron is connected to other neurons through about 10,000 synapses
A neuron receives input from other neurons. Inputs are combined
Once input exceeds a critical level, the neuron discharges a spike ‐ an
electrical pulse that travels from the body, down the axon, to the next
neuron(s)
The axon endings almost touch the dendrites or cell body of the next
neuron.
Transmission of an electrical signal from one neuron to the next is effected
by neurotransmitters.
Neurotransmitters are chemicals which are released from the first neuron
and which bind to the Second
This link is called a synapse. The strength of the signal that reaches the
next neuron depends on factors such as the amount of neurotransmitter
available
How do ANNs work?
Dendrites Input
Axon Output
Artificial Neuron
Perceptron
n
R w0 w1 x1 w2 x2 ,, wn xn w0 wi xi
i 1
1; if R 0
o signR
1, otherwise
36
Perceptron – Decision Surface
• Perceptron can be regarded as representing a hyperplane decision
surface in the n-dimensional feature space of instances.
37
Perceptron – Decision Surface
• In 2-dimensional space
Perceptron – Representation Power
• The Decision Surface is linear
1. Set the weights to small random values, e.g., in the range (-1, 1)
1; if R 0
n
R w0 w ixi o sign R
i1 1, otherwise
w i w i d o x i , i 1 , 2 , , n
47
Perceptron – Training Algorithm
Perceptron training
5/22/2021
Weights Adjusting
After each iteration, weights should be adjusted to
minimize the error.
– All possible weights
Perceptron – Training Rule
Perceptron Learning algorithm
While epoch produces an error
Present network with next inputs from epoch
Error = T – O
If Error <> 0 then
Wj = Wj + η * xj * Error
End If
End While
Error Estimation
The root mean square error (RMSE) is a frequently-
used measure of the differences between values
predicted by a model or an estimator and the values
actually observed from the thing being modeled or
estimated
Perceptron – Training Rule
56
Training Perceptrons
For AND
-1 A B Output
W1 = ?
00 0
01 0
x t = 0.0 10 0
W2 = ?
11 1
W3 = ?
y
57
Training Perceptrons
For AND
-1 A B Output
W1 = 0.3
00 0
01 0
x t = 0.0
W2 = 0.5 10 0
11 1
W3 =-0.4
y
I1 I2 I3 Summation Output
-1 0 0 (-1*0.3) + (0*0.5) + (0*-0.4) = -0.3 0
-1 0 1 (-1*0.3) + (0*0.5) + (1*-0.4) = -0.7 0
-1 1 0 (-1*0.3) + (1*0.5) + (0*-0.4) = 0.2 1
-1 1 1 (-1*0.3) + (1*0.5) + (1*-0.4) = -0.2 0
58
Learning in Neural Networks
• Learn values of weights from I/O pairs
• Start with random weights
• Load training example’s input
• Observe computed input
• Modify weights to reduce difference
• Iterate over all training examples
• Terminate when weights stop changing OR when error is very small
59
Gradient Descent algorithm and its variants
Gradient Descent is an optimization algorithm used for minimizing
the cost function in various machine learning algorithms. It is
basically used for updating the parameters of the learning model.
• If the learning rate is too high, we might OVERSHOOT the minima and keep
bouncing, without reaching the minima
• If the learning rate is too small, the training might turn out to be too long
The Learning Rate
............
xm x2 x1
Input
wm ..... w2
weights w1
Processing ∑
Transfer Function
f(vk)
(Activation Function)
Output y
The output is a function of the input, that is affected
by the weights, and the transfer functions
ACTIVATION FUNCTION
• A function that transforms the values or states the conditions for the
decision of the output neuron is known as an activation function.
• What does an artificial neuron do? Simply, it calculates a “weighted
sum” of its input, adds a bias and then decides whether it should be
“fired” or not.
• So consider a neuron.
ACTIVATION FUNCTION
• The value of Y can be anything ranging from -inf to +inf. The neuron
really doesn’t know the bounds of the value. So how do we decide
whether the neuron should fire or not ( why this firing pattern?
Because we learnt it from biology that’s the way brain works and
brain is a working testimony of an awesome and intelligent system ).
• We decided to add “activation functions” for this purpose. To check
the Y value produced by a neuron and decide whether outside
connections should consider this neuron as “fired” or not. Or rather
let’s say — “activated” or not.
ACTIVATION FUNCTION
• If we do not apply an Activation function, then the output signal would simply be a
simple linear function. A linear function is just a polynomial of one degree.
• A linear equation is easy to solve but they are limited in their complexity and have
less power to learn complex functional mappings from data.
• A Neural Network without Activation function would simply be a Linear Regression
Model, which has limited power and does not performs good most of the times.
• We want our Neural Network to not just learn and compute a linear function but
something more complicated than that.
• Also, without activation function our Neural network would not be able to learn
and model other complicated kinds of data such as images, videos , audio , speech
etc. That is why we use Artificial Neural network techniques such as Deep learning
to make sense of something complicated ,high dimensional, non-linear -big
datasets, where the model has lots and lots of hidden layers in between and has
a very complicated architecture which helps us to make sense and extract
knowledge form such complicated big datasets.
ACTIVATION FUNCTION
• Activation functions are really important for a Artificial Neural
Network to learn and make sense of something really complicated
and non-linear complex functional mappings between the inputs and
response variable. They introduce non-linear properties to our
network.
• Their main purpose is to convert an input signal of a node in a A-NN
to an output signal. That output signal now is used as a input in the
next layer in the stack.
WHY DO WE NEED NON-LINEARITIES?
• Non-linear functions are those which have degree more than one and they
have a curvature when we plot a Non-Linear function. Now we need a
Neural Network Model to learn and represent almost anything and any
arbitrary complex function which maps inputs to outputs. Neural-Networks
are considered Universal Function Approximators. It means that they can
compute and learn any function at all. Almost any process we can think of
can be represented as a functional computation in Neural Networks.
• DRAWBACK: Suppose you are creating a binary classifier. Something which should say a
“yes” or “no” ( activate or not activate ). A Step function could do that for you! That’s
exactly what it does, say a 1 or 0. Now, think about the use case where you would want
multiple such neurons to be connected to bring in more classes. Class1, class2, class3 etc.
What will happen if more than 1 neuron is “activated”. All neurons will output a 1 ( from
step function). Now what would you decide? Which class is it? Hard, complicated.
* https://towardsdatascience.com/activation-functions-and-its-types-which-is-better-a9a5310cc8f
TYPES OF ACTIVATION FUNCTIONS
Linear function
• A = cx
• A straight line function where activation is proportional to input ( which is
the weighted sum from neuron ).
• This way, it gives a range of activations, so it is not binary activation. We
can definitely connect a few neurons together and if more than 1 fires, we
could take the max and decide based on that. So that is ok too. Then what
is the problem with this?
• A = cx, derivative with respect to x is c. That means, the gradient has no
relationship with X. It is a constant gradient and the descent is going to be
on constant gradient. If there is an error in prediction, the changes made
by back propagation is constant and not depending on the change in input.
TYPES OF ACTIVATION FUNCTIONS
Sigmoid function
• Feedforward Network
• Single layer feedforward network
• Multilayer feedforward network
• Feedback Network
• Recurrent networks
• Fully recurrent network
• Jordan network
Different Network Topologies
• Single layer feed-forward networks
• Input layer projecting into the output layer
Single layer
network
Input Output
layer layer
Different Network Topologies
• Multi-layer feed-forward networks
• One or more hidden layers. Input projects only from previous layers onto a
layer.
2-layer or
1-hidden layer
fully connected
network
Input Hidden Output
layer layer layer
Different Network Topologies
/ activation function
Multi-layer networks of sigmoid unit
Error Gradient for a sigmoid unit
Error Gradient for a sigmoid unit
Backpropagation Algorithm
5/22/2021
we’re going to use a neural network with two inputs, two hidden neurons, two output neurons.
Additionally, the hidden and output neurons will include a bias.
Here’s the basic structure:
5/22/2021
In order to have some numbers to work with, here are the initial weights, the biases,
and training inputs/outputs:
5/22/2021
The goal of backpropagation is to optimize the weights so that the neural network can learn how to
correctly map arbitrary inputs to outputs.
We’re going to work with a single training set: given inputs 0.05 and 0.10, we want the neural network
to output 0.01 and 0.99.
We figure out the total net input to each hidden layer neuron, squash the total net input using
an activation function (here we use the logistic function), then repeat the process with the output layer
neurons.
5/22/2021
5/22/2021
5/22/2021
5/22/2021
5/22/2021
The Backwards Pass
Our goal with backpropagation is to update each of the weights in the network so that they cause
the actual output to be closer the target output, thereby minimizing the error for each output
neuron and the network as a whole.
5/22/2021
5/22/2021
Visually, here’s what we’re doing:
5/22/2021
5/22/2021
5/22/2021
5/22/2021
5/22/2021
5/22/2021
5/22/2021
Visually:
5/22/2021
5/22/2021
5/22/2021
5/22/2021
5/22/2021
5/22/2021
5/22/2021
5/22/2021
Backpropagation Algorithm
Backpropagation Algorithm
Hidden Layer Representations
One intriguing property of BACKPROPAGATION is its ability to discover useful intermediate
representations at the hidden unit layers inside the network
5/22/2021
Here, the eight network inputs are connected to three hidden units, which are in turn connected to the eight output units.
Because of this structure, the three hidden units will be forced to re-represent the eight input values in some way that
captures their relevant features, so that this hidden layer representation can be used by the output units to compute the
correct target values.
5/22/2021
This ability of multilayer networks to automatically discover useful representations at the hidden layers is
a key feature of ANN learning.
We can directly observe the effect of BACKPROPAGATION'S gradient descent search by plotting the
squared output error as a function of the number of gradient descent search steps
5/22/2021
Learning the 8 x 3 x 8 Network. This plot shows the evolving sum of squared errors for each of the eight
output units, as the number of training iterations (epochs) increases.
Learning the 8 x 3 x 8 Network. This plot shows the evolving hidden layer representation for the
input string "01000000
5/22/2021
Learning the 8 x 3 x 8 Network. This bottom plot shows the evolving weights for one of the three
hidden units.
5/22/2021
Convergence of Backpropagation
Convergence of Backpropagation
Expressive Capabilities of ANNs
Expressive Capabilities of ANNs
Overfitting ANN
5/22/2021
5/22/2021
5/22/2021
AN ILLUSTRATIVE EXAMPLE:
FACE RECOGNITION
The learning task here involves classifying camera images of faces of
various people in various poses. Images of 20 different people were
collected, including approximately 32 images per person, varying the
person's expression (happy, sad, angry, neutral), the direction in which
they were looking (left, right, straight ahead, up), and whether or not
they were wearing sunglasses. As can be seen from the example images in
Figure, here is also variation in the background behind the person, the
clothing worn by the person, and the position of the person's face within
the image. In total, 624 greyscale images were collected, each with a
resolution of 120 x 128, with each image pixel described by a greyscale
intensity value between 0 (black) and 255 (white).
5/22/2021
5/22/2021
target function: learning the direction in which the person is facing (to
their left, right, straight ahead, or upward).
5/22/2021
Learning an artificial neural network to recognize face pose. Here a 960 x 3 x
4 network is trained on grey-level images of faces (see top), to predict whether
a person is looking to their left, right, ahead, or up. After training on 260 such
images, the network achieves an accuracy of 90% over a separate test set. The
learned network weights are shown after one weight-tuning iteration through
the training examples and after 100 iterations. Each output unit (left, straight,
right, up) has four weights, shown by dark (negative) and light (positive)
blocks. The leftmost block corresponds to the weight wg, which determines
the unit threshold, and the three blocks to the right correspond to weights on
inputs from the three hidden units. The weights from the image pixels into
each hidden unit are also shown, with each weight plotted in the position of
the corresponding image pixel.
Design Choice
• Input encoding
• Output encoding
• Network graph structure
• Other learning algorithm parameters
ADVANCED TOPICS
IN
ARTIFICIAL NEURAL NETWORKS
Alternative Error Functions
Adding a term for errors in the slope, or derivative of the target function
Alternative Error Functions
Minimizing the cross entropy of the network with respect to the target values
Recurrent Networks
A Recurrent Neural Network (RNN) is a class of Artificial Neural Network in which the connection
between different nodes forms a directed graph to give a temporal dynamic behavior. It helps to model
sequential data that are derived from feedforward networks. It works similarly to human brains to
deliver predictive results.
They support a form of directed cycles in the network
we have added a new unit b to the hidden layer, and new input unit c(t). The input value c(t) to
the network at one time step is simply copied from the value of unit b on the previous time step.
5/22/2021
Following are the application of RNN:
1. Machine Translation
2. Speech Recognition
3. Sentiment Analysis
4. Automatic Image Tagger
Dynamically Modifying Network Structure
Dynamically Modifying Network Structure
Up to this point we have considered neural network learning as a problem of adjusting weights within
a fixed graph structure. A variety of methods have been proposed to dynamically grow or shrink the
number of network units and interconnections in an attempt to improve generalization accuracy and
training efficiency.
One idea is to begin with a network containing no hidden units, then grow the network as needed by
adding hidden units until the training error is reduced to some acceptable level.
CASCADE-CORRELATION algorithm can be used for Dynamically Modifying Network Structure
A second idea for dynamically altering network structure is to take the opposite approach. Instead of
beginning with the simplest possible network and adding complexity, we begin with a complex
network and prune it as we find that certain connections are inessential.
One way to decide whether a particular weight is inessential is to see whether its value is close to
zero. A second way, which appears to be more successful in practice, is to consider the effect that a
small variation in the weight has on the error E