Lecture 2: Basics and Definitions: Networks As Data Models
Lecture 2: Basics and Definitions: Networks As Data Models
b1
y1
yi f ( wij x j bi )
j 1
yi xm
1.
2. 3. 4. 5.
As the inputs and output are external, the parameters of this model are therefore the weights, bias and activation function and thus DEFINE the model
More layers means more of the same parameters (and several subscripts) x1 x2
Input
Output
(visual input) xn
Hidden layers (Motor output)
Type of learning used depends on task at hand. We will deal mainly with supervised and unsupervised learning. Reinforcement learning will be taught in Adaptive Systems course or can be found in eg Haykin or Hertz et al. or
Sutton R.S., and Barto A.G. (1998): Reinforcement learning: an introduction MIT Press
Pattern recognition
Pattern: the opposite of chaos; it is an entity, vaguely defined, that could be given a name or a classification Examples: Fingerprints, Handwritten characters, Human face, Speech (or deer/whale/bat etc) signals, Iris patterns Medical imaging (various screening procedures) Remote sensing etc etc etc.
Given a pattern: a. supervised classification (discriminant analysis) in which the input pattern is identified as a member of a predefined class b. unsupervised classification (e.g. clustering ) in which the patter is assigned to a hitherto unknown class.
a b
First need a data set to learn from: sets of characters How are they represented? Eg as an input vector x = (x1, , xn) to the network (eg vector of ones and zeroes for each pixel according to whether it is black/white). Set of input vectors is our Training Set X which has already been classified into as and bs (note capitals for set , X, underlined small letters for an instance of set, xi ie the ith training pattern/vector) Given a training set X, our goal is to tell if a new image is an a or b ie classify it into one of 2 classes C1 (all as) or C2 (all bs) (in general one of k classes C1.. Ck )
Generalisation
Q. How do we tell if a new unseen image is an a or b? A. Brute force: have a library of all possible images But 256 x 256 pixels => 2256 x 256 = 10158,000 images Impossible! Typically have less than a few thousand images in training set Therefore, system must be able to classify UNSEEN patterns from the patterns it has seen I.e. Must be able to generalise from the data in the training set Intuition: real neural networks do this well, so maybe artificial ones can do the same. As they are also shaped by experiences maybe well also learn about how the brain does it ...
For 2 class classification we want the network output y (a function of the inputs and network parameters) to be: y(x, w) = 1 if x is an a y(x, w) = -1 if x is a b where x is an input vector and the network parameters are grouped as a vector w. y is known as a discriminant function: it discriminates between 2 classes
As the network mapping is defined by the parameters we must use the data set to perform Learning (training, adaptation) ie: change weights or interaction between neurons according to the training examples (and possibly prior knowledge of the problem) Where the purpose of learning is to minimize: training errors on learning data: learning error prediction errors on new, unseen data: generalization error Since when the errors are minimised, the network discriminates between the 2 classes
We therefore need an error function to measure the network performance based on the training error
An optimisation algorithms can then be used to minimise the learning errors and train the network
Feature Extraction
However, if we use all the pixels as inputs we are going to have a long training procedure and a very big network May want to analyse the data first (pre-process it) and extract some (lower dimensional) salient features to be the inputs to the network Could use the ratio of height and width of letter as bs will tend to be higher than as (Prior knowledge)
Could then make a decision based on this feature. Suppose we make a histogram of the values of x* for the input vectors in the training set X
C1 A
C2 x*
For a new input with an x* value of A we would classify it as C1 as it is more likely to belong to this class
C1 A
C2
Decision Boundary x* = d x*
Therefore, get the idea of a Decision Boundary Points on one side of the boundary are in one class, and on the other are in the other class ie if x* < d pattern is in C1 else it is in C2 Intuitively it makes sense (and is optimal in a Bayesian sense) to place it where the 2 histograms cross
Can then view pattern recognition as the process of assigning patterns to one of a number of classes by dividing up the feature space with decision boundaries, which thus divides the original space feature extraction y feature space classification
However, can be lots of overlap in this case so could use a rejection threshold e where if x* < d - e pattern is in C1 if x* > d + e pattern is in C2
C2 x*
Related to the idea of minimising Risk where it may be more important to not misclassify in one class rather than the other Especially important in medical applications. Can serve to shift the decision boundary one way or the other based on the Loss function which defines the relative importance/cost of the different errors
x
x xx x x x
However, cannot keep increasing number of features as there will come a point where the performance starts to degrade as there is not enough data to provide a good estimate (cf using 256 x256 pixels)
Curse of dimensionality
Geometric eg: suppose we want to approximate a 1d function y from m-dimensional training data. We could: divide each dimension into intervals (like histogram) y value for interval = mean y value of all points in the interval Increase precision by increasing number of intervals However, need at least 1 point in each interval For k intervals in each dimension need > km data points Thus number of data points grows at least exponentially with the input dimension Known as the Curse of Dimensionality: A function defined in high dimensional space is likely to be much more complex than a function defined in a lower dimensional space and those complications are harder to discern (Friedman 95, in Haykin, 99)
Of course, the above is a particularly inefficient way of using data and most NNs are less susceptible However, only practical way to beat the curse is to incorporate correct prior knowledge In practice, we must make the underlying function smoother (ie less complex) with increasing input dimensionality Also try to reduce the input dimension by pre-processing Mainly, learn to live with the fact that perfect performance is not possible: data in the real world sometimes overlaps. Treat input data as random variables and instead look for a model which has smallest probability of making a mistake
Multivariate regression
Type of function approximation: try to approximate a function from a set of (noisy) training data. Eg suppose we have the function:
We generate training data at equal intervals of x (red circles) and add a little random Gaussian noise with s.d. 0.05 and the model The model is trained on this data
We then test the model (in this case a piecewise linear model) by plugging in many values of x and viewing the resultant function (solid blue line)
Model Complexity
In the previous picture used a piecwise linear function to approximate the data. Better to use a polynomial y = Saixi to approximate the data ie: y = a0 + a 1 x y = a0 + a1x + a2x2 1st order (straight line) 2nd order (quadratic)
3rd order
y = a0 + a1x + a2x2 + a3x3 + + anxn nth order As the order (highest power of x) increases, so does the potential complexity of the model/polynomial This means that it can represent a more complex (non-smooth) function and thus approximate the data more accurately
As the model complexity grows performance improves for a while but starts to degrade sfter reaching an optimal level Note though that training error continues to go down as model matches the fine-scale detail of the data (ie the noise) Rather want to model the intrinsic dimensionality of the data otherwise get the problem of overfitting Analagous to the problem of overtraining where a model is trained for too long and models the data too exactly and loses its generality
Similar problems occur in classification problems: A model with too much flexibility does not genralise well resulting in a non-smooth decision boundary. Somewhat like giving a system enough capacity to remember all training point: no need to generalise. Less memory => it must generalise to be able to model training data Trade-off between being a good fit to the training data and achieving a good generalisation: cf Bias-Variance trade-off (later)