Machine Learning-4
Machine Learning-4
Unit IV: Support Vector Machines (SVM): Introduction, Linear Discriminant Functions for Binary
Classification, Perceptron Algorithm, Large Margin Classifier for linearly seperable data, Linear Soft Margin
Classifier for Overlapping Classes, Kernel Induced Feature Spaces, Nonlinear Classifier, and Regression by
Support vector Machines.
Learning with Neural Networks: Towards Cognitive Machine, Neuron Models, Network Architectures,
Perceptrons, Linear neuron and the Widrow-Hoff Learning Rule, The error correction delta rule.
To separate the two classes of data points, there are many possible hyperplanes that could be chosen.
Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points
of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be
classified with more confidence.
Hyperplanes and Support Vectors:
Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side
of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon
the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of
input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine
when the number of features exceeds 3.
Support vectors are data points that are closer to the hyperplane and influence the position and orientation
of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support
vectors will change the position of the hyperplane. These are the points that help us build our SVM.
Description:
Firstly, the features for an example are given as input to the perceptron.
These input features get multiplied by corresponding weights (starting with initial value).
The summation is computed for the value we get after multiplication of each feature with the
corresponding weight.
The value of the summation is added to the bias.
The step/activation function is applied to the new value.
If X satisfies the equation above, then the point lies on the plane. Otherwise, it must be on one side of the
plane as shown below.
In general, if the data can be perfectly separated using a hyperplane, then there is an infinite number of
hyperplanes, since they can be shifted up or down, or slightly rotated without coming into contact with an
observation.
That is why we use the maximal margin hyperplane or optimal separating hyperplane which is the
separating hyperplane that is farthest from the observations. We calculate the perpendicular distance from
each training observation given a hyperplane. This is known as the margin. Hence, the optimal separating
hyperplane is the one with the largest margin.
As you can see above, there three points that are equidistant from the hyperplane. Those observations are
known as support vectors, because if their position shifts, the hyperplane shifts as well. Interestingly, this
means that the hyperplane depends only on the support vectors, and not on any other observations.
What if no separating plane exists
In this case, there is no maximal margin classifier. We use a support vector classifier that can almost separate
the classes using a soft margin called support vector classifier.
Here, it simply doesn’t exist a separating hyperplane, hence we need to define another criterion to find it. The
idea is relaxing the assumption that the hyperplane has to well segregate all the observations, but rather
segregate most of them. By doing so, we can allow, with different degrees of ‘softness’, some observations to
be on the wrong side of the margin and eventually on the wrong side of the plane (and so to be misclassified).
In this case, the support vectors will be those observations lying on the margin and beyond it (as long as
they are still on the correct side of the hyperplane).
The dataset is clearly a non-linear dataset and consists of two features (say, X and Y).
In order to use SVM for classifying this data, introduce another feature Z = X 2 + Y2 into the dataset.
Thus, projecting the 2-dimensional data into 3-dimensional space. The first dimension representing the feature
X, second representing Y and third representing Z (which, mathematically, is equal to the radius of the circle
of which the point (x, y) is a part of). Now, clearly, for the data shown above, the ‘yellow’ data points belong
to a circle of smaller radius and the ‘purple’ data points belong to a circle of larger radius. Thus, the data
becomes linearly separable along the Z-axis.
Now, we can use SVM (or, for that matter, any other linear classifier) to learn a 2-dimensional separating
hyperplane. This is how the hyperplane would look like:
1. Kernel: Kernel is a function that is used to map a lower dimensional data points into a higher
dimensional data points. As SVR performs linear regression in a higher dimension, this function is
crucial. There are many types of kernel such as Polynomial Kernel, Gaussian Kernel, and Sigmoid
Kernel etc.
2. Hyper Plane: In Support Vector Machine, a hyper plane is a line used to separate two data classes in a
higher dimension than the actual dimension. In SVR, hyper plane is the line that is used to predict the
continuous value.
3. Boundary Line: Two parallel lines drawn to the two sides of Support Vector with the error threshold
value are known as boundary line. This lines creates a margin between the data points.
4. Support Vector: The line from which the distance is minimum or least from two boundary data points.
The output is analogous to the axon of a biological neuron, and its value propagates to the input of the next
layer, through a synapse. It may also exit the system, possibly as part of an output vector. It has no learning
process as such. Its transfer function weights are calculated and threshold value are predetermined.
Their multiple outputs. In comparison, the nodes of the hidden and output layer are active. For example, they
may be pixel values from an image, samples from an audio signal, stock market prices on successive days, etc.
They may also be the output of some other algorithm, such as the classifiers in our cancer detection example:
diameter, brightness, edge sharpness, etc. Each value from the input layer is duplicated and sent to all of the
hidden nodes. This is called a fully interconnected structure.
The values entering a hidden node are multiplied by weights, a set of predetermined numbers stored in
the program. The weighted inputs are then added to produce a single number. This is shown in the diagram by
the symbol, ∑. Before leaving the node, this number is passed through a nonlinear mathematical function
called a sigmoid. This is an "s" shaped curve that limits the node's output. That is, the input to the sigmoid is a
value between -∞ and +∞, while its output can only be between 0 and 1.
The outputs from the hidden layer are represented in the flow diagram (Fig 26-5) by the variables:
X21,X22,X23 and X24. Just as before, each of these values is duplicated and applied to the next layer. The
active nodes of the output layer combine and modify the data to produce the two output values of this
network, X31 and X32.
Neural networks can have any number of layers, and any number of nodes per layer. Most applications
use the three layer structure with a maximum of a few hundred input nodes. The hidden layer is usually about
10% the size of the input layer. In the case of target detection, the output layer only needs a single node. The
output of this node is thresholded to provide a positive or negative indication of the target's presence or
absence in the input data.
Table 26-1 is a program to carry out the flow diagram of Fig. 26-5. The key point is that this architecture
is very simple and very generalized. This same flow diagram can be used for many problems, regardless of
their particular quirks. The ability of the neural network to provide useful data manipulation lies in the proper
selection of the weights. This is a dramatic departure from conventional information processing where
solutions are described in step-by-step procedures.
As an example, imagine a neural network for recognizing objects in a sonar signal. Suppose that 1000 samples
from the signal are stored in a computer. How does the computer determine if these data represent a
submarine, whale, undersea mountain, or nothing at all? Conventional DSP would approach this problem with
mathematics and algorithms, such as correlation and frequency spectrum analysis. With a neural network, the
1000 samples are simply fed into the input layer, resulting in values popping from the output layer. By
selecting the proper weights, the output can be configured to report a wide range of information. For instance,
there might be outputs for: submarine (yes/no), whale (yes/no), undersea mountain (yes/no), etc.
With other weights, the outputs might classify the objects as: metal or non-metal, biological or
nonbiological, enemy or ally, etc. No algorithms, no rules, no procedures; only a relationship between the
input and output dictated by the values of the weights selected.
Figure 26-7a shows a closer look at the sigmoid function, mathematically described by the equation:
The exact shape of the sigmoid is not important, only that it is a smooth threshold. For comparison,
a simple threshold produces a value of one when x > 0, and a value of zero when x < 0. The sigmoid
performs this same basic thresholding function, but is also differentiable, as shown in Fig. 26-7b. While the
derivative is not used in the flow diagram (Fig. 25-5), it is a critical part of finding the proper weights to use.
More about this shortly. An advantage of the sigmoid is that there is a shortcut to calculating the value of its
derivative:
For example, if x = 0, then s(x) = 0.5 (by Eq. 26-1), and the first derivative is calculated: s'(x) = 0.5(1 -
0.5) = 0.25. This isn't a critical concept, just a trick to make the algebra shorter.
Wouldn't the neural network be more flexible if the sigmoid could be adjusted left-or-right, making it centered
on some other value than x = 0? The answer is yes, and most neural networks allow for this. It is very simple
to implement; an additional node is added to the input layer, with its input always having a value of one.
When this is multiplied by the weights of the hidden layer, it provides a bias (DC offset) to each sigmoid. This
addition is called a bias node. It is treated the same as the other nodes, except for the constant input.
Can neural networks be made without a sigmoid or similar nonlinearity? To answer this, look at the three-
layer network of Fig. 26-5. If the sigmoids were not present, the three layers would collapse into only two
layers. In other words, the summations and weights of the hidden and output layers could be combined into a
single layer, resulting in only a two-layer network.
(Fig.3) 4 lines each dividing the plane into 2 linearly separable regions.
The top perceptron performs logical operations on the outputs of the hidden layers so that the whole network
classifies input points in 2 regions that might not be linearly separable. For instance, using the AND operator
on these four outputs, one gets the intersection of the 4 regions that forms the center region.
where
Where,
ɑ : is a small constant called learning rate
g(x) : is the neuron’s activation function
g’ : is the derivative of g
tj : is the target output
hj: is the weighted sum of the neuron’s inputs
yj : is the actual output
xi is the i th input
The delta rule is commonly stated in simplified form for a neuron with a linear activation function as
While the delta rule is similar to the perceptron’s update rule, the derivation is different. The perceptron uses
the Heaviside step function as the activation function g(h) and that means that g ’(h) does not exist at zero and
is equal to zero elsewhere, which makes the direct application of the delta rule impossible.