CHAPTER 6 Machine Learning
Aims
To introduce
   the basics of machine learning, particularly inductive learning.
Objective
You should be able to
 Describe how each method can be used to perform classification tasks.
 Use a tool to perform a simple application using the above methods.
                               Introduction
 The ability to learn is one of the most crucial/important
  characteristics of an intelligent entity.
 A system that can learn is more flexible, being able to
  respond to new problems and situations, and may also be
  easier to program.
 Learning is still an expanding area of AI research.
 It overlaps with almost all other areas of AI such as:
    in planning and robotics, there is interest in getting systems to learn rules of
     behavior from experience in some environment;
    in natural language a system may learn syntactic rules from example
     sentences; in vision a system may learn to recognize some object given some
     example images; and
    in expert systems rules may be learned from example cases.
It is also an area which is attracting interest in industry, with
 many commercial products available. For example, there is
 interest in analyzing data obtained from supermarket loyalty
 cards in order to find rules that can be used in direct
 marketing campaigns.
There are several different basic kinds of learning, involving:
    learner and teacher. A teacher may tell you something directly, so
     you just have to remember it; they may give some examples; present
     an analogy. Known as supervise learning.
    to discover new knowledge through experimentation/ experience –
     unsupervised learning.
In AI, most of the work to date has been on
 learning from examples, or inductive learning.
This may involve learning conceptual categories
 (like he concept of “dog”, from examples of dogs),
 learning rules to predict the weather, learning rules
 to diagnose a disease, and so on.
In each case, examples are given in some
 suitable formalism, and the system attempts to
 infer general rules/ formula or descriptions from
 those examples.
 In general, inductive learning is used to train a system
  to perform classification tasks. A classification task
  means there are a number of input features, and a set
  of possible output categories.
 For example, medical diagnosis is a classification task,
  where the input features are the patient’s symptoms,
  and the output categories are the possible diagnoses.
 The inductive learning methods may be used to try to
  produce a system to automatically produce the correct
  classification given just the input feature values.
 The techniques for inductive learning
    symbolic methods - learning is seen as a search problem - the
     search space of possible concepts to be searched to find one
     that matches the examples. The approach involves building up
     the best decision tree to categorize the given examples.
    genetic algorithms are based on the notion that good solution
     can evolve out of a population, by combining possible solutions
     to produce “offsping” solutions and “killing off” the weaker of
     those solutions.
    neural networks - loosely based on the architecture of the brain,
     and are a promising approach for certain tasks.
A Simple Inductive Learning Example
 Real machine learning applications typically require
  many hundreds or even thousands of examples
  (dataset) in order for interesting knowledge to be
  learned.
 For example, to learn rules to diagnose a particular
  disease, given that the patient has, say, stomach pains,
  data on thousands of patients would be required, listing
  the additional symptoms of each patient and the final
  diagnoses made by an expert.
 To illustrate the methods a simpler problem and set of
  examples are required, e.g. the “student” problem.
 Suppose we have data on a number of students in last
  year’s class, and are trying to find a rule that will allow
  us to determine whether current students are likely to
  get a first-class degree mark.
 The ones that did are referred to as positive examples
  while the ones who didn’t referred to as negative
  examples (Fig. 1).
Student   First last   Male?   Works    Drinks?   First this
          Year?                hards?             Year?
Richard   yes          yes     no       yes       no
Alan      yes          yes     yes      no        yes
Alison    no           no      yes      no        no
Jeff      no           yes     no       yes       no
Gail      yes          no      yes      yes       yes
Simon     no           yes     yes      yes       no
          Fig. 1 Student exam performance data
 A quick inspection should shows that the two people
  who got firsts (Alan and Gail) both got firsts last year and
  work hard, and that none of the people who failed to get
  firsts both did well last year and work hard.
 So a reasonable learned rule - if you did well last year
  and work hard this year you should do OK.
 However, other rules are possible – example - if you
  EITHER are male and don’t drink OR are female and drink
  a lot then you’ll do well. But this rule is a little odd, and
  more complex.
 Generally the best rule, getting most predictions right,
  will be the simplest one, as it tends to capture
  generalities (hard-working students do well).
 In the example, four attributes (or features) to focus
  are first last year, works hard, male/female and
  drinks. All these have yes/no answers – known as
  feature values.
 We use the letters L, M, W and D to represent
  features, and the feature values as Ts and Fs.
 So, Richard’s feature values correspond to the row
  TTFT.
 The fact “doesn’t drink but does work hard can also
  be represented as W  D.
Version Space Learning
 This method treats learning as a search problem.
 The rule to be learned involves a conjunction of facts i.e.,
  a rule only involving AND. Eg. “If they work hard and
  don’t drink a lot they’ll get a first” can be represented as
      W  D.
 The rule “Everyone will get a first” is T (always true) and
  “No-one will get a first” is F.
Decision Tree
 This method can contains rule with disjunctions.
 It is based on representing the rule as a decision tree.
 Example: Figure 7.3- simplified decision tree to determine someone
  coming to the surgery with chest pains has had a heart attack. The
  “diagnosis” is made by going through the tree, answering the yes/no
  questions posed by the system. 
 Decision tree induction systems try to construct the simplest decision
  tree that correctly classifies all the example data from past cases. The
  idea is that if the tree is simple it will capture generalities in the example
  data and be useful for making predictions or diagnoses given new cases.
 To illustrate the algorithm - the student data
 Algorithm DT:
    1.   pick the best attribute i.e, First last year
    2.   produce branch according to its value
    3.   find their records
    4.   find their categories/decision
    5.   if 100% correct then stop else goto 1
 
 Try FLY, WH? And WH,D?
 The general idea is to look for features which are particularly good
  indicators of the result you’re interested in. These features are then
  placed (as questions) in nodes of the tree.
Student   First last   Male?   Works    Drinks?   First this
          Year?                hards?             Year?
Richard   yes          yes     no       yes       yes
Alan      yes          yes     yes      no        yes
Alison    no           no      yes      no        yes
Jeff      no           yes     no       yes       no
Gail      yes          no      yes      yes       yes
Simon     no           yes     no       yes       no
Genetic Algorithm
A very different sort of method.
A GA can be viewed as a kind of search technique.
Successfully applied to timetabling problems, which involve searching for a
possible assignment of events (e.g. lectures) to rooms and times, given
various constraints (e.g., people can’t be in two places at the same time.
There may be many millions of possible rules, and it may be hard to find the
best such rules.
Are biologically inspired, being influenced by theories of evolution.
Also sometimes called evolutionary algorithms
The basic idea is to have a population of genomes representing possible
solutions to mutate and combine these to produce new ones (offspring), and
to evaluate the performance of these offspring using some scoring function.
The fittest of these offspring (with highest score) survive to “mate” again.
Neural Networks
It provide a rather different approach to reasoning and learning.
Consist many simple processing units (or neurons) connected together.
The behaviour of each neuron is very simple, but together a collection of
neurons can be sophisticated behaviour and be used for complex tasks.
Example vision, speech, walking
There are many kinds of neural networks, so this discussion will be limited
to perceptrons, including multiplayer perceptrons.
The behaviour depends on weights on the connections between neurons.
These weights are updated during iteration (learning) take place in order to
allow a given example data approaching its target output.
It is not a symbol structure (likes rule or decision tree) that can easily be
interpreted but it is treated as a “black box” which, given some inputs,
returns some outputs.
NNs are biologically inspired - from neurons in the human brain.
Biological Neurons
The human brain consists of approximately ten thousand million simple
processing units/neurons.
Each neuron is connected to many thousand other neurons.
The basic idea is that a neuron receives inputs from its neighbours, and if
enough inputs are received at the same time that neuron will be excited or
activated and fire, giving an output that will be received by further neurons.
Figure 7.5 illustrates the basic features of a neuron.
       Soma is the body of the neuron.
       Dendrites are filaments that provide inputs to the cell.
       The axon sends output signals, and
       A synapse (or synaptic junction) is a special connection which can be
        strengthened or weakened to allow more or less of a signal through.
        Depending on the signals received from all its inputs, a neuron can be in either
        an excited or inhibited state. If excited, it will pass on that “excitation”
        through its axon, and may in turn excite neighbouring cells.
 The behaviour of a network depends on the strengths of the connections
  between neurons.
 In the biological neuron this is determined at the synapse.
 The synapse works by releasing special chemicals called neurotransmitters
  when it gets an input. More or less of such chemicals may be released,
  and this quantity may be adjusted over time.
 This can be thought of as a simple learning process.
                The Simple Perceptron: Simple learning 
•It just takes a number of inputs (corresponding to the signals from neighbouring cells),
adjusts these using a weight to represent the strength of connections at the synapses,
sums these, and fires if this sum exceeds some threshold.
•A neuron which fire will have an output value of 1, and other wise output 0.
•More precisely, if there are n inputs ( and n associated weights) the neuron finds the
weighted sum of the inputs and outputs 1 if this exceeds a threshold t and 0 otherwise.
•If the inputs are x1…..xn, with weights w1 ..wn:
 
          if (w1 x1 +…+ wn xn)> t, i.e. 0.5
          then output = 1
          else output = 0
 
•This basic neuron is referred to as a simple perceptron, and is illustrated in the
following figure. The name “perceptron” was proposed by Frank Rosenblatt in 1962. He
pioneered the simulation of neural networks on computers.
• A serious neural network application would require a network of
  hundreds or thousands of neurons.  
• Learning in neural networks, involves using example data to adjust the
  weights in a network.  
• Each example will have specified input-output values.  
• These examples are considered one by one, and weights adjusted by a
  small amount if the current network gives the incorrect output.
• The way this is done is to increase the weights on active connections if the
  actual output of the network is 0 but the target output (from the example
  data) is 1, and decrease the weights if the actual output is 1 and the target
  is 0.
• The whole set of examples has to be considered again and again until
  eventually (we hope) the network converges to give the right results for
  all the given examples. 
• Example: Student problem.
No.   Student   First last Male?   Works    Drinks?   First this
                year?               hard?             year? (target value)
1     Richard   yes        yes      no       yes      Yes
2     Alan      yes        yes      yes      no       Yes
3     Alison    no         no       yes      no       yes
4     Jeff      no         yes      no       yes      no
5     Gail      yes        no       yes      yes      yes
6     Simon     no         yes      yes      yes      no
• Each feature (male, works hard etc) can be represented by an input, so x = 1
  if the student in question got a first last year, x =1 if they are male, and so
  on.
• The output corresponds to whether they end up getting a first,                so
  output = 1.
• Initially the weights are set to some small random values, i.e., the value 0.2.
• The threshold is set to 0.5.
• The amount that the weights are adjusted for this example will have the
  value d = 0.05.
•   The following figure illustrates the example data from the first student example
    (Richard).
•   Before any learning has taken place the output of this network is 1, as the weighted
    sum of the inputs is 0.2 + 0.2 + 0.2 = 0.6, which is higher than the threshold of 0.5.
•   and Richard did get a first (target value).
•   Therefore there is no change of weigths
•   The next example (Alan) is now considered.
•   His inputs are 1, 1, 1 and 0.
•   The current network gives an output of 1 (the weighted sum is exactly 0.5), but the
    correct output is 1, so there are no change of weights.
•   All the other examples are considered in the same way.
•   Learning doesn’t end there.
•   All the examples must be considered again and again until the network gives the
    right result for as may examples as possible (the error is minimized).
•   After a second run-through with our example data the new weights are 0.25, 0.1,
    0.2 and 0.1.
•   These weights in fact work perfectly for all the examples, so after the third run-
    through the process halts. Weights have now been learned such that the
    perceptron gives the correct output for each of the examples.
•   If a new student is encountered then to predict their results we use the learned
    weights. May be Tim got a first last year, works hard, is male, but drinks. We would
     predict that he will get a first.
•   The basic algorithm:
Randomly initialize the weights.
Repeat
  For each record
      Calculate sum of W X and determine its output
  If (sum)>=threshold, i.e., 0.5 then output = 1 else 0
      If the calculated output is 1 and the target output is 0, decrement the weights on
      active connections by d, i.e. 0.05;
      If the calculated output is 0 and the target output is 1, increment the weights on
      active connections by d, i.e. 0.05;
Until the network gives the correct outputs (or some time limit is exceeded).
•   Example: Student performance:
    1. X = {1,1,0,1},{1,1,1,0},{0,0,1,0},{0,1,0,1},{1,0,1,1},{0,1,1,1}
    2. Target = {1,1,1,0,1,0}
    3. W1=W2=W3=W4=0.2
    4. REPEAT // start training
           sum = W*X
           IF (sum > 0.5) // threshold
                  Output = 1 ELSE Output = 0
            IF (Output = 0 AND Target = 1)
                  W = W + 0.05; // increment weights on X != 0
             IF (Output = 1 AND Target = 0)
                  W = W – 0.05; // decrement weights on X != 0
     UNTIL ERROR = 0; // all correctly classified
     
•   Test the algorithm for correctly classified with the final new weights