0% found this document useful (0 votes)
2 views

Chapter 6 - Learning [Autosaved]

The document provides an overview of machine learning, emphasizing its role in enabling computer systems to learn from experience and improve performance in various tasks. It discusses different types of learning systems, including supervised, unsupervised, active, and reinforcement learning, along with their applications and methodologies. Additionally, it covers the mathematical formulation of inductive learning and the concept of learning from observations, using examples such as classifying mushrooms.

Uploaded by

nataniumcscbe
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Chapter 6 - Learning [Autosaved]

The document provides an overview of machine learning, emphasizing its role in enabling computer systems to learn from experience and improve performance in various tasks. It discusses different types of learning systems, including supervised, unsupervised, active, and reinforcement learning, along with their applications and methodologies. Additionally, it covers the mathematical formulation of inductive learning and the concept of learning from observations, using examples such as classifying mushrooms.

Uploaded by

nataniumcscbe
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 105

Chapter -

Learning

1
Outline
 Learning from Examples/Observation
 Knowledge in Learning
 Learning Probabilistic Models
 Neural Networks

2
Introduction to Learning
 Machine Learning is the study of how to build computer systems that adapt
and improve with experience.
 It is a subfield of Artificial Intelligence and intersects with cognitive science,
information theory, and probability theory, among others.
 Classical AI deals mainly with deductive reasoning, learning represents
inductive reasoning.
 Deductive reasoning arrives at answers to queries relating to a particular
situation starting from a set of general axioms, whereas inductive reasoning
arrives at general axioms from a set of particular instances.
 Classical AI often suffers from the knowledge acquisition problem in real life
applications where obtaining and updating the knowledge base is costly and
prone to errors.
 Machine learning serves to solve the knowledge acquisition bottleneck by
obtaining the result from data by induction.

3
Introduction to Learning … Cont’d
 Machine learning is particularly attractive in several real life problem
because of the following reasons:
• Some tasks cannot be defined well except by example
• Working environment of machines may not be known at design time
• Explicit knowledge encoding may be difficult and not available
• Environments change over time
• Biological systems learn
 Recently, learning is widely used in a number of application areas including,
• Data mining and knowledge discovery
• Speech/image/video (pattern) recognition
• Adaptive control
• Autonomous vehicles/robots
• Decision support systems
• Bioinformatics
4
• WWW
Introduction to Learning … Cont’d
 Formally, a computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at tasks in T,
as measured by P, improves with experience E.
 Thus a learning system is characterized by:
• task T
• experience E, and
• performance measure P
 Examples:
 Learning to play chess
 T: Play chess
 P: Percentage of games won in world tournament
 E: Opportunity to play against self or other players
 Learning to drive a van
 T: Drive on a public highway using vision sensors
 P: Average distance traveled before an error (according to human observer)
 E: Sequence of images and steering actions recorded during human driving.
5
Introduction to Learning … Cont’d
The block diagram of a generic learning system which can realize the above
definition is shown below:

6
Introduction to Learning … Cont’d
 As can be seen from the above diagram the system consists of the
following components:
 • Goal: Defined with respect to the task to be performed by the system
 • Model: A mathematical function which maps perception to actions
 • Learning rules: Which update the model parameters with new
experience such that the performance measures with respect to the goals
is optimized
 • Experience: A set of perception (and possibly the corresponding
actions)

7
Taxonomy of Learning Systems
 Several classification of learning systems are possible based on the above
components as follows:
 Goal/Task/Target Function:
 Prediction: To predict the desired output for a given input based on previous
input/output pairs. E.g., to predict the value of a stock given other inputs like
market index, interest rates etc.
 Categorization: To classify an object into one of several categories based on
features of the object. E.g., a robotic vision system to categorize a machine part
into one of the categories, spanner, hammer etc based on the parts’ dimension
and shape.
 Clustering: To organize a group of objects into homogeneous segments. E.g., a
satellite image analysis system which groups land areas into forest, urban and
water body, for better utilization of natural resources.
 Planning: To generate an optimal sequence of actions to solve a particular
problem. E.g., an Unmanned Air Vehicle which plans its path to obtain a set of
pictures and avoid enemy anti-aircraft guns.
8
Taxonomy of Learning Systems … Cont’d
 Models:
 Propositional and FOL rules
 Decision trees
 Linear separators
 Neural networks
 Graphical models
 Temporal models like hidden Markov models
 Learning Rules:
 Learning rules are often tied up with the model of learning used.
 Some common rules are gradient descent, least square error, expectation
maximization and margin maximization.

9
Taxonomy of Learning Systems … Cont’d
 Experiences:
 Learning algorithms use experiences in the form of perceptions or
perception action pairs to improve their performance. The nature of
experiences available varies with applications. Some common situations are
described below.
 Supervised learning: In supervised learning a teacher or oracle is available
which provides the desired action corresponding to a perception. A set of
perception action pair provides what is called a training set. Examples
include an automated vehicle where a set of vision inputs and the
corresponding steering actions are available to the learner.
 The problem of supervised learning involves learning a function from
examples of its inputs and outputs.
 From the example below case 1,2 and 3 are all instances of supervised learning
problem.

10
 In (1), the agent learns condition-action rule for braking-
this is a function from states to a Boolean output (to brake
or not to brake),
 In (2), the agent learns a function from images to a
Boolean output (whether the image contains a bus).
 In (3),the theory of braking is a function from states and
braking actions to, say, stopping distance in feet. Notice
that in cases (1) and (2), a teacher provided the correct
output value of the examples; in the third, the output value
was available directly from the agent's percepts.

11
 E.g., an agent training to become a taxi driver. Every time
the instructor shouts "Brake!" the agent can learn a
condition-action rule for when to brake (component 1).
 By seeing many camera images that it is told contain
buses, it can learn to recognize them (2).
 By trying actions and observing the results-for example,
braking hard on a wet road-it can learn the effects of its
actions (3).
 Then, when it receives no tip from passengers who have
been thoroughly shaken up during the trip, it can learn a
useful component of its overall utility function (4).

12
 Unsupervised learning: In unsupervised learning no teacher is
available. The learner only discovers persistent patterns in the
data consisting of a collection of perceptions. This is also called
exploratory learning. Finding out malicious network attacks
from a sequence of anomalous data packets is an example of
unsupervised learning.
 Involves learning patterns in the input when no
specific output values are supplied. For example, a taxi agent might
gradually develop a concept of "good traffic days" and "bad traffic
days" without ever being given labelled examples
of each.
 A purely unsupervised learning agent cannot learn what to do,
because it has no information as to what constitutes a correct action
or a desirable state
13
Taxonomy of Learning Systems … Cont’d
 Active learning: Here not only a teacher is available, the learner has the
freedom to ask the teacher for suitable perception-action example pairs
which will help the learner to improve its performance. Consider a news
recommender system which tries to learn users preferences and categorize
news articles as interesting or uninteresting to the user. The system may
present a particular article (of which it is not sure) to the user and ask
whether it is interesting or not.
 Reinforcement learning: In reinforcement learning a teacher is available,
but the teacher instead of directly providing the desired action
corresponding to a perception, return reward and punishment to the
learner for its action corresponding to a perception. Examples include a
robot in a unknown terrain where its get a punishment when its hits an
obstacle and reward when it moves smoothly.
 In order to design a learning system the designer has to make the
following choices based on the application.
14
Taxonomy of Learning Systems … Cont’d

15
Mathematical formulation of the inductive
learning problem
 Extrapolate from a given set of examples so that we can make accurate
predictions about future examples.
 Supervised versus Unsupervised learning Want to learn an unknown
function f(x) = y, where x is an input example and y is the desired output.
Supervised learning implies we are given a set of (x, y) pairs by a
"teacher." Unsupervised learning means we are only given the xs.
 In either case, the goal is to estimate f.

16
Mathematical formulation of the inductive
learning problem … Cont’d
 Inductive Bias
 Inductive learning is an inherently conjectural process because any
knowledge created by generalization from specific facts cannot be proven
true; it can only be proven false. Hence, inductive inference is falsity
preserving, not truth preserving.
 To generalize beyond the specific training examples, we need constraints or
biases on what f is best. That is, learning can be viewed as searching the
Hypothesis Space H of possible f functions.
 A bias allows us to choose one f over another one
 A completely unbiased inductive algorithm could only memorize the training
examples and could not say anything more about other unseen examples.
 Two types of biases are commonly used in machine learning:
 Restricted Hypothesis Space Bias Allow only certain types of f
functions, not arbitrary ones
17
Mathematical formulation of the inductive
learning problem … Cont’d
 Preference Bias Define a metric for comparing fs so as to determine
whether one is better than another
 Inductive Learning Framework
 Raw input data from sensors are preprocessed to obtain a feature vector,
x, that adequately describes all of the relevant features for classifying
examples.
 Each x is a list of (attribute, value) pairs. For example,
 x = (Person = Sue, Eye-Color = Brown, Age = Young, Sex = Female)
 The number of attributes (also called features) is fixed (positive, finite).
Each attribute has a fixed, finite number of possible values.
 Each example can be interpreted as a point in an n-dimensional feature
space, where n is the number of attributes.

18
Inductive learning hypothesis
 Learning task is to determine a hypothesis h identical to the target
concept c over the entire set of instances X.
 The only information Available about c is its value over the
training examples.
 Inductive learning algorithms can at best guarantee that the
output hypothesis fits the target concept over the training data
 Lacking further information, the assumption is that the best
hypothesis regarding unseen instances is the hypothesis that best
fits the observed training data
 Any hypothesis found to approximate the target function well
over a sufficiently large set of training examples will also
approximate the target function well over other unobserved
examples(concept of inductive learning hypothesis)
19
Learning From Observations
 Concept Learning
 Definition:
 The problem is to learn a function mapping examples into two classes: positive
and negative. We are given a database of examples already classified as positive
or negative. Concept learning: the process of inducing a function mapping input
examples into a Boolean output.
 Tom M.Mitchell:
 Inferring a Boolean-valued function from training examples of its input and
output.
 Can be formulated as a problem of searching through a predefined space of
potential hypotheses for the hypothesis that best fits the training example.
 Examples:
 Classifying objects in astronomical images as stars or galaxies
 Classifying animals as vertebrates or invertebrates

20
Learning From Observations …
Cont’d
 Example: Classifying Mushrooms
 Class of Tasks: Predicting poisonous mushrooms
 Performance: Accuracy of classification
 Experience: Database describing mushrooms with their
class
 Knowledge to learn: Function mapping mushrooms to
{0,1} where 0:not-poisonous and 1:poisonous
 In general, any concept learning task can be described by
 The set of instances over which the target function is defined
 The target function
 The set of candidate hypothesis considered by learner
 The set of available training examples
21
Learning From Observations … Cont’d
 Representation of target knowledge: conjunction of (constraints)attribute values.
 Learning mechanism: candidate-elimination
 Representation of instances:
 Features:
 color {red, brown, gray}
 size {small, large}
 shape {round,elongated}
 land {humid,dry}
 air humidity {low,high}
 texture {smooth, rough}
 Input and Output Spaces:
 X : The space of all possible examples (input space).
 Y: The space of classes (output space).
 An example in X is a feature vector x.
 For instance: x= (red,small,elongated,humid,low,rough)
 X is the cross product of all feature values.
 Only a small subset of instances is available in the database of examples.
22
Learning From Observations … Cont’d

Training Examples:
D : The set of training examples.
D is a set of pairs { (x, c(x)) }, where c is the target concept(the concept or
function to be learned). X is a subset of the universe of discourse or the set of
all possible instances. In the current example, X is the set of all possible
poisonous of mushrooms, each represented by the attributes color, size,
shape, land, air humidity and texture.
 c can be any Boolean valued function defined over the instances x; that
is ,c : x{0,1}. In the current example the target concept corresponding to
the value of mushrooms (i.e., c(x) =1 if mushroom is non poison and
c(x)=0
23 if mushroom is poison)
Learning From Observations …
Cont’d
 Learning the target concept, the learner is presented as a set of training examples,
each consisting of an instance x from X(possible poisonous of mushrooms).
 Instances for which c(x)=1 are called positive examples, or members of the target
concept.
 Instances for which c(x)=0 are called negative examples or non members of the
target concept.
 We often write the order pair (x, c(x)) to describe the training example consisting
of the instance x and its target concept value c(x).
 D is the set of available training examples
 Given a set of training examples of the target concept c, the problem faced by
learner is to hypothesize , or estimate c.
 H is the set of all possible hypotheses that the learner may consider regarding the
identity of the target concept
 h in H represents a Boolean-valued function defined over X; that is h:X{0,1}.
 The goal of the learner is to find a hypothesis h such that h(x)=c(x)for all x in X

24
Learning From Observations …
Cont’d
 Example of D:
• ((red,small,round,humid,low,smooth), poisonous)
• ((red,small,elongated,humid,low,smooth), poisonous)
• ((gray,large,elongated,humid,low,rough), not-poisonous)
• ((red,small,elongated,humid,high,rough), poisonous)

25
Learning From Observations … Cont’d
 Hypothesis Representation
 Any hypothesis h is a function from X to Y
 h: X Y
 We will explore the space of conjunctions.
 Special symbols:
 ? Any value is acceptable
 0 no value is acceptable
 Consider the following hypotheses:
 (?,?,?,?,?,?): all mushrooms are poisonous
 (0,0,0,0,0,0): no mushroom is poisonous

26
Learning From Observations … Cont’d
 Hypotheses Space:
 The space of all hypotheses is represented by H
 Let h be a hypothesis in H.
 Let X be an example of a mushroom.
 if h(X) = 1 then X is poisonous, otherwise X is not-poisonous
 Our goal is to find the hypothesis, h*, that is very “close” to target
concept c.
 A hypothesis is said to “cover” those examples it classifies as positive.

27
Learning From Observations … Cont’d
 Assumptions:
 We will explore the space of all conjunctions.
 We assume the target concept falls within this space.
 A hypothesis close to target concept c obtained after seeing many
training examples will result in high accuracy on the set of unobserved
examples. (Inductive Learning Hypothesis)

28
Concept Learning as Search
 We will see how the problem of concept learning can
be posed as a search problem.
 We will illustrate that there is a general to specific
ordering inherent to any hypothesis space.
 Concept learning Is a kind of searching task through a
large space of hypotheses implicitly defined by the
hypothesis representation.
 The goal of this search is to find the hypothesis that best
fits the training examples.

29
Concept Learning as Search
 General to specific ordering hypotheses:
 Consider these two hypotheses:
 h1 = (red,?,?,humid,?,?)
 h2 = (red,?,?,?,?,?)
 Set of instances that are classified positive by h1 and h2.
 h2 impose fewer constraints on the instances, it classifies more instances as positive.
 Any instances classified by h1 will also be classified positive by h2.
 We say h2 is more general than h1 because h2 classifies more instances than h1 and h1 is
covered by h2.
 For example, consider the following hypotheses

30
Concept Learning as Search … Cont’d
 h1 is more general than h2 and h3.
 h2 and h3 are neither more specific nor more general than each other.
 Definitions: let hj and hk be Boolean valued functions(0,1) defiened over X. then
hj is more_general_than_or_equal_to hk(written hj>=hk)
iff (xꜪX)[(hk(x)=1)(hj(x)=1)]
 Hj is (strictly) more general than hk(written hj>hk) iff (hj>=hk) (hk not greater
than equal to hj)
 In other word given hypothesis hj and hk, hj is more_general_than_or_equal_to
hk iff any instance that satisfies hk also satisfies hj.
 The >= relation imposes a partial ordering over the hypothesis space H
(reflexive, antisymmetric, and transitive).
 Any input space X defines then a lattice of hypotheses ordered according to the
general-specific relation:

31
32
Algorithm to Find a Maximally-Specific Hypothesis(Find-S)
 The structure is partial(as opposed to total) order, we mean there may be pairs of
hypothesis such as h1 and h2 , such that h1 greater than or equal to h2 and h2 greater
than or equal to h1
 How can we use the more_general_than partial ordering to organize the search for a
hypothesis consistent with observed training examples?
 Algorithm to search the space of conjunctions:
 Start with the most specific hypothesis
 Generalize the hypothesis when it fails to cover a positive example(we say that a hypothesis
“covers” a positive example if it is correctly classifies the example as positive )
 Find-S Algorithm:
1. Initialize h to the most specific hypothesis
2. For each positive training example X
 For each attribute constraint ai in h
 If the constraint ai is satisfied by X,
 then do nothing
 Else generalize ai in h by the next more general constraint that is satisfied by x
3. Output hypothesis h
33
Algorithm to Find a Maximally-Specific Hypothesis
 Example:
 Let’s run the learning algorithm above with the following examples:
 ((red,small,round,humid,low,smooth), poisonous)
 ((red,small,elongated,humid,low,smooth), poisonous)
 ((gray,large,elongated,humid,low,rough), not-poisonous)
 ((red,small,elongated,humid,high,rough), poisonous)
 We start with the most specific hypothesis: h = (0,0,0,0,0,0)

34
Algorithm to Find a Maximally-Specific Hypothesis
 The first example comes and since the example is positive and h fails to cover it, we simply
generalize h to cover exactly this example:
 h = (red,small,round,humid,low,smooth)
 Hypothesis h basically says that the first example is the only positive example, all other examples are
negative.
 Then comes examples 2: ((red,small,elongated,humid,low,smooth), poisonous)
 This example is positive. All attributes match hypothesis h except for attribute shape: it has the value
elongated, not round. We generalize this attribute using symbol ? yielding:
 h: (red,small,?,humid,low,smooth)
 The third example is negative and so we just ignore it.
 Why is it we don’t need to be concerned with negative examples?
 because the target concept c is also assumed to be in H and to be consistent with the training
examples , c must be more_general_than_or_equal_to h. but the c will never cover a negative
example, thus neither will h . No revision to h will be required in response to any negative example
 Upon observing the 4th example, hypothesis h is generalized to the following:
 h = (red,small,?,humid,?,?)
 h is interpreted as any mushroom that is red, small and found on humid land should be classified as
poisonous.
35
Algorithm to Find a Maximally-Specific Hypothesis

 The Find-S algorithm illustrate one way in which the


more_general_than partial ordering can be used to organize the search
for an acceptable hypothesis

36
Algorithm to Find a Maximally-Specific Hypothesis
 The algorithm is guaranteed to find the hypothesis that is most specific and
consistent with the set of training examples.
 It takes advantage of the general-specific ordering to move on the
corresponding lattice searching for the next most specific hypothesis.
 Note that:
 There are many hypotheses consistent with the training data D. Why should we
prefer the most specific hypothesis?
 What would happen if the examples are not consistent? What would happen if
they have errors, noise?
 What if there is a hypothesis space H where one can find more that one
maximally specific hypothesis h? The search over the lattice must then be
different to allow for this possibility.

37
Algorithm to Find a Maximally-Specific Hypothesis
 The algorithm that finds the maximally specific hypothesis is limited in
that it only finds one of many hypotheses consistent with the training
data.
 • The Candidate Elimination Algorithm (CEA) finds ALL hypotheses
consistent with the training data.
 • CEA does that without explicitly enumerating all consistent hypotheses.
 • Applications:
 Chemical Mass Spectroscopy
 Control Rules for Heuristic Search

38
Candidate Elimination Algorithm
 Consistency vs Coverage
 A hypothesis h is consistent with a set of training examples D if f h(x)= c(x) for
each example (x,c(x))in D.
 In the following example, h1 covers a different set of examples
than h2, h2 is consistent with training set D, h1 is not consistent
with training set D

39
Version Space
 The Candidate-elimination algorithm represents the set of all
hypotheses consistent with the observed training examples
 The version space, denoted V S H V D , with respect to hypothesis space H and
training examples D, is the subset of hypotheses from H consistent with the
training examples in D.

40
Version Space … Cont’d
 The version space for the mushroom example is as follows:

The candidate elimination algorithm generates the entire version space.

41
Cont’d
 The Candidate-elimination algorithm represents the version space by
storing only its most general members G and its most specific S.
 Given only these two sets S and G, it is possible to enumerate all members
of the version space as needed by generating the hypotheses that lie
between these two sets in the general-to-specific partial ordering over
hypotheses.
 The general boundary G, with respect to hypothesis space H and training
data D, is the set of maximally general members of H consistent with D.
 The specific boundary S, with respect to hypothesis space H and training
data D, is the set of minimally general (i.e., maximally specific) members
of H consistent with D.
 version space is the set of hypotheses contained in G , plus those
contained in S, plus those that lie between G and S in the partially
ordered hypothesis space.

42
The Candidate-Elimination Algorithm
 The candidate elimination algorithm keeps two lists of hypotheses
consistent with the training data: (i) The list of most specific hypotheses
S and, (ii) The list of most general hypotheses G. This is enough to
derive the whole version space VS.
 Steps:
 1. Initialize G to the set of maximally general hypotheses in H
 2. Initialize S to the set of maximally specific hypotheses in H
 3. For each training example X do
 a) If X is positive: generalize S if necessary
 b) If X is negative: specialize G if necessary
 4. Output {G,S}
 Step (a) Positive examples

43
The Candidate-Elimination Algorithm
 If X is positive:
 Remove from G any hypothesis inconsistent with X
 For each hypothesis h in S not consistent with X
 Remove h from S
 Add all minimal generalizations of h consistent with X such that some
member of G is more general than h
 Remove from S any hypothesis more general than any other hypothesis in S
 Step (b) Negative examples
 If X is negative:
 Remove from S any hypothesis inconsistent with X
 For each hypothesis h in G not consistent with X
 Remove h from G
 Add all minimal generalizations of h consistent with X such that some member of S is more
specific than h
 Remove from G any hypothesis less general than any other hypothesis in G
44
The Candidate-Elimination Algorithm
 The candidate elimination algorithm is guaranteed to converge to the right
hypothesis provided the following:
 a) No errors exist in the examples
 b) The target concept is included in the hypothesis space H
 If there exists errors in the examples:
 a) The right hypothesis would be inconsistent and thus eliminated.
 b) If the S and G sets converge to an empty space we have evidence that the
true concept lies outside space H.

45
Rule Induction and Decision Tree - I
 Decision Trees
 Decision trees are a class of learning models that are more robust to
noise as well as more powerful as compared to concept learning.
 Consider the problem of classifying a star based on some astronomical
measurements.
 It can naturally be represented by the following set of decisions on each
measurement arranged in a tree like fashion

46
Decision Tree: Definition
 A decision-tree learning algorithm approximates a target concept using a
tree representation, where each internal node corresponds to an attribute,
and every terminal node corresponds to a class.
 There are two types of nodes:
 Internal node.- Splits into different branches according to the different
values the corresponding attribute can take. Example: luminosity <=
T1 or luminosity > T1.
 Terminal Node.- Decides the class assigned to the example.

Decision tree representation


 Each internal node tests an attribute
 Each branch corresponds to attribute value
 Each leaf node assigns a classification

47
Cont’d
 When to Consider Decision Trees:
 Instances describable by attribute-value pairs.
 e.g. the attribute temperature has values hot, mild and cold
 Target function is discrete valued
 e.g. assigning Boolean classification(yes or no) toe each
example
 Disjunctive hypothesis may be required
 Decision trees represent disjunctive expression
 Possibly noisy training data
 Decision tree can be used when the some training examples
have unknown values

48
Cont’d
 Decision tree learning has been applied to problems:
 Learning to classify medical patients by their disease
 Equipment malfunctions by their cause
 Loan applicants by their likelihood of defaulting on payments

49
Classifying Examples Using Decision
Tree
 To classify an example X we start at the root of the tree, and check the
value of that attribute on X.
 We follow the branch corresponding to that value and jump to the next
node.
 We continue until we reach a terminal node and take that class as our
best prediction.

50
Cont’d…
 Decision trees adopt a DNF (Disjunctive Normal Form) representation.
 For a fixed class, every branch from the root of the tree to a terminal
node with that class is a conjunction of attribute values; different
branches ending in that class form a disjunction.
 Learned trees can also be re-represented as sets of if-then rules to
improve human readability.

51
Decision Tree Construction
 There are different ways to construct trees from data.
 We will concentrate on the top-down, greedy search approach:
 Basic idea:
 1. Choose the best attribute a* to place at the root of the tree.

52
Cont’d

53
Cont’d

54
Cont’d

55
Cont’d
 Steps:
 • Create a root for the tree
 • If all examples are of the same class or the number of examples is
below a threshold return that class
 • If no attributes available return majority class
 • Let a* be the best attribute
 • For each possible value v of a*
 • Add a branch below a* labeled “a = v”
 • Let Sv be the subsets of example where attribute a*=v
 • Recursively apply the algorithm to Sv

56
Rule Induction and Decision Tree - II
 Splitting Functions
 What attribute is the best to split the data? Let us remember some definitions from
information theory.
 What is the quantitative measure of the worth of an attribute?
 Information gain measure how well a given attribute separate the training examples
according to their target classification
 In order to define information gain precisely, we start by defining a measure of
information theory called entropy
 A measure of uncertainty or entropy that is associated to a random variable X is defined
as
 H(X) = - Σ pi log pi
 where the logarithm is in base 2.
 This is the “average amount of information or entropy of a finite complete probability
scheme”.
 We will use a entropy based splitting function.
 Decision tree use this information gain measure to select among the candidate attributes
at57
each step while growing the tree.
 Consider the previous example:
Cont’d

 Size divides the sample in two. humidity divides the sample in three.
 S1 = { 6P, 0NP} S1 = { 2P, 2NP}

S2 = { 5P, 0NP}
S2 = { 3P, 5NP}
S3 = { 2P, 3NP}
 H(S1) = 0 H(S1) = 1
 H(S2) = -(3/8)log2(3/8) H(S2) = 0
 -(5/8)log2(5/8) H(S3) = -(2/5)log2(2/5)
-(3/5)log2(3/5)

58
Cont’d
 Let us define information gain as follows:
 Information gain IG over attribute A: IG (A)
 IG(A) = H(S) - Σv (Sv/S) H (Sv)
 H(S) is the entropy of all examples. H(Sv) is the entropy of one subsample after
partitioning S based on all possible values of attribute A.
 Consider the previous example: We have,
H(S1) = 0
H(S2) = -(3/8)log2(3/8)
-(5/8)log2(5/8)
H(S) = -(9/14)log2(9/14)
-(5/14)log2(5/14)
|S1|/|S| = 6/14
|S2|/|S| = 8/14
The principle for decision tree construction may be stated as follows:
Order the splits (attribute and value of the attribute) in decreasing order of information gain.
59
Hypothesis space search in decision tree
learning
 ID3 characterized as a searching a space of hypotheses for
one that fits the training examples
 The hypothesis searched by ID3 is the set of possible
decision trees
 ID3 performs a simple-to complex, hill-climbing search
through this hypothesis space, beginning with the empty
tree, then considering progressively more elaborate
hypotheses in search of a decision tree that correctly
classifies the training data.
 The evaluation function that guides this hill-climbing
search is the information gain measure

60
Cont’d
Capabilities and limitations of ID3 in terms of its search space and search
strategy
 ID3’s hypothesis space of all decision trees is a complete space of finite
discrete-valued functions, relative to the available attributes
 ID3 maintains only a single current hypothesis as it searches through the
space of decision trees. By determining only a single hypothesis, ID3 loses
the capabilities that follow from explicitly representing all consistent
hypotheses
 ID3 in its pure form performs no backtracking in its search. Once it, selects
an attribute to test at a particular level in the tree, it never backtracks to
reconsider this choice. it is susceptible to the usual risks of hill-climbing
search without backtracking
 ID3 uses all training examples at each step in the search to make statistically
based decisions regarding how to refine its current hypothesis. the resulting
search is much less sensitive to errors in individual training examples

61
Decision Tree Pruning
 Practical issues while building a decision tree can be enumerated as
follows:
 1) How deep should the tree be?
 2) How do we handle continuous attributes?
 3) What is a good splitting function?
 4) What happens when attribute values are missing?
 5) How do we improve the computational efficiency
 The depth of the tree is related to the generalization capability of the tree.
If not carefully chosen it may lead to overfitting.
 A tree overfits the data if we let it grow deep enough so that it begins to
capture “aberrations” in the data that harm the predictive power on
unseen examples:

62
Cont’d
 A hypothesis overfits the training examples if some other
hypothesis that fits the training examples less well
actually performs better over the entire distribution of
instances (i.e., including instances beyond the training set)

63
Cont’d

 There are two main solutions to overfitting in a decision tree:


 1) stop the tree early before it begins to overfit the data
 + In practice this solution is hard to implement because it is difficult to
estimate precisely when to stop growing the tree.
 2) Grow the tree until the algorithm stops even if the overfitting problem
shows up.
 Then prune the tree as a post-processing step.
 + This method has found great popularity in the machine learning
community.
64
Cont’d

65
Cont’d
 Here, the available data has been split into three subsets:
the training examples, the validation examples used for
pruning the tree, and a set of test examples used to provide
an unbiased estimate of accuracy over future unseen
examples
 Rule of post-pruning
 In practices, it is successful techniques to find high
accuracy hypotheses
 Rule of post-pruning involves the following steps:

66
Cont’d
1. Infer the decision tree from the training set, growing the
tree until the training data is fit as well as possible and
allowing overfitting to occur
2. Convert the learned tree into an equivalent set of rules
by creating one rule for each path from the root node to
a leaf node.
3. Prune (generalize) each rule by removing any
preconditions that result in improving its estimated
accuracy
4. Sort the pruned rules by their estimated accuracy, and
consider them in this sequence when classifying
subsequent instances
67
Learning and Neural Networks - I
 Neural Networks
 Artificial neural networks are among the most powerful learning models.
 They have the versatility to approximate a wide range of complex functions
representing multi-dimensional input-output maps.
 Neural networks also have inherent adaptability, and can perform robustly
even in noisy environments.
 An Artificial Neural Network (ANN) is an information processing
paradigm that is inspired by the way biological nervous systems, such as
the brain, process information.
 The key element of this paradigm is the novel structure of the information
processing system.
 It is composed of a large number of highly interconnected simple
processing elements (neurons) working in unison to solve specific
problems.
68
Cont’d
 ANNs, like people, learn by example. An ANN is configured for a specific
application, such as pattern recognition or data classification, through a
learning process.
 Learning in biological systems involves adjustments to the synaptic
connections that exist between the neurons. This is true of ANNs as well.
 ANNs can process information at a great speed owing to their highly
massive parallelism.
 Neural networks, with their remarkable ability to derive meaning from
complicated or imprecise data, can be used to extract patterns and detect
trends that are too complex to be noticed by either humans or other
computer techniques.

69
Cont’d
 A trained neural network can be thought of as an "expert" in the category of
information it has been given to analyse.
 This expert can then be used to provide projections given new situations of
interest and answer "what if" questions. Other advantages include:
 1. Adaptive learning: An ability to learn how to do tasks based on the data given
for training or initial experience.
 2. Self-Organisation: An ANN can create its own organisation or representation
of the information it receives during learning time.
 3. Real Time Operation: ANN computations may be carried out in parallel, and
special hardware devices are being designed and manufactured which take
advantage of this capability.
 4. Fault Tolerance via Redundant Information Coding: Partial destruction of a
network leads to the corresponding degradation of performance. However, some
network capabilities may be retained even with major network damage.

70
Biological Neural Networks
 Much is still unknown about how the brain trains itself to process
information, so theories abound. In the human brain, a typical neuron
collects signals from others through a host of fine structures called
dendrites.
 The neuron sends out spikes of electrical activity through a long, thin stand
known as an axon, which splits into thousands of branches.
 At the end of each branch, a structure called a synapse converts the activity
from the axon into electrical effects that inhibit or excite activity from the
axon into electrical effects that inhibit or excite activity in the connected
neurons.
 When a neuron receives excitatory input that is sufficiently large compared
with its inhibitory input, it sends a spike of electrical activity down its axon.
 Learning occurs by changing the effectiveness of the synapses so that the
influence of one neuron on another changes.

71
Cont’d

72
Artificial Neural Networks
 Artificial neural networks are represented by a set of nodes or units,
often arranged in layers, and a set of weighted directed links connecting
them. The nodes are equivalent to neurons, while the links denote
synapses. The nodes are the information processing units and the links
acts as communicating media.
 There are a wide variety of networks depending on the nature of
information processing carried out at individual nodes, the topology of
the links, and the algorithm for adaptation of link weights. Some of the
popular among them include:
 Perceptron: This consists of a single neuron with multiple inputs and a
single output. It has restricted information processing capability. The
information processing is done through a transfer function which is
either linear or non-linear.

73
Cont’d
 Multi-layered Perceptron (MLP): It has a layered architecture consisting
of input, hidden and output layers. Each layer consists of a number of
perceptrons. The output of each layer is transmitted to the input of nodes
in other layers through weighted links. Usually, this transmission is done
only to nodes of the next layer, leading to what are known as feed forward
networks. MLPs were proposed to extend the limited information
processing capabilities of simple percptrons, and are highly versatile in
terms of their approximation ability. Training or weight adaptation is done
in MLPs using supervised backpropagation learning.
 Recurrent Neural Networks: RNN topology involves backward links
from output to the input and hidden layers. The notion of time is encoded
in the RNN information processing scheme. They are thus used in
applications like speech processing where inputs are time sequences data.

74
Cont’d
 Self-Organizing Maps: SOMs or Kohonen networks have a grid
topology, with unequal grid weights. The topology of the grid provides a
low dimensional visualization of the data distribution. These are thus
used in applications which typically involve organization and human
browsing of a large volume of data. Learning is performed using a
winner take all strategy in a unsupervised mode.

75
Appropriate problems for NN learning
 The BACKPROPAGATION algorithm is the most
commonly used ANN learning technique with the
following characteristics:
 Instances are represented by many attribute-value pairs
 The target function output may be discrete-valued, real-valued,
or a vector of several real- or discrete-valued attributes
 The training examples may contain errors.
 Long training times are acceptable
 Fast evaluation of the learned target function may be required
 The ability of humans to understand the learned target function
is not important

76
Neural Networks - II
 Perceptron
 Definition: It’s a step function based on a linear combination of real-
valued inputs. If the combination is above a threshold it outputs a 1,
otherwise it outputs a –1.

77
Cont’d
 A perceptron draws a hyperplane as the decision boundary over the (n-
dimensional) input space.

• A perceptron can learn only examples that are called “linearly separable”.
• These are examples that can be perfectly separated by a hyperplane.

78
Cont’d

 Perceptrons can learn many boolean functions: AND, OR, NAND, NOR, but
not XOR
 However, every boolean function can be represented with a perceptron
network that has two levels of depth or more.
 The weights of a perceptron implementing the AND function is shown below.
79
Cont’d

80
Perceptron Learning
 The learning problem is to determine a weight vector w that causes the
perceptron to produce the correct output for each training example
 The hypothesis space of a perceptron is the space of all weight vectors.
 The perceptron learning algorithm can be stated as below.
 1. Assign random values to the weight vector
 2. Apply the weight update rule to every training example
 3. Are all training examples correctly classified?
 a. Yes. Quit
 b. No. Go back to Step 2.
 There are two popular weight update rules.
 i) The perceptron rule, and
 ii) Delta rule

81
The Perceptron Rule
 For a new training example X = (x1, x2, …, xn), update each weight
according to this rule:
 wi = wi + Δwi
 Where Δwi = η (t-o) xi
 t: target output
 o: output generated by the perceptron
 η: constant called the learning rate (e.g., 0.1)
 Comments about the perceptron training rule:
 If the example is correctly classified the term (t-o) equals zero, and no update on
the weight is necessary.
 • If the perceptron outputs –1 and the real answer is 1, the weight is increased.

82
Cont’d
 If the perceptron outputs a 1 and the real answer is -1,
the weight is decreased.
 Provided the examples are linearly separable and a
small value for η is used, the rule is proved to classify
all training examples correctly (i.e, is consistent with
the training data).
 convergence guaranteed provided linearly separable
training examples and sufficiently small η

83
The Delta Rule
 perceptron rule fails if data is not linearly separable
 delta rule converges toward a best-fit approximation
 uses gradient descent to search the hypothesis space
 perceptron cannot be used, because it is not differentiable
 hence, a unthresholded linear unit is appropriate
 It is done by minimizing the error E = ½ Σi (ti – oi) 2
 where the sum goes over all training examples. Here oi is the inner
product WX and not sgn(WX) as with the perceptron rule.
 The idea is to find a minimum in the space of weights and the error
function E.

84
Cont’d
 The delta rule is as follows:
 For a new training example X = (x1, x2, …, xn),
update each weight according to this rule:
 wi = wi + Δwi
 Where Δwi = -η E’(W)/wi
 η: learning rate (e.g., 0.1) which determines the step
size in the gradient descent search.

85
Cont’d
 The negative sign indicate we want to move the weight vector in the direction
that decrease E
 It is easy to see that
 E’(W)/ wi = Σi (ti – oi) (-xi)
 So that gives us the following equation:
 wi = η Σi (ti – oi) xi
 What are the key practical difficulties in applying gradient descent ? How do we
alleviate these difficulties?(Reading Assignment )

86
Cont’d
 Gradient descent algorithm for training a linear unit:
 Gradient_Descent(training_examples, η)
 Initialize each wi, to some small random value
 Until the termination condition is met, Do
 Initialize each Δwi to zero
 For each (x,t ) in training_axamples, Do
 Input the instance x to the unit and compute the output o
 For each linear unit weight wi, Do
 Δwi = Δwi +η (t – o) xi
 For each linear unit weight wi, Do
 Wi=wi+ Δwi

87
Perceptron training Vs. Delta rule
 perceptron training rule:
 uses thresholded unit
 converges after a finite number of iterations
 output hypothesis classifies training data perfectly
 linearly separability necessary
 delta rule:
 uses unthresholded linear unit
 converges asymptotically toward a minimum error hypothesis
 termination is not guaranteed
 linear separability not necessary

88
Neural Networks - III
 Multi-Layer Perceptrons
 In contrast to perceptrons, multilayer networks can learn not only
multiple decision boundaries, but the boundaries may be nonlinear.
 The typical architecture of a multi-layer perceptron (MLP) is shown
below.

• To make nonlinear partitions on the space we need to define each unit as a nonlinear
function (unlike the perceptron). One solution is to use the sigmoid unit.
• Another reason for using sigmoids are that they are continuous unlike linear thresholds
and are thus differentiable at all points.
89
Cont’d
 Each layer is made up of units.
 The inputs to the network correspond to the attributes measured for each
training tuple
 The inputs are fed simultaneously into the units making up the input layer.
 These inputs pass through the input layer and are then weighted and fed
simultaneously to a second layer of “neuronlike” units, known as a hidden
layer.
 The outputs of the hidden layer units can be input to another hidden layer,
and so on.
 The weighted outputs of the last hidden layer are input to units making up
the output layer, which emits the network’s prediction for given tuples
 It is a feed-forward network since none of the weights cycles back to an
input unit or to a previous layer’s output unit.
 It is fully connected in that each unit provides input to each unit in the next
forward layer.
90
Cont’d
 Each output unit takes, as input, a weighted sum of the
outputs from units in the previous layer
 It applies a nonlinear (activation) function to the weighted
input.
 Multilayer feed-forward networks, given enough hidden
units and enough training samples, can closely
approximate any function.

91
Cont’d

where: σ ( WX ) = 1 / 1 + e -WX
Function σ is called the sigmoid or logistic function. It has the following
property:
d σ(y) / dy = σ(y) (1 – σ(y))

92
Back-Propagation Algorithm(1)
 Multi-layered perceptrons can be trained using the back-
propagation algorithm described next.
 Backpropagation learns by iteratively processing a data set of
training tuples, comparing the network’s prediction for each tuple
with the actual known target value.
 The target value may be the known class label of the training tuple
(for classification problems) or a continuous value (for numeric
prediction)
 It employs gradient descent to attempt to minimize the
squared error between the network output values and the
target values for these outputs
 The goal is to learn the weights for all links in an interconnected
multilayer network.
93
Back-Propagation Algorithm(2)
 We begin by defining our measure of error:

 where outputs is the set of output units in the network, and tkd and
Okd are the target and output values associated with the kth output
unit and training example d.
 The idea is to use again a gradient descent over the space of
weights to find a global minimum (no guarantee).

94
Back-Propagation Algorithm(3)

95
Back-Propagation Algorithm(4)

96
Back-Propagation Algorithm(5)
 Forward Propagation
 Given example X, compute the output of every node until we reach the
output nodes:

97
Back-Propagation Algorithm(6)
 A. For each output node k compute the error.
 The error is propagated backward by updating the weights to reflect the error of
the network’s prediction :
δk = Ok (1-Ok)(tk – Ok)
Where Ok is the actual output of node/unit k, and tk is the known target value
of the given training tuple
 B. For each hidden unit h, calculate the error.
 To compute the error of a hidden layer unit h, the weighted sum of the errors of
the units connected to unit h in the next layer are considered. The error of a
hidden layer unit h is
δh = Oh (1-Oh) Σk Whk δk
where whk is the weight of the connection from unit h to a unit k in the next higher
layer, and δk is the error of unit k.

98
Back-Propagation Algorithm(7)
 C. Update each network weight:
Wji = Wji + ΔWji
 where ΔWji = η δj Xji (Wji and Xji are the input and weight of node i to
node j) . ΔWji is the change in weight wji
 η Is the learning rate, a constant typically having a value between 0.0 and
1.0
 The learning rate helps avoid getting stuck at a local minimum in decision
space (i.e., where the weights appear to converge, but are not the optimum
solution) and encourages finding the global minimum.
 If the learning rate is too small, then learning will occur at a very slow
pace.
 If the learning rate is too large, then oscillation between inadequate
solutions may occur.
 A rule of thumb is to set the learning rate to 1=1/t, where t is the number of
iterations through the training set so far
99
Back-Propagation Algorithm(8)
 A momentum term, depending on the weight value at last iteration, may also
be added to the update rule as follows. At iteration n we have the following:
 ΔWji (n) = η δj Xji + αΔWji (n-1)
 Where α ( 0 <= α <= 1) is a constant called the momentum.
 1. It increases the speed along a local minimum.
 2. It increases the speed along flat regions.

 Updating the weights after the presentation of each tuple


(i.e., case updating)
 Weights is updated after all the tuples in the training set have
been presented(i.e., epoch updating)

100
Back-Propagation Algorithm(9)
 Termination conditions
 fixed number of iterations
 error falls below some threshold
 error on a separate validation set falls below some threshold
 The choice termination criteria is an important one,
because :
 too few iterations reduce error insufficiently
 too many iterations can lead to overfitting the data

101
Back-Propagation Algorithm(10)
 Remarks on Back-propagation
 1. It implements a gradient descent search over the weight space.
 2. It may become trapped in local minima.
 3. In practice, it is very effective.
 4. How to avoid local minima?
 a) Add momentum
 b) Use stochastic gradient descent.
 c) Use different networks with different initial values for the weights.
 Multi-layered perceptrons have high representational power. They can represent
the following:
 1. Boolean functions. Every boolean function can be represented with a network
having two layers of units.
 2. Continuous functions. All bounded continuous functions can also be
approximated with a network having two layers of units.
 3. Arbitrary functions. Any arbitrary function can be approximated with a network
with three layers of units.
102
Generalization and overfitting
 One obvious stopping point for backpropagation is to continue iterating
until the error is below some threshold; this can lead to overfitting.
 Backpropagation is susceptible to overfitting the training examples at the
cost of decrease generalization accuracy over other unseen examples .

Overfitting can be avoided using the following strategies.


• Use a validation set and stop until the error is small in this set.
• Use 10 fold cross validation.
• Use weight decay; the weights are decreased slowly on each iteration.
103
Applications of Neural Networks
 Neural networks have broad applicability to real world business problems. They have
already been successfully applied in many industries.
 Since neural networks are best at identifying patterns or trends in data, they are well
suited for prediction or forecasting needs including:
 • sales forecasting
 • industrial process control
 • customer research
 • data validation
 • risk management
 • target marketing
 Because of their adaptive and non-linear nature they have also been used in a number of
control system application domains like, • process control in chemical plants • unmanned
vehicles • robotics • consumer electronics
 Neural networks are also used a number of other applications which are too hard to model
using classical techniques. These include, computer vision, path planning and user
modeling.
104
References
 Machine Learning, Tom Mitchell, Mc Graw-Hill
International Editions, 1997 (Cap 3).
 Data Mining Concepts and Techniques,
Jiawei Han, Micheline Kamber and Jian Pei,
Third Edition

105

You might also like