0% found this document useful (0 votes)
11 views

Unit-III - Chapter7-Learning rule Sets

The document discusses inductive learning methods, contrasting them with analytical learning, and outlines the Sequential Covering Algorithm for learning sets of rules. It details the process of learning rules through a greedy search approach, emphasizing the importance of rule accuracy and the challenges of suboptimal choices. Additionally, it introduces the FOIL algorithm for learning first-order rules, highlighting its similarities and differences with previous algorithms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Unit-III - Chapter7-Learning rule Sets

The document discusses inductive learning methods, contrasting them with analytical learning, and outlines the Sequential Covering Algorithm for learning sets of rules. It details the process of learning rules through a greedy search approach, emphasizing the importance of rule accuracy and the challenges of suboptimal choices. Additionally, it introduces the FOIL algorithm for learning first-order rules, highlighting its similarities and differences with previous algorithms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Inductive Learning:

Learning Sets of Rules::Learning Sets of Rules: Sequential


Covering Algorithm – Learning Rule Sets -Learning First Order
Rules - Learning set of first order rules: FOIL
6.1 Introduction - Inductive Vs Analytical Learning

Inductive learning methods such as neural network and decision tree


learning require a certain number of training examples to achieve a given level
of generalization accuracy by identifying features that empirically distinguish
positive from negative training examples. Inductive learners will perform
classification upon learning poorly when insufficient data is available

In inductive learning, the learner is given a hypothesis space H from


which it must select an output hypothesis, and a set of training examples
D = { (x1,f (x1)),. . . (xn, f(xn)) } where f (xi) is the target value for the instance xi.
The desired output of the learner is a hypothesis h from H that is consistent
with these training examples.
Analytical learning uses prior knowledge and deductive reasoning to
augment the information provided by the training examples. The an analytical
learning method called explanation-based learning (EBL), prior knowledge is
used to analyze, or explain, how each observed training example satisfies the
target concept. This explanation is then used to distinguish the relevant features
of the training example from the irrelevant, so that examples can be generalized
based on logical rather than statistical reasoning. Explanation-based learning
has been successfully applied to learning search control rules for a variety of
planning and scheduling tasks.

In analytical learning, the input to the learner includes the same


hypothesis space H and training examples D as for inductive learning. In
addition, the learner is provided an additional input: A domain theory B
consisting of background knowledge that can be used to explain observed
training examples. The desired output of the learner is a hypothesis h from H
that is consistent with both the training examples D and the domain theory B.
6.2 LEARNING SETS OF RULES – Sequential Covering algorithm
In many cases in classification it is useful to learn the target function
represented as a set of if-then rules that jointly define the function. One way to learn
sets of rules is to first learn or construct a decision tree, then translate the tree into
an equivalent set of rules-one rule for each leaf node in the tree. A second method,
is to use a genetic algorithm that encodes each rule set as a bit string and uses
genetic search operators to explore this hypothesis space.
The algorithms that directly learn rule sets from the training data are
1. Learning rules using Sequential Covering Algorithms
2. Learning First –Order Rules

SEQUENTIAL COVERING ALGORITHMS –


Sequential covering algorithms are a family of algorithms for learning rule sets based
on the strategy of learning one rule, removing the data it covers, then iterating this
process. This sequential covering algorithm is one of the most widespread
approaches to learning disjunctive sets of rules. It reduces the problem of learning a
disjunctive set of rules to a sequence of simpler problems, each requiring that a
single conjunctive rule be learned. Because it performs a greedy search, formulating
a sequence of rules without backtracking, it is not guaranteed to find the smallest or
best set of rules that cover the training examples. The following algorithm is applied
to learn a set of rules from the given training examples.
Block diagram for Sequential covering algorithm
The sequential covering algorithm for learning a disjunctive set of rules. LEARN-
ONE-RULE must return a single rule that covers at least some of the Examples.
PERFORMANCE is a user-provided subroutine to evaluate rule quality. This
covering algorithm learns rules until it can no longer learn a rule whose
performance is above the given Threshold.
The subroutine LEARN-ONE-RULE that accepts a set of positive and negative
training examples as input, then outputs a single rule that covers many of the
positive examples and few of the negative examples. We require that this output
rule have high accuracy, but not necessarily high coverage. By high accuracy, we
mean the predictions it makes should be correct. By accepting low coverage, we
mean it need not make predictions for every training example.
Given this LEARN-ONE-RULE subroutine for learning a single rule, one
obvious approach to learning a set of rules is to invoke LEARN-ONE-RULE on all
the
available training examples, remove any positive examples covered by the rule it
learns, then invoke it again to learn a second rule based on the remaining training
examples. This procedure can be iterated as many times as desired to learn a
disjunctive set of rules that together cover any desired fraction of the positive
examples. This is called a sequential covering algorithm because it sequentially
learns a set of rules that together cover the full set of positive examples. The final
set of rules can then be sorted so that more accurate rules will be considered first
when a new instance must be classified.

One effective approach to implementing LEARN-ONE-RULE is to organize the


hypothesis space search in the same general fashion as the ID3 algorithm, but to
follow only the most promising branch in the tree at each step.
As illustrated for Playing Tennis concept training samples consider the search tree as
shown for learning sets of rules.

Rule generation process


The search begins by considering the most general rule precondition possible
(the empty test that matches every instance), then greedily adding the
attribute test that most improves rule performance measured over the training
examples. Once this test has been added, the process is repeated by greedily
adding a second attribute test, and so on. Like ID3, this process grows the
hypothesis by greedily adding new attribute tests until the hypothesis reaches
an acceptable level of performance. Unlike ID3, this implementation of LEARN-
ONERULE follows only a single descendant at each search step-the attribute-
value pair yielding the best performance-rather than growing a sub tree that
covers all possible values of the selected attribute.
This approach to implementing LEARN-ONE-RULE performs a general-
to specific search through the space of possible rules in search of a rule with
high accuracy, though perhaps incomplete coverage of the data

The general-to-specific search suggested above for the LEARN-ONE-


RULE algorithm is a greedy depth-first search with no backtracking. As with any
greedy search, there is a danger that a suboptimal choice will be made at any
step.
To reduce this risk, we can extend the algorithm to perform a beam
search; that is, a search in which the algorithm maintains a list of the k best
candidates at each step, rather than a single best candidate. On each search step,
descendants (specializations) are generated for each of these k best candidates,
and the resulting set is again reduced to the k most promising members. Beam
search keeps track of the most promising alternatives to the current top-rated
hypothesis, so that all of their successors can be considered at each search step.
This general to specific beam search algorithm for learning sets of rules from the
training examples is given below.
LEARN-ONE-RULE(Target_Attributes,Examples, k)
Returns a single rule that covers some of the Examples.
Conducts a general to specific greedy beam search for the best
rule, guided by the PERFORMANCE metric.
• Initialize Best_hypothesis to the most general hypothesis ?
• Initialize Candidate_hypotheses to the set (Best_hypothesis)
•While Candidate_hypotheses is not empty, Do
Methods for Calculation of Performance:

1. Relative frequency :
nc/n ( n : matched by rule, nc: classified by rule correctly)

2. m-estimate of accuracy :

(nc + mp) / (n + m)
• p : The prior probability that a randomly drawn example will have
classification assigned by the rule (e.g. if 12 out of 100 examples
have the value predicted by the rule, then p=0.12)
• m : Weight ( or # of examples for weighting this prior p )

3. Entropy
Working on the Algorithm:
The algorithm involves a set of ‘ordered rules’ or ‘list of decisions’
to be made.
Step 1 – create an empty learned rule list, ‘R’.
Step 2 – ‘Learn-One-Rule’ Algorithm It extracts the best rule for a
particular class ‘y’, where a rule is defined as:
General Form of Rule

In the beginning,
Step 2.a – if all training examples ∈ class ‘y’, then it’s classified
as positive example.
Step 2.b – else if all training examples ∉ class ‘y’, then it’s
classified as negative example.
Step 3 – The rule becomes ‘desirable’ when it covers a majority of
the positive examples.
Step 4 – When this rule is obtained, delete all the training data
associated with that rule. (i.e. when the rule is applied to the
dataset, it covers most of the training data, and has to be
removed)
Step 5 – The new rule is added to the bottom of learned rule list,
‘R’.
Training examples
Method I :
Form a rule based on these positive samples and remove those from training examples.
Step 1: a) forming rule b) removal of +ve samples covered by rule

Rule R1

STEP2: Rule R2
STEP3 : Rule R3

Rule List : R1,R2,R3  forwarded to Decision list


Method 2 :
Form a rule based on these positive samples and remove those from training examples.
Step 1: a) forming rule b) removal of +ve samples covered by rule

STEP2:
STEP3 :

STEP4:

Rule List : R1,R2,R3,R4  forwarded to Decision list


Few remarks on the above LEARN-ONE-RULE algorithm are -
• Each hypothesis considered in the main loop of the algorithm is a conjunction of
attribute-value constraints. Each of these conjunctive hypotheses corresponds to a
candidate set of preconditions for the rule to be learned and is evaluated by the
entropy of the examples it covers.
• The search considers increasingly specific candidate hypotheses until it reaches
a maximally specific hypothesis that contains all available attributes. The rule that
is output by the algorithm is the rule encountered during the search whose
PERFORMANCE is greatest-not necessarily the final hypothesis generated in the
search. The post condition for the output rule is chosen only in the final step of
the algorithm, after its precondition has been determined.
• The algorithm constructs the rule post condition to predict the value of the
target attribute that is most common among the examples covered by the rule
precondition.
• Despite the use of beam search to reduce the risk, the greedy search may still
produce suboptimal rules. However, even when this occurs the
SEQUENTIALCOVERING algorithm can still learn a collection of rules that together
cover the training examples, because it repeatedly calls LEARN-ONE-RULE on the
remaining uncovered examples.
6.3 LEARNING FIRST-ORDER RULES
Previous algorithms learns sets of propositional (i.e., variable-free) rules
based on samples. Now consider learning rules that contain variables-in particular,
learning first-order Horn theories , which are much more expressive than
propositional rules. Inductive learning of first-order rules or theories is often
referred to as inductive logic programming (or LP for short)., because this process
can be viewed as automatically inferring PROLOG programs from examples.
PROLOG is a general purpose, Turing-equivalent programming language in which
programs are expressed as collections of Horn clauses.
First-Order Horn Clauses
As an example of first-order rule sets, consider the following two rules that jointly
describe the target concept Ancestor. Here we use the predicate Parent(x, y) to
indicate that y is the mother or father of x, and the predicate Ancestor(x, y) to
indicate that y is an ancestor of x related by an arbitrary number of family
generations
These two rules compactly describe a recursive function that would be very
difficult to represent using a decision tree or other propositional representation.
One way to see the representational power of first-order rules is to
consider the general purpose programming language PROLOG. In PROLOG
programs are sets of first-order rules such as the two shown above (rules of this
form are also called Horn clauses). In fact, when stated in a slightly different
syntax the above rules form a valid PROLOG program for computing the Ancestor
relation. In this light, a general purpose algorithm capable of learning such rule
sets may be viewed as an algorithm for automatically inferring PROLOG programs
from examples. In practice, learning systems based on first-order representations
have been successfully applied to problems such as learning which chemical
bonds fragment in a mass spectrometer, learning which chemical substructures
produce mutagenic activity, and learning to design finite element meshes to
analyze stresses in physical structures. In each of these applications, the
hypotheses that must be represented involve relational assertions that can be
conveniently expressed using first-order representations, while they are very
difficult to describe using propositional representations.
Terminology in first order logic ( First-Order Horn Clauses)
To see the advantages of first-order representations over propositional
representations, consider the task of learning the simple target concept Daughter
(x, y), defined over pairs of people x and y. The value of Daughter(x, y) is True when
x is the daughter of y, and False otherwise. Suppose each person in the data is
described by the attributes Name, Mother, Father, Male, Female. Hence, each
training example will consist of the description of two people in terms of these
attributes, along with the value of the target attribute Daughter. For example, the
following is a positive example in which Sharon is the daughter of Bob:

where the subscript on each attribute name indicates which of the.two persons is
being described. Now if we were to collect a number of such training examples for
the target concept Daughter1,2 and provide them to a propositional rule learner
such as CN2 or C4.5, the result would be a collection of very specific rules such as

Although it is correct, this rule is so specific that it will rarely, if ever, be useful in classifying
future pairs of people.
The problem is that propositional representations offer no general way to describe
the essential relations among the values of the attributes. In contrast, a program
using first-order representations could learn the following general rule, where x and
y are variables that can be bound to any person.

First-order Horn clauses may also refer to variables in the preconditions that do not
occur in the post conditions. For example, one rule for GrandDaughter might be

Note the variable z in this rule, which refers to the father of y, is not present in the
rule post conditions. Whenever such a variable occurs only in the preconditions, it is
assumed to be existentially quantified; that is, the rule preconditions are satisfied as
long as there exists at least one binding of the variable that satisfies the
corresponding literal.
6.4 LEARNING SETS OF FIRST-ORDER RULES using FOIL Algorithm –
( First-Order Inductive Learner (FOIL) Algorithm)
Consider a program called FOIL that employs an approach very similar to the
SEQUENTIAL-COVERING and LEARN-ONERULE algorithms. In fact, the FOIL
program is the natural extension of these earlier algorithms to first-order
representations. Formally, the hypotheses learned by FOIL are sets of first-order
rules, where each rule is similar to a Horn clause with two exceptions.
• First, the rules learned by FOIL are more restricted than general Horn clauses,
because the literals are not permitted to contain function symbols (this reduces
the complexity of the hypothesis space search).
• Second, FOIL rules are more expressive than Horn clauses, because the literals
appearing in the body of the rule may be negated.
Sequential Covering Vs FOIL
Similarity is both uses the Learns one rule at a time and removes positive
examples from training examples and Difference is FOIL will learn the first order
rule when the target is TRUE .

The basic FOIL algorithm is given below.


•The outer loop corresponds to a variant of the SEQUENTIAL-COVERlNG
algorithm, it learns new rules one at a time, removing the positive examples
covered by the latest rule before attempting to learn the next rule.
•The inner loop corresponds to a variant of earlier LEARN-ONE-RULE algorithm,
extended to accommodate first-order rules. In particular, FOIL seeks only rules
that predict when the target literal is True, whereas our earlier algorithm would
seek both rules that predict when it is True and rules that predict when it is False.
Also, FOIL performs a simple hill climbing search rather than a beam search.
The hypothesis space search performed by FOIL is best understood by
viewing it hierarchically. Each iteration through FOIL'S outer loop adds a new rule
to its disjunctive hypothesis, Learned_Rules.
The effect of each new rule is to generalize the current disjunctive
hypothesis (i.e., to increase the number of instances it classifies as positive), by
adding a new disjunct. Viewed at this level, the search is a specific-to-general
search through the space of hypotheses, beginning with the most specific empty
disjunction and terminating when the hypothesis is sufficiently general to cover
all positive training examples. The inner loop of FOIL performs a finer-grained
search to determine the exact definition of each new rule.
This inner loop searches a second hypothesis space, consisting of conjunctions of
literals, to find a conjunction that will form the preconditions for the new rule.
Within this hypothesis space, it conducts a general-to-specific, hill-climbing
search, beginning with the most general preconditions possible (the empty
precondition), then adding literals one at a time to specialize the rule until it
avoids all negative examples.
The two most substantial differences between FOIL and SEQUENTIAL-
COVERING algorithm and LEARN-ONE-RULE algorithm follow from the
requirement that it accommodate first-order rules. These differences are:

1. In its general-to-specific search to 'learn each new rule, FOIL employs


different detailed steps to generate candidate specializations of the rule. This
difference follows from the need to accommodate variables in the rule
preconditions.

2. FOIL employs a PERFORMANCE measure, Foil-Gain. This difference


follows from the need to distinguish between different bindings of the rule
variables and from the fact that FOIL seeks only rules that cover positive
examples.

The following considers these two differences in greater detail.


6.4.1 Generating Candidate Specializations in FOIL
To generate candidate specializations of the current rule, FOIL generates a
variety of new literals, each of which may be individually added to the rule
preconditions. More precisely, suppose the current rule being considered
is

where L1.. . Ln are literals forming the current rule preconditions and
where P(x1, x2, . . . , xk) is the literal that forms the rule head, or post
conditions. FOIL generates candidate specializations of this rule by
considering new literals Ln+1 that fit one of the following forms:
• Q(vl, . . . , vn), where Q is any predicate name occurring in Predicates and
where the Vi are either new variables or variables already present in the
rule. At least one of the Vi in the created literal must already exist as a
variable in the rule.
• Equal(xj, xk), where xj and xk are variables already present in the rule.
• The negation of either of the above forms of literals.
6.4.2 Guiding the Search in FOIL
To select the most promising literal from the candidates generated at
each step, FOIL considers the performance of the rule over the training
data. In doing this, it considers all possible bindings of each variable in
the current rule. To illustrate this process, consider again the example in
which we seek to learn a set of rules for the target literal
GrandDaughter(x, y). For illustration, assume the training data includes
the following simple set of assertions, where we use the convention
that P(x, y) can be read as "The P of x is y ."
6.4.3 Learning Recursive Rule Sets
Earlier, we ignored the possibility that new literals added to the rule body could
refer to the target predicate itself (i.e., the predicate occurring in the rule head).
However, if we include the target predicate in the input list of Predicates, then
FOIL will consider it as well when generating candidate literals. This will allow it to
form recursive rules - rules that use the same predicate in the body and the head
of the rule.
For instance, recall the following rule set that provides a recursive
definition of the Ancestor relation.

Given an appropriate set of training examples, these two rules can be learned
following a trace similar to the one above for GrandDaughter. Note the second
rule is among the rules that are potentially within reach of FOIL'S search, provided
Ancestor is included in the list Predicates that determines which predicates may
be considered when generating new literals. Of course whether this particular rule
would be learned or not depends on whether these particular literals outscore
competing candidates during FOIL'S greedy search for increasingly specific rules.
FOIL Example :
• Say we are tying to predict the Target-predicate GrandDaughter(x,y).
• FOIL begins with
NewRule = GrandDaughter(x,y) ←
• To specialize it, generate these candidate additions to the preconditions:
Equal(x,y), Female(x), Female(y), Father(x,y), Father(y.x), Father(x,z),
Father(z,x), Father(y,z), Father(z,y) and their negations.
• FOIL might greedily select Father(x,y) as most promising, then
NewRule = GrandDaughter(x,y) ← Father(y,z).
• Foil now considers all the literals from the previous step as well as:
Female(z), Equal(z,x), Equal(z,y), Father(z,w), Father(w,z) and their
negations.
• Foil might select Father(z,x), and on the next step Female(y) leading to
NewRule = GrandDaughter (x,y) ← Father(y,z) ∧ Father(z,x) ∧ Female(y)
• If this covers only positive examples it terminates the search for further
specialization.
• FOIL now removes all positive examples covered by this new rule. If more
are left then the outer loop continues.

You might also like