0% found this document useful (0 votes)
139 views

ML Unit-3.-1

The document discusses machine learning concepts including computational learning theory, PAC learning, and online learning models. It provides definitions and explanations of key machine learning concepts such as: - Computational learning theory uses mathematical frameworks to quantify learning tasks and algorithms. - PAC learning aims to measure learning problem complexity and produce hypotheses that are probably approximately correct. - The VC dimension measures the capacity of a hypothesis space to fit data. - Online learning evaluates algorithms based on the number of mistakes made while interacting sequentially with data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
139 views

ML Unit-3.-1

The document discusses machine learning concepts including computational learning theory, PAC learning, and online learning models. It provides definitions and explanations of key machine learning concepts such as: - Computational learning theory uses mathematical frameworks to quantify learning tasks and algorithms. - PAC learning aims to measure learning problem complexity and produce hypotheses that are probably approximately correct. - The VC dimension measures the capacity of a hypothesis space to fit data. - Online learning evaluates algorithms based on the number of mistakes made while interacting sequentially with data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

UNIT-III(MACHINE LEARNING)

UNIT-III:
Computational Learning Theory:
Models of learnability: learning in the limit; probably approximately correct (PAC) learning.
Sample complexity for infinite hypothesis spaces, Vapnik- Chervonenkis dimension.
Rule Learning: Propositional and First-Order, Translating decision trees into rules,
Heuristic rule induction using separate and conquer and information gain, First-order
Horn-clause induction (Inductive Logic Programming) and Foil, Learning recursive rules,
Inverse resolution, Golem, and Progol.

COMPUTATIONAL LEARNING THEORY:


Computational learning theory (CoLT) is using mathematical methods or the design
applied to computer learning programs. It involves using mathematical frameworks for the
purpose of quantifying learning tasks and algorithms.
Computational learning theory can be considered to be an extension of statistical learning
theory or SLT for short, that makes use of formal methods for the purpose of quantifying
learning algorithms.

• Computational Learning Theory (CoLT): Formal study of learning tasks.


• Statistical Learning Theory (SLT): Formal study of learning algorithms.

This division of learning tasks vs. learning algorithms is arbitrary, and in practice, there
is quite a large degree of overlap between these two fields.

How important is computational learning theory


Computational learning theory provides a formal framework is possible to precisely
formulate and address questions regarding the performance of different learning
algorithms. Comparisons of both the predictive power and the computational efficiency of

1
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
competing learning algorithms can be made. Three key aspects that must be formalized
are:

• The way in which the learner interacts with its environment,


• The definition of success in completing the learning task,
• A formal definition of efficiency of both data usage (sample complexity) and
processing time (time complexity).

Computational learning in machine learning mainly deal with a type of inductive


learning called supervised learning.

In supervised learning, an algorithm is given samples that are labeled in some useful
way. For example, the samples might be descriptions of mushrooms, and the labels
could be whether or not the mushrooms are edible. The algorithm takes these
previously labeled samples and uses them to induce a classifier. This classifier is a
function that assigns labels to samples, including samples that have not been seen
previously by the algorithm. The goal of the supervised learning algorithm is to optimize
some measure of performance such as minimizing the number of mistakes made on
new samples. In computational learning theory, a computation is considered feasible if
it can be done in polynomial time.

There are two kinds of time complexity results:

• Positive results – Showing that a certain class of functions is learnable in polynomial


time.
• Negative results – Showing that certain classes cannot be learned in polynomial
time.

Negative results often rely on commonly believed, but yet unproven assumptions, such as:

• Computational complexity – P ≠ NP (the P versus NP problem);


• Cryptographic – One-way functions exist.

VC-DIMENSION:

❖ The VC dimension theory or Vapnik-Chervonenkis dimension theory is the

theoretical study on Machine Learning Algorithms. It is a machine learning

framework developed by Vladimir Vapnik and Alexey Chervonenkis.

❖ It is a measure of the capability of a collection of functions that can be learnt by a

statistical binary classification algorithm in terms of complexity, expressive power,

richness, or flexibility. The cardinality of the biggest set of points that the method

can shatter is specified. the context of a dataset, shatter or a shattered set implies

that points in the feature space may be picked or divided from one another using

hypotheses in the space such that the labels of samples in the distinct groups are

right.

2
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
❖ The VC dimension measures the complexity of a hypothesis space, for example, the

models that can be fit given a representation and learning method. The amount of

unique possibilities in a hypothesis space (space of models that might be fit) and

the space may be traversed are two ways to assess the complexity of a hypothesis

space (space of models that could be fit).

❖ The VC dimension is an ingenious method that instead counts the number of cases

from the target issue that can be distinguished by hypotheses in the space.

Mathematically, the VC dimension of a binary classifier is defined as follows:

Given a set of n points S = {x1, x2, …, xn} in a d-dimensional space and a binary

classifier h, the VC dimension of h is the largest integer d such that there exists a

set of d points that can be shattered by h, i.e., for any labeling of the d points, there

exists a hypothesis h in H that correctly classifies them. Formally, the VC dimension

of h is:

VC(h) = max{d | there exists a set of d points that can be shattered by h}.for each of

the 2² = 4 possible color assignments of the given points.

N= 3 points can be classified by H correctly with separating hyper plane as shown

in the following figure.

3
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)

And that's why the VC dimension of H is 3. Because for any 4 points in 2D plane, a linear
classifier can not shatter all the combinations of the points. For example,

For this set of points, there is no separating hyper plane can be drawn to classify this set.
So the VC dimension is 3.

s
Or the pattern where a three points coincides on each other, Here also we can not draw
separating hyper plane between 3 points. But still this pattern is not considered in the
definition of the VC dimension.

PAC LEARNING MODEL:

➢ “PAC Learning or Probably Approximately Correct Learning is a framework in the

theory of machine learning that aims to measure the complexity of a learning

problem and is probably the most advanced sub-field of computational learning

theory. It was a seminal work done by Leslie Valiant.”

➢ In PAC models, examples are created according to an arbitrary probability

distribution D, and the aim of a neural network is to classify any further unclassified

instances with high accuracy (with regard to the distribution D).

4
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
➢ To find information about any unknown target function, the learner is given access

to different examples of the functions that are drawn randomly according to some

unknown target distribution D.

➢ In general, a PAC algorithm may be performed on supplied data and the inaccuracy

of the resulting hypothesis objectively measured.

➢ An exception is when attempting to utilize statistical query methods empirically,

because most of these algorithms use the input for more than just establishing the

necessary sample size.

PAC-learnability:

PAC-learnability we require some specific terminology and related notations.

• Let X be a set called the instance space which may be finite or infinite. For example, X

may be the set of all points in a plane.

• A concept class C for X is a family of functions c ∶ X → {0, 1}. A member of C is called a

concept. A concept can also be thought of as a subset of X. If C is a subset of X, it defines

a unique function µC ∶ X → {0, 1} as follows:

• A hypothesis h is also a function h ∶ X → {0, 1}. So, as in the case of concepts, a

hypothesis can also be thought of as a subset of X. H will denote a set of hypotheses.

• We assume that F is an arbitrary, but fixed, probability distribution over X.

• Training examples are obtained by taking random samples from X. We assume that the

samples are randomly generated from X according to the probability distribution F.

Definition (informal) :

Let X be an instance space, C a concept class for X, h a hypothesis in C and F an

arbitrary, but fixed, probability distribution. The concept class C is said to be PAC-

learnable if there is an algorithm A which, for samples drawn with any probability

5
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
distribution F and any concept c ∈ C, will with high probability produce a hypothesis

h ∈ C whose error is small. Additional notions

• True error To formally define PAC-learnability, we require the concept of the true

error of a hypothesis h with respect to a target concept c denoted by error F (h). It

is defined by error

F (h) = Px∈F (h(x) ≠ c(x))

where the notation Px∈F indicates that the probability is taken for x drawn from X

according to the distribution F. This error is the probability that h will misclassify

an instance x drawn at random from X according to the distribution F. This error is

not directly observable to the learner; it can only see the training error of each

hypothesis (that is, how often h(x) ≠ c(x) over training instances).h

ONLINE LEARNING MODEL:

The online learning model, also known as the mistake-bounded learning model, is a form

of learning model in which the worst-case scenario is considered for all environments.

There is a known conception class for each situation, and the target concept is chosen

from there.

The target functions and the sequence of presentation for all instances are then chosen

by an opponent with limitless computer power and knowledge of the learner's algorithm.

Throughout the learning session, the student gets unlabeled examples.

6
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
Problem setting :The learner knows the right value after each learning session and can

utilize the input to enhance its hypothesis. The objective of this model is to use any

effective learning technique to reduce the worst-case amount of errors.

In this setting, the following scenario is repeated indefinitely:

1. The algorithm receives an unlabeled example.

2. The algorithm predicts a classification of this example.

3. The algorithm is then told the correct answer.

Definition 1: An algorithm A is said to learn C in the mistake bound model if for any

concept c ∈ C, and for any ordering of examples consistent with c, the total number of

mistakes ever made by A is bounded by p(n, size(c)), where p is a polynomial. We say that

A is a polynomial time learning algorithm if its running time per stage is also polynomial

in n and size(c). Let us now examine a few problems that are learnable in the mistake

bound model.

Conjunctions : Let us assume that we know that the target concept c will be a

conjunction of a set of (possibly negated) variables, with an example space of n-bit strings.

Consider the following algorithm:

1. Initialize hypothesis h to be x1x1 x2x2 . . . xnxn .

2. Predict using h(x).

3. If the prediction is False but the label is actually True, remove all the literals in h which

are False in x. (So if the first mistake is on 1001, the new h will be x1x2 x3 x4.)

4. If the prediction is True but the label is actually False, then output “no consistent

conjunction”.

5. Return to step 2.

7
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
An invariant of this algorithm is that the set of literals in c will always be a subset of the

set of literals in h. The first mistake on a positive example will bring the size of h to n.

Each subsequent such mistake will remove at least one literal from h, so that the

maximum number of mistakes made will be at most n + 1.

WEAK LEARNING:

The PAC learning model requires the learner to generate numerous hypotheses that

are arbitrarily near to the goal idea. The simple methods that are generally right are

easy to find, it is extremely difficult to find a single hypothesis that is highly

accurate.

A Weak Learning Algorithm is a sort of training data that generates an output

hypothesis that outperforms random guessing. The overall process of transforming

a weak learner into a PAC learner is known as hypothesis boosting. Since Schapire's

initial study, many boosting techniques have been presented.

AdaBoost is an Algorithm that claims to utilize empirical formulae. The key to

driving the weak learner to provide hypotheses that can be merged to produce a

highly accurate hypothesis is to generate diverse distributions on which the weak

learner is tested.

SAMPLE COMPLEXITY FOR INFINITE HYPOTHESIS SPACES:

➢ The restricted to finite hypothesis spaces Some infinite hypothesis spaces are more

expressive than others – E.g., Rectangles, vs. 17- sides convex polygons vs. general

convex polygons Linear threshold function vs. a conjunction of LTUs Need a

measure of the expressiveness of an infinite hypothesis space other than its size.

➢ The Vapnik-Chervonenkis dimension (VC dimension) provides such a measure.

Analogous to |H|, there are bounds for sample complexity using VC(H).

VC dimension (basic idea):

The VC dimension of a hypothesis space H measures the complexity of H by the

number of distinct instances from X that can be completely discriminated

8
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
(‘shattered’) using H, not by the number of distinct hypotheses (|H|). An unbiased

hypothesis space H shatters the entire instance space X (is able to induce every

possible partition on the set of all possible instances) The larger the subset X that

can be shattered, the more expressive a hypothesis space is, i.e., the less biased.

Shattering a set of instances

DEFINATION:A set of instances S is shattered by the hypothesis space H if and only

if for every dichotomy of S there is a hypothesis h in H that is consistent

with this dichotomy. Consider some subset of instances shows a subset of

three instances from X. Each hypothesis h from H imposes some dichotomy on S;

that is, h partitions S into the two subsets {x E S/h(x) = 1) and {x E S/h(x) = 0).

Given some instance set S, there are 2ISI possible dichotomies, though H may be

unable to represent some of these. We say that H shatters S if every possible

dichotomy of S can be represented by some hypothesis from H.

o dichotomy: partition instances in S into + and –

o one dichotomy = label all instances in a subset P⊆ S as +, and instances

in the complement of P, S\P as taken.

o The ability of H to shatter S is a measure of its capacity to represent

concepts over S.

We can shatter any dataset of two reals, but we cannot shatter datasets of three real
values.A set .of instances is thus a measure of its capacity to represent target concepts
defined over these instances.

9
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)

VC dimension of H
The VC dimension of the hypothesis space H, VC(H), is the size of the largest finite
subset of the instance space X that can be shattered by H. If arbitrarily large finite subsets
of X can be shattered by X then VC(H) = ∞
VC Dimension
The VC dimension of hypothesis space H over instance space X is the size of the largest
finite subset of X that is shattered by H.
❖ If there exists one (or more) subsets of size d that can be shattered, then VC(H) ≥ d
❖ If no subset of size d can be shattered, then VC(H) < d.
❖ The VC dimension of a 2-d linear classifier is 3: The largest set of points that can be
labeled arbitrarily Note that |H| is infinite, but expressiveness is quite low.
❖ If H is finite: VC(H) ≤ log2|H| A set S with d instances has 2d distinct
subsets/dichotomies. Hence, H requires 2d distinct hypotheses to shatter d
instances.
– If VC(H) = d: 2d ≤ |H| hence: VC(H) = d ≤ log2|H|

VC Dimension of linear classifiers in 2 dimensions

SAMPLE COMPLEXITY AND VC DIMENSION :

10
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
VC dimension serves the same role as the size of the hypothesis space. Using VC
dimension as a measure of expressiveness, we can give an Occam algorithm for infinite
hypothesis spaces. Given a sample D of m examples we will find h ∈ H that is consistent
with all m examples, if

when with probability at least 1 − δ, h has error less then . We consider that the hypothesis
space has to be infinite if we want to use this bound. If we want to shatter m points, then
H has to be at least 2m in order to shatter any configurations of those m examples.

COLT SUMMARY:
The PAC framework provides a reasonable model for theoretically analyzing the
effectiveness of learning algorithms.
The sample complexity for any consistent learner using the hypothesis space, H,
can be determined from a measure of H’s expressiveness (|H|, VC(H)) .
If the sample complexity is tractable, then the computational complexity of finding
a consistent hypothesis governs the complexity of the problem.
Sample complexity bounds given here are far from being tight, but separates
learnable classes from non-learnable classes (and show what’s important).
Computational complexity results exhibit cases where information theoretic
learning is feasible, but finding good hypothesis is intractable.
The theoretical framework allows for a concrete analysis of the complexity of
learning as a function of various assumptions (e.g., relevant variables)

*********************************************************************************

RULE LEARNING:
1. It is useful to learn the target function represented as a set of if-then rules that
jointly define the function. One way to learn sets of rules is to first learn a decision
tree, then translate the tree into an equivalent set of rules-one rule for each leaf
node in the tree.
2. A variety of algorithms that directly learn rule sets and that differ from these
algorithms in two key respects. First, they are designed to learn sets of first-order
rules that contain variables. This is significant because first-order rules are much
more expressive than propositional rules. Second, the algorithms discussed here

11
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
use sequential covering algorithms that learn one rule at a time to incrementally
grow the final set of rules.
3. First, they are designed to learn sets of first-order rules that contain variables. This
is significant because first-order rules are much more expressive than propositional
rules. Second, the algorithms discussed here use sequential covering algorithms
that learn one rule at a time to incrementally grow the final set of rules.

Learn-One-Rule:

we consider a family of algorithms for learning rule sets based on the strategy of
learning one rule, removing the data it covers, then iterating this process. Such
algorithms are called sequential covering algorithms. To elaborate, imagine we have
a subroutine LEARN-ONE-RULE that accepts a set of positive and negative training
examples as input, then outputs a single rule that covers many of the positive
examples and few of the negative examples. We require that this output rule have
high accuracy, but not necessarily high coverage. By high accuracy, we mean the
predictions it makes should be correct. By accepting low coverage, we mean it need
not make predictions for every training example.

For example:
IF Mother(y, x) and Female(y), THEN Daughter(x, y).

Here, any person can be associated with the variables x and y. The Learn-One-Rule
algorithm follows a greedy searching paradigm where it searches for the rules with
high accuracy but its coverage is very low. It classifies all the positive examples for
a particular instance. It returns a single rule that covers some examples.

12
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)

Learn-One-Rule(target_attribute, attributes, examples, k):


Pos = positive examples
Neg = negative examples
best-hypothesis = the most general hypothesis
candidate-hypothesis = {best-hypothesis}

while candidate-hypothesis:

//Generate the next more specific candidate-hypothesis

constraints_list = all constraints in the form


"attribute=value"
new-candidate-hypothesis = all specializations of
candidate-
hypothesis by adding all-constraints
remove all duplicates/inconsistent hypothesis from
new-candidate-hypothesis.

//Update best-hypothesis

best_hypothesis=argmax(h∈CHs)Performance(h,examples,tar
get_attribute)

//Update candidate-hypothesis

candidate-hypothesis = the k best from new-


candidate-hypothesis according to Performance.
prediction = most frequent value of target_attribute from
examples that match best-hypothesis

IF best_hypothesis:
return prediction.

13
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
It starts with the most general rule precondition, then greedily adds the variable
that most improves performance measured over the training examples.

Day Weather Temp Wind Rain PlayBadminton


D1 Sunny Hot Weak Heavy No
D2 Sunny Hot Strong Heavy No
D3 Overcast Hot Weak Heavy No
D4 Snowy Cold Weak Light Yes
D5 Snowy Cold Weak Light Yes
D6 Snowy Cold Strong Light Yes
D7 Overcast Mild Strong Heavy No
D8 Sunny Hot Weak Light Yes

Step 1 - best_hypothesis = IF h THEN PlayBadminton(x) = Yes

Step 2 - candidate-hypothesis = {best-hypothesis}

Step 3 - constraints_list = {Weather(x)=Sunny, Temp(x)=Hot, Wind(x)=Weak, ......}

Step4 - new-candidate-hypothesis ={IF Weather=Sunny


THEN PlayBadminton=YES,
IF Weather=Overcast THEN PlayBadminton=YES, ...}

Step 5 - best-hypothesis = IF Weather=Sunny THEN


PlayBadminton=YES

Step 6 - candidate-hypothesis = {IF Weather=Sunny THEN


PlayBadminton=YES,
IF Weather=Sunny THEN
PlayBadminton=YES...}

Step 7 - Go to Step 2 and keep doing it till the best-hypothesis is obtained.

SEQUENTIAL COVERING ALGORITHM:

14
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
Sequential Covering is a popular algorithm based on Rule-Based Classification used
for learning a disjunctive set of rules. The basic idea here is to learn one rule, remove
the data that it covers, then repeat the same process. In this process, In this way,
it covers all the rules involved with it in a sequential manner during the training
phase.

Algorithm Involved:

Sequential_covering (Target_attribute, Attributes,


Examples, Threshold):
Learned_rules = {}
Rule = Learn-One-Rule(Target_attribute,
Attributes, Examples)

while Performance(Rule, Examples) > Threshold


:
Learned_rules = Learned_rules + Rule
Examples = Examples - {examples correctly
classified by Rule}
Rule = Learn-One-Rule(Target_attribute,
Attributes, Examples)

Learned_rules = sort Learned_rules according to


performance over Examples
return Learned_rules

The Sequential Learning algorithm takes care of to some extent, the low coverage
problem in the Learn-One-Rule algorithm covering all the rules in a sequential
manner.
Working on the Algorithm:

The algorithm involves a set of ‘ordered rules’ or ‘list of decisions’ to be made.


Step 1 – create an empty decision list, ‘R’.
Step 2 – ‘Learn-One-Rule’ Algorithm
It extracts the best rule for a particular class ‘y’, where a rule is defined as: (Fig.)

General Form of Rule

15
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)

In the beginning,
Step 2.a – if all training examples ∈ class ‘y’, then it’s classified as positive example.
Step 2.b – else if all training examples ∉ class ‘y’, then it’s classified as negative
example.

Step 3 – The rule becomes ‘desirable’ when it covers a majority of the positive
examples.
Step 4 – When this rule is obtained, delete all the training data associated with that
rule.
(i.e. when the rule is applied to the dataset, it covers most of the training data, and
has to be removed)

Step 5 – The new rule is added to the bottom of decision list, ‘R’ Below figure

➢ Let us understand step by step how the algorithm is working in the example
shown in the below figure.
➢ First, we created an empty decision list. During Step 1, we see that there are
three sets of positive examples present in the dataset. So, as per the
algorithm, we consider the one with maximum no of positive example. (6, as
shown in Step 1 of above fig)

16
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
➢ Once we cover these 6 positive examples, we get our first rule R1, which is
then pushed into the decision list and those positive examples are removed
from the dataset. (as shown in Step 2 of below fig).
➢ Now, we take the next majority of positive examples (5, as shown in Step 2 of
below Fig ) and follow the same process until we get rule R2. (Same for R3)
➢ In the end, we obtain our final decision list with all the desirable rules.
Sequential Learning is a powerful algorithm for generating rule-based classifiers in
Machine Learning. It uses ‘Learn-One-Rule’ algorithm as its base to learn a sequence of
disjunctive rules.
FIRST-ORDER HORN CLAUSES:
The advantages of first-order representations over propositional (variable-free)
representations, consider the task of learning the simple target concept Daughter (x, y),
defined over pairs of people x and y. The value of Daughter(x, y) is True when x is the
daughter of y, and False otherwise. Suppose each person in the data is described by the
attributes Name, Mother, Father, Male, Female. Hence, each training example will consist
of the description of two people in terms of these attributes, along with the value of the
target attribute Daughter.
For example, the following is a positive example in which Sharon is the daughter of Bob:
(Namel = Sharon, Motherl = Louise, Fatherl = Bob,
Malel = False, Female1 = True,
Name2 = Bob, Mother2 = Nora, Father2 = Victor,
Male2 = True, Female2 = False, Daughterl.2 = True)
Where the subscript on each attribute name indicates which of the. two persons is being
described. To collect a number of such training examples for the target concept
Daughter1,2 and provide them to a propositional rule learner such as CN2 or C4.5, the
result would be a collection of very specific rules such as

FIRST-ORDER INDUCTIVE LEARNER (FOIL) ALGORITHM-TERMONALOGY:


✓ Constants: Every well-formed expression is composed of constants— e.g. tyler, 23,
a
✓ Variable: A term is any constant, any variable, or any function applied to any
term.— e.g. A, B, C
✓ predicate symbols: A literal is any predicate (or its negation) applied to any set of
terms. — e.g. male, father (True or False values only)
✓ function symbols — e.g. age (can take on any constant as a value)
✓ connectives — e.g. ∧, ∨, ¬, →, ←
✓ quantifiers — e.g. ∀, ∃
✓ Term: It can be defined as any constant, variable or function applied to any term.
e.g. age(bob).
✓ Literal: It can be defined as any predicate or negated predicate applied to any terms.
e.g. female(sue), father(X, Y).It has 3 types:
Ground Literal — a literal that contains no variables. e.g. female(sue)

17
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
Positive Literal — a literal that does not contain a negated predicate. e.g.
female(sue)
Negative Literal — a literal that contains a negated predicate. e.g. father(X,Y)
✓ Clause – It can be defined as any disjunction of literals whose variables are
universally quantified. M1V………………VM1
where, M1, M2, ...,Mn --> literals (with variables universally quantified),V -->
Disjunction (logical OR)
✓ Horn clause — It can be defined as any clause containing exactly one positive literal.
where, H --> Horn ClauseL1,L2,...,Ln --> Literals
(A ⇠ B) --> can be read as 'if B then A' [Inductive Logic] and
∧ --> Conjunction (logical AND)
∨ --> Disjunction (logical OR)
¬ --> Logical NOT

The following equivalent, using earlier rule notation IF L1 л.... л. L,, THEN H is the
notation, the Horn clause preconditions L1 л. . . л. L, are called the clause body or,
alternatively, the clause antecedents. The literal H forms the postcondition is called the
clause head or, alternatively, the clause consequent.
FIRST ORDER INDUCTIVE LEARNER (FOIL)-IDEA
➢ A variety of algorithms has been proposed for learning first-order rules, or Horn
clauses. we consider a program called FOIL.
➢ In machine learning, first-order inductive learner (FOIL) is a rule-based learning
algorithm. It is a natural extension of SEQUENTIAL-COVERING and LEARN-ONE-
RULE algorithms.
➢ It follows a Greedy approach. the FOIL program is the natural extension of these
earlier algorithms to first-order representations.
➢ The hypotheses learned by FOIL are sets of first-order rules, where each rule is
similar to a Horn clause with two exceptions.

18
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
➢ First, the rules learned by FOIL are more restricted than general Horn clauses,
because the literals are not pennitted to contain function symbols (this reduces the
complexity of the hypothesis space search).
➢ Second, FOIL rules are more expressive than Horn clauses, because the literals
appearing in the body of the rule may be negated.
Inductive Learning:
Inductive learning analyzing and understanding the evidence and then using it to
determine the outcome. It is based on Inductive Logic.

Algorithm Involved

FOIL(Target predicate, predicates, examples)


• Pos ← positive examples
• Neg ← negative examples
• Learned rules ← {}
• while Pos, do
//Learn a NewRule
– NewRule ← the rule that predicts target-
predicate with no preconditions
– NewRuleNeg ← Neg
– while NewRuleNeg, do
Add a new literal to specialize New Rule
1. Candidate_literals ← generate candidates for
new Rule based on Predicates
2. Best_literal ←
argmaxL∈Candidate
literalsFoil_Gain(L,NewRule)
3. add Best_literal to NewRule preconditions
4. NewRuleNeg ← subset of NewRuleNeg that 19
DEPT.OF ECE,SREC
satisfies New Rule preconditions
UNIT-III(MACHINE LEARNING)

Working of the Algorithm: In the algorithm, the inner loop is used to generate a new best
rule. Let us consider an example and understand the step-by-step working of the
algorithm.

To predict the Target-predicate- GrandDaughter(x,y). We perform the following steps:


[Refer above fig]
Step 1 - NewRule = GrandDaughter(x,y)
Step 2 -
2.a - Generate the candidate_literals.
(Female(x), Female(y), Father(x,y), Father(y.x),
Father(x,z), Father(z,x), Father(y,z), Father(z,y))
2.b - Generate the respective candidate literal negations.
(¬Female(x), ¬Female(y), ¬Father(x,y), ¬Father(y.x),
¬Father(x,z), ¬Father(z,x), ¬Father(y,z), ¬Father(z,y))

20
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
Step 3 - FOIL might greedily select Father(x,y) as most promising, then
NewRule = GrandDaughter(x,y) ← Father(y,z) [Greedy approach]

Step 4 - Foil now considers all the literals from the previous step as well as:
(Female(z), Father(z,w), Father(w,z), etc.) and their negations.

Step 5 - Foil might select Father(z,x), and on the next step Female(y) leading to
NewRule = GrandDaughter (x,y) ← Father(y,z) ∧ Father(z,x) ∧ Female(y)

Step 6 - If this greedy approach covers only positive examples it terminates


the search for further better results.
FOIL now removes all positive examples covered by this new rule. If more are left then the
outer while loop continues.
FOIL: PERFORMANCE EVALUATION MEASURE
The performance of a new rule is not defined by its entropy measure (like the
PERFORMANCE method in Learn-One-Rule algorithm).FOIL uses a gain algorithm to
determine which new specialized rule to opt. Each rule’s utility is estimated by the number
of bits required to encode all the positive bindings.

where,
L is the candidate literal to add to rule R
p0 = number of positive bindings of R
n0 = number of negative bindings of R
p1 = number of positive binding of R + L
n1 = number of negative bindings of R + L
t = number of positive bindings of R also covered by R + L
FOIL Algorithm is another rule-based learning algorithm that extends on the Sequential
Covering + Learn-One-Rule algorithms and uses a different Performance metrics (other
than entropy/information gain) to determine the best rule possible.
LEARNING RECURSIVE RULE SETS
The possibility of new literals added to the rule body could refer to the target predicate
itself (i.e., the predicate occurring in the rule head). However, if we include the target
predicate in the input list of Predicates, then FOIL will consider it as well as generating
candidate literals.
This will allow it to form recursive rules-rules that use the same predicate in
the body and the head of the rule. For instance, recall the following rule set that

21
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
provides a recursive definition of the Ancestor relation.
IF Parent (x, y) THEN Ancestor(x, y)
IF Parent (x, z) л Ancestor(z, y) THEN Ancestor@, y)

1. Consider an appropriate set of training examples, these two rules can be learned
following a trace similar to the one above for Grand Daughter. The second rule is
among the rules that are potentially within reach of FOIL'S search.
2. It provided Ancestor is included in the list Predicates that determines which
predicates may be considered when generating new literals. this particular rule
would be learned or not depends on whether these particular literals outscore
competing candidates during FOIL'S greedy search for increasingly specific rules.
So, it is possible. how to avoid learning rule sets that produce infinite recursion.

RESOLUTION RULE
The scan below shows how Resolution is closely related to the true inference rule
(mislabeled here) of "transitivity of implication").

22
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)

Thus "unit" resolution produces a new clause with one less term than its longer
parent. As we have seen, it's costly related to modus ponens.

23
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
Modus Ponens:
(A ⇒ B), A
-----------------
B
Resolution:
(∼ A ∨ B), A
-----------------
B

Procedure

24
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)

Resolution in Predicate Logic E.g. 1

25
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)

Resolution in Predicate Logic E.g. 2


Given
1. ∼ R
2. ∼ (P ∧ ∼ Q)
3. ∼ P → (R ∧ S)
Prove 4. Q by resolution.
Clauses:

26
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
1 → ∼ R (C1)
2 → ∼ P ∨ Q (C2)
3 → P ∨ R (C3a)
P ∨ S (C3b)
3 → ∼ Q (C4)
Resolutions:
C1, C3a → P (C5)
C5, C2 → Q (C6)
C4, C6 → ∅ (null, 'box')
Resolution in First-Order Logic
To generalize to FOL from PC must account for
• Predicates
• Unbound variables
• Existential, universal quantifiers

27
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)

28
DEPT.OF ECE,SREC

You might also like