ML Unit-3.-1
ML Unit-3.-1
UNIT-III:
Computational Learning Theory:
Models of learnability: learning in the limit; probably approximately correct (PAC) learning.
Sample complexity for infinite hypothesis spaces, Vapnik- Chervonenkis dimension.
Rule Learning: Propositional and First-Order, Translating decision trees into rules,
Heuristic rule induction using separate and conquer and information gain, First-order
Horn-clause induction (Inductive Logic Programming) and Foil, Learning recursive rules,
Inverse resolution, Golem, and Progol.
This division of learning tasks vs. learning algorithms is arbitrary, and in practice, there
is quite a large degree of overlap between these two fields.
1
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
competing learning algorithms can be made. Three key aspects that must be formalized
are:
In supervised learning, an algorithm is given samples that are labeled in some useful
way. For example, the samples might be descriptions of mushrooms, and the labels
could be whether or not the mushrooms are edible. The algorithm takes these
previously labeled samples and uses them to induce a classifier. This classifier is a
function that assigns labels to samples, including samples that have not been seen
previously by the algorithm. The goal of the supervised learning algorithm is to optimize
some measure of performance such as minimizing the number of mistakes made on
new samples. In computational learning theory, a computation is considered feasible if
it can be done in polynomial time.
Negative results often rely on commonly believed, but yet unproven assumptions, such as:
VC-DIMENSION:
richness, or flexibility. The cardinality of the biggest set of points that the method
can shatter is specified. the context of a dataset, shatter or a shattered set implies
that points in the feature space may be picked or divided from one another using
hypotheses in the space such that the labels of samples in the distinct groups are
right.
2
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
❖ The VC dimension measures the complexity of a hypothesis space, for example, the
models that can be fit given a representation and learning method. The amount of
unique possibilities in a hypothesis space (space of models that might be fit) and
the space may be traversed are two ways to assess the complexity of a hypothesis
❖ The VC dimension is an ingenious method that instead counts the number of cases
from the target issue that can be distinguished by hypotheses in the space.
Given a set of n points S = {x1, x2, …, xn} in a d-dimensional space and a binary
classifier h, the VC dimension of h is the largest integer d such that there exists a
set of d points that can be shattered by h, i.e., for any labeling of the d points, there
of h is:
VC(h) = max{d | there exists a set of d points that can be shattered by h}.for each of
3
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
And that's why the VC dimension of H is 3. Because for any 4 points in 2D plane, a linear
classifier can not shatter all the combinations of the points. For example,
For this set of points, there is no separating hyper plane can be drawn to classify this set.
So the VC dimension is 3.
s
Or the pattern where a three points coincides on each other, Here also we can not draw
separating hyper plane between 3 points. But still this pattern is not considered in the
definition of the VC dimension.
distribution D, and the aim of a neural network is to classify any further unclassified
4
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
➢ To find information about any unknown target function, the learner is given access
to different examples of the functions that are drawn randomly according to some
➢ In general, a PAC algorithm may be performed on supplied data and the inaccuracy
because most of these algorithms use the input for more than just establishing the
PAC-learnability:
• Let X be a set called the instance space which may be finite or infinite. For example, X
• Training examples are obtained by taking random samples from X. We assume that the
Definition (informal) :
arbitrary, but fixed, probability distribution. The concept class C is said to be PAC-
learnable if there is an algorithm A which, for samples drawn with any probability
5
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
distribution F and any concept c ∈ C, will with high probability produce a hypothesis
• True error To formally define PAC-learnability, we require the concept of the true
is defined by error
where the notation Px∈F indicates that the probability is taken for x drawn from X
according to the distribution F. This error is the probability that h will misclassify
not directly observable to the learner; it can only see the training error of each
hypothesis (that is, how often h(x) ≠ c(x) over training instances).h
The online learning model, also known as the mistake-bounded learning model, is a form
of learning model in which the worst-case scenario is considered for all environments.
There is a known conception class for each situation, and the target concept is chosen
from there.
The target functions and the sequence of presentation for all instances are then chosen
by an opponent with limitless computer power and knowledge of the learner's algorithm.
6
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
Problem setting :The learner knows the right value after each learning session and can
utilize the input to enhance its hypothesis. The objective of this model is to use any
Definition 1: An algorithm A is said to learn C in the mistake bound model if for any
concept c ∈ C, and for any ordering of examples consistent with c, the total number of
mistakes ever made by A is bounded by p(n, size(c)), where p is a polynomial. We say that
A is a polynomial time learning algorithm if its running time per stage is also polynomial
in n and size(c). Let us now examine a few problems that are learnable in the mistake
bound model.
Conjunctions : Let us assume that we know that the target concept c will be a
conjunction of a set of (possibly negated) variables, with an example space of n-bit strings.
3. If the prediction is False but the label is actually True, remove all the literals in h which
are False in x. (So if the first mistake is on 1001, the new h will be x1x2 x3 x4.)
4. If the prediction is True but the label is actually False, then output “no consistent
conjunction”.
5. Return to step 2.
7
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
An invariant of this algorithm is that the set of literals in c will always be a subset of the
set of literals in h. The first mistake on a positive example will bring the size of h to n.
Each subsequent such mistake will remove at least one literal from h, so that the
WEAK LEARNING:
The PAC learning model requires the learner to generate numerous hypotheses that
are arbitrarily near to the goal idea. The simple methods that are generally right are
accurate.
a weak learner into a PAC learner is known as hypothesis boosting. Since Schapire's
driving the weak learner to provide hypotheses that can be merged to produce a
learner is tested.
➢ The restricted to finite hypothesis spaces Some infinite hypothesis spaces are more
expressive than others – E.g., Rectangles, vs. 17- sides convex polygons vs. general
measure of the expressiveness of an infinite hypothesis space other than its size.
Analogous to |H|, there are bounds for sample complexity using VC(H).
8
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
(‘shattered’) using H, not by the number of distinct hypotheses (|H|). An unbiased
hypothesis space H shatters the entire instance space X (is able to induce every
possible partition on the set of all possible instances) The larger the subset X that
can be shattered, the more expressive a hypothesis space is, i.e., the less biased.
that is, h partitions S into the two subsets {x E S/h(x) = 1) and {x E S/h(x) = 0).
Given some instance set S, there are 2ISI possible dichotomies, though H may be
concepts over S.
We can shatter any dataset of two reals, but we cannot shatter datasets of three real
values.A set .of instances is thus a measure of its capacity to represent target concepts
defined over these instances.
9
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
VC dimension of H
The VC dimension of the hypothesis space H, VC(H), is the size of the largest finite
subset of the instance space X that can be shattered by H. If arbitrarily large finite subsets
of X can be shattered by X then VC(H) = ∞
VC Dimension
The VC dimension of hypothesis space H over instance space X is the size of the largest
finite subset of X that is shattered by H.
❖ If there exists one (or more) subsets of size d that can be shattered, then VC(H) ≥ d
❖ If no subset of size d can be shattered, then VC(H) < d.
❖ The VC dimension of a 2-d linear classifier is 3: The largest set of points that can be
labeled arbitrarily Note that |H| is infinite, but expressiveness is quite low.
❖ If H is finite: VC(H) ≤ log2|H| A set S with d instances has 2d distinct
subsets/dichotomies. Hence, H requires 2d distinct hypotheses to shatter d
instances.
– If VC(H) = d: 2d ≤ |H| hence: VC(H) = d ≤ log2|H|
10
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
VC dimension serves the same role as the size of the hypothesis space. Using VC
dimension as a measure of expressiveness, we can give an Occam algorithm for infinite
hypothesis spaces. Given a sample D of m examples we will find h ∈ H that is consistent
with all m examples, if
when with probability at least 1 − δ, h has error less then . We consider that the hypothesis
space has to be infinite if we want to use this bound. If we want to shatter m points, then
H has to be at least 2m in order to shatter any configurations of those m examples.
COLT SUMMARY:
The PAC framework provides a reasonable model for theoretically analyzing the
effectiveness of learning algorithms.
The sample complexity for any consistent learner using the hypothesis space, H,
can be determined from a measure of H’s expressiveness (|H|, VC(H)) .
If the sample complexity is tractable, then the computational complexity of finding
a consistent hypothesis governs the complexity of the problem.
Sample complexity bounds given here are far from being tight, but separates
learnable classes from non-learnable classes (and show what’s important).
Computational complexity results exhibit cases where information theoretic
learning is feasible, but finding good hypothesis is intractable.
The theoretical framework allows for a concrete analysis of the complexity of
learning as a function of various assumptions (e.g., relevant variables)
*********************************************************************************
RULE LEARNING:
1. It is useful to learn the target function represented as a set of if-then rules that
jointly define the function. One way to learn sets of rules is to first learn a decision
tree, then translate the tree into an equivalent set of rules-one rule for each leaf
node in the tree.
2. A variety of algorithms that directly learn rule sets and that differ from these
algorithms in two key respects. First, they are designed to learn sets of first-order
rules that contain variables. This is significant because first-order rules are much
more expressive than propositional rules. Second, the algorithms discussed here
11
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
use sequential covering algorithms that learn one rule at a time to incrementally
grow the final set of rules.
3. First, they are designed to learn sets of first-order rules that contain variables. This
is significant because first-order rules are much more expressive than propositional
rules. Second, the algorithms discussed here use sequential covering algorithms
that learn one rule at a time to incrementally grow the final set of rules.
Learn-One-Rule:
we consider a family of algorithms for learning rule sets based on the strategy of
learning one rule, removing the data it covers, then iterating this process. Such
algorithms are called sequential covering algorithms. To elaborate, imagine we have
a subroutine LEARN-ONE-RULE that accepts a set of positive and negative training
examples as input, then outputs a single rule that covers many of the positive
examples and few of the negative examples. We require that this output rule have
high accuracy, but not necessarily high coverage. By high accuracy, we mean the
predictions it makes should be correct. By accepting low coverage, we mean it need
not make predictions for every training example.
For example:
IF Mother(y, x) and Female(y), THEN Daughter(x, y).
Here, any person can be associated with the variables x and y. The Learn-One-Rule
algorithm follows a greedy searching paradigm where it searches for the rules with
high accuracy but its coverage is very low. It classifies all the positive examples for
a particular instance. It returns a single rule that covers some examples.
12
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
while candidate-hypothesis:
//Update best-hypothesis
best_hypothesis=argmax(h∈CHs)Performance(h,examples,tar
get_attribute)
//Update candidate-hypothesis
IF best_hypothesis:
return prediction.
13
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
It starts with the most general rule precondition, then greedily adds the variable
that most improves performance measured over the training examples.
14
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
Sequential Covering is a popular algorithm based on Rule-Based Classification used
for learning a disjunctive set of rules. The basic idea here is to learn one rule, remove
the data that it covers, then repeat the same process. In this process, In this way,
it covers all the rules involved with it in a sequential manner during the training
phase.
Algorithm Involved:
The Sequential Learning algorithm takes care of to some extent, the low coverage
problem in the Learn-One-Rule algorithm covering all the rules in a sequential
manner.
Working on the Algorithm:
15
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
In the beginning,
Step 2.a – if all training examples ∈ class ‘y’, then it’s classified as positive example.
Step 2.b – else if all training examples ∉ class ‘y’, then it’s classified as negative
example.
Step 3 – The rule becomes ‘desirable’ when it covers a majority of the positive
examples.
Step 4 – When this rule is obtained, delete all the training data associated with that
rule.
(i.e. when the rule is applied to the dataset, it covers most of the training data, and
has to be removed)
Step 5 – The new rule is added to the bottom of decision list, ‘R’ Below figure
➢ Let us understand step by step how the algorithm is working in the example
shown in the below figure.
➢ First, we created an empty decision list. During Step 1, we see that there are
three sets of positive examples present in the dataset. So, as per the
algorithm, we consider the one with maximum no of positive example. (6, as
shown in Step 1 of above fig)
16
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
➢ Once we cover these 6 positive examples, we get our first rule R1, which is
then pushed into the decision list and those positive examples are removed
from the dataset. (as shown in Step 2 of below fig).
➢ Now, we take the next majority of positive examples (5, as shown in Step 2 of
below Fig ) and follow the same process until we get rule R2. (Same for R3)
➢ In the end, we obtain our final decision list with all the desirable rules.
Sequential Learning is a powerful algorithm for generating rule-based classifiers in
Machine Learning. It uses ‘Learn-One-Rule’ algorithm as its base to learn a sequence of
disjunctive rules.
FIRST-ORDER HORN CLAUSES:
The advantages of first-order representations over propositional (variable-free)
representations, consider the task of learning the simple target concept Daughter (x, y),
defined over pairs of people x and y. The value of Daughter(x, y) is True when x is the
daughter of y, and False otherwise. Suppose each person in the data is described by the
attributes Name, Mother, Father, Male, Female. Hence, each training example will consist
of the description of two people in terms of these attributes, along with the value of the
target attribute Daughter.
For example, the following is a positive example in which Sharon is the daughter of Bob:
(Namel = Sharon, Motherl = Louise, Fatherl = Bob,
Malel = False, Female1 = True,
Name2 = Bob, Mother2 = Nora, Father2 = Victor,
Male2 = True, Female2 = False, Daughterl.2 = True)
Where the subscript on each attribute name indicates which of the. two persons is being
described. To collect a number of such training examples for the target concept
Daughter1,2 and provide them to a propositional rule learner such as CN2 or C4.5, the
result would be a collection of very specific rules such as
17
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
Positive Literal — a literal that does not contain a negated predicate. e.g.
female(sue)
Negative Literal — a literal that contains a negated predicate. e.g. father(X,Y)
✓ Clause – It can be defined as any disjunction of literals whose variables are
universally quantified. M1V………………VM1
where, M1, M2, ...,Mn --> literals (with variables universally quantified),V -->
Disjunction (logical OR)
✓ Horn clause — It can be defined as any clause containing exactly one positive literal.
where, H --> Horn ClauseL1,L2,...,Ln --> Literals
(A ⇠ B) --> can be read as 'if B then A' [Inductive Logic] and
∧ --> Conjunction (logical AND)
∨ --> Disjunction (logical OR)
¬ --> Logical NOT
The following equivalent, using earlier rule notation IF L1 л.... л. L,, THEN H is the
notation, the Horn clause preconditions L1 л. . . л. L, are called the clause body or,
alternatively, the clause antecedents. The literal H forms the postcondition is called the
clause head or, alternatively, the clause consequent.
FIRST ORDER INDUCTIVE LEARNER (FOIL)-IDEA
➢ A variety of algorithms has been proposed for learning first-order rules, or Horn
clauses. we consider a program called FOIL.
➢ In machine learning, first-order inductive learner (FOIL) is a rule-based learning
algorithm. It is a natural extension of SEQUENTIAL-COVERING and LEARN-ONE-
RULE algorithms.
➢ It follows a Greedy approach. the FOIL program is the natural extension of these
earlier algorithms to first-order representations.
➢ The hypotheses learned by FOIL are sets of first-order rules, where each rule is
similar to a Horn clause with two exceptions.
18
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
➢ First, the rules learned by FOIL are more restricted than general Horn clauses,
because the literals are not pennitted to contain function symbols (this reduces the
complexity of the hypothesis space search).
➢ Second, FOIL rules are more expressive than Horn clauses, because the literals
appearing in the body of the rule may be negated.
Inductive Learning:
Inductive learning analyzing and understanding the evidence and then using it to
determine the outcome. It is based on Inductive Logic.
Algorithm Involved
Working of the Algorithm: In the algorithm, the inner loop is used to generate a new best
rule. Let us consider an example and understand the step-by-step working of the
algorithm.
20
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
Step 3 - FOIL might greedily select Father(x,y) as most promising, then
NewRule = GrandDaughter(x,y) ← Father(y,z) [Greedy approach]
Step 4 - Foil now considers all the literals from the previous step as well as:
(Female(z), Father(z,w), Father(w,z), etc.) and their negations.
Step 5 - Foil might select Father(z,x), and on the next step Female(y) leading to
NewRule = GrandDaughter (x,y) ← Father(y,z) ∧ Father(z,x) ∧ Female(y)
where,
L is the candidate literal to add to rule R
p0 = number of positive bindings of R
n0 = number of negative bindings of R
p1 = number of positive binding of R + L
n1 = number of negative bindings of R + L
t = number of positive bindings of R also covered by R + L
FOIL Algorithm is another rule-based learning algorithm that extends on the Sequential
Covering + Learn-One-Rule algorithms and uses a different Performance metrics (other
than entropy/information gain) to determine the best rule possible.
LEARNING RECURSIVE RULE SETS
The possibility of new literals added to the rule body could refer to the target predicate
itself (i.e., the predicate occurring in the rule head). However, if we include the target
predicate in the input list of Predicates, then FOIL will consider it as well as generating
candidate literals.
This will allow it to form recursive rules-rules that use the same predicate in
the body and the head of the rule. For instance, recall the following rule set that
21
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
provides a recursive definition of the Ancestor relation.
IF Parent (x, y) THEN Ancestor(x, y)
IF Parent (x, z) л Ancestor(z, y) THEN Ancestor@, y)
1. Consider an appropriate set of training examples, these two rules can be learned
following a trace similar to the one above for Grand Daughter. The second rule is
among the rules that are potentially within reach of FOIL'S search.
2. It provided Ancestor is included in the list Predicates that determines which
predicates may be considered when generating new literals. this particular rule
would be learned or not depends on whether these particular literals outscore
competing candidates during FOIL'S greedy search for increasingly specific rules.
So, it is possible. how to avoid learning rule sets that produce infinite recursion.
RESOLUTION RULE
The scan below shows how Resolution is closely related to the true inference rule
(mislabeled here) of "transitivity of implication").
22
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
Thus "unit" resolution produces a new clause with one less term than its longer
parent. As we have seen, it's costly related to modus ponens.
23
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
Modus Ponens:
(A ⇒ B), A
-----------------
B
Resolution:
(∼ A ∨ B), A
-----------------
B
Procedure
24
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
25
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
26
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
1 → ∼ R (C1)
2 → ∼ P ∨ Q (C2)
3 → P ∨ R (C3a)
P ∨ S (C3b)
3 → ∼ Q (C4)
Resolutions:
C1, C3a → P (C5)
C5, C2 → Q (C6)
C4, C6 → ∅ (null, 'box')
Resolution in First-Order Logic
To generalize to FOL from PC must account for
• Predicates
• Unbound variables
• Existential, universal quantifiers
27
DEPT.OF ECE,SREC
UNIT-III(MACHINE LEARNING)
28
DEPT.OF ECE,SREC