Module-2 Knowledge Representation 18CS71 Knowledge Representation
Module-2 Knowledge Representation 18CS71 Knowledge Representation
KNOWLEDGE REPRESENTATION
Knowledge plays an important role in AI systems. The kinds of knowledge might need to be
represented in AI systems:
• Objects: Facts about objects in our world domain. e.g. Guitars have strings, trumpets
are brass instruments.
• Events: Actions that occur in our world. e.g. Steve Vai played the guitar in Frank
Zappa's Band.
• Performance: A behavior like playing the guitar involves knowledge about how to do
things.
• Meta-knowledge: Knowledge about what we know. e.g. Bobrow's Robot who plan's a
trip. It knows that it can read street signs along the way to find out where it is.
For the purpose of solving complex problems c\encountered in AI, we need both a large amount
of knowledge and some mechanism for manipulating that knowledge to create solutions to new
problems. A variety of ways of representing knowledge (facts) have been exploited in AI
programs. In all variety of knowledge representations, we deal with two kinds of entities.
A. Facts: Truths in some relevant world. These are the things we want to represent.
B. Representations of facts in some chosen formalism. these are things we will actually be able
to manipulate.
One way to think of structuring these entities is at two levels :
(a) the knowledge level, at which facts are described, and
(b) the symbol level, at which representations of objects at the knowledge level are defined in
terms of symbols that can be manipulated by programs.
The facts and representations are linked with two-way mappings. This link is called
representation mappings. The forward representation mapping maps from facts to
representations. The backward representation mapping goes the other way, from
representations to facts.
One common representation is natural language (particularly English) sentences. Regardless of
the representation for facts we use in a program , we may also need to be concerned with an
English representation of those facts in order to facilitate getting information into and out of the
system. We need mapping functions from English sentences to the representation we actually
use and from it back to sentences.
• It is embodied.
• Difficult to articulate formally.
• Difficult to communicate or share.
• Moreover, Hard to steal or copy.
• Drawn from experience, action, subjective insight
2. Explicit formal type of knowledge, Explicit
• Explicit knowledge
• Exists outside a human being;
• It is embedded.
• Can be articulated formally.
• Also, Can be shared, copied, processed and stored.
• So, Easy to steal or copy
• Drawn from the artifact of some type as a principle, procedure, process, concepts.
A variety of ways of representing knowledge have been exploited in AI programs.
There are two different kinds of entities, we are dealing with.
1. Facts: Truth in some relevant world. Things we want to represent.
2. Also, Representation of facts in some chosen formalism. Things we will actually be able to
manipulate.
These entities structured at two levels:
1. The knowledge level, at which facts described.
2. Moreover, The symbol level, at which representation of objects defined in terms of ymbols
that can manipulate by programs
FRAMEWORK OF KNOWLEDGE REPRESENTATION
• The computer requires a well-defined problem description to process and provide a well
defined acceptable solution
• Moreover, To collect fragments of knowledge we need first to formulate a description
in our spoken language and then represent it in formal language so that computer can
understand.
• Also, The computer can then use an algorithm to compute an answer.
So, This process illustrated as,
➢ Then, using the deductive mechanisms of logic, we may generate the new
representation object: as tail (Spot)
➢ Using an appropriate backward mapping function, the English sentence “Spot has a
• From the two statements we can conclude that “Each dog has a tail.” From the
statement 1, we conclude that “Each dog has more than one tail.”
• When we try to convert English sentence into some other represent such as logical
propositions, we first decode what facts the sentences represent and then convert those
facts into the new representations. When an AI program manipulates the internal
representation of facts these new representations should also be interpretable as new
representations of facts.
Mutilated Checkerboard Problem:
Problem: In a normal chess board the opposite corner squares have been eliminated. The given
task is to cover all the squares on the remaining board by dominoes so that each domino covers
two squares. No overlapping of dominoes is allowed, can it be done? Consider three data
structures
The first representation does not directly suggest the answer to the problem. The second may
suggest. The third representation does, when combined with the single additional facts that
each domino must cover exactly one white square and one black square.
The puzzle is impossible to complete. A domino placed on the chessboard will always cover
one white square and one black square. Therefore, a collection of dominoes placed on the board
will cover an equal number of squares of each color. If the two white corners are removed from
the board then 30 white squares and 32 black squares remain to be covered by dominoes, so
this is impossible. If the two black corners are removed instead, then 32 wh ite squares and 30
black squares remain, so it is again impossible.
The solution is number of squares must be equal for positive solution.
In the above figure, the dotted line across the top represents the abstract reasoning process that
a program is intended to model. The solid line across the bottom represents the concrete
reasoning process that a particular program performs. This program successfully models the
abstract process to the extent that, when the backward representation mapping is applied to the
Ideally, the program itself would be able to control knowledge acquisition. No single system
that optimizes all of the capabilities for all kinds of knowledge has yet been found.
As a result, multiple techniques for knowledge representation exist.
Relational Knowledge
• The simplest way to represent declarative facts is a set of relations of the same sort used
in the database system.
• Provides a framework to compare two objects based on equivalent attributes.
o Any instance in which two different objects are compared is a relational type of
knowledge.
• The table below shows a simple way to store facts. Also, The facts about a set of objects
are put systematically in columns. This representation provides little opportunity for
inference.
• Given the facts, it is not possible to answer a simple question such as: “Who is the
heaviest player?”
• Also, But if a procedure for finding the heaviest player is provided, then these facts will
enable that procedure to compute an answer.
• Moreover, We can ask things like who “bats – left” and “throws – right”.
Inheritable Knowledge
• Here the knowledge elements inherit attributes from their parents.
• The knowledge embodied in the design hierarchies found in the functional, physical and
process domains.
• Within the hierarchy, elements inherit attributes from their parents, but in many cases,
not all attributes of the parent elements prescribed to the child elements.
• Also, The inheritance is a powerful form of inference, but not adequate.
• Moreover, The basic KR (Knowledge Representation) needs to augment with inference
mechanism.
• Property inheritance: The objects or elements of specific classes inherit attributes and
values from more general classes.
• So, The classes organized in a generalized hierarchy.
Declarative Knowledge
• a statement in which knowledge is specified, but the use to which that knowledge is to be put
is not given.
• Example: laws, people’s name; there are facts which can stand alone, not dependent on other
knowledge
Procedural Knowledge
• A representation in which the control information, to use the knowledge, embedded in
the knowledge itself. For example, computer programs, directions, and recipes; these
indicate specific use or implementation;
• Moreover, Knowledge encoded in some procedures, small programs that know how to
do specific things, how to proceed.
Advantages:
• Heuristic or domain-specific knowledge can represent.
• Moreover, Extended logical inferences, such as default reasoning facilitated.
• Also, Side effects of actions may model. Some rules may become false in time.Keeping
track of this in large systems may be tricky.
Disadvantages:
• Completeness — not all cases may represent.
• Consistency — not all deductions may be correct. e.g If we know that Fred is a bird we
might deduce that Fred can fly. Later we might discover that Fred is an emu.
• Modularity sacrificed. Changes in knowledge base might have far-reaching effects.
• Cumbersome control information.
1. Important Attributes:
There are two attributes that are of very general significance, and we have already seen
their use: instance and isa, they support property inheritance. They are called a variety
of things in AI systems.
2. Relationships among Attributes
There are four such properties that deserve mention here:
• Inverses
• Existence in an isa hierarchy
• Techniques for reasoning about values
• Single-valued attributes
a. Inverses:
• Entities in the world are related to each other in many different ways. But as
soon as we decide to describe those relationships as attributes, we commit to a
perspective in which we focus on one object and lock for binary relationships
between it and others. Attributes are those relationships.
• The first is to represent both relationships in a single representation that ignores
focus. Logica] representations are usually interpreted as doing this. For example,
the assertion:
team(Pee-Wee-Reese, Brooklyn-Dodgers)
• The second approach is to use attributes that focus on a single entity but to use
them in pairs, one the inverse of the other. In this approach, we would represent
the team information with two attributes:
• one associated with Pee Wee Reese:
team = Brooklyn-Dodgers
• one associated with Brooklyn Dodgers:
team-members = Pee-Wee-Reese....
b. Existence in an isa hierarchy:
• This is about generalization-specialization, like classes of objects and
specialized subsets of those classes. There are attributes and specialization of
attributes.
• Example: The attribute Height. It is actually a specialization of the more general
attribute physical-size which is, in turn, a specialization of physical-attribute.
These generalization-specialization relationships are important because they
support inheritance. This also provides information about constraints on the
values that the attribute can have and mechanisms for computing those values.
c. Techniques for Reasoning about Values
« Information about the type of the value. For example, the value of Height must
be a number measured in a unit of length.
« Constraints on the value, often stated in terms of related entities. For example,
the age of a person cannot be greater than the age of either of that person’s
parents.
« Rules for computing the value whenit is needed. We showed an example of
such a rule in Fig. 4.5 for the bats attribute. These rules are called backward
rules, Such rules have also been called if -needed rules.
«Rules that describe actions that should be taken if a value ever becomes
known. These rules are called forward rules, or sometimes if -added rules.
d. Single-Valued Attributes
Knowledge-representation systems have taken several! different approaches to
providing support for single-valued attributes, including:
• Introduce an explicit notation for temporal interval. If two different values are
ever asserted for the same temporal interval, signal a contradiction
automatically.
• Assume that the only temporal interval that is of interest is now. So if a new
value is asserted, replace the old value.
• Provide no explicit support. Logic-based systems are in this category. But in
these systems, knowledge base builders can add axioms that state that if an
attribute has one value then it is known not to have all other values.
PREDICATE LOGIC
Introduction
Predicate logic is used to represent Knowledge. Predicate logic will be met in Knowledge
Representation Schemes and reasoning methods. There are other ways but this form is popular.
Propositional Logic
• It is simple to deal with and decision procedure for it exists. We can represent real-
world facts as logical propositions written as well-formed formulas.
• To explore the use of predicate logic as a way of representing knowledge by looking at
a specific example.
• The above two statements becomes totally separate assertion, we would not be able to
draw any conclusions about similarities between Socrates and Plato.
• These representations reflect the structure of the knowledge itself. These use predicates
applied to arguments.
• It fails to capture the relationship between any individual being a man and that
individual being a mortal.
We need variables and quantification unless we are willing to write separate statements.
Predicate:
• A Predicate is a truth assignment given for a particular statement which is either true or
false. To solve common sense problems by computer system, we use predicate logic.
Logic Symbols used in predicate logic:
Predicate Logic
• Terms represent specific objects in the world and can be constants, variables or functions.
• Predicate Symbols refer to a particular relation among objects.
• Sentences represent facts, and are made of terms, quantifiers and predicate symbols.
• Functions allow us to refer to objects indirectly (via some relationship).
• Quantifiers and variables allow us to refer to a collection of objects without explicitly aming
each object.
• Some Examples
o Predicates: Brother, Sister, Mother , Father
o Objects: Bill, Hillary, Chelsea, Roger
o Facts expressed as atomic sentences a.k.a.
literals:
o Father(Bill,Chelsea)
o Mother(Hillary,Chelsea)
o Brother(Bill,Roger)
o Father(Bill,Chelsea)
Variables and Universal Quantification
Universal Quantification allows us to make a statement about a collection of objects:
Nested Quantification
Functions
• Functions are terms - they refer to a specific object.
• We can use functions to symbolically refer to objects without naming them.
• Examples:
fatherof(x) age(x) times(x,y) succ(x)
• Using functions
If we use logical statements as a way of representing knowledge, then we have available a good
way of reasoning with that knowledge.
Representing facts with Predicate Logic
1) Marcus was a man man(Markus)
2) Marcus was a Pompeian pompeian(Markus)
3) All Pompeians were Romans
7) People only try to assassinate rulers they are not loyal to.
Clearly we do not want to have to write out the representation of each of these facts
individually. For one thing, there are infinitely many of them. But even if we only consider the
finite number of them that can be represented, say, using a single machine word per number, it
would be extremely in efficient to store explicitly a large set of statements when we could,
instead, so easily compute each one as we need it, Thus it becomes useful to augment our
representation by these computable predicates.
Consider the following set of facts, again involving
1. Marcus was a man.
Man (Marcus)
Again we ignore the issue of tense.
2. Marcus was a Pompeian.
Pompeian(Marcus)
3. Marcus was bom in 40 A.D.
born(Marcus, 40)
4. All men are mortal.
Ɐ man(x) -> mortal(x)
5. All Pompeians died when the volcano erupted in 79 A.D.
erupted( volcano, 79) ꓥⱯx: : [Pompeian(x) -> died(x, 79)
6. No mortal lives longer than 150 years.
Ɐx : Ɐt1: Vt2: mortal(x) ꓥborn(x, t1,) ꓥ gt(t2, — t1,150) -> dead(x, t2)
7. It is now 1991.
now = 1991
8. Alive means not dead.
Ɐx : Ɐt:[alive(x,t)-> ¬dead(x,t)] ꓥ[¬dead(x,t)-> alive(x,t)]
9. If someone dies, then he is dead at all later times.
Ɐx : Ɐt1 : Ɐt2 : died(x,t1) ꓥ gt(t2,t1) -> dead(x,t2)
This representation says that one is dead in all years after the one in which one died. It ignores
the question of whether one is dead in the year in which one died.
1. Man (Marcus)
2. Pompeian(Marcus)
3. born(Marcus, 40)
4. Ɐ man(x) -> mortal(x)
RESOLUTION
A procedure to prove a statement, Resolution attempts to show that Negation of
Statement givesContradiction with known statements. It simplifies proof procedure by first
converting the statements into canonical form. Simple iterative process; at each step, 2 clauses
called the parent clauses are compared, yielding a new clause that has been inferred from them.
Resolution refutation:
• Convert all sentences to CNF (conjunctive normal form)
• Negate the desired conclusion (converted to CNF) Apply resolution rule until either
• Derive false (a contradiction)
• Can’t apply any more Resolution refutation is sound and complete
• If we derive a contradiction, then the conclusion follows from the axioms
• If we can’t apply any more, then the conclusion cannot be proved from the axioms.
Sometimes from the collection of the statements we have, we want to know the answer of this
question - "Is it possible to prove some other statements from what we actually know?" In
orderto prove this we need to make some inferences and those other statements can be shown
true using Refutation proof method i.e. proof by contradiction using Resolution. So for the
asked goal we will negate the goal and will add it to the given statements to prove the
contradiction.
So resolution refutation for propositional logic is a complete proof procedure. So if the thing
that you're trying to prove is, in fact, entailed by the things that you've assumed, then you can
prove itusing resolution refutation.
Clauses:
▪ Resolution can be applied to certain class of wff called clauses.
▪ A clause is defined as a wff consisting of disjunction of literals.
Clause Form:
2. Reduce the scope of each ¬to a single term, using the fact that ¬(¬p) = p,
4. Move all quantifiers to the left of the formulas without changing their relative order.
5. Eliminate existential quantifiers. We can eliminate the quantifier by substituting for the
variable a reference to a function that produces the desired value.
6. Drop the prefix. At this point, all remaining variables are universally quantified
8. Create a separate clause corresponding to each conjunct in order for a well formed formula
to be true, all the clauses that are generated from it must be true.
9. Standardize apart the variables in set of clauses generated in step 8. Rename the
variables.So that no two clauses make reference to same variable.
UNIFICATION ALGORITHM
• In propositional logic it is easy to determine that two literals cannot both be true
at the sametime.
• Simply look for L and ~L . In predicate logic, this matching process is more
complicated, sincebindings of variables must be considered.
Unification Example:
The object of the Unification procedure is to discover at least one substitution that
causes two literals to match. Usually, if there is one such substitution there are many
Unification algorithm each literal is represented as a list, where first element is the name of a
predicate and the remaining elements are arguments. The argument may be a single element
(atom) or may be another list.
The unification algorithm recursively matches pairs of elements, one pair at a time. The
matching rules are:
• Different constants, functions or predicates cannot match, whereas identical ones can.
• A variable can match another variable, any constant or a function or predicateexpression,
subject to the condition that the function or [predicate expression must notcontain any instance
of the variable being matched (otherwise it will lead to infinite recursion).
• The substitution must be consistent. Substituting y for x now and then z for x later is
inconsistent. (a substitution y for x written as y/x)
Example:
Suppose we want to unify p(X,Y,Y)
with p(a,Z,b). Initially E is
{p(X,Y,Y)=p(a,Z,b)}.
• It is important that if two instances of the same variable occur, then they must be
given identicalsubstitutions
Answer:
Predicate Logic
Resolution proof:
Answering Questions
We can also use the proof procedure to answer questions such as “who tried to assassinate
Caesar” by proving:
– Try assassinate(y,Caesar).
Once the proof is complete we need to find out what was substitution was made fory.
We show how resolution can be used to answer fill-in-the-blank questions, such as "When did
Marcus die?" or "Who tried to assassinate a ruler?” Answering these questions involves finding
a known statement that matches the terms given in the question and then responding with
anotherpiece of the same statement that fills the slot demanded by the question.
From Clause Form to Horn Clauses
The operation is to convert Clause form to Horn Clauses. This operation is not always pos sible.
Horn clauses are clauses in normal form that have one or zero positive literals. The conversion
from a clause in normal form with one or zero positive literals to a Horn clause is done by using
the implication property.
Example:
The procedural representation is one in which the control information i.e., necessary to
use theknowledge is considered to be embedded in the knowledge itself.
• Procedural knowledge answers the question 'What can you do?'
• While declarative knowledge is demonstrated using nouns,
• Procedural knowledge relies on action words, or verbs.
• It is a person's ability to carry out actions to complete a task.
The real difference between declarative and procedural views of knowledge lies in which
the controlinformation presides.
Example
In both the cases, the control strategy is it it must cause motion and systematic. The production
system model of the search process provides an easy way of viewing forward and backward
reasoning as symmetric processes.
Consider the problem of solving a particular instance of the 8-puzzle problem. The rules
to be used for solving the puzzle can be written as:
To reason forward, the left sides (preconditions) are matched against the current state and the
right sides (results) are used to generate new nodes until the goal is reached. To reason
backward, the right sides are matched against the current node and the left sides are used to
generate new nodes representing new goal states to be achieved.
The following 4 factors influence whether it is better to reason Forward or Backward:
1. Are there more possible start states or goal states? We would like to move from the
smaller set of states to the larger (and thus easier to find) set of states.
2. In which direction branching factor (i.e, average number of nodes that can be reached
directly from a single node) is greater? We would like to proceed in th e direction
with lower branching factor.
3. Will the program be used to justify its reasoning process to a user? If so, it is
important to proceed in the direction that corresponds more closely with the way the
user will think.
The first two of these differences arise naturally from the fact that PROLOG programs are
actually sets of Horn Clauses that have been transformed as follows:
1. If the Horn Clause contains no negative literals (i.e., it contains a single literal
which ispositive), then leave it as it is.
2. Otherwise, return the Horn clause as an implication, combining all of the negative
literals into the antecedent of the implication and leaving the single positive
literal (if there is one)as the consequent.
This procedure causes a clause, which originally consisted of a disjunction of literals (all but
oneof which were negative), to be transformed to single implication whose an tecedent is a
conjunction of (what are now positive) literals.
Matching
We described the process of using search to solve problems as the application of appropriate
rules to individual problem states to generate new states to which the rules can then be applied
and so forth until a solution is found.
How we extract from the entire collection of rules those that can be applied at a given point?
To do so requires some kind of matching between the current state and the preconditions of the
rules. How should this be done? The answer to this question can be critical to the success of a
rule based system
A more complex matching is required when the preconditions of rule specify required
properties that are not stated explicitly in the description of the current state. In this case, a
separate set of rules must be used to describe how some properties can be inferred from others.
An even more complex matching process is required if rules should be applied and if
their pre condition approximately match the current situation. This is often the case in
situations involving physical descriptions of the world.
Indexing
One way to select applicable rules is to do a simple search though all the rules comparing each
one’s precondition to the current state and extracting all the one’s that match. There are two
problems with this simple solution:
i. The large number of rules will be necessary and scanning through all of them at
every step would be inefficient.
ii. It’s not always obvious whether a rule’s preconditions are satisfied by a particular
state.
Solution: Instead of searching through rules use the current state as an index into the rules
and select the matching one’s immediately.
Matching process is easy but at the price of complete lack of generality in the statement of the
rules. Despite some limitations of this approach, Indexing in some form is very important in
the efficient operation of rule based systems.
Matching with Variables
The problem of selecting applicable rules is made more difficult when preconditions are not
stated as exact descriptions of particular situations but rather describe properties that the
situations must have. It often turns out that discovering whether there is a match between a
particular situation and the preconditions of a given rule must itself involve a significant search
process.
backward-chaining systems usually use depth-first backtracking to select individual rules, but
forward-chaining systems generally employ sophisticated conflict resolution strategies to
choose among the applicable rules.
While it is possible to apply unification repeatedly over the cross product of preconditions
and statedescription elements, it is more efficient to consider the many-many match problem,
in which many rules are matched against many elements in the state description
simultaneously. One efficientmany-many match algorithm is RETE.
RETE Matching Algorithm
The above cycle is repeated until no rules are put in the conflict set or until stopping condition
is reached. In order to verify several conditions, it is a time consuming process. To eliminate
the need to perform thousands of matches of cycles on effective matching algorithm is called
RETE.
The Algorithm consists of two Steps.
1. Working memory changes need to be examined.
2. Grouping rules which share the same condition & linking them to their common
terms.
RETE Algorithm is many-match algorithm (In which many rules are matched against many
elements). RETE uses forward chaining systems which generally employee sophisticated
conflictresolution strategies to choose among applicable rules. RETE gains efficiency from 3
major sources.
1. RETE maintains a network of rule condition and it uses changes in the state description to
determine which new rules might apply. Full matching is only pursued for candidates that could
be affected by incoming/outgoing data.
2. Structural Similarity in rules: RETE stores the rules so that they share structures in
memory, set of conditions that appear in several rules are matched once for cycle.
Persistence of variable binding consistency. While all the individual preconditions of the rule
might be met, there may be variable binding conflicts that prevent the rule from firing can be
minimized. RETE remembers its previous calculations and is able to merge new binding
information efficiently.
Approximate Matching:
Rules should be applied if their preconditions approximately match to the current situation Eg:
Speech understanding program
Rules: A description of a physical waveform to phones
Physical Signal: difference in the way individuals speak, result of background noise.
Conflict Resolution:
When several rules matched at once such a situation is called conflict resolution. There are 3
approaches to the problem of conflict resolution in production system.
1. Preference based on rule match:
a. Physical order of rules in which they are presented to the system
b. Priority is given to rules in the order in which they appear
2. Preference based on the objects match:
a. Considers importance of objects that are matched
b. Considers the position of the match able objects in terms of Long Term
Memory (LTM) & Short Term Memory(STM)
LTM: Stores a set of rules
STM (Working Memory): Serves as storage area for the facts deduced by
rules in longterm memory
3. Preference based on the Action:
a. One way to do is find all the rules temporarily and examine the results of
each. Using a Heuristic Function that can evaluate each of the resulting states
compare the merits of the result and then select the preferred one.
Search Control Knowledge:
➢ It is knowledge about which paths are most likely to lead quickly to a goal state
➢ Search Control Knowledge requires Meta Knowledge.
➢ It can take many forms. Knowledge about
which states are more preferable to others.
which rule to apply in a given situation the Order in which to pursue sub
goalsuseful Sequences of rules to apply.
CONCEPT LEARNING
Much of learning involves acquiring general concepts from specific training examples. People,
for example, continually learn general concepts or categories such as "bird," "car," etc. Each
concept can be viewed as describing some subset of objects/events d efined over a larger set.
We consider the problem of automatically inferring the general definition of some concept,
given examples labeled as members or nonmembers of the concept. This task is commonly
referred to as concept learning or approximating a boolean-valued function from examples.
Concept learning: Inferring a boolean-valued function from training examples of its input and
output.
“A task of acquiring potential hypothesis (solution) that best fits the given training examples.”
1. A CONCEPT LEARNING TASK
To ground our discussion of concept learning, consider the example task of learning the target
concept "Days on which my friend Sachin enjoys his favorite water sport” . Table given
below describes a set of example days, each represented by a set of attributes.
In general, any concept learning task can be described by the set of instances over which the
target function is defined, the target function, the set of candidate hypotheses considered by the
learner, and the set of available training examples.
Notation
• The set of items over which the concept is defined is called the set of instances, which we
denote by X. In the current example, X is the set of all possible days, each represented by the
attributes Sky, AirTemp, Humidity, Wind, Water, and Forecast.
• The concept or function to be learned is called the target concept, which we denote by c.
In general, c can be any Boolean valued function defined over the instances X;
that is, c: X → {0, 1}
In the current example, the target concept corresponds to the value of the attribute EnjoySport
(i.e, c(x)=1 if EnjoySport=Yes, and c(x)=0 if EnjoySport= No).
•When learning the target concept, the learner is presented by a set of training examples, each
consisting of an instance x from X, along with its target concept value c(x).
•Instances for which c(x) = 1 are called positive examples, or members of the target concept.
Instances for which c(x) = 0 are called negative examples. We will often write the ordered pair
(x, c(x)) to describe the training example consisting of the instance x and its target concept
value c(x).
•We use the symbol D to denote the set of available training examples.
•Given a set of training examples of the target concept c, the problem faced by the learner is to
hypothesize, or estimate, c. We use the symbol H to denote the set of all possible hypotheses
that the learner may consider regarding the identity of the target concept.
•In general, each hypothesis h in H represents a boolean-valued function defined over X; that
is,
h : X →{0, 1}. The goal of the learner is to find a hypothesis h such that h(x) = c(x) for all
x in X.
h : X →{0, 1}. The goal of the learner is to find a hypothesis h such that h(x) = c(x) for all
x in X.
• Given:
o Instances X: Possible days, each described by the attributes
▪ Sky (with possible values Sunny, Cloudy, and Rainy),
▪ AirTemp (with values Warm and Cold),
▪ Humidity (with values Normal and High),
▪ Wind (with values Strong and Weak),
Now consider the sets of instances that are classified positive by h l and by h 2. Because h2
imposes fewer constraints on the instance, it classifies more instances as positive. In fact, any
instance classified positive by hl will also be classified positive by h 2. Therefore, we say that
h2 is more general than hl.
This intuitive "more general than" relationship between hypotheses can be defined more
precisely as follows.
Definition: Let h j and h k be Boolean-valued functions defined over X. Then h j is more general-
than-or-equal-to h k (written h j ≥ h k) if and only if
(Ɐ x ∈ X ) [(hk (x) = 1 )→ (hj (x) = 1)]
• In the figure, the box on the left represents the set X of all instances, the box on the right
the set H of all hypotheses.
• Each hypothesis corresponds to some subset of X-the subset of instances that it classifies
positive.
• The arrows connecting hypotheses represent the more - general -than relation, with the
arrow pointing toward the less general hypothesis.
• Note the subset of instances characterized by h 2 subsumes the subset characterized by
h l , hence h 2 is more - general– than h 1
To illustrate this algorithm, assume the learner is given the sequence of training examples from
• The first step of FIND-S is to initialize h to the most specific hypothesis in H
h - (Ø, Ø, Ø, Ø, Ø, Ø)
• Consider the first training example
x1 = <Sunny Warm Normal Strong Warm Same>, +
• Observing the first training example, it is clear that hypothesis h is too specific. None
of the "Ø" constraints in h are satisfied by this example, so each is replaced by the next
more general constraint that fits the example
h1 = <Sunny Warm Normal Strong Warm Same>
• Consider the second training example
x2 = <Sunny, Warm, High, Strong, Warm, Same>, +
• The second training example forces the algorithm to further generalize h, this time
substituting a "?" in place of any attribute value in h that is not satisf ied by the new example
3. Are the training examples consistent? In most practical learning problems there is some
chance that the training examples will contain at least some errors or noise. Such
inconsistent sets of training examples can severely mislead FIND-S, given the fact that
it ignores negative examples. We would prefer an algorithm that could at least detect
when the training data is inconsistent and, preferably, accommodate such errors.
4. What if there are several maximally specific consistent hypotheses? In the hypothesis
language H for the EnjoySport task, there is always a unique, most specific hypothesis
consistent with any set of positive examples. However, for other hypothesis spaces there
can be several maximally specific hypotheses consistent with the data.
example.
1. Version Space c a list containing every hypothesis in H
2. For each training example, (x, c(x))
remove from Version Space any hypothesis h for which h(x) ≠ c(x)
3. Output the list of hypotheses in Version Space
The List-Then-Eliminate algorithm first initializes the version space to contain all hypotheses
in H, then eliminates any hypothesis found inconsistent with any training example. The version
space of candidate hypotheses thus shrinks as more examples are o bserved, until ideally just
one hypothesis remains that is consistent with all the observed examples.
It is intuitively plausible that we can represent the version space in terms of its most specific
and most general members.
A More Compact Representation for Version Spaces
The version space is represented by its most general and least general members. These
members form general and specific boundary sets that delimit the version space within the
partially ordered hypothesis space.
Definition: The general boundary G, with respect to hypothesis space H and training data D,
is the set of maximally general members of H consistent with D
Definition: The specific boundary S, with respect to hypothesis space H and training data D,
is the set of minimally general (i.e., maximally specific) members of H consistent with D.
Sketch of proof:
• let g, h, s be arbitrary members of G, H, S respectively with g g h g s
If d is positive example
Remove s from S
If d is negative example
Remove g from G
hypothesis in G
An Illustrative Example
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes
CANDIDATE-ELIMINTION algorithm begins by initializing the version space to the set of
all hypotheses in H;
Initializing the G boundary set to contain the most general hypothesis in H
G0 <?, ?, ?, ?, ?, ?>
Initializing the S boundary set to contain the most specific (least general) hypothesis
S0 <Φ,Φ,Φ,Φ,Φ,Φ>
• Consider the third training example. This negative example reveals that the G
boundary of the version space is overly general, that is, the hypothesis in G
incorrectly predicts that this new example is a positive example.
Given that there are six attributes that could be specified to specialize G 2, why are there only
three new hypotheses in G 3?
• This positive example further generalizes the S boundary of the version space. It also results
in removing one member of the G boundary, because this member fails to cover the new
positive example
• After processing these four examples, the boundary sets S 4 and G4 delimit the version space
of all hypotheses consistent with the set of incrementally observed training examples.
INDUCTIVE BIAS
The fundamental questions for inductive inference
1. What if the target concept is not contained in the hypothesis space?
2. Can we avoid this difficulty by using a hypothesis space that includes every
possible hypothesis?
3. How does the size of this hypothesis space influence the ability of the
• This new hypothesis is overly general and it incorrectly covers the third negative
training example! So H does not include the appropriate c.
• In this case, a more expressive hypothesis space is required.
An Unbiased Learner
• The solution to the problem of assuring that the target concept is in the hypothesis
space H is to provide a hypothesis space capable of representing every teachable
concept that is representing every possible subset of the instances X.
• The set of all subsets of a set X is called the power set of X
• In the EnjoySport learning task the size of the instance space X of days described
Definition:
• Consider a concept learning algorithm L for the set of instances X.
• Let c be an arbitrary concept defined over X, and
• let Dc = {〈x, c(x) 〉} be an arbitrary set of training examples of c.
• Let L(x i , Dc ) denote the classification assigned to the instance x i by L after training on
the data Dc .
• The inductive bias of L is any minimal set of assertions B such that for any target concept
c and corresponding training examples D c
Inductive bias of CEA: The target concept c is contained in the given hypothesis space H.The
figure given below summarizes the situation schematically.
The following three learning algorithms, which are listed from weakest to strongest bias:
•Rote-Learner: Learning corresponds simply to storing each observed training example in
memory. Subsequent instances are classified by looking them up in memory. If the instance is
found in memory, the stored classification is returned. Otherwise, the system refuses to classify
the new instance.
The Rote-Learner has no inductive bias. The classifications it provides for new instances follow
deductively from the observed training examples, with no additional assumptions required.
•CEA: New instances are classified only in the case where all members of the current version
space agree on the classification. Otherwise, the system refuses to classify the new instance.
The CEA has a stronger inductive bias: that the target concept can be represented in its
hypothesis space. Because it has a stronger bias, it will classify some insta nces that the Rote-
Learner will not. Of course, the correctness of such classifications will depend completely on
the correctness of this inductive bias.
•FIND-S: This algorithm, described earlier, finds the most specific hypothesis consistent with
the training examples. It then uses this hypothesis to classify all subsequent instances.
The FIND-S algorithm has an even stronger inductive bias. In addition to the assumption that
the target concept can be described in its hypothesis space, it has an additional inductive bias
assumption: that all instances are negative instances unless the opposite is entailed by its other
knowledge.