1 - Concept Learning
1 - Concept Learning
UNIT 1 U1/0/2024
Machine Learning
COS4852
Year module
School of Computing
CONTENTS
This document contains the material for UNIT 1 for COS4852 for 2024.
university
Define tomorrow. of south africa
1 OUTCOMES
In this unit you will learn to describe and solve a learning problem as a concept-learning task with
specific reference to the concept of the hypothesis space.
3. Understand the General-to-Specific ordering of hypotheses, which form the basis of what most
Machine Learning algorithms do with data.
7. Understand the problem of Inductive Bias, which is built into all Machine Learning algorithms.
1. Discuss Concept Learning and what role hypotheses play in machine learning.
3. Discuss the General-to-Specific ordering of hypotheses and why this ordering forms the basis
of how Machine Learning algorithms work.
8. Discuss the problem of Inductive Bias and techniques to address this problem.
2 INTRODUCTION
In this Unit you will investigate Tom Mitchell’s theoretical background behind learning theory. This
will be done using his idea of Concept Learning. This is covered quite well in Tom Mitchell’s textbook,
though you we will be using other sources and need not buy the book (though the book gives you a
thorough working of this)
2
COS4852/U1
To understand how computers can learn from data, you need to first learn a bit more about how
human learning works, and what you can use there to create machine learning algorithms. Human
learning can be divided into 5 aspects:
Traditional Artificial Intelligence focuses mostly on deductive processes - using logic to derive new
facts. Expert Systems is a good example here. Most Machine Learning algorithms are variants on
the Concept Learning theme, where concepts are learned from specific examples of what needs to
be learned. Classification, regression, clustering, etc. can all be seen as specific variants of the
process of finding a specific hypothesis (solution) or a set of specific hypotheses that fit best, from a
(usually) pre-defined set of hypotheses.
Tom Mitchell defines this as the ”Problem of searching through a predefined space of potential
hypotheses for the hypothesis that best fits the training example.”
1. Learning Systems
2. Concept learning
5. Version spaces
7. Inductive bias
Many of these concepts were developed by Tom Mitchell in his PhD, and subsequently in papers
and books. They form a very useful theoretical framework to describe and assess machine learning
algorithms.
3
3 PREPARATION
Here are references to two textbooks that are available online, that will give you a good overview
(and lots of detail) on Machine Learning.
Nils Nilsson wrote what he calls ‘notes’, which is the draft of a textbook he intended to publish. Go
to http://robotics.stanford.edu/people/nilsson/mlbook.html and download the textbook
“Introduction to Machine Learning” (http://robotics.stanford.edu/people/nilsson/MLBOOK.pd
f).
You could also go online and try to find Max Welling’s textbook, titled “A first encounter with Machine
Learning”.
Also check on the myUnisa site under Additional Resources for copies of these textbooks.
Andrew Ng created one of the best, free online courses on Supervised Machine Learning. YOu can
find this course at: https://www.coursera.org/learn/machine-learning.
As part of your activities later you will be asked to find other online courses online.
There are many excellent resources available online on the topic of Machine Learning. In this
module we will be using some of these resources. We will provide you with links to some of these,
but these are not exclusive and not necessarily the only or best. We encourage you to do your own
searches (using Google or DuckDuckGo) to find more. The more different angles and approaches
to learning this field you can find, the better.
To learn Machine Learning you have to have a solid background in Computer Science and Mathe-
matics, and ideally some basic Statistics as well. The following two documents give you an overview
of the mathematics that you need to master to do well in Machine Learning:
4
COS4852/U1
The following article gives a brief overview of the fundamentals in Machine Learning:
• https://towardsdatascience.com/machine-learning-basics-part-1-a36d38c7916
Concept learning
• https://www.asquero.com/article/concept-and-concept-learning/
• Uses a redacted EnjoySport example to explain Concept Learning, and the Find-S and List-
then-Eliminate algorithms to find a list of valid hypotheses: https://www.studytonight.com
/post/what-is-concept-learning-in-ml
Here we see that it is possible to define a hierarchy of specificity that allows us to order the
hypotheses from most general to most specific. This creates a space wherein we can search for an
optimal hyypothesis (or solution to the training problem).
• Here is a summary of Mitchell’s chapter 2 that discuss the concept of hypothesis ordering, and
gives a short example: https://www.i2tutorials.com/machine-learning-tutorial/mach
ine-learning-general-to-specific-ordering-of-hypothesis/
• The link here gives a godd overview of the ordering principle, but also summarises most of the
other concepts you are learning in this unit: https://bi.snu.ac.kr/Courses/g-ai04 2/ML02.
pdf
5
Version spaces
• https://www.studytonight.com/post/what-is-concept-learning-in-ml
• https://machinelearningmastery.com/basic-concepts-in-machine-learning/
• PDF document covering Specific to General search, General to Specific search, and the
Candidate Elimination Algorithm https://cs.ccsu.edu/~markov/ccsu courses/lnml-ch4.
pdf
• Gives a good definition of Version Space http://www2.cs.uregina.ca/~dbd/cs831/notes/m
l/vspace/3 vspace.html
The Candidate-Elimination algorithm effectively extends Find-S and Find-G to approach a set of
valid hypotheses from both ends, by considering each instance one at a time, whether they are
positive or negative.
Inductive bias
In any learning system there is a trade-off between bias and being able to generalise beyond the
data that was learned. An un-biased learner cannot generalise, while a learning system that needs
to be able to classify unseen instances will always have some form of bias built into it.
• The link we’ve seen earlier also gives a good summary of the problem of bias in a learning
system: https://bi.snu.ac.kr/Courses/g-ai04 2/ML02.pdf
• Here you will find a set of examples of inductive learning in a range of learning problems. It
also gives examples of different types of inductive bias we find in various learning parameters,
assumptions, and algorithms: https://www.i2tutorials.com/machine-learning-tutoria
l/machine-learning-inductive-bias-in-machine-learning/
6
COS4852/U1
4 DISCUSSION
A computer can be said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance P at tasks in T improves with experience E. This is a
formal definition of learning as defined by Tom Mitchell. Such a formal definition allows us to build
and compare machine learning systems.
If we can find suitable T , P, and E parameters for a given learning problem, we can define the
parameters and type of target function, choose/define the target function to be learned, and decide
on an appropriate learning algorithm.
Here are three examples of learning problems and their T , P and E values:
E: A set of labeled images of handwritten characters, where the label is the associated alphabet
character.
E: A set of labeled e-mail messages, where the label is either SPAM or NOT SPAM.
7
Choose a target function
Design an evaluation function that will assign a numerical value to any given board state, in such
a way that higher values are associated with better board states (from the machine learner’s
perspective). Let:
• V : B → R – the value function V maps any given board state B to a real number.
We now need to find a working definition of the ideal target function V that can be implemented
in a learning algorithm. In other words, we need to write the target function in terms of variables
and coefficients. The learning algorithm will only acquire an approximation of the target function,
and for this reason the process of learning the target function is called function approximation. This
means that we need to find a balance between a representation of the target function that is more
expressive (contains more detail), but suffers from requiring much more training data, to a less
expressive function that does not need as much data, and may not learn as well.
We could define the target function as a linear combination of what we see in the current board
position. The variables could be:
Threatened pieces are those that can be captured on the next move. Each variable can be assigned
a parameter that could be learned by the system:
V̂ (b) = w0 + x1 w1 + x2 w2 + x3 w3 + x4 w4 + x5 w5 + x6 w6
becomes our target function. The task of the learning system is to find values for the weights w1
that will maximise V̂ . The weight w0 is an additive constant.
8
COS4852/U1
Training set
To learn the target function V we require a set of training examples. Each training example is an
ordered pair of the form:
hb, Vtrain (b)i
where b is a specific board state and Vtrain (b) is the training value for b.
We can estimate training values for looking at possible subsequent board positions of b, and using
the V -value for that board position:
There are many possible ways to define the best hypothesis (the set of weights in this case). A good
first approach is to minimise the squared error E between the training values and those predicted
by the hypothesis V̂ : X
E= (Vtrain (b) − V̂ (b))2
hb,Vtrain (b)i
This means that we are trying to find the weights, or V̂ that minimises E for the observed training
examples.
An appropriate training algorithm will search through the possible legal board states and try to find
values for pi that maximises V . With a game such as Checkers the possible legal board positions
are huge, and an exhaustive search would be impossible.
The Least Means Square (LMS) training rule is one of many algorithms that could be used to
incrementally adjust the weights. The LMS update rule is a simple algorithm that works as follows:
9
Experiment
generator
Hypothesis
New board
Performance
Generaliser
system
Critic
Let X be an instance space consisting of points in the Euclidian plane with integer coordinates
(x, y ), with positive and negative instances as shown in Figure 2. Positive intances are indicated as
green circles, and negative instances are indicated as red triangles.
Let H be the set of hypotheses consisting of the region between two vertical lines. Formally, this
hypothesis has the form h ← ha < x < bi, where a < b and a, b ∈ Z (R is the set of integers,
{... , 93, 92, 91, 0, 1, 2, 3, ...}). This can be shortened to h ← ha, bi. One such hypothesis, h ← h95, 1i
is shown in green in Figure 3.
If the instance space is not limited there are an infinite number of hypotheses. Assume (for purposes
of explanation) that the instance space is limited to 911 ≤ x ≤ 7. The hypothesis space will then be
all the green areas that can be drawn with 911 ≤ a ≤ 6 and 910 ≤ b ≤ 7, with a < b (remember
that a, b ∈ R). A quick calculation will show that there are 18 × 18 = 324 possible hypotheses, given
this limited instance space. One of these hypotheses is shown in Figure 3. Most real-world problems
10
COS4852/U1
P2 N3
6
P4 N7
4
N1
3
P1
2
N6
1
x
–11 –10 –9 –8 –7 –6 –5 –4 –3 –2 –1 0 1 2 3 4 5 6 7
N4
–1
N2 P3
–2
–3
P6
–4
N5 P5
–5
–6
–7
P2 N3
6
P4 N7
4
N1
3
P1
2
N6
1
x
–11 –10 –9 –8 –7 –6 –5 –4 –3 –2 –1 0 1 2 3 4 5 6 7
N4
–1
N2 P3
–2
–3
P6
–4
N5 P5
–5
–6
–7
have infinite instance- and search spaces. Care needs to be taken when stating the assumptions
(such as instance space boundaries) so that the models remain valid.
11
H324 = {h911, 910i, h911, 99i, h911, 98i, h911, 97i, h911, 96i, h911, 95i, h911, 94i, h911, 93i, h911, 92i, h911, 91i, h911, 0i, h911, 1i, h911, 2i,
h911, 3i, h911, 4i, h911, 5i, h911, 6i, h911, 7i,
h910, 99i, h910, 98i, h910, 97i, h910, 96i, h910, 95i, h910, 94i, h910, 93i, h910, 92i, h910, 91i, h910, 0i, h910, 1i, h910, 2i, h910, 3i,
h910, 4i, h910, 5i, h910, 6i, h910, 7i,
h99, 98i, h99, 97i, h99, 96i, h99, 95i, h99, 94i, h99, 93i, h99, 92i, h99, 91i, h99, 0i, h99, 1i, h99, 2i, h99, 3i, h99, 4i, h99, 5i, h99, 6i, h99, 7i,
h98, 97i, h98, 96i, h98, 95i, h98, 94i, h98, 93i, h98, 92i, h98, 91i, h98, 0i, h98, 1i, h98, 2i, h98, 3i, h98, 4i, h98, 5i, h98, 6i, h98, 7i,
h97, 96i, h97, 95i, h97, 94i, h97, 93i, h97, 92i, h97, 91i, h97, 0i, h97, 1i, h97, 2i, h97, 3i, h97, 4i, h97, 5i, h97, 6i, h97, 7i,
h96, 95i, h96, 94i, h96, 93i, h96, 92i, h96, 91i, h96, 0i, h96, 1i, h96, 2i, h96, 3i, h96, 4i, h96, 5i, h96, 6i, h96, 7i,
h95, 94i, h95, 93i, h95, 92i, h95, 91i, h95, 0i, h95, 1i, h95, 2i, h95, 3i, h95, 4i, h95, 5i, h95, 6i, h95, 7i,
h94, 93i, h94, 92i, h94, 91i, h94, 0i, h94, 1i, h94, 2i, h94, 3i, h94, 4i, h94, 5i, h94, 6i, h94, 7i,
h93, 92i, h93, 91i, h93, 0i, h93, 1i, h93, 2i, h93, 3i, h93, 4i, h93, 5i, h93, 6i, h93, 7i,
h92, 91i, h92, 0i, h92, 1i, h92, 2i, h92, 3i, h92, 4i, h92, 5i, h92, 6i, h92, 7i,
h91, 0i, h91, 1i, h91, 2i, h91, 3i, h91, 4i, h91, 5i, h91, 6i, h91, 7i,
h0, 1i, h0, 2i, h0, 3i, h0, 4i, h0, 5i, h0, 6i, h0, 7i,
h1, 2i, h1, 3i, h1, 4i, h1, 5i, h1, 6i, h1, 7i,
h2, 3i, h2, 4i, h2, 5i, h2, 6i, h2, 7i,
h3, 4i, h3, 5i, h3, 6i, h3, 7i,
h4, 5i, h4, 6i, h4, 7i,
h5, 6i, h5, 7i,
h6, 7i}
In between these two sets of hypotheses all the hypotheses can ordered from the most specific to
the most general. For example, use len(h) = |a − b| as the criterium to order our hypotheses. We
then find that hh95,1i is more specific than hh95,2i , because len(hh95,1i ) = 6 and len(hh95,2i ) = 7.
We can reduce the search space by only considering valid hypotheses. For example, the hypoth-
esis hh911,7i is not valid because it contains both positive and negative instances. The only valid
hypotheses will be those between hh96,1i and hh95,3i . This reduces the search space to:
H6 = {h96, 1i, h96, 2i, h96, 3i, h95, 1i, h95, 2i, h95, 3i}
We can order the hypotheses from most specific to most general, as shown in Figure 4.
FIND-S and FIND-G We will now use the FIND-S algorithm to find the most specific set of
hypotheses, called the S-boundary. The process only looks at the positive instances, so only the 6
positive instances need to be considered.
The sequence of considering the instances does not effect the outcome of the algorithm if there are
no errors in the data.
12
COS4852/U1
len(h)
hh6,3i 9
hh95,3i hh96,2i 8
hh95,2i hh96,1i 7
hh95,1i 6
S0 ← {∅}
After observing the first positive instance P1 = (94, 2), choose a and b to get the smallest hypothesis
that still contain all the instances considered up to this point. This gives us:
S1 ← {h95, 93i}
After considering the second positive instance P2 = (91, 6), S grows to:
S2 ← {h95, 90i}
The next three positive instances does not expand S. P6 expands S to its final state:
S = S6 ← {h95, 1i}
In a similar way we can find the set of most general hypotheses, by starting with G containing
unknown values:
G0 ← {h?, ?i}
We consider the negative instances one after the other. After N1 we get:
G1 ← {h97, ?i}
N2 has no effect:
G2 ← {h97, ?i}
N3 gives us:
13
G3 ← {h97, 3i}
N4 and N5 as no effect:
G5 ← {h97, 3i}
N6 reduces G to:
G6 ← {h96, 3i}
G = G8 ← {h96, 3i}
The hypothesis shown in Figure 3 is also our S-boundary, while the G-boundary is shown in Figure 5.
P2 N3
6
P4 N7
4
N1
3
P1
2
N6
1
x
–11 –10 –9 –8 –7 –6 –5 –4 –3 –2 –1 0 1 2 3 4 5 6 7
N4
–1
N2 P3
–2
–3
P6
–4
N5 P5
–5
–6
–7
Other hypotheses It is of course possible to define any other hypothesis. Two different hypotheses
are shown in Figure 6. With our current hypothesis we need two parameters (a and b) to describe it.
The ellipse hypothesis will need three parameters, the center and the two axes. The blue hypotheses
would be much more complex te define.
14
COS4852/U1
P2 N3
6
P4 N7
4
N1
3
P1
2
N6
1
x
–11 –10 –9 –8 –7 –6 –5 –4 –3 –2 –1 0 1 2 3 4 5 6 7
N4
–1
N2 P3
–2
–3
P6
–4
N5 P5
–5
–6
–7
5 ACTIVITIES
https://www.coursera.org/learn/machine-learning
Find at least five (5) more online courses in Machine Learning. Discuss and compare their content,
and the pros and cons of each, in detail.
List the URL’s so that we can find them as well. You can use them for later reference in the module.
Universities
Find at least ten (10) University-based courses (not online courses, and not the same 5 as in
Subtask 1), and discuss and compare the syllabus covered by each of theses. List the URL’s.
15
Here is one from the University of Washington, which you may include as an eleventh one:
http://courses.washington.edu/css490/2012.Winter/CSS%20490-590%20-%20Introduction%2
0to%20Machine%20Learning.html
Summarise what you have learnt about the structure and scope of Machine Learning from the 15+1
websites that you have found in the previous tasks. Discuss what background knowledge you would
require to master Machine Learning. Find resources on the web that contain material needed to
learn these background knowledge skills (HINT: see subtask 4 for one such source). List the URL’s.
Mathematics background
Go to Additional Resources: Mathematics on the COS4852 site and download the files:
• ML math essentials 1
• ML math essentials 2
If you are not already well versed in these mathematical skills, use these documents as pointers,
and start learning the skills.
These documents come from the University of Washington course, where you could find other useful
material as well:
6 SUMMARY
• Concept learning is the process of searching through a (potentially infinite) predefined space
of potential hypotheses.
• Hypotheses can be ordered from more general to more specific, which gives a structure in
which to search for valid hypotheses.
• The Find-S algorithm performs specific-to-general search to find the most specific hypothesis.
16
COS4852/U1
• The S- and G-boundaries delimits the entire set of valid hypotheses - those consistent with
the data.
• Noisy and incomplete data breaks the CE algorithm, as these cannot be expressed as
hypotheses.
• The indicative bias in CE is that the target concept exist in the hypothesis space.
• If the hypothesis space is enlarged to include all possible hypotheses are included, it will
remove the inductive bias, but it also means that the learner cannot generalise beyong the
given data.
17
6.1 TASK 1
Find other onlines sources on the following topics (as discussed above). Study these in detail and
summarise. Use detailed worked examples to illustrate the concepts:
• Concept learning
• Version spaces
• Inductive bias
These are all concepts that were developed by Tom Mitchell as part of his PhD, and subsequently in
papers and books. They form a very useful theoretical framework to describe and assess machine
learning algorithms.
© UNISA 2024
18