Genetic Algorithms Rule Discovery Data Mining: For in
Genetic Algorithms Rule Discovery Data Mining: For in
for
Rule Discovery
in
Data Mining
Magnus Erik Hvass Pedersen (971055)
Daimi, University of Aarhus, October 2003
1 Introduction
The purpose of this document is to verify attendance of the author to the Data
Mining course at DAIMI, University of Aarhus. First the concept of genetic
algorithms (GAs) is outlined, then a brief introduction is given to data mining
in general and rule discovery in particular, and finally these are combined to
describe how GAs discover rules.
The referenced litterature is used throughout, usually without explicit ci-
tation. The reader is assumed to be familiar with meta-heuristics and related
topics.
2 Genetic Algorithms
The following meta-heuristic is inspired by genetics. It basically consists of
combining the best solutions so far and changing them slightly. This incorpo-
rates Darwinian evolutionary theory with sexual reproduction. Specifically, for
a population P of chromosomes the following operators are applied:
• Selection deals with the probabilistic survival of the fittest, in that more
fit chromosomes are chosen to survive. Where fitness is a comparable
measure of how well a chromosome solves the problem at hand.
• Crossover takes individual chromosomes from P and combines them to
form new ones.
• Mutation alters the new solutions so as to add stochasticity in the search
for better solutions.
A variant known as elitist GA ensures the most fit solution survives intact,
leading to a higher degree of exploitation rather than exploration.
2.1 Selection
When selecting chromosomes from P , different methods are available. The first
one, roulette selection, chooses a chromosome with probability proportional to
1
its fitness:
F itness(c)
P r(c) = P 0
c0 ∈P F itness(c )
2.2 Algorithm
The algorithm is as follows:
2
Because the original binary realization spawned some theory1 formalizing
the validity of the algorithm as a meta-heuristic search-procedure, other coding
schemes were initially disregarded by some researchers. Abstraction is however
one of the main tools of not only mathematics and computer science, but most
sciences - not to say of self-organization and life itself.
2.5 Observations
Notice how the population size remains constant, whereas populations in nature
have a tendency to grow unless the environment prohibits it. One reason for
keeping it constant is a matter of computational resources, the proper analogy to
nature would be for each chromosome to execute on its own computer. Another
reason is stability of convergence.
Furthermore, there is only one species and one race. Implementing race in
a GA would be similar to having subsets of P with more similar chromosomes,
also called niching. Species is more difficult as the chromosomes are normally
rather precisely sized candidate solutions. But it would be interesting to allow
the evolution of chromosomes with different sizes (both smaller and larger),
provided there is a sensible way of using them on the original problem.
A suggestion would be to use a window: If the chromosome is bigger, then
choose only a portion of it matching the problem size. If it is smaller, only solve
a certain part of the problem of size equal to the chromosome - with the fitness
also somehow reflecting that only a part of the problem was solved. The actual
growing or shrinking may be built into the mutation operator, and crossover
between different species could be disallowed. Alternatively, the crossover oper-
ator could instead split the two chromosomes at different points, thus creating
new chromosomes of inequal length.
1 Including the socalled Schema-theorem.
3
This is somewhat similar to the artificial immune system described in [3] (p.
231), in which germs are bit-strings. The protective agents, socalled antigens,
are bit-strings of arbitrary length, offering protection against any substring they
encode. The antigens may also learn from eachother, and it turns out that the
information they encode gets compressed, so that substrings recognizing germs
start to overlap.
However, it may very well be that the larger chromosomes provide no im-
provement over simply increasing the number of fixed-size chromosomes.
4
3.2 Overfitting
Overfitting is the overinfluencing of the prediction model to anomalies in the
training set T , that are not representative for the entire data set D. This
may occur because the model is developed too much, when too few samples are
present in T , or if they are too noisy. The extremity of this is memorizing in
which an uncovered pattern is so specific that it only covers a single instance. In
the worst case, the entire training set may be memorized, rendering the model
useless. The inverse is known as underfitting, where the model is too general to
express essential subtleties of the data-set.
Although [4] mentions as an example that a person’s credit can not be de-
duced from her name, even though the data mining algorithm may in fact find
such a pattern, the purpose of knowledge discovery is precisely to uncover pre-
viously unknown patterns in vast data sets.
For example the name Magnus was uncommon in Denmark 10 years ago,
then suddenly a large number of infant males were given this name. Thus a
Danish person of this name is more likely to belong to this younger generation.
The name Olga is even more seldomly used - perhaps there is not a single Danish
woman under the age of 80 with this name. Assuming that people over 80 who
live in houses employ cleaning assistants, it can then be deduced that a person
whose name is Olga and who lives in a house, employs cleaning assistants. Now,
this sort of rule is not universally predictive, but for a given era in time, it may
be highly accurate.
The socalled memetic view of mental processes, is that ideas - like evolution
- do not develop at random; there is some mutation taking place, but often
it is a combination of previous ideas. Given enough data (both in the sense
of attributes for a given sample and the total number of samples) and a good
enough predictive model, one might be able to foresee when the name Olga
comes in vogue again, and how this implicates previously discovered patterns.
5
4 Rule Discovery
The data mining task of classification revolves around discovering rules of the
form:
IF <antecedent> THEN <consequent>
Where the consequent is a finite set whose elements are called classes. That is,
the task is to decide what class a single target attribute will be, given a number
of predicting attributes. Naturally, the target attribute can not occur in the
antecedent of a rule.
For example the credit of a person may be discovered to be good, if she has
a job and a positive balance on her bank-account:
IF ((has job) AND (positive balance)) THEN (good credit)
There are direct generalizations of the classification task, such as depen-
dence modelling and association rules. But more interesting is data mining of
first order Horn clauses [1], or predicate logic, that discovers relationships with
variables, for example the concepts of family relationships.
6
The actual representation or encoding of a chromosome is suggested in [4]
to be binary. That seems a tad low-level, and it may be easier to develop and
maintain an algorithm working on tree-based expressions, akin to genetic pro-
gramming. The genetic operators are then modifications on trees, that only need
to ensure validity of the resulting trees. When the data-set is to be accessed,
if using relational databases the tree may easily be mapped (flattened) to SQL
statements - these are rather speedy with properly indexed tables. There is no
need to fear more abstract chromosomes on the basis of execution speed of the
genetic algorithm, as the manipulation of abstract data-types is still negligible
compared to the actual data access.
References
[1] Machine Learning
Tom M. Mitchell
McGraw-Hill 1997
ISBN 0-07-042807-7
[2] Genetic Algorithms
in Search, Optimization, and Machine Learning
David E. Goldberg
Addison-Wesley Publishing 1989
ISBN 0-201-15767-5