Introduction To Data Mining
Introduction To Data Mining
Introduction
Scenarios in life
1. Human in vitro fertilization involves collecting several eggs from a woman’s ovaries, which,
after fertilization with partner or donor sperm, produce several embryos. Some of these are
selected and transferred to the woman’s uterus. The challenge is to select the “best” embryos
to use—the ones that are most likely to survive. Selection is based on around 60 recorded
features of the embryos—characterizing their morphology, oocyte, and follicle, and the
sperm sample. The number of features is large enough to make it difficult for an
embryologist to assess them all simultaneously and correlate historical data with the crucial
outcome of whether that embryo did or did not result in a live child. Machine learning has
been investigated as a technique for making the selection, using historical records of
embryos and their outcome as training data.
2. Every year, large scale dairy farmers have to make a tough business decision: which cows to
retain in their herd and which to sell off to an abattoir. Typically, one-fifth of the cows in a
dairy herd are culled each year near the end of the milking season as feed reserves dwindle.
Each cow’s breeding and milk production history influences this decision. Other factors
include age (a cow nears the end of its productive life at eight years), health problems,
history of difficult calving, undesirable temperament traits (kicking or jumping fences), and
not being pregnant with calf for the following season. About 700 attributes for each of
several million cows have been recorded over the years. Machine learning has been
investigated as a way of ascertaining what factors are taken into account by successful
farmers—not to automate the decision but to propagate their skills and experience to others.
Machine learning is a growing new technology for mining knowledge from data.
Data mining is defined as the process of discovering patterns in data. The process must be
automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that
they lead to some advantage, usually an economic one. The data is invariably present in
substantial quantities.
Machine Learning provides techniques for finding and describing structural patterns in data.
Machine Learning
What is learning, anyway? What is machine learning? These are philosophical questions.
Dictionaries define “to learn” as;
To get knowledge of something by study, experience, or being taught.
To become aware by information or from observation
To commit to memory
To be informed of or to ascertain
To receive instruction
These meanings have some shortcomings when it comes to talking about computers. For the
first two, it is virtually impossible to test whether learning has been achieved or not.
How do you know whether a machine has got knowledge of something?
How do you know whether it has become aware of something?
For the last three meanings, merely committing to memory and receiving instruction seem to
fall far short of what we might mean by machine learning.
They are too passive – computers find these tasks trivial.
An entity can receive instruction without benefiting from it at all.
Note: Things learn when they change their behavior in a way that makes them perform better in
the future. This ties learning to performance rather than knowledge. You can test learning by
observing present behavior and comparing it with past behavior.
Examples
The Weather Problem
This weather problem concerns conditions that are suitable for playing some unspecified
game.
In general, instances in a dataset are characterized by the values of features, or attributes,
that measure different aspects of the instance. In this case there are four attributes: outlook,
temperature, humidity, and windy. The outcome is whether to play or not.
In table 1.2, all four attributes have values that are symbolic categories rather than numbers.
Outlook can be sunny, overcast, or rainy; temperature can be hot, mild, or cool; humidity
can be high or normal; and windy can be true or false. This creates 36 possible
combinations (3 × 3 × 2 × 2 = 36), of which 14 are present in the set of input examples.
The following rules which could be interpreted in sequence could be derived from the data
above;
A set of rules that are intended to be interpreted in sequence is called a decision list.
Suppose that two of the attributes—temperature and humidity—have numeric values as
shown in table 1.3. This means that any learning scheme must create inequalities involving
these attributes rather than simple equality tests as in the former case. This is called a
numeric-attribute problem—in this case, a mixed-attribute problem because not all
attributes are numeric.
The first rule might take the form:
The rules we have seen so far are classification rules: They predict the classification of the
example in terms of whether to play or not.
It is equally possible to disregard the classification and just look for any rules that strongly
associate different attribute values. These are called association rules. Examples from table
1.2 include;
1. Web Mining: Search engine companies examine the hyper-links in web pages to come up
with a measure to rank web pages. The problem of how to rank web pages is to use machine
learning based on a training set of example queries—documents that contain the terms in the
query and human judgments about how relevant the documents are to that query. Then a
learning algorithm analyzes this training data and comes up with a way to predict the
relevance judgment for any document and query. For each document, a set of feature values
is calculated that depends on the query term—for example, whether it occurs in the title tag,
whether it occurs in the document’s URL, how often it occurs in the document itself, and
how often it appears in the anchor text of hyperlinks that point to the document. Search
engines mine the web to select advertisements that you might be interested in.
2. Judgment on loans: Loan defaulter are a risky group of customers. Credit industry
professionals point out that if only their repayment future could be reliably determined, it is
precisely these customers whose business should be wooed since they will always be in
financial need. A machine learning procedure is used to produce a small set of classification
rules that made correct predictions on two-thirds of the borderline case customers in an
independently chosen test set. These rules improve the success rate of the loan decisions on
defaulters.
3. Screening of Images for oil leaks on oceans.
4. Marketing and sales: Market basket analysis is the use of association techniques to find
groups of items that tend to occur together in transactions, typically supermarket checkout
data.
Classification learning is sometimes called supervised, because, in a sense, the scheme operates
under supervision by being provided with the actual outcome for each of the training examples—the
play or don’t play judgment, the lens recommendation, the type of iris, the acceptability of the labor
contract. This outcome is called the class of the example. The success of classification learning can
be judged by trying out the concept description that is learned on an independent set of test data for
which the true classifications are known but not made available to the machine. The success rate on
test data gives an objective measure of how well the concept has been learned.
The input to a machine learning scheme is a set of instances. These instances are the things that are
to be classified or associated or clustered. Although until now we have called them examples,
henceforth we will generally use the more specific term instances to refer to the input. In the
standard scenario, each instance is an individual, independent example of the concept to be learned.
Instances are characterized by the values of a set of predetermined attributes.
Each instance that provides the input to machine learning is characterized by its values on a fixed,
predefined set of features or attributes. The instances are the rows of the tables that we have shown
for the weather and the iris problems while the attributes are the columns.
Attributes can either be numeric or nominal quantities. Numeric attributes, sometimes called
continuous attributes, measure numbers—either real or integer valued (abuse!). Nominal attributes
take on values in a prespecified, finite set of possibilities and are sometimes called categorical.
1) TABLES
The simplest, most rudimentary way of representing the output from machine learning is to make it
just the same as the input—a table. For example, Table 1.2 is a decision table for the weather data:
You just look up the appropriate conditions to decide whether or not to play.
2) LINEAR MODELS
Another simple style of representation is a linear model, the output of which is just the sum of the
attribute values, except that weights are applied to each attribute before adding them together. The
trick is to come up with good values for the weights—ones that make the model’s output match the
desired output. For linear models, the output and the inputs—attribute values—are all numeric.
Regression is the process of predicting a numeric quantity, and regression model is another term
for this kind of linear model. Linear models are easiest to visualize in two dimensions, where they
involve drawing a straight line through a set of data points.
Figure 3.1 shows a line fitted to the CPU performance data where only the cache attribute is used as
input. The class attribute performance is shown on the vertical axis, with cache on the horizontal
axis; both are numeric. The straight line represents the “best fit” prediction equation.
Given a test instance, a prediction can be produced by plugging the observed value of cache into
this expression to obtain a value for performance.
Linear models can also be applied to binary classification problems. In this case, the line produced
by the model separates the two classes: It defines where the decision changes from one class value
to the other. Such a line is often referred to as the decision boundary. Figure 3.2 shows a decision
boundary for the iris data that separates the Iris setosas from the Iris versicolors.
In this case, the data is plotted using two of the input attributes—petal length and petal width—and
the straight line defining the decision boundary is a function of these two attributes. As before,
given a test instance, a prediction is produced by plugging the observed values of the attributes in
question into the expression. But here we check the result and predict one class if it is greater than
or equal to 0 (in this case, Iris setosa) and the other if it is less than 0 (Iris versicolor).
3) TREES
Consider table 1.1 that shows contact lens data that tells the kind of contact lens to prescribe, given
certain information about a patient.
The first column of Table 1.1 gives the age of the patient where presbyopia is a form of
longsightedness that accompanies the onset of middle age.
The second column gives the spectacle prescription: Myope means shortsighted and
hypermetrope means longsighted.
The third column shows whether the patient is astigmatic, while the fourth relates to the rate
of tear production, which is important in this context because tears lubricate contact lenses.
The final column shows which kind of lenses to prescribe, whether hard, soft, or none.
Rules shown in figure 1.1 can be derived from this table.
However, a smaller set of rules could perform better.
Machine learning techniques can be used to gain insight into the structure of data rather
than to make predictions for new cases.
Figure 1.2 shows a structural description for the contact lens data in the form of a decision
tree, which is a more concise representation of the rules and has the advantage that it can be
visualized more easily.
Nodes in a decision tree involve testing a particular attribute. Usually, the test compares an
attribute value with a constant. Leaf nodes give a classification that applies to all instances
that reach the leaf, or a set of classifications, or a probability distribution over all possible
classifications. To classify an unknown instance, it is routed down the tree according to the
values of the attributes tested in successive nodes, and when a leaf is reached the instance is
classified according to the class assigned to the leaf.
In summary, decision trees divide the data at a node by comparing the value of some
attribute with a constant.
4) RULES
Rules are a popular alternative to decision trees.
The antecedent, or precondition, of a rule is a series of tests just like the tests at nodes in
decision trees, while the consequent, or conclusion, gives the class or classes that apply to
instances covered by that rule, or perhaps gives a probability distribution over the classes.
Classification Rules
It is easy to read a set of classification rules directly off a decision tree since one rule is
generated for each leaf.
The antecedent of the rule includes a condition for every node on the path from the root to
that leaf, and the consequent of the rule is the class assigned by the leaf.
The left diagram of Figure 3.6 shows an exclusive-or function for which the output is a if
x=1 or y=1 but not both. To make this into a tree, you have to split on one attribute first,
leading to a structure like the one shown in the center. In contrast, rules can faithfully reflect
the true symmetry of the problem with respect to the attributes, as shown on the right.
In this example the rules are not notably more compact than the tree.
Reasons why rules are popular:
1. Each rule seems to represent an independent “nugget” of knowledge.
2. New rules can be added to an existing rule set without disturbing ones already there,
whereas to add to a tree structure may require reshaping the whole tree. However,
this independence is something of an illusion because it ignores the question of how
the rule set is executed.
if rules are meant to be interpreted in order as a “decision list,” some of them,
taken individually and out of context, may be incorrect.
if the order of interpretation is supposed to be immaterial, then it is not clear
what to do when different rules lead to different conclusions for the same
instance.
This situation cannot arise for rules that are read directly off a decision tree
because the redundancy included in the structure of the rules prevents any
ambiguity in interpretation.
What if a rule set gives multiple classifications for a particular example?
1. One solution is to give no conclusion at all.
2. Another is to count how often each rule fires on the training data and go with the
most popular one.
Note: Individual rules are simple, and sets of rules seem deceptively simple—but given just a
set of rules with no additional information, it is not clear how it should be interpreted.
Association Rules
Association rules can predict any attribute, not just the class – this gives them the freedom to
predict combinations of attributes too.
Also, association rules are not intended to be used together as a set, as classification rules
are.
Different association rules express different regularities that underlie the dataset, and they
generally predict different things.
Since so many different association rules can be derived from even a very small dataset,
interest is restricted to those that apply to a reasonably large number of instances and have a
reasonably high accuracy on the instances to which they apply.
The coverage of an association rule is the number of instances for which it predicts
correctly—this is often called its support. Its accuracy – often called confidence – is the
number of instances that it predicts correctly, expressed as a proportion of all instances to
which it applies.
5) INSTANCE-BASED REPRESENTATION
The simplest form of learning is plain memorization, or rote learning.
Once a set of training instances has been memorized, on encountering a new instance the
memory is searched for the training instance that most strongly resembles the new one.
The only problem is how to interpret “resembles”.
In instance based learning, instances are stored after which new instances whose class is
unknown are related to existing ones whose class is known.
In this case, instead of trying to create rules, learning happens directly from the examples.
Note: In a sense, all the other learning methods are instance-based too, because we always
start with a set of instances as the initial training information.
But the instance-based knowledge representation uses the instances themselves to represent
what is learned, rather than inferring a rule set or decision tree and storing it instead.
The difference between this method and the others that we have seen is the time at which the
“learning” takes place.
1. Instance-based learning is lazy, deferring the real work as long as possible, whereas
other methods are eager, producing a generalization as soon as the data has been
seen.
2. In instance-based classification, each new instance is compared with existing ones
using a distance metric, and the closest existing instance is used to assign the class to
the new one. This is called the nearest-neighbor classification method. Sometimes
more than one nearest neighbor is used, and the majority class of the closest k
neighbors (or the distance-weighted average if the class is numeric) is assigned to the
new instance. This is termed the k-nearest-neighbor method.
3. Computing the distance between two examples is trivial when examples have just
one numeric attribute: It is just the difference between the two attribute values.
4. For examples with several numeric attributes the standard Euclidean distance
(distance between two points) is used with the assumption that the attributes are
normalized and are of equal importance., The main problems in learning is to
determine the “important” features. Some attributes will be more important than
others, and this is usually reflected in the distance metric by some kind of attribute
weighting. Deriving suitable attribute weights from the training set is a key problem
in instance-based learning.
5. When nominal attributes are present, it is necessary to come up with a “distance”
between different values of that attribute. What are the distances between, say, the
values red, green, and blue? Usually, a distance of zero is assigned if the values are
identical; otherwise, the distance is one. Thus, the distance between red and red is
zero but the distance between red and green is one.
6. However, it may be desirable to use a more sophisticated representation of the
attributes. For example, with more colors one could use a numeric measure of hue in
color space, making yellow closer to orange than it is to green and ocher closer still.
7. An apparent drawback to instance-based representations is that they do not make
explicit the structures that are learned i.e., instances do not really “describe” the
patterns in data.
Figure 3.10 shows different ways of partitioning the instance space. Given a single instance of each
of two classes, the nearest-neighbor rule effectively splits the instance space along the perpendicular
bisector of the line joining the instances. Given several instances of each class, the space is divided
by a set of lines that represent the perpendicular bisectors of selected lines joining an instance of
one class to one of another class.
1. Figure 3.10(a) illustrates a nine-sided polygon that separates the filled-circle class from the
open-circle class. This polygon is implicit in the operation of the nearest-neighbor rule.
2. When training instances are discarded (to save space and increase execution speed), the
result is to save just a few critical examples of each class. Figure 3.10(b) shows only the
examples that actually get used in nearest-neighbor decisions: The others (the light-gray
ones) can be discarded without affecting the result. These examples serve as a kind of
explicit knowledge representation.
3. Some instance-based representations go further and explicitly generalize the instances.
Typically, this is accomplished by creating rectangular regions that enclose examples of the
same class. Figure 3.10(c) shows the rectangular regions that might be produced. Unknown
examples that fall within one of the rectangles will be assigned the corresponding class; ones
that fall outside all rectangles will be subject to the usual nearest-neighbor rule. Of course,
this produces different decision boundaries from the straightforward nearest-neighbor rule,
as can be seen by superimposing the polygon in Figure 3.10(a) onto the rectangles. Any part
of the polygon that lies within a rectangle will be chopped off and replaced by the
rectangle’s boundary.
6) CLUSTERS
When a cluster rather than a classifier is learned, the output takes the form of a diagram that shows
how the instances fall into clusters.
1. In the simplest case this involves associating a cluster number with each instance, which
might be depicted by laying the instances out in two dimensions and partitioning the space
to show each cluster, as illustrated in Figure 3.11(a).
2. Some clustering algorithms allow one instance to belong to more than one cluster, so the
diagram might lay the instances out in two dimensions and draw overlapping subsets
representing each cluster—a Venn diagram, as in Figure 3.11(b).
3. Some algorithms associate instances with clusters probabilistically rather than categorically.
In this case, for every instance there is a probability or degree of membership with which it
belongs to each of the clusters. This is shown in Figure 3.11(c). The numbers for each
example sum to 1.
4. Other algorithms produce a hierarchical structure of clusters so that at the top level the
instance space divides into just a few clusters, each of which divides into its own subcluster
at the next level down, and so on. In this case a diagram such as the one in Figure 3.11(d) is
used, in which elements joined together at lower levels are more tightly clustered than ones
joined together at higher levels. Such diagrams are called dendrograms (Greek) – Tree
diagrams.
DATA MINING ALGORITHMS
This section explains the basic ideas behind the techniques that are used in practical data mining.
The section looks at the basic ideas about algorithms.
Concept of 1R algorithm
Rules are made that test a single attribute and branch accordingly. Each branch corresponds
to a different value of the attribute.
The class that occurs most often in the training data is used as the best classification.
Then error rate of the rules is determined by counting the errors that occur on the training
data—that is, the number of instances that do not have the majority class.
Each attribute generates a different set of rules, one rule for every value of the attribute. An
evaluation of the error rate for each attribute’s rule set is done and the best is chosen.
STATISTICAL MODELING
The 1R method uses a single attribute as the basis for its decisions and chooses the one that works
best. Another simple technique is to use all attributes and allow them to make contributions to the
decision that are equally important and independent of one another, given the class. This is
unrealistic, of course: What makes real-life datasets interesting is that the attributes are certainly not
equally important or independent. But it leads to a simple scheme that, again, works surprisingly
well in practice.
Table 4.2 (generate the table in class) shows a summary of the weather data obtained by
counting how many times each attribute–value pair occurs with each value (yes and no) for
play.
For example, from the weather data outlook is sunny for five examples, two of which have
play = yes and three of which have play = no.
The cells in the first row of the new table simply count these occurrences for all possible
values of each attribute, and the play figure in the final column counts the total number of
occurrences of yes and no.
The lower part of the table contains the same information expressed as fractions, or
observed probabilities. For example, of the nine days that play is yes, outlook is sunny for
two, yielding a fraction of 2/9.
For play the fractions are different: They are the proportion of days that play is yes and no,
respectively.
Now suppose we encounter a new example with the values that are shown in Table 4.3.
We treat the five features in Table 4.2—outlook, temperature, humidity, windy, and the
overall likelihood that play is yes or no—as equally important, independent pieces of
evidence and multiply the corresponding fractions.
Looking at the outcome yes gives;
The fractions are taken from the yes entries in the table according to the values of the
attributes for the new day, and the final 9/14 is the overall fraction representing the
proportion of days on which play is yes. A similar calculation for the outcome no leads to
The likelihood numbers can be turned into probabilities by normalizing them so that they
sum to 1:
This indicates that for the new day, no is four times more likely than yes.
This simple and intuitive method is based on Bayes’ rule of conditional probability. Bayes’
rule says that if you have a hypothesis H and evidence E that bears on that hypothesis, then
The Pr[yes] at the end is the probability of a yes outcome without knowing any of the
evidence E—that is, without knowing anything about the particular day in question—and
it’s called the prior probability of the hypothesis H. In this case, it’s just 9/14, because 9 of
the 14 training examples had a yes.
Substituting the fractions in Table 4.2 for the appropriate evidence probabilities leads to;
This method goes by the name of Naïve Bayes because it’s based on Bayes’ rule and
“naïvely” assumes independence—it is only valid to multiply probabilities when the events
are independent. The assumption that attributes are independent (given the class) in real life
certainly is a simplistic one.
Despite the name, Naïve Bayes works very effectively when tested on actual datasets.
However, things go wrong for this algorithm when one of the attributes' probabilities is zero.
This results on the overall probability becoming zero. Probabilities that are zero hold a veto
over the other ones.
The problem is how to determine which attribute to split on, given a set of examples with different
classes. Consider the weather data. There are four possibilities for each split, and at the top level
they produce the trees in Figure 4.2.
Any leaf with only one class—yes or no—will not have to be split further, and the recursive process
down that branch will terminate. Because we seek small trees, we would like this to happen as soon
as possible. If we had a measure of the purity of each node, we could choose the attribute that
produces the purest daughter nodes.
The measure of purity that we will use is called the information and is measured in units called bits.
Associated with each node of the tree, it represents the expected amount of information that would
be needed to specify whether a new instance should be classified yes or no, given that the example
reached that node. Unlike the bits in computer memory, the expected amount of information usually
involves fractions of a bit—and is often less than 1! It is calculated based on the number of yes and
no classes at the node.
When evaluating the tree in Figure 4.2(a), the number of yes and no classes at the leaf nodes are [2,
3], [4, 0], and [3, 2], respectively, and the information values of these nodes are
We calculate the average information value of these, taking into account the number of instances
that go down each branch i.e., five down the first and third and four down the second:
The training examples at the root comprised nine yes and five no nodes, corresponding to an
information value of;
Thus, the tree in Figure 4.2(a) is responsible for an information gain of:
Thus we need to calculate the information gain for each attribute and split on the one that gains the
most information. In the situation that is shown in Figure 4.2:
Therefore, we select outlook as the splitting attribute at the root of the tree. It is the only choice for
which one daughter node is completely pure, which gives it a considerable advantage over the
other attributes. Humidity is the next best choice because it produces a larger daughter node that is
almost completely pure. Then we continue, recursively.
Figure 4.3 shows the possibilities for a further branch at the node reached when outlook is sunny.
Continued application of the same idea leads to the decision tree of Figure 4.4 for the weather data.
LINEAR REGRESSION
When the outcome, or class, is numeric, and all the attributes are numeric, linear regression
is a natural technique to consider. This is a staple method in statistics. The idea is to express
the class as a linear combination of the attributes, with predetermined weights:
Linear regression is a process that allows one to make predictions about variable “Y” based
on knowledge about variable “X” . It summarizes how average values of a numerical
outcome variable vary over subpopulations defined by linear functions of predictors.
Linear regression is an excellent, simple method for numeric prediction.
However, linear models suffer from the disadvantage of linearity – if the data exhibits a
nonlinear dependency, the best-fitting straight line would be found. This line may not fit
very well.
LOGISTIC REGRESSION
Logistic regression is a discriminative classifier that employs a probabilistic approach to
directly model posterior probability by learning a discriminative function that maps an input
feature vector directly onto a class label. The goal of Logistic regression is to directly
estimate the posterior probability P(Y (T) = i|X ) from the training data. The Logistic
regression model is defined as;
Precision = TP/(TP+FP)
In terms of defective software modules, precision as the ratio of actually defective modules
within the modules predicted as defective while recall is the ratio of detected defective
modules among all defective modules. Recall (sensitivity) therefore measures how often we
find what we are looking for and evaluates to a value of “1” if all instances of the True class
are classified to the True class.
The main goal for learning from imbalanced datasets is to improve recall without hurting precision.
However, recall and precision goals can often conflict since when increasing TP for minority class,
the number of FP can also be increased thus reducing precision. To resolve this, a n F-value metric
measure that combines the trade-offs between precision and recall is used. The relationship
of recall and precision is given by the following F-value equation;
where β is usually set to 1 and corresponds to the relative importance of precision vs recall.
The single number that the F-value outputs reflects the “goodness” of a classifier in the
presence of rare classes. While recall (sensitivity) measures how often we find what we are
looking for, specificity measures how often what we find is what we are not looking for.
To plot a curve for a model, its sensitivity and specificity are calculated and plotted as a
point on the ROC graph.
An ideal model would be represented by the location (0, 1) on the graph in Fig. 2
corresponding to 100% specificity and 100% sensitivity.
To obtain a curve on the ROC plot that corresponds to a single classifier, a threshold of the
quantity under study is chosen and then used to calculate a point classifier. If the value of
the threshold is varied, several points are obtained which when plotted generate an ROC
curve as shown by the curves for classifiers A and B in Fig.2.
For models with curves that do not overlap in the graph space, the curve that is more to the
upper left would indicate a better classifier.
However the best way of selecting an optimal model is by determining the Area Under
Curve (AUC) of the ROC whereby better performing model will have the largest Area
Under Curve (AUC).
A perfect classifier would have an AUC of 1, while a random classifier is expected to
achieve an AUC of 0.5 .