0% found this document useful (0 votes)
24 views

Introduction To Data Mining

This document discusses data mining and machine learning techniques. It provides two examples where machine learning has been used: (1) selecting embryos in human in vitro fertilization based on 60 recorded embryo features, and (2) determining which cows to retain or sell in a dairy herd based on 700 recorded attributes. It then discusses how data mining involves looking for patterns in large amounts of data, and machine learning provides techniques to find and describe structural patterns. Several examples are given to illustrate classification and association rule learning from weather, iris plant, and web mining data.

Uploaded by

petergitagia9781
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Introduction To Data Mining

This document discusses data mining and machine learning techniques. It provides two examples where machine learning has been used: (1) selecting embryos in human in vitro fertilization based on 60 recorded embryo features, and (2) determining which cows to retain or sell in a dairy herd based on 700 recorded attributes. It then discusses how data mining involves looking for patterns in large amounts of data, and machine learning provides techniques to find and describe structural patterns. Several examples are given to illustrate classification and association rule learning from weather, iris plant, and web mining data.

Uploaded by

petergitagia9781
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

CBT 2301 Data Mining

Introduction
Scenarios in life
1. Human in vitro fertilization involves collecting several eggs from a woman’s ovaries, which,
after fertilization with partner or donor sperm, produce several embryos. Some of these are
selected and transferred to the woman’s uterus. The challenge is to select the “best” embryos
to use—the ones that are most likely to survive. Selection is based on around 60 recorded
features of the embryos—characterizing their morphology, oocyte, and follicle, and the
sperm sample. The number of features is large enough to make it difficult for an
embryologist to assess them all simultaneously and correlate historical data with the crucial
outcome of whether that embryo did or did not result in a live child. Machine learning has
been investigated as a technique for making the selection, using historical records of
embryos and their outcome as training data.
2. Every year, large scale dairy farmers have to make a tough business decision: which cows to
retain in their herd and which to sell off to an abattoir. Typically, one-fifth of the cows in a
dairy herd are culled each year near the end of the milking season as feed reserves dwindle.
Each cow’s breeding and milk production history influences this decision. Other factors
include age (a cow nears the end of its productive life at eight years), health problems,
history of difficult calving, undesirable temperament traits (kicking or jumping fences), and
not being pregnant with calf for the following season. About 700 attributes for each of
several million cows have been recorded over the years. Machine learning has been
investigated as a way of ascertaining what factors are taken into account by successful
farmers—not to automate the decision but to propagate their skills and experience to others.

Machine learning is a growing new technology for mining knowledge from data.

DATA MINING AND MACHINE LEARNING


We are overwhelmed with data. The amount of data in the world and in our lives seems ever-
increasing—and there’s no end in sight.
a) Omnipresent computers make it too easy to save things that previously we would have
trashed.
b) Inexpensive disks and online storage make it too easy to postpone decisions about what to
do with all the data—we simply get more memory and keep it all.
c) Ubiquitous electronics record our decisions, our choices in the supermarket, our financial
habits, our comings and goings.
We swipe our way through the world, every swipe a record in a database. The World Wide Web
(WWW) overwhelms us with information; meanwhile, every choice we make is recorded.
 It is a fact that there is a growing gap between the generation of data and our understanding
of it.
 As the volume of data increases, inexorably, the proportion of it that people understand
decreases alarmingly.
 Lying hidden in all this data is information—potentially useful information—that is rarely
made explicit or taken advantage of.
 Data Mining is about looking for patterns in data.
 Hunters seek patterns in animal migration behavior,
 Farmers seek patterns in crop growth,
 Politicians seek patterns in voter opinion (Cambridge Analytica case), and
 Lovers seek patterns in their partners’ responses.
 In data mining, the data is stored electronically and the search is automated—or at least
augmented—by computer – this is not particularly new;
 Economists, statisticians, forecasters, and communication engineers have long
worked with the idea that patterns in data can be sought automatically, identified,
validated, and used for prediction.
 What is new is the staggering increase in opportunities for finding patterns in data.
 It has been estimated that the amount of data stored in the world’s databases doubles
every 20 months.
 As the world grows in complexity, overwhelming us with the data it generates, data
mining becomes our only hope for elucidating hidden patterns.
 Intelligently analyzed data is a valuable resource – it can lead to new insights, and, in
commercial settings, to competitive advantages.
 Data mining is about solving problems by analyzing data already present in databases.
 Patterns of behavior of former customers can be analyzed to identify present
customers who are likely to jump ship to a competitor company – such a group of
customers can be targeted for special treatment, treatment too costly to apply to the
customer base as a whole.
 The same techniques can be used to identify customers who might be attracted to
another service the enterprise provides, one they are not presently enjoying, to target
them for special offers that promote this service. In today’s highly competitive,
customer-centered, service-oriented economy, data is the raw material that fuels
business growth—if only it can be mined.

Data mining is defined as the process of discovering patterns in data. The process must be
automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that
they lead to some advantage, usually an economic one. The data is invariably present in
substantial quantities.

Machine Learning provides techniques for finding and describing structural patterns in data.

Machine Learning
What is learning, anyway? What is machine learning? These are philosophical questions.
Dictionaries define “to learn” as;
 To get knowledge of something by study, experience, or being taught.
 To become aware by information or from observation
 To commit to memory
 To be informed of or to ascertain
 To receive instruction
These meanings have some shortcomings when it comes to talking about computers. For the
first two, it is virtually impossible to test whether learning has been achieved or not.
 How do you know whether a machine has got knowledge of something?
 How do you know whether it has become aware of something?
For the last three meanings, merely committing to memory and receiving instruction seem to
fall far short of what we might mean by machine learning.
 They are too passive – computers find these tasks trivial.
 An entity can receive instruction without benefiting from it at all.
Note: Things learn when they change their behavior in a way that makes them perform better in
the future. This ties learning to performance rather than knowledge. You can test learning by
observing present behavior and comparing it with past behavior.

Examples
The Weather Problem
 This weather problem concerns conditions that are suitable for playing some unspecified
game.
 In general, instances in a dataset are characterized by the values of features, or attributes,
that measure different aspects of the instance. In this case there are four attributes: outlook,
temperature, humidity, and windy. The outcome is whether to play or not.
 In table 1.2, all four attributes have values that are symbolic categories rather than numbers.
Outlook can be sunny, overcast, or rainy; temperature can be hot, mild, or cool; humidity
can be high or normal; and windy can be true or false. This creates 36 possible
combinations (3 × 3 × 2 × 2 = 36), of which 14 are present in the set of input examples.
 The following rules which could be interpreted in sequence could be derived from the data
above;

 A set of rules that are intended to be interpreted in sequence is called a decision list.
 Suppose that two of the attributes—temperature and humidity—have numeric values as
shown in table 1.3. This means that any learning scheme must create inequalities involving
these attributes rather than simple equality tests as in the former case. This is called a
numeric-attribute problem—in this case, a mixed-attribute problem because not all
attributes are numeric.
 The first rule might take the form:

 The rules we have seen so far are classification rules: They predict the classification of the
example in terms of whether to play or not.
 It is equally possible to disregard the classification and just look for any rules that strongly
associate different attribute values. These are called association rules. Examples from table
1.2 include;

Irises: A Classic Numeric Dataset


The iris dataset, is arguably the most famous dataset used in data mining, contains 50 examples of
each of three types of plant: Iris setosa, Iris versicolor, and Iris virginica. It is excerpted in Table
1.4. There are four attributes: sepal length, sepal width, petal length, and petal width (all measured
in centimeters). Unlike previous datasets, all attributes have values that are numeric.
The following set of rules might be learned from the dataset;
Other Examples of Applications of data Mining

1. Web Mining: Search engine companies examine the hyper-links in web pages to come up
with a measure to rank web pages. The problem of how to rank web pages is to use machine
learning based on a training set of example queries—documents that contain the terms in the
query and human judgments about how relevant the documents are to that query. Then a
learning algorithm analyzes this training data and comes up with a way to predict the
relevance judgment for any document and query. For each document, a set of feature values
is calculated that depends on the query term—for example, whether it occurs in the title tag,
whether it occurs in the document’s URL, how often it occurs in the document itself, and
how often it appears in the anchor text of hyperlinks that point to the document. Search
engines mine the web to select advertisements that you might be interested in.
2. Judgment on loans: Loan defaulter are a risky group of customers. Credit industry
professionals point out that if only their repayment future could be reliably determined, it is
precisely these customers whose business should be wooed since they will always be in
financial need. A machine learning procedure is used to produce a small set of classification
rules that made correct predictions on two-thirds of the borderline case customers in an
independently chosen test set. These rules improve the success rate of the loan decisions on
defaulters.
3. Screening of Images for oil leaks on oceans.
4. Marketing and sales: Market basket analysis is the use of association techniques to find
groups of items that tend to occur together in transactions, typically supermarket checkout
data.

Input: Concepts, Instances, and Attributes


Four basically different styles of learning appear in data mining applications:
1. In classification learning, the learning scheme is presented with a set of classified examples
from which it is expected to learn a way of classifying unseen examples.
2. In association learning, any association among features is sought, not just ones that predict
a particular class value.
3. In clustering, groups of examples that belong together are sought.
4. In numeric prediction, the outcome to be predicted is not a discrete class but a numeric
quantity.
Regardless of the type of learning involved, we call the thing to be learned the concept and the
output produced by a learning scheme the concept description.

Classification learning is sometimes called supervised, because, in a sense, the scheme operates
under supervision by being provided with the actual outcome for each of the training examples—the
play or don’t play judgment, the lens recommendation, the type of iris, the acceptability of the labor
contract. This outcome is called the class of the example. The success of classification learning can
be judged by trying out the concept description that is learned on an independent set of test data for
which the true classifications are known but not made available to the machine. The success rate on
test data gives an objective measure of how well the concept has been learned.

The input to a machine learning scheme is a set of instances. These instances are the things that are
to be classified or associated or clustered. Although until now we have called them examples,
henceforth we will generally use the more specific term instances to refer to the input. In the
standard scenario, each instance is an individual, independent example of the concept to be learned.
Instances are characterized by the values of a set of predetermined attributes.
Each instance that provides the input to machine learning is characterized by its values on a fixed,
predefined set of features or attributes. The instances are the rows of the tables that we have shown
for the weather and the iris problems while the attributes are the columns.

Attributes can either be numeric or nominal quantities. Numeric attributes, sometimes called
continuous attributes, measure numbers—either real or integer valued (abuse!). Nominal attributes
take on values in a prespecified, finite set of possibilities and are sometimes called categorical.

OUTPUT: KNOWLEDGE REPRESENTATION

1) TABLES
The simplest, most rudimentary way of representing the output from machine learning is to make it
just the same as the input—a table. For example, Table 1.2 is a decision table for the weather data:
You just look up the appropriate conditions to decide whether or not to play.

2) LINEAR MODELS
Another simple style of representation is a linear model, the output of which is just the sum of the
attribute values, except that weights are applied to each attribute before adding them together. The
trick is to come up with good values for the weights—ones that make the model’s output match the
desired output. For linear models, the output and the inputs—attribute values—are all numeric.
Regression is the process of predicting a numeric quantity, and regression model is another term
for this kind of linear model. Linear models are easiest to visualize in two dimensions, where they
involve drawing a straight line through a set of data points.
Figure 3.1 shows a line fitted to the CPU performance data where only the cache attribute is used as
input. The class attribute performance is shown on the vertical axis, with cache on the horizontal
axis; both are numeric. The straight line represents the “best fit” prediction equation.

PRP = 37.06 + 2.47 CACH

Given a test instance, a prediction can be produced by plugging the observed value of cache into
this expression to obtain a value for performance.

Linear models can also be applied to binary classification problems. In this case, the line produced
by the model separates the two classes: It defines where the decision changes from one class value
to the other. Such a line is often referred to as the decision boundary. Figure 3.2 shows a decision
boundary for the iris data that separates the Iris setosas from the Iris versicolors.

In this case, the data is plotted using two of the input attributes—petal length and petal width—and
the straight line defining the decision boundary is a function of these two attributes. As before,
given a test instance, a prediction is produced by plugging the observed values of the attributes in
question into the expression. But here we check the result and predict one class if it is greater than
or equal to 0 (in this case, Iris setosa) and the other if it is less than 0 (Iris versicolor).
3) TREES
Consider table 1.1 that shows contact lens data that tells the kind of contact lens to prescribe, given
certain information about a patient.
 The first column of Table 1.1 gives the age of the patient where presbyopia is a form of
longsightedness that accompanies the onset of middle age.
 The second column gives the spectacle prescription: Myope means shortsighted and
hypermetrope means longsighted.
 The third column shows whether the patient is astigmatic, while the fourth relates to the rate
of tear production, which is important in this context because tears lubricate contact lenses.
 The final column shows which kind of lenses to prescribe, whether hard, soft, or none.
 Rules shown in figure 1.1 can be derived from this table.
 However, a smaller set of rules could perform better.
 Machine learning techniques can be used to gain insight into the structure of data rather
than to make predictions for new cases.
 Figure 1.2 shows a structural description for the contact lens data in the form of a decision
tree, which is a more concise representation of the rules and has the advantage that it can be
visualized more easily.
 Nodes in a decision tree involve testing a particular attribute. Usually, the test compares an
attribute value with a constant. Leaf nodes give a classification that applies to all instances
that reach the leaf, or a set of classifications, or a probability distribution over all possible
classifications. To classify an unknown instance, it is routed down the tree according to the
values of the attributes tested in successive nodes, and when a leaf is reached the instance is
classified according to the class assigned to the leaf.
 In summary, decision trees divide the data at a node by comparing the value of some
attribute with a constant.

4) RULES
 Rules are a popular alternative to decision trees.
 The antecedent, or precondition, of a rule is a series of tests just like the tests at nodes in
decision trees, while the consequent, or conclusion, gives the class or classes that apply to
instances covered by that rule, or perhaps gives a probability distribution over the classes.

Classification Rules
 It is easy to read a set of classification rules directly off a decision tree since one rule is
generated for each leaf.
 The antecedent of the rule includes a condition for every node on the path from the root to
that leaf, and the consequent of the rule is the class assigned by the leaf.
 The left diagram of Figure 3.6 shows an exclusive-or function for which the output is a if
x=1 or y=1 but not both. To make this into a tree, you have to split on one attribute first,
leading to a structure like the one shown in the center. In contrast, rules can faithfully reflect
the true symmetry of the problem with respect to the attributes, as shown on the right.

 In this example the rules are not notably more compact than the tree.
 Reasons why rules are popular:
1. Each rule seems to represent an independent “nugget” of knowledge.
2. New rules can be added to an existing rule set without disturbing ones already there,
whereas to add to a tree structure may require reshaping the whole tree. However,
this independence is something of an illusion because it ignores the question of how
the rule set is executed.
 if rules are meant to be interpreted in order as a “decision list,” some of them,
taken individually and out of context, may be incorrect.
 if the order of interpretation is supposed to be immaterial, then it is not clear
what to do when different rules lead to different conclusions for the same
instance.
 This situation cannot arise for rules that are read directly off a decision tree
because the redundancy included in the structure of the rules prevents any
ambiguity in interpretation.
 What if a rule set gives multiple classifications for a particular example?
1. One solution is to give no conclusion at all.
2. Another is to count how often each rule fires on the training data and go with the
most popular one.
Note: Individual rules are simple, and sets of rules seem deceptively simple—but given just a
set of rules with no additional information, it is not clear how it should be interpreted.

Association Rules
 Association rules can predict any attribute, not just the class – this gives them the freedom to
predict combinations of attributes too.
 Also, association rules are not intended to be used together as a set, as classification rules
are.
 Different association rules express different regularities that underlie the dataset, and they
generally predict different things.
 Since so many different association rules can be derived from even a very small dataset,
interest is restricted to those that apply to a reasonably large number of instances and have a
reasonably high accuracy on the instances to which they apply.
 The coverage of an association rule is the number of instances for which it predicts
correctly—this is often called its support. Its accuracy – often called confidence – is the
number of instances that it predicts correctly, expressed as a proportion of all instances to
which it applies.

5) INSTANCE-BASED REPRESENTATION
 The simplest form of learning is plain memorization, or rote learning.
 Once a set of training instances has been memorized, on encountering a new instance the
memory is searched for the training instance that most strongly resembles the new one.
 The only problem is how to interpret “resembles”.
 In instance based learning, instances are stored after which new instances whose class is
unknown are related to existing ones whose class is known.
 In this case, instead of trying to create rules, learning happens directly from the examples.
 Note: In a sense, all the other learning methods are instance-based too, because we always
start with a set of instances as the initial training information.
 But the instance-based knowledge representation uses the instances themselves to represent
what is learned, rather than inferring a rule set or decision tree and storing it instead.
 The difference between this method and the others that we have seen is the time at which the
“learning” takes place.
1. Instance-based learning is lazy, deferring the real work as long as possible, whereas
other methods are eager, producing a generalization as soon as the data has been
seen.
2. In instance-based classification, each new instance is compared with existing ones
using a distance metric, and the closest existing instance is used to assign the class to
the new one. This is called the nearest-neighbor classification method. Sometimes
more than one nearest neighbor is used, and the majority class of the closest k
neighbors (or the distance-weighted average if the class is numeric) is assigned to the
new instance. This is termed the k-nearest-neighbor method.
3. Computing the distance between two examples is trivial when examples have just
one numeric attribute: It is just the difference between the two attribute values.
4. For examples with several numeric attributes the standard Euclidean distance
(distance between two points) is used with the assumption that the attributes are
normalized and are of equal importance., The main problems in learning is to
determine the “important” features. Some attributes will be more important than
others, and this is usually reflected in the distance metric by some kind of attribute
weighting. Deriving suitable attribute weights from the training set is a key problem
in instance-based learning.
5. When nominal attributes are present, it is necessary to come up with a “distance”
between different values of that attribute. What are the distances between, say, the
values red, green, and blue? Usually, a distance of zero is assigned if the values are
identical; otherwise, the distance is one. Thus, the distance between red and red is
zero but the distance between red and green is one.
6. However, it may be desirable to use a more sophisticated representation of the
attributes. For example, with more colors one could use a numeric measure of hue in
color space, making yellow closer to orange than it is to green and ocher closer still.
7. An apparent drawback to instance-based representations is that they do not make
explicit the structures that are learned i.e., instances do not really “describe” the
patterns in data.

Figure 3.10 shows different ways of partitioning the instance space. Given a single instance of each
of two classes, the nearest-neighbor rule effectively splits the instance space along the perpendicular
bisector of the line joining the instances. Given several instances of each class, the space is divided
by a set of lines that represent the perpendicular bisectors of selected lines joining an instance of
one class to one of another class.
1. Figure 3.10(a) illustrates a nine-sided polygon that separates the filled-circle class from the
open-circle class. This polygon is implicit in the operation of the nearest-neighbor rule.
2. When training instances are discarded (to save space and increase execution speed), the
result is to save just a few critical examples of each class. Figure 3.10(b) shows only the
examples that actually get used in nearest-neighbor decisions: The others (the light-gray
ones) can be discarded without affecting the result. These examples serve as a kind of
explicit knowledge representation.
3. Some instance-based representations go further and explicitly generalize the instances.
Typically, this is accomplished by creating rectangular regions that enclose examples of the
same class. Figure 3.10(c) shows the rectangular regions that might be produced. Unknown
examples that fall within one of the rectangles will be assigned the corresponding class; ones
that fall outside all rectangles will be subject to the usual nearest-neighbor rule. Of course,
this produces different decision boundaries from the straightforward nearest-neighbor rule,
as can be seen by superimposing the polygon in Figure 3.10(a) onto the rectangles. Any part
of the polygon that lies within a rectangle will be chopped off and replaced by the
rectangle’s boundary.
6) CLUSTERS
When a cluster rather than a classifier is learned, the output takes the form of a diagram that shows
how the instances fall into clusters.
1. In the simplest case this involves associating a cluster number with each instance, which
might be depicted by laying the instances out in two dimensions and partitioning the space
to show each cluster, as illustrated in Figure 3.11(a).
2. Some clustering algorithms allow one instance to belong to more than one cluster, so the
diagram might lay the instances out in two dimensions and draw overlapping subsets
representing each cluster—a Venn diagram, as in Figure 3.11(b).
3. Some algorithms associate instances with clusters probabilistically rather than categorically.
In this case, for every instance there is a probability or degree of membership with which it
belongs to each of the clusters. This is shown in Figure 3.11(c). The numbers for each
example sum to 1.
4. Other algorithms produce a hierarchical structure of clusters so that at the top level the
instance space divides into just a few clusters, each of which divides into its own subcluster
at the next level down, and so on. In this case a diagram such as the one in Figure 3.11(d) is
used, in which elements joined together at lower levels are more tightly clustered than ones
joined together at higher levels. Such diagrams are called dendrograms (Greek) – Tree
diagrams.
DATA MINING ALGORITHMS

This section explains the basic ideas behind the techniques that are used in practical data mining.
The section looks at the basic ideas about algorithms.

INFERRING RUDIMENTARY RULES (1R i,e., 1-Rule)


 1R or 1-rule is an easy way of finding very simple classification rules from a set of
instances.
 1R generates a one-level decision tree expressed in the form of a set of rules that all test one
particular attribute.
 1R is a simple, cheap method with quite good rules for characterizing the structure in data.
Simple rules frequently achieve surprisingly high accuracy. Perhaps this is because the
structure underlying many real-world datasets is quite rudimentary, and just one attribute is
sufficient to determine the class of an instance quite accurately.

Concept of 1R algorithm
 Rules are made that test a single attribute and branch accordingly. Each branch corresponds
to a different value of the attribute.
 The class that occurs most often in the training data is used as the best classification.
 Then error rate of the rules is determined by counting the errors that occur on the training
data—that is, the number of instances that do not have the majority class.
 Each attribute generates a different set of rules, one rule for every value of the attribute. An
evaluation of the error rate for each attribute’s rule set is done and the best is chosen.

Consider the weather data of Table 1.2.


 To classify on the final column, play, 1R considers four sets of rules, one for each attribute.
These rules are shown in Table 4.1.
 An asterisk indicates that a random choice has been made between two equally likely
outcomes.
 The number of errors is given for each rule, along with the total number of errors for the rule
set as a whole.
 1R chooses the attribute that produces rules with the smallest number of errors—that is, the
first and third rule sets.
Arbitrarily breaking the tie between these two rule sets gives;

STATISTICAL MODELING
The 1R method uses a single attribute as the basis for its decisions and chooses the one that works
best. Another simple technique is to use all attributes and allow them to make contributions to the
decision that are equally important and independent of one another, given the class. This is
unrealistic, of course: What makes real-life datasets interesting is that the attributes are certainly not
equally important or independent. But it leads to a simple scheme that, again, works surprisingly
well in practice.

 Table 4.2 (generate the table in class) shows a summary of the weather data obtained by
counting how many times each attribute–value pair occurs with each value (yes and no) for
play.
 For example, from the weather data outlook is sunny for five examples, two of which have
play = yes and three of which have play = no.
 The cells in the first row of the new table simply count these occurrences for all possible
values of each attribute, and the play figure in the final column counts the total number of
occurrences of yes and no.
 The lower part of the table contains the same information expressed as fractions, or
observed probabilities. For example, of the nine days that play is yes, outlook is sunny for
two, yielding a fraction of 2/9.
 For play the fractions are different: They are the proportion of days that play is yes and no,
respectively.

Now suppose we encounter a new example with the values that are shown in Table 4.3.

 We treat the five features in Table 4.2—outlook, temperature, humidity, windy, and the
overall likelihood that play is yes or no—as equally important, independent pieces of
evidence and multiply the corresponding fractions.
 Looking at the outcome yes gives;

Likelihood of yes = 2/9 × 3/9 × 3/9 × 3/9 × 9/14 = 0.0053

 The fractions are taken from the yes entries in the table according to the values of the
attributes for the new day, and the final 9/14 is the overall fraction representing the
proportion of days on which play is yes. A similar calculation for the outcome no leads to

Likelihood of no = 3/5 × 1/5 × 4/5 × 3/5 × 5/14 = 0.0206

 The likelihood numbers can be turned into probabilities by normalizing them so that they
sum to 1:

Probability of yes = (0.0053/(0.0053+0.0206))*100 = 20.5%

Probability of no = (0.0206/(0.0053+0.0206))*100 = 79.5%

 This indicates that for the new day, no is four times more likely than yes.
 This simple and intuitive method is based on Bayes’ rule of conditional probability. Bayes’
rule says that if you have a hypothesis H and evidence E that bears on that hypothesis, then

Pr[A] denotes the probability of an event A


Pr[A | B] denotes the probability of A conditional on another event B.

 The hypothesis H is that play will be, say, yes.


 The evidence E is the particular combination of attribute values for the new day — outlook
= sunny, temperature = cool, humidity = high, and windy = true.
 Let’s call these four pieces of evidence E1, E2 , E3, and E4 , respectively.
 Assuming that these pieces of evidence are independent (given the class), their combined
probability is obtained by multiplying the probabilities:

 The Pr[yes] at the end is the probability of a yes outcome without knowing any of the
evidence E—that is, without knowing anything about the particular day in question—and
it’s called the prior probability of the hypothesis H. In this case, it’s just 9/14, because 9 of
the 14 training examples had a yes.
 Substituting the fractions in Table 4.2 for the appropriate evidence probabilities leads to;

 This method goes by the name of Naïve Bayes because it’s based on Bayes’ rule and
“naïvely” assumes independence—it is only valid to multiply probabilities when the events
are independent. The assumption that attributes are independent (given the class) in real life
certainly is a simplistic one.
 Despite the name, Naïve Bayes works very effectively when tested on actual datasets.
However, things go wrong for this algorithm when one of the attributes' probabilities is zero.
This results on the overall probability becoming zero. Probabilities that are zero hold a veto
over the other ones.

DIVIDE-AND-CONQUER: CONSTRUCTING DECISION TREES


The problem of constructing a decision tree can be expressed recursively. First, select an attribute to
place at the root node, and make one branch for each possible value. This splits up the example set
into subsets, one for every value of the attribute. The process can be repeated recursively for each
branch, using only those instances that actually reach the branch. If at any time all instances at a
node have the same classification, stop developing that part of the tree.

The problem is how to determine which attribute to split on, given a set of examples with different
classes. Consider the weather data. There are four possibilities for each split, and at the top level
they produce the trees in Figure 4.2.

Any leaf with only one class—yes or no—will not have to be split further, and the recursive process
down that branch will terminate. Because we seek small trees, we would like this to happen as soon
as possible. If we had a measure of the purity of each node, we could choose the attribute that
produces the purest daughter nodes.

The measure of purity that we will use is called the information and is measured in units called bits.
Associated with each node of the tree, it represents the expected amount of information that would
be needed to specify whether a new instance should be classified yes or no, given that the example
reached that node. Unlike the bits in computer memory, the expected amount of information usually
involves fractions of a bit—and is often less than 1! It is calculated based on the number of yes and
no classes at the node.
When evaluating the tree in Figure 4.2(a), the number of yes and no classes at the leaf nodes are [2,
3], [4, 0], and [3, 2], respectively, and the information values of these nodes are

info ([ 2 , 3 ]) = 0.971 bits


info ([ 4 , 0 ]) = 0.0 bits
info ([ 3 , 2 ]) = 0.971 bits

We calculate the average information value of these, taking into account the number of instances
that go down each branch i.e., five down the first and third and four down the second:

info([2,3], [4,0], [3,2]) = (5/14) × 0.971 + (4/14 ) × 0 + (5/14 ) × 0.971


= 0.693 bits

The training examples at the root comprised nine yes and five no nodes, corresponding to an
information value of;

info([9,5]) = 0.940 bits

Thus, the tree in Figure 4.2(a) is responsible for an information gain of:

gain(outlook) = info([9,5]) − info([2,3], [4,0], [3,2])


= 0.940 − 0.6 93
= 0.247 bits
Gain(outlook) can be interpreted as the informational value of creating a branch on the outlook
attribute.

Thus we need to calculate the information gain for each attribute and split on the one that gains the
most information. In the situation that is shown in Figure 4.2:

gain(outlook) = 0.247 bits


gain(temperature) = 0.029 bits
gain(humidity) = 0.152 bits
gain(windy) = 0.048 bits

Therefore, we select outlook as the splitting attribute at the root of the tree. It is the only choice for
which one daughter node is completely pure, which gives it a considerable advantage over the
other attributes. Humidity is the next best choice because it produces a larger daughter node that is
almost completely pure. Then we continue, recursively.

Figure 4.3 shows the possibilities for a further branch at the node reached when outlook is sunny.

The information gain for each turns out to be;

• gain(temperature) = 0.571 bits


• gain(humidity) = 0.971 bits
• gain(windy) = 0.020 bits

Therefore, we select humidity as the splitting attribute at this point.

Continued application of the same idea leads to the decision tree of Figure 4.4 for the weather data.
LINEAR REGRESSION
 When the outcome, or class, is numeric, and all the attributes are numeric, linear regression
is a natural technique to consider. This is a staple method in statistics. The idea is to express
the class as a linear combination of the attributes, with predetermined weights:

 Linear regression is a process that allows one to make predictions about variable “Y” based
on knowledge about variable “X” . It summarizes how average values of a numerical
outcome variable vary over subpopulations defined by linear functions of predictors.
 Linear regression is an excellent, simple method for numeric prediction.
 However, linear models suffer from the disadvantage of linearity – if the data exhibits a
nonlinear dependency, the best-fitting straight line would be found. This line may not fit
very well.

LOGISTIC REGRESSION
 Logistic regression is a discriminative classifier that employs a probabilistic approach to
directly model posterior probability by learning a discriminative function that maps an input
feature vector directly onto a class label. The goal of Logistic regression is to directly
estimate the posterior probability P(Y (T) = i|X ) from the training data. The Logistic
regression model is defined as;

where θ is the vector of parameters to be estimated.


 Logistic regression is a standard way to model binary outcomes i.e, data yi that take on the
values 0 or 1. It can also be used for count data, using binomial distribution to model the
number of “successes” out of a specified number of possibilities, with the probability of
success being fit to a logistic regression. Logistic regression analyzes the relationship
between multiple independent variables and a categorical dependent variable and estimates
the probability of occurrence of an event by fitting data to a curve.

Other common machine learning algorithms


1. Random Forest
2. Support Vector Machines
3. Bayesian Networks

Validating model's predictability – Hold out vs cross-validation


 For a prediction model to be considered effective in prediction, it should be trained on a
different data-set and tested on a different data-set.
 Several ways of splitting experimental data into training and testing data exist. The best
strategy is to split the data randomly, using a rule-of-thumb formula that states that about 2/3
of the data should be used for training. This method is also called holdout.
 In hold-out method, the model is built using the 2/3 training data and tested on 1/3 test data.
However hold out is characterized by high bias due to small data size but low variance.
 The most common method used in prediction experiments is cross-validation.
 Cross-validation is used in situations where the data set is relatively small but difficult such
that splitting it into two parts does not result in good prediction. In a k-fold cross-validation,
data is split into k equal size subsets where k-1 parts are used for training and the remaining
part for testing and calculation of prediction error. Ten-fold cross validation is the most
common type of k-fold cross-validation that is frequently used. In ten-fold cross-validation,
data is randomly split into ten groups, and then ten experiments are carried out using each
group as a testing set while all the others combined as training set.

MEASURING PERFORMANCE OF PREDICTION MODELS


 Prediction models should address performance, accuracy and cost of utilization of the
models.
 Various model performance measures of models exist.

Significance testing, Correlation and Error measures


 Significance testing is used to determine the probability of a pattern such as a relationship
between two variables occurring by chance alone. If the probability of a test statistic having
occurred by chance is very low, usually the value of significance testing p < 0.05. In such a
case then there exist a statistically significant relationship.
 R-square indicates how good a linear model fits the training data.
 Adjusted R-square is like R-square but it also takes into account the number of variables in
the model. The higher the value of R-square and Adjusted R-square, the better is the fit.
 A correlation coefficient enables quantification of the strength of a linear relationship
between two ranked numerical variables. Thus, Spearman and Pearson correlation
coefficients show how good a prediction is. The higher the correlation, the better is the
prediction. Spearman correlation is more accurate than Pearson correlation for data that is
not normally distributed.
Confusion matrix
 The confusion matrix is a commonly used method of evaluating classification algorithms.
 The matrix below shows a typical confusion matrix where columns correspond to the
predicted class while rows are the actual class;

Fig. 1: Confusion matrix

 In the confusion matrix shown in figure 1,


◦ TN is the number of negative examples correctly classified (True Negatives),
◦ FP is the number of negative examples incorrectly classified as positive (False
Positives),
◦ FN is the number of positive examples incorrectly classified as negative (False
Negatives) and
◦ TP is the number of positive examples correctly classified (True Positives).
 Several performance measures have been derived from the confusion matrix which
include; precision, recall or sensitivity, specificity, false discovery rate, F-measure and
accuracy. From the confusion matrix, the expressions for precision and recall (sensitivity)
are;

Precision = TP/(TP+FP)

Recall (sensitivity) = TP/(TP+FN)

 In terms of defective software modules, precision as the ratio of actually defective modules
within the modules predicted as defective while recall is the ratio of detected defective
modules among all defective modules. Recall (sensitivity) therefore measures how often we
find what we are looking for and evaluates to a value of “1” if all instances of the True class
are classified to the True class.
 The main goal for learning from imbalanced datasets is to improve recall without hurting precision.
However, recall and precision goals can often conflict since when increasing TP for minority class,
the number of FP can also be increased thus reducing precision. To resolve this, a n F-value metric
measure that combines the trade-offs between precision and recall is used. The relationship
of recall and precision is given by the following F-value equation;
where β is usually set to 1 and corresponds to the relative importance of precision vs recall.
 The single number that the F-value outputs reflects the “goodness” of a classifier in the
presence of rare classes. While recall (sensitivity) measures how often we find what we are
looking for, specificity measures how often what we find is what we are not looking for.

Specificity = TN/(FP + TN)

Receiver Operator Characteristic (ROC) curve


 In addition to the confusion matrix, a Receiver Operator Characteristic (ROC) curve is
another method of evaluating classification models.
 The Receiver Operating Characteristic (ROC) curve is a standard technique for summarizing
classifier performance over a range of trade-offs between true positive and false positive
error rates.
 ROC curve is a two dimensional graphical representation where True Positive (TP) is
plotted on the y-axis and False Positive (FP) on the x-axis. The Receiver Operating
Characteristics (ROC) analysis is performed by drawing curves on a two-dimensional space,
with the y-axis representing Sensitivity = TP rate, while the x-axis represents 1 − Specificity
= FP rate as shown below.

Fig. 2: Cios et al.'s ROC curves for classifiers A and B

 To plot a curve for a model, its sensitivity and specificity are calculated and plotted as a
point on the ROC graph.
 An ideal model would be represented by the location (0, 1) on the graph in Fig. 2
corresponding to 100% specificity and 100% sensitivity.
 To obtain a curve on the ROC plot that corresponds to a single classifier, a threshold of the
quantity under study is chosen and then used to calculate a point classifier. If the value of
the threshold is varied, several points are obtained which when plotted generate an ROC
curve as shown by the curves for classifiers A and B in Fig.2.
 For models with curves that do not overlap in the graph space, the curve that is more to the
upper left would indicate a better classifier.
 However the best way of selecting an optimal model is by determining the Area Under
Curve (AUC) of the ROC whereby better performing model will have the largest Area
Under Curve (AUC).
 A perfect classifier would have an AUC of 1, while a random classifier is expected to
achieve an AUC of 0.5 .

You might also like