CS263 - Bayesian Decision Theory
CS263 - Bayesian Decision Theory
Introduction
One of the goals in pattern recognition is the achievement of an optimal decision rule to classify
these data into their respective categories. Of all existing decision rules in pattern recognition
such as the Chow’s rule and the nearest neaighbor rule, the Bayesian decision theory is often
considered as one of the most optimal (Bow, 2002).
The Bayesian approach describe categories by probability distributions over the attributes of the
objects specified by a model function and its parameters. It also has several advantages over
other methods such as but not limited to, the number of categories is determined automatically,
objects are not assigned to the categories absolutely, all attributes are potentially significant and
data can be real or discrete.
This written report presents a review of Bayesian decision theory in pattern recognition. Decision
theories deal with the development of methods and techniques that are approriate for making
decisions in an optimal fashion. This optimality of the Bayesian approach has been exemplified
in this paper by surverying real-world applications in artificial intelligence and pattern
recognition research.
The survey revealed that more often, the Bayesian approach outperforms all other machine
learning models utilized in solving the task at hand. This paper enumerates five real-world
examples of applying Bayesian decision-driven machine learning in english letter recognition,
computer-vision application, spam filtering, database clustering and association football
prediction.
𝑝(𝑥|𝜔𝑗 )P(𝜔𝑗 )
𝑃(𝜔𝑗 |𝑥) =
𝑝(x)
𝐹𝑜𝑟𝑚𝑢𝑙𝑎 1 − 𝐵𝑎𝑦𝑒𝑠 𝑟𝑢𝑙𝑒 𝑢𝑠𝑖𝑛𝑔 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠
Likelihood is the probability of a specific class given a random variable while the prior
probability is the initial reflection of how likely a certain class is expected before the actual
observation.
The evidence signified as 𝑝(x) is usually considered as a scaling term which the Bayes theorem
states as equivalent to the following formula:
𝑃(𝑥|𝜔𝑗 )P(𝜔𝑗 )
𝑃(𝜔𝑗 |𝑥) = 𝑃(x)
𝐹𝑜𝑟𝑚𝑢𝑙𝑎 4 − 𝐵𝑎𝑦𝑒𝑠 𝑟𝑢𝑙𝑒 𝑢𝑠𝑖𝑛𝑔 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛
Nonetheless, the Bayes decision rule remains unchanged for both cases as its purpose is to
minimize the risk or cost in the decision.
Each action αi has an associated risk 𝑅(𝑎𝑖 | 𝑥), and 𝜆(𝑎𝑖 | 𝜔𝑗 ) is the loss incurred for deciding
𝜔𝑗 . The conditional risk is computed to minimize the overall risk which is the same for both the
continuous and discrete case:
𝑛
The risk given our observations for every action is the sum of all losses for that action given all
the states, weighted by the probability of occurrence of each state. The action with the minimum
risk is then selected.
Given the formula above, we can decide 𝜔1 if 𝑔1 (𝑥) > 0 which gives us two forms of
discriminant fucntion:
𝑔(𝑥) = 𝑃(𝜔1 |𝑥) − 𝑃(𝜔2 |𝑥)
𝑃(𝑥|𝜔 ) 𝑃(𝜔 )
𝑔(𝑥) = 𝑙𝑛 𝑃(𝑥|𝜔1 ) + 𝑙𝑛 𝑃(𝜔1 )
2 2
𝐹𝑜𝑟𝑚𝑢𝑙𝑎 7 − 𝑇𝑤𝑜 𝑓𝑜𝑟𝑚𝑠 𝑜𝑓 𝑡𝑤𝑜 − 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑐𝑎𝑠𝑒 𝑑𝑖𝑠𝑐𝑟𝑖𝑚𝑖𝑛𝑎𝑛𝑡 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑣𝑒𝑐𝑡𝑜𝑟𝑠
If the feature vector is binary and assumed (correctly or incorrectly) independent, a simplified
Bayes rules can be employed:
𝑑
𝑔(𝑥) = ∑ 𝜔𝑖 xi + 𝜔0
𝑖 =1
where
𝑝𝑖 (1 − 𝑞𝑖 )
𝜔𝑖 = 𝑙𝑛 ,𝑖 = 1…𝑑
𝑞𝑖 (1 − 𝑝𝑖 )
and
𝑑
1 − 𝑝𝑖 𝑃(𝜔1 )
𝜔0 = ∑ 𝑙𝑛 + 𝑙𝑛
1 − 𝑞𝑖 𝑃(𝜔2 )
𝑖 =1
𝐹𝑜𝑟𝑚𝑢𝑙𝑎 8 − 𝑇𝑤𝑜 𝑓𝑜𝑟𝑚𝑠 𝑜𝑓 𝑡𝑤𝑜 − 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑐𝑎𝑠𝑒 𝑑𝑖𝑠𝑐𝑟𝑖𝑚𝑖𝑛𝑎𝑛𝑡 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠 𝑓𝑜𝑟 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑣𝑒𝑐𝑡𝑜𝑟𝑠
It is important to note that 𝜔𝑖 and 𝜔0 are weights calculated for the linear discriminant. The
discriminant function 𝑔(𝑥) above will indicate whether the current feature vector belongs to
class 1 or class 2. A decision boundary lies wherever 𝑔(𝑥) = 0. This decision boundary can be a
line, or hyper-plane depending on the dimension of the feature space.
Example of a Two-Category Case Problem
Consider the three-dimensional binary feature vector 𝑥 = (𝑥1 , 𝑥2 , 𝑥3 ) = (0, 1, 1) of the observed
input above which we will attempt to classify if it falls under class 1 or class 2. Given the
following prior probabilities: 𝑃(𝜔1 ) = 0.6 and 𝑃(𝜔2 ) = 0.4. It is already evident that there is a
bias towards class 1.
Also, the likelihood of each independent features is: 𝑝 = {0.8, 0.2, 0.5} and 𝑞 = {0.2, 0.5, 0.9}.
Since the problem definition assumes that the features are independent, the discriminant function
can be calculated as follows:
Plugging the 𝑥𝑖 values into the discriminant function will give the answer 𝑔(𝑥) = −2.4849.
Since 𝑔(𝑥) = −2.4849 < 0, the feature vector 𝑥 = (0, 1, 1) belongs to class 2.
This is accomplished by setting 𝑔1 (𝑥) = 𝑔𝑖 (𝑥) and 𝑔2 (𝑥) = 𝑔𝑛𝑜𝑡 𝑖 (𝑥). The probabilities for
𝑔2 (𝑥) can be obtained by summing all the probabilities for classes {1, … , 𝑖 − 1, 𝑖 + 1, … , 𝑛 }. If
𝑥 belongs to class 𝑖, then 𝑔𝑖 (𝑥) > 𝑔𝑛𝑜𝑡 𝑖 (𝑥); otherwise 𝑥 belongs to some other class.
Literature Survey
English Letter Classification Using Bayesian Decision Theory and Feature Extraction Using
Principal Component Analysis
(Husnain, & Naweed, 2009) utilized Bayesian decision theory to identify each of the large
number of black and white rectangular pixel displays as one of the 26 capital letters in the
english alphabet. The character images were based on 20 different fonts with each randomly
distorted to produce a file of 20,000 unique instances.
The image dataset used in the research was donated by David J. Slate and P.W. Frey in 1991 to
UCI data repository. Different distortion techniques such as compress, change aspect ratio along
with x and y axis to add bearable noise to the dataset. For each of the black and white image of
the english alphabet, 16-dimensional feature vector was extracted by the authors to demonstrate
the summary of the alphabet image.
The feature vector contains the characteristic features of the image such as vertical and
horizontal position of the rectangular box containing the alphabet, total number of ON pixels and
edge count. Each instance was converted into 16 primitive numerical attributes such as mean,
variance, moments, and covariance scaled to fit into a range of integer value from 0 to 15. The
detail of the 16 attributes are as follows:
On the first set of experiment, 14,000 items were used in the training data and the remaining
6,000 were used in the test set wherein one instance is selected randomly and plugged into the
classifier to check its corresponding alphabet class. This achieved 92% accuracy upon
introducing 100 random input instances where only 8 out of those were misclassified.
The training data was increased from 14,000 to 16,000 on the second set of experiment which
reduced the error rate to 2% wherein only 2 out of the 10 random character data were
misclassified. The results were far better than what (Frey, & Slate, 1991) achieved using holland-
style adaptive classifier in letter recognition which had only 80% accuracy.
The research also revealed that english alphabets ‘N’ and ‘H’ have almost the same shape and
same number of ON pixels resulting to similar posterior probability:
The features were reduced from 16 to 8 using principal component analysis which is an
eigenvector/value-based approched used to reduce the dimensions of the features of a
multivariate data. It identifies patterns in data, and express it in such a way which highlights their
similarities and differences. The accuracy of the bayesian decision theory classifier was checked
again for 100 random inputs and resulted to 98% accuracy with 16,000 instances kept as the
training data. As shown in the screengraph below, the first 8 principal components were very
near to 90% of the variance with the remaining components have no significance in classifying
and their variance values diminished.
𝐹𝑖𝑔𝑢𝑟𝑒 4 − 𝑆𝑐𝑟𝑒𝑒𝑛𝑔𝑟𝑎𝑝ℎ 𝑜𝑓 𝑝𝑟𝑖𝑛𝑐𝑖𝑝𝑎𝑙 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡 𝑎𝑛𝑑 𝑖𝑡𝑠 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
It can be concluded that both principal component analysis and Bayesian decision theory can
give efficient results on document analysis and was proven to be more effective than the holland-
style and statistical adaptive similarity classifiers on letter recognition tasks.
A Vision-based Method for Weeds Identification through the Bayesian Decision Theory
(Tellaechea, Burgos-Artizzu, Pajaresa, & Angela Ribeiro, 2007) developed an automatic
computer vision-based approach for the detection and diffential weed sparying in corn-crops.
Their strategy involved an image segmentation process which divides the incoming image in
cells and extracts features and attributes to be used in the decision-making procedure based on
the computation of a posterior probability under a Bayesian framewor wherein the prior
priobability is computed by the dynamic of the tractor where the method is embedded. The
decision to be made is determining if a cell is to be sprayed or not and requires the existence of a
database containing a set of samples classified as items to be sprayed or not which could be
offline or online.
The knowledge base is built during the offline stage wherein the decision during the online stage
will be based upon. The image segmentation process is identical for both the offline and online
stages.
The training process is done during the offline stage while new images are processed so that a
decision is made about them. This will then be stored in the knowledge based and estimated
during the offline stage.
𝐹𝑖𝑔𝑢𝑟𝑒 5 − 𝑉𝑖𝑠𝑖𝑜𝑛 − 𝑏𝑎𝑠𝑒𝑑 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛 𝑠𝑐ℎ𝑒𝑚𝑒 𝑎𝑛𝑑 𝑑𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑝𝑟𝑜𝑐𝑒𝑠𝑠.
A set of 340 digital images acquired with a HPR817 digital camera during four different days in
May 2006 and April/May 2007 were used to assess the validity and performance of the proposed
approach. 18 video sequences acquired at a rate of 15 frames per second based on the tractor
motion were selected. 10 frames were extracted from each video sequence so the 𝑘𝑡ℎ frame in
the sequence 𝑖 is given by 𝑓𝑖𝑘 , where 𝑘 = 1, … , 10 and 𝑖 = 1, … , 18. So, given two consecutive
frames, 𝑓𝑖𝑘 and 𝑓𝑖𝑘+1 , these will differ in 3𝑢 image rows. Assuming the origin of the coordinates
is the bottom-left corner, the row number 1 in 𝑓𝑖𝑘+1 will match the row 3𝑢 in 𝑓𝑖𝑘 where 𝑢 is a
constant parameter which is set to 50.
The rows of cells fourth and fifth are expanded in the frame 𝑓𝑖𝑘+1 to the first, second, and third
rows of cells which implies that the final spraying decision should be made about the first,
second, and third rows of cells while the fourth and fifth rows are used for computing the prior
probability for the next frame. It should also be noted that the tractor speed is fixed at 4km/h
which implies that 12 m are covered in 11 seconds hence, the time elapsed between frames
𝑓𝑖𝑘+1 and 𝑓𝑖𝑘 is about 11 seconds.
The authors designed a test strategy which involved an initialization step labeled as STEP 0. This
step simulates the offline phase with 160 images and was estimated by cross-validation of 256
cells in the training set and 48 cells in the validation set both of which were randomly selected.
Five training processes were performed where each used a different set as validation and the
remaining cells as the training data. This guarantees that the number of training samples is
always greater or equal than 256. For each validation set, 𝑘 was varied and error was computed.
The errors were averaged for each set and for each 𝑘. The best performance for 𝑘 is obtained for
the minimum mean error which was obtained in 𝑘 = 0.3.
For STEP 1 to 3, a decision is made for each frame 𝑘 for the six cells in its bottom part and each
cell was described by its area-vector of attributes 𝑥. After the decision, a set 𝑆𝑌𝑛 of cells
belonging to wn to be sprayed and a set of 𝑆𝑁𝑛 of cells belonging to wn that do not require
spraying are obtained. Prior probabilities are set to 0.5 otherwise the prior probabilities are the
posterior probabilities computed for the four preceding cells in the previous frame.
The knowledgebase is updated by adding both sets of cells (𝑆𝑌𝑛 and 𝑆𝑁𝑛 ) to the previous entries
classifying all cells belongting to 𝑤𝑦 and 𝑤𝑛 and stored in the knowledgebase to obtain a new
estimate of the class-conditional probability density functions.
The performance is established by comparing the criterion of farmers and technical consultants
against the results obtained for each test wherein the number of cells correctly identified to be
sprayed are signified as True Spraying (TS), the number of cells that do not require spraying
correctly detected as True No Spraying (TN), the number of cells that do not require spraying but
identified as cells to be sprayed as False Spraying (FS), and the number of cells requiring
spraying that they are identified by the method as cells that do not require spraying as False No
Spraying (FN).
𝐹𝑖𝑔𝑢𝑟𝑒 7 − 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑚𝑎𝑔𝑒𝑠, 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓𝑐𝑒𝑙𝑙𝑠 𝑡𝑜 𝑏𝑒 𝑜𝑟 𝑛𝑜𝑡 𝑠𝑝𝑟𝑎𝑦𝑒𝑑 𝑎𝑐𝑐𝑜𝑟𝑑𝑖𝑛𝑔 𝑡𝑜 𝑡ℎ𝑒 𝐵𝑎𝑦𝑒𝑠𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠𝑓𝑖𝑒𝑟.
𝐹𝑖𝑔𝑢𝑟𝑒 8 − 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝑎𝑛𝑑 𝑌𝑢𝑙𝑒 𝑠𝑐𝑜𝑟𝑒 𝑣𝑎𝑙𝑢𝑒𝑠 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑡𝑒𝑠𝑡𝑠 𝑎𝑛𝑑 𝑠𝑡𝑒𝑝𝑠.
(𝑇𝑆+𝑇𝑁)
The correct classification percentage is equated as follows: 𝐶𝐶𝑃 = (𝑇𝑆 + 𝐹𝑆 + 𝑇𝑁 +
𝑇𝑆
𝐹𝑁) while the Yule coefficient as: 𝑌𝑢𝑙𝑒 = | 𝑇𝑁 |. Figures 7 and 8 shows that the
(𝑇𝑆+𝐹𝑆)+( )−1
𝑇𝑁+𝐹𝑁
best performance was achieved by Test 3 in STEP 3 and that the worst performer was Test 1.
The best performance achieved in STEP 3 was due to the degree of learning performed as also
shown on the tables, the performance improves as the learning progresses.
Overall, the research was successful in developing an automated decision-making process for
detecting weeds in corn corps using bayesian decision theory. Although the robustness of the
proposed approach against illumination variability is still in question according to the paper, it
still achieves an important saving in cost and pollution.
A theoretic notion of cost sensitive classification was considered as the cost for misclassifying a
legitimate e-mail as junk far outweighs the cost of marking a piece of junk as legitimate. With
this, a message is only classified as junk if the probability it would placed in the junk class is
greater than 99.9%.
Figure 9 shows the precision and recall for both junk and legitimate e-mail for each feature
regime. It shows that while the phrasal information improves the performance slightly, the
incorporation of little domain knowledge improves the resulting classifications.
𝐹𝑖𝑔𝑢𝑟𝑒 10 − 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑎𝑛𝑑 𝑅𝑒𝑐𝑎𝑙𝑙 𝑐𝑢𝑟𝑣𝑒𝑠 𝑓𝑜𝑟 𝑗𝑢𝑛𝑘 𝑒 − 𝑚𝑎𝑖𝑙 𝑢𝑠𝑖𝑛𝑔 𝑣𝑎𝑟𝑖𝑜𝑢𝑠 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑠𝑒𝑡𝑠.
Figure 10 focuses on the range from 0.85 to 1.0 to clearly show the greatest variation in the junk
mail precision/recall curves. It shows that the incorporation of additional features, especially
non-textual domain-specific information, gives consistently superior results to just considering
the words in the messages.
This research proved that it is possible to automatically learn effective filters to eliminate a large
portion of junk email from a user’s mail stream. The efficacy of these filters can also be
enhanced by a set of hand-crafted features which are specific for the task at hand. While the
bayesian framework that had been used in the research is successful, it exposed the need for
methods aimed at controlling the variance in parameter estimates for text categorization
problems hence the utlization of Support Vector Machines or SVMs in a decision theoretic
framework that incorporates asymmetric misclassification costs is a promising venue for further
research. Also, the utilization of other Bayesian classifiers which are less restrictive than Naïve
Bayes are seen to obtain better classification probability estimates and make more accurate costs
sesitive classifications.
The Autoclass program developed breaks the classification problem into two parts: determining
the number of classes and determining the parameters defining them. It uses a Bayesian variant
of Dempster and Laird’s EM algorithm to find the best class parameters for a given number of
classes and differentiate the postetior distribution with respect to the class parameters and equate
with zero to derive the algorithm.
The developed program classified data supplied by researchers active in various domains and has
yielded new and intriguing results such as the discovery of three classes present in the Iris
database with high confidence despite that not all cases can be assigned to their classes with
certainty. It also found four known classes in Stepp’s soybean diesease database which exactly
matched Michalski’s CLUSTER/2 system.
Finally, Autoclass assayed the Infared Astronomical Satellite Databse which contains 5,425
cases and 94 attributes. It was also considered as the least throroughly understood by domain
experts. The program discovered classes which differed significantly from NASA’s previous
analysis but clearly reflect physical phenomena in the data.
Predicting Football Results using Bayesian Nets and Other Machine Learning Techniques
(Joseph, Fenton, & Neil, 2006) compared the performance of an expert-constructed Bayesian
nets with other machine learning techniques such as naïve BN, KNN and decision trees for
predicting the outcome of matches played by english football club, Tottenham Hotspur FC from
1995 to 1997. Their objective was to see how the expert-constructed Bayesian nets perform in
terms of predictive accuracy and explanatory clarity for the factors effecting the result of the
matches under investigation.
The expert-constructed BN uses features such as presence or absence of three key players
(Sherringham, Anderton, & Armstrong), if Wilson is playing in midfield or not, quality of the
opposing team measured on a simple 3-point scale (high, medium, & low), and if the game is
played at Spurs’ home ground or away.
Aside from these, additional factors such as the quality of the Spurs attacking force, the overall
quality of the Spurs team and how well the team will perform given their own quality and of
their opponents were related to the outcome of the game (win, lose, or draw) to simplify the
structure. All of which were measured as low, medium, or high.
𝐹𝑖𝑔𝑢𝑟𝑒 11 − 𝐸𝑥𝑝𝑒𝑟𝑡 − 𝑐𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑒𝑑 𝐵𝑎𝑦𝑒𝑠𝑖𝑎𝑛 𝑛𝑒𝑡𝑠 𝑓𝑜𝑟 𝑇𝑜𝑡𝑡𝑒𝑛ℎ𝑎𝑚 𝐻𝑜𝑡𝑠𝑝𝑢𝑟 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒.
All machine learning models were implemented using the MLC++ package apart from the
expert-constructed Bayesian nets which was part of the Hugin tool. The match data was divided
into disjoint subsets which was used for training and validation sets. The data for each season
was divided into three groups of ten matches and one group of eight matches organized
chronologically.
𝐹𝑖𝑔𝑢𝑟𝑒 12 − 𝐶𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛 𝑜𝑓 𝑙𝑒𝑎𝑟𝑛𝑒𝑟 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑤𝑖𝑡ℎ 𝑒𝑥𝑝𝑒𝑟𝑡 𝑚𝑜𝑑𝑒𝑙 𝑑𝑎𝑡𝑎.
The expert-constructed Bayesian nets was the most accurate predictor of the outcome of the
Spurs games wuth a classification error over the disjoint training and test data sets of 40.79%. Its
poorest performance was on the 1995/1996 data. However, it is worth noting that with
classification errors of 50% and 40.74% for the 1995/1996 and 1996/1997 seasons respectively,
it was still the best classifier for the intra-season data. It also produced the best results among all
the classifiers for every one of the cross-season test period with the classification error averaged
at 33.62%.
Figure 12 shows the relative accuracy of the machine learning models implemented in this
research. KNN was the best performer when the same training and test data for the complete
seasons was used but it dropped significantly when disjoint training and test data sets were used
out of which the expert-constructed Bayesian nets outperformed all other learners.
This study reveals which of the selected attributes are the crucial factors affecting the outcome of
a football game, and the relationship between these factors. One of the limitations of all the non-
expert methods used is that they only use the supplied attributes which affects the learnt
Bayesian nets. The performance of the Bayesian network constructed was impressive given the
inherent analtsus bias against it. Although this study has now long been irrelevant since it
contains variables relating to key players who have retired or left the team already, its results
confirm the excellent potential of Bayesian networks when they are built by a reliable domain
expert.
The direction in which future work could be done to extend this study is the construction of a
more symmetrical model using similar data for all the teams in the league. However, this may
involve multiplying the amount of computational work by the number of additional teams in the
league. Another potential improvement is to qualify the inherent quality of each player who
plays and usage of abstract nodes like the quality of the attack and defence to improve the model
and ensure its longevity.
Conclusion
This paper has described the Bayesian approach to pattern recognition and exemplified five real
– world classification tasks. The Bayesian decision theory provides a simple and extensible
approach not only limited to classification but predictions and general mixture separation. The
theoretical basis behind this is free from any ad hoc quantities, and in any measures which alter
the data to suit the needs of the program. As a result, most of the Bayesian classification models
described in the paper lend itself easily to extension and further research.
References
Bow, S. (2002). Pattern recognition and image preprocessing. New York: Marcel Dekker.
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. Toronto: John Wiley &
Sons.
Husnain, M., & Naweed, S. (2009). English Letter Classification Using Bayesian Decision
Theory and Feature Extraction Using Principal Component Analysis. European Journal of
Scientific Research, 34, 2nd ser., 196-203.
Joseph, A., Fenton, N., & Neil, M. (2006). Predicting football results using Bayesian nets and
other machine learning techniques. Knowledge-Based Systems, 19(7), 544-553.
doi:10.1016/j.knosys.2006.04.011
Nadler, M. (1993). Pattern recognition engineering. New York: John Wiley & Sons.
Tellaeche, A., Burgos-Artizzu, X. P., Pajares, G., & Ribeiro, A. (2008). A vision-based method
for weeds identification through the Bayesian decision theory. Pattern Recognition, 41(2), 521-
530. doi:10.1016/j.patcog.2007.07.007
Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A Bayesian Approach to
Filtering Junk E-mail.
Tellaeche, A., Burgos-Artizzu, X. P., Pajares, G., & Ribeiro, A. (2008). A vision-based method
for weeds identification through the Bayesian decision theory. Pattern Recognition, 41(2), 521-
530. doi:10.1016/j.patcog.2007.07.007