0% found this document useful (0 votes)
23 views

AIML-Unit 5 Notes

The document discusses machine learning techniques, specifically statistical learning and belief networks. It provides background on statistical learning, describing it as estimating an unknown function f from given data to understand the relationship between predictor variables X and a response variable Y. It discusses parametric and non-parametric methods for estimating f, and distinguishes between regression and classification problems. It also provides details on learning Bayesian belief networks, including their graph structure, conditional probabilities, and use of joint and conditional probabilities.

Uploaded by

Harshitha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

AIML-Unit 5 Notes

The document discusses machine learning techniques, specifically statistical learning and belief networks. It provides background on statistical learning, describing it as estimating an unknown function f from given data to understand the relationship between predictor variables X and a response variable Y. It discusses parametric and non-parametric methods for estimating f, and distinguishes between regression and classification problems. It also provides details on learning Bayesian belief networks, including their graph structure, conditional probabilities, and use of joint and conditional probabilities.

Uploaded by

Harshitha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

PANIMALAR ENGINERING COLLEGE

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


21EC1401 - ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING (LAB INTEGRATED)
UNIT V - Machine Learning Techniques
UNIT V - NOTES
Syllabus:

Statistical Learning, background and general method, learning belief networks, nearest
neighbor. Decision-trees, supervised learning of uni-variance decision trees, network
equivalent of decision trees.

5.1 Statistical Learning (background and general method)

 Consider Y = f (X) + ϵ, where the function f is unknown and ϵ is an error term, which
is a random variable assumed to have zero mean; that is, E[ϵ] = 0.
 Statistical learning is the task of estimating f from given data, so that the dependence
of Y on X can be better understood.
 That is, f conveys the systematic information that X provides about Y .
 X is a vector-valued quantity of the form X = (X1, X2, . . . , Xp), referred as predictors
or independent variables, while Y is also known as a dependent variable, or response.
Example: Suppose the data set consists of sales of a product in 200 different markets, and the
advertising budget for each market in three different media: TV, radio and newspaper. The
three budgets are predictors, and sales is the response.
Estimation of ‘f’:
Estimation of f is useful for both prediction and inference.
Prediction:
If the output Y is not easily obtained but the input X is, then it is desirable to be able to predict
what the output Y will be for a given value of X. Such a prediction has the form Yˆ = fˆ(X)
where fˆ is an estimate of the unknown function f .
Example: Suppose the predictors X1, X2, . . . , Xp are characteristics of a patient’s blood
sample that can easily be measured in a lab, and Y represents the patient’s risk for a severe
reaction to some drug. If a patient’s risk can be assessed before giving them the drug, then it
could be avoided if their risk turns out to be high.
Types of Prediction errors:
 The error in prediction has two components: reducible error and irreducible error.
 The reducible error is a measure of the deviation of fˆ from f , which can be reduced
by choosing the right statistical learning technique or tuning it correctly. In some
cases, it may be able to reduce this error to zero, and obtain fˆ = f .
 The irreducible error stems from the error term ϵ, which cannot be predicted using
X. For example, it may depend on unmeasured variables not among the predictors in
X.
For a fixed value of X, our overall error can be decomposed as follows:

E[(Y − Ŷ )2 ] = E[(f (X) + ϵ − fˆ(X))2 ]

= E[(f (X) − fˆ(X))]2 + 2ϵ(f (X) − fˆ(X)) + E[ϵ2]

= (f (X) − fˆ(X))2 + Var(ϵ)

The first term is the reducible error, and the second term is the irreducible error.

Inference
Prediction is estimating the value of the unknown function f at values of X outside of the
given data. For such a task, f can be treated as a “black box”, meaning the the form of f
does not need to be known.
Inference, on the other hand, is about better understanding the relationship between the
each predictor and the response, which means at least approximate knowledge of the form
of f is of importance.
For the case of a linear model, in which the response Y (sales) is an affine function
of X1 , X2, X3 (advertising budgets), meaning that
Y ≈ Y0 + m1X1 + m2X2 + m3X3,

Where m1, m2 and m3 are the slopes.

Methods of estimating f:
There are many methods for estimating f , but they generally work with a set of
observations, which are predictor-response pairs (xi, yi), i = 1, 2, . . . , n. These
observations are called the training data, as they are used to “train’, or “teach”, our
statistical learning method of choice
2 Methods:
Estimation methods generally fall into two categories: parametric and non-
parametric. We now examine each of these categories.

Parametric Methods
In a parametric method, we first assume f has a particular form. For example, we
may assume that f is a linear, or more precisely, affine function of X:
f (X) = β0 + β1X1 + β2X2 + · · · + βpXp

Thus what remains is to estimate the coefficients β0, . . . , βp.


Next, the training data is used to fit, or train, the model. In the case of a linear model,
that means using some algorithm to obtain values for the coefficients β0, . . . , βp such that
Y ≈ β0 + β1X1 + β2X2 + · · · + βpXp
for each predictor-response pair (xi, yi) in the training set.
A parametric method has the advantage of greatly simplifying the task of obtaining an
estimate fˆ of f , because it reduces the task to one of computing certain coefficients.
However, if the chosen form for f is not representative of f , then the estimate will be far
less useful. This can be remedied by choosing a more flexible form for f , such as a
nonlinear function instead of a linear model.

Non-parametric Methods
In non-parametric methods, it is not assumed that the function f has a particular
form, at least not over the entire domain of the training data. One example of a non-
parametric method is to construct a thin-plate spline, which is linear combination of
radial basis functions. Another, for one-dimensional data sets, is a cubic spline, which
is a piecewise polynomial.
The practical difference is that for non-parametric methods, a much larger number of
parameters is required. In fact, the number of parameters is often proportional to n, the
number of observations, which is not the case for parametric methods such as least-
squares fitting.

Regression vs. Classification


Statistical learning works with both quantitative variables, which have numerical values,
and qual- itative, or categorical, values, which do not. For example, temperature is a
quantitative variable, while gender or day of the week is not. When a response is
quantitative, it is often estimated by solving a regression problem, such as least-squares
fitting. When it is qualitative, the problem of estimating the response in terms of its
predictors is called a classification problem.

5.2 Learning belief networks


Bayesian Belief Network

 Bayesian belief network is key computer technology for dealing with probabilistic
events and to solve a problem which has uncertainty.
 "A Bayesian network is a probabilistic graphical model which represents a set of
variables and their conditional dependencies using a directed acyclic graph."
 It is also called a Bayes network, belief network, decision network, or Bayesian
model.
 Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
 Real world applications are probabilistic in nature, and to represent the relationship
between multiple events, we need a Bayesian network. It can also be used in various
tasks including prediction, anomaly detection, diagnostics, automated insight,
reasoning, time series prediction, and decision making under uncertainty.
 Bayesian Network can be used for building models from data and experts opinions, and
it consists of two parts:
o Directed Acyclic Graph
o Table of conditional probabilities.

The generalized form of Bayesian network that represents and solve decision problems under
uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:

o Each node corresponds to the random variables, and a variable can


be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities
between random variables. These directed links or arrows connect the pair of nodes in
the graph.
These links represent that one node directly influence the other node, and if there is no
directed link that means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by
the nodes of the network graph.
o If we are considering node B, which is connected with node A by a directed
arrow, then node A is called the parent of Node B.
o Node C is independent of node A.

Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is known as
a directed acyclic graph or DAG.

The Bayesian network has mainly two components:

o Causal Component
o Actual numbers

Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ),
which determines the effect of the parent on that node.

Bayesian network is based on Joint probability distribution and conditional probability. So let's
first understand the joint probability distribution:

Joint probability distribution:


If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of x1,
x2, x3.. xn, are known as Joint probability distribution.

P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.

= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]

= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].

In general for each variable Xi, we can write the equation as:

P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))


Explanation of Bayesian network:

Let's understand the Bayesian network through an example by creating a directed acyclic graph:

Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm reliably
responds at detecting a burglary but also responds for minor earthquakes. Harry has two
neighbors David and Sophia, who have taken a responsibility to inform Harry at work when
they hear the alarm. David always calls Harry when he hears the alarm, but sometimes he got
confused with the phone ringing and calls at that time too. On the other hand, Sophia likes to
listen to high music, so sometimes she misses to hear the alarm. Here we would like to compute
the probability of Burglary Alarm.

Problem:

Calculate the probability that alarm has sounded, but there is neither a burglary, nor an
earthquake occurred, and David and Sophia both called the Harry.

Solution:

o The Bayesian network for the above problem is given below. The network structure is
showing that burglary and earthquake is the parent node of the alarm and directly
affecting the probability of alarm's going off, but David and Sophia's calls depend on
alarm probability.
o The network is representing that our assumptions do not directly perceive the burglary
and also do not notice the minor earthquake, and they also not confer before calling.
o The conditional distributions for each node are given as conditional probabilities table
or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent an
exhaustive set of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2 K probabilities. Hence, if
there are two parents, then CPT will contain 4 probability values

List of all events occurring in this network:

o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)

We can write the events of problem statement in the form of probability: P[D, S, A, B, E], can
rewrite the above probability statement using joint probability distribution:

P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]

=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]

= P [D| A]. P [ S| A, B, E]. P[ A, B, E]

= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]

= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]


Let's take the observed probability for the Burglary and earthquake component:

P(B= True) = 0.002, which is the probability of burglary.

P(B= False)= 0.998, which is the probability of no burglary.

P(E= True)= 0.001, which is the probability of a minor earthquake

P(E= False)= 0.999, Which is the probability that an earthquake not occurred.

We can provide the conditional probabilities as per the below tables:

Conditional probability table for Alarm A:

The Conditional probability of Alarm A depends on Burglar and earthquake:

B E P(A= True) P(A= False)

True True 0.94 0.06

True False 0.95 0.04

False True 0.31 0.69

False False 0.001 0.999

Conditional probability table for David Calls:

The Conditional probability of David that he will call depends on the probability of Alarm.

A P(D= True) P(D= False)

True 0.91 0.09

False 0.05 0.95

Conditional probability table for Sophia Calls:

The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."

A P(S= True) P(S= False)

True 0.75 0.25


False 0.02 0.98

From the formula of joint distribution, we can write the problem statement in the form of
probability distribution:

P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).

= 0.75* 0.91* 0.001* 0.998*0.999

= 0.00068045.

Hence, a Bayesian network can answer any query about the domain by using Joint
distribution.

The semantics of Bayesian Network:

There are two ways to understand the semantics of the Bayesian network, which is given below:

1. To understand the network as the representation of the Joint probability distribution.

It is helpful to understand how to construct the network.

2. To understand the network as an encoding of a collection of conditional independence


statements.

It is helpful in designing inference procedure.

5.3 Nearest Neighbor Algorithm


K-Nearest Neighbor (KNN) Algorithm for Machine Learning

o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.

Need for K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of problem,
we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or class
of a particular dataset. Consider the below diagram:
Working of K-NN Algorithm:

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor
is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the
below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry.
It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN algorithm:

o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points
for all the training samples.

5.4 Decision Tree Classification Algorithm


o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision rules and each
leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.
o Below diagram explains the general structure of a decision tree:

Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.

Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

Decision Tree Terminologies


Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next decision
node further gets split into one decision node (Cab facility) and one leaf node. Finally, the
decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider the
below diagram:
Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select the
best attribute for the nodes of the tree. There are two popular techniques for ASM, which are:

o Information Gain
o Gini Index

1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using
the below formula:

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)


Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies
randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2


Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the learning
tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:

o Cost Complexity Pruning


o Reduced Error Pruning.

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

Splitting:

Node splitting, or simply splitting, divides a node into multiple sub-nodes to create relatively

pure nodes. This is done by finding the best split for a node and can be done in multiple ways.

The ways of splitting a node can be broadly divided into two categories based on the type of

target variable:

1. Continuous Target Variable: Reduction in Variance

2. Categorical Target Variable: Gini Impurity, Information Gain, and Chi-Square

Reduction in Variance in Decision Tree

Reduction in Variance is a method for splitting the node used when the target variable is

continuous, i.e., regression problems. It is called so because it uses variance as a measure for

deciding the feature on which a node is split into child nodes.

Variance is used for calculating the homogeneity of a node. If a node is entirely homogeneous,

then the variance is zero.

Here are the steps to split a decision tree using the reduction in variance method:

1. For each split, individually calculate the variance of each child node
2. Calculate the variance of each split as the weighted average variance of child nodes

3. Select the split with the lowest variance

4. Perform steps 1-3 until completely homogeneous nodes are achieved

Information Gain:

Information Gain in Decision Tree

Now, what if we have a categorical target variable? For categorical variables, a reduction in

variation won’t quite cut it. Well, the answer to that is Information Gain. The Information Gain

method is used for splitting the nodes when the target variable is categorical. It works on the

concept of entropy and is given by:

Entropy is used for calculating the purity of a node. The lower the value of entropy, the higher

the purity of the node. The entropy of a homogeneous node is zero. Since we subtract entropy

from 1, the Information Gain is higher for the purer nodes with a maximum value of 1. Now,

let’s take a look at the formula for calculating the entropy:

Steps to split a decision tree using Information Gain:

1. For each split, individually calculate the entropy of each child node

2. Calculate the entropy of each split as the weighted average entropy of child nodes

3. Select the split with the lowest entropy or highest information gain

4. Until you achieve homogeneous nodes, repeat steps 1-3

Gini Impurity in Decision Tree


Gini Impurity is a method for splitting the nodes when the target variable is categorical. It is the

most popular and easiest way to split a decision tree. The Gini Impurity value is:

Wait – what is Gini?

Gini is the probability of correctly labeling a randomly chosen element if it is randomly labeled

according to the distribution of labels in the node. The formula for Gini is:

And Gini Impurity is:

The lower the Gini Impurity, the higher the homogeneity of the node. The Gini Impurity of a

pure node is zero. Now, you might be thinking we already know about Information Gain then,

why do we need Gini Impurity?

Gini Impurity is preferred to Information Gain because it does not contain logarithms which are

computationally intensive.

Here are the steps to split a decision tree using Gini Impurity:

1. Similar to what we did in information gain. For each split, individually calculate the Gini

Impurity of each child node

2. Calculate the Gini Impurity of each split as the weighted average Gini Impurity of child

nodes

3. Select the split with the lowest value of Gini Impurity


4. Until you achieve homogeneous nodes, repeat steps 1-3

Types of Splits:
Univarite and Multivariate.
A split is called univariate if it uses only a single variable,
otherwise multivariate.
Example:
“Petal.Width < 1.75” is univariate,

“Petal.Width < 1.75 and Petal.Length < 4.95” is bivariate.

Regularization:

Regularization Methods
There are several simple regularization methods:
minimum number of points per cell: require that each cell (i.e., each leaf node) covers a given
minimum number of training points.
maximum number of cells: limit the maximum number of cells of the partition (i.e., leaf
nodes).
maximum depth: limit the maximum depth of the tree.
The number of points per cell, the number of cells, etc. can beseen as a hyperparameter of the
decision tree learning method.
Example:
The figure below gives a decision network The agent can receive a report of people leaving a
building and has to decide whether or not to call the fire department. Before calling, the agent
can check for smoke, but this has some cost associated with it. The utility depends on whether
it calls, whether there is a fire, and the cost associated with checking for smoke.
In this sequential decision problem, there are two decisions to be made. First, the agent
must decide whether to check for smoke. The information that will be available when it makes
this decision is whether there is a report of people leaving the building. Second, the agent must
decide whether or not to call the fire department. When making this decision, the agent will
know whether there was a report, whether it checked for smoke, and whether it can see smoke.
Assume that all of the variables are binary.
The information necessary for the decision network includes the conditional probabilities
of the belief network and

Utility for fire alarm decision network

This utility function expresses the cost structure that calling has a cost of 200,
checking has a cost of 20, but not calling when there is a fire has a cost of 5000. The utility
is the negative of the cost.

You might also like