Comp3308 Cheatsheet-4
Comp3308 Cheatsheet-4
for missing values/different features the distance is 1, if the Comparing Classifiers calculate the difference in accuracy bet- post-pruning: grow the tree fully then prune.
ning branches that cannot possibly influence the final decision. two attributes are the same and not missing then 0. ween the classifiers, calculate the standard deviation of the post-pruning is more successful: sub-tree replacment: replace
BFS: level by level expansion, complete: yes, optimal: yes (if
traverse tree in DFS order apply utility function to terminal NN is good with lower dimensions, but suffers from the cur- difference, calculate the confidence interval for the difference, subtree with majority class of the subtree’s examples. sub-tree
uniform step costs), time/space: Opbd q nodes compute values backwards se of dimensionality. as the number of dimensions increases, if the confidence interval include 0 insignifcant, else signifi- raising: raise a subtree to the parent node if it is more accura-
DFS: depth first, complete: no (if infinite search space), op- At each non-leaf node store the value indicating the best mo-
m the distance between points becomes less meaningful, and the cant. te than the parent node. i.e. if the subtree has more examples
timal: no, time: Opb q, space: Opbmq nearest neighbours may not be representative of the data. so
ve for the player to play. basically pick the best move for the than the parent.
g
UCS: expands the least cost node first (lowest gpnq), com- current player, if max choose the maximum value of the child- we can select a subset of features to reduce the dimensionality f řk 2
e i“1 pdi ´ dmean q numeric attributes are handled by splitting the data into ran-
f
plete: yes (if step costs are positive), optimal: yes, time: ren, if min choose the minimum value of the children. if the of the data. sensitive to noise makes predictions based on the σ“ ges, e.g. x ă 5 or x ě 5.
˚ value of a child is worse than a parent, prune it local structure of the data, so it is a local model, not a global k´1
OpbrC {ϵs q, similar to BFS, but we sort the fringe by gpnq. Neural Networks
model. σ
DLS: DFS with a depth limit l, complete: no, optimal: no, Z “ dmean ˘ tp1´αqpk´1q ?
1-Rule a simple classifier that uses only one feature to classify k neurons are connected in layers, each neuron has a weight,
time: Opbl q, space: Opblq. the data. it finds the feature that gives the best classification and each neuron has bias. weights and biases are intialised
IDS: DLS with increasing depth limits, complete: yes (if fi- accuracy, i.e. fewest errors and uses that feature to classify the Z is the confidence interval for the difference in accuracy, ge- randomly. neuron perform some step computation
nite search space), optimal: yes (uniform step costs), time: data. missing values are treated as a separate value, nominal nerally we use a 95% confidence level, k is number of folds,
features are decretized. this may lead to overfitting, so we t is obtained from the t-distribution table, p1 ´ aq is the
Opbd q, space: Opbdq, ai “ f pwi ¨ p ` bi q
impose a minimum number of examples per rule. and merge confidence level, pk ´ 1q is the degrees of freedom.
Informed Searches adjacent rules with the same class. Confusion Matrix Perceptron is a single layer neural network. uses the step func-
Greedy Search: expands the node with the lowest hpnq. com- Naive Bayes probabilistic classifier based on Bayes’ theorem. tion:
Imperfect minmax and alpha-beta pruning takes too long for Predicted Positive Predicted Negative
plete: yes (if m is finite). optimal: no. time/space: Opbm q. a “ f pw ¨ p ` bq
large trees, so we cut off the search at a certain depth, and Actual Positive True Positive (TP) False Negative (FN) where f pxq “ 1 if x ě 0, else f pxq “ 0. the percep-
A* search: expands the node with the lowest f pnq, whe- P pE|HqP pHq
evaluate the nodes using a heuristic function. however with P pH|Eq “ tron can only learn linearly separable functions. it is a binary
re f pnq “ gpnq ` hpnq admissable heuristic: hpnq ď Actual Negative False Positive (FP) True Negative (TN)
cutoff depth we may cutoff before a losing move so we do a P pEq classifier, i.e. it can only classify examples into two classes.
h˚ pnq the true cost to the goal complete/optimal: if hpnq is secondary search past the cutoff point to ensure that there are
admissible if f pnq “ gpnq A* = UCS if f pnq “ depthpnq no hidden pitfalls. i.e. we do not want to stop at non-quiescent to calculate P pE|Hq we use the conditional independence TP TP
Perceptrons learn by: - initialising weights and biases random-
A* = BFS moves, moves that lead to large change in adavantage. assumption, P “ R“ ly - feeding examples into the perceptron one at a time until
TP ` FP TP ` FN an epoch is classified correctly.
if hpnq is consistent, then hpnq ď gpn1 q ` hpn1 q for all Expectiminimax: used for games with chance, where the out- n 2P R
nodes n1 that are successors of n, consistent heuristics are come of a move is not deterministic. ź
F “ e“t´a
P pE|Hq “ P pEi |Hq 1
also admissible. P `R wnew “ wold ` epT
value of the parent node is the expected value of the children i“1
dominant heuristic: h1 pnq ě h2 pnq for all nodes n, then nodes, weighted by the probability of each child node. TP ` TN bnew “ bold ` e
h1 is dominant over h2 . accuracy “
i.e. where Ei are the features of the evidence, and n is the num- TP ` TN ` FP ` FN where e is the error, t is the true label, a is the predicted
Local Search Algorithms k ber of features.
ÿ (P): proportion of true positives among all predicted positives. label, and p is the input vector.
vpnq “ pi vpni q where H is the hypothesis and E is the evidence. assumes
Hill Climbing: expands the node with the best vpnq, but on- (left col accuracy) Multi-Layer Perceptron instead of a single layer we have mul-
ly considers successors of the current node. solution found i“1 that each feature is equally important, and independent of
each other. (R): proportion of true positives among all actual positives. tiple layers of neurons, neurons only recieve input from pre-
depends on the initial node. complete/optimal: no (local op- vious layers, all neurons in a layer are connected to all neurons
where pi is the probability of child node ni and vpni q is the (top row accuracy)
timum, plateaus, ridges) Laplace correction is used to avoid zero probabilities, add 1 in the next layer. layer size is average of the input and output
value of child node ni . (F1): harmonic mean of precision and recall, used to balance
Beam Search: keeps the k best nodes (best vpnq) at each to the numerator and k to the denominator, where k is the layer size. use hot vector encoding for the output layer, i.e. if
number of possible values for the feature. the two metrics. there are k classes, then the output layer has k neurons, each
level. complete/optimal: no
Inductive learning: produce a general rule from specific ex- neuron represents a class.
Simulated Annealing: randomly selects successors if the suc- NEi ,H ` 1 amples sigmoid: f pxq “ 1
cessor is better than the current node, it is selected. if vpmq P pEi |Hq “ 1`e´x
is better than vpnq, then m is selected with probability else NH ` k Decision Trees
n “ m with probability p where f 1 pxq “ f pxqp1 ´ f pxqq
where NEi ,H is the number of examples with feature Ei topdown approach, recursively split the data into subsets ba- x ´x
vpmq´vpnq
´ and class H, sed on the features, until a stopping criterion is met. tanh: f pxq “ ex ´e´x
p“e T e `e
Missing attribute values are ignored. e.g. outlook=?, tempe- Select the feature that maximizes the information gain, i.e.
bad moves are vpmq ą vpnq if looking for min f 1 pxq “ 1 ´ f pxq2
rature=cool, humidity=high, windy=true ignore outlook, use the feature that best separates the data into subsets.
T decreases over time, starting from a high value. Supervised Learning the rest to classify. Split the data into subsets based on the selected feature, and Backpropagation
Thereom: if T is high enough/lowers slowly enough, then the Nearest Neighbour example of a lazy classifier, it stores all Handling numeric attributes: we can use PDF to model the repeat the process for each subset. backpropagation is used to train multi-layer perceptrons, it is
algorithm will eventually find the global optimum. i.e. com- training examples but does not build classifier until unlabled distribution of the data. recurse on each subset a supervised learning algorithm that uses gradient descent to
plete/optimal example is given. commonly used distance metrics are:
Stop when all examples in the subset have the same class. No minimize the error of the network. the error is calculated as
Genetic Algorithms: select best individuals, crossover, mutate euclidean:
1 ´
px´µq2 further splits are possible (e.g. no features left, or all features the difference between the true label and the predicted label.
best individuals are selected based on a fitness function. f pxq “ ? e 2σ 2 are the same). each non-input neuron computes
2πσ 2
g
crossover is combining swapping parts of two individuals about f n
each root to leaf path represents a rule. netj “ m
ř
i“1 wij ai ` bj oj “ f pnetj q
fÿ
a point to create a new individual. dpx, yq “ e f pxi ´ yi q2 where µ is the mean and σ is the standard deviation of the
i.e. px ^ yq where f is a differentiable activation.
mutation is random change of bits to an individual. i“1 feature.
and the entire tree is a conjunction of the rules. output layer error: δj “ f 1 pnetj qpdj ´ oj q
complete/optimal: no advantages: simple, clear, fast Oppkq p = # of training ex-
manhattan: amples, k = # of attributes. robust to noise, works well with i.e. pX1 _ X2 _ X3 _ . . . _ Xi q hidden layer error: δj “ f 1 pnetj q i wji δi
ř
Games small datasets. Entropy is a measure of the impurity of a set of examples,
n weight change: ∆wij “ ηδp ap
deterministic vs chance, e.g. dice vs no dice games perfect vs ÿ correlation reduces the performance of Naive Bayes, violates
dpx, yq “ |xi ´ yi | E “ 1
řn 2
imperfect information: e.g. chess vs poker zero-sum vs non- the independence. so disregard correlated features. ÿc 2 i“1 pdi ´ ai q
zero-sum: e.g. one’s gain is another’s loss vs both can win i“1 HpSq “ ´ P log pP q
numeric attributes aren’t always normally distributed, so we i 2 i backprogation algorithm: init: random weight and bias repeat:
process forward, not from goal, too many possible goals, don’t minkowski: i“1 forward pass: calculate the output of the network backward
can discretize to nominal Use alternative PDFs (Poisson, bino-
learn anything if incomplete search from goal pass: calculate the error of the output layer and propagate
mial, gamma) Apply normalizing transformations Kernel den-
Minimax: used for two-player games, where one player tries where pi is the proportion of examples in class i in the set it back to the hidden layers update weights and biases using
¨ ˛1 sity estimation
to maximize their score and the other tries to minimize it. n p S, and c is the number of classes. the error and the learning rate η. until the error falls below
p Evaluating Classifiers
ÿ
perform DFS to the terminal node of the game tree, then dpx, yq “ ˝ |xi ´ yi | ‚ Information gain is the reduction in entropy after splitting the threshold.
backtrack to the root. At max: pick the maximum value of i“1 holdout validation: split the data into training and test sets, data on a feature.
the children, at min: pick the minimum value of the children.
Deep Learning
train on the training set, and evaluate on the test set. ÿ |Sv |
complete/optimal: yes same time/space complexity as DFS, need to normalise the data before using distance metrics, IGpS|Aq “ HpSq ´ HpSv q deep learning have more than one hidden layer, automatically
otherwise features with larger ranges will dominate the di- accuracy “ 1 ´ Error rate |S| learns features from the data. we use multiple layers because
Opbd q jPvaluespAq
stance metric. split the data into training, validation, and test sets, train on where HpSq is the entropy of the set S, Sv is the sub- they can learn more complex features than a single layer.
k-NN: majority vote upon the k nearest neighbours. the training set, evaluate on the validation set, and test on set of examples in S that have value v for feature A, and Autoencoders NN: used for pre-training, dimensionality re-
? the test set. holdout validation becomes more reliable by re-
very sensitive to k, generally we use k “ n where n is the peating the process multiple times and averaging the results, valuespAq is the set of all possible values for feature A. duction, compression, encryption. we train first autoencoder
number of training examples. this is called cross-validation. basically it’s entropy of the class - proportion of examples in to learn a compressed representation of the data, then we
each feature * entropy of the example in the feature. train a second autoencoder to learn a compressed represen-
can be used for classification and regression. Stratifcation ensure that each class is represented in the trai- tation of the first autoencoder’s output. after pretraining, we
distance for nominal features is 0 if the feature is the same, 1 ning and test sets, i.e. the proportion of each class in the DT Pruning: add a supervised layer on top of the autoencoders, and train
otherwise. training and test sets is the same as in the original dataset. pre-pruning: stop growing the tree before it is fully grown. the entire network using backpropagation. other types of au-
toencoders include: sparse - more hidden neurons than input each subset, and combines the predictions of the classifiers by ter’s centroid, distci ,cj is distance between centroids of p1 ` 1q{p3 ` 3q “ 2{6, etc. for both Yes and No. Then, the All these steps is equivalent to a single epoch in such a Multi-
neurons, denoising - add noise so NN learns robust features. averaging or voting. clusters i and j. product of the priors and all Yes/No attributes of the instance Layered Perceptron (jesus christ). When repeated, the it will
Convolutional NN: process raw pixels, handle distortions, and Boosting: train a model in sequence, increase weights of K-Means Clustering: we have k centroids. assign each point is the final score (respective). If a score No is larger than score progressively get closer to t.
learn heirarchical features. each neuron connected to a small misclassfied and decrease weights. future models will use new to their closest centroid, then recalculate the centroids as the Yes, predict No, and vice-versa.
region of the input image, called a receptive field. A single weights to classfy the data. in final ensemble, each model is mean of the points assigned to them. repeat until the cen- For a Decision Tree, we first calculate root entropy, that To make a maximum-margin classifier (Hard-Margin Sca-
filter (kernel) is applied to the input image which reduces the weighted by its accuracy. troids do not change or a maximum number of iterations is is pY es “ 3{5 and pN o “ 2{5, to find HpSq “ lar Vector Machine) for the dataset P1 “ p1, 1q, P2 “
dimensionality of the input image. pooling is used to reduce AdaBoost: if the learning algo is weak, i.e. accuracy is less reached. ´3{5 log2 3{5 ´ 2{5 log2 2{5 « 0.971. We than calcu- p´1, ´1q, we first find the primal problem - that is, to find
the dimensionality of the output of the convolutional layer, than 50%, the adaboost will increase the weights of the late the information gain for each attribute. For Outlook, it
w P R2 and b minimising 2 1 }w}2 subject to the constraints
e.g. max pooling, taking the maximum value of a region of misclassified examples, and decrease the weights of the cor- Nearest Neighbour Clustering: start with one point as a clus- is very simple, as each attribute lines up (Sunny has both
the output image. LCN is used to normalise the output of the rectly classified examples. ter, the for each new point merge it with an existing cluster No, Rain has both Yes), hence weighted entropy is 0 and for each i: yi pw ¨ xi ` bq ě 1. For our two points, this
of pooling layer, it normalises the output by subtracting the if the distance between the point and the cluster is over a the gain is 0.971 ´ 0 “ 0.971. However, for Windy, we gives:
Random Forest: an ensemble of decision trees, get training threshold t. start a new cluster.
#
mean and dividing by the standard deviation of the output. find pF alse “ 3{4 as we have one No and three Yes. p`1qpw1 ¨ 1 ` w2 ¨ 1 ` bq ě 1, pP1 q
data by bagging. each tree is trained on a random subset of Hierarchical Clustering: builds a denogram, a tree-like struc- This means HpW indy “ F alseq “ ´1{4 log2 1{4 ´ p´1qpw1 ¨ p´1q ` w2 ¨ p´1q ` bq ě 1 pP2 q
x´µ the features, and the final prediction is made by averaging or ture that represents the clustering of the data. agglomerative: 3{4 log2 3{4 « 0.811 and weight “ 4{5, hence the At optimum, both constraints hold with equality (both points
LCNpxq “ voting the predictions of the trees. tree is made how a decision start with each point as a cluster, and merge clusters until a weighted H “ p4{5q ˆ 0.811 “ 0.649, making the
σ become support vectors):
tree is made. no pruning. stopping criterion is met. divisive: start with all points in a Gain “ 0.971 ´ 0.649 “ 0.322. Ultimately, Outlook
LCN allows for brightness invariance. single cluster, and split clusters until a stopping criterion is has highest gain, so we make a decision tree on Outlook. 1. w1 ` w2 ` b “ 1
Bayesian Networks
met. A Multi-Layered Perceptron, given two inputs x1 , x2 plus 2. ´ p´w1 ´ w2 ` bq “ 1 ùñ ´w1 ´ w2 ` b “ ´1
SVM probability chain rule:
Algorithm Examples a bias, two hidden neurons h1 , h2 (with bias), and a single
in linear case SVM finds the hyperplane with the maximum P pA, B, Cq “ P pA|B, CqP pB|CqP pCq output neuron o, and where all activations are sigmoid: Adding the two equations:
margin that separates the data into two classes. the margin pw1 ` w2 ` bq ` p´w1 ´ w2 ` bq “ 1 ` p´1q
is the distance between the hyperplane and the closest data these are used to represent the probabilistic relationships bet-
ween variables. A(4) B(2) 1 ùñ 2b “ 0 ùñ b “ 0
points from each class, called support vectors. σpvq “ Substitute b “ 0 into w1 ` w2 “ 1. We still have infi-
J computing join probability BN 1 ` e´v
w x`b“0 nitely many pw1 , w2 q on that line, so we pick the one mi-
that separates the data into two classes, ź can pass an echelon as follows. Where x1 “ 0.05, x2 “ nimising }w}2 “ w1 2 2
` w2 . By symmetry (and Lagrange
P px1 , ..., xn q “ P pxi |P arentspiqq 0.10, x0 pbiasq “ 1, the target t “ 0.01, learning rate multipliers) the minimum occurs at w1 “ w2 “ 1 . Thus
signpwJ x ` bq is the class label. C(4) D(1) η “ 0.5, and initial weights/biases: 1 , 1 q,
2
Full-Joint Enumeration: the sum over all hidden variables w “ p2 2
b “ 0.
SVM chooses the hyperplane that maximises the margin. mi- P pX|eq “ α ř
nimising: 2 1 ||w||2 Y,Z,... P pX, e, Y, Z, . . . q We can then find the decision boundary where w ¨ x ` b “
w1h1 “ 0.15, w2h1 “ 0.20, bh1 “ 0.35 1 x ` 1 x “ 0, or simply x ` x “ 0. The
Variable elimination: factorise the joint into conditional pro- 0 ùñ 2 1 2 2 1 2
subject to the linear constraint: margin is where
bability table (CPT) factors, eliminate hidden vars one at a E(2) F(0)
yi pwJ xi ` bq ě 1 time by multiplying relevant factors, summing out the varia- w1h2 “ 0.25, w2h2 “ 0.30, bh2 “ 0.35 ?
1 “ b 1 “ ?1
The optimization problem can be solved using Lagrange mul- ble to produce a new factor. Multiply remaining factors and γ “ }w}
1{2
“ 2
normalise. p1{2q2 `p1{2q2
tipliers, leading to the dual problem: For Simulated Annealing, with an initial temperature T0 “ wh1 ,o “ 0.40, wh2,o “ 0.45, bo “ 0.60
Diagrams: note chance = oval, decision = square, utility = 3.0, cooling rate α “ 0.7 per iteration, at state s with
n n diamond. To solve, roll back network starting at leaves, then 1 We can forward pass by finding the hidden layer nets and
Approaches to Questions
ÿ 1 ÿ heuristic hpsq, we pick a random neighbour s , compute
max “ λi ´ λi λj yi yj xi ¨ xj computing expected utility at each decision node given the ∆ “ hps1 q ´ hpsq. If ∆ ď 0, always move to s1 , el- outputs: Two main approaches, Constraint Satisfaction Problems and
2 parents. Then, select the action maximising EU : STRIPS.
i“1 i,j“1 se move with probability expp´∆{T q. Update T Ð αT . neth1 “ 0.05 ¨ 0.15 ` 0.10 ¨ 0.20 ` 0.35 “ 0.3775
ÿ Repeat until goal. CSP asks: ’Can we assign values to a set of variables so
optimal decision boundary is given by: EU paq “ P ps|parentsq ¨ U ps, aq h1 “ σp0.3775q « 0.59327
Iteration 0 - Current node A, hpAq “ 4, T “ 3.0, that all constraints are satisfied?’. It is formed by its’ va-
n
s random neighbour (in this case s1 “ C) hps1 q “ 4, neth2 “ 0.05 ¨ 0.25 ` 0.10 ¨ 0.30 ` 0.35 “ 0.3925
riables X1 , X2 , . . . , Xn , domains Di for each Xi , and
constraints Cij between variables (binary) or more gene-
∆ “ 0, accept? Yes, as equal/improve (e´0{3 “ 0). Ite-
ÿ
w“ λi yi xi rally Ci1 , Ci2 , . . . , Cin . The solution is an assignment
ration 1 - Current node C, hpsq “ 4, T “ 2.1, random h2 “ σp0.3925q « 0.59688
i“1 X1 “ v1 , ..., Xn “ vn that makes every constraint true.
neighbour hps1 q “ hpAq “ 4, ∆ “ 0, hence accept
´∆{2.1 the output net and output: To check, perform a backtracking search, picking an unassi-
where λi are the Lagrange multipliers, yi are the class labels, (e “ 0). Continue until goal. gned variable and assigning it from its domain, then, if no
and xi are the support vectors. neto “ 0.59327¨0.40`0.59688¨0.45`0.60 « 1.1059 constraints are violated, recurse, else backtrack and try ano-
For Beam Search at k “ 2, keep the 2 best nodes per depth
we can allow misclassification when problem is not linearly according to hpnq. On the graph above: o “ σp1.1059q « 0.75136 ther value. Once you assign Xi “ v, immediately remove
separable, add an addtional parameter C to the optimization any values w from each neighbours domain if pv, wq violates
‚ Step 0: beam tAp4qu.
problem to maximise trade-off between margin and misclassi- the constraint. One may also follow Arc Consistency (AC-3),
‚ Step 1: expand A Ñ tBp2q, Cp4qu Ñ sorted by h: and the error:
fication. 1 1 where one initialises a queue of all arcs pXi , Xj q, then while
rB, Cs Ñ keep tB, Cu. 2 2
usually we use the kernel trick to map the data into a higher- E “ pt ´ oq “ p0.01 ´ 0.75136q « 0.2748 the queue is not empty, pop pXi , Xj q. For each v P Di ,
‚ Step 2: expand B, C Ñ tDp1q, Ep2q, Dp1qu Ñ candi- 2 2
dimensional space, where it is linearly separable. The optimi- if there is no w P Dj with pv, wq P Cij , remove v from
dates tD, D, Eu Ñ sorted rD, D, Es Ñ keep tD, Eu.
Di and enqueue all pXk , Xi q for neighbours Xk . Note you
zation problem becomes: ‚ Step 3: expand D, E Ñ tF p0q, F p0qu Ñ keep tF u We then form our backward pass using the deltas, where the
n n output delta is: should always choose the variable with the fewest legal values
1 ÿ (goal).
left, breaking these ties by picking the variable in more cons-
ÿ
max “ λi ´ λi λj yi yj Kpxi ¨ xj q δo “ po ´ tq ¨ o ¨ p1 ´ oq “ p0.75136 ´ 0.01qˆ
2 Resulting path is A Ñ B Ñ D Ñ F . traints, and when assigning, preferring the value that rules out
i“1 i,j“1
n 0.75136 ˆ 0.24864 « 0.1370 the fewest choices for neighbours.
ÿ
w “ Φpxq “ λi yi Φpxi q STRIPS planning problems follow: ’Given an initial state and
Outlook Temp Humidity Windy Play? we then use this to calculate the hidden deltas: a goal description, find a sequence of actions that achieves
i“1
δh1 “ δo ˆ wh1,o ˆ h1 p1 ´ h1 q « 0.1370ˆ the goal’. A state s is a set of grounded predicates (e.g. {At
where Kpxi , xj q “ Φpxi q ¨ Φpxj q is the kernel function, Sunny Hot High False No A, Free B}). Each action a is a tuple pP re, Add, Delq,
which maps the input data into a higher-dimensional space. Sunny Hot High True No 0.40 ˆ 0.59327 ˆ 0.40673 « 0.0130 where P re Ď predicates that must hold in s, Add are the
common kernel functions include: predicates to be added to the state, and Del are predicates to
Overcst Hot High False Yes be removed from the state. If P re Ď s, then a is applicable
δh2 “ δo ˆ wh2 ,o ˆ h2 p1 ´ h2 q « 0.1370ˆ
polynomial kernel: Kpxi , xj q “ pxi ¨ xj ` cqd Rain Mild High False Yes and Resultps, aq “ psnDelq Y Add. One might choose
0.45 ˆ 0.59688 ˆ 0.40312 « 0.0140 to take a regressive (start from goal conditions and regress) or
||xi ´xj ||2 Rain Cool Normal False Yes
´ Clustering/Unsupervised Learning progressive (from initial state, applying actions) search to sol-
RBF kernel: Kpxi , xj q “ e 2σ 2 For Naive Bayes, given the instance (Sunny, Cool, High, Fal- We then must update our weights by applying w1 Ð w ´ ve a problem. It is typically a smart idea to build a graph with
centroid is mean of cluster. metroid is median of cluster. 3
`
tanh kernel: Kpxi , xj q “ tanh αpxi ¨ xj q ` c
˘ se) we first calculate the priors (P pY esq “ 5 , P pN oq “ ηpδ ˆ inputq, which is also biased (but since bias “ 1, alternating layers xS0 , A0 , S1 , A1 , . . . y where Si is the
single links: smallest link between elements of two clusters
2 ). Given the k-values per attribute (3 for outlook, 3 for nothing). For our output layer: set of predicates reachable after i steps, and Ai is the set of
classification becomes complete links: largest average links: average 5
temp, 2 humidity, 2 windy), we can use Laplace smoothing. 1
w1,h “ 0.15 ´ 0.5 ˆ 0.0130 ˆ 0.05 « 0.1497 actions whose preconditions lie in Si . Ensure to track mutex-
good clustering has high intra-cluster similarity, low inter- Given the formula: 1 es (actions/literals that cannot co-occur, comp2017 moment)
¨ ˛ 1
n w2,h “ 0.20 ´ 0.5 ˆ 0.0130 ˆ 0.10 « 0.1994 and ensure to stop when all goal literals appear in Sk without
ÿ cluster similarity. 1
f pxq “ sign ˝ λi yi Kpxi , xq ` b‚ 1 mutual exclusion. One can then extract a plan by choosing a
measured with David-Bouldin index, countpa “ v, Cq ` 1 bh “ 0.35 ´ 0.5 ˆ 0.0130 ˆ 1 « 0.3435
i“1 P pa “ v | Cq “ 1 non-mutex action for each goal at level k, then regressing
k NC ` k subgoals to level k ´ 1, and so on.
1 ÿ distpx, ci q ` distpx, cj q 1
w1,h “ 0.25 ´ 0.5 ˆ 0.0140 ˆ 0.05 « 0.2497
Ensemble DB “ max 2
k j,j‰i dci ,cj where NC “ # of examples in class C, we can calcu- 1
i“1 w2,h “ 0.30 ´ 0.5 ˆ 0.0140 ˆ 0.10 « 0.2993
Bagging: bootstrap aggregating, splits the training data into k is number of clusters, ci is the centroid of cluster i, late for both priors. For P p¨ | Y esq where Outlook “ 2
1
subsets by sampling with replacement, trains a classifier on distpx, ci q is average distance between points and its clus- Sunny, p0 ` 1q{p3 ` 3q “ 1{6, where T emp “ Cool, bh “ 0.35 ´ 0.5 ˆ 0.0140 ˆ 1 « 0.3430
2