ANSWERS TO 15-381 Final, Spring 2004: Friday May 7, 2004
ANSWERS TO 15-381 Final, Spring 2004: Friday May 7, 2004
1. Place your name and your andrew email address on the front page.
2. You may use any and all notes, as well as the class textbook. Keep in mind, however, that this final was designed
in full awareness of such. You may NOT use the Internet, but you can use a calculator.
3. We only require that you provide the answers. We don’t need to see your work.
4. The maximum possible score on this exam is 100. You have 180 minutes.
5. Good luck!
Question Score
1 20
2 6
3 6
4 9
5 14
6 9
7 9
8 9
9 9
10 9
Total 100
1
1 Short Questions
(a) When you run Waltz algorithm on the following drawing, which of the following statements is true? Circle the
correct answer.
(i) The algorithm will label all edges uniquely.
(ii) The algorithm will report that some edges are ambiguous.
(iii) The algorithm will report that the image cannot be labeled consistently.
ANSWER: (ii)
(b) How many degrees of freedom does a rigid 3-d object have if it moves in a 3-d space?
ANSWER: 6
(c) How does randomized hill-climbing choose the next move each time? Circle the correct answer.
(i) It generates a random move from the moveset, and accepts this move.
(ii) It generates a random move from the whole state space, and accepts this move.
(iii) It generates a random move from the moveset, and accepts this move only if this move improves the
evaluation function.
(iv) It generates a random move from the whole state space, and accepts this move only if this move improves
the evaluation function.
ANSWER: (iii)
(d) Suppose you are using a genetic algorithm. Show the children of the following two strings if single point
crossover is performed with a cross-point between the 4th and the 5th digits:
1 4 6 2 5 7 2 3 and 8 5 3 4 6 7 6 1
ANSWER: 1 4 6 2 6 7 6 1 and 85345723
(f) Which of the following is the main reason of pruning a decision tree? Circle the correct answer.
(i) to save computational cost
(ii) to avoid over-fitting
(iii) to make the training error smaller
ANSWER: (ii)
(g) Which of the following does the Naive Bayes classifier assume? Circle the correct answer.
(i) All the attributes are independent.
(ii) All the attributes are conditionally independent given the output label.
(iii) All the attributes are jointly dependent to each other.
2
ANSWER: (ii)
(h) By which of the following networks can XOR function be learned? Circle the correct answer.
(i) linear perceptron
(ii) single layer Neural Network
(iii) 1-hidden layer Neural Network
(iv) none of the above
ANSWER: (iii)
(i) If we use K-means on a finite set of samples, which of the following statement is true? Circle the correct answer.
(i) K-means is not guaranteed to terminate.
(ii) K-means is guaranteed to terminate, but is not guaranteed to find the optimal clustering.
(iii) K-means is guaranteed to terminate and find the optimal clustering.
ANSWER: (ii)
(j) In the worst case, what is the number of nodes that will be visited by Breadth-First Search in a (non-looping)
tree with depth d and branching factor b?
ANSWER: O(bd )
(k) True or False : If a search tree has cycles, A* Search with an inadmissible heuristic might never converge when
run on that tree.
ANSWER: False
(l) Circle the Nash Equilibria in the following matrix-form game:
ANSWER:
Player 2
D E F
A 0, 1 3, 5 2, 1
Player 1
B 6, 3 1, 3 5, 2
C 4, 2 3, 4 7, 7
(m) Assume the following zero-sum game, where player 1 is the maximizer:
ANSWER:
3
Player 2
C D
A 2 0
Player 1
B 0 1
If Player 1 chooses strategy A with probability p, and if Player 2 always plays strategy C, what is the expected
value of the game?
ANSWER: 2p + 0(1 p) = 2p
(n) In the mixed strategy Nash equilibrium for the above game, with what probability does Player 1 use strategy A?
p
2 = 1(1 p)
p
3 =1
ANSWER: p = 1=3
(o) True or False : In a second-price, sealed bid auction, it is optimal to bid your true value. There is no advantage
to bluffing.
ANSWER: True
(p) How many values does it take to represent the joint distribution of 4 boolean variables?
ANSWER: 16
(q) If P(A) = 0.3, P(B) = 0.4, and P(AjB) = 0.6
(a) What is P(A^B)?
ANSWER: P (A ^ B ) = P (AjB ) P (B ) = 0:24
(b) What is P(BjA)?
ANSWER: P (B jA) = P (APjB(A)P) (B ) :
=08
4
(r) For the following questions, use the diagram below. If you do not have enough information to answer a question,
answer False.
A B
5
2 Hill Climbing, Simulated Annealing and Genetic Algorithm
The N-queens problem requires you to place N queens on an N-by-N chessboard such that no queen attacks another
queen. (A queen attacks any piece in the same row, column or diagonal.) Here are some important facts:
We define the states to be any configuration where the N queens are on the board, one per column.
The moveset includes all possible states generated by moving a single queen to another square in the same
column. The function to obtain these states is called the successor function.
The evaluation function Eval(state) is the number of non-attacking pairs of queens in this state. (Please note
it is the number of NON-attacking pairs. )
In the following questions, we deal with the 6-queens problem (N=6).
1. How many possible states are there totally?
ANSWER: 66
2. For each state, how many successor states are there in the moveset?
ANSWER: 30
3. What value will the evaluation function Eval() return for the current state shown below?
ANSWER: 9
4. If you use Simulated Annealing (currently T=3), and the current state and the random next state are shown
below, will you accept this random next state immediately? or accept it with some probability? If it is the latter
case, what is the probablity?
c u r r e n t s t a t e r a n d o m n e x t s t a t e
ANSWER: For the current state, E 1 = 9. For the next state, E 2 = 6. So E 2 < E 1.
P = exp( (E 1 E 2)=T ) = exp( (9 =
6) 3)) = 1 =e
We will accept the next state with probability 1=e.
6
5. Suppose you use a Genetic Algorithm. The current generation includes four states, S 1 through S 4. The evalu-
ation values for each of the four states are: Eval(S 1) = 9, Eval(S 2) = 12, Eval(S 3) = 11, Eval(S 4) = 8.
Calculate the probability that each of them would be chosen in the ”selection” step (also called ”reproduction”
step).
E v a l ( S 1 ) = 9
E v a l ( S 2 ) = 1 2
E v a l ( S 3 ) = 1 1 E v a l ( S 4 ) = 8
6. In a Genetic Algorithm, each state of 6-queens can be represented as 6 digits, each indicating the position of
the queen in that column. Which action in genetic algorithm (among fselection, cross-over, mutationg) is most
similar to the successor function described in previous page?
ANSWER: Mutation
7
3 Cross Validation
Suppose you are running a majority classifier on the following training set. The training set is shown below. It consists
of 10 data points. Each data point has a class label of either 0 or 1. A majority classifier is defined to output the class
label that is in the majority in the training set, regardless of the input. If there is a tie in the training set, then always
output class label 1.
3. What is the two-fold Cross-Validation error? Assume the left 5 points belong to one partition while the right 5
points belong to the other partition. (report the error as a ratio)
ANSWER: 8/10
8
4 Probabilistic Reasoning/Bayes Nets
1. If A and B are independent then A is independent of B . True or False?
Show the work supporting your answer. You might find the following statements useful:
P ( A^ B ) = P ( (A _ B )) = 1 P (A _ B ) = 1 P (A) P (B ) + p(A ^ B )
= P ( A) P (B ) + p(A)p(B ) = P ( A) P (B )(1 P (A))
= P ( A)(1 P (B )) = P ( A)P ( B )
(a) What is the probability that neither show up to class on any given day?
ANSWER: P ( A^ B ) = P ( A)P ( B ) = :2 :4 = :08
(b) What is the probability that at least one of them is in class on any given day?
P (A _ B ) = P ( ( A^ B )) = 1 P ( A^ B ) = 1 :08 = :92
Also, P (A _ B ) = P (A) + P (B ) P (A ^ B ) = :8 + :6 :8 :6 = :92
ANSWER: 0:92
Suppose there is also a student C who always comes to class if and only if student A or student B (or both)
show up.
ANSWER:
P(A)=.8 P(B)=.6
A B
p(C | A ^ B) = 1
C p(C | ~A ^ B) = 1
p(C | A ^ ~B) = 1
p(C|~A ^ ~B) = 0
9
(e) is A conditionally independent of B given C? (yes, no)
ANSWER: No, A and B are dependent given C since C unblocks the path between A and B.
(f) suppose you know that C came to class, what is the probability of A coming if you know that B showed
up too?
ANSWER: 0.8
Since B coming to class could fully explain the appearence of C, the probability of P (AjB ^ C ) = P (A) =
:8. The result can also be obtained from the probabilities:
P (A ^ B ^ C ) P (A)P (B )P (C jA ^ B )
P (AjB ^ C )
P (B ^ C ) P (A ^ B ^ C ) + P ( A ^ B ^ C )
= =
P (A)P (B )
P (C jA ^ B )P (A)P (B ) + P (C j A ^ B )P ( A)P (B )
=
P (A)
= P (A)
P (A) + P ( A)
=
10
5 Neural Networks
1. Draw a linear perceptron network and calculate corresponding weights to correctly classify 4 points below. The
output node returns 1 if the weighted sum is greater than or equal to the threshold (.5). If it looks too complicated
you are probably wrong. You are allowed to make use of a ”constant 1” unit input.
x y out
0 0 1
0 1 1
1 0 0
1 1 1
ANSWER:
x
w1
w2
y out
w3
The dataset is linearly separable as shown below. Any set of weights such that w1 < 0, w3 2 (0:5; :5 w1 ]
and w2 maxf0; :5 w1 w3 g would have been a correct solution to the problem. A common solution was
w1 = 1, w2 = 1, w3 = 1.
2. Is it possible to modify the network so that it will classify both - the dataset above and the one below with 100%
accuracy?
x y out
0 0 0
0 1 1
1 0 1
1 1 1
11
6 Naive Bayes
Assume we have a data set with three binary input attributes, A, B, C, and one binary outcome attribute Y. The three
input attributes, A, B, C take values in the set f0,1g while the Y attribute takes values in the set fTrue, Falseg.
A B C Y
0 1 1 True
1 1 0 True
1 0 1 False
1 1 1 False
0 1 1 True
0 0 0 True
0 1 1 False
1 0 1 False
0 1 0 True
1 1 1 True
If we are using a Naive Bayes Classifier with one binary valued output variable Y , the following theorem is true:
Theorem: A non-impossible set of input values, S , (i.e. a set of input values with P (S ) > 0) will have an unambigu-
ous predicted classification of Y = True , P (Y = True ^ S ) > P (Y = False ^ S )
1. How would a Naive Bayes classifier classify the record (A=1,B=1,C=0)? (True/False)
ANSWER: True
ANSWER: FALSE
12
3. How would a Naive Bayes classifier classify the record (A=0,B=0,C=0)? (True/False)
ANSWER: TRUE
and 1
30
> 0.
4. Would it be possible to add just one record to the data set that would result in a Naive Bayes classifier changing
its classification of the record (A=1,B=0,C=1)? (Yes/No)
ANSWER: No
With the current data set the record (A=1,B=0,C=1) classifies to False since:
P (A = 1jY = T rue)P (B = 0jY = T rue)P (C = 1jY = T rue)P (Y = T rue) = 62 61 36 106 = 0:0166
< P (A = 1jY = F alse)P (B = 0jY = F alse)P (C = 1jY = F alse)P (Y = F alse) = 43 42 44 104 = 0:15
If the record we add has Y = T rue then the estimate of
P (A = 1jY = T rue)P (B = 0jY = T rue)P (C = 1jY = T rue)P (Y = T rue) would increase the most if the
added record also has A=1, B=0, and C=1 in which case the estimate of it would become 37 27 47 11
7
= 0:0445.
However this value is still less than, 0.15, the unchanged estimate of
P (A = 1jY = F alse)P (B = 0jY = F alse)P (C = 1jY = F alse)P (Y = F alse).
If the record we add has Y = F alse then the estimate of
P (A = 1jY = F alse)P (B = 0jY = F alse)P (C = 1jY = F alse)P (Y = F alse) would decrease the most if
the added record has A=0, B=1, and C=0 in which case the estimate of it would become 35 25 45 11
5
= 0:0873
which would still be greater than 0.0166 the unchanged estimate of P (A = 1jY = T rue)P (B = 0jY =
T rue)P (C = 1jY = T rue)P (Y = T rue).
13
7 Decision Tree
For this problem we will use the same data set below as in the Naive Bayes question. Again assume we have three
binary input attributes, A, B, C, and one binary outcome attribute Y. The three input attributes, A, B, C take values in
the set f0,1g while the Y attribute takes values in the set fTrue, Falseg.
A B C Y
0 1 1 True
1 1 0 True Specific Conditional Entropies
1 0 1 False H(YjA=0)=0.72 H(YjA=0,B=0)=0.00 H(YjA=1,C=0)=0.00
1 1 1 False H(YjA=1)=0.97 H(YjA=0,B=1)=0.81 H(YjA=1,C=1)=0.81
0 1 1 True H(YjB=0)=0.92 H(YjA=1,B=0)=0.00 H(YjB=0,C=0)=0.00
0 0 0 True H(YjB=1)=0.86 H(YjA=1,B=1)=0.92 H(YjB=0,C=1)=0.00
0 1 1 False H(YjC=0)=0.00 H(YjA=0,C=0)=0.00 H(YjB=1,C=0)=0.00
1 0 1 False H(YjC=1)=0.99 H(YjA=0,C=1)=0.92 H(YjB=1,C=1)=0.97
0 1 0 True
1 1 1 True
ANSWER:
Note the first attribute split on in the case C=1 is B instead of A since
H (Y jA; C = 1) = P (A = 0jC = 1)H (Y jA = 0; C = 1) + P (A = 1jC = 1)H (Y jA = 1; C = 1) =
3
0:92 + 47 0:81 = :857
> H (Y jB; C = 1) = P (B = 0jC = 1)H (Y jB = 0; C = 1) + P (B = 1jC = 1)H (Y jB = 1; C = 1) =
7
2
7
0:00 + 57 :97 = :693
14
3. How would your decision tree classify the record (A=0,B=0,C=1)? (True/False)
ANSWER: FALSE
4. How would your decision tree classify the record (A=1,B=0,C=0)? (True/False)
ANSWER: TRUE
5. If you pruned all nodes from your decision tree except the root node, now how would your decision tree classify
the record (A=0,B=0,C=1)? (True/False) Again, assume any ties are broken by choosing True. (NOTE: This
was clarified during the exam to mean that the tree would split on one attribute and then classify)
ANSWER: FALSE
6. If you pruned all nodes from your decision tree except the root node, now how would your decision tree classify
the record (A=1,B=0,C=0)? (True/False) Again, assume any ties are broken by choosing True.
ANSWER: TRUE
15
8 K-Means
The circles in the numbered boxes below represent the data points. In the first numbered box there are three squares,
representing the initial location of cluster centers of the k-means algorithm. Trace through the first nine iterations of
the k-means algorithm or until convergence is reached, whichever comes first. For each iteration draw three squares
corresponding to the location of the cluster centers during that iteration. (NOTE: It is not necessary to draw the exact
location of the squares, but it should be clear from your placement of the squares that you understand how k-means
performs quantitatively)
ANSWER:
16
9 Reinforcement Learning
9.1 Q-Learning
Perform Q-learning for a system with two states and two actions, given the following training examples. The discount
factor is
= 0:5 and the learning rate is = 0:5. Assume that your Q-table is initialized to 0.0 for all values.
(1) = a1 (2) = a1
17
9.2 Certainty Equivalent Learning
In the diagram below, draw the state transitions and label them according to the values that would be discovered by
Certainty Equivalent learning given follwing training examples.
(Start = S1 , Action = a1 , Reward = 10, End = S2 )
(Start = S2 , Action = a2 , Reward = -10, End = S1 )
(Start = S1 , Action = a2 , Reward = 10, End = S1 )
(Start = S1 , Action = a1 , Reward = 10, End = S1 )
(Start = S1 , Action = a2 , Reward = 10, End = S1 )
(Start = S1 , Action = a1 , Reward = 10, End = S2 )
(Start = S2 , Action = a1 , Reward = -10, End = S2 )
(Start = S2 , Action = a2 , Reward = -10, End = S2 )
(Start = S2 , Action = a2 , Reward = -10, End = S1 )
1/3
2/3
a1 a1 1.0
2/3
S1 S2
a2 a2
R = +10 R = −10
1.0 1/3
(1) = a2 (2) = a2
18
10 Markov Decision Processes
You are a wildy implausible robot who wanders among the four areas depicted below. You hate rain and get a reward
of -30 on any move that starts in the deck and -40 on any move that starts in the Garden. You like parties, and you are
indifferent to kitchens.
S S
Deck Party
CL CC
R = −30 R = +20
CC CL
CL CC
Kitchen Garden
CC CL
R=0 R = −40
S S
Actions: All states have three actions: Clockwise (CL), Counter-Clockwise (CC), Stay (S). Clockwise and Counter-
Clockwise move you through a door into another room, and Stay keeps you in the same location. All transitions have
are deterministic (probability 1.0).
2. Let J ? (Room) = expected discounted sum of future rewards assuming you start in “Room” and subsequently
act optimally. Assuming a discount factor
= 0:5, give the J ? values for each room.
By eyeballing the problem and quickly checking the CL and S options in the
Kitchen, you can quickly determine that the optimal policy is:
19
J ? (Deck) = 10
J ? (Party) = 40
J ? (Kitchen) = 0
J ? (Garden) = 20
3. The optimal policy when the discount factor,
, is small but non-zero (e.g.
:
= 0 1) is different from the optimal
policy when
is large (e.g.
= 0:9).
If we began with
= 0:1, and then gradually increased
, what would be the threshold value of
above which
the optimal policy would change?
J S (Kitchen) = 0 +
J S (Kitchen) = 0
J CL(Kitchen) = 0 +
J ? (Deck)
We already know that the optimal policy in the Deck is CL, regardless of the
discount factor:
J ? (Deck) = 30 + J ? (Party)
and we know
( 30 +
( 120
)) = 0
( 120
) = 0
30 +
20
= 30(1
)
= 3=5 = 0:6
20