class-test-1
class-test-1
Total marks: 20
Note. Provide brief justifications and/or calculations along with each answer to illustrate how you
arrived at the answer.
Question 1. Describe two factors that have contributed to the dramatic growth of AI in the last
5–10 years. Keep your answer brief: ideally 2–3 lines. [2 marks]
Question 2. Answer these questions relating to the Perceptron Learning Algorithm, as discussed
in class. Assume that the input data set comprises two classes which are separable using an origin-
centred hyperplane.
2a. Recall that the Perceptron Learning Algorithm is free to pick an arbitrary point from the
currently misclassified set at each iteration, and update the weight vector based on that point.
Hence, for a given data set, we can end up with different output weight vectors depending
on how we resolve the choice at each iteration. Concretely, let w1 , w2 , . . . , wL be the output
weight vectors produced by L separate runs of the Perceptron Learning Algorithm. As we
have already shown, each of these weight vectors guarantees perfect separation of the training
data points. Now consider the “average” of the output vectors:
w1 + w2 + · · · + wL
wavg = .
L
Is wavg also guaranteed to separate the training points perfectly? Prove that your answer is
correct. [2 marks]
2b. Consider a change to the Perceptron Learning Algorithm, wherein we use a “learning rate”
αk = k1 for k ≥ 1. In other words, the update made by this variant at the k-th iteration is
wk+1 ← wk + αk y j xj ,
with (xj , y j ) being the arbitrarily-chosen misclassified point. By contrast, in class we had
used a constant learning rate of 1.
The use of a harmonically-annealed learning rate is common in algorithms such as gradient
descent. How does it affect the Perceptron Learning Algorithm? Does the algorithm still
converge; if so, does it still yield a separating hyperplane? Prove that your answer is correct.
[5 marks]
Question 3. We consider a 1-dimensional example of gradient descent. For w ∈ R, let
3b. Consider a procedure that begins with some initial guess w0 ∈ R, and progressively obtains
iterates through gradient descent: for t ≥ 0,
1
wt+1 ← wt − G(wt ).
t+1
Draw a plot with w0 on the x axis and limt→∞ Error(wt ) on the y axis. [4 marks]
3c. What is G2 (w) = ∇w G(w)? Based on your answer to 3b, suggest why this function might be
useful to compute. [2 marks]
Question 4. Consider a data set containing every possible tuple of three binary variables x1 ,
x2 , and x3 . The label associated with each tuple is encoded by the decision tree T1 shown below.
Observe that the variables and the labels both take values in {0, 1}.
x1
0 1
x2 x3
0 1 0 1
1 0 0 1
Figure 1: Decision tree T1 . Internal nodes are shown as circles, and leaves as squares.
Draw a decision tree T2 that assigns each possible (x1 , x2 , x3 )-tuple the same label as assigned
by T1 , but which splits on the variable x2 at its root node. Use the least number of internal nodes
possible to construct T2 , and argue why you cannot reduce this number further. [4 marks]
Solutions
1. The growth of the Internet has made it possible to collect large amounts of data, which it is also
possible now to store cheaply. Computing and memory have also become orders of magnitude more
efficient in the last few years, allowing for algorithms to process stored data. Cameras and other
sensors—also ubiquitous and cheap!—have made different types of data available for processing.
From a technical standpoint, the maturing of machine learning as a field has led to many “off-the-
shelf” solutions and libraries for AI. The resurgence of neural networks as an effective model for
tasks in domains such as vision and speech has also been the reason behind many success stories.
2a. We know from our proof in class that for each i ∈ {1, 2, . . . , n} and l ∈ {1, 2, . . . , L},
y i (wl · xi ) > 0.
It follows that for each i ∈ {1, 2, . . . , n},
y i (w1 · xi ) + y i (w2 · xi ) + · · · + y i (wL · xi )
y i (wavg · xi ) = > 0,
L
implying that wavg also achieves perfect separation of the data.
2b. We follow essentially the same steps as done in our original proof, albeit with the different
learning rate. First, we observe
wk+1 · w⋆ = (wk + αk y j xj ) · w⋆
= wk · w⋆ + αk y j (xj · w⋆ )
≥ wk · w⋆ + αk γ.
It follows by induction that wk+1 · w⋆ ≥ (α1 + α2 + · · · + αk )γ. Since wk+1 · w⋆ ≤ kwk+1 kkw⋆ k =
kwk+1 k, we get
kwk+1 k ≥ (α1 + α2 + · · · + αk )γ. (1)
We can also upper-bound kwk+1 k as follows.
kwk+1 k2 = kwk + αk y j xj k2
= kwk k2 + kαk y j xj k2 + 2(wk · xj )αk y j
= kwk k2 + (αk )2 kxj k2 + 2(wk · xj )αk y j
≤ kwk k2 + (αk )2 kxj k2
≤ kwk k2 + (αk )2 R2 ,
from which it follows by induction that
kwk+1 k2 ≤ ((α1 )2 + (α2 )2 + · · · + (αk )2 )R2 . (2)
For our particular choice of sequence αk = k1 , we observe that
1. α1 + α2 + · · · + αk > ln(k), and
2. there is a constant C such that (α1 )2 + (α2 )2 + · · · + (αk )2 < C.
√
It follows that (γ ln(k))2 ≤ kwk+1 k2 ≤ CR2 , which implies k ≤ exp( C R γ ). Hence, the algo-
rithm can only make a finite number of iterations; by construction, termination implies correctness.
3a. G(w) = ∇w (3w4 − 4w3 − 12w2 + 50) = 12w3 − 12w2 − 24w.
3b. It is easy to see that G(w) factorises as 12w(w − 2)(w + 1), implying that Error(w) has its
local optima (maxima or minima) at −1, 0, and 2. By plotting Error(w), we observe that indeed
−1 and 2 are local minima, while 0 is a local maximum.
It follows that if we perform gradient descent with a “small enough” learning rate,
Unfortunately, there is a bug in the question, wherein the learning rate used is not small enough
compared to the gradient.1 Hence, although starting points in the vicinity of the local minima will
converge to these minima (and starting at the local maximum will keep the process there for ever),
starting from other points could take the process through hard-to-characterise sequences, and in
fact even to divergence.
3c. G2 (w) = ∇w (12w3 − 12w2 − 24w) = 36w2 − 24w − 24. This second derivative lets us determine
whether a given local optimum is a local maximum or a local minimum. Observe that G2 (−1) and
G2 (2) are positive, while G2 (0) is negative: implying that Error achieves local minima at −1 and
2, and a local maximum at 0. In the unusual but plausible event that we initialise the procedure
at a local maximum, note that the gradient will be 0, and thus we would have converged. Knowing
the second derivative would inform us whether we are indeed at a local minimum, or whether we
can do better by starting with a small perturbation of the initial point.
x1 x2 x3 y
0 0 0 1
0 0 1 1
0 1 0 0
0 1 1 0
1 0 0 0
1 0 1 1
1 1 0 0
1 1 1 1
We are to replicate the same labelling with T2 , which has x2 at its root. The left subtree of
T2 , containing all points with x2 = 0, cannot be represented with just one node since the labels
of the four points cannot all be predicted correctly based on either x1 or x3 alone. Hence, the left
subtree needs at least two nodes. For the same reason, the right subtree also needs at least two
nodes. It can be seen that all four trees shown below use exactly two nodes in each of the left and
right subtrees, and they classify all the data points exactly as done by T1 . Any one of them can be
provided as the answer.
1
We acknowledge Utkarsh Gupta for pointing out this bug.
Possibility 1 for T2 (all paths of tree shown).
0 0
• x2 −
→ x1 −
→ 1.
0 1 0
• x2 −
→ x1 −
→ x3 −
→ 0.
0 1 1
• x2 −
→ x1 −
→ x3 −
→ 1.
1 0
• x2 −
→ x1 −
→ 0.
1 1 0
• x2 −
→ x1 −
→ x3 −
→ 0.
1 1 1
• x2 −
→ x1 −
→ x3 −
→ 1.
Possibility 2.
0 0
• x2 −
→ x1 −
→ 1.
0 1 0
• x2 −
→ x1 −
→ x3 −
→ 0.
0 1 1
• x2 −
→ x1 −
→ x3 −
→ 1.
1 0
• x2 −
→ x3 −
→ 0.
1 1 0
• x2 −
→ x3 −
→ x1 −
→ 0.
1 1 1
• x2 −
→ x3 −
→ x1 −
→ 1.
Possibility 3.
0 0 0
• x2 −
→ x3 −
→ x1 −
→1
0 1 1
• x2 −
→ x3 −
→ x1 −
→ 0.
0 1
• x2 −
→ x3 −
→ 1.
1 0
• x2 −
→ x1 −
→ 0.
1 1 0
• x2 −
→ x1 −
→ x3 −
→ 0.
1 1 1
• x2 −
→ x1 −
→ x3 −
→ 1.
Possibility 4.
0 0 0
• x2 −
→ x3 −
→ x1 −
→1
0 1 1
• x2 −
→ x3 −
→ x1 −
→ 0.
0 1
• x2 −
→ x3 −
→ 1.
1 0
• x2 −
→ x3 −
→ 0.
1 1 0
• x2 −
→ x3 −
→ x1 −
→ 0.
1 1 1
• x2 −
→ x3 −
→ x1 −
→ 1.