HW2 Solution
HW2 Solution
Homework 2
Homework 2 is due on Canvas by Thursday November 10 at 23:59. The total number of points is 40.
f (x) = uT x + v.
Without loss of generality, and in order to compare different scoring mechanisms, we may assume that the
vector u is unit-norm (∥u∥2 = 1) and that the scores are centered, that is,
m
X
f (xi ) = 0.
i=1
1. (2 points) Show that the centering requirement implies that v can be expressed as a function of u,
which you will determine. Interpret the resulting scoring mechanism in terms of the centered data
points xi − x̂, i = 1, . . . , m, where
m
1 X
x̂ := xi
m i=1
is the center of the data points.
Solution: The condition that the scores are centered implies that
n
X n
X
0= (uT xj + v) = uT xj + nv.
j=1 j=1
1
Pn
The zero-mean condition implies v = −uT x̂, where x̂ := n j=1 xj ∈ Rm is the vector of sample
averages of different data points.
2. (2 points) Interpret the scoring formula above as a projection on a line, which you will determine in
terms of u.
3. (6 points) Consider a data set of your choice1 , and try different vectors u (do not forget to normalize
them):
• Random vectors
• All ones (normalized)
1
Some possible choices are provided on Canvas.
1
• Any other choice
Look at the spread of the scores, as measured by their variance. What do you observe? Which vector
u would you choose? Comment.
Solution: Students’ choice of data and results’ explanation. The better the spread the better the
visualization.
2
2 Clustering
In clustering problems, we are given data points xi ∈ Rn , i = 1, . . . , m. We seek to assign each point to a
cluster of points. The so-called k-means algorithm is one of the most widely used clustering methods. It is
based on choosing a number of clusters, k(< m), and minimizing the average squared Euclidean distance
from the data points to the their closest cluster “representative”. The objective function to minimize is thus
m
X
clust
J := min min ∥xi − cj ∥22 .
c1 ,...,ck 1≤j≤k
i=1
Each cj ∈ Rn is the “representative” point for the j-th cluster, denoted Cj . Note that the terms inside
the sum express the assignment of a specific point xi to a center cluster, so that the problem expresses as
minimizing the sum of those distances.
1. (4 points) Show that the problem can be written as one involving two matrix variables C, U :
2 k
m
k
X
uij = 1, 1 ≤ i ≤ m,
X X
min
xi − uij cj
:
C,U
j=1
i=1 j=1 2 uij ∈ {0, 1}, 1 ≤ i ≤ m, 1 ≤ j ≤ k.
In the above, the n×k matrix C has columns cj , 1 ≤ j ≤ k, the center representatives; you are asked
to explain why the Boolean2 m × k matrix U with entries uij , 1 ≤ i ≤ m, 1 ≤ j ≤ k, is referred to
as an assignment matrix. Hint: show that, for a given point x ∈ Rn , we have A(x) = B(x), where
Solution: The result in the hint is proven as follows. Any u ∈ {0, 1}k that satisfies uT 1k = 1 must
be one of the unit vectors in Rk ; denoting those as ej , j = 1, . . . , k), we thus obtain the desired
equality:
B(x) = min F (x, u) = min F (x, ej ) = A(x).
u∈U 1≤j≤k
The Boolean variables uij specify which data point is assigned to which center, with exactly one
center per data point. The result follows directly from the hint, by summation over all the data
points.
2. (4 points) Show that in turn, the above problem is equivalent to finding an (approximate) factor-
ization of the data matrix X, into a product of two matrices with specific properties. Make sure to
express the above problem in terms of matrices, matrix norms, and matrix constraints. It will be
convenient to use the notation 1s for the vector of ones in Rs ; and B := {0, 1}m×k for the set of
Boolean matrices in Rm×k .
Solution: The above can be written as
2
The term refers to the fact that the entries of the matrix are either 0 or 1. After George Boole, a contributor to logic theory
of the 19th century.
3
Indeed, we have, for every i = 1, . . . , m:
k ui1
. . . ck ... = CU T ei ,
X
uij cj = c1
j=1 uik
as claimed.
3. (3 points) One idea to solve the above problem is to alternate over the matrix C and U . We start
with an initial point (C 0 , U 0 ) and update the pair by minimizing J(C, U ) over U with C fixed, and
the over C with fixed U . Derive the solution the C-step, that is, minimizing over C for fixed U .
Express the result in terms of mj , the number of points assigned to cluster j, and Ij , the index of
points assigned to cluster Cj ; then express your result in words. Hint: using the fact that the gradient
of a differentiable function is zero at its minimum, show that the vector c which minimizes the sum
of squared distances to given vectors y1 , . . . , yL ,
Applying the hint, we obtain the optimal center as the average of the points assigned to that cluster:
1 X
cj = xi .
mj i∈I
j
4. (3 points) Find the solution to the U -step, where we fix C and solve for the assignment matrix U .
Express your result in words.
Solution: We now fix C, and need to solve, for every data point i = 1, . . . , m, the problem
Xk
min
xi − uj cj
: uT 1k = 1, u ∈ {0, 1}k .
u
j=1 2
As shown in part 1, the solution is nothing else than the closest cluster representative to xi . This
means that we are looking for the best assignment of a single data to one of the clusters. Precisely,
the optimal j is the index of the center assigned to data point xi .
4
3 Matrices
∂fk /∂g
Jg (x) = ∂g/∂x1 , ∂g/∂x2 , . . . , ∂g/∂xn
where:
∂fi /∂g = [∂fi /∂g1 , ∂fi /∂g2 , . . . , ∂fi /∂gm ]
∂g1 /∂xj
∂g2 /∂xj
∂g/∂xj = ...
∂gm /∂xj
Therefore:
∂f1 /∂x1 ∂f1 /∂x2 . . . ∂f1 /∂xn
∂f2 /∂x1 ∂f2 /∂x2 . . . ∂f2 /∂xn
Jf (g(x)) · Jg (x) = = Jh (x)
.. .. ..
. . .
∂fk /∂x1 ∂fk /∂x2 . . . ∂fk /∂xn
3. (a) (2 points) Show that a square matrix is invertible if and only if its determinant is non-zero. You
can use the fact that the determinant of a product is a product of the determinant, together with
the QR decomposition of matrix A.
Solution: Any real square matrix A can be decomposed as:
A = QR
5
r1 ∗ . . . ∗
0 r2 . . . ∗
R = ..
. .
. .
0 0 . . . rn
R is invertible if and only if all the rows of R are linearly independent, which is equivalent to
rn ̸= 0, rn−1 ̸= 0, . . . , r1 ̸= 0. Since
det R = r1 · . . . · rn ,
∥M z∥2
∥M ∥ := max .
z̸=0 ∥z∥2
Solution: The definition of the l2 -induced norm implies that for any matrix M , and any
vector z of appropriate size,
∥M z∥2 ≤ ∥M ∥2 · ∥z∥2 .
Let x be a non-zero p-vector. We have
6
4 Hermitian product and projection of complex vectors on a line
where z̄ is the conjugate of the complex vector z. The ordinary scalar product results when x, y are both
real vectors. In this exercise, we explain why this choice makes sense from the point of view of projections.
Precisely, we show that the projection z of a point p ∈ Cn on the line L(u) := {αu : α ∈ C}, where
u ∈ Cn satisfies uH u = 1 without loss of generality, is given by z = (uuH )p = (uH p)u.
1. (2 points) As a preliminary result, show that for any real vector z, the minimum value of
∥w∥22 − 2z T w (1)
over real vectors w, is obtained for w = z. Hint: express the objective function of the above problem
as the difference of two squared terms, the second one independent of w.
Solution: Let z be a given real vector. For any real vector w:
and the lower bound is obtained when w = z. This proves that the minimum value is −∥z∥22 ,
obtained with w = z, as claimed.
2. (2 points) Show that the proposed formula for the projected vector is correct when u, p are real.
Solution: When u, p are real vectors, we have uH p = uT p, which proves the desired result.
3. (4 points) Show that the proposed formula is also correct in the complex case. That is, solve the
problem
min ∥p − αu∥2
α∈C
∗ H
and show that the optimal α is α = u p. Hint: optimize over the real and imaginary parts of α, and
transform the problem into one of the form (1) involving two-dimensional real vectors; then apply
the result of part 1.
Solution: Using the fact that uH u = 1, we have for any α:
Applying the result of part 1, we obtain an optimal (a, b) as (a∗ , b∗ ) = (c, d), which leads to an
optimal α∗ = c + id = uH p, as claimed.
7
5 Convolutions
8
3. (2 points) Show that
a ∗ b = T (a)b = T (b)a,
where T (a), T (b) are two appropriate matrices. Specify those matrices for the case n = 3, m = 4.
Solution: Suppose c = a ∗ b, we have the following terms of c.
c1 = a1 b 1
c2 = a1 b 2 + a2 b 1
c3 = a1 b 3 + a2 b 2 + a3 b 1
c4 = a1 b 4 + a2 b 3 + a3 b 2
c5 = a2 b 4 + a3 b 3
c6 = a3 b 4
Therefore we can write
c1 a1 0 0 0 b1 0 0
c2 a2 a1 0 0 b1 b2 b1 0
a1
c3 a3 a2 a1 0 b2 b3 b2 b1
c= =
b3 = b4
a2
c 4
0 a 3 a 2 a 1 b3 a2
a3
c 5 0 0 a3 a2 b 4 0 b4 b3
c6 0 0 0 a3 0 0 b4
Now we have a ∗ b = T (a)b = T (b)a where =
a1 0 0 0 b1 0 0
a2 a1 0 0 b2 b1 0
a3 a2 a1 0 b3 b2 b1
T (a) =
0 a3 a2 a1 ,
T (b) =
b4
b3 a2
0 0 a3 a2 0 b4 b3
0 0 0 a3 0 0 b4
4. (4 points) A T -vector r gives the average daily rainfall in some region over a period of T days. The
vector h gives the daily height of a river in the region. Using model fitting, it is found that the two
vectors are related by h = g ∗ r, where
g := (0.1, 0.4, 0.5, 0.2).
(a) If one day there is a heavy rainfall, assuming uniform rainfall for all other days, how many
days after that day is the river at maximum height?
(b) How many days does it take for the river to return to 0 after rain stops?
Solution: Notice that the convolution is commutative, h = g ∗ r = r ∗ g. Since g is a weighted
sum of the shifted “delta” sequences, similar to part 2, we have that h is a weighted sum of shifted
r. Since the length of h is 4, hi is a weighted sum of ri , ri−1 , ri−2 , ri−3 where the weights are
0.1, 0.4, 0.5, 0.2.
hi = 0.1ri + 0.4ri−1 + 0.5ri−2 + 0.2ri−3
From the equation above, we note that the river height is most heavily affected by the rainfall 2 days
ago and is not affected by the rainfall more than 3 days before. Thus the answer is 2 for part (a) and
4 for part (b).