0% found this document useful (0 votes)
38 views

HW2 Solution

The document describes Homework 2 for a Math 2050 course due on November 10th at 23:59, worth a total of 40 points. It involves two exercises: 1) Visualizing high-dimensional data by computing linear scores associated with data points, and interpreting the scoring formula as a data projection on a line. 2) Clustering data points by minimizing the average squared distance to cluster representatives using k-means clustering, and relating the problem to matrix factorization. The homework asks students to complete parts of each exercise, showing steps and interpreting results.

Uploaded by

Lê Quân
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

HW2 Solution

The document describes Homework 2 for a Math 2050 course due on November 10th at 23:59, worth a total of 40 points. It involves two exercises: 1) Visualizing high-dimensional data by computing linear scores associated with data points, and interpreting the scoring formula as a data projection on a line. 2) Clustering data points by minimizing the average squared distance to cluster representatives using k-means clustering, and relating the problem to matrix factorization. The homework asks students to complete parts of each exercise, showing steps and interpreting results.

Uploaded by

Lê Quân
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Math 2050 Fall 2022

Homework 2

Homework 2 is due on Canvas by Thursday November 10 at 23:59. The total number of points is 40.

1 Visualizing data on a line

In this exercise, we examine how to visualize a high-dimensional data set of points xi ∈ Rn , i = 1, . . . , m,


by computing and visualizing a single scalar, or score, associated to the data points. Specifically, the score
associated to a generic data point x ∈ Rn is obtained via the linear formula

f (x) = uT x + v.

Without loss of generality, and in order to compare different scoring mechanisms, we may assume that the
vector u is unit-norm (∥u∥2 = 1) and that the scores are centered, that is,
m
X
f (xi ) = 0.
i=1

1. (2 points) Show that the centering requirement implies that v can be expressed as a function of u,
which you will determine. Interpret the resulting scoring mechanism in terms of the centered data
points xi − x̂, i = 1, . . . , m, where
m
1 X
x̂ := xi
m i=1
is the center of the data points.

Solution: The condition that the scores are centered implies that
n
X n
X 
0= (uT xj + v) = uT xj + nv.
j=1 j=1

1
Pn
The zero-mean condition implies v = −uT x̂, where x̂ := n j=1 xj ∈ Rm is the vector of sample
averages of different data points.

2. (2 points) Interpret the scoring formula above as a projection on a line, which you will determine in
terms of u.

Solution: f (x) = uT x + v = uT x − uT v = uT (x − x̂).

3. (6 points) Consider a data set of your choice1 , and try different vectors u (do not forget to normalize
them):

• Random vectors
• All ones (normalized)
1
Some possible choices are provided on Canvas.

1
• Any other choice

Look at the spread of the scores, as measured by their variance. What do you observe? Which vector
u would you choose? Comment.
Solution: Students’ choice of data and results’ explanation. The better the spread the better the
visualization.

2
2 Clustering

In clustering problems, we are given data points xi ∈ Rn , i = 1, . . . , m. We seek to assign each point to a
cluster of points. The so-called k-means algorithm is one of the most widely used clustering methods. It is
based on choosing a number of clusters, k(< m), and minimizing the average squared Euclidean distance
from the data points to the their closest cluster “representative”. The objective function to minimize is thus
m
X
clust
J := min min ∥xi − cj ∥22 .
c1 ,...,ck 1≤j≤k
i=1

Each cj ∈ Rn is the “representative” point for the j-th cluster, denoted Cj . Note that the terms inside
the sum express the assignment of a specific point xi to a center cluster, so that the problem expresses as
minimizing the sum of those distances.
1. (4 points) Show that the problem can be written as one involving two matrix variables C, U :
2 k
m k
X
uij = 1, 1 ≤ i ≤ m,
X X
min xi − uij cj :

C,U j=1
i=1 j=1 2 uij ∈ {0, 1}, 1 ≤ i ≤ m, 1 ≤ j ≤ k.
In the above, the n×k matrix C has columns cj , 1 ≤ j ≤ k, the center representatives; you are asked
to explain why the Boolean2 m × k matrix U with entries uij , 1 ≤ i ≤ m, 1 ≤ j ≤ k, is referred to
as an assignment matrix. Hint: show that, for a given point x ∈ Rn , we have A(x) = B(x), where

A(x) := min ∥x − cj ∥22 , B(x) := min F (x, u),


1≤j≤k u∈U
2
k
X
uj cj , U := uT 1k = 1, u ∈ {0, 1}k .

F (x, u) := x −


j=1 2

Solution: The result in the hint is proven as follows. Any u ∈ {0, 1}k that satisfies uT 1k = 1 must
be one of the unit vectors in Rk ; denoting those as ej , j = 1, . . . , k), we thus obtain the desired
equality:
B(x) = min F (x, u) = min F (x, ej ) = A(x).
u∈U 1≤j≤k

The Boolean variables uij specify which data point is assigned to which center, with exactly one
center per data point. The result follows directly from the hint, by summation over all the data
points.
2. (4 points) Show that in turn, the above problem is equivalent to finding an (approximate) factor-
ization of the data matrix X, into a product of two matrices with specific properties. Make sure to
express the above problem in terms of matrices, matrix norms, and matrix constraints. It will be
convenient to use the notation 1s for the vector of ones in Rs ; and B := {0, 1}m×k for the set of
Boolean matrices in Rm×k .
Solution: The above can be written as

min ∥X − CU T ∥2F : C ∈ Rn×m , U 1k = 1m , U ∈ B.


C,U

2
The term refers to the fact that the entries of the matrix are either 0 or 1. After George Boole, a contributor to logic theory
of the 19th century.

3
Indeed, we have, for every i = 1, . . . , m:
 
k ui1
. . . ck  ...  = CU T ei ,
X  
uij cj = c1
j=1 uik

where ei is the i-th unit vector in Rm . This shows that


2
m k m
(X − CU T )ei 2 = ∥X − CU T ∥2F ,
X X X
x − u c =

i ij j 2

i=1 j=1 2 i=1

as claimed.
3. (3 points) One idea to solve the above problem is to alternate over the matrix C and U . We start
with an initial point (C 0 , U 0 ) and update the pair by minimizing J(C, U ) over U with C fixed, and
the over C with fixed U . Derive the solution the C-step, that is, minimizing over C for fixed U .
Express the result in terms of mj , the number of points assigned to cluster j, and Ij , the index of
points assigned to cluster Cj ; then express your result in words. Hint: using the fact that the gradient
of a differentiable function is zero at its minimum, show that the vector c which minimizes the sum
of squared distances to given vectors y1 , . . . , yL ,

F (c) = ∥y1 − c∥2 + . . . + ∥xL − c∥2 ,

is the average of the vectors, c∗ = (1/L)(y1 + . . . + yL ).


Solution: First we prove the result mentioned in the hint. The gradient of the function F at a point
c is
∇F (c) = 2 (c − y1 + . . . + c − yL ) .
At minimum the gradient of the function is zero, leading to the result.
In the clustering problem with U fixed, to find the cluster representative for each cluster, we have to
solve, for every j = 1, . . . , k: X
min ∥xi − c∥22 .
c
i∈Ij

Applying the hint, we obtain the optimal center as the average of the points assigned to that cluster:
1 X
cj = xi .
mj i∈I
j

4. (3 points) Find the solution to the U -step, where we fix C and solve for the assignment matrix U .
Express your result in words.
Solution: We now fix C, and need to solve, for every data point i = 1, . . . , m, the problem

Xk
min xi − uj cj : uT 1k = 1, u ∈ {0, 1}k .

u
j=1 2

As shown in part 1, the solution is nothing else than the closest cluster representative to xi . This
means that we are looking for the best assignment of a single data to one of the clusters. Precisely,
the optimal j is the index of the center assigned to data point xi .

4
3 Matrices

1. (2 points) Let f : Rm → Rk and g : Rn → Rm be two maps. Let h : Rn → Rk be the composite


map h = f ◦ g, with values h(x) = f (g(x)) for x in Rn . Show that the derivatives of h can be
expressed via a matrix-matrix product, as Jh (x) = Jf (g(x)) · Jg (x), where the Jacobian matrix of h
at x is defined as the matrix Jh (x) with (i, j) element ∂hi /∂xj (x).
Solution: We have Jf (g(x)) is an k × m matrix and Jg (x) is an m × n matrix such that:
 
∂f1 /∂g
∂f2 /∂g 
Jf (g(x)) = 
 ... 

∂fk /∂g
 
Jg (x) = ∂g/∂x1 , ∂g/∂x2 , . . . , ∂g/∂xn

where:
∂fi /∂g = [∂fi /∂g1 , ∂fi /∂g2 , . . . , ∂fi /∂gm ]
 
∂g1 /∂xj
 ∂g2 /∂xj 
∂g/∂xj =   ... 

∂gm /∂xj
Therefore:  
∂f1 /∂x1 ∂f1 /∂x2 . . . ∂f1 /∂xn
∂f2 /∂x1 ∂f2 /∂x2 . . . ∂f2 /∂xn 
Jf (g(x)) · Jg (x) =   = Jh (x)
 
.. .. ..
 . . . 
∂fk /∂x1 ∂fk /∂x2 . . . ∂fk /∂xn

2. (2 points) A matrix P in Rn×n is a permutation matrix if it is a permutation of the columns of the


n × n identity matrix. For a n × n matrix A, we consider the products P A and AP . Describe in
simple terms what these matrices look like with respect to the original matrix A.
Solution: The columns of AP are the permutation of the columns of A (this permutation is the
permutation of the identity matrix).
The rows of P A are the permutation of the rows of A (this permutation is the permutation of the
identity matrix).

3. (a) (2 points) Show that a square matrix is invertible if and only if its determinant is non-zero. You
can use the fact that the determinant of a product is a product of the determinant, together with
the QR decomposition of matrix A.
Solution: Any real square matrix A can be decomposed as:

A = QR

where Q is an orthogonal matrix and R is an upper-triangular matrix. Since QQT = I, we


have | det(Q)| = 1 ̸= 0, and therefore Q is an invertible matrix.
With det A = det Q · det R, we obtain that A is invertible if and only if R is invertible. R has
the form:

5
 
r1 ∗ . . . ∗
 0 r2 . . . ∗ 
R =  ..
 
. . 
. . 
0 0 . . . rn
R is invertible if and only if all the rows of R are linearly independent, which is equivalent to
rn ̸= 0, rn−1 ̸= 0, . . . , r1 ̸= 0. Since

det R = r1 · . . . · rn ,

this is equivalent to det(R) ̸= 0. This proves the result.


(b) (2 points) Let A ∈ Rm×n , B ∈ Rn×p , and let C := AB ∈ Rm×p . Show that ∥C∥ ≤ ∥A∥ · ∥B∥
where ∥ · ∥ denotes the l2 -induced norm norm of its matrix argument, defined for a matrix M
as:

∥M z∥2
∥M ∥ := max .
z̸=0 ∥z∥2
Solution: The definition of the l2 -induced norm implies that for any matrix M , and any
vector z of appropriate size,
∥M z∥2 ≤ ∥M ∥2 · ∥z∥2 .
Let x be a non-zero p-vector. We have

∥A(Bx)∥2 ≤ ∥A∥ · ∥Bx∥2 ≤ ∥A∥ · ∥B∥ · ∥x∥2 ,

which shows that


∥ABx∥2
∥C∥ = max ≤ ∥A∥ · ∥B∥,
x̸=0 ∥x∥2
as claimed.

6
4 Hermitian product and projection of complex vectors on a line

In lecture we have defined the scalar product of two complex vectors x, y ∈ Cn as


n
X
xH y = x̄T y = x̄(i)y(i),
i=1

where z̄ is the conjugate of the complex vector z. The ordinary scalar product results when x, y are both
real vectors. In this exercise, we explain why this choice makes sense from the point of view of projections.
Precisely, we show that the projection z of a point p ∈ Cn on the line L(u) := {αu : α ∈ C}, where
u ∈ Cn satisfies uH u = 1 without loss of generality, is given by z = (uuH )p = (uH p)u.

1. (2 points) As a preliminary result, show that for any real vector z, the minimum value of

∥w∥22 − 2z T w (1)

over real vectors w, is obtained for w = z. Hint: express the objective function of the above problem
as the difference of two squared terms, the second one independent of w.
Solution: Let z be a given real vector. For any real vector w:

∥w∥22 − 2z T w = ∥w − z∥22 − ∥z∥2 ≥ −∥z∥22 ,

and the lower bound is obtained when w = z. This proves that the minimum value is −∥z∥22 ,
obtained with w = z, as claimed.

2. (2 points) Show that the proposed formula for the projected vector is correct when u, p are real.
Solution: When u, p are real vectors, we have uH p = uT p, which proves the desired result.

3. (4 points) Show that the proposed formula is also correct in the complex case. That is, solve the
problem
min ∥p − αu∥2
α∈C
∗ H
and show that the optimal α is α = u p. Hint: optimize over the real and imaginary parts of α, and
transform the problem into one of the form (1) involving two-dimensional real vectors; then apply
the result of part 1.
Solution: Using the fact that uH u = 1, we have for any α:

∥p − αu∥22 = (p − αu)H (p − αu) = |α|2 − (ᾱuH p + αpH u).

Define four real numbers such that α = a + ib, uH p = c + id. We obtain


   T  
a 2 a c
∥p − αu∥22 2 2
= a + b − 2(ac + bd) = ∥ ∥ −2 .
b 2 b d

Applying the result of part 1, we obtain an optimal (a, b) as (a∗ , b∗ ) = (c, d), which leads to an
optimal α∗ = c + id = uH p, as claimed.

7
5 Convolutions

The convolution of a n-vector a and m-vector b is the (n + m − 1)-vector c = a ∗ b, with entries


X
ck = ai bj , k = 1, . . . , n + m − 1.
i+j=k+1

1. (2 points) Express the coefficients of the product of two polynomials


p(x) = a1 + a2 x + . . . + an xn−1 , q(x) = b1 + b2 x + . . . + bm xm−1 ,
in terms of an appropriate convolution.
Solution: Suppose the product writes
g(x) = p(x)q(x) = c1 + c2 x + . . . + cn+m−1 xn+m−2 .
Then by matching the powers of x, we start by writing out the following terms of
c1 = a1 b1
c2 = a1 b2 + a2 b1
c3 = a1 b3 + a2 b2 + a3 b1
..
.
cn+m−1 = an bm
We notice that the vector of coefficients c = (c1 , c2 , . . . , cn+m−1 ) is exactly given by c = a ∗ b. One
may also notice that the calculation of ck in the convolution is exactly summing up terms with the
same power of x.
2. (2 points) Given a time-series x ∈ Rn , the (4-point) moving average of x is a new time-series y
such that, for every i = 4, 5, . . . , n, yi is the average of xi , xi−1 , xi−2 , xi−3 . Express y in terms of a
convolution of x with an appropriate vector.
Hint: Think about time-series with only a single 1 in it.
Solution: Notice the convolution of x with the following “delta” sequence d = (1, 0, . . .), with 1
located in the first place and 0 elsewhere, is exactly itself.
x=x∗d
And the convolution with the shifted “delta” sequence d′ = (0, 1, 0, . . .) is
x′ = x ∗ d ′
X
x′k = xi d′j , k = 1, . . . , n
i+j=k+1

x′k = xk−1 d′2 ,


k = 2, . . . , n
x′k = xk−1 , k = 2, . . . , n
x′1 =0
One can find the similar statements are true with d′′ = (0, 0, 1, 0, . . .) shifts x by 2 (call it x′′ ).
From linearity of convolution, we construct e = (d + d′ + d′′ + d′′′ )/4, then the convolution should
give y = x ∗ e = (x + x′ + x′′ + x′′′ )/4. Thus yi = (xi + xi−1 + xi−2 + xi−3 )/4 and y is the (4-point)
moving average of x.

8
3. (2 points) Show that
a ∗ b = T (a)b = T (b)a,
where T (a), T (b) are two appropriate matrices. Specify those matrices for the case n = 3, m = 4.
Solution: Suppose c = a ∗ b, we have the following terms of c.
c1 = a1 b 1
c2 = a1 b 2 + a2 b 1
c3 = a1 b 3 + a2 b 2 + a3 b 1
c4 = a1 b 4 + a2 b 3 + a3 b 2
c5 = a2 b 4 + a3 b 3
c6 = a3 b 4
Therefore we can write
     
c1 a1 0 0 0   b1 0 0
c2  a2 a1 0 0  b1 b2 b1 0  
      a1
c3  a3 a2 a1 0  b2  b3 b2 b1   

c= =
  
 b3  = b4
    a2
c 4
   0 a 3 a 2 a 1   b3 a2 
 a3
 c 5   0 0 a3 a2  b 4 0 b4 b3 
c6 0 0 0 a3 0 0 b4
Now we have a ∗ b = T (a)b = T (b)a where =
   
a1 0 0 0 b1 0 0
a2 a1 0 0  b2 b1 0
   
a3 a2 a1 0  b3 b2 b1 
T (a) = 
 0 a3 a2 a1  ,
 T (b) = 
b4

   b3 a2 

 0 0 a3 a2  0 b4 b3 
0 0 0 a3 0 0 b4

4. (4 points) A T -vector r gives the average daily rainfall in some region over a period of T days. The
vector h gives the daily height of a river in the region. Using model fitting, it is found that the two
vectors are related by h = g ∗ r, where
g := (0.1, 0.4, 0.5, 0.2).
(a) If one day there is a heavy rainfall, assuming uniform rainfall for all other days, how many
days after that day is the river at maximum height?
(b) How many days does it take for the river to return to 0 after rain stops?
Solution: Notice that the convolution is commutative, h = g ∗ r = r ∗ g. Since g is a weighted
sum of the shifted “delta” sequences, similar to part 2, we have that h is a weighted sum of shifted
r. Since the length of h is 4, hi is a weighted sum of ri , ri−1 , ri−2 , ri−3 where the weights are
0.1, 0.4, 0.5, 0.2.
hi = 0.1ri + 0.4ri−1 + 0.5ri−2 + 0.2ri−3
From the equation above, we note that the river height is most heavily affected by the rainfall 2 days
ago and is not affected by the rainfall more than 3 days before. Thus the answer is 2 for part (a) and
4 for part (b).

You might also like