0% found this document useful (0 votes)
79 views

CS6910 Tutorial1

This document provides instructions for Assignment 1 in CS 7015 - Deep Learning. It instructs students to complete the assignment in LaTeX using the provided template, avoid verbose answers, and not be discouraged from answering difficult questions marked with an asterisk. It then presents 4 questions related to concepts in deep learning, including partial derivatives, erroneous estimates of derivatives, differentiation of vectors/matrices, and ordered derivatives. Students are asked to provide solutions in the LaTeX template.

Uploaded by

Kaspa Vivek
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views

CS6910 Tutorial1

This document provides instructions for Assignment 1 in CS 7015 - Deep Learning. It instructs students to complete the assignment in LaTeX using the provided template, avoid verbose answers, and not be discouraged from answering difficult questions marked with an asterisk. It then presents 4 questions related to concepts in deep learning, including partial derivatives, erroneous estimates of derivatives, differentiation of vectors/matrices, and ordered derivatives. Students are asked to provide solutions in the LaTeX template.

Uploaded by

Kaspa Vivek
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

CS 7015 - Deep Learning - Assignment 1 Your Name, Roll Number

Instructions:
• This assignment is meant to help you grok certain concepts we will use in the course.
Please don’t copy solutions from any sources.
• Avoid verbosity.
• Questions marked with * are relatively difficult. Don’t be discouraged if you cannot
solve them right away!
• The assignment needs to be written in latex using the attached tex file. The solution
for each question should be written in the solution block in space already provided in
the tex file. Handwritten assignments will not be accepted.

1. Partial Derivatives

(a) Consider the following computation ,

1+tanh( wx+b )
x 2 f (x)
2

1
1+tanh( wx+b ) ez −e−z
where f (x) = 2
2
and by definiton : tanh(z) = ez +e−z

The value L is given by,


1
L = (y − f (x))2
2
Here, x and y are constants and w and b are parameters that can be modified. In
other words, L is a function of w and b.
∂L
Derive the partial derivatives, ∂w and ∂L
∂b
.

Solution:

(b) Consider the evaluation of E as given below,


E = g(x, y, z) = σ(c(ax + by) + dz)
Here x, y, z are inputs (constants) and a, b, c, d are parameters (variables). σ is the
logistic sigmoid function defined as:
1
σ(x) =
1 + e−x
Note that here E is a function of a, b, c, d.
Compute the partial derivatives of E with respect to the parameters a, b and d i.e.
∂E ∂E
∂a
, ∂b and ∂E
∂d
.
Solution:

2. Erroneous Estimates
The first order derivative of a real valued function f is defined by the following limit (if
it exists),
df (x) f (x + h) − f (x)
= lim (1)
dx h→0 h
On observing the above definition we see that the derivative of a function is the ratio of
change in the function value to the change in the function input, when we change the
input by a small quantity (infinitesimally small).
Consider the function f (x) = x2 − 2x + 1.
df (x)
(a) Using the limit definition of derivative, show that the derivative of f (x) is dx
=
2x − 2.

Solution:

(b) The function evaluates to 0 at 1 i.e. f (1) = 0.


Say we wanted to estimate the value of f (1.01) and f (1.5) without using the defi-
nition of f (x). We could think of using the definition of derivative to “extrapolate”
the value of f (1) to obtain f (1.01) and f (1.5).
A first degree approximation based on 1 would be the following.

df (x)
f (x + h) ≈ f (x) + h (2)
dx

Estimate f (1.01) and f (1.5) using the above formula.

Solution:

(c) Compare it to the actual value of f (1.01) = 0.0001, and f (1.5) = 0.25.

Solution:

(d) Explain the discrepancy from the actual value. Why does it increase/decrease when
we move further away from 1?

Solution:

(e) Can we get a better estimate of f (1.01) and f (1.5) by “correcting” our estimate
from part (a)? Can you suggest a way of doing this?

Page 2
Solution:

3. Differentiation w.r.t. Vectors and matrices


Consider vectors u, x ∈ Rd , and matrix A ∈ Rn×n .
The gradient of a scalar function f w.r.t. a vector x is a vector by itself, given by
 
∂f ∂f ∂f
∇x f = , ,...,
∂x1 ∂x2 ∂xn

Gradient of a scalar function w.r.t a matrix is a matrix.


 ∂f ∂f ∂f 
∂A11 ∂A12
. . . ∂A 1n
 ∂f ∂f ∂f 
 ∂A21 ∂A22 . . . ∂A2n 
∇A f =  . .. .. .. 
 .. . . . 
∂f ∂f ∂f
∂An1 ∂An2
... ∂Ann

Gradient of the gradient of a function w.r.t. a vector is a matrix. It is referred to as


Hessian.  2 
∂ f ∂2f ∂2f
2 . . .
 ∂∂ 2xf1 ∂x1 ∂x2
∂2f
∂x1 ∂xn
∂2f 
2
 ∂x ∂x ∂ 2x . . . ∂x 2 ∂xn 

Hx f = ∇ x f =  ..
2 1
..
2
... .. 
 . . . 
∂2f ∂2f ∂2f
∂xn ∂x1 ∂xn ∂x2
... ∂ 2 xn

(a) Derive the expressions for the following gradients.


1. ∇ x uT x
2. ∇ x xT x
3. ∇x xT Ax
4. ∇A xT Ax
5. ∇2x xT Ax
(Aside: Compare your results with derivatives for the scalar equivalents of the above
expressions ax and x2 .)
The gradient of a scalar f w.r.t. a matrix X, is a matrix whose (i, j) component is
∂f
∂Xij
, where Xij is the (i, j) component of the matrix X.)

Solution:

(b) Use the equations obtained in the previous part to get the Linear regression solution
that you studied in ML or PR. Suppose X as input example-feature matrix, Y as
given outputs and w as weight vector.

Page 3
Solution:

(c) By now you must have the intuition. Gradient w.r.t. a 1 dimensional array was
1 dimensional. Gradient w.r.t. a 2-dimensional array was 2 dimensional. Higher
order arrays are referred to as tensors. Let T be a 3 dimensional tensor. Write
the expression of ∇T f . You can use gradients w.r.t. a vector or a matrix in the
expression.

Solution:

4. Ordered Derivatives
An ordered network is a network where the state variables can be computed one at a
time in a specified order.
Answer the following questions regarding such a network.
(a) Given the ordered network below, give a formula for calculating the ordered deriva-
tive ∂y
∂y1
4
in terms of partial derivatives w.r.t. y1 and y2 where y1 , y2 and y3 are the
outputs of nodes 1, 2 and 3 respectively.

2 y2

y4
1 y1 4

3 y3

Solution:

(b) The figure above can be viewed as a dependency graph as it tells us which variables
in the system depend on which other variables. For example, we see that y3 depends
on y1 and y2 which in turn also depends on y1 . Now consider the network given
below,
Y Y Y

s−1 s0 s1 s2 s3
W W W

U U U
x1 x2 x3

Page 4
Here, si = σ(W si−1 + Y si−2 + U xi + b) (∀i ≥ 1).

Can you draw a dependency graph involving the variables s3 , s2 , s1 , W, Y ?

Solution:

∂s3 ∂s3 ∂s3


(c) Give a formula for computing ,
∂W ∂Y
and ∂U
for the network shown in part (b)

Solution:

5. Baby Steps
From basic calculus, we know that we can find the minima (local and global) of a function
by finding the first and second order derivatives. We set the first derivative to zero and
verify if the second derivative at the same point is positive. The reasoning behind the
following procedure is based on the interpretation of the derivative of a function as the
slope of the function at any given point.
The above procedure, even though correct can be intractable in practice while trying to
minimize functions. And this is not just a problem for the multivariable case, but even
for single variable functions. Consider minimizing the function f (x) = x5 + 5sin(x) +
10tan(x). Although the function f is a contrived example, the point is that the standard
derivative approach, might not always be a feasible way to find minima of functions.
In this course, we will be routinely dealing with minimizing functions of multiple variables
(in fact millions of variables). Of course we will not be solving them by hand, but we
need a more efficient way of minimizing functions. For the sake of this problem, consider
we are trying to minimize a convex function of one variable f (x), 1 which is guaranteed
to have a single minima. We will now build an iterative approach to finding the minima
of functions.
The high level idea is the following:
Start at a (random) point x0 . Verify if we are at the minima. If not, change the value
so that we are moving closer to the minima. Keep repeating until we hit the minima.
(a) Use the intuition built from Q.3 to find a way to change the current value of x while
still ensuring that we are improving (i.e. minimizing) the function.

Solution:

(b) How would you use the same idea, if you had to minimize a function of multiple
variables ?
1
https://en.wikipedia.org/wiki/Convex_function

Page 5
Solution:

(c) Does your method always lead to the global minima (smallest value) for non convex
functions (which may have multiple local minima)? If yes, can you explain (prove
or argue) why? If not, can you give a concrete example of a case where it fails?

Solution:

(d) Do you think this procedure always works for convex functions ? (i.e., are we always
guaranteed to reach the minima)

Solution:

(e) (Extra) Can you think of the number of steps needed to reach the minima ?

Solution:

(f) (Extra) Can you think of ways to improve the number of steps needed to reach the
minima ?

Solution:

6. Constrained Optimization Let f (x, y) and g(x, y) be smooth (continuous, differen-


tiable etc.) real valued functions of two variables. We want to minimize f , which is a
convex function of x and y.
(a) Argue that at the minima of f , the partial derivative ∂f∂x
and ∂f
∂y
will be zero. Thus
setting the partial derivatives to zero is a possible method for finding the minima.

Solution:

(b) Suppose we are only interested in minimizing f in the region where g(x, y) = c,
where c is some constant. Suppose this region is a curve in the x-y plane. Call this
the feasible curve. Will our previous technique still work in this case? Why or why
not?

Solution:

(c) What is the component of ∇g along the feasible curve, when computed at points
lying on the curve?

Solution:

Page 6
(d) * At the point on the feasible curve, which achieves minimum value of f , what will
be the component of ∇f along the curve?

Solution:

(e) Using the previous answers, show that at the point on the feasible curve, achieving
minimum value of f , ∇f = λ∇g for some real number λ. Thus, this equation,
combined with the constraint ∇g = 0 should enable us to find the minima.

Solution:

(f) * Using the insights from discussion so far, solve the the following optimization
problem:
max xa y b z c
x,y,z

where
x+y+z =1
and given a, b, c > 0.

Solution:

7. Billions of Balloons
Consider a large playground filled with 1 billion balloons. Of these there are k1 blue, k2
green and k3 red balloons. The values of k1 , k2 and k3 are not known to you but you are
interested in estimating them. Of course, you cannot go over all the 1 billion balloons
and count the number of blue, green and red balloons. So you decide to randomly
sample 1000 balloons and note down the number of blue, green and red balloons. Let
these counts be k̂1 , k̂2 and k̂3 respectively. You then estimate the total number of blue,
green and red balloons as 1000000 ∗ k̂1 , 1000000 ∗ k̂2 and 1000000 ∗ k̂3 .
(a) Your friend knows the values of k1 , k2 and k3 and wants to see how bad your
estimates are compared to the true values. Can you suggest some ways of calculating
this difference? [Hint: Think about probability!]

Solution:

(b) * Consider two ways of converting k̂1 , k̂2 and k̂3 to a probability distribution:

k̂i
pi = P
i k̂i

ek̂i
qi = P
k̂i
ie

Page 7
Would you prefer the distribution q = [q1 , q2 , ..., qn ] over p = [p1 , p2 , ..., pn ] for the
above task? Give reasons and provide an example to support your choice.

Solution:

8. ** Let X be a real-valued random variable with p as its probability density function


(PDF). We define the cumulative density function (CDF) of X as
Z y=x
F (x) = Pr(X ≤ x) = p(y)dy
y=−∞

What is the value of EX [F (X)] (the expected value of the CDF of X)? The answer is
a real number (Hint: The expectation can be formulated as a double integral. Try to
plot the area over which you need to integrate in the x-y plane. Now look at the area
over which you are not integrating. Do you notice any symmetries?)

Solution:

9. * Intuitive Urns
An urn initially contains 3 red balls and 3 blue balls. One of the balls is removed without
being observed. To find out the color of the removed ball, Alice and Bob independently
perform the same experiment: they randomly draw a ball, record the color, and put it
back. This is repeated several times and the number of red and blue balls observed by
each of them is recorded.
Alice draws 6 times and observes 6 red balls and 0 blue balls.
Bob draws 600 times and observes 303 red balls and 297 blue balls.
Obviously, both of them will predict that the removed ball was blue.
(a) Intuitively, who do you think has stronger evidence for claiming that the removed
ball was blue, and why? (Don’t cheat by computing the answer. This
subquestion has no marks, but is compulsory!)

Solution:

(b) What is the exact probability that the removed ball was blue, given Alice’s obser-
vations? (Hint: Think Bayesian Probability)

Solution:

(c) What is the exact probability that the removed ball was blue, given Bob’s observa-
tions? (Hint: Think Bayesian Probability)

Page 8
Solution:

(d) Computationally, who do you think has stronger evidence for claiming that the
removed ball was blue?

Solution:

Did your intuition match up with the computations? If yes, awesome! If not,
remember that probability can often be seem deceptively straightforward. Try to
avoid intuition when dealing with probability by grounding it in formalism.

10. Plotting Functions for Great Good


(a) Consider the variable x and functions h11 (x), h12 (x) and h21 (x) such that

1
h11 (x) =
1+ e−(400x+24)
1
h12 (x) =
1+ e−(400x−24)
h21 = h11 (x) − h12 (x)

The above set of functions are summarized in the graph below.

Plot the following functions: h11 (x), h12 (x) and h21 (x) for x ∈ (−1, 1)

Solution:

(b) Now consider the variables x1 , x2 and the functions h11 (x1 , x2 ), h12 (x1 , x2 ), h13 (x1 , x2 ), h14 (x1 , x2 ),
h21 (x1 , x2 ), h22 (x1 , x2 ), h31 (x1 , x2 ) and f (x1 , x2 ) such that

Page 9
1
h11 (x1 , x2 ) =
1+ e−(x1 +100x2 +200)
1
h12 (x1 , x2 ) =
1+ e−(x1 +100x2 −200)
1
h13 (x1 , x2 ) =
1+ e−(100x1 +x2 +200)
1
h14 (x1 , x2 ) =
1 + e−(100x1 +x2 −200)
h21 (x1 , x2 ) = h11 (x1 , x2 ) − h12 (x1 , x2 )
h22 (x1 , x2 ) = h13 (x1 , x2 ) − h14 (x1 , x2 )
h31 (x1 , x2 ) = h21 (x1 , x2 ) + h22 (x1 , x2 )
1
f (x1 , x2 ) = −(50h 31 (x)−100)
1+e

The above set of functions are summarized in the graph below.

Plot the following functions: h11 (x1 , x2 ), h12 (x1 , x2 ), h13 (x1 , x2 ), h14 (x1 , x2 ), h21 (x1 , x2 ),
h22 (x1 , x2 ), h31 (x1 , x2 ) and f (x1 , x2 ) for x1 ∈ (−5, 5) and x2 ∈ (−5, 5)

Solution:

Page 10

You might also like