CS6910 Tutorial1
CS6910 Tutorial1
Instructions:
• This assignment is meant to help you grok certain concepts we will use in the course.
Please don’t copy solutions from any sources.
• Avoid verbosity.
• Questions marked with * are relatively difficult. Don’t be discouraged if you cannot
solve them right away!
• The assignment needs to be written in latex using the attached tex file. The solution
for each question should be written in the solution block in space already provided in
the tex file. Handwritten assignments will not be accepted.
1. Partial Derivatives
1+tanh( wx+b )
x 2 f (x)
2
1
1+tanh( wx+b ) ez −e−z
where f (x) = 2
2
and by definiton : tanh(z) = ez +e−z
Solution:
2. Erroneous Estimates
The first order derivative of a real valued function f is defined by the following limit (if
it exists),
df (x) f (x + h) − f (x)
= lim (1)
dx h→0 h
On observing the above definition we see that the derivative of a function is the ratio of
change in the function value to the change in the function input, when we change the
input by a small quantity (infinitesimally small).
Consider the function f (x) = x2 − 2x + 1.
df (x)
(a) Using the limit definition of derivative, show that the derivative of f (x) is dx
=
2x − 2.
Solution:
df (x)
f (x + h) ≈ f (x) + h (2)
dx
Solution:
(c) Compare it to the actual value of f (1.01) = 0.0001, and f (1.5) = 0.25.
Solution:
(d) Explain the discrepancy from the actual value. Why does it increase/decrease when
we move further away from 1?
Solution:
(e) Can we get a better estimate of f (1.01) and f (1.5) by “correcting” our estimate
from part (a)? Can you suggest a way of doing this?
Page 2
Solution:
Solution:
(b) Use the equations obtained in the previous part to get the Linear regression solution
that you studied in ML or PR. Suppose X as input example-feature matrix, Y as
given outputs and w as weight vector.
Page 3
Solution:
(c) By now you must have the intuition. Gradient w.r.t. a 1 dimensional array was
1 dimensional. Gradient w.r.t. a 2-dimensional array was 2 dimensional. Higher
order arrays are referred to as tensors. Let T be a 3 dimensional tensor. Write
the expression of ∇T f . You can use gradients w.r.t. a vector or a matrix in the
expression.
Solution:
4. Ordered Derivatives
An ordered network is a network where the state variables can be computed one at a
time in a specified order.
Answer the following questions regarding such a network.
(a) Given the ordered network below, give a formula for calculating the ordered deriva-
tive ∂y
∂y1
4
in terms of partial derivatives w.r.t. y1 and y2 where y1 , y2 and y3 are the
outputs of nodes 1, 2 and 3 respectively.
2 y2
y4
1 y1 4
3 y3
Solution:
(b) The figure above can be viewed as a dependency graph as it tells us which variables
in the system depend on which other variables. For example, we see that y3 depends
on y1 and y2 which in turn also depends on y1 . Now consider the network given
below,
Y Y Y
s−1 s0 s1 s2 s3
W W W
U U U
x1 x2 x3
Page 4
Here, si = σ(W si−1 + Y si−2 + U xi + b) (∀i ≥ 1).
Solution:
Solution:
5. Baby Steps
From basic calculus, we know that we can find the minima (local and global) of a function
by finding the first and second order derivatives. We set the first derivative to zero and
verify if the second derivative at the same point is positive. The reasoning behind the
following procedure is based on the interpretation of the derivative of a function as the
slope of the function at any given point.
The above procedure, even though correct can be intractable in practice while trying to
minimize functions. And this is not just a problem for the multivariable case, but even
for single variable functions. Consider minimizing the function f (x) = x5 + 5sin(x) +
10tan(x). Although the function f is a contrived example, the point is that the standard
derivative approach, might not always be a feasible way to find minima of functions.
In this course, we will be routinely dealing with minimizing functions of multiple variables
(in fact millions of variables). Of course we will not be solving them by hand, but we
need a more efficient way of minimizing functions. For the sake of this problem, consider
we are trying to minimize a convex function of one variable f (x), 1 which is guaranteed
to have a single minima. We will now build an iterative approach to finding the minima
of functions.
The high level idea is the following:
Start at a (random) point x0 . Verify if we are at the minima. If not, change the value
so that we are moving closer to the minima. Keep repeating until we hit the minima.
(a) Use the intuition built from Q.3 to find a way to change the current value of x while
still ensuring that we are improving (i.e. minimizing) the function.
Solution:
(b) How would you use the same idea, if you had to minimize a function of multiple
variables ?
1
https://en.wikipedia.org/wiki/Convex_function
Page 5
Solution:
(c) Does your method always lead to the global minima (smallest value) for non convex
functions (which may have multiple local minima)? If yes, can you explain (prove
or argue) why? If not, can you give a concrete example of a case where it fails?
Solution:
(d) Do you think this procedure always works for convex functions ? (i.e., are we always
guaranteed to reach the minima)
Solution:
(e) (Extra) Can you think of the number of steps needed to reach the minima ?
Solution:
(f) (Extra) Can you think of ways to improve the number of steps needed to reach the
minima ?
Solution:
Solution:
(b) Suppose we are only interested in minimizing f in the region where g(x, y) = c,
where c is some constant. Suppose this region is a curve in the x-y plane. Call this
the feasible curve. Will our previous technique still work in this case? Why or why
not?
Solution:
(c) What is the component of ∇g along the feasible curve, when computed at points
lying on the curve?
Solution:
Page 6
(d) * At the point on the feasible curve, which achieves minimum value of f , what will
be the component of ∇f along the curve?
Solution:
(e) Using the previous answers, show that at the point on the feasible curve, achieving
minimum value of f , ∇f = λ∇g for some real number λ. Thus, this equation,
combined with the constraint ∇g = 0 should enable us to find the minima.
Solution:
(f) * Using the insights from discussion so far, solve the the following optimization
problem:
max xa y b z c
x,y,z
where
x+y+z =1
and given a, b, c > 0.
Solution:
7. Billions of Balloons
Consider a large playground filled with 1 billion balloons. Of these there are k1 blue, k2
green and k3 red balloons. The values of k1 , k2 and k3 are not known to you but you are
interested in estimating them. Of course, you cannot go over all the 1 billion balloons
and count the number of blue, green and red balloons. So you decide to randomly
sample 1000 balloons and note down the number of blue, green and red balloons. Let
these counts be k̂1 , k̂2 and k̂3 respectively. You then estimate the total number of blue,
green and red balloons as 1000000 ∗ k̂1 , 1000000 ∗ k̂2 and 1000000 ∗ k̂3 .
(a) Your friend knows the values of k1 , k2 and k3 and wants to see how bad your
estimates are compared to the true values. Can you suggest some ways of calculating
this difference? [Hint: Think about probability!]
Solution:
(b) * Consider two ways of converting k̂1 , k̂2 and k̂3 to a probability distribution:
k̂i
pi = P
i k̂i
ek̂i
qi = P
k̂i
ie
Page 7
Would you prefer the distribution q = [q1 , q2 , ..., qn ] over p = [p1 , p2 , ..., pn ] for the
above task? Give reasons and provide an example to support your choice.
Solution:
What is the value of EX [F (X)] (the expected value of the CDF of X)? The answer is
a real number (Hint: The expectation can be formulated as a double integral. Try to
plot the area over which you need to integrate in the x-y plane. Now look at the area
over which you are not integrating. Do you notice any symmetries?)
Solution:
9. * Intuitive Urns
An urn initially contains 3 red balls and 3 blue balls. One of the balls is removed without
being observed. To find out the color of the removed ball, Alice and Bob independently
perform the same experiment: they randomly draw a ball, record the color, and put it
back. This is repeated several times and the number of red and blue balls observed by
each of them is recorded.
Alice draws 6 times and observes 6 red balls and 0 blue balls.
Bob draws 600 times and observes 303 red balls and 297 blue balls.
Obviously, both of them will predict that the removed ball was blue.
(a) Intuitively, who do you think has stronger evidence for claiming that the removed
ball was blue, and why? (Don’t cheat by computing the answer. This
subquestion has no marks, but is compulsory!)
Solution:
(b) What is the exact probability that the removed ball was blue, given Alice’s obser-
vations? (Hint: Think Bayesian Probability)
Solution:
(c) What is the exact probability that the removed ball was blue, given Bob’s observa-
tions? (Hint: Think Bayesian Probability)
Page 8
Solution:
(d) Computationally, who do you think has stronger evidence for claiming that the
removed ball was blue?
Solution:
Did your intuition match up with the computations? If yes, awesome! If not,
remember that probability can often be seem deceptively straightforward. Try to
avoid intuition when dealing with probability by grounding it in formalism.
1
h11 (x) =
1+ e−(400x+24)
1
h12 (x) =
1+ e−(400x−24)
h21 = h11 (x) − h12 (x)
Plot the following functions: h11 (x), h12 (x) and h21 (x) for x ∈ (−1, 1)
Solution:
(b) Now consider the variables x1 , x2 and the functions h11 (x1 , x2 ), h12 (x1 , x2 ), h13 (x1 , x2 ), h14 (x1 , x2 ),
h21 (x1 , x2 ), h22 (x1 , x2 ), h31 (x1 , x2 ) and f (x1 , x2 ) such that
Page 9
1
h11 (x1 , x2 ) =
1+ e−(x1 +100x2 +200)
1
h12 (x1 , x2 ) =
1+ e−(x1 +100x2 −200)
1
h13 (x1 , x2 ) =
1+ e−(100x1 +x2 +200)
1
h14 (x1 , x2 ) =
1 + e−(100x1 +x2 −200)
h21 (x1 , x2 ) = h11 (x1 , x2 ) − h12 (x1 , x2 )
h22 (x1 , x2 ) = h13 (x1 , x2 ) − h14 (x1 , x2 )
h31 (x1 , x2 ) = h21 (x1 , x2 ) + h22 (x1 , x2 )
1
f (x1 , x2 ) = −(50h 31 (x)−100)
1+e
Plot the following functions: h11 (x1 , x2 ), h12 (x1 , x2 ), h13 (x1 , x2 ), h14 (x1 , x2 ), h21 (x1 , x2 ),
h22 (x1 , x2 ), h31 (x1 , x2 ) and f (x1 , x2 ) for x1 ∈ (−5, 5) and x2 ∈ (−5, 5)
Solution:
Page 10