Assignment 1 cs771 Machine Learning
Assignment 1 cs771 Machine Learning
1
9. Your solutions must appear in proper order in the PDF file i.e. your solution to question
n must be complete in the PDF file (including all plots, tables, proofs etc) before you
present a solution to question n + 1.
10. We may impose a penalty on submissions that significantly deviate from the style file or
which do not following the formatting instructions.
• Please note that the packages we will require for this assignment are not readily available
in standard form for Windows platforms. We will recommend a Linux platform for the
following. The IITK Computer Center provides Linux workstations to all students.
• Python will be the language used in all assignments in this course. If you are unfamiliar
with the language, we recommend going to a source like (https://www.codecademy.
com/learn/learn-python) for the basic syntax. All assignments will assume a passing
familiarity with the language, and a certain level of comfort with programming in general.
• Feel free to contact the mentors with any queries you may have with regards to assign-
ments. Python has two major versions, 2.7 and 3.5. There aren’t many differences between
the two, and our assignments should work on both. We will assume 3.5 as the default
version but things should work on 2.7 without any additional changes.
• For this assignment, you will need the LMNN algorithm. It has an implementation in the
Shogun machine learning library. Installation steps for Shogun are given below.
• Open a terminal and execute the command: conda install -c conda-forge shogun
• Create a conda environment using the command conda create --prefix ~/cs771, re-
placing ~/cs717 with whatever path you wish to have.
• If all goes well, the import will be successfull, and no error messages will be displayed
2
• Please note that alternate implementations of the LMNN algorithm do exist e.g. https:
//all-umass.github.io/metric-learn/metric_learn.lmnn.html which may run on
Windows platforms but the Shogun implementation is highly optimized for speed and
accuracy (it actually runs in C++).
• Also note that Shogun does not have readily available binaries for Windows unless you
build the thing from scratch using MS Visual Studio etc. Hence, working with Linux
environments is preferred.
• If you are able to get Shogun running with LMNN on Windows then all is fine. However,
in that case, please teach the instructor how to do this as well :)
1. This submission should be a single ZIP file. No PY/PYC/M files will be accepted.
2. The name of the ZIP file should be your roll number. Eg. 17001.zip. If your submission
is wrongly named, we may be unable to link it to you and you may lose credit.
3. You may resubmit but do not resubmit more than twice – you may incur a penalty for
excessive resubmission. We will simply accept your latest submission.
4. Submissions for this part should be made to the following URL Submissions will be ac-
cepted at the following URL
https://www.dropbox.com/request/7s7l6UtgbuFv4hZ8jIwD
6. Do not include your PDF file from the theory part in this submission.
7. Do not include any data files (training features etc) in your ZIP archive. Your
archive should only contain code and model files.
8. Your code must be well commented and must execute/compile without need for special
packages or installations. If we are unable to execute your code, you may be asked to put
up a demonstration and incur a penalty too.
3
Problem 1.1 (V for Voronoi). Recall the learning with prototypes problem. Consider a two
class problem where the prototypes are the points (1, 0) (green) and (0, 1) (red). Calculate
the decision boundary when we use the learning with prototypes rule but with the following
Mahalanobis metrics. In the following, z1 , z2 ∈ R2 denote two points on the real plane
1 2
1 2 1 2
3 0
1. d(z , z ) = z − z , U (z − z ) , where U =
0 1
1 2
1 2 1 2
1 0
2. d(z , z ) = z − z , V (z − z ) , where V =
0 0
Figure 1 pictorally depicts the prototypes as well as the sample solution if we had used the
standard Euclidean metric to compute distances. In your submission, for each of the two parts
above, you must include the following details
1. The mathematical expression for the decision boundary. For example, in the Euclidean
case, it is the line y = x.
2. An image shading the red and green decision boundaries for the above cases similar to
the figure on the right in Figure 1.
To aid you, both figures in Figure 1 have been included in your assignment package (proto_blank.png
and proto_euclid_sample.png). Note that your images must be embedded in your PDF file
and not sent separately. Use the \includegraphics command in LATEX to embed images in
your submission PDF file. (5+5=10 marks)
Figure 1: Learning with Prototypes: the figure on the left shows the two prototypes. The
figure
1 on
the right shows what the decision boundary if the distance measure used is d(z1 , z2 ) =
z − z
, for any two points z , z2 ∈ R2 . The decision boundary in this case is the line y = x.
2 1
2
Problem 1.2 (PML For Constraints). Consider the following constrained least-squares regression
problem on a data set (xi , y i )i=1,...,n , where xi ∈ Rd and y i ∈ R.
n
X
(y i − w, xi )2
ŵcls = arg min
i=1
s.t. kwk2 ≤ r.
Design a likelihood distribution (on the responses, conditioned on the data covariates x) and
prior distribution (on the parameter) such that ŵcls is the MAP estimate for your model. Give
explicit forms for the density functions of your likelihood and prior distributions. The above
shows that PML approaches can also lead to constrained optimization problems. (5 marks)
4
Problem 1.3 (Fun with Features). Consider the following feature-regularized least-squares re-
gression problem on a data set (xi , y i )i=1,...,n , where xi ∈ Rd , y i ∈ R, and αj > 0 for j ∈ [d].
n
X d
X
(y i − w, xi )2 + αi (wi )2
ŵfr = arg min
w∈Rd i=1 j=1
Design a likelihood and prior distribution such that ŵfr is the MAP estimate for your model.
Give explicit forms for all distributions. It turns out that just as there exists a closed form ex-
pression for the solution to the L2 -regularized least-squares problem, one exists for this problem
too. Find a closed-form expression for ŵfr . (5+5=10 marks)
Problem 1.4 (Break Free from Constraints). Recall the OVA approach to multi-classification.
Let us use a dataset (xi , y i )i=1,...,n , where xi ∈ Rd and y i ∈ [K] i.e. there are K classes. Denote
using W = [w1 , . . . , wK ] ∈ Rd×K , the set of K linear models that make up the OVA classifier.
The Crammer-Singer formulation (P 1) for a single machine learner for multi-classification is
K
n
k
2 X
n o X
c {ξˆi } = arg min
W,
w
+ ξi
W,{ξi } 2
k=1 i=1
D i E D E
s.t. wy , xi ≥ wk , xi + 1 − ξi , ∀i, ∀k 6= y i (P 1)
ξi ≥ 0, for all i
where η i = W, xi and
`cs (y i , η i ) = [1 + max η ik − η iy ]+
k6=y
To show equivalence, you will have to show that if W0 , {ξi0 } are an optimum for (P1 ) then
W0 must
1 be an optimum for1(P 2), as well as if W1 is an optimum for (P 2) then there must
exist ξi ≥ 0 such that W , {ξi1 } are an optimum for (P1 ).
(15 marks)
Problem 1.5 (Sub-gradient Computation). Consider the following function, where (xi , y i )i=1,...,n ,
where xi ∈ Rd and y i ∈ {−1, 1} i.e. binary Rademacher labels.
n
X
[1 − y i w, xi ]+
f (w) =
i=1
(
−y i · xi if y i w, xi < 1
i
h =
if y i w, xi ≥ 1,
0
then show that g ∈ ∂f (w) i.e. g is a member of the subdifferential of f at w. Recall that to
show this, you have to show that for every w0 ∈ Rd , f (w0 ) ≥ f (w) + hg, w0 − wi. (5 marks)
5
Problem 1.6 (Metric Learning for NN classifiers). Recall that we commented that the nearest
neighbor (or more generally, the k-NN) algorithm can be made more powerful by learning an
application specific metric instead of using a fixed metric like Euclidean (L2 ) or Manhattan
(L1 ). In this exercise, you will use the LMNN method 1 to learn a Mahalaobis metric and then
use the k-NN algorithm with this learnt metric to perform classification.
You have been supplied with a training data set with 60K data points and a test set with
20K data points, each point being 100 dimensional. Download these datasets from the URL
http://web.cse.iitk.ac.in/users/purushot/courses/ml/2017-18-a/material/assn1data.zip
This is a supervised multi-classification problem with 3 classes. You may use the training
data provided to you in any way to tune your parameters (split into validation sets in any
fashion (e.g. held out, k-fold), as well as use any fraction of data for validation, etc) but your
job is to do well on your test data points (as well as a secret test set which we have with us and
will not reveal to you).
Execute the following experiments using your data
1. Use the k-NN algorithm with the Euclidean metric to perform classification. Report your
test error (on the 20K sized dataset) for different values of k = 1, 2, 3, 5, 10. Plot a graph
showing test accuracies (fraction of the 20K points that were correctly classified) vs k.
Explain your observations.
2. Tune a good value of k using your favorite validation technique. Do not touch your test
set while performing validation. Report which value of k you found to work best.
3. Fixing the above value of k, learn a good Mahalanobis metric for the classification problem
using the LMNN package. Note that your value of k was chosen for the Euclidean metric
and not for the metric learnt by LMNN. We may do a joint optimization of the k value
and the metric but let us leave that for now. Report your test accuracies using the learnt
metric and the value of k chosen in step 2.
5. Submit all your training code inside your submission ZIP file. Do not have include subdi-
rectories inside the ZIP file. All files should be present directly inside the ZIP file. If you
have used multiple files in the entire process of training, submit all of them.
6. Write a single Python script called test.py to allow others to use your learnt metric as
well as your tuned value of k to perform classification on new points. The file must be
able to accept the training and testing file names as system input, load the metric from
the file model.npy, and save the predictions on the test set to a file called testY.dat.
Include the file test.py inside your submission ZIP file.
Please note that your submissions to parts (1, 2, 3) above above go in the PDF file to
Gradescope, while your submissions to parts (4, 5, 6) go in the ZIP file to Dropbox.
For you convenience, a sample training (train.py) and testing (test.py) file have already
been included in the assignment package which adhere to the above instructions. Please make
1
Kilian Q. Weinberger and Lawrence K. Saul, Distance Metric Learning for Large Margin Nearest Neighbor
Classification, Journal of Machine Learning Research, 10:207-244, 2009.
6
changes to those files as per your wish but be careful not to change the way the test file
accepts input and writes output. You may also refer to the following Python notebook for a
quick explanation on how to invoke the LMNN routine http://nbviewer.jupyter.org/gist/
iglesias/6576096.
You may use a brute force search or sorting operations to identify the k-nearest neighbor
and need not invest in faster NN data structures like k-d trees etc. However, you should use
fast sorting operations offered by packages such as NumPy and not try to implement sorting
etc operations yourself – chances are your implementation will be really slow since Python is
by-and-large, an interpreted language.
Extra Credit: the ITML metric learning technique 2 is another excellent method for learning
Mahalanonbis metrics. This method won the best paper award at the ICML 2007 confer-
ence. You may obtain a Python implementation of ITML from https://all-umass.github.
io/metric-learn/metric_learn.itml.html or else a Matlab implementation by the original
authors from http://www.cs.utexas.edu/users/pjain/itml/.
The Github website also hosts several other metric learning techniques like SDML and
LSML. Use ITML (and optionally the other methods) to learn Mahalanobis metrics and see if
you can get superior accuracies or not. Report your findings on your test set. Do not submit
test.py or model.npy files for extra credit experiments Just include your observations in
the PDF file and the training code in the ZIP file.
Instructions
1. All plots must be generated electronically - no hand-drawn plots would be accepted. All
plots must have axes titles and a legend indicating what the plotted quantities are.
2. All plots must be embedded in the PDF file – no stray image files will be accepted. Use
the \includegraphics command in LATEX to embed images in your submission PDF file.
3. Your submission must describe neatly what the plotted quantities are as well as the main
inference that can be drawn from the plot. E.g. if varying k changes the accuracy, what
changes do you observe?
4. If a file name has been specified in the instructions above, please stick to it and do not
use any other file name. Our automated scripts will not be able to evaluate your code
otherwise and you will incur a penalty. For training, you may use file names as per your
wish but of course avoid the names we have specified above.
5. Do not include any data files (.dat etc) inside your submission. Only your
code files (in .py format) and your model file (model.npy) should be there inside your
submission ZIP file. We have the training files with us already and can easily include it
at our own end to test your code. Your total submission size should not exceed 500KB.
(35+extra marks)
2
Jason V. Davis, Brian Kulis, Prateek Jain, Suvrit Sra and Inderjit S. Dhillon, Information Theoretic Metric
Learning, ICML 2007.