Ldats2470 Project
Ldats2470 Project
Project
Build groups of two or three students and register this group in moodle before Thurs-
day, March 16, 2023, at 6pm. Write a report of maximum 10 pages in total. Deposit a
zip file before the deadline, Friday, May 12, 2023, at 3pm. The file must be deposited
via your group on moodle and contain your report in pdf format, the data file, and the
source codes of your programs in separate files. The first page of the report should contain
the group name and its members. You can write in English or in French.
1. Choose your own dataset, which must satisfy the following criteria:
Formulate an objective of your analysis. What is the research question that you
would like to answer? Be original. Ideally, your dataset is original. If it has been
used elsewhere, make sure that your analysis of that dataset is original. To check
this, compilatio will be used.
Send the title, description and dataset (or a link) of your project to the teacher
before March 23, 2023 for approval.
2. Structure: In the introduction, explain the idea and objectives of your project. Give
a short graphical or numerical summary and overview of your data set, including
an explanation of the different variables. After the main part of the analysis, in
the conclusions, summarize your main results, and whether you have attained your
objectives.
3. Depending on your objective and dataset, choose one of the following methodologies:
1
(a) clustering: k-means, kernel k-means, and EM clustering
(b) classification: linear and nonlinear SVM
(c) dimension reduction: linear and nonlinear PCA
The tasks for the individual methodologies are explained in the following.
4. Clustering: Use your numerical variables to perform a clustering for a given number
of clusters, and using k-means, kernel k-means, and a Gaussian mixture estimated
by EM. For the latter, assume first diagonal covariance matrices, and if possible (de-
pending on your data set) obtain results also for EM clustering with full covariance
matrices. Compare your results with at least one other number of clusters, k. There
is no simple statistical criterion, so argue which clustering method and which k you
prefer depending on your data set and the interpretability of the clusters. Visualize
the results by projecting the data into the plane spanned by the first two principal
components, and indicate the different clusters in the graph. Try to interpret the
results.
5. Classification: Begin by performing a linear SVM, i.e. SVM with a linear kernel. Is
the data linearly separable? Report the number of support vectors, the value of the
objective function, and the proportion of correct classifications. Then do a nonlinear
SVM using a kernel such as the RBF. Optimize the parameters (capacity and kernel)
by cross-validation, e.g. five- or ten-fold. Report the number of support vectors,
the value of the objective function, and the proportion of correct classifications.
Visualize the results by projecting the data into the plane spanned by the first two
principal components. Interpret your results.
6. Dimension reduction: Using the numerical variables, perform a linear PCA by select-
ing the number dimensions, and representing the data graphically in the principal
component space. Show the duality with MDS by calculating the euclidean dis-
tances between the data and performing MDS on this distance matrix using the
same number of dimensions as for PCA. Finally, perform a kernel PCA choosing a
suitable kernel. Show the projection of the data onto the kernel principal compo-
nents graphically. Compare PCA and kernel PCA, e.g. in terms of percentage of
variance explained. Interpret the results.