Exercises - Dss - Partd - Handout
Exercises - Dss - Partd - Handout
Predictive Analytics
1 Clustering
D.1.1 Cluster Analysis in python
For the following exercise, load the BigMac2003.csv 1 using pandas. This data set contains for 69 cities world-wide
the average working hours, price level and income level for the year 1991. Your task is to perform a cluster analysis
of the cities based on selected variables. This data set contains the following variables (to list them, you might use,
e.g., attributes(BigMac2003)):
BigMac Minutes of labor to purchase a Big Mac
Bread Minutes of labor to purchase 1 kg of bread
Rice Minutes of labor to purchase 1 kg of rice
FoodIndex Food price index (Zurich=100)
Bus Cost in US dollars for a one-way 10 km ticket
Apt Normal rent (US dollars) of a 3 room apartment
TeachGI Primary teacher’s gross income, 1000s of US dollars
TeachNI Primary teacher’s net income, 1000s of US dollars
TaxRate Tax rate paid by a primary teacher
TeachHours Primary teacher’s hours of work per week
For the class label, generate vector with the cities’ regions (aggregated to 4 classes) as follows:
1 import pandas as pd
2 import numpy as np
3
4 df = pd . read_csv ( " bigMac2003 . csv " )
5 df . rename ( columns ={ df . columns [0]: " City " } , inplace = True )
6
7 # categorical
8 df [ ’ region ’ ]= [ " EU " ," EU " ," AUNZ " ," AS " ," EU " ," EU " ," EU " ," SA " ," EU " ," EU " ," EU " ," EU " ," SA " ," SA " ," NA
",
9 " EU " ," AF " ," EU " ," EU " ," EU " ," EU " ," AS " ," AF " ," AS " ," AF " ," AS " ," EU " ," AS " ," AF " ," SA " ," EU " ," EU " ," EU " ,
" NA " ,
10 " EU " ," EU " ," EU " ," AF " ," AS " ," NA " ," NA " ," EU " ," NA " ," EU " ," AS " ," AF " ," NA " ," EU " ," EU " ," EU " ," EU " ," SA " ,
" EU " ,
11 " SA " ," SA " ," AS " ," AS " ," AS " ," EU " ," EU " ," AUNZ " ," AS " ," EU " ," AF " ," AS " ," NA " ," EU " ," EU " ," EU " ]
12
13 approved = [ " AF " , " AS " , " SA " ]
14 df [ ’ region ’] = np . where ( df [ ’ region ’ ]. isin ( approved ) , df [ ’ region ’] , ’N ’)
15
16 df [ ’ region ’] = df [ ’ region ’ ]. astype ( ’ category ’)
17
18 df . head ()
D.1.2 How would you distinguish between flat and hierarchical partitioning?
D.1.3 What is probabilistic clustering?
D.1.4 What is the difference between soft and hard partitioning?
D.1.5 What are the two main approaches of flat partitioning?
D.1.6 What are the two main approaches of hierarchical partitioning?
D.1.7 Discuss the Minkowski-Distances for r = 1 and r = 2, their differences and their
alternative names.
D.1.8 On which of the following datasets would you use (1) K-Means Clustering (or
EM), (2) single-linkage agglomerative hierarchical clustering, (3) density-based
clustering? Provide arguments for your recommendation.
D.1.9 Which clustering(s) tend(s) to form clusters that are chains of points?
( ) Single Linkage Agglomerative Hierarchical Clustering
2
X2
X2
X2
X1 X1 X1
(a) (b) (c)
X2
X2
X2
X1 X1 X1
(d) (e) (f)
( ) Mahanalobis distance
( ) Q-Correlation Coefficient
( ) Clustering aims to maximise similarity between instances within clusters and dissimilarity between instances in
different clusters.
( ) The Silhouette Index considers the intra-cluster and inter-cluster distances.
D.1.14 You want to compare two clusterings, one with three and one with five clusters.
Which external performance measures can you use?
( ) Purity
( ) Normalised Mutual Information
D.1.15 Based on the elbow method, determine in the plots in Fig. 2 below which
number of clusters k to chose.
RSS
RSS
RSS
1 2 3 4 5 6 7 8 k 1 2 3 4 5 6 7 8 k 1 2 3 4 5 6 7 8 k
(a) (b) (c)
4
2 Classification
D.2.1 Which of the following statements are correct?
( ) In classification, the task is to predict a dependent variable (class label) based on a set of explanatory (feature)
variables.
( ) A probabilistic classifier returns also estimates of the posterior class probability Pr (y|x).
( ) A lazy learner does not construct an (abstract) model from the data.
( ) A Bayes classifier requires an estimate of either the joint probability Pr (x, y), or estimates of the class prior
probability Pr (y) together with the class-conditional feature probability Pr (x|y).
( ) The (unconditional) feature probability Pr () can be computed by summing (marginalising) Pr (x, y) over the
different values of x.
( ) The (unconditional) feature probability Pr () can be computed by summing (marginalising) Pr (x, y) over the
different values of y.
( ) In contrast to a multivariate Bayes classifier, a Naive Bayes classifier assumes conditional independence.
( ) A predictive classifier also provides a model of the underlying data distribution, which can be used to generate
data.
( ) For k-fold cross validation, the data set is partitioned into k subsets, and each of them is used once for testing
and the other times for training.
( ) For obtaining a reliable estimate of the performance on new data, we need to evaluate a classifier on its training
data.
( ) False positives are instances that are classified as negative, but actually are positive.
( ) In a region, where we classify all instances as positive, our false negative rate is zero.
D.2.2 Curse of Dimensionality: If you have three feature variables X1 ,X2 , and X3 , which
you divide each into two intervals (low vs. high values), in how many cells does
this split your feature space? Alternatively, how many training instances do you
need to place exactly one in each cell? How does this change with four features
X1 ,X2 ,X3 and X4 ?
5
D.2.3 In Fig. 3, the scatterplots of two data sets (a) and (b) are given. In which of
these two is conditional independence assumption clearly violated?
Negatives Negatives
X2
X2
Positives Positives
P(X2|-)
P(X2|+)
P(X2|+)
P(X2|-)
X1 P(X1|-)
X1
P(X1|-) P(X1|+) P(X1|+)
(a) (b)
6
D.2.4 Based on the AUC/ROC plotted in Fig. 4, which of the following statement(s)
is/are correct?
Classifier A
0.4
Classifier B
0.3
Classifier C
0.2
Classifier D
0.1
0
Figure 4: Area Under the Receiver Operating Characteristics (ROC) Curve (AUC) Plots.
7
D.2.5 (Univariate) Bayesian Classification
Consider a binary classification problem, with class variable Y ∈ {pos, neg} and feature variable X ∈ (0, 15). The
joint distributions of P (X, Y ) are given in the plot below (P (X, Y = pos) is plotted in green, P (X, Y = neg) in blue).
0.3
P(x,y=+)
P(x,y=-)
0.25
Probability Density Distribution
0.2
0.15
0.1
0.05
0
0 2 4 6 8 10 12 14
x1 x2 x3 x4
Feature X
c) Mark the Bayes’ optimal decision boundary/boundaries. How would you classify the following instances: x1 = 4,
x2 = 6, x3 = 8, x4 = 12
d) Indicate the Bayes’ error rate. Assuming a Bayes’ optimal decision, annotate the number of false negatives, and
the number of true negatives in the plot.
8
D.2.6 Multivariate Bayesian Classification
Consider the following, simplified golf-player classification example2 :
Temp Humidity Windy Play
hot high false no
hot high true no
hot high false yes
mild high false yes
cool normal false yes
cool normal true no
cool normal true yes
mild high false no
cool normal false yes
mild normal false yes
mild normal true yes
mild high true yes
hot normal false yes
mild high true no
• Pr(Xtemp = hot)
• Pr(y = yes)
• Pr(Xtemp = hot, y = yes)
• Pr(Xtemp = hot|y = yes)
• Pr(y = yes|Xtemp = hot)
b) Explain the classification of a multivariate Bayes-optimal classifier for an instance with Xtemp = cool and
Xhum = normal and calculate the necessary probabilities.
c) Explain the classification of a Naive Bayes classifier for an instance with Xtemp = cool and Xhum = normal
and calculate the necessary values.
a) In python, train a Naive Bayes classifier on the attributes temp and hum, and to predict the class label of the
test instances. Compare the results with your calculation in the exercise above!
Hint: from sklearn.naive bayes use GaussianNB
Hint: from sklearn use preprocessing
b) In python,train a decision tree on the attributes temp and hum, and to predict the class label of the test in-
stances. Compare the results with above!
Hint: from sklearn.tree use DecisionTreeClassifier
c) Visualize the tree and see how it decides how an instance is classified
Hint: plot tree
d) In this exercise you will get familiar with AutoML. What is AutoML? Automated Machine Learning (AutoML)
is tied in with producing Machine Learning solutions for the data scientist without doing unlimited inquiries on
data preparation, model selection, model hyperparameters and model compression parameters.
Find the best classifier for the breast cancer dataset from scikit-learn by following this example:
https://automl.github.io/auto-sklearn/master/examples/20 basic/example classification.html#sphx-glr-examples-2
0-basic-example-classification-py
10
IID Item Name TID Items in Baskez
A Grey’s Anatomy 1 A,B,C,D,G,H
B The Big Bang Theory 2 A,B,C,G
C Castle 3 A,D,H
D Downton Abbey 4 B,C
G Game of Thrones 5 B,C,D,G
H How I Met Your Mother 6 B,C,G
7 B,D,G
8 B,G
(a) List of Items (b) Transactions / Baskets
11
Appendix
12