Self Test Master Data Science SoSe 2021 2
Self Test Master Data Science SoSe 2021 2
• The participation in this assessment is required during the application process for the master programme
Data Science.
• The self assessment is also intended to give you an indication of the extent to which it is recommended that
you refresh your knowledge before beginning your studies. The exact result is solely for your orientation and
is not evaluated by us.
• To provide your answers, please use the input mask available on the ‘EvaSys’ website:
https://evaluation.tu-dortmund.de/evasys/online.php?p=V1HSK
There, the questions are numbered as below and also identified by a keyword.
• We recommend that you work on the assessment offline and then enter your prepared answers in
the input mask. You should not interrupt this filling in.
• After answering all the questions, you obtain a certificate of attendance, which you have to fill in, sign, and
submit during the application process.
• The questions are multiple-choice, all of them with a set of possible answers, which are true or false. Each
answer can be marked as ‘true’ or ’false’ or you can click on ‘no idea’; you receive 1 point per answer for
correctly assigning it as ‘true’ or ’false’, -1 point off for incorrectly assigning it, and 0 points for ‘no idea’.
(Hence, you can receive as many points for a question as there are answers.)
• In the statistical programming Section 3.4, you can opt between R and Python representations of the same
questions. If you do not know any of these programming languages, select the answer option ‘no idea’ for all
these questions.
1 Mathematics
• The symbol ln denotes the natural logarithm, that is, with base e.
• The symbols Z, Q, R, C denote the sets of integers, rational numbers, real numbers, and complex numbers,
respectively.
1.1 Calculus
ln(x − a)
Question 1. For a ∈ R, which statements do hold for the one-sided limit lim ?
x→a+ ln(ex − ea )
a) The limit exists.
b) Its value is 0.
c) Its value is 1.
d) Its value is ea .
1
Z π
Question 2. Which statements do hold for the definite integral ecos x sin x dx ?
0
1
d) The derivative equals f 0 (x) = .
1−x
2
a) The system is consistent.
b) The sum of any two solutions is a solution.
c) The system has a unique solution.
d) The system has infinitely many solutions.
Question 6. Which are eigenvalues of the matrix
3 2 5
0 2 3 ?
0 1 4
a) 2
b) 3
c) 5
d) 0
0 ≤ z ≤ 4 − x2 − y 2
x(θ) = 4 cos θ
y(θ) = 4 sin θ
z(θ) = 3θ
Let L(θ) be the arclength of the helix from the point P (θ) = (x(θ), y(θ), z(θ)) to the point P (0) = (4, 0, 0), and let
D(θ) be the distance between P (θ) and the origin (0, 0, 0). Let L(θ) = 10. Which statements do hold?
a) θ = 4
b) θ = 2
c) To calculate the value of D for a given θ, x(θ) and y(θ) have to be evaluated explicitely.
√
d) D(θ) = 52
3
1.4 Differential Equations
Question 10. Let y : R → R be the real-valued function defined on the real line, which is the solution of the initial
value problem
y 0 = −xy + x, y(0) = 2.
Which statements are correct?
a) The problem is not uniquely solvable.
b) The solution y(x) contains an exponential function.
c) lim y(x) = 1
x→∞
d) lim y(x) = 0
x→∞
2 Computer Science
2.1 Data Structures
Question 11. The number of steps taken for searching the value x in a binary tree with n nodes . . .
a) depends on x.
b) depends on n.
c) is O(log2 n).
d) is O(logx n).
Question 12. The average-case performance when looking up a single search key . . .
a) is better with a Linked List than with a Hash Table.
b) is better with a Hash Table than with an Array.
c) is better with a Binary Search Tree than with a Hash Table.
d) is the same with a Linked List, an Array, and a Hash Table.
Question 13. Given 100 000 numbers, the minimum height of a binary search tree that can store all these numbers
...
a) depends on the numbers.
b) is larger than 20 levels.
c) is smaller than 19 levels.
d) can be calculated as log10 (100 000).
Question 14. Which of the following statements are correct for a max-heap?
a) The root always contains the largest key.
b) All keys in the left subtree are always smaller than any key in the corresponding right subtree.
c) All leaves are located on the same level.
d) Each subtree is also a max-heap.
Question 15. Which of the following statements are correct for a binary search tree?
a) The root always contains the largest key.
b) All keys in the left subtree are always smaller than any key in the corresponding right subtree.
4
c) All leaves are located on the same level.
d) Each subtree is also a binary search tree.
Question 16. The following operations are applied to an empty stack s:
s.push(1)
s.push(2)
s.push(3)
s.pop()
s.push(4)
s.pop()
d) 2
d) C2 is a subclass of C1.
Question 19. The following function f uses recursion:
def f(n):
if n <= 1
return n
else
return f(n-1) + f(n-2)
5
Let n be a valid input, i.e., a natural number. Which of the following functions returns the same result but without
recursion?
a) def f(n):
a <- 0
b <- 1
if n = 0
return a
elsif n = 1
return b
else
for i in 1..n
c <- a + b
a <- b
b <- c
return b
b) def f(n):
a <- 0
i <- n
while i > 0
a <- a + i + (i-1)
return a
c) def f(n):
arr[0] <- 0
arr[1] <- 1
if n <= 1
return arr[n]
else
for i in 2..n
arr[i] <- arr[i-1] + arr[i-2]
return arr[n]
d) def f(n):
arr[0..n] <- [0, ..., n]
if n <= 1
return arr[n]
else
a <- 0
for i in 0..n
a <- a + arr[i]
return a
a) A ∧ (B ∨ C) = (A ∧ B) ∨ (A ∧ C)
b) A ∨ (B ∧ C) = (A ∨ B) ∧ (A ∨ C)
c) (A ∧ B) ∨ C = C ∨ (B ∧ A)
Question 21. A large retail company keeps sales data local to the individual branches where sales transactions
were performed. To compute overall sales statistics, the company wants to avoid sending the full sales data set to
a central server. Instead, only aggregated sales information (sum, average, minimum, variance, median, maximum)
is sent from each branch to the central site. Which of the following statements are correct?
a) The overall sum can be derived from the sums per branch.
6
b) The overall average can be derived from the averages per branch.
c) The overall minimum can be derived from the minimums per branch.
d) The overall variance can be derived from the variances per branch.
e) The overall median can be derived from the medians per branch.
f) The overall maximum can be derived from the maximums per branch.
Question 22. Consider the following table in a relational database.
Last Name Rank Room Shift
Smith Manager 234 Morning
Jones Custodian 33 Afternoon
Smith Custodian 33 Evening
Doe Clerical 222 Morning
According to the data shown in the table, which of the following could be candidate keys of the table?
a) {Last Name}
b) {Room}
c) {Shift}
d) {Rank, Room}
e) {Room, Shift}
Question 23. The database interface of a library allows searching only for a single attribute (such as Title
or Author ) in each query. Your friend decided to extend it’s functionality and wrote an algorithm that allows
searching for books that satisfy multiple predicates over single attributes in conjunction. He tells you the algorithm
reuses the already implemented query functionality and works by intersecting the results ( book id’s ) of queries
over single attributes.
Which of the following assumptions on your friend’s algorithm are plausible?
a) Its worst-case run-time necessarily increases exponentially with respect to the number of attributes in the
query.
b) Its worst-case run-time depends on the length of the longest result of the single-attribute queries.
c) It might be implemented using an join.
d) It might be implemented using sorting.
7
2.5 Computer Architecture
Question 26. In computer architecture, SIMD may refer to the situation where...
a) multiple CPU cores can access the same memory concurrently.
b) the same operation can be applied to multiple operands with only a single instruction.
c) multiple independent instructions can be executed at the same time in the same CPU core.
d) multiple independent memory banks show up as a single address space.
3 Statistics
3.1 Descriptive Statistics
Question 27. Which of the following sets have an arithmetic mean of 100, but a median smaller than 100?
a) {80, 100, 120}
b) {80, 80, 140}
c) {0, 50, 150}
d) {60, 120, 120}
Question 28. Can there be a set of data fitting to both the following histograms? Which of these answers are
correct?
Histogram 1 Histogram 2
0.15
0.06
0.10
0.04
Density
Density
0.05
0.02
0.00
0.00
−4 −2 0 2 4 6 8 0 10 20 30 40 50
x y
a) No, because the right one is calculated from positive data only.
b) Yes, the right one includes all possible data from which the left one may be calculated.
c) No, the right one must be calculated with at least one value greater than 8.
d) No, the left one can not have been calculated with a value of 10 or more.
Question 29. Calculate estimates of the standard deviations sx , sy of the samples x = (5, 9, 7) and y = (−1, 2, 5)
as well as the Pearson coefficient of correlation rxy of x and y. Which of the following answers are correct?
a) sx = 4, sy = 9
8
b) rxy = 0
c) sx = 2, sy = 3
1
d) rxy = 2
1
e) rxy = 4
●
●
● ●
●
● ●●
●
●
1
● ●●
● ●
●
● ● ●
●
●●
●
● ● ●
0
●
●
● ●
x2
● ●
● ●
● ● ●
●
●
● ● ●
● ●
● ●
−1
●
●
−2
−2 −1 0 1
x1
3.2 Probability
Question 32. There are 8 socks in your drawer: 4 black and 4 red. You take 3 of them with you in the dark.
Which statements are correct?
a) It is sure that you get at least two socks (a pair) of the same colour.
b) It is sure that you get a pair of reds.
c) The probability to get 3 of the same colour is 18 .
9
d) The probability to get 3 of the same colour is 71 .
Question 33. In the sports injuries unit of a hospital, 40% of the patients are rugby players, 20% are swimmers
and the remaining 40% play soccer. For a rugby player, the probability to be released on the first day is 10%; for
a swimmer, it is 20%; for a soccer player, it is 80%. Which of the following statements are correct?
a) 40% of all patients are released on the first day.
b) Given a patient is released on the first day, the probability of her/him being a soccer player is 80%.
c) 80% of the non-swimmers have to stay for more than one day.
Question 34. Let X be a random variable with probability density function
(
1 2
x , x ∈ [0, 3],
f (x) = 9
0, else.
with parameters α > 0 and β > 0. We observe a sample {3, 4, 8}. Which of the following statements are correct?
a) The expected value of X exists for all combinations of α and β.
b) The expected value does only depend on α, but not on β.
c) A p-value is the probability that the null hypothesis is correct, given the observed data.
d) If we obtain a p-value of 0.04, we will reject (level α = 0.05) the null hypothesis.
Question 37. One of the lines in the following scatter plot is the regression line fitted to the data. Which of the
statements are correct?
10
80
●
70
60
50
●
●
●
● ●● ●
y
● ● ●
●
● ●● ●
● ●
● ● ● ● ●
40
● ●●
●
●● ●● ● ●
● ● ● ●●
● ●
●● ●
● ● ● ●
●
30
● ●
● ●●●
●● ● ●
● ● ● ●
● ●● ●
● ● ● ● ●
● ●●
● ●● ● ●
20
● ● ●
● ●
● ●
●● ●
●● ● ●
●
● ●
●
●
10
a) The red and green line have the right direction, and, hence, one of them could be the regression line.
b) The blue line seems to represent the mean value of the data with respect to y and thus could be the regression
line.
c) The point in the top right corner has a strong influence on the regression line.
d) Leaving aside the point in the corner, the red line seems to fit better to the rest of the data.
Question 38. You have performed a linear regression analysis to explore sunflowers’ growth (in meters per month)
depending on the watering (in litres per day). You have estimated the regression coefficient to be β̂ = 1.6. What
can you conclude?
a) There is a significant correlation between watering and growth.
b) An average sunflower growths 1.6 meters per month.
c) If you give it an additional litre of water per day, there will be an additional average growth of 1.6 meters per
month.
d) According to the model assumptions, an additional litre of water per day will result in additional 19.2 meters
of growth after one year.
e) You should consider further influencing quantities.
a) 5 >= 5
b) TRUE & FALSE | FALSE & TRUE
c) FALSE & FALSE & FALSE | TRUE
d) !(((TRUE > FALSE) > TRUE) & !TRUE)
11
x <- 0
while(x < 4) {
x <- sample(1:3, 1)
print(x)
}
It is not a good idea to run these lines because...
a) x is an invalid argument to print().
b) the condition x < 4 is never violated.
c) the function sample() does not exist.
d) x is initialised with the wrong type.
Question 41. Which of the following code lines return TRUE?
a) max(c(2, 3, 4, NA, 1, 5)) == NA
b) max(c(2, 3, 4, NA, 1, 5), na.rm = TRUE) == 5
c) typeof(sum(c(1, 2, 3, 4, NA))) == "double"
d) typeof(sum(1:4)) == "integer"
e) typeof(sum(c(1L, 2L, 3L, 4L, NA_real_), na.rm = TRUE)) == "integer"
Question 42. Which functions may have been used to generate the following plot and its underlying data?
a) lm()
b) points()
c) abline()
d) integrate()
Question 43. Consider the following code chunk and output and note that NA appears in the output of lm().
X1 <- rnorm(1e2)
X2 <- X1 + 3
Y <- X1 + X2 + rnorm(1e2)
lm(Y ~ X1 + X2)
12
Call:
lm(formula = Y ~ X1 + X2)
Coefficients:
(Intercept) X1 X2
2.979 2.019 NA
b) lm() excludes X2 from the regression so that there is a least squares solution.
c) NA indicates that the model fit to the data is perfect.
b) numpy.nanargmax(numpy.array([2,3,4,numpy.NAN,1,5])) == 5
c) type(numpy.array([1,2,3,4,numpy.NAN]).sum()) is numpy.float64
d) type(numpy.array([1,2,3,4],dtype=object).sum()) is int
e) type(numpy.array([1,2,3,4]).sum()) is int
Question 42. Which packages may have been used to generate the following plot and its underlying data?
13
6
2
Y
4
4 2 0 2 4 6
X
a) numpy
b) mathplotlib
c) statsmodels
d) math
Question 43. Consider the following code chunk and output and note that there are two warnings.
import pandas as pd
import statsmodels.formula.api as sm
import matplotlib.pyplot as plt
X1 = np.random.normal(0, 1, 100)
X2 = X1 + 3
Y = X1 + X2 + np.random.normal(0, 1, 100)
Output:
Intercept -0.0272
X1 1.1074
X2 1.0257
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.27e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
Which of the following statements are correct?
a) Perfectly correlated regressors X1 and X2 are used.
b) Either X1 or X2 should be excluded, as the second regressor does not add any information to the model.
c) The second warning indicates that the model fit to the data is perfect.
14
4 Data Science
Question 44. Consider a data set data containing all German inhabitants, which is subsetted in the following
process:
Select which statements are true after all the subsets have applied.
a) The data set contains all brown haired and married females worldwide.
b) The data set contains all brown haired German inhabitants.
c) The data set contains all brown haired and married female German inhabitants.
d) The data set contains all brown haired, married, female German inhabitants with at least 2 children.
Question 45. For what ultimate purposes may algorithms like Nelder-Mead, Newton-Raphson or gradient-descent
be used for?
a) To find the minimum of a function.
b) To find all zeros of a function.
c) To evaluate the derivative of a function.
d) To solve a generalised regression problem.
Question 46. The Titanic data set contains information, whether passengers of the Titanic survived the shipwreck,
based on their gender, age and passenger class. The following decision tree has been learned on this data. Which
of the statements are true?
died
0.38
100%
died survived
0.19 0.73
64% 36%
survived died
0.58 0.49
3% 17%
15
Question 47. Random forests are one of the most famous machine learning methods. They are easy to understand,
easy to implement and reach good prediction performances even without a hyper-parameter tuning. Which of the
following statements on random forest are correct?
a) The prediction of a classification forest is made by a majority vote of the trees’ predictions.
b) The prediction of a regression forest is the median of the tree predictions.
c) Each single tree in the forest uses only a part of the data available.
d) The training time of a random forest scales linear with the number of trees used.
Question 48. Let us return to the Titanic data set. We now have learned several models and want to choose the
best one. We used three different methods to validate these models: The training error rate (apparent error rate),
the error rate on an external test set and the error rate estimated by a 10-fold cross validation.
Learner Training Error Error on the test set Cross Validation Error
Decision Tree 0.18 0.22 0.21
Random Forest 0.01 0.10 0.12
1-Nearest-Neighbour 0 0.18 0.19
a) 1-Nearest-Neighbour has a perfect training error and hence it should be used here.
b) Random Forests outperforms both 1-Nearest-Neighbour and the Decision Tree in terms of prediction error.
c) Not just in this case, but in general, Cross Validation is the better validation strategy and should always be
preferred over the error on a single test set.
d) Not just in this case, but in general, Decision Trees always perform worse than Random Forests.
Question 49. We try a last model class to find the perfect model for the Titanic data-set: An SVM. The SVM is a
model class that is very sensitive to hyper-parameter tuning. Especially, the cost parameter C and the bandwidth
of the RBF kernel λ must be optimally adjusted in order to obtain a sensible model.
We use a nested resampling strategy to perform this hyper-parameter tuning: At first, 33% of the data are
laid aside as an external test set, to validate the result of the hyper-parameter tuning itself (the outer resampling
strategy). We use a random search as the tuning algorithm with a budget of 100 iterations. As parameter spaces,
we use all positive real numbers for both C and λ. The performance of a single hyper-parameter setting is evaluated
using a 10-fold cross validation (the inner resampling strategy). Moreover, in order to speed up the entire tuning
process, we utilise parallel computing.
Which of the following statements are correct?
16
1.0
0.5
0.0
x2
−0.5
−1.0
x1
It is a classification data-set with the goal of separating the red and the black observations. Assume, that the
number of red and black observations is approximately equal. Which of the following statements is correct?
a) A Decision Tree can reach a prediction error of (nearly) zero on this data-set.
b) When performing a variable selection using the step-wise forward selection algorithm, neither of the variables
x1 , x2 will be added to the model.
c) A Linear Discriminant Analysis (LDA) can reach a prediction error of (nearly) zero on this data-set.
d) Every model using only one of the two variables x1 , x2 will have a missclassification error of approximately
50%.
17