BUSI2045 Midterm Questions 2024 Spring
BUSI2045 Midterm Questions 2024 Spring
1. This is a 3-hour open-book test, which accounts for 20% of your final score.
2. Total Score is 100.
1. Part 1 Multiple Choices (32%),
2. Part II Empirical Questions (68%).
3. Submission format.
1. Include your name and ID in the first line of your answer sheet.
2. You can upload only one file via Moodle submission link.
3. You need to report both R codes and results in the answer sheet.
Q1. Which of the following plot is used to test whether a variable is normally distributed?
A. Pie chart
B. Error bars
C. Box plot
D. QQ plot
Q2. If the median value for a variable is larger than its mean, and its mode value is larger than its median,
then the distribution of values of this variable tends to be
A. Negatively skewed
B. Positively skewed
C. Symmetrically distributed
D. None of the above
1
BUSI 2045 DATA ANALYTICS FOR BUSINESS DECISION MAKING
Q3. Which of the below terms refers to the procedure of random sampling with replacement to create
multiple re-samples from a sample data?
A. Central Limit Theorem
B. Random Sampling
C. Selection Bias
D. Bootstrapping
Q5. Given x1 = 1:4, x2 = 5:8 and x3 = 9:12, you want to create a matrix with 4 rows and 3 columns named
m1 by combining the three vectors. Which of the following statements is correct?
A. You can create m1 by m1 = rbind(x1, x2, x3).
B. You can create m1 by m1 = cbind(x1, x2, x3).
C. You can create m1 by m1 = matrix(x1, x2, x3).
D. The output of length(m1) is 4.
Q6. Which of the following codes will produce FALSE as the output?
A. is.numeric(1:5)
B. is.numeric(c(1,2,3))
C. is.numeric('123')
D. none of the above
2
BUSI 2045 DATA ANALYTICS FOR BUSINESS DECISION MAKING
Q7. Which of the following scenarios fulfils the principle of random sampling?
A. Estimating the average GPA of HKBU students with only BUSI2045 students in the sample.
B. Estimating the average income of all Hong Kong workers with only doctors included in the sample.
C. Estimating the average housing price in Hong Kong with the houses in the Hong Kong Island.
D. None of the above.
Q8. If you would like to produce a scatter plot and add a small amount of random variation to the location of
each point, which of the following function should you use?
A. geom_point()
B. geom_jitter()
C. geom_boxplot()
D. geom_pointrange()
3
BUSI 2045 DATA ANALYTICS FOR BUSINESS DECISION MAKING
Q9. Which of the following statements can produce a plot like the below?
4
BUSI 2045 DATA ANALYTICS FOR BUSINESS DECISION MAKING
Load the iris data from package datasets, which have been preinstalled in Base R, and answer Q12 -14
accordingly. (Hint: you may simply run the code data(iris) to load the data)
Q13. Create a subset of the iris data in which Sepal.Width values are larger than 2.5. For the variable
Species, how many times the value ‘setosa’ appears in this subset?
A. 41
B. 49
C. 50
D. 139
Q14. Create another subset of the iris data in which Petal.Width values are larger than 1. How many unique
Species values are there in this subset?
A. 3
B. 2
C. 1
D. 0
5
BUSI 2045 DATA ANALYTICS FOR BUSINESS DECISION MAKING
Load the data set Duncan from the package carData, and answer Q15 - 16. This dataset records information on
the prestige and other characteristics of 45 U.S. occupations in 1950, based on a social survey data. Occupation
names were set as row names.
(Hint: you may need to install and load the package carData before loading its data Duncan into R)
Q15. The variable prestige records the percentage of respondents who rated the occupation as “good” or
better in prestige. What is the max prestige value and which occupation receives the highest prestige?
A. 97, physician
B. 17, engineer
C. 17, professor
D. 3, shoe.shiner
Q16. The variable type records the type of occupation, with “prof” representing professional and managerial,
“wc” representing white-collar, and “bc” representing blue-collar. Which occupation type occurs most often
in this dataset?
A. Professional and managerial
B. Blue-collar
C. White-collar
D. none of the above
6
BUSI 2045 DATA ANALYTICS FOR BUSINESS DECISION MAKING
(a) How many variables and observations are there? What are the data types for these variables?
(b) Find the mean, the 0.25 and 0.75 quantiles of the variable alpha.
(c) Construct a frequency table of the variable delta as below.
? ? ?
Create a subset named subset1 in which variable beta contains no missing value, answer below questions.
(d) How many observations are there in subset1 ?
(e) Find the mean, the 0.25 and 0.75 quantiles of variable alpha in subset1.
(f) With subset1, visualize the average gamma value in each delta level. Your output should look similar
as the below graph. (Hints: pay attention to the axis labels, legend title, and plot title)
7
BUSI 2045 DATA ANALYTICS FOR BUSINESS DECISION MAKING
(b) Create a two-way table to show the number of customers separated by Education levels and
Marital_Status. How many customers are both “married” and with a master’s degree?
(c) Display the number of customers across different Marital_Status and Education levels with a bar plot.
The plot should look like the below. (Hints: pay attention to plot title, legend title, and axis labels)
(d) Create a subset named subset2 in which the “Divorced” people are excluded (variable
Marital_Status). How many customers are there in the subset?
(e) With subset2 , visualize the distribution of variable Income across different Marital_Status and
Education with a boxplot. Your result should look like the below. (Hints: Pay attention to the title, x-
axis, and y-axis labels)
8
BUSI 2045 DATA ANALYTICS FOR BUSINESS DECISION MAKING
(f) Based on the boxplot in step (e), answer the following two questions:
i. Which education level tends to have lower income in general? Explain your answer.
ii. Is the income of the customers with education level ‘Graduation’ higher after getting married in
general? Explain your answer.
9
BUSI 2045 DATA ANALYTICS FOR BUSINESS DECISION MAKING
set.seed(2024)
X <- rnorm(400, mean=172, sd=10)
(a) Visualize values in X with a density plot and mark their mean with a red vertical line. You result should
look like the below. (Hint: you may need to convert the vector X as a data frame before plotting)
(b) Assume 𝑿 is a random sample representing the height of all residents in Hong Kong. If we collect
multiple random samples, each with the same sample size (i.e., 𝑛 = 400), from the Hong Kong
population, will these sample means be normally distributed? Why?
(c) Calculate the standard error (of the mean) with the mathematical approximation based on sample standard
deviation and sample size. What does the standard error measure?
(d) Calculate the 95% confidence interval (of the mean) via bootstrapping with 5000 resamples. What does it
tell us? (Note: set the random seed as 2024)
10