Data Analysis
Data Analysis
Number:
PART – A
A) Web development
B) Statistical computing and graphics
C) Mobile app development
D) Database management
A) = =
B) ->
C) :=
D) <-
A) load()
B) install.packages()
C) require()
D) library()
Page 1 of 7
6. Which of the following datasets is an example of time series data? 1
10. Which statistical measure represents the most frequently occurring value in a dataset? 1
A) Mean
B) Median
C) Mode
D) Standard deviation
PART – B
Answer All the Questions (2 X 15 Marks = 30 Marks)
11. Match the types of healthcare data with their respective examples: 2
12. A researcher wants to estimate the average income of households in a city. The city is divided 2
into several neighborhoods, each with a different socioeconomic profile. To ensure a
representative sample, the researcher decides to use stratified sampling. Which of the following
Page 2 of 7
best describes stratified sampling?
A) The researcher randomly selects households from each neighborhood and combines them
into the sample.
B) The researcher selects households from a single neighborhood and includes them in the
sample.
C) The researcher divides households into income groups and randomly selects households
from each group.
D) The researcher selects households based on their proximity to the city center.
13. Suppose you have two variables, X and Y, representing the number of hours spent studying (X) 2
and the corresponding exam scores (Y) of a group of students in a class. After calculating the
covariance between X and Y, you obtain a positive value. What does this positive covariance
value indicate about the relationship between the number of hours spent studying and exam
scores?
A) There is no relationship between the number of hours spent studying and exam scores.
B) There is a negative relationship between the number of hours spent studying and exam
scores.
C) There is a tendency for the number of hours spent studying and exam scores to change in the
same direction.
D) There is a tendency for the number of hours spent studying and exam scores to change in
opposite directions.
14. Given a dataset with the following values: 10, 20, 30, 40, 50. Which of the following is the 2
normalized value (min-max) for the data point 30 if the normalization range is [0, 1]?
A) 0.25
B) 0.50
C) 0.75
D) 0.60
15. If a dataset has a mean of 50 and a standard deviation of 10, what is the z-score of a data point 2
with a value of 60?
A) 0.5
B) 1
C) 2
D) 10
16. You are comparing the distribution of salaries between two departments, Department X and 2
Department Y, within a company. After plotting box plots for both departments, you observe
that Department X has a larger IQR compared to Department Y. Which of the following
statements provides the most likely explanation for this difference in IQR between the two
departments?
17. Suppose you have a dataset containing information about students' exam scores and their 2
corresponding study hours. Due to some technical issues, 20% of the study hours data is
Page 3 of 7
missing. You decide to impute the missing values using a linear regression model fitted on the
observed data. After imputation, the correlation coefficient between study hours and exam
scores increases from 0.60 to 0.75. What can you infer from this change in correlation
coefficient?
18. In a study analyzing the relationship between dietary habits and cardiovascular health, missing 2
data is observed in the variable measuring daily sodium intake. After further investigation, it is
found that participants with a family history of heart disease are less likely to report their
sodium intake accurately.
Statement:
The missing data mechanism is most likely to be at play in this scenario is Missing Not at
Random (MNAR).
A) True
B) False
19. Consider the following small dataset containing the values of two variables, X and Y: 2
X: [2, 4, 6, 8, 10]
Y: 1,3,5,7,9]
What is the covariance between variables X and Y?
A) 8.0
B) 7.5
C) 6.0
A) 45
20. Consider the following dataset representing the scores of students in a mathematics test: 2
{50,60,70,80,85,90,95,100}
Find the Interquartile Range (IQR) of the dataset?
A) 30
B) 35
C) 25
D) 20
21. In the k-means clustering algorithm, the following step is performed iteratively until 2
convergence.
Page 4 of 7
Select which of the statement(s) is/are correct ?
A) A and B is True
B) C and B is True
C) A is True
D) C is True
22. Consider a dataset containing measurements of air quality parameters such as temperature, 2
humidity, and particulate matter concentration. Which of the following models for outlier
analysis would be most suitable for identifying regions with unusual concentrations of
pollutants compared to their surrounding areas?
A) Clustering models
B) Extreme value Analysis
C) Distance-based models
D) Density-based models
23. In the dataset {1, 3, 3, 3, 50, 97, 97, 97, 100}, which value(s) would be considered outlier(s) 2
based on their extreme positions in the dataset?
A) 1 and 100
B) 1 and 97
C) 50
D) 100 and 97
24. Consider a dataset containing spatial data points representing various locations in a city. Which 2
outlier detection technique would be most suitable for identifying locations with unusual
patterns of spatial distribution compared to their neighboring areas?
25. Consider a dataset containing measurements of a physical quantity over time. Certain 2
measurements exhibit sudden and unexpected changes compared to the surrounding data
points. Which outlier detection technique would be most suitable for identifying these abrupt
changes?
PART – C
Page 5 of 7
26. You are a data analyst for a retail company that sells products both in physical stores and 10
online. The company is interested in understanding customer behavior to optimize its
marketing strategies. You have been provided with a dataset containing information about
customer purchases, including the products bought, the purchase amounts, and whether the
purchase was made in-store or online. Describe the key steps you would follow in the EDA life
cycle to extract valuable insights from the dataset. Illustrate the analyses or techniques you
would apply at each step.
27. Consider the dataset provided below, containing information about the education level of 10
customers. Each customer is assigned a unique identifier, and their education level is
categorized into four categories: "High School," "Bachelor's Degree," "Master's Degree," and
"Ph.D."
Customer_ID Education_Level
1 High School
2 Bachelor's Degree
3 Master's Degree
4 Ph.D.
5 High School
Illustrate the concepts of encoding in the context of categorical data during data preprocessing.
28. Consider the following dataset containing the ages of students in a class: 10
{18,20,22,23,25,25,26,27,30,32,35,40,45,50}
(a) Compute the mean, median, mode, range, interquartile range (IQR), quartiles, minimum,
and maximum of the dataset.
(b) Discuss the significance of each measure in summarizing the distribution of ages in the
class.
29. Consider the following dataset containing the coordinates of points in a two-dimensional space: 10
{(2,3),(5,4),(9,6),(4,7),(8,1),(7,2),(6,5),(3,8)}
(a) Apply an appropriate clustering algorithm to partition the dataset.
(b) Discuss the iterative steps involved in clustering algorithm
(c) Interpret the results obtained from clustering the dataset into N clusters.
30. Consider the following dataset containing information about houses, including their sizes (in 10
square feet), the number of bedrooms, and their prices. However, some data entries are missing.
House_ID Size (sqft) Price ($)
1 1500 25000
2 1800 30000
3 1670 27000
Page 6 of 7
4 2000 32000
5 1700 NA
(a) Apply regression techniques to impute missing values in the dataset. Workout the steps
involved and the assumptions made during this process.
(b) After imputing the missing values, discuss the potential impact on the analysis and
interpretation of the dataset.
31. Consider the following dataset containing information about the prices of houses in a 10
neighborhood:
House_ID Price ($)
1 250000
2 300000
3 270000
4 320000
5 1000000
Illustrate a method for detecting outliers in this dataset. Apply the method and identify any
outliers, if present.
32. Perform student performance analysis with given data from a high school. The dataset includes 10
information such as student IDs, test scores, grades, attendance records.
i. Design a data visualization to analyze the relationship between test scores and
attendance records.
ii. Discuss the insights that can be gained from the data visualization and how it can help
in understanding the relationship between test scores and attendance records.
******
Page 7 of 7