0% found this document useful (0 votes)
31 views

Data Analysis

Uploaded by

ssaurabh_ss
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Data Analysis

Uploaded by

ssaurabh_ss
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Reg.

Number:

Model Question Paper

Programme : Online M.Sc. Data Science Semester : I


Course Title : Exploratory Data Analysis Course Code : OLMDS505
Faculty : Class Nbr :
Duration : 2 Hrs. 30 Mins. Max. Marks : 100

PART – A

Answer All the Questions (10 X 1 Marks = 10 Marks)

Q.No. Question Description Marks

1. What is R primarily used for? 1

A) Web development
B) Statistical computing and graphics
C) Mobile app development
D) Database management

2. Which symbol is used for assignment in R? 1

A) = =
B) ->
C) :=
D) <-

3. Which function is used to install packages in R from CRAN (Comprehensive R Archive 1


Network)?

A) load()
B) install.packages()
C) require()
D) library()

4. What is the purpose of exploratory data analysis (EDA)? 1

A) To predict future outcomes with high accuracy


B) To summarize the main characteristics of a dataset
C) To train machine learning models
D) To visualize data for publication

5. What is the primary purpose of data transformation in data preprocessing? 1

A) To reduce the dimensionality of the dataset


B) To convert data types
C) To normalize or standardize the data
D) To impute missing values

Page 1 of 7
6. Which of the following datasets is an example of time series data? 1

A) Monthly sales figures of a retail store


B) Survey responses from different age groups
C) Characteristics of different species of plants
D) Test scores of students in a class

7. What is the Mahalanobis distance primarily used for in outlier detection? 1

A) Detecting outliers in univariate data


B) Identifying anomalies in categorical data
C) Measuring the distance between two clusters in high-dimensional space
D) Assessing the distance of a data point from the centroid of a dataset, accounting for
correlations between variables

8. What is the purpose of a box plot in EDA? 1

A) To display the distribution of categorical variables


B) To visualize the relationship between two continuous variables
C) To identify outliers in a dataset
D) To summarize the central tendency and spread of a variable

9. How is the z-score calculated? 1

A) (Data value - Minimum value) / (Maximum value - Minimum value)


B) (Data value - Mean) / Standard deviation
C) (Data value - Median) / Interquartile range
D) (Data value - Mode) / Range

10. Which statistical measure represents the most frequently occurring value in a dataset? 1

A) Mean
B) Median
C) Mode
D) Standard deviation

PART – B
Answer All the Questions (2 X 15 Marks = 30 Marks)
11. Match the types of healthcare data with their respective examples: 2

Type of Data Example


1 Structured data A. Insurance Emails, electronic health records
(EHRs) and Patient clinical notes
2 Unstructured data B. medical sensor data and radiology reports.
3 Semi-structured data C. Patient IDs, bill amounts and date of birth

Select correct Options that Match:


A. 1A, 2B, 3C
B. 1C, 2B, 3A
C. 1B, 2C, 3A
D. 1A, 3B, 2C

12. A researcher wants to estimate the average income of households in a city. The city is divided 2
into several neighborhoods, each with a different socioeconomic profile. To ensure a
representative sample, the researcher decides to use stratified sampling. Which of the following
Page 2 of 7
best describes stratified sampling?

A) The researcher randomly selects households from each neighborhood and combines them
into the sample.
B) The researcher selects households from a single neighborhood and includes them in the
sample.
C) The researcher divides households into income groups and randomly selects households
from each group.
D) The researcher selects households based on their proximity to the city center.

13. Suppose you have two variables, X and Y, representing the number of hours spent studying (X) 2
and the corresponding exam scores (Y) of a group of students in a class. After calculating the
covariance between X and Y, you obtain a positive value. What does this positive covariance
value indicate about the relationship between the number of hours spent studying and exam
scores?

A) There is no relationship between the number of hours spent studying and exam scores.
B) There is a negative relationship between the number of hours spent studying and exam
scores.
C) There is a tendency for the number of hours spent studying and exam scores to change in the
same direction.
D) There is a tendency for the number of hours spent studying and exam scores to change in
opposite directions.

14. Given a dataset with the following values: 10, 20, 30, 40, 50. Which of the following is the 2
normalized value (min-max) for the data point 30 if the normalization range is [0, 1]?

A) 0.25
B) 0.50
C) 0.75
D) 0.60

15. If a dataset has a mean of 50 and a standard deviation of 10, what is the z-score of a data point 2
with a value of 60?

A) 0.5
B) 1
C) 2
D) 10

16. You are comparing the distribution of salaries between two departments, Department X and 2
Department Y, within a company. After plotting box plots for both departments, you observe
that Department X has a larger IQR compared to Department Y. Which of the following
statements provides the most likely explanation for this difference in IQR between the two
departments?

A) Employees in Department X have higher salaries than employees in Department Y.


B) There are more employees in Department X than in Department Y.
C) There is more variability in salaries among employees in Department X than in Department
Y.
D) The mean salary in Department X is lower than the mean salary in Department Y.

17. Suppose you have a dataset containing information about students' exam scores and their 2
corresponding study hours. Due to some technical issues, 20% of the study hours data is
Page 3 of 7
missing. You decide to impute the missing values using a linear regression model fitted on the
observed data. After imputation, the correlation coefficient between study hours and exam
scores increases from 0.60 to 0.75. What can you infer from this change in correlation
coefficient?

A) The imputation method has introduced bias in the data.


B) The imputation method has reduced the accuracy of the correlation coefficient.
C) The imputation method has effectively captured the relationship between study hours and
exam scores.
D) The imputation method has artificially inflated the correlation coefficient.

18. In a study analyzing the relationship between dietary habits and cardiovascular health, missing 2
data is observed in the variable measuring daily sodium intake. After further investigation, it is
found that participants with a family history of heart disease are less likely to report their
sodium intake accurately.

Statement:
The missing data mechanism is most likely to be at play in this scenario is Missing Not at
Random (MNAR).

Check whether above state is True or False:

A) True
B) False

19. Consider the following small dataset containing the values of two variables, X and Y: 2
X: [2, 4, 6, 8, 10]
Y: 1,3,5,7,9]
What is the covariance between variables X and Y?

A) 8.0
B) 7.5
C) 6.0
A) 45

20. Consider the following dataset representing the scores of students in a mathematics test: 2
{50,60,70,80,85,90,95,100}
Find the Interquartile Range (IQR) of the dataset?

A) 30
B) 35
C) 25
D) 20

21. In the k-means clustering algorithm, the following step is performed iteratively until 2
convergence.

A. Assigning each data point to the nearest cluster centroid


B. Computing the distance between each data point and each cluster size
C. Updating the positions of the cluster centroids based on the variance of data points in
each cluster
D. Determining the optimal number of clusters based on a predefined criterion

Page 4 of 7
Select which of the statement(s) is/are correct ?

A) A and B is True
B) C and B is True
C) A is True
D) C is True

22. Consider a dataset containing measurements of air quality parameters such as temperature, 2
humidity, and particulate matter concentration. Which of the following models for outlier
analysis would be most suitable for identifying regions with unusual concentrations of
pollutants compared to their surrounding areas?

A) Clustering models
B) Extreme value Analysis
C) Distance-based models
D) Density-based models

23. In the dataset {1, 3, 3, 3, 50, 97, 97, 97, 100}, which value(s) would be considered outlier(s) 2
based on their extreme positions in the dataset?

A) 1 and 100
B) 1 and 97
C) 50
D) 100 and 97

24. Consider a dataset containing spatial data points representing various locations in a city. Which 2
outlier detection technique would be most suitable for identifying locations with unusual
patterns of spatial distribution compared to their neighboring areas?

A) Local Outlier Factor (LOF)


B) Histogram-based technique
C) Distance Based technique
D) Probabilistic models

25. Consider a dataset containing measurements of a physical quantity over time. Certain 2
measurements exhibit sudden and unexpected changes compared to the surrounding data
points. Which outlier detection technique would be most suitable for identifying these abrupt
changes?

A) Local Outlier Factor (LOF)


B) Histogram-based technique
C) Grid-based technique
D) Kernel Density Estimation (KDE)

PART – C

Answer any six Questions (6 X 10 Marks = 60 Marks)

Page 5 of 7
26. You are a data analyst for a retail company that sells products both in physical stores and 10
online. The company is interested in understanding customer behavior to optimize its
marketing strategies. You have been provided with a dataset containing information about
customer purchases, including the products bought, the purchase amounts, and whether the
purchase was made in-store or online. Describe the key steps you would follow in the EDA life
cycle to extract valuable insights from the dataset. Illustrate the analyses or techniques you
would apply at each step.

Customer Purchase_ Product Product_ Purchase_ Purchase_


_ Date _ Name Amount Channel
ID ID
1 2023-01-01 101 Laptop 1200 Online
2 2023-01-02 102 Smartphone 800 In-store
3 2023-01-03 103 Headphones 100 Online
4 2023-01-04 104 Tablet 500 Online
5 2023-01-05 NA Smartwatch 300 In-store
6 2023-01-06 106 NA 150 Online
7 2023-01-07 107 Camera NA In-store
8 2023-01-08 108 Earphones 50 Online
9 2023-01-09 109 TV 1000 Online
10 NA 110 Printer NA In-store

27. Consider the dataset provided below, containing information about the education level of 10
customers. Each customer is assigned a unique identifier, and their education level is
categorized into four categories: "High School," "Bachelor's Degree," "Master's Degree," and
"Ph.D."
Customer_ID Education_Level
1 High School
2 Bachelor's Degree
3 Master's Degree
4 Ph.D.
5 High School
Illustrate the concepts of encoding in the context of categorical data during data preprocessing.

28. Consider the following dataset containing the ages of students in a class: 10
{18,20,22,23,25,25,26,27,30,32,35,40,45,50}
(a) Compute the mean, median, mode, range, interquartile range (IQR), quartiles, minimum,
and maximum of the dataset.
(b) Discuss the significance of each measure in summarizing the distribution of ages in the
class.

29. Consider the following dataset containing the coordinates of points in a two-dimensional space: 10
{(2,3),(5,4),(9,6),(4,7),(8,1),(7,2),(6,5),(3,8)}
(a) Apply an appropriate clustering algorithm to partition the dataset.
(b) Discuss the iterative steps involved in clustering algorithm
(c) Interpret the results obtained from clustering the dataset into N clusters.

30. Consider the following dataset containing information about houses, including their sizes (in 10
square feet), the number of bedrooms, and their prices. However, some data entries are missing.
House_ID Size (sqft) Price ($)
1 1500 25000
2 1800 30000
3 1670 27000
Page 6 of 7
4 2000 32000
5 1700 NA
(a) Apply regression techniques to impute missing values in the dataset. Workout the steps
involved and the assumptions made during this process.
(b) After imputing the missing values, discuss the potential impact on the analysis and
interpretation of the dataset.

31. Consider the following dataset containing information about the prices of houses in a 10
neighborhood:
House_ID Price ($)
1 250000
2 300000
3 270000
4 320000
5 1000000
Illustrate a method for detecting outliers in this dataset. Apply the method and identify any
outliers, if present.

32. Perform student performance analysis with given data from a high school. The dataset includes 10
information such as student IDs, test scores, grades, attendance records.

Student ID Test Scores Grades Attendance Records


1 85 B 90%
2 95 A 99%
3 90 A 85%
4 65 C 50%
5 80 B 92%
6 60 C 68%
7 88 B 93%
8 62 D 47%
9 78 C 91%
10 65 D 46%

i. Design a data visualization to analyze the relationship between test scores and
attendance records.
ii. Discuss the insights that can be gained from the data visualization and how it can help
in understanding the relationship between test scores and attendance records.

******

Page 7 of 7

You might also like