100% found this document useful (11 votes)
8K views14 pages

CSBS - AD3491 - FDSA - IA 1 - Answer Key

This document contains an answer key for an internal assessment test on fundamentals of data science and analytics. It includes multiple choice and short answer questions that assess students' understanding of key concepts like data quality, data science processes, applications of data science, outlier detection, frequency distributions, and measures of central tendency. The document provides the questions, learning outcomes, required skills, and model answers to aid instructors in evaluating students.

Uploaded by

R.Mohan Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (11 votes)
8K views14 pages

CSBS - AD3491 - FDSA - IA 1 - Answer Key

This document contains an answer key for an internal assessment test on fundamentals of data science and analytics. It includes multiple choice and short answer questions that assess students' understanding of key concepts like data quality, data science processes, applications of data science, outlier detection, frequency distributions, and measures of central tendency. The document provides the questions, learning outcomes, required skills, and model answers to aid instructors in evaluating students.

Uploaded by

R.Mohan Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Saranathan College of Engineering

Tiruchirappalli - 620012

Internal Assessment Test – I – Answer Key Date/Session 21-09-2022 Marks 50


Course code AD3491 Course Title FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS
Batch No. Duration 90 MINUTES Academic Year 2022-2023/ODD
Year II Semester III Department B.Tech - CSBS
Part – A (20 Marks)
Answer all the Questions (10x2=10 Marks)
Q. No. Questions CO Skills
1 What are the characteristics of a quality data?
 Validity - The degree to which your data conforms to defined business
rules or constraints.
 Accuracy - Ensure your data is close to the true values.
 Completeness - The degree to which all required data is known. C206.1 R
 Consistency - Ensure your data is consistent within the same dataset
and/or across multiple data sets.
 Uniformity - The degree to which the data is specified using the same unit
of measure.
2 What do you mean by Data Science?
 Data science is the domain of study that deals with vast volumes of data
using modern tools and techniques to find hidden patterns, derive
meaningful information, and make business decisions.
 Data science can be explained as the entire process of gathering actionable C206.1 R
insights from raw data that involves concepts like pre-processing of data,
data modelling, statistical analysis, data analysis, machine learning
algorithms, etc.
 The main purpose of data science is to compute better decision making.
3 List out at least five applications of data science.
 Finance and Fraud & Risk Detection.
 Healthcare.
 Internet Search and Website Recommendations.
C206.1 R
 Retail Marketing and Targeted Advertising.
 Advanced Image Recognition.
 Speech Recognition.
 Airline Route Planning.
4 Write short note on outlier detection and state its real-time application.
 In statistics, an outlier is a data point that differs significantly from
other observations.
 An outlier detection technique (ODT) is used to detect anomalous C206.1 U
observations/samples that do not fit the typical/normal statistical
distribution of a dataset.
 Applications of Outlier Detection are SPAM Detection, Credit Card
Fraudulent Activity detection, intrusion detection in cyber security.
5 What are the contents should be included in a project charter?
A project charter requires teamwork, and your input covers at least the
following:
i. A clear research goal
ii. The project mission and context
C206.1 U
iii. How you’re going to perform your analysis
iv. What resources you expect to use
v. Proof that it’s an achievable project, or proof of concepts
vi. Deliverables and a measure of success
vii. A timeline
Define Data Cleansing.
 Data cleaning is the process of fixing or removing incorrect, corrupted,
incorrectly formatted, duplicate, or incomplete data within a dataset.
6  When combining multiple data sources, there are many opportunities C206.1 R
for data to be duplicated or mislabeled. If data is incorrect, outcomes
and algorithms are unreliable, even though they may look correct.
 Data cleansing, also referred to as data cleaning or data scrubbing.
Why Frequency Distribution is important in Data Science?
 Frequency distribution is an organized tabulation/graphical representation
of the number of individuals in each category on the scale of measurement.
The reasons for constructing a frequency distribution are as follows:
 To organize the data in a meaningful, intelligible way.
7 C206.2 AZ
 To determine the shape of the distribution.
 To facilitate computational procedures for measures of average and
spread.
 To draw charts and graphs for the presentation of data.
 To enable the reader to make comparisons among different data sets.
State the “Guidelines for frequency distribution”.
1. Each observation should be included in one, and only one, class.
2. List all classes, even those with zero frequencies.
3. All classes should have equal intervals.
8 4. All classes should have both an upper boundary and a lower boundary. C206.2 U
5. Select the class interval from convenient numbers, particularly 5 and 10 or
multiples of 5 and 10.
6. The lower boundary of each class should be a multiple of the class interval.
7. Aim for a total of approximately 10 classes.
Write short note on Stem-and-leaf display. Represent the following data
in stem-and-leaf display. 67, 74, 63, 88, 82, 97, 65, 79
 A stem-and-leaf display is used to present quantitative data in a graphical
9 format, similar to a histogram, to assist in visualizing the shape of a C206.2 C
distribution.
 A stem and leaf plot displays data by splitting up each value in a dataset
into a “stem” and a “leaf.”
67, 74, 63, 88, 82, 97, 65, 79

Raw Data Stem Leaf

67
6 7 3 5
74
63 7 4 9
88
82 8 8 2
97
65
9 7
79
How the skewness of a data distribution can be identified?

10 C206.2 U
Part – B
(Answer all the questions 2 x 10 = 20marks)

Q.
Questions CO Skills
No.
11 Discuss in detail about step-by-step process in Data Science with neat diagram
Ans: Refer to Unit-I study material Page No 15-18
The data science process typically consists of six steps, as follows,
1. Setting the research goal - Defining the what, the why, and the how of your
project in a project charter.
2. Retrieving data - Finding and getting access to data needed in your project.
This data is either found within the company or retrieved from a third party.
3. Data preparation - Checking and remediating data errors, enriching the data
with data from other data sources, and transforming it into a suitable format for C206.1 R
your models.
4. Data exploration - Diving deeper into your data using descriptive statistics
and visual techniques.
5. Data modelling - Using machine learning and statistical techniques to
achieve your project goal.
6. Presentation and automation - Presenting your results to the stakeholders
and industrializing your analysis process for repetitive reuse and integration with
other tools.

Or

12 Discuss briefly about:


i. Life cycle of Data Science (5)
Ans: Refer to Unit-I Study material – Page 04 to 06
Formulating a Business Problem, Data Extraction, Transformation,
Loading, Data Preprocessing, Data Modeling, Gathering Actionable C206.1 R
Insights, Solutions For the Business Problem

ii. Machine Learning in Data Science (5)


Ans: Refer to Unit-I Study material – Page 09 to 11
Regression, Decision tree, Clustering, Classification, Outlier Analysis

13 The IQ scores for a group of 35 high school dropouts are as follows:


91 85 84 79 80 87 96 75 86
104 95 71 105 90 77 123 80 100
93 108 98 69 99 95 90 110 109
94 100 103 112 90 90 98 89
C206.2 A
i. Construct a frequency distribution for grouped data (4)
ii. Relative Frequency distribution (3)
iii. Cumulative Frequency distribution (3)

Solution:
Or

14 i. Discuss in detail about “Measures of Central Tendency” and calculate


each measure for the following retirement ages data: (6)
A
60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63 C206.2
ii. Is it possible to calculate “Mean” for qualitative data? Justify your
answer. (2) AZ
iii. Is the above data following “Bimodal”? Justify your answer. (2)
AZ
Answer:
ii) NO, Mean or Average can be computed only for quantitative data which is measurable in nature. Whereas,
Qualitative data is not measurable and countable so mean cannot be calculated.

iii) Bimodal Mode - A set of data with two Modes is known as a Bimodal Mode. This means that there are
two data values that are having the highest frequencies.
60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63

Data Frequency
45 1
55 1
60 2
63 4
65 2
70 1

Since only one observation (63) is having high frequency value of 4. This is NOT following bimodal.

Part – C
(Answer all the questions 1 x 10 = 10marks)

Q.No. Questions CO Skills


Discuss about following measures and calculate them with given
“residence changes” data.
1, 3, 4, 1, 0, 2, 5, 8, 0, 2, 3, 4, 7, 11, 0, 2, 3, 4
i. Range (1 + 1)
15 ii. Variance (1 + 1) C206.2 A
iii. Standard Deviation (1 + 1)
iv. InterQuartile Range (IQR) (1 + 1)
v. Z-Score (1 + 1)

Answer:

 Definition for all above measures needs to be written. Each definition carries 1 mark.
16 During their first swim through a water maze, 15 laboratory rats made the
following number of errors (blind alleyway entrances):

2, 17, 5, 3, 28, 7, 5, 8, 5, 6, 2, 12, 10, 4, 3.

(a) Find the mode, median, and mean for these data. (6)
C206.2 A
(b) Draw the shape of uniform distribution, positively skewed
distribution, and negatively skewed distribution. (3)

(c) Without constructing a frequency distribution or graph, would you


characterize the shape of the above data distribution as balanced,
positively skewed, or negatively skewed? (1)

Answer:
(b)
Normal Distribution Positively Skewed Distribution Negatively Skewed Distribution

(c)

You might also like