Exercise pdf
Exercise pdf
Select the suitable answer for the following Multiple Choice Questions (MCQs)
1 ______ is a structured or processed collection of data usually associated with a unique body of
work
A Database B Dataset
C Data and Information D Information
2 ______ refers to the process of carefully examining and studying data to identify patterns, draw
conclusions, or make the data meaningful
A Data analytics B Data Predictions
C Dataset D Database
3 ______ is the graphical representation of fata through use of common charts, plots, info graphics,
and animations.
A Data cleaning B Missing values
C Data visualization D Data hiding
4 ______ is subset of Machine learning, with emphasis on the simulation or imitation of human
brain’s behavior by using artificial neural networks
A Data visualization B Computer vision
C Deep learning D Big Data
5 ______ is the use of data to predict future trends and events based on historical data
A Statistical analysis B Predictive analysis
C Graphical learning D Deep learning
6 ______ is the fast rate at which data is received and acted on
A Volume B Velocity
C Variety D Vision
7 ______ includes the data which can only take certain values and cannot be further subdivided into
smaller units
A Discrete data B Continuous data
C Ordinal data D Referral data
8 ______ is limitation of big data
A Statistical data B Unlimited growth of data
C Data visualization D Predictive maintenance
9 Customer satisfaction level such as satisfied, dissatisfied, and neutral are examples of ______ data
type
A Ordinal data B Continuous data
C Numeric data D Discrete data
10 ______ is a method collecting information from individuals
A Survey B Data hiding
C Data visualization D Data finding
Give Short answers to the following Short Responsive Questions (SRQs)
Q1 Define data analytics and data science, are they similar or different? Give reason.
Ans: Data Analytics: Data analytics refers to the process of carefully examining and studying
data to identify patterns, draw conclusions or make the data meaningful.
Data Science: Data science refers to an interdisciplinary field of multiple disciplines that uses
mathematics, statistics, data analysis and machine learning to analyze data and to extract
knowledge and insights from it.
Conclusion: While data analytics and data science share commonalities and often overlap, they
differ in terms of scope, techniques, and purpose. Data science can be seen as a broader, more
encompassing field that includes data analytics as one of its components.
Q2 Can you relate how data science is helpful in solving business problems?
Ans: A business problem is a gap between the existing and desired state of a situation. It is a
desired action or series of actions to achieve an objective. Various business problems can be
solved through data science, some of them are as follows:
o To decide the best routes for shipping of goods or passenger airplanes
o To choose the best product among many, which one to buy A or B
o To foresee delays for flight/ship/train
o To create promotional offers
o To find the best suitable time to deliver goods to reduce cost
o To forecast next year’s revenue for a company
o To analyze health benefit of physical training programs
o To predict some fore coming event like who will win elections
Q3 Database is useful in the field of data science. Defend this statement
Ans: Before the advent of database systems, computer scientists relied on file management
systems to store and manage data. However, without a structured method of storing data, it
would be of little use. This is why databases serve as the backbone of data science by providing
a centralized source for data storage, integration, retrieval, transformation, security, scalability,
and persistence. It enables data scientists to work with large and complex datasets effectively,
facilitating the extraction of actionable insights and the development of predictive models .
Q4 Compare machine learning and deep learning, in the context of formal and informal
education
Ans: Comparison between machine learning and deep learning
Machine Learning (Basic Tools) Deep Learning (Advanced Tools):
Great for helping teachers with common More powerful but requires a lot of data
tasks. and computing muscle.
Spots students who need extra help and Can understand complex things like essays
recommends practice problems. or speech.
Grades basic quizzes and helps personalize Acts like a smart tutor, giving feedback and
learning materials (like recommending creating personalized learning journeys.
easier or harder exercises). Great for creating rich educational
Works well in both classrooms and experiences but needs careful handling to
informal learning apps avoid bias and ensure fair learning
In conclusion, both machine learning and deep learning can make learning more engaging and
effective, but they work best for different situations
Q5 What is meant by sources of data? Give three sources of data excluding those
mentioned in the book
Ans: To analyze data for predictive analysis and decision making, the initial step is data collection
through various reliable sources. Data can be divided into two categories, primary data and secondary
data. Primary data is collected directly by questionnaire, surveys, and interviews. Primary data can also
be collected through experiments and recording observations. Secondary data is collected from some
previously recorded from primary data. Following are some sources of data:
Questionnaire: Collecting number of questions in a specific topic
Transactions: Collecting data on transactions in the banks
Financial data: Collecting data of financial transactions in the banks, governments or private institutions
Website: Collecting tweets regarding some topic or thread
Surveys: Collecting firsthand data by performing surveys about some event, movie etc.
Sensors: Collecting seismic data regarding changes under the earth which cause earthquakes
Q6 Differentiate between database and dataset
Ans: Comparison between Dataset and Database
Dataset Database
A collection of data that is organized in a A collection of organized data that is stored
specific format and accessed electronically
Typically used for research, data analysis, and Typically used to store and manage large
machine learning projects amounts of data to support the operations of
an organization
Can be stored in a variety of formats such as a Can store a wide range of data types, including
spreadsheet, a CSV file or a database text, numbers, images or other types of data
Can be subset of data extracted from a larger Can have multiple datasets and can be used
database for different applications
Typically used for specific purpose Typically used as a comprehensive and long
term storage solution
Q7 Argue about the trends, outliers and distribution of values in a data set? Describe
Ans: To effectively argue about the trends, outliers, and distribution of values in a data set,
let's break down each component and illustrate how they can be analyzed.
Trends
Definition: Trends refer to the general direction in which the data is moving over a period. This
can be upward, downward, or stable.
Example: Suppose we have a data set showing the sales figures of a company over the past 12
months. A line graph reveals an upward trend, indicating increasing sales. A positive slope in a
linear regression analysis would confirm this upward trend.
Outliers
Definition: Outliers are values at the extreme ends of a dataset
Example: In the same sales data set, if most monthly sales figures range between 10,000 and
15,000 units, but one month shows 25,000 units, this data point would be an outlier.
Distribution of Values
Definition: Distribution refers to how values in a data set are spread out across the range. It
provides insight into the shape, central tendency, and variability of the data.
Example: A histogram or a boxplot could help us visualize the distribution better. Perhaps
there's a slight skew towards the higher end, indicating more expensive houses on the market.
Q8 Why are summary statistics needed?
Ans: It is information about the data in a sample. It can help to understand the values of
better. It may include the total number of values, minimum value, and maximum value, along
with the mean value and the standard deviation corresponding to a data collection. Summary
statistics help to understand the trends, outliers and distribution of values in a data set.
The summary statistics provide a quick overview of characteristics of data. It leads towards a
better understanding of data cleaning, data preprocessing, feature selection and data
visualization.
Q9 Express big data in your own words. Explain three V’s of big data with reference to
email data. (Hint: An email box that contains hundreds of emails)
Ans: Big data refers to extremely large data sets that cannot be easily managed, processed or
analyzed using traditional data processing tools. These data sets are characterized by their
massive volume, rapid generation, and diverse formats, requiring advanced methods and
technologies to derive meaningful insights from them. The three Vs of big data are:
Volume: It refers to the amount of data. Big data deals with huge volumes of low-
intensity, unstructured data. The size/volume of data may vary from system to system. In an
email context, volume is represented by the large number of emails stored
Velocity: It refers to the speed of data. Velocity is the fast rate at which data is received.
Normally, the highest velocity of data streams directly into memory rather than being written
to disk. In an email context, velocity by the rapid and constant flow of incoming and outgoing
emails
Variety: It refers to the various formats and types of data that are available. Traditional
data types were structured and fit neatly in a relational database. With the rise of data, data
comes in new data types. In an email context, variety by the different formats and types of data
within those emails
Q10 Illustrate the purpose of data storage?
Ans: After data collection, effective storage of data is an essential step for managing and
analyzing the large volumes of data. There are various data storage methods according to the
nature of data. Some common data storage methods are: (i) Relational database, (ii) Data
warehouse, (iii) Distributed file systems, (iv) Cloud based data storage, (v) Block chain
Give Long answers to the following Extended Responsive Questions (ERQs)
Q1 Sketch the key concepts of data science in your own words
Ans: Data science refers to an interdisciplinary field of multiple disciplines that uses
mathematics, statistics, data analysis and machine learning to analyze data and to extract
knowledge and insights from it. It is like a pipeline from data to insights. This awareness or
knowledge is used to find patterns in the data. The result drawn can be used for making
informed decisions to solve real world problems e.g., medical, education, scientific research, and
business etc.
Concepts of Data Science
Data science consists of many components, theories and algorithms. To understand data science
and make its productive usage, following are some key concepts or components that lay the
foundation of data science:
Data: Data is a collection of observations, facts or information collected from different sources.
This data can be in the form of numbers, measurements, words, observations or in audio/ video
form. It could be structured (processed) data which is in the form of tables or unstructured
(unprocessed) data in the form of audio/video, tweets, pdf files etc.
Dataset: Dataset is a structured or processed collection of data usually associated with a
unique body of work. This collection of data is related to each other in some way, for example a
collection of brain CT scan of brain tumor patients is a dataset which can be used to evaluate
certain pattern or trend common in the entire dataset.
Statistics and Probability: Statistics is the analysis of the frequency of past events and
probability is to predict the likelihood of future events. Data scientists use statistics and
probability to find patterns and trends in data.
Mathematics: Mathematics is a fundamental part of data science which helps to solve
problems, optimize the model performances, and interpret huge complex data into simple and
clear results, for decision making.
Deep Learning: Deep learning is the subset of machine learning, with emphasis on the
simulation or limitation of human brain’s behavior by using artificial neural networks.
Data Mining: Data mining is the subset of data science which primarily focuses on discovering
patterns and relationships in existing datasets. The usage of techniques and tools is limited in
data mining as compared to data science.
Data Visualization: Data visualization is the graphical representation data using common
charts, plots, info graphics and animations. These visual displays of information communicate
complex data relationships and data driven insights in a way that is easy to understand.
Big Data: Big data refers to handling large volumes of data. Data scientists use big data to
find patterns and trends in datasets, to obtain more accurate and reliable results. The huge size
of data provides more opportunities for machine learning and provides better results.
Predictive Analysis: Predictive analysis is the use of data to predict future trends and events
based on historical data.
Natural Language Processing (NLP): It is the study of interaction between human language and
computers. The common uses of NLP are chatbots, language translators and sentiment analysis.
Q2 Develop your own thinking on the various data types used in data science
Ans: In data science we can mainly classify data into two main types qualitative (categorical)
and quantitative (numeric).
Qualitative or Categorical Data
This type of data describes an object or a group of objects that can be labeled according to
some group or category. It cannot be represented in numerical form. For example, data
including colors, places, etc. It is further subdivided into two types:
(i) Ordinal Data: Ordinal data sees a specific order or ranking, it uses certain scale or
measure to group data into categories. Such as in test grades, economic status or
military rank
(ii) Nominal Data: Nominal data does not have any order; it can be labeled into mutually
exclusive categories, which cannot be ordered meaningfully. For example, if we consider
the categories of transportation as car, bus or train. Similarly gender, city, color,
employment status are also example of nominal data.
Quantitative or Numerical Data
This data deals with numeric values; that can be computed mathematically to draw some
conclusions. Examples of numeric data are height, weight, number of students in a school, fruits
in a basket etc. Quantitative data can be further divided into two types:
(i) Discrete Data: It includes data which can only take certain values and cannot be further
subdivided into smaller units. This data can be counted and has a finite number of
values. For example, the number of product reviews, ticket sold, computers in certain
departments, employees in a company etc.
(ii) Continuous Data: It refers to the unspecified number of possible measurements
between two realistic points or numbers. For example, daily wind speed, weight of
newborn babies, freezer’s temperature etc.
Q3 Compare how big data is applicable to various fields of life. Illustrate your answer with
suitable examples.
Ans: Big data applications can help companies to make better business decisions by analyzing
large volumes of data and discovering hidden patterns. The following are various domains
where big data can be applied:
1. Healthcare:
Big data is making a major impact on the huge healthcare industry. Wearable devices and
sensors collect patient data which is then fed in real time to an individual’s electronic health
records. Healthcare providers are now using big data to predict epidemic outbreaks, real time
alerting, predict and prevent serious medical conditions etc. Researchers analyze the data to
determine the best treatment for a particular disease, side effects of the drugs, forecasting the
health risks etc.
2. Media and Entertainment:
The media and entertainment industries are creating, advertising and distributing their content
using new business models. The media houses are targeting audiences by predicting what they
would like to see, how to target the ads, content monetization, etc. Big data systems are thus
increasing the revenues of such media houses by analyzing viewer patterns.
3. Internet of Things (IoT)
Big data plays an important role in enhancing the capabilities of IoT devices. IoT devices
generate continuous data. The analytics based on this huge data helps in personalized customer
experience. In brief, big data is essential for unlocking the full potential of IoT by providing
meaningful insight derived from the massive amount of data generated by IoT devices.
4. Manufacturing
Big data helps the manufacturing companies to make better products and smarter decisions. It
helps in predicting when machines might need a break, making sure they don’t unexpectedly
stop working. Big data also looks at how products are made better and cheaper. It is having a
smart assistant that guides the whole manufacturing process, making things more efficient and
helping companies build the best products. The following are some of the major advantages of
employing big data applications in manufacturing industries: (i) High product quality, (ii)
Tracking faults, (iii) Supply planning, (iv) Predicting the output, (v) Increasing energy efficiency,
(vi) Testing and simulation of new manufacturing process, (vii) Large scale customization of
manufacturing
5. Government
Analytics through big data management techniques allows governments to understand the
needs of their citizens, combat fraud, minimize system errors and improve operations, reducing
costs and improving the services of any government entity. By adopting big data systems, the
government can attain efficiency in terms of cost, output and novelty.
Big data applications can be applied in each and everywhere, it finds applications includes:
Agriculture
Aviation
Cyber security and intelligence
Crime prediction and prevention
E-commerce
Fake news detection
Fraud detection
Pharmaceutical drug evaluation
Scientific research
Weather forecasting
Tax compliance
Q4 Relate advantages and challenges of big data
Ans: Advantages and Benefits of Big Data
Big data contains more information therefore it helps individuals, organizations, and businesses
to optimize and generate cost effective solutions. Big data has many advantages for the
betterment and progress of business; some of them are as follows:
Product Development: Developing and creating new products, services or brands is much
easier when based on data collected from customers’ needs and wants. Companies use big data
to anticipate customer demand. They build predictive models for new products and services by
classifying key attributes of past and current products.
Predictive Maintenance: It is a proactive maintenance strategy that uses the analysis of
existing data to predict when equipment, machinery or product is likely to fail. Therefore, it
indicates the potential issues before the problems happen.
Customer Experience / Satisfaction: A clearer view of customer experience is more possible
now than ever before. Big data enables the businesses together data from social media, web
visits, call logs, and other sources to improve customer satisfaction.
Fraud and Compliance: Big data analytics can identify and detect unusual suspicious
patterns and anomalies. As a result provides an effective tool to detect fraudulent activities and
enhance cyber security measures.
Big Data Challenges
Since there are many advantages of big data, business encounter many challenges of big data.
Some of them are as follows:
Data Quality: Poor quality of data may lead to errors, insufficiency and misleading effect after
data analysis.
Data Security and Privacy: It is difficult to manage the protection and privacy of massive
datasets to prevent unauthorized access.
Rapid Growth of Data: Making systems that can handle more and more data as it keeps
on growing without slowing down is challenging.
Big Data Tool Selection: Ensuring compatibility and seamless interaction between
different big data tools and platforms.
Data Integration: To create harmony among diverse data formats and structures is a
difficult task.
Q5 Design a case study about how data science and big data has revolutionized the field
of healthcare.
Ans: Case study with the use of data analytics in healthcare:
Case Study Analysis
1. Predictive Analytics in Patient Care
Example: Predicting Sepsis
Problem: Sepsis is a life-threatening condition that arises when the body's response to
infection causes injury to its own tissues and organs. Early detection is crucial.
Solution: A hospital implemented a predictive analytics system using machine learning
models trained on historical patient data. The system analyzed vital signs, lab results, and
clinical notes in real-time.
Outcome: The system was able to predict sepsis 24-48 hours before clinical recognition,
significantly reducing mortality rates.
2. Operational Efficiency
Example: Reducing Hospital Readmissions
Problem: High readmission rates are costly and often indicate suboptimal patient care.
Solution: By analyzing patterns in EHR data, hospitals identified risk factors for
readmission. Predictive models were developed to flag high-risk patients.
Outcome: Targeted interventions, such as personalized discharge plans and follow-up
calls, reduced readmission rates by 15%.
3. Personalized Medicine
Example: Tailoring Cancer Treatment
Problem: Cancer treatment often follows a one-size-fits-all approach, which may not be
effective for all patients.
Solution: Using genomic data and advanced analytics, oncologists could classify tumors
based on genetic mutations. Machine learning models predicted which treatments would be
most effective for individual patients.
Outcome: Personalized treatment plans improved patient response rates and reduced
side effects.
4. Public Health Surveillance
Example: Managing COVID-19
Problem: The COVID-19 pandemic required real-time tracking of virus spread and
resource allocation.
Solution: Public health authorities used big data analytics to integrate data from
multiple sources, including testing results, mobility data, and hospital capacities. Predictive
models forecasted infection hotspots and healthcare needs.
Outcome: Enhanced decision-making capabilities led to more effective containment
strategies and optimized resource distribution.
Conclusion
Data science and big data have revolutionized healthcare by enabling predictive
analytics, enhancing operational efficiency, personalizing treatment, and improving
public health management. While challenges remain, the continued evolution of these
technologies promises even greater advancements in patient care and healthcare
delivery.