Fds Question Bank
Fds Question Bank
COMPILED BY
M.DEIVAMANI AP/CSE
VERIFIED BY
1
DR.NNCE II YEAR/03 FDS-QB
UNIT I INTRODUCTION
Data Science: Benefits and uses – facets of data – Data Science Process: Overview -Defining research goals-
Retrieving data – Data preparation – Exploratory Data analysis – Build the model– presenting findings and
building applications – Data Mining – Data Warehousing – Basic Statistical descriptions of Data
UNIT II DESCRIBING DATA
Types of Data - Types of Variables -Describing Data with Tables and Graphs –
Describing Data with Averages - Describing Variability - Normal Distributions and Standard (z) Scores
UNIT III DESCRIBING RELATIONSHIPS
Correlation –Scatter plots –correlation coefficient for quantitative data –computational formula for
correlation coefficient – Regression – regression line – least squares regression line – Standard error of
estimate – interpretation of r2 –multiple regression equations –regression towards the mean
UNIT IV PYTHON LIBRARIES FOR DATA WRANGLING
Basics of Numpy arrays –aggregations –computations on arrays –comparisons, masks, boolean logic –
fancy indexing – structured arrays – Data manipulation with Pandas – data indexing and selection –
operating on data – missing data – Hierarchical indexing – combining datasets – aggregation and grouping -
pivot tables
UNIT V DATA VISUALIZATION
Importing Matplotlib – Line plots – Scatter plots – visualizing errors – density and contour plots –
Histograms – legends – colors – subplots – text and annotation – customization – three dimensional
plotting - Geographic Data with Base map - Visualization with Seaborn.
COURSE OUTCOMES:
At the end of this course, the students will be able to:
CO1: Define the data science process
CO2: Understand different types of data description for data science process
CO3: Gain knowledge on relationships between data
CO4: Use the Python Libraries for Data Wrangling
CO5: Apply visualization Libraries in Python to interpret and explore data
TEXT BOOKS
1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data Science”, Manning
Publications, 2016. (Unit I)
2. Robert S. Witte and John S. Witte, “Statistics”, Eleventh Edition, Wiley Publications, 2017.
(Units II and III)
3. Jake VanderPlas, “Python Data Science Handbook”, O’Reilly, 2016.
(Units IV and V)
REFERENCES:
Allen B. Downey, “Think Stats: Exploratory Data Analysis in Python”, Green Tea Press,2014
2
DR.NNCE II YEAR/03 FDS-QB
Unit I: Introduction
Data Science and Big Data
Facets of Data
Data Science Process
Defining Research Goals
Retrieving Data
Data Preparation
Exploratory Data Analysis
Build the Models
Presenting Findings and Building Applications
Data Mining
Basic Statistical Descriptions of Data
3
DR.NNCE II YEAR/03 FDS-QB
UNIT I
PART – A
Ans;
• Data science is an interdisciplinary field that seeks to extract knowledge or insights
from various forms of data.
• At its core, data science aims to discover and extract actionable knowledge from data
that can be used to make sound business decisions and predictions.
• Data science uses advanced analytical theory and various methods such as time series
analysis for predicting future.
Q.2 Define structured data? Nov/Dec2023
Ans. Structured data is arranged in rows and column format. It helps for application to
retrieve and process data easily. Database management system is used for storing
structured data. The term structured data refers to data that is identifiable because it is
organized in a structure.
Q.3 What is data?
Ans. Data set is collection of related records or information. The information may be
on some entity or some subject area.
Q.4 What is unstructured data ?April/May2023
Ans. Unstructured data is data that does not follow a specified format. Row and columns
are not used for unstructured data. Therefore it is difficult to retrieve required information.
Unstructured data has no identifiable structure.
Q.5 What is machine - generated data ?
Ans. Machine-generated data is an information that is created without human interaction
4
DR.NNCE II YEAR/03 FDS-QB
as a result of a computer process or application activity. This means that data entered
manually by an end-user is not recognized to be machine-generated.
5
DR.NNCE II YEAR/03 FDS-QB
Q.13 What are the three challenges to data mining regarding data mining
methodology?
Ans. Challenges to data mining regarding data mining methodology include the following:
1. Mining different kinds of knowledge in databases,
2. Interactive mining of knowledge at multiple levels of abstraction,
3. Incorporation of background knowledge.
Q.14 What is predictive mining?
Ans. Predictive mining tasks perform inference on the current data in order to make
predictions. Predictive analysis provides answers of the future queries that move across
using historical data as the chief principle for decisions.
Q.15 List the stages of data science process.
Ans. Data science process consists of six stages:
1. Discovery or Setting the research goal 2. Retrieving data 3. Data preparation
4. Data exploration 5. Data modelling 6. Presentation and automation
Q.16 Difference between Structured and Unstructured Data Nov/Dec 2023
PART B
Q.1 Elaborate about the steps in the data science process with a diagram?
(April/May2023 )
Data Science:
• Data is measurable units of information gathered or captured from activity of people,
places and things.
• Data science is an interdisciplinary field that seeks to extract knowledge or insights
• from various forms of data. At its core, Data Science aims to discover and extract
actionable knowledge from data that can be used to make sound business decisions
and predictions.
• Data science combines math and statistics, specialized programming, advanced
analytics, Artificial Intelligence (AI) and machine learning with specific subject
matter expertise to uncover actionable insights hidden in an organization's data.
Life cycle of data science:
1. Capture: Data acquisition, data entry, signal reception and data extraction.
2. Maintain Data warehousing, data cleansing, data staging, data processing
and data architecture.
3.Process Data mining, clustering and classification, data modelling and data
summarization.
4. Analyse : Data reporting, data visualization, business intelligence and decision making.
• These three dimensions are also called as three V's of Big Data.
Q2. Explain the different Facets of Data with the challenges in processing?
(Nov/Dec2022&2023)
Very large amount of data will generate in big data and data science. These data is various
types and main categories of data are as follows:
a) Structured
b) Natural language
9
DR.NNCE II YEAR/03 FDS-QB
c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
g) Audio, video and images
Structured Data
• Structured data is arranged in rows and column format. It helps for application to
retrieve and process data easily. Database management system is used for storing
structured data.
• The term structured data refers to data that is identifiable because it is organized in a
structure. The most common form of structured data or records is a database where
specific information is stored based on a methodology of columns and rows.
• An Excel table is an example of structured data.
Unstructured Data
• Unstructured data is data that does not follow a specified format. Row and columns are
not used for unstructured data. Therefore it is difficult to retrieve required
information. unstructured data has no identifiable structure.
• The unstructured data can be in the form of Text: (Documents, email messages,
customer feedbacks), audio, video, images. Email is an example of unstructured data.
• Even today in most of the organizations more than 80 % of the data are in unstructured
form. This carries lots of information. But extracting information from these various
sources is a very big challenge.
• Characteristics of unstructured data:
1. There is no structural restriction or binding for the data.
2. Data can be of any type.
3. Unstructured data does not follow any structural rules.
4. There are no predefined formats, restriction or sequence for unstructured data
Natural Language
• Natural language is a special type of unstructured data.
• Natural language processing enables machines to recognize characters, words and
sentences, then apply meaning and understanding to that information. This helps
machines to understand language as humans do.
• Natural language processing is the driving force behind machine intelligence in many
modern real-world applications. The natural language processing community has had
success in entity recognition, topic recognition, summarization, text completion and
sentiment analysis.
Machine - Generated Data
• Machine-generated data is an information that is created without human interaction as a
result of a computer process or application activity. This means that data entered manually
10
DR.NNCE II YEAR/03 FDS-QB
activity, information from social networks, financial trading floors or geospatial services
and telemetry from connected devices or instrumentation in data centres.
Q3. Explore the various steps associated with data science process and explain any
three steps of it with suitable diagrams and example.? Nov/Dec2022
Data science process consists of six stages :
2. Retrieving data
3. Data preparation
4. Data exploration
5. Data modelling
business understanding of the data user have and deciphering what each piece of data
means. This could entail determining exactly what data is required and the best
methods for obtaining it. This also entails determining what each of the data points
means in terms of the company. If we have given a data set from a client, for example,
we shall need to know what each column and row represents.
• Step 3: Data preparation
Data can have many inconsistencies like missing values, blank columns, an incorrect data
format, which needs to be cleaned. We need to process, explore and condition data
before modeling. The clean data, gives the better predictions.
• Step 4: Data exploration
Data exploration is related to deeper understanding of data. Try to understand how
variables interact with each other, the distribution of the data and whether there are outliers.
To achieve this use descriptive statistics, visual techniques and simple modeling.
This steps is also called as Exploratory Data Analysis.
• Step 5: Data modeling
In this step, the actual model building process starts. Here, Data scientist distributes
datasets for training and testing. Techniques like association, classification and clustering
are applied to the training data set. The model, once prepared, is tested against the
"testing" dataset.
• Step 6: Presentation and automation
Deliver the final base lined model with reports, code and technical documents in this stage.
Model is deployed into a real-time production environment after thorough testing. In this
stage, the key findings are communicated to all stakeholders. This helps to decide if the
project results are a success or a failure based on the inputs from the model.
2. Resources :
• As part of the discovery phase, the team needs to assess the resources available to
support the project. In this context, resources include technology, tools, systems,
data and people.
3. Frame the problem :
• Each team member may hear slightly different things related to the needs and the problem
and have somewhat different ideas of possible solutions.
4. Identifying key stakeholders:
• The team can identify the success criteria, key risks and stakeholders, which should
include anyone who will benefit from the project or will be significantly impacted by the
project.
• When interviewing stakeholders, learn about the domain area and any relevant history
from similar analytics projects.
5. Interviewing the analytics sponsor:
• The team should plan to collaborate with the stakeholders to clarify and frame the
analytics problem.
• At the outset, project sponsors may have a predetermined solution that may not
necessarily realize the desired outcome.
• In these cases, the team must use its knowledge and expertise to identify the true
underlying problem and appropriate solution.
• When interviewing the main stakeholders, the team needs to take time to thoroughly
interview the project sponsor, who tends to be the one funding the project or providing
the high-level requirements.
• This person understands the problem and usually has an idea of a potential working
solution.
6. Developing initial hypotheses:
• This step involves forming ideas that the team can test with data. Generally, it is best to
come up with a few primary hypotheses to test and then be creative about developing
several more.
• These Initial Hypotheses form the basis of the analytical tests the team will use in later
phases and serve as the foundation for the findings in phase.
7. Identifying potential data sources:
• Consider the volume, type and time span of the data needed to test the hypotheses.
Ensure that the team can access more than simply aggregated data. In most cases,
the team will need the raw data to avoid introducing bias for the downstream analysis.
third parties.
• Most of the high quality data is freely available for public and commercial use. Data
can be stored in various format. It is in text file format and tables in database.
Data may be internal or external.
1. Start working on internal data, i.e. data stored within the company
• First step of data scientists is to verify the internal data. Assess the relevance and quality
of the data that's readily in company. Most companies have a program for maintaining
key data, so much of the cleaning work may already be done. This data can be stored in
official data repositories such as databases, data marts, data warehouses and
data lakes maintained by a team of IT professionals.
Data repository can be used to describe several ways to collect and store data:
a) Data warehouse is a large data repository that aggregates data usually from multiple
sources or segments of a business, without the data being necessarily related.
b) Data lake is a large data repository that stores unstructured data that is classified and
tagged with metadata.
c) Data marts are subsets of the data repository. These data marts are more targeted to
what the data user needs and easier to use.
d) Metadata repositories store data about data and databases.
The metadata explains where the data source, how it was captured and what it
represents.
e) Data cubes are lists of data with three or more dimensions stored as a table.
16
DR.NNCE II YEAR/03 FDS-QB
c) Data errors may point to a business process that isn't working as designed.
d) Data errors may point to defective equipment, such as broken transmission lines and
defective sensors.
e) Data errors can point to bugs in software or in the integration of software that
may be critical to the company
18
DR.NNCE II YEAR/03 FDS-QB
2. Appending tables
• Appending table is called stacking table. It effectively adding observations from one
table to another table. Fig. 1.6.3 shows Appending table. (See Fig. 1.6.3 on next page)
• Duplication of data is avoided by using view and append. The append table requires
more space for storage. If table size is in terabytes of data, then it becomes problematic to
duplicate the data. For this reason, the concept of a view was invented.
• Fig. 1.6.4 shows how the sales data from the different months is combined virtually
into a yearly sales table instead of duplicating the data.
19
DR.NNCE II YEAR/03 FDS-QB
Transforming Data
• In data transformation, the data are transformed or consolidated into forms
appropriate for mining. Relationships between an input variable and an output variable
aren't always linear.
• Reducing the number of variables: Having too many variables in the model makes the
model difficult to handle and certain techniques don't perform well when user overload
them with too many input variables.
• All the techniques based on a Euclidean distance perform well only up to 10 variables.
Data scientists use special methods to reduce the number of variables but retain the
maximum amount of data.
Euclidean distance :
• Euclidean distance is used to measure the similarity between observations. It is calculated
as the square root of the sum of differences between each point.
Euclidean distance = √(X1-X2)2 + (Y1-Y2)2
Turning variable into dummies :
• Variables can be turned into dummy variables. Dummy variables can only take two
values: true (1) or false√ (0). They're used to indicate the absence of categorical effect
that may explain the observation.
20
DR.NNCE II YEAR/03 FDS-QB
Q.6 Explain in detail Data Mining ?OR Explain Data Analytic life cycle.
Brief about Time-Series Analysis? April/may2024
• Data mining refers to extracting or mining knowledge from large amounts of data.
It is a process of discovering interesting patterns or Knowledge from a large amount
of data stored either in databases, data warehouses or other information repositories.
Reasons for using data mining:
1. Knowledge discovery: To identify the invisible correlation, patterns in the database.
2. Data visualization: To find sensible way of displaying data.
3. Data correction: To identify and correct incomplete and inconsistent data.
Functions of Data Mining
• Different functions of data mining are characterization, association and correlation
analysis, classification, prediction, clustering analysis and evolution analysis.
1. Characterization is a summarization of the general characteristics or features of a
target. class of data. For example, the characteristics of students can be produced,
generating a profile of all the University in first year engineering students.
2. Association is the discovery of association rules showing attribute-value conditions
21
DR.NNCE II YEAR/03 FDS-QB
methods at the intersection of AI, machine learning, statistics and database systems.
• Fig. 1.10.1 (See on next page) shows typical architecture of data mining system.
• Components of data mining system are data source, data warehouse server, data mining
engine, pattern evaluation module, graphical user interface and knowledge base.
• Database, data warehouse, WWW or other information repository: This is set of databases,
data warehouses, spread sheets or other kinds of data repositories. Data cleaning and
data integration techniques may be apply on the data.
• Data warehouse server based on the user's data request, data warehouse server is
responsible for fetching the relevant data.
Classification of DM System
• Data mining system can be categorized according to various parameters. These are
database technology, machine learning, statistics, information science, visualization
and other disciplines.
• Fig. 1.10.2 shows classification of DM system.
23
DR.NNCE II YEAR/03 FDS-QB
• Data warehousing is the process of constructing and using a data warehouse. A data
warehouse is constructed by integrating data from multiple heterogeneous sources that
support analytical reporting, structured and/or ad hoc queries and decision making.
Data warehousing involves data cleaning, data integration and data consolidations.
• A data warehouse is a subject-oriented, integrated, time-variant and non-volatile
process. collection of data in support of management's decision-making process.
A data warehouse stores historical data for purposes of decision support.
• A database is a way to record and access information from a single source. A database is
often handling real-time data to support day-to-day business processes like
transaction processing.
• A data warehouse is a way to store historical information from multiple sources to
allow you to analyse and report on related data (e.g., your sales transaction data, mobile
app data
and CRM data). Unlike a database, the information isn't updated in real-time and is better
for data analysis of broader trends.
• Modern data warehouses are moving toward an Extract, Load, Transformation
(ELT) architecture in which all or most data transformation is performed on the
database that hosts the data warehouse.
• Goals of data warehousing:
1. To help reporting as well as analysis.
2. Maintain the organization's historical information.
3. Be the foundation for decision making.
Characteristics of Data Warehouse
1. Subject oriented Data are organized based on how the users refer to them. A data
warehouse can be used to analyse a particular subject area. For example, "sales"
can be a particular subject.
2. Integrated: All inconsistencies regarding naming convention and value representations
are removed. For example, source A and source B may have different ways of
identifying a product, but in a data warehouse, there will be only a single way of
identifying a product.
3. Non-volatile: Data are stored in read-only format and do not change over time.
Typical activities such as deletes, inserts and changes that are performed in an
operational application environment are completely non-existent in a DW environment.
Key characteristics of a Data Warehouse
1. Data is structured for simplicity of access and high-speed query performance.
2. End users are time-sensitive and desire speed-of-thought response times.
3. Large amounts of historical data are used.
4. Queries often retrieve large amounts of data, perhaps many thousands of rows.
5. Both predefined and ad hoc queries are common.
25
DR.NNCE II YEAR/03 FDS-QB
26
DR.NNCE II YEAR/03 FDS-QB
a) First, it acts as the glue that links all parts of the data warehouses.
b) Next, it provides information about the contents and structures to the developers.
c) Finally, it opens the doors to the end-users and makes the contents recognizable in their
terms.
27
DR.NNCE II YEAR/03 FDS-QB
28
DR.NNCE II YEAR/03 FDS-QB
UNIT II
DESCRIBING DATA
Types of Data
Types of Variables
Describing Data with Tables and Graphs
Describing Data with Averages
Describing Variability
Normal Distributions and Standard (z) Score
1
DR.NNCE II YEAR/03 FDS-QB
PART – B
1. Describe types of Variables? (Nov/Dec2022 &13mark)
2. The IQ scores for a group of 35 school dropouts are as follows:
3. Given below are the weekly pocket expenses (in Rupees) of a group of 25
students selected at random.
37, 41, 39, 34, 41, 26, 46, 31, 48, 32, 44, 39, 35, 39, 37, 49, 27, 37, 33, 38, 49, 45,
44, 37, 36 Construct a grouped frequency distribution table with class
intervals of equal widths, starting from 25-30, 30-35 and so on. Also, find the
range of weekly pocket expenses? (April/May2022&13mark)
6. The heights of animals are: 600 mm, 470 mm, 170 mm, 430 mm and 300
mm. Find out the mean, the variance and the standard deviation?
(Nov/Dec 2022/2023)
2
DR.NNCE II YEAR/03 FDS-QB
3
DR.NNCE II YEAR/03 FDS-QB
1. Descriptive Statistics
2. Inferential Statistics
we describe the data using the Mean, Standard deviation, Charts, or Probability
distributions. Basically, as part of descriptive Statistics, we measure the
following:
4
DR.NNCE II YEAR/03 FDS-QB
5
DR.NNCE II YEAR/03 FDS-QB
6
DR.NNCE II YEAR/03 FDS-QB
PART-B
Q.1 Describe types of Variables?(Nov/Dec2022 13mark)
Variable is a characteristic or property that can take on different values.
Discrete and Continuous Variables
Discrete variables:
• Quantitative variables can be further distinguished in terms of whether they are
discrete or continuous.
• The word discrete means countable. For example, the number of students in a
class is countable or discrete. The value could be 2, 24, 34 or 135 students, but it
cannot be 23/32 or 12.23 students.
• Number of page in the book is a discrete variable. Discrete data can only take
on certain individual values.
Continuous variables:
• Continuous variables are a variable which can take all values within a given
interval or range. A continuous variable consists of numbers whose values, at
least in theory, have no restrictions.
• Example of continuous variables is Blood pressure, weight, high and income.
• Continuous data can take on any value in a certain range. Length of a file is a
continuous variable.
7
DR.NNCE II YEAR/03 FDS-QB
Approximate Numbers
• Approximate number is defined as a number approximated to the exact number
and there is always a difference between the exact and approximate numbers.
• For example, 2, 4, 9 are exact numbers as they do not need any approximation.
• But √2, л, √3 are approximate numbers as they cannot be expressed exactly by a
finite digits. They can be written as 1.414, 3.1416, 1.7320 etc which are only
approximations to the true values.
• Whenever values are rounded off, as is always the case with actual values for
continuous variables, the resulting numbers are approximate, never exact.
• An approximate number is one that does have uncertainty. A number can be
approximate for one of two reasons:
a) The number can be the result of a measurement.
b) Certain numbers simply cannot be written exactly in decimal form. Many
fractions and all irrational numbers fall into this category
The two main variables in an experiment are the independent and dependent
variable.
8
DR.NNCE II YEAR/03 FDS-QB
9
DR.NNCE II YEAR/03 FDS-QB
Observational Study
• An observational study focuses on detecting relationships between variables not
manipulated by the investigator. An observational study is used to answer a
research question based purely on what the researcher observes. There is no
interference or manipulation of the research subjects and no control and treatment
groups.
• These studies are often qualitative in nature and can be used for both
exploratory and explanatory research purposes. While quantitative observational
studies exist, they are less common.
• Observational studies are generally used in hard science, medical and social
science fields. This is often due to ethical or practical concerns that prevent the
researcher from conducting a traditional experiment. However, the lack of control
and treatment groups means that forming inferences is difficult and there is a risk
of confounding variables impacting user analysis.
Confounding Variable
• Confounding variables are those that affect other variables in a way that
produces spurious or distorted associations between two variables. They
confound the "true" relationship between two variables. Confounding refers to
differences in outcomes that occur because of differences in the baseline risks of
the comparison groups.
• For example, if we have an association between two variables (X and Y) and
that association is due entirely to the fact that both X and Y are affected by a third
variable (Z), then we would say that the association between X and Y is spurious
and that it is a result of the effect of a confounding variable (Z).
• A difference between groups might be due not to the independent variable but to
a confounding variable.
• For a variable to be confounding:
a) It must have connected with independent variables of interest and
b) It must be connected to the outcome or dependent variable directly.
• Consider the example, in order to conduct research that has the objective that
alcohol drinkers can have more heart disease than non-alcohol drinkers such that
they can be influenced by another factor. For instance, alcohol drinkers might
consume cigarettes more than non drinkers that act as a confounding variable
(consuming cigarettes in this case) to study an association amidst drinking
alcohol and heart disease.
10
DR.NNCE II YEAR/03 FDS-QB
• For example, suppose a researcher collects data on ice cream sales and shark
attacks and finds that the two variables are highly correlated. Does this mean that
increased ice cream sales cause more shark attacks? That's unlikely. The more
likely cause is the confounding variable temperature. When it is warmer outside,
more people buy ice cream and more people go in the ocean.
b) Specify the real limits for the lowest class interval in this frequency
distribution?(NOV/DEC2023 13MARK)
(123-69)/ 10=54/10=5.4≈ 5
11
DR.NNCE II YEAR/03 FDS-QB
b) b) Real limits for the lowest class interval in this frequency distribution =
64.5-69.5.
Q3. Given below are the weekly pocket expenses (in Rupees) of a
group of 25 students selected at random.
37, 41, 39, 34, 41, 26, 46, 31, 48, 32, 44, 39, 35, 39, 37, 49, 27, 37, 33, 38,
49, 45, 44, 37, 36
Construct a grouped frequency distribution table with class intervals
of equal widths, starting from 25-30, 30-35 and so on. Also, find the
range of weekly pocket expenses.
Solution:
• In the given data, the smallest value is 26 and the largest value is 49. So,
the range of the weekly pocket expenses = 49-26=23.
12
DR.NNCE II YEAR/03 FDS-QB
Example: Suppose we take a sample of 200 India family's and record the
number of people living there. We obtain the following:
Cumulative frequency:
• A cumulative frequency distribution can be useful for ordered data (e.g. data
arranged in intervals, measurement data, etc.). Instead of reporting frequencies,
the recorded values are the sum of all frequencies for values less than and
including the current value.
13
DR.NNCE II YEAR/03 FDS-QB
• Example: Suppose we take a sample of 200 India family's and record the
number of people living there. We obtain the following:
• It is a well known fact that statistics can be misleading. They are often used to
prove a point and can easily be twisted in favour of that point.
• Good graphs are extremely powerful tools for displaying large quantities of
complex data; they help turn the realms of information available today into
knowledge. But, unfortunately, some graphs deceive or mislead.
• This may happen because the designer chooses to give readers the impression of
better performance or results than is actually the situation. In other cases, the
person who prepares the graph may want to be accurate and honest, but may
mislead the reader by a poor choice of a graph form or poor graph construction.
14
DR.NNCE II YEAR/03 FDS-QB
1. Title
2. Labels on both axes of a line or bar chart and on all sections of a pie chart
4. Key to a pictograph
• A graph can be altered by changing the scale of the graph. For example, data in
the two graphs of Fig. 2.6.1 are identical, but scaling of the Y-axis changes the
impression of the magnitude of differences.
Solution:
15
DR.NNCE II YEAR/03 FDS-QB
16
DR.NNCE II YEAR/03 FDS-QB
1. Mean :
• The mean of a data set is the average of all the data values. The sample mean x
is the point estimator of the population mean μ.
2. Median :
• The median of a data set is the value in the middle when the data items are
arranged in ascending order. Whenever a data set has extreme values, the median
is the preferred measure of central location.
• The median is the measure of location most often reported for annual income
and property value data. A few extremely large incomes of property values can
inflate the mean.
Median=19
8 observations=26 18 29 12 14 27 30 19
3. Mode:
• The mode of a data set is the value that occurs with greatest frequency. The
greatest frequency can occur at two or more different values. If the data have
exactly two modes, the data have exactly two modes, the data are bimodal. If the
data have more than two modes, the data are multimodal.
4. Variance
• In the formula above, μ represents the mean of the data points, x is the value of
an individual data point and N is the total number of data points.
standard Deviation
• Standard deviation is simply the square root of the variance. Standard deviation
measures the standard distance between a score and the mean.
Standard deviation=√Variance
• The standard deviation is a measure of how the values in data differ from one
another or how spread out data is. There are two types of variance and standard
deviation in terms of sample and population.
• The standard deviation measures how far apart the data points in observations
are from each. we can calculate it by subtracting each data point from the mean
value and then finding the squared mean of the differenced values; this is called
Variance. The square root of the variance gives us the standard deviation.
18
DR.NNCE II YEAR/03 FDS-QB
b) The center of the distribution (the mean) changes, but the standard deviation
remains the same.
• If user are given numerical values for the mean and the standard deviation, we
should be
Standard deviation distances always originate from the mean and are expressed as
positive deviations above the mean or negative deviations below the mean.
SS = Σ (X-X )2
SS = Σx2 - (Σx)2/n
I) The heights of animals are: 600 mm, 470 mm, 170 mm, 430 mm and 300
mm. Find out the mean, the variance and the standard deviation.
19
DR.NNCE II YEAR/03 FDS-QB
Solution:
Variance = 21704
= 142.32 ≈ 142
II) Determine the values of the range and the IQR for the following sets of
data.
(a) Retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63
Solution:
a) Retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63
Range = 25
IQR:
45, 55, 60, 60, 63, 63, 63, 63, 65, 65, 70
↑
20
DR.NNCE II YEAR/03 FDS-QB
Median
Q1=60 , Q3 65
IQR = Q3-Q1=65-60 = 5
• The normal distribution is often called the bell curve because the graph of its
probability density looks like a bell. It is also known as called Gaussian
distribution, after the German mathematician Carl Gauss who first described it.
21
DR.NNCE II YEAR/03 FDS-QB
z Scores
• The Z-score or standard score, is a fractional representation of standard
deviations from the mean value. Accordingly, z-scores often have a distribution
with no average and standard deviation of 1. Formally, the z-score is defined as :
Z = X-μ / σ
• The z-score works by taking a sample score and subtracting the mean score,
before then dividing by the standard deviation of the total population. The z-score
is positive if the value lies above the mean and negative if it lies below the mean.
a) Positive or negative sign indicating whether it's above or below the mean; and
b) Number indicating the size of its deviation from the mean in standard deviation
units
22
DR.NNCE II YEAR/03 FDS-QB
(b) And enables us to compare two scores that are from different samples (which
may have different means and standard deviations).
• Using the z-score technique, one can now compare two different test results
based on relative performance, not individual grading scale.
(b) A score of 470 on the SAT math test, given a mean of 500 and a standard
deviation of 100.
Solution :
Given, Margaret's IQ (X) = 135, Mean (u) = 100, Standard deviation (o) = 15
Z = X- μ / σ = 135-100 / 15 =2.33
b) A score of 470 on the SAT math test, given a mean of 500 and a standard
deviation of 100
Given,
Score (X) = 470, Mean (u) = 500, Standard deviation (6)= 100
• Although there is an infinite number of different normal curves, each with its
own mean and standard deviation, there is only one standard normal curve, with a
mean of 0 and a standard deviation of 1.
Solution:
24
DR.NNCE II YEAR/03 FDS-QB
UNIT III
DESCRIBING RELATIONSHIPS
Correlation
Scatter plots
correlation coefficient for quantitative data
computational formula for correlation coefficient
Regression –regression line
least squares regression line
Standard errorof estimate
interpretation of r2
multiple regression equations
regression towards the mean
1
DR.NNCE II YEAR/03 FDS-QB
PART –B
Q1. Explain in detail about the types of Correlation? April/MAY2022
Q2. A sample of 6 children was selected, data about their age in years
and weight in kilograms was recorded as shown in the following table.
It is required to find the correlation between age and weight? NOV/DEC2022
Q3. A sample of 12 fathers and their elder sons gave the following data about their
heights in inches. Calculate the coefficient of rank correlation? April/May2023
2
DR.NNCE II YEAR/03 FDS-QB
3
DR.NNCE II YEAR/03 FDS-QB
UNIT III
DESCRIBING RELATIONSHIPS
PART – A
1. What do you mean by Correlation?
Correlation is a statistical technique to ascertain the association or relationship
between two or more variables. Correlation analysis is a statistical technique to
study the degree and direction of relationship between two or more variables.
2. What do you mean by correlation coefficient?
A correlation coefficient is a statistical measure of the degree to which changes to
the value of one variable predict change to the value of another. When the
fluctuation of one variable reliably predicts a similar fluctuation in another
variable, there‟s often a tendency to think that means that the change in one causes
the change in the other.
3. Write down the Uses of correlations:
I. Correlation analysis helps inn deriving precisely the degree and the
direction of such relationship.
II. The effect of correlation is to reduce the range of uncertainity of our
prediction. The prediction based on correlation analysis will be more
reliable and near to reality.
III. Correlation analysis contributes to the understanding of economic
behaviour, aids in locating the critically important variables on which
others depend, may reveal to the economist the connections by which
disturbances spread and suggest to him the paths through which stabilizing
farces may become effective
IV. Economic theory and business studies show relationships between
variables like price and quantity demanded advertising expenditure and
4
DR.NNCE II YEAR/03 FDS-QB
(c) No Correlation
S Particulars Solution
l
N
o
1 Price of commodity and its demand Negative
2 Yield of crop and amount of rainfall Positive
3 No of fruits eaten and hungry of a person Negative
4 No of units produced and fixed cost per unit Negative
5 No of girls in the class and marks of boys No
Correlation
6 Ages of Husbands and wife Positive
7 Temperature and sale of woollen garments Negative
8 Number of cows and milk produced Positive
9 Weight of person and intelligence No
Correlation
1 Advertisement expenditure and sales volume Positive
0
5
DR.NNCE II YEAR/03 FDS-QB
6
DR.NNCE II YEAR/03 FDS-QB
Correlation Regression
7
DR.NNCE II YEAR/03 FDS-QB
PART – B
1.Explain in detail about the types of Correlation? April/MAY2022
• When one measurement is made on each observation, uni-variate analysis is applied.
If more than one measurement is made on each observation, multivariate analysis is
applied.
Here we focus on bivariate analysis, where exactly two measurements are made on
each observation.
• The two measurements will be called X and Y. Since X and Y are obtained for
each observation, the data for one observation is the pair (X, Y).
• Some examples :
1. Height (X) and weight (Y) are measured for each individual in a sample.
2. Stock market valuation (X) and quarterly corporate earnings (Y) are recorded for
each company in a sample.
2. A cell culture is treated with varying concentrations of a drug and the growth rate (X)
And drug concentrations (Y) are recorded for each trial.
8
DR.NNCE II YEAR/03 FDS-QB
3. Temperature (X) and precipitation (Y) are measured on a given day at a set of weather
stations.
•There is difference in bivariate data and two sample data. In two sample data, the
X and Y values are not paired and there are not necessarily the same number of X and Y
values.
• Correlation refers to a relationship between two or more objects. In statistics,
the word correlation refers to the relationship between two variables. Correlation exists
between two variables when one of them is related to the other in some way.
• Examples: One variable might be the number of hunters in a region and the other
variable could be the deer population. Perhaps as the number of hunters increases, the
deer population decreases. This is an example of a negative correlation: As one variable
increases the other decreases.
A positive correlation is where the two variables react in the same way, increasing or
decreasing together. Temperature in Celsius and Fahrenheit has a positive correlation.
• The term "correlation" refers to a measure of the strength of association between two
variables.
• Covariance is the extent to which a change in one variable corresponds systematically
to a change in another. Correlation can be thought of as a standardized covariance.
• The correlation coefficient r is a function of the data, so it really should be called the
sample correlation coefficient. The (sample) correlation coefficient r estimates the
population correlation coefficient p.
• If either the X, or the Y; values are constant (i.e. all have the same value), then one
of the sample standard deviations is zero and therefore the correlation coefficient is not
defined.
Types of Correlation
1. Positive and negative
2. Simple and multiple
3. Partial and total
4. Linear and non-linear.
1. Positive and negative
9
DR.NNCE II YEAR/03 FDS-QB
• Positive correlation : Association between variables such that high scores on one
variable tends to have high scores on the other variable. A direct relation between the
variables.
• Negative correlation : Association between variables such that high scores on one
variable tends to have low scores on the other variable. An inverse relation between the
variables.
2. Simple and multiple
• Simple: It is about the study of only two variables, the relationship is described as
simple correlation.
• Example: Quantity of money and price level, demand and price.
• Multiple: It is about the study of more than two variables simultaneously, the
relationship is described as multiple correlations.
• Example: The relationship of price, demand and supply of a commodity.
3. Partial and total correlation
• Partial correlation : Analysis recognizes more than two variables but considers
only two variables keeping the other constant. Example: Price and demand, eliminating
the supply side.
• Total correlation is based on all the relevant variables, which is normally not feasible.
In total correlation, all the facts are taken into account.
4. Linear and non-linear correlation
• Linear correlation : Correlation is said to be linear when the amount of change in one
variable tends to bear a constant ratio to the amount of change in the other. The graph
of the variables having a linear relationship will form a straight line.
• Non linear correlation : The correlation would be non linear if the amount of change
in one variable does not bear a constant ratio to the amount of change in the other
variable.
Classification of correlation
•Two methods are used for finding relationship between variables.
1. Graphic methods
2. Mathematical methods.
10
DR.NNCE II YEAR/03 FDS-QB
• Graphic methods contain two sub methods: Scatter diagram and simple graph.
• Types of mathematical methods are,
a. Karl 'Pearson's coefficient of correlation
b. Spearman's rank coefficient correlation
c. Coefficient of concurrent deviation
d. Method of least squares.
Coefficient of Correlation
Correlation : The degree of relationship between the variables under consideration is
measure through the correlation analysis.
• The measure of correlation called the correlation coefficient. The degree of
relationship is expressed by coefficient which range from correlation (- 1 ≤ r≥ + 1).
The direction of change is indicated by a sign.
• The correlation analysis enables us to have an idea about the degree and direction
of the relationship between the two variables under study.
• Correlation is a statistical tool that helps to measure and analyze the degree of
relationship between two variables. Correlation analysis deals with the association
between two or more variables.
• Correlation denotes the interdependency among the variables for correlating two
phenomenon, it is essential that the two phenomenon should have cause-effect
relationship
and if such relationship does not exist then the two phenomenon can not be correlated.
• If two variables vary in such a way that movement in one are accompanied by
movement in other, these variables are called cause and effect relationship.
Properties of Correlation
1. Correlation requires that both variables be quantitative.
2. Positive r indicates positive association between the variables and negative r indicates
negative association.
3. The correlation coefficient (r) is always a number between - 1 and + 1.
4. The correlation coefficient (r) is a pure number without units.
5. The correlation coefficient measures clustering about a line, but only relative to the
11
DR.NNCE II YEAR/03 FDS-QB
Solution :
12
DR.NNCE II YEAR/03 FDS-QB
13
DR.NNCE II YEAR/03 FDS-QB
Q3. A sample of 12 fathers and their elder sons gave the following data about their
Solution:
14
DR.NNCE II YEAR/03 FDS-QB
Solution: In the problem statement, both series items are in small numbers. So there is no
= 46 / 5.29 × 9.165
r = 0.9488
15
DR.NNCE II YEAR/03 FDS-QB
• When two variables x and y have an association (or relationship), we say there
exists a correlation between them. Alternatively, we could say x and y are correlated.
To find such an association, we usually look at a scatterplot and try to find a pattern.
• Scatterplot (or scatter diagram) is a graph in which the paired (x, y) sample data are
plotted with a horizontal x axis and a vertical y axis. Each individual (x, y) pair is plotted as a
single point.
• One variable is called independent (X) and the second is called dependent (Y).
Example:
16
DR.NNCE II YEAR/03 FDS-QB
• The pattern of data is indicative of the type of relationship between your two variables :
1. Positive relationship
2. Negative relationship
3. No relationship.
or a zero relationship.
1. It is a simple to implement and attractive method to find out the nature of correlation.
2. It is easy to understand.
3. User will get rough idea about correlation (positive or negative correlation).
18
DR.NNCE II YEAR/03 FDS-QB
19
DR.NNCE II YEAR/03 FDS-QB
n= 10
X= Maintains cost
y=Sales cost
20
DR.NNCE II YEAR/03 FDS-QB
ii) Find Karl Pearson's correlation coefficient for the following paired data.
21
DR.NNCE II YEAR/03 FDS-QB
• For an input x, if the output is continuous, this is called a regression problem. For
example, based on historical information of demand for tooth paste in your supermarket
you are asked to predict the demand for the next month.
• Regression is concerned with the prediction of continuous quantities. Linear regression
is the oldest and most widely used predictive model in the field of machine learning.
The goal is to minimize the sum of the squared errors to fit a straight line to a set of
data points.
• It is one of the supervised learning algorithms. A regression model requires the
knowledge of both the dependent and the independent variables in the training data set.
22
DR.NNCE II YEAR/03 FDS-QB
• Simple Linear Regression (SLR) is a statistical model in which there is only one
independent variable and the functional relationship between the dependent variable
and the regression coefficient is linear.
• Regression line is the line which gives the best estimate of one variable from the
value of any other given variable.
• The regression line gives the average relationship between the two variables in
mathematical form. For two variables X and Y, there are always two lines of regression.
• Regression line of Y on X: Gives the best estimate for the value of Y for any specific
given values of X:
where
Y = a + bx
a = Y - intercept
b = Slope of the line
Y = Dependent variable
X = Independent variable
• By using the least squares method, we are able to construct a best fitting straight line
to the scatter diagram points and then formulate a regression equation in the form of:
ŷ = a + bx
ŷ = ȳ + b(x- x¯ )
• Regression analysis is the art and science of fitting straight lines to patterns of data. In
a linear regression model, the variable of interest ("dependent" variable) is predicted
from k other variables ("independent" variables) using a linear equation.
• If Y denotes the dependent variable and X1, ..., Xk are the independent variables,
then the assumption is that the value of Y at time t in the data sample is determined by the
linear equation:
Y1 = β0 + β1 X1t + B2 X2t +… + βk Xkt + εt
where the betas are constants and the epsilons are independent and identically distributed
normal random variables with mean zero.
Regression Line
• A way of making a somewhat precise prediction based upon the relationships
23
DR.NNCE II YEAR/03 FDS-QB
between two variables. The regression line is placed so that it minimizes the predictive error.
• The regression line does not go through every point; instead it balances the difference
between all data points and the straight-line model. The difference between the
observed data value and the predicted value (the value on the straight line) is the error
or residual. The criterion to determine the line that best describes the relation
between two variables is based on the residuals.
Residual = Observed - Predicted
• A negative residual indicates that the model is over-predicting. A positive residual indicates
that the model is under-predicting.
Linear Regression
• The simplest form of regression to visualize is linear regression with a single predictor.
A linear regression technique can be used if the relationship between X and Y
can be approximated with a straight line.
• Linear regression with a single predictor can be expressed with the equation:
y = Ɵ2x + Ɵ1 + e
• The regression parameters in simple linear regression are the slope of the line (Ɵ2),
the angle between a data point and the regression line and the y intercept (Ɵ1) the point
where x crosses the y axis (X = 0).
• Model 'Y', is a linear function of 'X'. The value of 'Y' increases or decreases in
linear manner according to which the value of 'X' also changes.
Nonlinear Regression:
• Often the relationship between x and y cannot be approximated with a straight line.
In this case, a nonlinear regression technique may be used.
24
DR.NNCE II YEAR/03 FDS-QB
25
DR.NNCE II YEAR/03 FDS-QB
• Imagine a vertical distance between the line and a data point E = Y - E(Y).
This error is the deviation of the data point from the imaginary line, regression line.
Then what is the best values of a and b? A and b that minimizes the sum of such errors.
• Deviation does not have good properties for computation. Then why do we use
squares of deviation? Let us get a and b that can minimize the sum of squared deviations
rather than the sum of deviations. This method is called least squares.
• Least squares method minimizes the sum of squares of errors. Such a and b are
called least squares estimators i.e. estimators of parameters a and B.
• The process of getting parameter estimators (e.g., a and b) is called estimation.
Lest squares method is the estimation method of Ordinary Least Squares (OLS).
Disadvantages of least square
1. Lack robustness to outliers.
26
DR.NNCE II YEAR/03 FDS-QB
27
DR.NNCE II YEAR/03 FDS-QB
• It basically states that if a variable is extreme the first time we measure it, it will be closer
28
DR.NNCE II YEAR/03 FDS-QB
to the average the next time we measure it. In technical terms, it describes how a random
variable that is outside the norm eventually tends to return to the norm.
• For example, our odds of winning on a slot machine stay the same. We might hit a
"winning streak" which is, technically speaking, a set of random variables outside the norm.
But play the machine long enough and the random variables will regress to the mean
(i.e. "return to normal") and we shall end up losing.
Regression fallacy
• Regression fallacy assumes that a situation has returned to normal due to corrective
actions having been taken while the situation was abnormal. It does not take into
consideration normal fluctuations.
• An example of this could be a business program failing and causing problems
which is then cancelled. The return to "normal", which might be somewhat different
from the original situation or a situation of "new normal" could fall into the
category of regression fallacy. This is considered an informal fallacy.
29
DR.NNCE II YEAR/03 FDS-QB
UNIT IV
PYTHON LIBRARIES FOR DATA WRANGLING
1
DR.NNCE II YEAR/03 FDS-QB
PART – B
1. Explain in detail about Data Wrangling in Python? Nov/Dec2022
2. Explain the two main ways to carry out Boolean masking? April/may2022
3. Explain in detail about Aggregation in Pandas? April/may2023
4. Pandas Data Frame - transform() function? Nov/Dec2023
5. Explain in detail about the pivot table using python?
2
DR.NNCE II YEAR/03 FDS-QB
PART – A
3
DR.NNCE II YEAR/03 FDS-QB
print(df.head())
4
DR.NNCE II YEAR/03 FDS-QB
PART – B
1. Explain in detail about Data Wrangling in Python? Nov/Dec2022
Data wrangling involves processing the data in various formats like - merging, grouping,
concatenating etc. for the purpose of analysing or getting them ready to be used with another set of data.
Python has built-in features to apply these wrangling methods to various data sets to achieve the
analytical goal. In this chapter we will look at few examples describing these methods.
Merging Data
The Pandas library in python provides a single function, merge, as the entry point for all standard
database join operations between DataFrame objects −
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True)
Let us now create two different DataFrames and perform the merging operations on it.
3 Alice 4 sub6
4 Ayoung 5 sub5
Name id subject_id
0 Billy 1 sub2
1 Brian 2 sub4
2 Bran 3 sub3
3 Bryce 4 sub6
4 Betty 5 sub5
Grouping Data
Grouping data sets is a frequent need in data analysis where we need the result in terms of various groups
present in the data set. Panadas has in-built methods which can roll the data into various groups.
In the below example we group the data by year and then get the result for a specific year.
grouped = df.groupby('Year')
print grouped.get_group(2014)
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
6
DR.NNCE II YEAR/03 FDS-QB
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two])
2. Explain the two main ways to carry out Boolean masking? April/May2022
The NumPy library in Python is a popular library for working with arrays. Boolean masking, also
called boolean indexing, is a feature in Python NumPy that allows for the filtering of values
in numpy arrays.
Syntax
arr[arr > 5]
Parameter values
arr: This is the array that we are querying.
The condition arr > 5 is the criterion with which values in the arr array will be filtered.
7
DR.NNCE II YEAR/03 FDS-QB
Return value
This method returns a NumPy array, ndarray, with values that satisfy the given condition. The line in the
example given above will return all the values in arr that are greater than 5.
Example
Let's try out this method in the following example:
# importing NumPy
import numpy as np
# Creating a NumPy array
arr = np.arange(15)
# Printing our array to observe
print(arr)
# Using boolean masking to filter elements greater than or equal to 8
print(arr[arr >= 8])
# Using boolean masking to filter elements equal to 12
print(arr[arr == 12])
Syntax
The code snippet given below shows us how to use this method:
mask = arr > 5
Return value
The line in the code snippet given above will:
Return an array with the same size and dimensions as arr. This array will only contain the
values True and False. All the True values represent elements in the same position in arr that
satisfy our condition, and all the False values represent elements in the same position in arr that do
not satisfy our condition.
Store this boolean array in a mask array.
The mask array can be passed in the index brackets of arr to return the values that satisfy our condition. We will
see how this works in our coding example.
Example
Let's try out this method in the following example:
# importing NumPy
import numpy as np
# Creating a NumPy array
8
DR.NNCE II YEAR/03 FDS-QB
Pandas provide us with a variety of aggregate functions. These functions help to perform various
activities on the datasets. The functions are:
.count(): This gives a count of the data in a column.
.sum(): This gives the sum of data in a column.
.min() and .max(): This helps to find the minimum value and maximum value, ina function,
respectively.
.mean() and .median(): Helps to find the mean and median, of the values in a column,
respectively.
DataFrame.aggregate(func, axis=0, *args, **kwargs)
Parameters:
func: It refers callable, string, dictionary, or list of string/callables.
It is used for aggregating the data. For a function, it must either work when passed to a DataFrame or
DataFrame.apply(). For a DataFrame, it can pass a dict, if the keys are the column names.
Returns:
It returns the scalar, Series or DataFrame.
scalar: It is being used when Series.agg is called with the single function.
Series: It is being used when DataFrame.agg is called for the single function.
9
DR.NNCE II YEAR/03 FDS-QB
DataFrame: It is being used when DataFrame.agg is called for the several functions.
Example:
1. import pandas as pd
2. import numpy as np
3. info=pd.DataFrame([[1,5,7],[10,12,15],[18,21,24],[np.nan,np.nan,np.nan]],columns=['X','Y','Z'])
4. info.agg(['sum','min'])
10
DR.NNCE II YEAR/03 FDS-QB
func Function to use for transforming the data. If a function, str, list Required
function, must either work when passed a DataFrame or dict
or when passed to DataFrame.apply.
Accepted combinations are:
function
string function name
list of functions and/or function names,
e.g. [np.exp. 'sqrt']
dict of axis labels -> functions, function
names or list of such.
11
DR.NNCE II YEAR/03 FDS-QB
0 0 2
1 1 3
2 2 4
3 3 5
In [3]:
df.transform(lambda x: x + 1)
Out[3]:
X Y
0 1 3
1 2 4
2 3 5
3 4 6
Even though the resulting DataFrame must have the same length as the input DataFrame,
it is possible to provide several input functions:
In [4]:
s = pd.Series(range(4))
s
Out[4]:
0 0
1 1
2 2
12
DR.NNCE II YEAR/03 FDS-QB
3 3
dtype: int64
In [5]:
s.transform([np.sqrt, np.exp])
Out[5]:
sqrt exp
0 0.000000 1.000000
1 1.000000 2.718282
2 1.414214 7.389056
3 1.732051 20.085537
Q5. Explain in detail about the pivot table using python?
Most people likely have experience with pivot tables in Ecel. Pandas provides a similar function called
(appropriately enough) pivot table . While it is exceedingly useful, I frequently find myself struggling to
remember how to use the syntax to format the output for my needs. This article will focus on explaining
the pandas pivot table function and how to use it for your data analysis.
As an added bonus, I‟ve created a simple cheat sheet that summarizes the pivot table. We can find it at the
end of this post and I hope it serves as a useful reference. Let me know if it is helpful.
The Data
One of the challenges with using the panda‟s pivot_table is making sure us understand our data and what
questions we are trying to answer with the pivot table. It is a seemingly simple function but can produce
very powerful analysis very quickly.
In this scenario, I‟m going to be tracking a sales pipeline (also called funnel). The basic problem is that
some sales cycles are very long (think “enterprise software”, capital equipment, etc.) and management
wants to understand it in more detail throughout the year.
Many companies will have CRM tools or other software that sales uses to track the process. While they
may have useful tools for analyzing the data, inevitably someone will export the data to Excel and use a
PivotTable to summarize the data.
13
DR.NNCE II YEAR/03 FDS-QB
import pandas as pd
import numpy as np
Version Warning
The pivot_table API has changed over time so please make sure we have a recent version of pandas ( >
0.15) installed for this example to work. This example also uses the category data type which requires a
recent version as well.
Read in our sales funnel data into our DataFrame
df = pd.read_excel("../in/sales-funnel.xlsx")
df.head()
For convenience sake, let‟s define the status column as a category and set the order we want to view.
This isn‟t strictly required but helps us keep the order we want as we work through analyzing the data.
df["Status"] = df["Status"].astype("category")
df["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)
14
DR.NNCE II YEAR/03 FDS-QB
The simplest pivot table must have a dataframe and an index . In this case, let‟s use the Name as
our index.
pd.pivot_table(df,index=["Name"])
We can have multiple indexes as well. In fact, most of the pivot_table args can take multiple values via
a list.
pd.pivot_table(df,index=["Name","Rep","Manager"])
15
DR.NNCE II YEAR/03 FDS-QB
This is interesting but not particularly useful. What we probably want to do is look at this by Manager and
Rep. It‟s easy enough to do by changing the index .
pd.pivot_table(df,index=["Manager","Rep"])
We can see that the pivot table is smart enough to start aggregating the data and summarizing it by
grouping the reps with their managers. Now we start to get a glimpse of what a pivot table can do for us.
For this purpose, the Account and Quantity columns aren‟t really useful. Let‟s remove it by explicitly
defining the columns we care about using the values field.
16
DR.NNCE II YEAR/03 FDS-QB
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"])
The price column automatically averages the data but we can do a count or a sum. Adding them is simple
using aggfunc and np.sum .
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=np.sum)
aggfunc can take a list of functions. Let‟s try a mean using the numpy mean function and len to get
a count.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=[np.mean,len])
17
DR.NNCE II YEAR/03 FDS-QB
18
DR.NNCE II YEAR/03 FDS-QB
DATA VISUALIZATION
1
DR.NNCE II YEAR/03 FDS-QB
PART – B
1. Explain various features of Matplotlib platform used for data visualization and
illustrate its challenges? Nov/Dec2023
2. Explain various types of plotting using Scatter Plots in python? (OR)
Write a code snippet that projects our globe as a 2-D flat surface (using
cylindrical project) and convey information about the location of any three
major Indian cities in the map (using scatter plot).April/May2024
2
DR.NNCE II YEAR/03 FDS-QB
Essentially, there are four libraries that are used for data visualization in python:
Matplotlib
Seaborn
Plotly
Bokeh
3. How can we visualize more than three dimensions of data in a single chart?
To visualize data beyond three dimensions, we need to use visual cues such as color, size,
and shape.
Color is used to depict both continuous and categorical data.
3
DR.NNCE II YEAR/03 FDS-QB
Marker Size is used to represent continuous data. Can be used for categorical data as
well. However, since size differences are difficult to detect, it is not considered the most
appropriate choice for categorical data.
Shapes are used to represent different classes.
4. How to plot the distribution of customers by age?
The distribution of customers by age can be plotted simply by creating a histogram from the
Age column of the customer’s Data Frame.
1. What is the purpose of a Scatter plot?
Scatter plots are used to observe relationships between two different numeric variables.
2. Define IQR in a box plot?
IQR stands for interquartile range. In a box plot and IQR is the length of the box.
3. What is a Boxplot?
A Box and Whisker Plot (or Boxplot) are used to represent data distribution through their
quartiles. The graph looks like a rectangle with lines extending from the top and bottom.
These lines are known as the “whiskers”, and represent the variability outside the upper
and lower quartiles.
5. What is a heat map in Python? Create a correlation matrix using the core function of
the data frame?
Heat maps are used to cross-examining multivariate data and represent it through color
variations.
6. What is a scatter plot? For what type of data is scatter plot usually used for?
A scatter plot is a chart used to plot a correlation between two or more variables at the
same time. It’s usually used for numeric data.
4
DR.NNCE II YEAR/03 FDS-QB
1. Correlation: the two variables might have a relationship, for example, one might depend on
another. But this is not the same as causation.
2. Outliers: there could be cases where the data in two dimensions does not follow the
general pattern.
3. Clusters: sometimes there could be groups of data that form a cluster on the plot.
5. Barriers: boundaries.
8. What type of plot would you use if you need to demonstrate “relationship” between
variables/parameters?
When we are trying to show the relationship between 2 variables, scatter plots or charts
are used. When we are trying to show “relationship” between three variables, bubble
charts are used.
9. When will you use a histogram and when will you use a bar chart?
Both plots are used to plot the distribution of a variable. Histograms are usually used for
a categorical variable, while bar charts are used for a categorical variable.
PART – B
1. Explain various features of Matplotlib platform used for data visualization and
illustrate its challenges? Nov/Dec2023
• Matplotlib is a cross-platform, data visualization and graphical plotting library for Python
and its numerical extension NumPy.
• Matplotlib is a comprehensive library for creating static, animated and interactive
visualizations in Python.
• Matplotlib is a plotting library for the Python programming language. It allows to make
quality charts in few lines of code. Most of the other python plotting library are build
on top of Matplotlib.
• The library is currently limited to 2D output, but it still provides you with the means to
express graphically the data patterns.
Visualizing Information: Starting with Graph
• Data visualization is the presentation of quantitative information in a graphical form.
6
DR.NNCE II YEAR/03 FDS-QB
In other words, data visualizations turn large and small datasets into visuals that are easier
for the human brain to understand and process.
• Good data visualizations are created when communication, data science, and
design collide. Data visualizations done right offer key insights into complicated
datasets inways that are meaningful and intuitive.
• A graph is simply a visual representation of numeric data. MatPlotLib supports a
large number of graph and chart types.
• Matplotlib is a popular Python package used to build plots. Matplotlib can also be used
to make 3D plots and animations.
• Line plots can be created in Python with Matplotlib's pyplot library. To build a line plot
first import Matplotlib. It is a standard convention to import Matplotlib's pyplot library as
plt.
• To define a plot, you need some values, the matplotlib.pyplot module, and an idea of
what you want to display.
import matplotlib.pyplot as plt
plt.plot([1,2,3],[5,7,4])
plt.show()
• The plt.plot will "draw" this plot in the background, but we need to bring it to the screen
when we're ready, after graphing everything we intend to.
• plt.show(): With that, the graph should pop up. If not, sometimes can pop under, or
you may have gotten an error. Your graph should look like :
• This window is a matplotlib window, which allows us to see our graph, as well as
interact with it and navigate it
Line Plot
• More than one line can be in the plot. To add another line, just call the plot (x,y) function
again. In the example below we have two different values for y (y1, y2) that are plotted
onto the chart.
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(-1, 1, 50)
y1 = 2*x+ 1
y2 = 2**x + 1
plt.plot(x, y2)
plt.plot(x, y1,
7
DR.NNCE II YEAR/03 FDS-QB
linewidth=1.0,
linestyle='--'
plt.show()
Scatter Plots
• A scatter plot is a visual representation of how two variables relate to each other.
we can use scatter plots to explore the relationship between two variables, for example by
looking for any correlation between them.
• Matplotlib also supports more advanced plots, such as scatter plots. In this case, the
scatter function is used to display data values as a collection of x, y coordinates
represented by standalone dots.
Import matplotlib.pyplot as plt
#X axis values:
x = [2,3,7,29,8,5,13,11,22,33]
# Y axis values:
8
DR.NNCE II YEAR/03 FDS-QB
y = [4,7,55,43,2,4,11,22,33,44]
# Create scatter plot:
plt.scatter(x, y)
plt.show()
• Comparing plt.scatter() and plt.plot(): We can also produce the scatter plot shown
above using another function within matplotlib.pyplot. Matplotlib'splt.plot() is a general-
purpose plotting function that will allow user to create various different line or marker plots.
• We can achieve the same scatter plot as the one obtained in the section above with the
following call to plt.plot(), using the same data:
plt.plot(x, y, "o")
plt.show()
• In this case, we had to include the marker "o" as a third argument, as otherwise plt.plot()
would plot a line graph. The plot created with this code is identical to the plot created
earlier with plt.scatter(). Creating Advanced Scatterplots
• Scatterplots are especially important for data science because they can show data patterns
that aren't obvious when viewed in other ways.
import matplotlib.pyplot as plt
x_axis1 =[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y_axis1 =[5, 16, 34, 56, 32, 56, 32, 12, 76, 89]
x_axis2 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y_axis2 = [53, 6, 46, 36, 15, 64, 73, 25, 82, 9]
plt.title("Prices over 10 years")
plt.scatter(x_axis1, y_axis1, color = 'darkblue', marker='x', label="item 1")
plt.scatter(x_axis2, y_axis2, color='darkred', marker='x', label="item 2")
plt.xlabel("Time (years)")
plt.ylabel("Price (dollars)")
plt.grid(True)
plt.legend()
plt.show()
• The chart displays two data sets. We distinguish between them by the colour of the
marker.
Visualizing Errors
• Error bars are included in Matplotlib line plots and graphs. Error is the difference
between the calculated value and actual value.
• Without error bars, bar graphs provide the perception that a measurable or determined
number is defined to a high level of efficiency. The method matplotlib.pyplot.errorbar()
draws y vs. x as planes and/or indicators with error bars associated.
9
DR.NNCE II YEAR/03 FDS-QB
• Adding the error bar in Matplotlib, Python. It's very simple, we just have to write the
value of the error. We use the command:
plt.errorbar(x, y, yerr = 2, capsize=3)
Where:
x = The data of the X axis.
Y = The data of the Y axis.
yerr = The error value of the Y axis. Each point has its own error value.
xerr = The error value of the X axis.
capsize = The size of the lower and upper lines of the error bar
• A simple example, where we only plot one point. The error is the 10% on the Y axis.
importmatplotlib.pyplot as plt
x=1
y = 20
plt.show()
Output:
• We plot using the command "plt.errorbar (...)", giving it the desired characteristics.
importmatplotlib.pyplot as plt
importnumpy as np
x = np.arange(1,8)
y = np.array([20,10,45,32,38,21,27])
y_error = y * 0.10 ##El 10%
plt.errorbar(x, y, yerr = y_error,
linestyle="None", fmt="ob", capsize=3, ecolor="k")
plt.show()
• Parameters of the errorbar :
10
DR.NNCE II YEAR/03 FDS-QB
plt.legend(loc='upper left')
plt.title('Example')
plt.show()
Output:
12
DR.NNCE II YEAR/03 FDS-QB
• Histograms can display a large amount of data and the frequency of the data values.
The median and distribution of the data can be determined by a histogram. In addition,
it can show any outliers or gaps in the data.
• Matplotlib provides a dedicated function to compute and display histograms: plt.hist()
• Code for creating histogram with randomized data :
import numpy as np
import matplotlib.pyplot as plt
x = 40* np.random.randn(50000)
plt.hist(x, 20, range=(-50, 50), histtype='stepfilled',
align='mid', color='r', label="Test Data')
plt.legend()
plt.title(' Histogram')
plt.show()
Example :
13
DR.NNCE II YEAR/03 FDS-QB
fig=plt.figure(figsize=(8,8))
ax=plt.axes(projection='3d')
ax.grid()
t=np.arange(0,10*np.pi,np.pi/50)
x=np.sin(t)
y=np.cos(t)
ax.plot3D(x,y,t)
ax.set_title('3D Parametric Plot')
# Set axes label
ax.set_xlabel('x',labelpad=20)
ax.set_ylabel('y', labelpad=20)
ax.set_zlabel('t', labelpad=20)
plt.show()
Output:
15
DR.NNCE II YEAR/03 FDS-QB
17