0% found this document useful (0 votes)
23 views116 pages

Fds Question Bank

The document outlines the course CS3352 Foundations of Data Science, detailing its objectives, units, and outcomes for students in the II Year B.E/B.Tech program. It covers fundamental concepts of data science, data description, relationships, Python libraries for data wrangling, and data visualization techniques. Additionally, it includes important questions and topics for examination preparation, along with recommended textbooks and references.

Uploaded by

nncecse 2017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views116 pages

Fds Question Bank

The document outlines the course CS3352 Foundations of Data Science, detailing its objectives, units, and outcomes for students in the II Year B.E/B.Tech program. It covers fundamental concepts of data science, data description, relationships, Python libraries for data wrangling, and data visualization techniques. Additionally, it includes important questions and topics for examination preparation, along with recommended textbooks and references.

Uploaded by

nncecse 2017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 116

DR.

NNCE II YEAR/03 FDS-QB

CS3352 FOUNDATIONS OF DATA SCIENCE

II YEAR / III SEMESTER B.E / B.Tech- COMMON TO ALL

COMPILED BY

M.DEIVAMANI AP/CSE

VERIFIED BY

HOD PRINCIPAL CEO/CORRESPONDENT

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

DR.NAVALAR NEDUNCHEZHIYAN COLLEGE OF ENGINEERING – THOLUDUR.

1
DR.NNCE II YEAR/03 FDS-QB

CS3352 FOUNDATIONS OF DATA SCIENCE


COURSE OBJECTIVES:
• To understand the data science fundamentals and process.
• To learn to describe the data for the data science process.
• To learn to describe the relationship between data.
• To utilize the Python libraries for Data Wrangling.
• To present and interpret data using visualization libraries in Python

UNIT I INTRODUCTION
Data Science: Benefits and uses – facets of data – Data Science Process: Overview -Defining research goals-
Retrieving data – Data preparation – Exploratory Data analysis – Build the model– presenting findings and
building applications – Data Mining – Data Warehousing – Basic Statistical descriptions of Data
UNIT II DESCRIBING DATA
Types of Data - Types of Variables -Describing Data with Tables and Graphs –
Describing Data with Averages - Describing Variability - Normal Distributions and Standard (z) Scores
UNIT III DESCRIBING RELATIONSHIPS
Correlation –Scatter plots –correlation coefficient for quantitative data –computational formula for
correlation coefficient – Regression – regression line – least squares regression line – Standard error of
estimate – interpretation of r2 –multiple regression equations –regression towards the mean
UNIT IV PYTHON LIBRARIES FOR DATA WRANGLING
Basics of Numpy arrays –aggregations –computations on arrays –comparisons, masks, boolean logic –
fancy indexing – structured arrays – Data manipulation with Pandas – data indexing and selection –
operating on data – missing data – Hierarchical indexing – combining datasets – aggregation and grouping -
pivot tables
UNIT V DATA VISUALIZATION
Importing Matplotlib – Line plots – Scatter plots – visualizing errors – density and contour plots –
Histograms – legends – colors – subplots – text and annotation – customization – three dimensional
plotting - Geographic Data with Base map - Visualization with Seaborn.

COURSE OUTCOMES:
At the end of this course, the students will be able to:
CO1: Define the data science process
CO2: Understand different types of data description for data science process
CO3: Gain knowledge on relationships between data
CO4: Use the Python Libraries for Data Wrangling
CO5: Apply visualization Libraries in Python to interpret and explore data

TEXT BOOKS
1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data Science”, Manning
Publications, 2016. (Unit I)
2. Robert S. Witte and John S. Witte, “Statistics”, Eleventh Edition, Wiley Publications, 2017.
(Units II and III)
3. Jake VanderPlas, “Python Data Science Handbook”, O’Reilly, 2016.
(Units IV and V)
REFERENCES:
Allen B. Downey, “Think Stats: Exploratory Data Analysis in Python”, Green Tea Press,2014

2
DR.NNCE II YEAR/03 FDS-QB

Unit I: Introduction
 Data Science and Big Data
 Facets of Data
 Data Science Process
 Defining Research Goals
 Retrieving Data
 Data Preparation
 Exploratory Data Analysis
 Build the Models
 Presenting Findings and Building Applications
 Data Mining
 Basic Statistical Descriptions of Data

LIST OF IMPORTANT QUESTIONS


PART-A
1 What is data science? Nov/Dec2022
2 Define structured data? Nov/Dec2023
3 What is data?
4 What is unstructured data ?April/May2023
5 What is machine - generated data ?
6 Define streaming data.
7 List the stages of data science process.
8 What are the advantages of data repositories?
9 What is data cleaning? April/May2024
10 What is outlier detection? Nov/Dec2022
11 Explain exploratory data analysis.
12 Define data mining? April/May2023
13 What are the three challenges to data mining regarding data mining methodology?
14 What is predictive mining?
15 List the stages of data science process.
16 Difference between Structured and Unstructured Data Nov/Dec 2023
17 What is Euclidean distance ?
18 List out Some applications of Data Science.
PART B
1. Elaborate about the steps in the data science process with a diagram?
April/May2023

3
DR.NNCE II YEAR/03 FDS-QB

2. Explain the different Facets of Data with the challenges in processing?


Nov/Dec2022&2023
3. Explore the various steps associated with data science process and explain any
three steps of it with suitable diagrams and example.? Nov/Dec2022
4. Defining Research Goals, Retrieving Data?
5. Explain in details about Data cleansing, integrating, transforming data and
build a model? April/may2022&2023
6. Explain in detail Data Mining ?(OR)Explain Data Analytic life cycle. Brief about
Time-Series Analysis? April/may2024
7. . What is a data warehouse? Outline the architecture of a data warehouse
with a diagram? April/May2023

UNIT I
PART – A

Two Marks Questions with Answers

Q.1 What is data science? Nov/Dec2022

Ans;
• Data science is an interdisciplinary field that seeks to extract knowledge or insights
from various forms of data.
• At its core, data science aims to discover and extract actionable knowledge from data
that can be used to make sound business decisions and predictions.
• Data science uses advanced analytical theory and various methods such as time series
analysis for predicting future.
Q.2 Define structured data? Nov/Dec2023
Ans. Structured data is arranged in rows and column format. It helps for application to
retrieve and process data easily. Database management system is used for storing
structured data. The term structured data refers to data that is identifiable because it is
organized in a structure.
Q.3 What is data?
Ans. Data set is collection of related records or information. The information may be
on some entity or some subject area.
Q.4 What is unstructured data ?April/May2023
Ans. Unstructured data is data that does not follow a specified format. Row and columns
are not used for unstructured data. Therefore it is difficult to retrieve required information.
Unstructured data has no identifiable structure.
Q.5 What is machine - generated data ?
Ans. Machine-generated data is an information that is created without human interaction
4
DR.NNCE II YEAR/03 FDS-QB

as a result of a computer process or application activity. This means that data entered
manually by an end-user is not recognized to be machine-generated.

Q.6 Define streaming data.


Ans; Streaming data is data that is generated continuously by thousands of data sources,
which typically send in the data records simultaneously and in small sizes
(order of Kilobytes).
Q.7 List the stages of data science process.
Ans.: Stages of data science process are as follows:
1. Discovery or Setting the research goal
2. Retrieving data
3. Data preparation
4. Data exploration
5. Data modelling
6. Presentation and automation
Q.8 What are the advantages of data repositories?
Ans.: Advantages are as follows:
i. Data is preserved and archived.
ii. Data isolation allows for easier and faster data reporting.
iii. Database administrators have easier time tracking problems.
iv. There is value to storing and analysing data.
Q.9 What is data cleaning? April/May2024
Ans. Data cleaning means removing the inconsistent data or noise and collecting
necessary information of a collection of interrelated data.
Q.10 What is outlier detection? Nov/Dec2022
Ans. : Outlier detection is the process of detecting and subsequently excluding outliers
from a given set of data. The easiest way to find outliers is to use a plot or a table with the
minimum and maximum values.
Q.11 Explain exploratory data analysis.
Ans. : Exploratory Data Analysis (EDA) is a general approach to exploring datasets by
means of simple summary statistics and graphic visualizations in order to gain
a deeper understanding of data. EDA is used by data scientists to analyse and
investigate data sets and summarize their main characteristics, often employing
data visualization methods.
Q.12 Define data mining? April/May2023
Ans. : Data mining refers to extracting or mining knowledge from large amounts of data.
It is a process of discovering interesting patterns or Knowledge from a large amount of data
stored either in databases, data warehouses, or other information repositories.

5
DR.NNCE II YEAR/03 FDS-QB

Q.13 What are the three challenges to data mining regarding data mining
methodology?
Ans. Challenges to data mining regarding data mining methodology include the following:
1. Mining different kinds of knowledge in databases,
2. Interactive mining of knowledge at multiple levels of abstraction,
3. Incorporation of background knowledge.
Q.14 What is predictive mining?
Ans. Predictive mining tasks perform inference on the current data in order to make
predictions. Predictive analysis provides answers of the future queries that move across
using historical data as the chief principle for decisions.
Q.15 List the stages of data science process.
Ans. Data science process consists of six stages:
1. Discovery or Setting the research goal 2. Retrieving data 3. Data preparation
4. Data exploration 5. Data modelling 6. Presentation and automation
Q.16 Difference between Structured and Unstructured Data Nov/Dec 2023

Q.17 What is Euclidean distance ?


Ans. Euclidean distance is used to measure the similarity between observations. It is
calculated as the square root of the sum of differences between each point.

Q.18 List out Some applications of Data Science.


Ans.  Internet Search Results (Google)
6
DR.NNCE II YEAR/03 FDS-QB

 Recommendation Engine (Spotify)


 Intelligent Digital Assistants (Google Assistant)
 Autonomous Driving Vehicle (Waymo, Tesla)
 Spam Filter (Gmail)
 Abusive Content and Hate Speech Filter (Facebook)
 Robotics (Boston Dynamics)
 Automatic Piracy Detection (YouTube)

PART B
Q.1 Elaborate about the steps in the data science process with a diagram?
(April/May2023 )
Data Science:
• Data is measurable units of information gathered or captured from activity of people,
places and things.
• Data science is an interdisciplinary field that seeks to extract knowledge or insights
• from various forms of data. At its core, Data Science aims to discover and extract
actionable knowledge from data that can be used to make sound business decisions
and predictions.
• Data science combines math and statistics, specialized programming, advanced
analytics, Artificial Intelligence (AI) and machine learning with specific subject
matter expertise to uncover actionable insights hidden in an organization's data.
Life cycle of data science:
1. Capture: Data acquisition, data entry, signal reception and data extraction.
2. Maintain Data warehousing, data cleansing, data staging, data processing
and data architecture.
3.Process Data mining, clustering and classification, data modelling and data
summarization.
4. Analyse : Data reporting, data visualization, business intelligence and decision making.

5. Communicate: Exploratory and confirmatory analysis, predictive analysis, regression,


Text mining and qualitative analysis.
Benefits and Uses of Data Science
• Data science example and applications :
a) Anomaly detection: Fraud, disease and crime
b) Classification: Background checks; an email server classifying emails as "important"

c) Forecasting: Sales, revenue and customer retention


7
DR.NNCE II YEAR/03 FDS-QB

d) Pattern detection: Weather patterns, financial market patterns


e) Recognition : Facial, voice and text
f) Recommendation: Based on learned preferences, recommendation engines can refer
user to movies, restaurants and books
g) Regression: Predicting food delivery times, predicting home prices based on amenities
h) Optimization: Scheduling ride-share pickups and package deliveries
Big Data:
• Big data is a blanket term for any collection of data sets so large or complex that it
becomes difficult to process them using traditional data management techniques such
as for example, the RDBMS.
Characteristics of Big Data
• Characteristics of big data are volume, velocity and variety. They are often referred to
as the three V's.
1. Volume Volumes of data are larger than that conventional relational database
infrastructure can cope with. It consisting of terabytes or petabytes of data.
2. Velocity: The term 'velocity' refers to the speed of generation of data. How fast the
data is generated and processed to meet the demands, determines real potential in the data.
It is being created in or near real-time.
3. Variety: It refers to heterogeneous sources and the nature of data, both structured
and unstructured.

• These three dimensions are also called as three V's of Big Data.

• Two other characteristics of big data is veracity and value.


a) Veracity:
• Veracity refers to source reliability, information credibility and content validity.
• Veracity refers to the trustworthiness of the data. Can the manager rely on the fact that
the data is representative? Every good manager knows that there are inherent discrepancies
in all the data collected.
• Spatial veracity: For vector data (imagery based on points, lines and polygons), the
quality varies. It depends on whether the points have been GPS determined by unknown
origins or manually. Also, resolution and projection issues can alter veracity.
• For geo-coded points, there may be errors in the address tables and in the point
8
DR.NNCE II YEAR/03 FDS-QB

location algorithms associated with addresses.


• For raster data (imagery based on pixels), veracity depends on accuracy of
recording instruments in satellites or aerial devices and on timeliness.
b) Value :
• It represents the business value to be derived from big data.
• The ultimate objective of any big data project should be to generate some sort of value
for the company doing all the analysis. Otherwise, user just performing some
technological task for technology's sake.
• Exploration of data trends can include spatial proximities and relationships.
Benefits and Use of Big Data
• Benefits of Big Data :
1. Improved customer service
2. Businesses can utilize outside intelligence while taking decisions
3. Reducing maintenance costs
4. Re-develop our products : Big Data can also help us understand how others
perceive our products so that we can adapt them or our marketing, if need be.
5. Early identification of risk to the product/services, if any
6. Better operational efficiency
• Some of the examples of big data are:
1. Social media : Social media is one of the biggest contributors to the flood of data
we have today. Facebook generates around 500+ terabytes of data every day in the
form of content generated by the users like status messages, photos and video
uploads, messages, comments etc.
2. Stock exchange : Data generated by stock exchanges is also in terabytes per day.
Most of this data is the trade data of users and companies.
3. Aviation industry: A single jet engine can generate around 10 terabytes of data
during a 30 minute flight.
4. Survey data: Online or offline surveys conducted on various topics which
typically has hundreds and thousands of responses and needs to be processed for
analysis and visualization by creating a cluster of population and their
associated responses.
5. Compliance data : Many organizations like healthcare, hospitals, life sciences,
finance etc has to file compliance reports.

Q2. Explain the different Facets of Data with the challenges in processing?
(Nov/Dec2022&2023)
Very large amount of data will generate in big data and data science. These data is various
types and main categories of data are as follows:
a) Structured
b) Natural language
9
DR.NNCE II YEAR/03 FDS-QB

c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
g) Audio, video and images
Structured Data
• Structured data is arranged in rows and column format. It helps for application to
retrieve and process data easily. Database management system is used for storing
structured data.
• The term structured data refers to data that is identifiable because it is organized in a
structure. The most common form of structured data or records is a database where
specific information is stored based on a methodology of columns and rows.
• An Excel table is an example of structured data.

Unstructured Data
• Unstructured data is data that does not follow a specified format. Row and columns are
not used for unstructured data. Therefore it is difficult to retrieve required
information. unstructured data has no identifiable structure.
• The unstructured data can be in the form of Text: (Documents, email messages,
customer feedbacks), audio, video, images. Email is an example of unstructured data.
• Even today in most of the organizations more than 80 % of the data are in unstructured
form. This carries lots of information. But extracting information from these various
sources is a very big challenge.
• Characteristics of unstructured data:
1. There is no structural restriction or binding for the data.
2. Data can be of any type.
3. Unstructured data does not follow any structural rules.
4. There are no predefined formats, restriction or sequence for unstructured data
Natural Language
• Natural language is a special type of unstructured data.
• Natural language processing enables machines to recognize characters, words and
sentences, then apply meaning and understanding to that information. This helps
machines to understand language as humans do.
• Natural language processing is the driving force behind machine intelligence in many
modern real-world applications. The natural language processing community has had
success in entity recognition, topic recognition, summarization, text completion and
sentiment analysis.
Machine - Generated Data
• Machine-generated data is an information that is created without human interaction as a
result of a computer process or application activity. This means that data entered manually
10
DR.NNCE II YEAR/03 FDS-QB

by an end-user is not recognized to be machine-generated.


• Machine data contains a definitive record of all activity and behaviour of our customers,

users, transactions, applications, servers, networks, factory machinery and so on.


• It's configuration data, data from APIs and message queues, change events, the
output of diagnostic commands and call detail records, sensor data from remote equipment
and more.
• Examples of machine data are web server logs, call detail records, network event logs
and telemetry.
Graph-based or Network Data
•Graphs are data structures to describe relationships and interactions between entities in
complex systems. In general, a graph contains a collection of entities called nodes and
another collection of interactions between a pair of nodes called edges.
• Nodes represent entities, which can be of any object type that is relevant to our
problem domain. By connecting nodes with edges, we will end up with a graph
(network) of nodes.
• A graph database stores nodes and relationships instead of tables or documents. Data is
stored just like we might sketch ideas on a whiteboard. Our data is stored without
restricting it to a predefined model, allowing a very flexible way of thinking about and
using it.
• Graph databases are used to store graph-based data and are queried with specialized
query languages such as SPARQL.
• Graph databases are capable of sophisticated fraud prevention. With graph databases,
we can use relationships to process financial and purchase transactions in near-real time.
With fast graph queries, we are able to detect that, for example, a potential purchaser is
using the same email address and credit card as included in a known fraud case.
Audio, Image and Video
• Audio, image and video are data types that pose specific challenges to a data scientist
Tasks that are trivial for humans, such as recognizing objects in pictures, turn out
to be challenging for computers.
•The terms audio and video commonly refers to the time-based media storage
format for sound/music and moving pictures information. Audio and video digital
recording, also referred as audio and video codecs, can be uncompressed, lossless
compressed or loss compressed depending on the desired quality and use cases.
Streaming Data
Streaming data is data that is generated continuously by thousands of data sources,
which typically send in the data records simultaneously and in small sizes
(order of Kilobytes).
• Streaming data includes a wide variety of data such as log files generated by customers
using your mobile or web applications, ecommerce purchases, in-game player
11
DR.NNCE II YEAR/03 FDS-QB

activity, information from social networks, financial trading floors or geospatial services
and telemetry from connected devices or instrumentation in data centres.

Q3. Explore the various steps associated with data science process and explain any
three steps of it with suitable diagrams and example.? Nov/Dec2022
Data science process consists of six stages :

1. Discovery or Setting the research goal

2. Retrieving data

3. Data preparation

4. Data exploration

5. Data modelling

6. Presentation and automation

• Fig. 1.3.1 shows data science design process.

• Step 1: Discovery or Defining research goal


This step involves acquiring data from all the identified internal and external sources,
which helps to answer the business question.
• Step 2: Retrieving data
It collection of data which required for project. This is the process of gaining a
12
DR.NNCE II YEAR/03 FDS-QB

business understanding of the data user have and deciphering what each piece of data
means. This could entail determining exactly what data is required and the best
methods for obtaining it. This also entails determining what each of the data points
means in terms of the company. If we have given a data set from a client, for example,
we shall need to know what each column and row represents.
• Step 3: Data preparation
Data can have many inconsistencies like missing values, blank columns, an incorrect data
format, which needs to be cleaned. We need to process, explore and condition data
before modeling. The clean data, gives the better predictions.
• Step 4: Data exploration
Data exploration is related to deeper understanding of data. Try to understand how
variables interact with each other, the distribution of the data and whether there are outliers.
To achieve this use descriptive statistics, visual techniques and simple modeling.
This steps is also called as Exploratory Data Analysis.
• Step 5: Data modeling
In this step, the actual model building process starts. Here, Data scientist distributes
datasets for training and testing. Techniques like association, classification and clustering
are applied to the training data set. The model, once prepared, is tested against the
"testing" dataset.
• Step 6: Presentation and automation
Deliver the final base lined model with reports, code and technical documents in this stage.
Model is deployed into a real-time production environment after thorough testing. In this
stage, the key findings are communicated to all stakeholders. This helps to decide if the
project results are a success or a failure based on the inputs from the model.

Q 4.Defining Research Goals, Retrieving Data?


• To understand the project, three concept must understand: what, why and how.
a) What is expectation of company or organization?
b) Why does a company's higher authority define such research value?
c) How is it part of a bigger strategic picture?
• Goal of first phase will be the answer of these three questions.
• In this phase, the data science team must learn and investigate the problem, develop
context and understanding and learn about the data sources needed and available for the
project.
1. Learning the business domain :
• Understanding the domain area of the problem is essential. In many cases, data scientists
will have deep computational and quantitative knowledge that can be broadly applied
across many disciplines.
• Data scientists have deep knowledge of the methods, techniques and ways for
applying heuristics to a variety of business and conceptual problems.
13
DR.NNCE II YEAR/03 FDS-QB

2. Resources :
• As part of the discovery phase, the team needs to assess the resources available to
support the project. In this context, resources include technology, tools, systems,
data and people.
3. Frame the problem :
• Each team member may hear slightly different things related to the needs and the problem
and have somewhat different ideas of possible solutions.
4. Identifying key stakeholders:
• The team can identify the success criteria, key risks and stakeholders, which should
include anyone who will benefit from the project or will be significantly impacted by the
project.
• When interviewing stakeholders, learn about the domain area and any relevant history
from similar analytics projects.
5. Interviewing the analytics sponsor:
• The team should plan to collaborate with the stakeholders to clarify and frame the
analytics problem.
• At the outset, project sponsors may have a predetermined solution that may not
necessarily realize the desired outcome.
• In these cases, the team must use its knowledge and expertise to identify the true
underlying problem and appropriate solution.
• When interviewing the main stakeholders, the team needs to take time to thoroughly
interview the project sponsor, who tends to be the one funding the project or providing
the high-level requirements.
• This person understands the problem and usually has an idea of a potential working
solution.
6. Developing initial hypotheses:
• This step involves forming ideas that the team can test with data. Generally, it is best to
come up with a few primary hypotheses to test and then be creative about developing
several more.
• These Initial Hypotheses form the basis of the analytical tests the team will use in later
phases and serve as the foundation for the findings in phase.
7. Identifying potential data sources:
• Consider the volume, type and time span of the data needed to test the hypotheses.
Ensure that the team can access more than simply aggregated data. In most cases,
the team will need the raw data to avoid introducing bias for the downstream analysis.

ii)Retrieving Data Nov/dec2023


• Retrieving required data is second phase of data science project. Sometimes Data scientists
need to go into the field and design a data collection process. Many companies will have
already collected and stored the data and what they don't have can often be bought from
14
DR.NNCE II YEAR/03 FDS-QB

third parties.
• Most of the high quality data is freely available for public and commercial use. Data
can be stored in various format. It is in text file format and tables in database.
Data may be internal or external.

1. Start working on internal data, i.e. data stored within the company
• First step of data scientists is to verify the internal data. Assess the relevance and quality
of the data that's readily in company. Most companies have a program for maintaining
key data, so much of the cleaning work may already be done. This data can be stored in
official data repositories such as databases, data marts, data warehouses and
data lakes maintained by a team of IT professionals.

Data repository can be used to describe several ways to collect and store data:
a) Data warehouse is a large data repository that aggregates data usually from multiple
sources or segments of a business, without the data being necessarily related.
b) Data lake is a large data repository that stores unstructured data that is classified and
tagged with metadata.
c) Data marts are subsets of the data repository. These data marts are more targeted to
what the data user needs and easier to use.
d) Metadata repositories store data about data and databases.
The metadata explains where the data source, how it was captured and what it
represents.
e) Data cubes are lists of data with three or more dimensions stored as a table.

Advantages of data repositories:


i. Data is preserved and archived.
ii. Data isolation allows for easier and faster data reporting.
iii. Database administrators have easier time tracking problems.
iv. There is value to storing and analyzing data.
Disadvantages of data repositories :
i. Growing data sets could slow down systems.
ii. A system crash could affect all the data.
iii. Unauthorized users can access all sensitive data more easily than if it was
distributed across several locations.

Q.5Explain in details about Data cleansing, integrating, transforming data


and build a model? April/may2022&2023
• Data preparation means data cleansing, Integrating and transforming data.
Data Cleaning
• Data is cleansed through processes such as filling in missing values, smoothing the
15
DR.NNCE II YEAR/03 FDS-QB

noisy data or resolving the inconsistencies in the data.


• Data cleaning tasks are as follows:
1. Data acquisition and metadata
2. Fill in missing values
3. Unified date format
4. Converting nominal to numeric

5. Identify outliers and smooth out noisy data


6. Correct inconsistent data
• Data cleaning is a first step in data pre-processing techniques which is used to find the
missing value, smooth noise data, recognize outliers and correct inconsistent.
• Missing value: These dirty data will affects on miming procedure and led to
unreliable and poor output. Therefore it is important for some data cleaning routines
For example, suppose that the average salary of staff is Rs. 65000/-. Use this value to
replace the missing value for salary.
• Data entry errors: Data collection and data entry are error-prone processes. They often
require human intervention and because humans are only human, they make typos or
lose their concentration for a second and introduce an error into the chain.
But data collected by machines or computers isn't free from errors either. Errors can
arise from human sloppiness, whereas others are due to machine or hardware failure.
Examples of errors originating from machines are transmission errors or bugs in the
extract, transform and load phase (ETL).
• Whitespace error: Whitespaces tend to be hard to detect but cause errors like other
redundant characters would. To remove the spaces present at start and end of the string,
we can use strip() function on the string in Python.
• Fixing capital letter mismatches: Capital letter mismatches are common problem.
Most programming languages make a distinction between "Chennai" and "chennai".
• Python provides string conversion like to convert a string to lowercase, uppercase using
lower(), upper().
• The lower() Function in python converts the input string to lowercase. The upper()
Function in python converts the input string to uppercase.
Outlier
• Outlier detection is the process of detecting and subsequently excluding outliers from a
given set of data. The easiest way to find outliers is to use a plot or a table with the
minimum and maximum values.
• Fig. 1.6.1 shows outliers detection. Here O1 and O2 seem outliers from the rest.

16
DR.NNCE II YEAR/03 FDS-QB

• An outlier may be defined as a piece of data or observation that deviates drastically


from the given norm or average of the data set. An outlier may be caused simply by chance,
but it may also indicate measurement error or that the given data set has a heavy-
tailed distribution.
• General idea of application is to find out data which deviates from normal behaviour of
data set.
Dealing with Missing Value
• These dirty data will affects on miming procedure and led to unreliable and poor
output. Therefore it is important for some data cleaning routines.
How to handle noisy data in data mining?
• Following methods are used for handling noisy data:
1. Ignore the tuple: Usually done when the class label is missing. This method is not
good unless the tuple contains several attributes with missing values.
2. Fill in the missing value manually : It is time-consuming and not suitable for a
Large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute
values by the same constant.
Correct Errors as Early as Possible
• If error is not corrected in early stage of project, then it create problem in latter stages.
Most of the time, we spend on finding and correcting error. Retrieving data is a difficult
task and organizations spend millions of dollars on it in the hope of making better
decisions.
The data collection process is error prone and in a big organization it involves many steps
and teams.
• Data should be cleansed when acquired for many reasons:
a) Not everyone spots the data anomalies. Decision-makers may make costly
mistakes on information based on incorrect data from applications that fail to
17
DR.NNCE II YEAR/03 FDS-QB

correct for the faulty data.


b) If errors are not corrected early on in the process, the cleansing will have to be done
for every project that uses that data.

c) Data errors may point to a business process that isn't working as designed.

d) Data errors may point to defective equipment, such as broken transmission lines and

defective sensors.

e) Data errors can point to bugs in software or in the integration of software that
may be critical to the company

Combining Data from Different Data Sources


1. Joining table
• Joining tables allows user to combine the information of one observation found in
one table with the information that we find in another table. The focus is on enriching a
single observation.
• A primary key is a value that cannot be duplicated within a table. This means that one
value can only be seen once within the primary key column. That same key can exist as a
foreign key in another table which creates the relationship. A foreign key can have
duplicate instances within a table.
• Fig. 1.6.2 shows Joining two tables on the Country ID and Country Name keys.

18
DR.NNCE II YEAR/03 FDS-QB

2. Appending tables

• Appending table is called stacking table. It effectively adding observations from one
table to another table. Fig. 1.6.3 shows Appending table. (See Fig. 1.6.3 on next page)

• Table 1 contains x3 value as 3 and Table 2 contains x3 value as 33.The result of


appending these tables is a larger one with the observations from Table 1 as well as Table2.
The equivalent operation in set theory would be the union and this is also the command in
SQL, the common language of relational databases. Other set operators are also used
in data science, such as set difference and intersection.
3. Using views to simulate data joins and appends

• Duplication of data is avoided by using view and append. The append table requires
more space for storage. If table size is in terabytes of data, then it becomes problematic to
duplicate the data. For this reason, the concept of a view was invented.
• Fig. 1.6.4 shows how the sales data from the different months is combined virtually
into a yearly sales table instead of duplicating the data.

19
DR.NNCE II YEAR/03 FDS-QB

Transforming Data
• In data transformation, the data are transformed or consolidated into forms
appropriate for mining. Relationships between an input variable and an output variable
aren't always linear.
• Reducing the number of variables: Having too many variables in the model makes the
model difficult to handle and certain techniques don't perform well when user overload
them with too many input variables.
• All the techniques based on a Euclidean distance perform well only up to 10 variables.
Data scientists use special methods to reduce the number of variables but retain the
maximum amount of data.
Euclidean distance :
• Euclidean distance is used to measure the similarity between observations. It is calculated
as the square root of the sum of differences between each point.
Euclidean distance = √(X1-X2)2 + (Y1-Y2)2
Turning variable into dummies :
• Variables can be turned into dummy variables. Dummy variables can only take two
values: true (1) or false√ (0). They're used to indicate the absence of categorical effect
that may explain the observation.

20
DR.NNCE II YEAR/03 FDS-QB

Q.6 Explain in detail Data Mining ?OR Explain Data Analytic life cycle.
Brief about Time-Series Analysis? April/may2024
• Data mining refers to extracting or mining knowledge from large amounts of data.
It is a process of discovering interesting patterns or Knowledge from a large amount
of data stored either in databases, data warehouses or other information repositories.
Reasons for using data mining:
1. Knowledge discovery: To identify the invisible correlation, patterns in the database.
2. Data visualization: To find sensible way of displaying data.
3. Data correction: To identify and correct incomplete and inconsistent data.
Functions of Data Mining
• Different functions of data mining are characterization, association and correlation
analysis, classification, prediction, clustering analysis and evolution analysis.
1. Characterization is a summarization of the general characteristics or features of a
target. class of data. For example, the characteristics of students can be produced,
generating a profile of all the University in first year engineering students.
2. Association is the discovery of association rules showing attribute-value conditions

21
DR.NNCE II YEAR/03 FDS-QB

that occur frequently together in a given set of data.


3. Classification differs from prediction. Classification constructs a set of models that
describe and distinguish data classes and prediction builds a model to predict some
missing data values.
4. Clustering can also support taxonomy formation. The organization of observations
into a hierarchy of classes that group similar events together.

Predictive Mining Tasks


• To make prediction, predictive mining tasks performs inference on the current data.
Predictive analysis provides answers of the future queries that move across using
historical data as the chief principle for decisions
• It involves the supervised learning functions used for the prediction of the target value.
The methods fall under this mining category are the classification, time-series analysis
and regression.
• Data modeling is the necessity of the predictive analysis, which works by utilizing
some variables to anticipate the unknown future data values for other variables.
• It provides organizations with actionable insights based on data. It provides an
estimation regarding the likelihood of a future outcome.
• To do this, a variety of techniques are used, such as machine learning, data mining,
modelling and game theory.
• Predictive modelling can, for example, help to identify any risks or opportunities in
the future.

Descriptive Mining Task


• Descriptive Analytics is the conventional form of business intelligence and data analysis,
seeks to provide a depiction or "summary view" of facts and figures in an understandable
format, to either inform or prepare data for further analysis.
• Two primary techniques are used for reporting past events : data aggregation and data
mining.
• It presents past data in an easily digestible format for the benefit of a wide business
audience.
• A set of techniques for reviewing and examining the data set to understand the data and
analyze business performance.
• Descriptive analytics helps organisations to understand what happened in the past.
It helps to understand the relationship between product and customers.
Architecture of a Typical Data Mining System
• Data mining refers to extracting or mining knowledge from large amounts of data.
It is a process of discovering interesting patterns or knowledge from a large amount of data
stored either in databases, data warehouses.
• It is the computational process of discovering patterns in huge data sets involving
22
DR.NNCE II YEAR/03 FDS-QB

methods at the intersection of AI, machine learning, statistics and database systems.
• Fig. 1.10.1 (See on next page) shows typical architecture of data mining system.
• Components of data mining system are data source, data warehouse server, data mining
engine, pattern evaluation module, graphical user interface and knowledge base.
• Database, data warehouse, WWW or other information repository: This is set of databases,
data warehouses, spread sheets or other kinds of data repositories. Data cleaning and
data integration techniques may be apply on the data.

• Data warehouse server based on the user's data request, data warehouse server is
responsible for fetching the relevant data.

Classification of DM System
• Data mining system can be categorized according to various parameters. These are
database technology, machine learning, statistics, information science, visualization
and other disciplines.
• Fig. 1.10.2 shows classification of DM system.

23
DR.NNCE II YEAR/03 FDS-QB

• Multi-dimensional view of data mining classification.

Q 7. What is a data warehouse? Outline the architecture of a data warehouse


with a diagram? April/May2023

• Data warehousing is the process of constructing and using a data warehouse. A data
warehouse is constructed by integrating data from multiple heterogeneous sources that
support analytical reporting, structured and/or ad hoc queries and decision making.
Data warehousing involves data cleaning, data integration and data consolidations.
• A data warehouse is a subject-oriented, integrated, time-variant and non-volatile
process. collection of data in support of management's decision-making process.
A data warehouse stores historical data for purposes of decision support.

• A database an application-oriented collection of data that is organized, structured,


coherent, with minimum and controlled redundancy, which may be accessed by several
24
DR.NNCE II YEAR/03 FDS-QB

users in due time.


• Data warehousing provides architectures and tools for business executives to
systematically organize, understand and use their data to make strategic decisions.
• A data warehouse is a subject-oriented collection of data that is integrated, time-variant,
non-volatile, which may be used to support the decision-making.
• Databases and data warehouses are related but not the same.

• A database is a way to record and access information from a single source. A database is
often handling real-time data to support day-to-day business processes like
transaction processing.
• A data warehouse is a way to store historical information from multiple sources to
allow you to analyse and report on related data (e.g., your sales transaction data, mobile
app data

and CRM data). Unlike a database, the information isn't updated in real-time and is better
for data analysis of broader trends.
• Modern data warehouses are moving toward an Extract, Load, Transformation
(ELT) architecture in which all or most data transformation is performed on the
database that hosts the data warehouse.
• Goals of data warehousing:
1. To help reporting as well as analysis.
2. Maintain the organization's historical information.
3. Be the foundation for decision making.
Characteristics of Data Warehouse
1. Subject oriented Data are organized based on how the users refer to them. A data
warehouse can be used to analyse a particular subject area. For example, "sales"
can be a particular subject.
2. Integrated: All inconsistencies regarding naming convention and value representations
are removed. For example, source A and source B may have different ways of
identifying a product, but in a data warehouse, there will be only a single way of
identifying a product.
3. Non-volatile: Data are stored in read-only format and do not change over time.
Typical activities such as deletes, inserts and changes that are performed in an
operational application environment are completely non-existent in a DW environment.
Key characteristics of a Data Warehouse
1. Data is structured for simplicity of access and high-speed query performance.
2. End users are time-sensitive and desire speed-of-thought response times.
3. Large amounts of historical data are used.
4. Queries often retrieve large amounts of data, perhaps many thousands of rows.
5. Both predefined and ad hoc queries are common.
25
DR.NNCE II YEAR/03 FDS-QB

6. The data load involves multiple sources and transformations.

Multitier Architecture of Data Warehouse


• Data warehouse architecture is a data storage framework's design of an organization.
A data warehouse architecture takes information from raw sets of data and stores
it in a structured and easily digestible format.
• Data warehouse system is constructed in three ways. These approaches are
classified the number of tiers in the architecture.
a) Single-tier architecture.
b) Two-tier architecture.
c) Three-tier architecture (Multi-tier architecture).
• Single tier warehouse architecture focuses on creating a compact data set and minimizing
the amount of data stored. While it is useful for removing redundancies. It is not
effective for organizations with large data needs and multiple streams.
• Two-tier warehouse structures separate the resources physically available from the
warehouse itself. This is most commonly used in small organizations where a server is
used as a data mart. While it is more effective at storing and sorting data.
Two-tier is not scalable and it supports a minimal number of end
Three tier (Multi-tier) architecture:
• Three tier architecture creates a more structured flow for data from raw sets to
actionable insights. It is the most widely used architecture for data warehouse systems.
• Fig. 1.11.1 shows three tier architecture. Three tier architecture sometimes called
multi-tier architecture.
• The bottom tier is the database of the warehouse, where the cleansed and transformed
data is loaded. The bottom tier is a warehouse database server.

26
DR.NNCE II YEAR/03 FDS-QB

Needs of Data Warehouse


1) Business user: Business users require a data warehouse to view summarized data
from\ the past. Since these people are non-technical, the data may be presented to
them in an elementary form.
2) Store historical data: Data warehouse is required to store the time variable data from
the past. This input is made to be used for various purposes.
3) Make strategic decisions: Some strategies may be depending upon the data in
the data warehouse. So, data warehouse contributes to making strategic decisions.
Benefits of Data Warehouse
a) Understand business trends and make better forecasting decisions.
b) Data warehouses are designed to perform well enormous amounts of data.
c) The structure of data warehouses is more accessible for end-users to navigate,
understand and query.
c) Queries that would be complex in many normalized databases could be easier to
build and maintain in data warehouses.
Why is metadata necessary in a data warehouse ?

a) First, it acts as the glue that links all parts of the data warehouses.

b) Next, it provides information about the contents and structures to the developers.

c) Finally, it opens the doors to the end-users and makes the contents recognizable in their

terms.

• Fig. 1.11.2 shows warehouse metadata.

27
DR.NNCE II YEAR/03 FDS-QB

Difference between ODS and Data Warehouse

28
DR.NNCE II YEAR/03 FDS-QB

UNIT II
DESCRIBING DATA

 Types of Data
 Types of Variables
 Describing Data with Tables and Graphs
 Describing Data with Averages
 Describing Variability
 Normal Distributions and Standard (z) Score

LIST OF IMPORTANT QUESTIONS


PART – A
1. Write the three types of data.
2. What is Random sampling ?
3. What do you mean by Descriptive Statistics ?
4. Write the difference between Parameter and Statistic.
5. Write the Statistics and its types.
6. What is Descriptive Statistics ?
7. What is Inferential Statistics ?
8. What do you mean by Mean?
9. What do you mean by Median ?
10. What do you mean by Mode?
11. What do you mean by Variance?
12. What do you mean by Standard Deviation?
13. What do you mean by Range?
14. What is Probability distributions ?
15. What do you mean by Graphical representations?
16. What is relative frequency distribution?
17. How to form cumulative frequency distribution?
18.What is a percentile score indicating?
19. What is a histogram?
20. What is the use of frequency polygon?

1
DR.NNCE II YEAR/03 FDS-QB

PART – B
1. Describe types of Variables? (Nov/Dec2022 &13mark)
2. The IQ scores for a group of 35 school dropouts are as follows:

a) Construct a frequency distribution for grouped data.


b) Specify the real limits for the lowest class interval in this frequency
distribution?( Nov/Dec 2023 &13mark)

3. Given below are the weekly pocket expenses (in Rupees) of a group of 25
students selected at random.
37, 41, 39, 34, 41, 26, 46, 31, 48, 32, 44, 39, 35, 39, 37, 49, 27, 37, 33, 38, 49, 45,
44, 37, 36 Construct a grouped frequency distribution table with class
intervals of equal widths, starting from 25-30, 30-35 and so on. Also, find the
range of weekly pocket expenses? (April/May2022&13mark)

4. Explain Misleading Graph? (April/May2023 &13 mark)

5. Construct a frequency distribution for the number of different residences


occupied by graduating seniors during their college career, namely: 1, 4, 2, 3,
3, 1, 6, 7, 4, 3, 3, 9, 2, 4, 2, 2, 3, 2, 3, 4, 4, 2, 3, 3, 5. What is the shape of this
distribution? (April/May 2022&13mark)

6. The heights of animals are: 600 mm, 470 mm, 170 mm, 430 mm and 300
mm. Find out the mean, the variance and the standard deviation?
(Nov/Dec 2022/2023)

2
DR.NNCE II YEAR/03 FDS-QB

7. Normal Distributions and Standard (z) Scores? April/May2024

LIST OF IMPORTANT QUESTIONS


PART – A

1. Write the three types of data.


Generally, qualitative data consist of words (Yes or No), letters (Y or N), or
numerical codes (0 or 1) that represent a class or category. Ranked data consist of
numbers.(1st, 2nd, . . . 40th place) that represent relative standing within a group.
Quantitative data consist of numbers (weights of 238, 170, . . . 185 lbs) that
represent an amount or a count.

2. What is Random sampling ?


Random sampling is a procedure designed to ensure that each potential
observation in the population has an equal chance of being selected in a survey.
Classic examples of random samples are a state lottery where each number from
1 to 99 in the population has an equal chance of being selected as one of the five
winning numbers or a nation-wide opinion survey in which each telephone
number has an equal chance of being selected as a result of a series of random
selections, beginning with a three-digit area code and ending with a specific
seven-digit telephone number.

3. What do you mean by Descriptive Statistics ?


Statistics exists because of the prevalence of variability in the real world.
In its simplest form, known as descriptive statistics, statistics provides us with
tools, graphs, averages, ranges, correlations for organizing and summarizing the
invite-table variability in collections of actual observations or scores.
Examples
A graph showing the annual change in global temperature during the last
30 years A report that describes the average difference in grade point average
(GPA) between college students who regularly drink alcoholic beverages and
those who don’t

3
DR.NNCE II YEAR/03 FDS-QB

4. Write the difference between Parameter and Statistic.


In our day in day out, we keep speaking about the Population and sample. So it is
very important to know the terminology to represent the population and the
sample. A parameter is a number that describes the data from the population and
statistic is a number that describes the data from a sample.

5. Write the Statistics and its types.


Statistics:
Statistics is the most critical unit of Data Science basics, and it is the
method or science of collecting and analyzing numerical data in large quantities
to get useful insights. The Wikipedia definition of Statistics states that “it is a
discipline that concerns the collection, organization, analysis, interpretation, and
presentation of data.”
It means, as part of statistical analysis, we collect, organize, and draw
meaningful insights from the data either through visualizations or mathematical
explanations.
Statistics is broadly categorized into two types:

1. Descriptive Statistics
2. Inferential Statistics

6. What is Descriptive Statistics ?

we describe the data using the Mean, Standard deviation, Charts, or Probability
distributions. Basically, as part of descriptive Statistics, we measure the
following:

1. Frequency: no. of times a data point occurs


2. Central tendency: the centrality of the data – mean, median, and mode
3. Dispersion: the spread of the data – range, variance, and standard deviation
4. The measure of position: percentiles and quantile ranks
7. What is Inferential Statistics ?
In Inferential statistics, we estimate the population parameters. Or we run
Hypothesis testing to assess the assumptions made about the population
parameters.

4
DR.NNCE II YEAR/03 FDS-QB

In simple terms, we interpret the meaning of the descriptive statistics by inferring


them to the population.
For example, we are conducting a survey on the number of two-wheelers in a
city. Assume the city has a total population of 5L people. So, we take a sample of
1000 people as it is impossible to run an analysis on entire population data.
From the survey conducted, it is found that 800 people out of 1000 (800 out of
1000 is 80%) are two-wheelers. So, we can infer these results to the population
and conclude that 4L people out of the 5L population are two-wheelers.
8. What do you mean by Mean?
It is the sum of all the data points divided by the total number of values in
the data set. Mean cannot always be relied upon because it is influenced by
outliers.
9. What do you mean by Median ?
It is the middlemost value of a sorted/ordered dataset. If the size of the
dataset is even, then the median is calculated by taking the average of the two
middle values.
10. What do you mean by Mode?
It is the most repeated value in the dataset. Data with a single mode is
called unmoral, data with two modes is called bimodal, and data with more than
two modes is called multimodal.

11. What do you mean by Variance?


It is the average squared distance of all the data points from their mean. The
problem with Variance is, the units will also get squared.

12. What do you mean by Standard Deviation?


It is the square root of Variance. Helps in retrieving the original units.

13. What do you mean by Range?


It is the difference between the maximum and the minimum values of a dataset.

14. What is Probability distributions ?


In statistical terms, a distribution function is a mathematical expression that
describes the probability of different possible outcomes for an experiment.

5
DR.NNCE II YEAR/03 FDS-QB

15. What do you mean by Graphical representations?


Graphical representation refers to the use of charts or graphs to visualize, analyse
and interpret numerical data. For a single variable (Univariate analysis), we have
a bar plot, line plot, frequency plot, dot plot, boxplot, and the Normal Q-Q plot.
We will be discussing the Boxplot and the Normal Q-Q plot.

16. What is relative frequency distribution?


A relative frequency distribution shows the proportion of the total number of
observations associated with each value or class of values and is related to a
probability distribution, which is extensively used in statistics

17. How to form cumulative frequency distribution?


The cumulative frequency can be calculated by adding the frequency of
the first class interval to the frequency of the second class interval. After
that, the sum is added to the frequency of the third class interval, etc.

18. What is a percentile score indicating?


A percentile is a comparison score between a particular score and the scores
of the rest of a group. It shows the percentage of scores that a particular score
surpassed. For example, if we score 75 points on a test, and are ranked in the
85 th percentile, it means that the score 75 is higher than 85% of the scores.

19. What is a histogram?


A histogram is a chart that plots the distribution of a numeric
variable's values as a series of bars. Each bar typically covers a range of
numeric values called a bin or class; a bar's height indicates the frequency of
data points with a value within the corresponding bin.

20. What is the use of frequency polygon?


A frequency polygon is a type of line graph where a line segment curves to
join the midpoints of all the class intervals. The shape of the curved line helps
in providing accurate data. Both a line graph and frequency polygon graph are
widely used when data is required to be compared.

6
DR.NNCE II YEAR/03 FDS-QB

PART-B
Q.1 Describe types of Variables?(Nov/Dec2022 13mark)
Variable is a characteristic or property that can take on different values.
Discrete and Continuous Variables
Discrete variables:
• Quantitative variables can be further distinguished in terms of whether they are
discrete or continuous.
• The word discrete means countable. For example, the number of students in a
class is countable or discrete. The value could be 2, 24, 34 or 135 students, but it
cannot be 23/32 or 12.23 students.
• Number of page in the book is a discrete variable. Discrete data can only take
on certain individual values.
Continuous variables:
• Continuous variables are a variable which can take all values within a given
interval or range. A continuous variable consists of numbers whose values, at
least in theory, have no restrictions.
• Example of continuous variables is Blood pressure, weight, high and income.
• Continuous data can take on any value in a certain range. Length of a file is a
continuous variable.

Difference between Discrete variables and Continuous variables

7
DR.NNCE II YEAR/03 FDS-QB

Approximate Numbers
• Approximate number is defined as a number approximated to the exact number
and there is always a difference between the exact and approximate numbers.
• For example, 2, 4, 9 are exact numbers as they do not need any approximation.
• But √2, л, √3 are approximate numbers as they cannot be expressed exactly by a
finite digits. They can be written as 1.414, 3.1416, 1.7320 etc which are only
approximations to the true values.
• Whenever values are rounded off, as is always the case with actual values for
continuous variables, the resulting numbers are approximate, never exact.
• An approximate number is one that does have uncertainty. A number can be
approximate for one of two reasons:
a) The number can be the result of a measurement.
b) Certain numbers simply cannot be written exactly in decimal form. Many
fractions and all irrational numbers fall into this category

The two main variables in an experiment are the independent and dependent
variable.
8
DR.NNCE II YEAR/03 FDS-QB

An experiment is a study in which the investigator decides who receives the


special treatment.
1. Independent variables
• An independent variable is the variable that is changed or controlled in a
scientific experiment to test the effects on the dependent variable.
• An independent variable is a variable that represents a quantity that is being
manipulated in an experiment.
• The independent variable is the one that the researcher intentionally changes or
controls.
• In an experiment, an independent variable is the treatment manipulated by the
investigator. Mostly in mathematical equations, independent variables are
denoted by 'x'.
• Independent variables are also termed as "explanatory variables," "manipulated
variables," or "controlled variables." In a graph, the independent variable is
usually plotted on the X-axis.
2. Dependent variables
• A dependent variable is the variable being tested and measured in a scientific
experiment.
• The dependent variable is 'dependent' on the independent variable. As the
experimenter changes the independent variable, the effect on the dependent
variable is observed and recorded.
• The dependent variable is the factor that the research measures. It changes in
response to the independent variable or depends upon it.
• A dependent variable represents a quantity whose value depends on how the
independent variable is manipulated.
• Mostly in mathematical equations, dependent variables are denoted by 'y'.
• Dependent variables are also termed as "measured variable," the "responding
variable," or the "explained variable". In a graph, dependent variables are usually
plotted on the Y-axis.
• When a variable is believed to have been influenced by the independent
variable, it is called a dependent variable. In an experimental setting, the
dependent variable is measured, counted or recorded by the investigator.

9
DR.NNCE II YEAR/03 FDS-QB

Observational Study
• An observational study focuses on detecting relationships between variables not
manipulated by the investigator. An observational study is used to answer a
research question based purely on what the researcher observes. There is no
interference or manipulation of the research subjects and no control and treatment
groups.
• These studies are often qualitative in nature and can be used for both
exploratory and explanatory research purposes. While quantitative observational
studies exist, they are less common.
• Observational studies are generally used in hard science, medical and social
science fields. This is often due to ethical or practical concerns that prevent the
researcher from conducting a traditional experiment. However, the lack of control
and treatment groups means that forming inferences is difficult and there is a risk
of confounding variables impacting user analysis.
Confounding Variable
• Confounding variables are those that affect other variables in a way that
produces spurious or distorted associations between two variables. They
confound the "true" relationship between two variables. Confounding refers to
differences in outcomes that occur because of differences in the baseline risks of
the comparison groups.
• For example, if we have an association between two variables (X and Y) and
that association is due entirely to the fact that both X and Y are affected by a third
variable (Z), then we would say that the association between X and Y is spurious
and that it is a result of the effect of a confounding variable (Z).
• A difference between groups might be due not to the independent variable but to
a confounding variable.
• For a variable to be confounding:
a) It must have connected with independent variables of interest and
b) It must be connected to the outcome or dependent variable directly.
• Consider the example, in order to conduct research that has the objective that
alcohol drinkers can have more heart disease than non-alcohol drinkers such that
they can be influenced by another factor. For instance, alcohol drinkers might
consume cigarettes more than non drinkers that act as a confounding variable
(consuming cigarettes in this case) to study an association amidst drinking
alcohol and heart disease.
10
DR.NNCE II YEAR/03 FDS-QB

• For example, suppose a researcher collects data on ice cream sales and shark
attacks and finds that the two variables are highly correlated. Does this mean that
increased ice cream sales cause more shark attacks? That's unlikely. The more
likely cause is the confounding variable temperature. When it is warmer outside,
more people buy ice cream and more people go in the ocean.

2) The IQ scores for a group of 35 school dropouts are as follows:

a) Construct a frequency distribution for grouped data.

b) Specify the real limits for the lowest class interval in this frequency
distribution?(NOV/DEC2023 13MARK)

• Solution: Calculating the class width

(123-69)/ 10=54/10=5.4≈ 5

a) Frequency distribution for grouped data

11
DR.NNCE II YEAR/03 FDS-QB

b) b) Real limits for the lowest class interval in this frequency distribution =
64.5-69.5.

Q3. Given below are the weekly pocket expenses (in Rupees) of a
group of 25 students selected at random.
37, 41, 39, 34, 41, 26, 46, 31, 48, 32, 44, 39, 35, 39, 37, 49, 27, 37, 33, 38,
49, 45, 44, 37, 36
Construct a grouped frequency distribution table with class intervals
of equal widths, starting from 25-30, 30-35 and so on. Also, find the
range of weekly pocket expenses.
Solution:

• In the given data, the smallest value is 26 and the largest value is 49. So,
the range of the weekly pocket expenses = 49-26=23.

12
DR.NNCE II YEAR/03 FDS-QB

Relative and Cumulative Frequency Distribution


• Relative frequency distributions show the frequency of each class as a part or
fraction of the total frequency for the entire distribution. Frequency distributions
can show either the actual number of observations falling in each range or the
percentage of observations. In the latter instance, the distribution is called a
relative frequency distribution.

Example: Suppose we take a sample of 200 India family's and record the
number of people living there. We obtain the following:

Cumulative frequency:

• A cumulative frequency distribution can be useful for ordered data (e.g. data
arranged in intervals, measurement data, etc.). Instead of reporting frequencies,
the recorded values are the sum of all frequencies for values less than and
including the current value.

13
DR.NNCE II YEAR/03 FDS-QB

• Example: Suppose we take a sample of 200 India family's and record the
number of people living there. We obtain the following:

4.Explain Misleading Graph? April/May2023

• It is a well known fact that statistics can be misleading. They are often used to
prove a point and can easily be twisted in favour of that point.

• Good graphs are extremely powerful tools for displaying large quantities of
complex data; they help turn the realms of information available today into
knowledge. But, unfortunately, some graphs deceive or mislead.

• This may happen because the designer chooses to give readers the impression of
better performance or results than is actually the situation. In other cases, the
person who prepares the graph may want to be accurate and honest, but may
mislead the reader by a poor choice of a graph form or poor graph construction.

• The following things are important to consider when looking at a graph:

14
DR.NNCE II YEAR/03 FDS-QB

1. Title

2. Labels on both axes of a line or bar chart and on all sections of a pie chart

3. Source of the data

4. Key to a pictograph

5. Uniform size of a symbol in a pictograph

6. Scale: Does it start with zero? If not, is there a break shown

7. Scale: Are the numbers equally spaced?

• A graph can be altered by changing the scale of the graph. For example, data in
the two graphs of Fig. 2.6.1 are identical, but scaling of the Y-axis changes the
impression of the magnitude of differences.

5. Construct a frequency distribution for the number of different residences


occupied by graduating seniors during their college career, namely: 1, 4, 2, 3,
3, 1, 6, 7, 4, 3, 3, 9, 2, 4, 2, 2, 3, 2, 3, 4, 4, 2, 3, 3, 5. What is the shape of this
distribution? April/May 2022

Solution:

Normal distribution: The normal distribution is one of the most commonly


encountered types of data distribution, especially in social sciences. Due to its
bell-like shape, the normal distribution is also referred to as the bell curve.

15
DR.NNCE II YEAR/03 FDS-QB

Histogram of given data:

16
DR.NNCE II YEAR/03 FDS-QB

6. Describing Data with Averages? Nov/Dec2022&2023

1. Mean :

• The mean of a data set is the average of all the data values. The sample mean x
is the point estimator of the population mean μ.

2. Median :

Sum of the values of then observations Number of observations in the sample

Sum of the values of the N observations Number of observations in the


population

• The median of a data set is the value in the middle when the data items are
arranged in ascending order. Whenever a data set has extreme values, the median
is the preferred measure of central location.

• The median is the measure of location most often reported for annual income
and property value data. A few extremely large incomes of property values can
inflate the mean.

• For an off number of observations:

7 observations== 26, 18, 27, 12, 14, 29, 19.

Numbers in ascending order 12, 14, 18, 19, 26, 27, 29

• The median is the middle value.

Median=19

• For an even number of observations :

8 observations=26 18 29 12 14 27 30 19

Numbers in ascending order=12, 14, 18, 19, 26, 27, 29,30

The median is the average of the middle two values.


17
DR.NNCE II YEAR/03 FDS-QB

3. Mode:

• The mode of a data set is the value that occurs with greatest frequency. The
greatest frequency can occur at two or more different values. If the data have
exactly two modes, the data have exactly two modes, the data are bimodal. If the
data have more than two modes, the data are multimodal.

4. Variance

• Variance is the expected value of the squared deviation of a random variable


from its mean. In short, it is the measurement of the distance of a set of random
numbers from their collective average value. Variance is used in statistics as a
way of better understanding a data set's distribution.

• Variance is calculated by finding the square of the standard deviation of a


variable.

σ2= Σ(Χ - μ)2 /N

• In the formula above, μ represents the mean of the data points, x is the value of
an individual data point and N is the total number of data points.

standard Deviation

• Standard deviation is simply the square root of the variance. Standard deviation
measures the standard distance between a score and the mean.

Standard deviation=√Variance

• The standard deviation is a measure of how the values in data differ from one
another or how spread out data is. There are two types of variance and standard
deviation in terms of sample and population.

• The standard deviation measures how far apart the data points in observations
are from each. we can calculate it by subtracting each data point from the mean
value and then finding the squared mean of the differenced values; this is called
Variance. The square root of the variance gives us the standard deviation.

18
DR.NNCE II YEAR/03 FDS-QB

• Properties of the Standard Deviation :

a) If a constant is added to every score in a distribution, the standard deviation


will not be changed.

b) The center of the distribution (the mean) changes, but the standard deviation
remains the same.

c) If each score is multiplied by a constant, the standard deviation will be


multiplied by the same constant.

d) Multiplying by a constant will multiply the distance between scores and


because the standard deviation is a measure of distance, it will also be multiplied.

• If user are given numerical values for the mean and the standard deviation, we
should be

Standard deviation distances always originate from the mean and are expressed as
positive deviations above the mean or negative deviations below the mean.

• Sum of Square (SS) for population definition formula is given below:

Sum of Square (SS) = Σ(x-μ)2

• Sum of Square (SS) for population computation formula is given below:

SS= ΣΧ2- (ΣΧ)2/ N

• Sum of Squares for sample definition formula:

SS = Σ (X-X )2

• Sum of Squares for sample computation formula :

SS = Σx2 - (Σx)2/n

I) The heights of animals are: 600 mm, 470 mm, 170 mm, 430 mm and 300
mm. Find out the mean, the variance and the standard deviation.

19
DR.NNCE II YEAR/03 FDS-QB

Solution:

Mean = 600+ 470 + 170+ 430 + 300 / 5

=1970 /5= 394

σ2= Σ(Χ - μ)2/ N

Variance = (600-394)2 + (470-394)2 + (170-394)2 + (430-394)2 + (300-394)2 /5

Variance = 42436+5776+ 50176 + 1296 +8836 / 5

Variance = 21704

Standard deviation = √Variance = √21704

= 142.32 ≈ 142

II) Determine the values of the range and the IQR for the following sets of
data.

(a) Retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63

(b) Residence changes: 1, 3, 4, 1, 0, 2, 5, 8, 0, 2, 3, 4, 7, 11, 0, 2, 3, 4

Solution:

a) Retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63

Range = Max number - Min number = 70-45

Range = 25

IQR:

Step 1: Arrange given number form lowest to highest.

45, 55, 60, 60, 63, 63, 63, 63, 65, 65, 70


20
DR.NNCE II YEAR/03 FDS-QB

Median

Q1=60 , Q3 65

IQR = Q3-Q1=65-60 = 5

7.Normal Distributions and Standard (z) Scores? April/May2024

• The normal distribution is a continuous probability distribution that is


symmetrical on both sides of the mean, so the right side of the center is a mirror
image of the left side. The area under the normal distribution curve represents
probability and the total area under the curve sums to one.

• The normal distribution is often called the bell curve because the graph of its
probability density looks like a bell. It is also known as called Gaussian
distribution, after the German mathematician Carl Gauss who first described it.

• Fig. 2.9.1 shows normal curve.

• A normal distribution is determined by two parameters the mean and the


variance. A normal distribution with a mean of 0 and a standard deviation of 1 is
called a standard normal distribution.

21
DR.NNCE II YEAR/03 FDS-QB

z Scores
• The Z-score or standard score, is a fractional representation of standard
deviations from the mean value. Accordingly, z-scores often have a distribution
with no average and standard deviation of 1. Formally, the z-score is defined as :

Z = X-μ / σ

where μ is mean, X is score and σ is standard deviation

• The z-score works by taking a sample score and subtracting the mean score,
before then dividing by the standard deviation of the total population. The z-score
is positive if the value lies above the mean and negative if it lies below the mean.

• A z score consists of two parts:

a) Positive or negative sign indicating whether it's above or below the mean; and

b) Number indicating the size of its deviation from the mean in standard deviation
units

22
DR.NNCE II YEAR/03 FDS-QB

• Why are z-scores important?

• It is useful to standardized the values (raw scores) of a normal distribution by


converting them into z-scores because:

(a) It allows researchers to calculate the probability of a score occurring within a


standard normal distribution;

(b) And enables us to compare two scores that are from different samples (which
may have different means and standard deviations).

• Using the z-score technique, one can now compare two different test results
based on relative performance, not individual grading scale.

Example 2.9.2: Express each of the following scores as a z score:

(a) Margaret's IQ of 135, given a mean of 100 and a standard deviation of 15

(b) A score of 470 on the SAT math test, given a mean of 500 and a standard
deviation of 100.

Solution :

a) Margaret's IQ of 135, given a mean of 100 and a standard deviation of 15

Given, Margaret's IQ (X) = 135, Mean (u) = 100, Standard deviation (o) = 15

The z-score for Margaret's calculated using formula as,

Z = X- μ / σ = 135-100 / 15 =2.33

b) A score of 470 on the SAT math test, given a mean of 500 and a standard
deviation of 100

Given,

Score (X) = 470, Mean (u) = 500, Standard deviation (6)= 100

The z-score for Margaret's calculated using formula as,


23
DR.NNCE II YEAR/03 FDS-QB

Z = X-μ / σ = 470-500 /100 = 0.33

Standard Normal Curve


• If the original distribution approximates a normal curve, then the shift to
standard or z-scores will always produce a new distribution that approximates the
standard normal curve.

• Although there is an infinite number of different normal curves, each with its
own mean and standard deviation, there is only one standard normal curve, with a
mean of 0 and a standard deviation of 1.

Example 2.9.3: Suppose a random variable is normally distributed with a


mean of 400 and a standard deviation 100. Draw a normal curve with
parameter label.

Solution:

24
DR.NNCE II YEAR/03 FDS-QB

UNIT III
DESCRIBING RELATIONSHIPS

 Correlation
 Scatter plots
 correlation coefficient for quantitative data
 computational formula for correlation coefficient
 Regression –regression line
 least squares regression line
 Standard errorof estimate
 interpretation of r2
 multiple regression equations
 regression towards the mean

1
DR.NNCE II YEAR/03 FDS-QB

LIST OF IMPORTANT QUESTIONS


1. What do you mean by Correlation?
2. What do you mean by correlation coefficient?
3. Write down the Uses of correlations:
4. What is Multiple Correlation ?
5. State in each case whether there is
6. List out the Properties of Coefficient of Correlation.
7. What are the Uses of Regression Analysis?
8. Distinguish the Correlation and Regression.
9. What is Regression Coefficient?

PART –B
Q1. Explain in detail about the types of Correlation? April/MAY2022
Q2. A sample of 6 children was selected, data about their age in years
and weight in kilograms was recorded as shown in the following table.
It is required to find the correlation between age and weight? NOV/DEC2022

Q3. A sample of 12 fathers and their elder sons gave the following data about their
heights in inches. Calculate the coefficient of rank correlation? April/May2023

2
DR.NNCE II YEAR/03 FDS-QB

Q4. Calculate coefficient of correlation from the following data? April/May2023

Q5.Categorize the different types of relationships using Scatter Plots? NOV/DEC2023

Q6. Explain about Correlation Coefficient for Quantitative Data? Nov/Dec2022

Q7. Explain about Regression? Nov/Dec2023

3
DR.NNCE II YEAR/03 FDS-QB

UNIT III
DESCRIBING RELATIONSHIPS

PART – A
1. What do you mean by Correlation?
Correlation is a statistical technique to ascertain the association or relationship
between two or more variables. Correlation analysis is a statistical technique to
study the degree and direction of relationship between two or more variables.
2. What do you mean by correlation coefficient?
A correlation coefficient is a statistical measure of the degree to which changes to
the value of one variable predict change to the value of another. When the
fluctuation of one variable reliably predicts a similar fluctuation in another
variable, there‟s often a tendency to think that means that the change in one causes
the change in the other.
3. Write down the Uses of correlations:
I. Correlation analysis helps inn deriving precisely the degree and the
direction of such relationship.
II. The effect of correlation is to reduce the range of uncertainity of our
prediction. The prediction based on correlation analysis will be more
reliable and near to reality.
III. Correlation analysis contributes to the understanding of economic
behaviour, aids in locating the critically important variables on which
others depend, may reveal to the economist the connections by which
disturbances spread and suggest to him the paths through which stabilizing
farces may become effective
IV. Economic theory and business studies show relationships between
variables like price and quantity demanded advertising expenditure and
4
DR.NNCE II YEAR/03 FDS-QB

sales promotion measures etc.


V. The measure of coefficient of correlation is a relative measure of change.
Types of Correlation: Correlation is described or classified in several different
ways. Three of the most important are: I. Positive and Negative II. Simple,
Partial and Multiple III. Linear and non-linear
4. What is Multiple Correlation ?
When three or more variables are studied, it is a case of multiple correlation. For
example, in above example if study covers the relationship between student marks,
attendance of students, effectiveness of teacher, use of teaching aids etc, it is a case
of multiple correlation.

5. State in each case whether there is


(a) Positive Correlation

(b) Negative Correlation

(c) No Correlation

S Particulars Solution
l

N
o
1 Price of commodity and its demand Negative
2 Yield of crop and amount of rainfall Positive
3 No of fruits eaten and hungry of a person Negative
4 No of units produced and fixed cost per unit Negative
5 No of girls in the class and marks of boys No
Correlation
6 Ages of Husbands and wife Positive
7 Temperature and sale of woollen garments Negative
8 Number of cows and milk produced Positive
9 Weight of person and intelligence No
Correlation
1 Advertisement expenditure and sales volume Positive
0

5
DR.NNCE II YEAR/03 FDS-QB

6. List out the Properties of Coefficient of Correlation.


1. The coefficient of correlation always lies between – 1 to +1, symbolically
canwritten as – 1 ≤ r ≤ 1. The coefficient of correlation is independent of change
of origin and scale.
2. The coefficient of correlation is a pure number and is independent of the units of
measurement. It means if X represent say height in inches and Y represent say
weights in kgs, then the correlation coefficient will be neither in inches nor in kgs
but only a pure number.
3. The coefficient of correlation is the geometric mean of two regression coefficient,
symbolically 𝑟2= bxy ∗ byx
4. If X and Y are independent variables then coefficient of correlation is zero.

5. A study of measuring the relationship between associated variables, wherein one


variable is dependent on another independent variable, called as Regression. It is
developed by Sir Francis Galton in 1877 to measure the relationship of height
betweenparents and their children.
6. Regression analysis is a statistical tool to study the nature and extent of functional
relationship between two or more variables and to estimate (or predict) the
unknown values of dependent variable from the known values of independent
variable.
7) What are the Uses of Regression Analysis?
1. It provides estimates of values of the dependent variables from values of
independent variables.
2. It is used to obtain a measure of the error involved in using the regression line as
a basis for estimation.
3. With the help of regression analysis, we can obtain a measure of degree of
association or correlation that exists between the two variables.
4. It is highly valuable tool in economies and business research, since most of
theproblems of the economic analysis are based on cause and effect relationship.

6
DR.NNCE II YEAR/03 FDS-QB

8).Distinguish the Correlation and Regression.

Correlation Regression

„Correlation‟ as the name says it „Regression‟ explains how


determines the interconnection or a an independent variable is
co-relationship between the numerically associated with
variables. the dependent variable.

In Correlation, both the However, in Regression,


independent and dependent values both the dependent and
have no difference. independent variable are
different.

The primary objective of When it comes to


Correlation is, to find out a regression, its primary
quantitative/numerical value intent is, to reckon the
expressing the association between values of a haphazard
the values. variable based on the values
of the fixed variable.

Correlation stipulates the degree to However, regression


which both of the variables can specifies the effect of the
move together. change in the unit, in the
known variable(p) on the
evaluated variable (q).

Correlation helps to constitute the Regression helps in

7
DR.NNCE II YEAR/03 FDS-QB

connection between the two estimating a variable‟s value


variables. based on another given
value.

9. What is Regression Coefficient?


The quantity “b” in the regression equation is called as the regression coefficient or
slope coefficient. Since there are two regression equations, therefore, we have two
regression coefficients.
1. Regression Coefficient X on Y, symbolically written as “b xy”

2.Regression Coefficient Y on X, symbolically written as “byx”

PART – B
1.Explain in detail about the types of Correlation? April/MAY2022
• When one measurement is made on each observation, uni-variate analysis is applied.
If more than one measurement is made on each observation, multivariate analysis is
applied.
Here we focus on bivariate analysis, where exactly two measurements are made on
each observation.
• The two measurements will be called X and Y. Since X and Y are obtained for
each observation, the data for one observation is the pair (X, Y).

• Some examples :
1. Height (X) and weight (Y) are measured for each individual in a sample.
2. Stock market valuation (X) and quarterly corporate earnings (Y) are recorded for
each company in a sample.
2. A cell culture is treated with varying concentrations of a drug and the growth rate (X)
And drug concentrations (Y) are recorded for each trial.
8
DR.NNCE II YEAR/03 FDS-QB

3. Temperature (X) and precipitation (Y) are measured on a given day at a set of weather
stations.
•There is difference in bivariate data and two sample data. In two sample data, the
X and Y values are not paired and there are not necessarily the same number of X and Y
values.
• Correlation refers to a relationship between two or more objects. In statistics,
the word correlation refers to the relationship between two variables. Correlation exists
between two variables when one of them is related to the other in some way.
• Examples: One variable might be the number of hunters in a region and the other
variable could be the deer population. Perhaps as the number of hunters increases, the
deer population decreases. This is an example of a negative correlation: As one variable
increases the other decreases.
A positive correlation is where the two variables react in the same way, increasing or
decreasing together. Temperature in Celsius and Fahrenheit has a positive correlation.
• The term "correlation" refers to a measure of the strength of association between two
variables.
• Covariance is the extent to which a change in one variable corresponds systematically
to a change in another. Correlation can be thought of as a standardized covariance.
• The correlation coefficient r is a function of the data, so it really should be called the
sample correlation coefficient. The (sample) correlation coefficient r estimates the
population correlation coefficient p.
• If either the X, or the Y; values are constant (i.e. all have the same value), then one
of the sample standard deviations is zero and therefore the correlation coefficient is not
defined.
Types of Correlation
1. Positive and negative
2. Simple and multiple
3. Partial and total
4. Linear and non-linear.
1. Positive and negative
9
DR.NNCE II YEAR/03 FDS-QB

• Positive correlation : Association between variables such that high scores on one
variable tends to have high scores on the other variable. A direct relation between the
variables.
• Negative correlation : Association between variables such that high scores on one
variable tends to have low scores on the other variable. An inverse relation between the
variables.
2. Simple and multiple
• Simple: It is about the study of only two variables, the relationship is described as
simple correlation.
• Example: Quantity of money and price level, demand and price.
• Multiple: It is about the study of more than two variables simultaneously, the
relationship is described as multiple correlations.
• Example: The relationship of price, demand and supply of a commodity.
3. Partial and total correlation
• Partial correlation : Analysis recognizes more than two variables but considers
only two variables keeping the other constant. Example: Price and demand, eliminating
the supply side.
• Total correlation is based on all the relevant variables, which is normally not feasible.
In total correlation, all the facts are taken into account.
4. Linear and non-linear correlation
• Linear correlation : Correlation is said to be linear when the amount of change in one
variable tends to bear a constant ratio to the amount of change in the other. The graph
of the variables having a linear relationship will form a straight line.
• Non linear correlation : The correlation would be non linear if the amount of change
in one variable does not bear a constant ratio to the amount of change in the other
variable.
Classification of correlation
•Two methods are used for finding relationship between variables.
1. Graphic methods
2. Mathematical methods.
10
DR.NNCE II YEAR/03 FDS-QB

• Graphic methods contain two sub methods: Scatter diagram and simple graph.
• Types of mathematical methods are,
a. Karl 'Pearson's coefficient of correlation
b. Spearman's rank coefficient correlation
c. Coefficient of concurrent deviation
d. Method of least squares.
Coefficient of Correlation
Correlation : The degree of relationship between the variables under consideration is
measure through the correlation analysis.
• The measure of correlation called the correlation coefficient. The degree of
relationship is expressed by coefficient which range from correlation (- 1 ≤ r≥ + 1).
The direction of change is indicated by a sign.
• The correlation analysis enables us to have an idea about the degree and direction
of the relationship between the two variables under study.
• Correlation is a statistical tool that helps to measure and analyze the degree of
relationship between two variables. Correlation analysis deals with the association
between two or more variables.
• Correlation denotes the interdependency among the variables for correlating two
phenomenon, it is essential that the two phenomenon should have cause-effect
relationship
and if such relationship does not exist then the two phenomenon can not be correlated.
• If two variables vary in such a way that movement in one are accompanied by
movement in other, these variables are called cause and effect relationship.
Properties of Correlation
1. Correlation requires that both variables be quantitative.
2. Positive r indicates positive association between the variables and negative r indicates
negative association.
3. The correlation coefficient (r) is always a number between - 1 and + 1.
4. The correlation coefficient (r) is a pure number without units.
5. The correlation coefficient measures clustering about a line, but only relative to the
11
DR.NNCE II YEAR/03 FDS-QB

6. The correlation can be misleading in the presence of outliers or nonlinear association.


7. Correlation measures association. But association does not necessarily show causation.
Q2. A sample of 6 children was selected, data about their age in years
and weight in kilograms was recorded as shown in the following table. It is required to
find the correlation between age and weight? NOV/DEC2022

Solution :

X = Variable age is the independent variable

Y = Variable weight is the dependent

12
DR.NNCE II YEAR/03 FDS-QB

• Other formula for calculating correlation coefficient is as follows:

Interpreting the correlation coefficient Cr = Σ (Zx Zy)/N


•Because the relationship between two sets of data is seldom perfect, the majority of
correlation coefficients are fractions (0.92, -0.80 and the like).
• When interpreting correlation coefficients it is sometimes difficult to determine
what is high, low and average.
• The value of correlation coefficient 'r' ranges from - 1 to +1.
• If r = + 1, then the correlation between the two variables is said to be perfect and positive.
•If r = -1, then the correlation between the two variables is said to be perfect and negative.

• If r = 0, then there exists no correlation between the variables.

13
DR.NNCE II YEAR/03 FDS-QB

Q3. A sample of 12 fathers and their elder sons gave the following data about their

heights in inches. Calculate the coefficient of rank correlation? April/May2023

Solution:

14
DR.NNCE II YEAR/03 FDS-QB

Q4. Calculate coefficient of correlation from the following data? April/May2023

Solution: In the problem statement, both series items are in small numbers. So there is no

need to take deviations. Computation of coefficient of correlation

= 46 / 5.29 × 9.165

r = 0.9488

15
DR.NNCE II YEAR/03 FDS-QB

Q5.Categorize the different types of relationships using Scatter Plots? NOV/DEC2023

• When two variables x and y have an association (or relationship), we say there
exists a correlation between them. Alternatively, we could say x and y are correlated.
To find such an association, we usually look at a scatterplot and try to find a pattern.
• Scatterplot (or scatter diagram) is a graph in which the paired (x, y) sample data are
plotted with a horizontal x axis and a vertical y axis. Each individual (x, y) pair is plotted as a
single point.
• One variable is called independent (X) and the second is called dependent (Y).

Example:

• Fig. 3.2.1 shows the scatter diagram.

16
DR.NNCE II YEAR/03 FDS-QB

• The pattern of data is indicative of the type of relationship between your two variables :

1. Positive relationship

2. Negative relationship

3. No relationship.

• The scatter gram can indicate a positive relationship, a negative relationship


17
DR.NNCE II YEAR/03 FDS-QB

or a zero relationship.

Advantages of Scatter Diagram

1. It is a simple to implement and attractive method to find out the nature of correlation.

2. It is easy to understand.

3. User will get rough idea about correlation (positive or negative correlation).

4. Not influenced by the size of extreme item

5. First step in investing the relationship between two variables.

Disadvantage of scatter diagram

• Can not adopt an exact degree of correlation.

Q6. Explain about Correlation Coefficient for Quantitative Data? Nov/Dec2022

• The product moment correlation, r, summarizes the strength of association between


two metric (interval or ratio scaled) variables, say X and Y. It is an index used to
determine whether a linear or straight-line relationship exists between X and Y.
• As it was originally proposed by Karl Pearson, it is also known as the Pearson
correlation coefficient. It is also referred to as simple correlation, bivariate correlation or
merely the correlation coefficient
• The correlation coefficient between two variables will be the same regardless
of their underlying units of measurement.
• It measures the nature and strength between two variables of the quantitative type.
• The sign of r denotes the nature of association. While the value of r denotes the
strength of association.
• If the sign is positive this means the relation is direct (an increase in one variable is
associated with an increase in the other variable and a decrease in one variable is associated
with a decrease in the other variable).

18
DR.NNCE II YEAR/03 FDS-QB

• While if the sign is negative this means an inverse or indirect relationship


(which means an increase in one variable is associated with a decrease in the other).
The value of r ranges between (-1) and (+ 1). The value of r denotes the

strength of the association as illustrated by the following diagram,

1. If r = Zero this means no association or correlation between the two variables.

2. If 0 < r <0.25 = Weak correlation.

3. If 0.25 ≤ r < 0.75 = Intermediate correlation.

4. If 0.75 ≤ r< 1 = Strong correlation.

5. If r=1= Perfect correlation

• Pearson's 'r' is the most common correlation coefficient. Karl Pearson's


Coefficient of Correlation denoted by - 'r' The coefficient of correlation 'r' measure the
degree of linear relationship between two variables say x and y.
• Formula for calculating correlation coefficient (r) :
1. When deviation taken from actual mean :

1. When deviation taken from actual mean :

19
DR.NNCE II YEAR/03 FDS-QB

2. When deviation taken from an assumed mean :

i) Compute Pearson's coefficient of correlation between maintains cost and sales as

per the data given below.

Solution: Given data:

n= 10

X= Maintains cost

y=Sales cost

Calculate coefficient of correlation.

20
DR.NNCE II YEAR/03 FDS-QB

Correlation coefficient is positively correlated.

ii) Find Karl Pearson's correlation coefficient for the following paired data.

Solution: Let x = Wages y = Cost of living

21
DR.NNCE II YEAR/03 FDS-QB

Karl Pearson's correlation coefficient r = 0.847

Q7. Explain about Regression? Nov/Dec2023

• For an input x, if the output is continuous, this is called a regression problem. For
example, based on historical information of demand for tooth paste in your supermarket
you are asked to predict the demand for the next month.
• Regression is concerned with the prediction of continuous quantities. Linear regression
is the oldest and most widely used predictive model in the field of machine learning.
The goal is to minimize the sum of the squared errors to fit a straight line to a set of
data points.
• It is one of the supervised learning algorithms. A regression model requires the
knowledge of both the dependent and the independent variables in the training data set.

22
DR.NNCE II YEAR/03 FDS-QB

• Simple Linear Regression (SLR) is a statistical model in which there is only one
independent variable and the functional relationship between the dependent variable
and the regression coefficient is linear.
• Regression line is the line which gives the best estimate of one variable from the
value of any other given variable.
• The regression line gives the average relationship between the two variables in
mathematical form. For two variables X and Y, there are always two lines of regression.
• Regression line of Y on X: Gives the best estimate for the value of Y for any specific
given values of X:
where
Y = a + bx
a = Y - intercept
b = Slope of the line
Y = Dependent variable
X = Independent variable
• By using the least squares method, we are able to construct a best fitting straight line
to the scatter diagram points and then formulate a regression equation in the form of:
ŷ = a + bx
ŷ = ȳ + b(x- x¯ )
• Regression analysis is the art and science of fitting straight lines to patterns of data. In
a linear regression model, the variable of interest ("dependent" variable) is predicted
from k other variables ("independent" variables) using a linear equation.
• If Y denotes the dependent variable and X1, ..., Xk are the independent variables,
then the assumption is that the value of Y at time t in the data sample is determined by the
linear equation:
Y1 = β0 + β1 X1t + B2 X2t +… + βk Xkt + εt
where the betas are constants and the epsilons are independent and identically distributed
normal random variables with mean zero.
Regression Line
• A way of making a somewhat precise prediction based upon the relationships
23
DR.NNCE II YEAR/03 FDS-QB

between two variables. The regression line is placed so that it minimizes the predictive error.
• The regression line does not go through every point; instead it balances the difference
between all data points and the straight-line model. The difference between the
observed data value and the predicted value (the value on the straight line) is the error
or residual. The criterion to determine the line that best describes the relation
between two variables is based on the residuals.
Residual = Observed - Predicted
• A negative residual indicates that the model is over-predicting. A positive residual indicates
that the model is under-predicting.
Linear Regression
• The simplest form of regression to visualize is linear regression with a single predictor.
A linear regression technique can be used if the relationship between X and Y
can be approximated with a straight line.
• Linear regression with a single predictor can be expressed with the equation:
y = Ɵ2x + Ɵ1 + e
• The regression parameters in simple linear regression are the slope of the line (Ɵ2),
the angle between a data point and the regression line and the y intercept (Ɵ1) the point
where x crosses the y axis (X = 0).
• Model 'Y', is a linear function of 'X'. The value of 'Y' increases or decreases in
linear manner according to which the value of 'X' also changes.

Nonlinear Regression:
• Often the relationship between x and y cannot be approximated with a straight line.
In this case, a nonlinear regression technique may be used.

24
DR.NNCE II YEAR/03 FDS-QB

• Alternatively, the data could be pre-processed to make the relationship linear.


In Fig. 3.4.2 shows nonlinear regression. (Refer Fig. 3.4.2 on previous page)
• The X and Y have a nonlinear relationship.
• If data does not show a linear dependence we can get a more accurate model using
a nonlinear regression model.
• For example: y = W0 + W1X + W2 X2 + W3 X3
• Generalized linear model is foundation on which linear regression can be applied to
modeling categorical response variables.
Advantages:
A.Training a linear regression model is usually much faster than methods such
as neural networks.
b. Linear regression models are simple and require minimum memory to implement.
By examining the magnitude and sign of the regression coefficients you can
infer how predictor variables affect the target outcome.
• There are two important shortcomings of linear regression:
1. Predictive ability: The linear regression fit often has low bias but high variance.
Recall that expected test error is a combination of these two quantities. Prediction
accuracy can sometimes be improved by sacrificing some small amount of bias
in order to decrease the variance.
2. Interpretative ability: Linear regression freely assigns a coefficient to each predictor
variable. When the number of variables p is large, we may sometimes seek, for
the sake of interpretation, a smaller set of important variables.

25
DR.NNCE II YEAR/03 FDS-QB

Least Squares Regression Line


Least square method
• The method of least squares is about estimating parameters by minimizing the
squared discrepancies between observed data, on the one hand and their expected values
on the other.
• The Least Squares (LS) criterion states that the sum of the squares of errors is minimum.
The least-squares solutions yield y(x) whose elements sum to 1, but do not ensure the
outputs to be in the range [0, 1].
• How to draw such a line based on data points observed? Suppose a imaginary line of
y = a + bx.

• Imagine a vertical distance between the line and a data point E = Y - E(Y).
This error is the deviation of the data point from the imaginary line, regression line.
Then what is the best values of a and b? A and b that minimizes the sum of such errors.
• Deviation does not have good properties for computation. Then why do we use
squares of deviation? Let us get a and b that can minimize the sum of squared deviations
rather than the sum of deviations. This method is called least squares.
• Least squares method minimizes the sum of squares of errors. Such a and b are
called least squares estimators i.e. estimators of parameters a and B.
• The process of getting parameter estimators (e.g., a and b) is called estimation.
Lest squares method is the estimation method of Ordinary Least Squares (OLS).
Disadvantages of least square
1. Lack robustness to outliers.

26
DR.NNCE II YEAR/03 FDS-QB

2. Certain datasets unsuitable for least squares classification.


3. Decision boundary corresponds to ML solution.
Define linear and nonlinear regression using figures. Calculate the value of
Y for X = 100 based on linear regression prediction method.
Solution

27
DR.NNCE II YEAR/03 FDS-QB

Regression Towards the Mean


• Regression toward the mean refers to a tendency for scores, particularly extreme
scores, to shrink toward the mean. Regression toward the mean appears among
subsets of extreme observations for a wide variety of distributions.
• The rule goes that, in any series with complex phenomena that are dependent
on many variables, where chance is involved, extreme outcomes tend to be followed
by more moderate ones.
• The effects of regression to the mean can frequently be observed in sports, where
the effect causes plenty of unjustified speculations.

• It basically states that if a variable is extreme the first time we measure it, it will be closer

28
DR.NNCE II YEAR/03 FDS-QB

to the average the next time we measure it. In technical terms, it describes how a random
variable that is outside the norm eventually tends to return to the norm.
• For example, our odds of winning on a slot machine stay the same. We might hit a
"winning streak" which is, technically speaking, a set of random variables outside the norm.
But play the machine long enough and the random variables will regress to the mean
(i.e. "return to normal") and we shall end up losing.

Regression fallacy
• Regression fallacy assumes that a situation has returned to normal due to corrective
actions having been taken while the situation was abnormal. It does not take into
consideration normal fluctuations.
• An example of this could be a business program failing and causing problems
which is then cancelled. The return to "normal", which might be somewhat different
from the original situation or a situation of "new normal" could fall into the
category of regression fallacy. This is considered an informal fallacy.

29
DR.NNCE II YEAR/03 FDS-QB

UNIT IV
PYTHON LIBRARIES FOR DATA WRANGLING

 Basics of Numpy arrays


 aggregations
 computations on arrays
 comparisons, masks,
 boolean logic
 fancy indexing
 structured arrays
 Data manipulation with Pandas
 data indexing and selection
 operating on data – missing data
 Hierarchical indexing – combining datasets – aggregation
and grouping
 pivot tables

1
DR.NNCE II YEAR/03 FDS-QB

LIST OF IMPORTANT QUESTIONS


UNIT IV
PYTHON LIBRARIES FOR DATA WRANGLING
PART – A

1. Write short on Importance of Data Wrangling.


2. Write short note on NumPy.
3. List the operations which are done using NumPy?
4. Write short notes on aggregate() function.
5. List the operations which are done using Pandas?
6. Write short note on Data Manipulation using Pandas.
7. How can Pandas get missing data?
8. How do you treat missing data in Python?
9. How to use Hierarchical Indexes with Pandas?
10. List some of the Aggregation functions in Pandas.
11. What is grouping in pandas?
12. What is the use of pivot table in Python?

PART – B
1. Explain in detail about Data Wrangling in Python? Nov/Dec2022
2. Explain the two main ways to carry out Boolean masking? April/may2022
3. Explain in detail about Aggregation in Pandas? April/may2023
4. Pandas Data Frame - transform() function? Nov/Dec2023
5. Explain in detail about the pivot table using python?

2
DR.NNCE II YEAR/03 FDS-QB

LIST OF IMPORTANT QUESTIONS


UNIT IV
PYTHON LIBRARIES FOR DATA WRANGLING

PART – A

1. Write short on Importance of Data Wrangling.


Data Wrangling is a very important step. The below example will explain its importance as
:Books selling Website want to show top-selling books of different domains, according to user
preference. For example, a new user search for motivational books, then they want to show those
motivational books which sell the most or having a high rating, etc.

2. Write short note on NumPy.


One of the most fundamental packages in Python, NumPy is a general-purpose array-processing
package. It provides high-performance multidimensional array objects and tools to work with the
arrays. NumPy is an efficient container of generic multi-dimensional data.
3. List the operations which are done using NumPy?
Basic array operations: add, multiply, slice, flatten, reshape, index arrays
1. Advanced array operations: stack arrays, split into sections, broadcast arrays
2. Work with DateTime or Linear Algebra
3. Basic Slicing and Advanced Indexing in NumPy Python

4. Write short notes on aggregate() function.


The aggregate() method allows us to apply a function or a list of function names to be executed
along one of the axis of the DataFrame, default 0, which is the index (row) axis.
Note: the agg() method is an alias of the aggregate() method.
dataframe. aggregate(func, axis, args, kwargs)

5. List the operations which are done using Pandas?


1. Indexing, manipulating, renaming, sorting, merging data frame
2. Update, Add, Delete columns from a data frame
3. Impute missing files, handle missing data or NANs
4. Plot data with histogram or box plot
6. Write short note on Data Manipulation using Pandas.
 Dropping columns in the data.
 Dropping rows in the data.
 Renaming a column in the dataset.
 Select columns with specific data types.
 Slicing the dataset.

3
DR.NNCE II YEAR/03 FDS-QB

7. How can Pandas get missing data?


In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull(). Both
function help in checking whether a value is NaN or not. These function can also be used in Pandas Series
in order to find null values in a series
8. How do you treat missing data in Python?
It is time to see the different methods to handle them.
1. Drop rows or columns that have a missing value.
2. Drop rows or columns that only have missing values.
3. Drop rows or columns based on a threshold value.
4. Drop based on a particular subset of columns.
5. Fill with a constant value.
6. Fill with an aggregated value.
9. How to use Hierarchical Indexes with Pandas?
#importing pandas library as alias pd
import pandas as pd
# calling the pandas read_csv() function.
# and storing the result in DataFrame df
df = pd.read_csv('homelessness.csv')

print(df.head())

10. List some of the Aggregation functions in Pandas.


Pandas provide us with a variety of aggregate functions. These functions help to perform various
activities on the datasets. The functions are:
 .count(): This gives a count of the data in a column.
 .sum(): This gives the sum of data in a column.
 .min() and .max(): This helps to find the minimum value and maximum value, ina function,
respectively.
 .mean() and .median(): Helps to find the mean and median, of the values in a column,
respectively.

4
DR.NNCE II YEAR/03 FDS-QB

11. What is grouping in pandas?


Pandas groupby is used for grouping the data according to the categories and apply a
function to the categories. It also helps to aggregate data efficiently. Pandas dataframe.
groupby() function is used to split the data into groups based on some criteria.

12. What is the use of pivot table in Python?


The Pandas pivot_table() function provides a familiar interface to create Excel-style pivot
tables. The function requires at a minimum either the index= or columns= parameters to specify
how to split data. The function can calculate one or multiple aggregation methods, including using
custom functions

PART – B
1. Explain in detail about Data Wrangling in Python? Nov/Dec2022
Data wrangling involves processing the data in various formats like - merging, grouping,
concatenating etc. for the purpose of analysing or getting them ready to be used with another set of data.
Python has built-in features to apply these wrangling methods to various data sets to achieve the
analytical goal. In this chapter we will look at few examples describing these methods.
Merging Data
The Pandas library in python provides a single function, merge, as the entry point for all standard
database join operations between DataFrame objects −
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True)
Let us now create two different DataFrames and perform the merging operations on it.

# import the pandas library


import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print left
print right

Its output is as follows −


Name id subject_id
0 Alex 1 sub1
1 Amy 2 sub2
2 Allen 3 sub4
5
DR.NNCE II YEAR/03 FDS-QB

3 Alice 4 sub6
4 Ayoung 5 sub5

Name id subject_id
0 Billy 1 sub2
1 Brian 2 sub4
2 Bran 3 sub3
3 Bryce 4 sub6
4 Betty 5 sub5
Grouping Data
Grouping data sets is a frequent need in data analysis where we need the result in terms of various groups
present in the data set. Panadas has in-built methods which can roll the data into various groups.
In the below example we group the data by year and then get the result for a specific year.

# import the pandas library


import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',


'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

grouped = df.groupby('Year')
print grouped.get_group(2014)

Its output is as follows −


Points Rank Team Year
0 876 1 Riders 2014
2 863 2 Devils 2014
4 741 3 Kings 2014
9 701 4 Royals 2014
Concatenating Data
Pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects.
In the below example the concat function performs concatenation operations along an axis. Let us create
different objects and do concatenation.

import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
6
DR.NNCE II YEAR/03 FDS-QB

two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two])

Its output is as follows −


Marks_scored Name subject_id
1 98 Alex sub1
2 90 Amy sub2
3 87 Allen sub4
4 69 Alice sub6
5 78 Ayoung sub5
1 89 Billy sub2
2 80 Brian sub4
3 79 Bran sub3
4 97 Bryce sub6
5 88 Betty sub5

2. Explain the two main ways to carry out Boolean masking? April/May2022

The NumPy library in Python is a popular library for working with arrays. Boolean masking, also
called boolean indexing, is a feature in Python NumPy that allows for the filtering of values
in numpy arrays.

There are two main ways to carry out boolean masking:

 Method one: Returning the result array.

 Method two: Returning a boolean array.

Method one: Returning the result array


The first method returns an array with the required results. In this method, we pass a condition in the
indexing brackets, [ ], of an array. The condition can be any comparison, like arr > 5, for the array arr.

Syntax
arr[arr > 5]
Parameter values
 arr: This is the array that we are querying.
 The condition arr > 5 is the criterion with which values in the arr array will be filtered.

7
DR.NNCE II YEAR/03 FDS-QB

Return value
This method returns a NumPy array, ndarray, with values that satisfy the given condition. The line in the
example given above will return all the values in arr that are greater than 5.

Example
Let's try out this method in the following example:

# importing NumPy
import numpy as np
# Creating a NumPy array
arr = np.arange(15)
# Printing our array to observe
print(arr)
# Using boolean masking to filter elements greater than or equal to 8
print(arr[arr >= 8])
# Using boolean masking to filter elements equal to 12
print(arr[arr == 12])

Method two: Returning a boolean array


The second method returns a boolean array that has the same size as the array it represents. A boolean
array only contains the boolean values of either True or False. This boolean array is also called a mask
array, or simply a mask. We'll discuss boolean arrays in more detail in the "Return value" section.

Syntax
The code snippet given below shows us how to use this method:
mask = arr > 5
Return value
The line in the code snippet given above will:

 Return an array with the same size and dimensions as arr. This array will only contain the
values True and False. All the True values represent elements in the same position in arr that
satisfy our condition, and all the False values represent elements in the same position in arr that do
not satisfy our condition.
 Store this boolean array in a mask array.

The mask array can be passed in the index brackets of arr to return the values that satisfy our condition. We will
see how this works in our coding example.

Example
Let's try out this method in the following example:
# importing NumPy
import numpy as np
# Creating a NumPy array
8
DR.NNCE II YEAR/03 FDS-QB

arr = np.array([[ 0, 9, 0],


[ 0, 7, 8],
[ 6, 0, 1]])
# Printing our array to observe
print(arr)
# Creating a mask array
mask = arr > 5
# Printing the mask array
print(mask)
# Printing the filtered array using both methods
print(arr[mask])
print(arr[arr > 5])

3. Explain in detail about Aggregation in Pandas? APRIL/MAY2023

Pandas provide us with a variety of aggregate functions. These functions help to perform various
activities on the datasets. The functions are:
 .count(): This gives a count of the data in a column.
 .sum(): This gives the sum of data in a column.
 .min() and .max(): This helps to find the minimum value and maximum value, ina function,
respectively.
 .mean() and .median(): Helps to find the mean and median, of the values in a column,
respectively.
 DataFrame.aggregate(func, axis=0, *args, **kwargs)
Parameters:
func: It refers callable, string, dictionary, or list of string/callables.

It is used for aggregating the data. For a function, it must either work when passed to a DataFrame or
DataFrame.apply(). For a DataFrame, it can pass a dict, if the keys are the column names.

axis: (default 0): It refers to 0 or 'index', 1 or 'columns'

0 or 'index': It is an apply function for each column.

1 or 'columns': It is an apply function for each row.

*args: It is a positional argument that is to be passed to func.

**kwargs: It is a keyword argument that is to be passed to the func.

Returns:
It returns the scalar, Series or DataFrame.

scalar: It is being used when Series.agg is called with the single function.

Series: It is being used when DataFrame.agg is called for the single function.
9
DR.NNCE II YEAR/03 FDS-QB

DataFrame: It is being used when DataFrame.agg is called for the several functions.

Example:
1. import pandas as pd
2. import numpy as np
3. info=pd.DataFrame([[1,5,7],[10,12,15],[18,21,24],[np.nan,np.nan,np.nan]],columns=['X','Y','Z'])
4. info.agg(['sum','min'])

10
DR.NNCE II YEAR/03 FDS-QB

4. Pandas DataFrame - transform() function? April/May2023


The transform() function is used to call function (func) on self producing a DataFrame with transformed
values and that has the same axis length as self.
Syntax:
DataFrame.transform(self, func, axis=0, *args, **kwargs)
Parameters:
Name Description Type/Default Required
Value /
Optional

func Function to use for transforming the data. If a function, str, list Required
function, must either work when passed a DataFrame or dict
or when passed to DataFrame.apply.
Accepted combinations are:
 function
 string function name
 list of functions and/or function names,
e.g. [np.exp. 'sqrt']
 dict of axis labels -> functions, function
names or list of such.

axis If 0 or „index‟: apply function to each column. If 1 or {0 or „index‟, 1 or Required


„columns‟: apply function to each row. „columns‟}
Default Value: 0

11
DR.NNCE II YEAR/03 FDS-QB

*args Positional arguments to pass to func. Required

**kwargs Keyword arguments to pass to func. Required


Returns: DataFrame
A DataFrame that must have the same length as self.
Raises: ValueError - If the returned DataFrame has a different length than self.
Example:
Examples
In [1]:
import numpy as np
import pandas as pd
In [2]:
df = pd.DataFrame({'X': range(4), 'Y': range(2, 6)})
df
Out[2]:
X Y

0 0 2

1 1 3

2 2 4

3 3 5
In [3]:
df.transform(lambda x: x + 1)
Out[3]:
X Y

0 1 3

1 2 4

2 3 5

3 4 6
Even though the resulting DataFrame must have the same length as the input DataFrame,
it is possible to provide several input functions:
In [4]:
s = pd.Series(range(4))
s
Out[4]:
0 0
1 1
2 2
12
DR.NNCE II YEAR/03 FDS-QB

3 3
dtype: int64
In [5]:
s.transform([np.sqrt, np.exp])
Out[5]:
sqrt exp

0 0.000000 1.000000

1 1.000000 2.718282

2 1.414214 7.389056

3 1.732051 20.085537
Q5. Explain in detail about the pivot table using python?

Most people likely have experience with pivot tables in Ecel. Pandas provides a similar function called
(appropriately enough) pivot table . While it is exceedingly useful, I frequently find myself struggling to
remember how to use the syntax to format the output for my needs. This article will focus on explaining
the pandas pivot table function and how to use it for your data analysis.

As an added bonus, I‟ve created a simple cheat sheet that summarizes the pivot table. We can find it at the
end of this post and I hope it serves as a useful reference. Let me know if it is helpful.

The Data
One of the challenges with using the panda‟s pivot_table is making sure us understand our data and what
questions we are trying to answer with the pivot table. It is a seemingly simple function but can produce
very powerful analysis very quickly.

In this scenario, I‟m going to be tracking a sales pipeline (also called funnel). The basic problem is that
some sales cycles are very long (think “enterprise software”, capital equipment, etc.) and management
wants to understand it in more detail throughout the year.

Typical questions include:

 How much revenue is in the pipeline?


 What products are in the pipeline?
 Who has what products at what stage?
 How likely are we to close deals by year end?

Many companies will have CRM tools or other software that sales uses to track the process. While they
may have useful tools for analyzing the data, inevitably someone will export the data to Excel and use a
PivotTable to summarize the data.

13
DR.NNCE II YEAR/03 FDS-QB

Using a panda‟s pivot table can be a good alternative because it is:

 Quicker (once it is set up)


 Self documenting (look at the code and we know what it does)
 Easy to use to generate a report or email
 More flexible because we can define custome aggregation functions

Read in the data


Let‟s set up our environment first.

If we want to follow along, we can download the Excel file.

import pandas as pd
import numpy as np

Version Warning
The pivot_table API has changed over time so please make sure we have a recent version of pandas ( >
0.15) installed for this example to work. This example also uses the category data type which requires a
recent version as well.
Read in our sales funnel data into our DataFrame

df = pd.read_excel("../in/sales-funnel.xlsx")
df.head()

For convenience sake, let‟s define the status column as a category and set the order we want to view.

This isn‟t strictly required but helps us keep the order we want as we work through analyzing the data.

df["Status"] = df["Status"].astype("category")
df["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)
14
DR.NNCE II YEAR/03 FDS-QB

Pivot the data


As we build up the pivot table, I think it‟s easiest to take it one step at a time. Add items and check each
step to verify we are getting the results we expect. Don‟t be afraid to play with the order and the variables
to see what presentation makes the most sense for our needs.

The simplest pivot table must have a dataframe and an index . In this case, let‟s use the Name as
our index.

pd.pivot_table(df,index=["Name"])

We can have multiple indexes as well. In fact, most of the pivot_table args can take multiple values via
a list.

pd.pivot_table(df,index=["Name","Rep","Manager"])

15
DR.NNCE II YEAR/03 FDS-QB

This is interesting but not particularly useful. What we probably want to do is look at this by Manager and
Rep. It‟s easy enough to do by changing the index .

pd.pivot_table(df,index=["Manager","Rep"])

We can see that the pivot table is smart enough to start aggregating the data and summarizing it by
grouping the reps with their managers. Now we start to get a glimpse of what a pivot table can do for us.

For this purpose, the Account and Quantity columns aren‟t really useful. Let‟s remove it by explicitly
defining the columns we care about using the values field.

16
DR.NNCE II YEAR/03 FDS-QB

pd.pivot_table(df,index=["Manager","Rep"],values=["Price"])

The price column automatically averages the data but we can do a count or a sum. Adding them is simple
using aggfunc and np.sum .

pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=np.sum)

aggfunc can take a list of functions. Let‟s try a mean using the numpy mean function and len to get
a count.

pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=[np.mean,len])

17
DR.NNCE II YEAR/03 FDS-QB

18
DR.NNCE II YEAR/03 FDS-QB

DATA VISUALIZATION

 Importing Matplotlib – Line plots – Scatter plots


 visualizing errors – density and contour plots
 Histograms – legends – colors
 subplots – text and annotation – customization
 three-dimensional plotting –
 Geographic Data with Basemap –
 Visualization with Seaborn.

1
DR.NNCE II YEAR/03 FDS-QB

LIST OF IMPORTANT QUESTIONS


UNIT V
DATA VISUALIZATION
PART – A
1. What is Data Visualization?
2. Which are the best libraries for data visualization in python?
3. How can we visualize more than three dimensions of data in a single chart?
4. How to plot the distribution of customers by age?
5. What is the purpose of a Scatter plot?
6. Define IQR in a box plot.
7. What is a Boxplot?
8. What is a heat map in Python? Create a correlation matrix using the core function of
the data frame?
9. What is a scatter plot? For what type of data is scatter plot usually used for?
10. What features might be visible in scatterplots?
11. What type of plot would you use if you need to demonstrate “relationship” between
variables/parameters?
12. When will you use a histogram and when will you use a bar chart?
13. What is an outlier?
14. What type of data is box-plots usually used for? Why?
15. What information could you gain from a box-plot?

PART – B
1. Explain various features of Matplotlib platform used for data visualization and
illustrate its challenges? Nov/Dec2023
2. Explain various types of plotting using Scatter Plots in python? (OR)
Write a code snippet that projects our globe as a 2-D flat surface (using
cylindrical project) and convey information about the location of any three
major Indian cities in the map (using scatter plot).April/May2024

2
DR.NNCE II YEAR/03 FDS-QB

3. Explain about Histogram? Nov/Dec2022


4. Explain in detail about three-dimensional Plotting using Matplotlib? (OR)
Outline any two three-dimensional plotting in Matplotlib with an example.
( April/May2023)
5. Mapping Geographical Data with Base map Python Package?
6. Explain in detail about Data Visualization Using Seaborn? (OR) Briefly
explain about visualization with Seaborn. Give an example working code
segment that represents a 2D kernel density plot for any data?
April/May2024

LIST OF IMPORTANT QUESTIONS


UNIT V
DATA VISUALIZATION
PART – A

1. What is Data Visualization?


Manipulating the data to our use, we have to present it for others to understand. This process
is known as Data Visualization. Data can be presented or visualized using tools and techniques
such as infographics, graphs, fever charts, histograms to name a few. Data visualization
helps present your trends and patterns to customers, stakeholders, and team members for
activities as varied as driving sales to product development to performance analysis.
2. Which are the best libraries for data visualization in python?

Essentially, there are four libraries that are used for data visualization in python:

 Matplotlib
 Seaborn
 Plotly
 Bokeh
3. How can we visualize more than three dimensions of data in a single chart?
To visualize data beyond three dimensions, we need to use visual cues such as color, size,
and shape.
 Color is used to depict both continuous and categorical data.

3
DR.NNCE II YEAR/03 FDS-QB

 Marker Size is used to represent continuous data. Can be used for categorical data as
well. However, since size differences are difficult to detect, it is not considered the most
appropriate choice for categorical data.
 Shapes are used to represent different classes.
4. How to plot the distribution of customers by age?
The distribution of customers by age can be plotted simply by creating a histogram from the
Age column of the customer’s Data Frame.
1. What is the purpose of a Scatter plot?
Scatter plots are used to observe relationships between two different numeric variables.
2. Define IQR in a box plot?
IQR stands for interquartile range. In a box plot and IQR is the length of the box.
3. What is a Boxplot?
A Box and Whisker Plot (or Boxplot) are used to represent data distribution through their
quartiles. The graph looks like a rectangle with lines extending from the top and bottom.
These lines are known as the “whiskers”, and represent the variability outside the upper
and lower quartiles.
5. What is a heat map in Python? Create a correlation matrix using the core function of
the data frame?
Heat maps are used to cross-examining multivariate data and represent it through color
variations.
6. What is a scatter plot? For what type of data is scatter plot usually used for?
A scatter plot is a chart used to plot a correlation between two or more variables at the
same time. It’s usually used for numeric data.

4
DR.NNCE II YEAR/03 FDS-QB

7. What features might be visible in scatterplots?

1. Correlation: the two variables might have a relationship, for example, one might depend on
another. But this is not the same as causation.

1. Associations: the variables may be associated with one another.

2. Outliers: there could be cases where the data in two dimensions does not follow the
general pattern.

3. Clusters: sometimes there could be groups of data that form a cluster on the plot.

4. Gaps: some combinations of values might not exist in a particular case.

5. Barriers: boundaries.

6. Conditional relationships: some relationship between the variables rely on a condition


to be met.

8. What type of plot would you use if you need to demonstrate “relationship” between
variables/parameters?
When we are trying to show the relationship between 2 variables, scatter plots or charts
are used. When we are trying to show “relationship” between three variables, bubble
charts are used.

9. When will you use a histogram and when will you use a bar chart?

Both plots are used to plot the distribution of a variable. Histograms are usually used for
a categorical variable, while bar charts are used for a categorical variable.

10. What is an outlier?


The outlier is a commonly used terms by analysts referred for a value that appears far
away and diverges from an overall pattern in a sample. There are two types of outliers:
univariate and multivariate.
11. What type of data is box-plots usually used for? Why?
Boxplots are usually used for continuous variables. The plot is generally not informative
when used for discrete data.

12. What information could you gain from a box-plot?


1. Minimum/maximum score
2. Lower/upper quartile
3. Median
5
DR.NNCE II YEAR/03 FDS-QB

4. The Interquartile Range


5. Skewness
6. Dispersion
7. Outliers
13. When do you use a boxplot and in what situation would you choose boxplot over
histograms?
Boxplots are used when trying to show a statistical distribution of one variable or compare the
distributions of several variables. It is a visual representation of the statistical five number
summary.Histograms are better at determining the probability distribution of the data;
however, boxplots are better for comparison between datasets and they are more space
efficient.
14. When analyzing a histogram, what are some of the features to look for?
1. Asymmetry
2. Outliers
3. Multimodality
4. Gaps
5. Heaping/Rounding: Heaping example: temperature data can consist of common
values due to conversion from Fahrenheit to Celsius. Rounding example: weight
data that are all multiples of 5.
6. Impossibilities/Errors

PART – B
1. Explain various features of Matplotlib platform used for data visualization and
illustrate its challenges? Nov/Dec2023
• Matplotlib is a cross-platform, data visualization and graphical plotting library for Python
and its numerical extension NumPy.
• Matplotlib is a comprehensive library for creating static, animated and interactive
visualizations in Python.
• Matplotlib is a plotting library for the Python programming language. It allows to make
quality charts in few lines of code. Most of the other python plotting library are build
on top of Matplotlib.
• The library is currently limited to 2D output, but it still provides you with the means to
express graphically the data patterns.
Visualizing Information: Starting with Graph
• Data visualization is the presentation of quantitative information in a graphical form.
6
DR.NNCE II YEAR/03 FDS-QB

In other words, data visualizations turn large and small datasets into visuals that are easier
for the human brain to understand and process.
• Good data visualizations are created when communication, data science, and
design collide. Data visualizations done right offer key insights into complicated
datasets inways that are meaningful and intuitive.
• A graph is simply a visual representation of numeric data. MatPlotLib supports a
large number of graph and chart types.
• Matplotlib is a popular Python package used to build plots. Matplotlib can also be used
to make 3D plots and animations.
• Line plots can be created in Python with Matplotlib's pyplot library. To build a line plot
first import Matplotlib. It is a standard convention to import Matplotlib's pyplot library as
plt.
• To define a plot, you need some values, the matplotlib.pyplot module, and an idea of
what you want to display.
import matplotlib.pyplot as plt
plt.plot([1,2,3],[5,7,4])
plt.show()
• The plt.plot will "draw" this plot in the background, but we need to bring it to the screen
when we're ready, after graphing everything we intend to.
• plt.show(): With that, the graph should pop up. If not, sometimes can pop under, or
you may have gotten an error. Your graph should look like :
• This window is a matplotlib window, which allows us to see our graph, as well as
interact with it and navigate it

Line Plot
• More than one line can be in the plot. To add another line, just call the plot (x,y) function
again. In the example below we have two different values for y (y1, y2) that are plotted
onto the chart.
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(-1, 1, 50)
y1 = 2*x+ 1
y2 = 2**x + 1

plt.figure(num = 3, figsize=(8, 5))

plt.plot(x, y2)

plt.plot(x, y1,

7
DR.NNCE II YEAR/03 FDS-QB

linewidth=1.0,

linestyle='--'

plt.show()

• Output of the above code will look like this:

2. Explain various types of plotting using Scatter Plots in python? (OR)


Write a code snippet that projects our globe as a 2-D flat surface (using cylindrical
project) and convey information about the location of any three major Indian cities
in the map (using scatter plot).April/May2024

Scatter Plots
• A scatter plot is a visual representation of how two variables relate to each other.
we can use scatter plots to explore the relationship between two variables, for example by
looking for any correlation between them.
• Matplotlib also supports more advanced plots, such as scatter plots. In this case, the
scatter function is used to display data values as a collection of x, y coordinates
represented by standalone dots.
Import matplotlib.pyplot as plt
#X axis values:
x = [2,3,7,29,8,5,13,11,22,33]
# Y axis values:
8
DR.NNCE II YEAR/03 FDS-QB

y = [4,7,55,43,2,4,11,22,33,44]
# Create scatter plot:
plt.scatter(x, y)
plt.show()
• Comparing plt.scatter() and plt.plot(): We can also produce the scatter plot shown
above using another function within matplotlib.pyplot. Matplotlib'splt.plot() is a general-
purpose plotting function that will allow user to create various different line or marker plots.
• We can achieve the same scatter plot as the one obtained in the section above with the
following call to plt.plot(), using the same data:
plt.plot(x, y, "o")
plt.show()

• In this case, we had to include the marker "o" as a third argument, as otherwise plt.plot()
would plot a line graph. The plot created with this code is identical to the plot created
earlier with plt.scatter(). Creating Advanced Scatterplots
• Scatterplots are especially important for data science because they can show data patterns
that aren't obvious when viewed in other ways.
import matplotlib.pyplot as plt
x_axis1 =[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y_axis1 =[5, 16, 34, 56, 32, 56, 32, 12, 76, 89]
x_axis2 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y_axis2 = [53, 6, 46, 36, 15, 64, 73, 25, 82, 9]
plt.title("Prices over 10 years")
plt.scatter(x_axis1, y_axis1, color = 'darkblue', marker='x', label="item 1")
plt.scatter(x_axis2, y_axis2, color='darkred', marker='x', label="item 2")
plt.xlabel("Time (years)")
plt.ylabel("Price (dollars)")
plt.grid(True)
plt.legend()
plt.show()
• The chart displays two data sets. We distinguish between them by the colour of the
marker.

Visualizing Errors
• Error bars are included in Matplotlib line plots and graphs. Error is the difference
between the calculated value and actual value.
• Without error bars, bar graphs provide the perception that a measurable or determined
number is defined to a high level of efficiency. The method matplotlib.pyplot.errorbar()
draws y vs. x as planes and/or indicators with error bars associated.

9
DR.NNCE II YEAR/03 FDS-QB

• Adding the error bar in Matplotlib, Python. It's very simple, we just have to write the
value of the error. We use the command:
plt.errorbar(x, y, yerr = 2, capsize=3)
Where:
x = The data of the X axis.
Y = The data of the Y axis.
yerr = The error value of the Y axis. Each point has its own error value.
xerr = The error value of the X axis.
capsize = The size of the lower and upper lines of the error bar

• A simple example, where we only plot one point. The error is the 10% on the Y axis.
importmatplotlib.pyplot as plt
x=1
y = 20

y_error = 20*0.10 ## El 10% de error

plt.errorbar(x,y, yerr = y_error, capsize=3)

plt.show()

Output:

• We plot using the command "plt.errorbar (...)", giving it the desired characteristics.
importmatplotlib.pyplot as plt
importnumpy as np
x = np.arange(1,8)
y = np.array([20,10,45,32,38,21,27])
y_error = y * 0.10 ##El 10%
plt.errorbar(x, y, yerr = y_error,
linestyle="None", fmt="ob", capsize=3, ecolor="k")
plt.show()
• Parameters of the errorbar :
10
DR.NNCE II YEAR/03 FDS-QB

a) yerr is the error value in each point.


b) linestyle, here it indicate that we will not plot a line.
c) fmt, is the type of marker, in this case is a point ("o") blue ("b").
d) capsize, is the size of the lower and upper lines of the error bar.
e) ecolor, is the color of the error bar. The default color is the marker color.
Output:

• Multiple lines in MatplotlibErrorbar in Python : The ability to draw numerous lines in


almost the same plot is critical. We'll draw many errorbars in the same graph by
using this scheme.
importnumpy as np
importmatplotlib.pyplot as plt
fig = plt.figure()
x = np.arange(20)
y = 4* np.sin(x / 20 * np.pi)
yerr = np.linspace (0.06, 0.3, 20)
plt.errorbar(x, y + 8, yerr = yerr, )
plt.errorbar(x, y + 6, yerr = yerr,
uplims = True, )
plt.errorbar(x, y + 4, yerr = yerr,
uplims = True,
lolims True, )
upperlimits = [True, False] * 6
lowerlimits = [False, True]* 6
plt.errorbar(x, y, yerr = yerr,
uplims =upperlimits,
lolims = lowerlimits, )
11
DR.NNCE II YEAR/03 FDS-QB

plt.legend(loc='upper left')
plt.title('Example')
plt.show()
Output:

3. Explain about Histogram? Nov/Dec2022


• In a histogram, the data are grouped into ranges (e.g. 10 - 19, 20 - 29) and then
plotted as connected bars. Each bar represents a range of data. The width of each
bar is proportional to the width of each category, and the height is proportional to the
frequency or percentage of that category.
• It provides a visual interpretation of numerical data by showing the number of data points
that fall within a specified range of values called "bins".
• Fig. 5.5.1 shows histogram.

12
DR.NNCE II YEAR/03 FDS-QB

• Histograms can display a large amount of data and the frequency of the data values.
The median and distribution of the data can be determined by a histogram. In addition,
it can show any outliers or gaps in the data.
• Matplotlib provides a dedicated function to compute and display histograms: plt.hist()
• Code for creating histogram with randomized data :
import numpy as np
import matplotlib.pyplot as plt
x = 40* np.random.randn(50000)
plt.hist(x, 20, range=(-50, 50), histtype='stepfilled',
align='mid', color='r', label="Test Data')
plt.legend()
plt.title(' Histogram')
plt.show()

4. Explain in detail about three-dimensional Plotting using Matplotlib? (OR)


Outline any two three-dimensional plotting in Matplotlib with an example.
( April/May2023)
• Matplotlib is the most popular choice for data visualization. While initially
developed for plotting 2-D charts like histograms, bar charts, scatter plots,
line plots, etc., Matplotlib has extended its capabilities to offer 3D plotting modules as well.
• First import the library :
importmatplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
• The first one is a standard import statement for plotting using matplotlib, which you
would see for 2D plotting as well. The second import of the Axes3D class is required
for enabling 3D projections. It is, otherwise, not used anywhere else.
• Create figure and axes
fig = plt.figure(figsize=(4,4))
ax = fig.add_subplot(111, projection='3d')
Output:

Example :
13
DR.NNCE II YEAR/03 FDS-QB

fig=plt.figure(figsize=(8,8))
ax=plt.axes(projection='3d')
ax.grid()
t=np.arange(0,10*np.pi,np.pi/50)
x=np.sin(t)
y=np.cos(t)
ax.plot3D(x,y,t)
ax.set_title('3D Parametric Plot')
# Set axes label
ax.set_xlabel('x',labelpad=20)
ax.set_ylabel('y', labelpad=20)
ax.set_zlabel('t', labelpad=20)
plt.show()
Output:

5. Mapping Geographical Data with Base map Python Package?


• Basemap is a toolkit under the Python visualization library Matplotlib. Its main
function is to draw 2D maps, which are important for visualizing spatial data.
Basemap itself does not do any plotting, but provides the ability to transform
coordinates into one of 25 different map projections.
• Matplotlib can also be used to plot contours, images, vectors, lines or points in
transformed coordinates. Basemap includes the GSSH coastline dataset, as well as
datasets from GMT for rivers, states and national boundaries.
• These datasets can be used to plot coastlines, rivers and political boundaries on a
map at several different resolutions. Basemap uses the Geometry Engine-Open Source
(GEOS) library at the bottom to clip coastline and boundary features to the desired
map projection
area. In addition, basemap provides the ability to read shapefiles.
• Basemap cannot be installed using pip install basemap. If Anaconda is installed, you
can install basemap using canda install basemap.
14
DR.NNCE II YEAR/03 FDS-QB

• Example objects in basemap:


a) contour(): Draw contour lines.
b) contourf(): Draw filled contours.
c) imshow(): Draw an image.
d) pcolor(): Draw a pseudocolor plot.
e) pcolormesh(): Draw a pseudocolor plot (faster version for regular meshes).
f) plot(): Draw lines and/or markers.
g) scatter(): Draw points with markers.
h) quiver(): Draw vectors.(draw vector map, 3D is surface map)
i) barbs(): Draw wind barbs (draw wind plume map)
j) drawgreat circle(): Draw a great circle (draws a great circle route)
• For example, if we wanted to show all the different types of endangered plants within
a region, we would use a base map showing roads, provincial and state boundaries,
waterways and elevation. Onto this base map, we could add layers that show the
location of different categories of endangered plants. One added layer could be trees,
another layer could be mosses and lichens, another layer could be grasses.
Basemap basic usage:
import warnings
warnings.filterwarmings('ignore')
frommpl_toolkits.basemap import Basemap
importmatplotlib.pyplot as plt
map = Basemap()
map.drawcoastlines()
# plt.show()
plt.savefig('test.png')
Output:

15
DR.NNCE II YEAR/03 FDS-QB

6. Explain in detail about Data Visualization Using Seaborn? (OR) Briefly


explain about visualization with Seaborn. Give an example working code
segment that represents a 2D kernel density plot for any data?
April/May2024

• Seaborn is a Python data visualization library based on Matplotlib. It provides a


high-level interface for drawing attractive and informative statistical graphics.
Seaborn is an open- source Python library.
• Seaborn helps you explore and understand your data. Its plotting functions
operate on dataframes and arrays containing whole datasets and internally perform the
necessary semantic mapping and statistical aggregation to produce informative plots.
• Its dataset-oriented, declarative API. User should focus on what the different elements
of your plots mean, rather than on the details of how to draw them.
• Keys features:
a) Seaborn is a statistical plotting library
b) It has beautiful default styles
c) It also is designed to work very well with Pandas data frame objects.
Seaborn works easily with data frames and the Pandas library. The graphs created can
also be customized easily.
• Functionality that seaborn offers:
a) A dataset-oriented API for examining relationships between multiple variables
b) Convenient views onto the overall structure of complex datasets
c) Specialized support for using categorical variables to show observations or aggregate
statistics
d) Options for visualizing univariate or bivariate distributions and for comparing them
between subsets of data
e) Automatic estimation and plotting of linear regression models for different kinds of
dependent variables
f) High-level abstractions for structuring multi-plot grids that let you easily build
complex visualizations
g) Concise control over matplotlib figure styling with several built-in themes
h) Tools for choosing color palettes that faithfully reveal patterns in your data.
Plot a Scatter Plot in Seaborn :
importmatplotlib.pyplot as plt
importseaborn as sns
import pandas as pd
df = pd.read_csv('worldHappiness2016.csv').
16
DR.NNCE II YEAR/03 FDS-QB

sns.scatterplot(data= df, x = "Economy (GDP per Capita)", y =


plt.show()
Output:

Difference between Matplotlib and Seaborn

17

You might also like