0% found this document useful (0 votes)
12 views

Unit 1

The document outlines the Presenter’s Manual for the Data Exploration and Visualization course at Paavai College of Engineering for the academic year 2024-2025. It details the teaching methodology, course objectives, syllabus, learning outcomes, and recommended textbooks, emphasizing the importance of exploratory data analysis and visualization techniques. The course aims to equip students with essential skills in data analytics and visualization using tools like Matplotlib and programming languages such as Python.

Uploaded by

indhuji31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Unit 1

The document outlines the Presenter’s Manual for the Data Exploration and Visualization course at Paavai College of Engineering for the academic year 2024-2025. It details the teaching methodology, course objectives, syllabus, learning outcomes, and recommended textbooks, emphasizing the importance of exploratory data analysis and visualization techniques. The course aims to equip students with essential skills in data analytics and visualization using tools like Matplotlib and programming languages such as Python.

Uploaded by

indhuji31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

PAVAAI COLLEGE OF ENGINEERING

ACADEMIC YEAR 2024-2025

PRESENTER’ S MANUAL

DEPARTMENT OF ARTIFICIAL INTELLIGENCE &


DATA SCIENCE
COURSE NAME: DATA EXPLORATION AND
VISUALIZATION
COURSE CODE: AD3301
YEAR/SEM: II YEAR – III SEMESTER
What is Paavai Teaching Methodology?
At Paavai Educational Institutions, inclusive, flexible and insightful learning aims to provide
engaging educational experiences and meet the needs of learners from all the backgrounds.
Teachers should align their teaching with everyday life which will make learning meaningful.

This semester is based on the teaching methodology following three cardinal components:

1. Concept Class
2. Lab Manual
3. Directed Learning Class

Concept class: This is a theory class that will focus on the concepts. Whenever required this
session will also demonstrate how these concepts data exploration and visualization topics.

Lab Manual: The lab manual follows the concept class. The students learn to implement the
concepts learnt in the concept class.

Directed Learning Class: Learning and application may be challenging for some students.
One of the oldest and most comprehensive ways of delivery information, self-directed class
allows the student to apply themselves in a manner that makes understanding content more
accessible. In this process, learners take initiative in their own learning by planning,
implementing and evaluating their learning.

The entire purpose of this methodology is to make the students more:

 Concept focused
 Adapted to real life work environment
Introduction to the Course

This Presenter’s Manual is to be used for the fifth semester of Artificial Intelligence & Data
Science for the course of data exploration and visualization. The syllabus of this course
enhances the students’ to understand and learn the important of data exploration and
visualization.

Prerequisites for taking this course

 To excel in data visualization, you require a combination of knowledge and skills


such as analytical skills, mathematics, statistics, computer science, design,
narrative, and creativity.
 Data visualization employs graphical representations such as plots, charts, and
animations to effectively communicate complex data insights.
 With the daily generation of 2.5 quintillion bytes of data, there's a growing need for
data visualization to effectively share the insights from these large data sets.
 Learning data visualization involves mastering multiple related skills, including data
analytics and familiarity with tools such as Excel, Tableau, and programming
languages like Python.
 Noble Desktop offers comprehensive programs covering core data analytics and
visualization skills, providing a great starting point for anyone interested in this field.
 Before learning data visualization, possessing skills in data analytics, design, and
storytelling can expedite the learning process.

Course Objectives:
To enable the students to
OBJECTIVES:

 To outline an overview of exploratory data analysis.


 To implement data visualization using Matplotlib.
 To perform univariate data exploration and analysis.
 To apply bivariate data exploration and analysis.
 To use Data exploration and visualization techniques for multivariate and time
series data.
SYLLABUS
AD3301 DATA EXPLORATION AND VISUALIZATION LT PC
3024

UNIT I EXPLORATORY DATA ANALYSIS 9

EDA fundamentals – Understanding data science – Significance of EDA – Making sense of


data – Comparing EDA with classical and Bayesian analysis – Software tools for EDA - Visual
Aids for EDA Data transformation techniques-merging database, reshaping and pivoting,
Transformation techniques - Grouping Datasets - data aggregation – Pivot tables and cross-
tabulations.

UNIT II VISUALIZING USING MATPLOTLIB 9

Importing Matplotlib – Simple line plots – Simple scatter plots – visualizing errors – density
and contour plots – Histograms – legends – colors – subplots – text and annotation –
customization – three dimensional plotting - Geographic Data with Baseman - Visualization
with Seaborn.

UNIT III UNIVARIATE ANALYSIS 10

Introduction to Single variable: Distributions and Variables - Numerical Summaries of Level


and Spread - Scaling and Standardizing – Inequality - Smoothing Time Series.

UNIT IV BIVARIATE ANALYSIS 8

Relationships between Two Variables - Percentage Tables - Analyzing Contingency Tables -


Handling Several Batches - Scatterplots and Resistant Lines – Transformation.

UNIT V MULTIVARIATE AND TIME SERIES ANALYSIS 9

Introducing a Third Variable - Causal Explanations - Three-Variable Contingency Tables and


Beyond - Longitudinal Data – Fundamentals of TSA – Characteristics of time series data –
Data Cleaning – Time-based indexing – Visualizing – Grouping – Resampling.

TOTAL: 45 PERIODS

Student Learning Outcomes:


Clearly written student learning outcomes are the foundation upon which effective courses are
designed. Outcomes inform both the ways students are evaluated in a course and the way a
course will be organized. Effective learning outcomes are student-centered, measurable,
concise, meaningful, achievable and outcome-based (rather than task-based). The course
contents are designed for those who are keen on getting themselves a career in the field of data
exploration and visualization..

At the end of the course, the students will be able to:

 Understand the fundamentals of exploratory data analysis.


 Implement the data visualization using Matplotlib.
 Perform univariate data exploration and analysis.
 Apply bivariate data exploration and analysis.
 Use Data exploration and visualization techniques for multivariate and time series data.

TEXT BOOKS PRESCRIBED:

1. Suresh Kumar Mukhiya, Usman Ahmed, “Hands-On Exploratory Data Analysis with
Python”,Packt Publishing, 2020.

2. Jake Vander Plas, "Python Data Science Handbook: Essential Tools for
Working with Data", Oreilly, 1st Edition, 2016.

3. Catherine Marsh, Jane Elliott, “Exploring Data: An Introduction to Data Analysis for
SocialScientists”, Wiley Publications, 2nd Edition, 2008.

REFERENCE BOOKS:

1. Eric Pimpler, Data Visualization and Exploration with R, GeoSpatial Training service,
2017.
2. Claus O. Wilke, “Fundamentals of Data Visualization”, O’reilly publications, 2019.
3. Matthew O. Ward, Georges Grinstein, Daniel Keim, “Interactive Data
Visualization:Foundations, Techniques, and Applications”, 2nd Edition, CRC press,
2015.
PO’s PSO’s
CO’s 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3
1 3 1 3 3 - - - - 2 3 3 3 2 2 2
2 2 2 2 1 1 - - - 3 2 3 1 3 1 3
3 2 1 2 1 1 - - - 3 2 1 2 2 2 1
4 2 2 2 1 - - - - 1 2 1 3 1 3 2
5 3 1 1 2 1 - - - 3 2 1 2 2 2 3
AVG 2 1 2 2 1 - - - 2 2 2 2 2 2 2

* For Entire Course, PO & PSO Mapping

POs & PSO REFERENCE:


PO1 Engineering PO7 Environment & PSO1 Professional Skills
Knowledge Sustainability

PO2 Problem Analysis PO8 Ethics PSO2 Problem-Solving Skills


PO3 Design & PO9 Individual & Team PSO3 Successful Career and
Development Work Entrepreneurship

PO4 Investigations PO10 Communication Skills


PO5 Modern Tools PO11 Project Mgt. & Finance

PO6 Engineer & Society PO12 Life Long Learning

UNIT WISE HOURS ALLOCATION


Course Concept Class Hrs. Dir. Learning Class Hrs. Total Hrs.

Unit - 1 7 2 9
Unit - 2 7 2 9
Unit - 3 7 2 10
Unit - 4 6 2 8
Unit - 5 7 2 9
Total 34 10 45
I. CONCEPT CLASS

LESSON PLAN

UNIT I EXPLORATORY DATA ANALYSIS 9

EDA fundamentals – Understanding data science – Significance of EDA – Making sense of


data – Comparing EDA with classical and Bayesian analysis – Software tools for EDA - Visual
Aids for EDA Data transformation techniques-merging database, reshaping and pivoting,
Transformation techniques - Grouping Datasets - data aggregation – Pivot tables and cross-
tabulations.

Lecture
Method Proposed notes – Teaching
S.No. Topics to be covered Ref
Date Page Method
Number
EDA fundamentals
1. CC 15.07.24 1- 2 T1 CB/L
Understanding data science – CC
2. Significance of EDA 16.07.24 3-7 T1 CB/L
Making sense of data – CC
3. Comparing EDA with classical 18.07.24 8 - 11 T1 CB/L
and Bayesian analysis
4. Software tools for EDA CC 19.07.24 11 - 12 T1 CB/L
Visual Aids for EDA CC
5. 20.07.24 13 - 15 T1 CB/L
CC
6. Data transformation techniques 22.07.24 16 - 18 T1 CB/L

Merging database, reshaping CC


7. and pivoting 23.07.24 19 - 20 T1 CB/L
Transformation techniques -
Grouping Datasets - data
8. 24.07.24 23 - 24 T1 CB/L
aggregation CC

Transformation techniques -

9. Pivot tables and cross- 25.07.24 25- 27 T1 CB/L


CC
tabulations.

*T1 – Textbook 1, CC – Concept Class, DL – Directed Learning


PAAVAI COLLEGE OF ENGINEERING
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE

DATA EXPLORATION AND Lecture no 1


Subject Name
VISUALIZATION Date 12.08.2024
Subject Code AD3301 Day 1

Unit 1 EXPLORATORY DATA ANALYSIS Hour

Topic covered EDA fundamentals

S.no Time Structure

1 2 Mins Attendance

2 2 Mins Technical Terms

3 7 Mins Revision: Data and data collection

4 1 min Objective: EDA fundamentals

35
5 Content: Exploratory Data Analysis (EDA) is an analysis approach that identifies
Mins general patterns in the data.

6 3 Mins Questions by Students :

Revision and Questions:


7 3 Mins
What is Exploratory data analysis?

Outcome: The student should be able to understand the concept of understand


8 1 Mins
the basics in : EDA and its basic terms.

9 1 Mins Next Class : Understanding data science – Significance of EDA

Remarks:
Faculty Incharge

PAAVAI COLLEGE OF ENGINEERING


DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE

DATA EXPLORATION AND Lecture no 2


Subject Name
VISUALIZATION Date
Subject Code AD3301 Day

Unit 1 EXPLORATORY DATA ANALYSIS Hour

Topic covered Understanding data science, Significance of EDA.

S.no Time Structure

1 2 Mins Attendance

2 2 Mins Technical Terms

3 7 Mins Revision: Fundamentals of Exploratory data analysis.

4 1 min Objective: understand the concept of data science and Significance of EDA.

35 Content: Data science is the study of data to extract meaningful insights for
5 business. It help look at data before making any assumptions.
Mins

Questions by Students :
6 3 Mins

Revision and Questions:


What is Data Science Process?
7 3 Mins
What are the Significance of EDA?

Outcome: The student should be able to understand the concept of


8 1 Mins
Understanding data science, Significance of EDA.
Next Class : Making sense of data, Comparing EDA with classical and
9 1 Mins
Bayesian analysis

Remarks:
Faculty Incharge

PAAVAI COLLEGE OF ENGINEERING


DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE

DATA EXPLORATION AND Lecture no 3


Subject Name
VISUALIZATION Date
Subject Code AD3301 Day

Unit 1 EXPLORATORY DATA ANALYSIS Hour

Topic covered Making sense of data, Comparing EDA with classical and Bayesian analysis.

S.no Time Structure

1 2 Mins Attendance

2 2 Mins Technical Terms

3 7 Mins Revision: Data science Process and Significance of EDA.

Objective: Able to understand of the Making sense of data, Comparing EDA


4 1 min
with classical and Bayesian analysis.
35 Content: Examining the information you have and looking for patterns,
5 relationships, or trends within the data.
Mins

6 3 Mins Questions by Students:

Revision and Questions: what is making sense of data?


7 3 Mins
Explain Comparing EDA with classical and Bayesian analysis?
Outcome: The student should be able to understand the concept of the Making
8 1 Mins
sense of data, Comparing EDA with classical and Bayesian analysis.

9 1 Mins Next Class : Software tools for EDA

Remarks:
Faculty Incharge

PAAVAI COLLEGE OF ENGINEERING


DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE

DATA EXPLORATION AND Lecture no 4


Subject Name
VISUALIZATION Date
Subject Code AD3301 Day

Unit 1 EXPLORATORY DATA ANALYSIS Hour

Topic covered Software tools for EDA

S.no Time Structure

1 2 Mins Attendance

2 2 Mins Technical Terms

Revision: Making sense of data, Comparing EDA with classical and Bayesian
3 7 Mins
analysis.
Objective: Able to understand the Software tools for Exploratory data
4 1 min
analysis.
35 Content: Tools required for exploratory data analysis: R, PYTHON, EXCEL.
5
Mins
Questions by Students:
6 3 Mins

Revision and Questions:


7 3 Mins What are Tools to Perform Exploratory Data Analysis?

Outcome: The student should be able to understand Software tools for


8 1 Mins
Exploratory data analysis.

9 1 Mins Next Class : Visual Aids for EDA


Remarks:

Faculty Incharge

PAAVAI COLLEGE OF ENGINEERING


DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE

DATA EXPLORATION AND Lecture no 5


Subject Name
VISUALIZATION Date
Subject Code AD3301 Day

Unit 1 EXPLORATORY DATA ANALYSIS Hour

Topic covered Visual Aids for EDA

S.no Time Structure

1 2 Mins Attendance

2 2 Mins Technical Terms

Revision: Making sense of data, Comparing EDA with classical and Bayesian
3 7 Mins
analysis.

4 1 min Objective: Able to understand the Visual Aids for EDA.

35 Content: visual tools such as box plots, scatter plots, and histograms, EDA aids
5
Mins in identifying underlying patterns and relationships within the data

Questions by Students :
6 3 Mins

Revision and Questions:


7 3 Mins What are the visual AIDs for Exploratory Data Analysis?
Outcome: The student should be able to understand the visual AIDs for
8 1 Mins Exploratory Data Analysis.

9 1 Mins Next Class : Data transformation techniques

Remarks:

Faculty Incharge

PAAVAI COLLEGE OF ENGINEERING


DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE

DATA EXPLORATION AND Lecture no 6


Subject Name
VISUALIZATION Date
Subject Code AD3301 Day

Unit 1 EXPLORATORY DATA ANALYSIS Hour

Topic covered Data transformation techniques

S.no Time Structure

1 2 Mins Attendance

2 2 Mins Technical Terms

3 7 Mins Revision: The Visual Aids for EDA.

4 1 min Objective: students able to understand the Data transformation techniques.

35 Content: The different types of data transformation techniques such


5
Mins as manipulation, normalization, attribute construction, generalization,
discretization, aggregation, and smoothing

6 3 Mins Questions by Students :


7 3 Mins Revision and Questions: What are the Data transformation techniques?

Outcome: The student should be able to understand the concept of Data


8 1 Mins
transformation techniques

9 1 Mins Merging database, reshaping and pivoting

Remarks:

Faculty Incharge

PAAVAI COLLEGE OF ENGINEERING


DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE

DATA EXPLORATION AND Lecture no 7


Subject Name
VISUALIZATION Date
Subject Code AD3301 Day

Unit 1 EXPLORATORY DATA ANALYSIS Hour

Topic covered Merging database, reshaping and pivoting

S.no Time Structure

1 2 Mins Attendance

2 2 Mins Technical Terms

3 7 Mins Revision: Data transformation techniques.

Objective: : understand the concept of Merging database, reshaping and


4 1 min
pivoting.

Content: Data merging is the process of combining two or more similar


35
5 records into a single one. In Pandas, reshaping data refers to the process of
Mins
converting a DataFrame from one format to another for better data visualization and
analysis.
Questions by Students :
6 3 Mins

Revision and Questions:


What is reshaping and pivoting in Pandas?
7 3 Mins
What is Merging database?

Outcome: The student should be able to understand the concept of Merging


8 1 Mins
database, reshaping and pivoting.
Next Class : Transformation techniques - Grouping Datasets - data
9 1 Mins
aggregation

Remarks:

Faculty Incharge

PAAVAI COLLEGE OF ENGINEERING


DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE

DATA EXPLORATION AND Lecture no 8


Subject Name
VISUALIZATION Date
Subject Code AD3301 Day

Unit 1 EXPLORATORY DATA ANALYSIS Hour

Topic covered Transformation techniques - Grouping Datasets - data aggregation

S.no Time Structure

1 2 Mins Attendance

2 2 Mins Technical Terms


3 7 Mins Revision: Merging database, reshaping and pivoting.

Objective: understand the concept of Transformation techniques - Grouping


4 1 min
Datasets - data aggregation.
35 Content: Grouping data involves aggregating data points based on a common
5 field to summarize and analyze the dataset more effectively
Mins
Questions by Students :
6 3 Mins

Revision and Questions: What is data aggregation and Grouping Datasets in


7 3 Mins data transformation?

Outcome: The student should be able to understand the concept of


8 1 Mins
Transformation techniques - Grouping Datasets - data aggregation.
Next Class : Pivot tables and cross-tabulations.
9 1 Mins

Remarks:

Faculty Incharge

PAAVAI COLLEGE OF ENGINEERING


DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE

DATA EXPLORATION AND Lecture no 9


Subject Name
VISUALIZATION Date
Subject Code AD3301 Day

Unit 1 EXPLORATORY DATA ANALYSIS Hour

Topic covered
Transformation techniques - Pivot tables and cross-tabulations.

S.no Time Structure

1 2 Mins Attendance

2 2 Mins Technical Terms

3 7 Mins Revision: Transformation techniques - Grouping Datasets - data aggregation

Objective: understand the concept of Transformation techniques Pivot tables

4 1 min and cross-tabulations.

35
5 Content: Crosstabs are used for categorical data, while pivot tables can be
Mins
used for both categorical and numerical data.

6 3 Mins Questions by Students :

Revision and Questions: What is Pivot tables and cross-tabulations in data


7 3 Mins
transformation?

Outcome: The student should be able to understand the concept of Pivot tables
8 1 Mins
and cross-tabulations.

9 1 Mins Next Class : Revision

Remarks:

Faculty Incharge

UNIT -1 EXPLORATORY DATA ANALYSIS

Technical Terms

S.NO Technical Terms Literal Meaning Subject Meaning


1. Data exploration Data exploration is the first step Data exploration is the first step
in data analysis involving the use in the journey of extracting
of data visualization tools and insights from raw datasets. Data
statistical techniques to uncover exploration serves as the compass
data set characteristics and initial that guides data scientists
patterns. through the vast sea of
information. It involves getting to
know the data intimately,
understanding its structure, and
uncovering valuable nuggets that
lay hidden beneath the surface.
2. Data visualization the process of using visual elements Data visualization is the graphical
like charts, graphs, or maps to representation of information and
represent data. data. By using visual elements
like charts, graphs, and maps, data
visualization tools provide an
accessible way to see and
understand trends, outliers, and
patterns in data.
3. Exploratory Data an analysis technique to analyze Exploratory Data Analysis
Analysis . and investigate the data set and (EDA) is a vital step in the
summarize the main process of understanding and
characteristics of the dataset analyzing data. It serves as the
foundation stone for any data
analysis project, providing
valuable insights and revealing
the true nature of the data.

4. Data science the scientific analysis of large Data science is the study of data to
amounts of information held on extract meaningful insights for
computers business. It is a multidisciplinary
approach that combines principles
and practices from the fields of
mathematics, statistics, artificial
intelligence, and computer
engineering to analyze large
amounts of data.
5. Data Transformations typically involve Data transformation is the process
transformation converting a raw data source into of converting data from one
a cleansed, validated and ready- format, such as a database file,
to-use format. XML document or Excel
spreadsheet, into another.
6 Data Smoothing. refers to a statistical approach of Data smoothing refers to a
eliminating outliers from datasets statistical approach of eliminating
outliers from datasets to make the
patterns more noticeable.
7 Data the process of generating Data generalization is the process
generalization summary data with successive of generating summary data with
layers for a dataset. successive layers for a dataset. It
is to hide the characteristic of an
individual from its group, such
that the adversary will not able to
distinguish this individual from its
peers.
8 Data aggregation any process whereby data is Data aggregation is the process
gathered and expressed in a where raw data is gathered and
summary form. expressed in a summary form for
statistical analysis. Raw data can
be aggregated over a given time
period to provide statistics such as
average, minimum, maximum,
sum, and count.
9 Data the process of reorganizing data Data normalization is the process
normalization within a database so that users can of reorganizing data within a
utilize it for further queries and database so that users can utilize it
analysis for further queries and analysis.
Simply put, it is the process of
developing clean data. This
includes eliminating redundant
and unstructured data and making
the data appear similar across all
records and fields.
10 Data merging the process of combining two or Data merging is the process of
more data sets into a single, combining two or more data sets
unified database. into a single, unified database. It
involves adding new details to
existing data, appending cases,
and removing any duplicate or
incorrect information to ensure
that the data at hand is
comprehensive, complete, and
accurate

UNIT I
EXPLORATORY DATA ANALYSIS
CONTENTS

1. EDA Fundamentals
2. Understanding data science
3. Significance of EDA
4. Making sense of Data
5. Comparing EDA with classical and Bayesian Analysis
6. Software tools for EDA
7. Visual aids for EDA
8. Data transformation techniques
9. Merging database, Reshaping and Pivoting
10. Transformation Techniques
10.1Grouping datasets
10.2 Data aggregation
10.3 Pivot tables and Cross Tabulations

1. Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is a vital step in the process of understanding and
analyzing data. It serves as the foundation stone for any data analysis project, providing
valuable insights and revealing the true nature of the data. EDA can be compared to an
investigation carried out by a detective, where digging deep into piles of data helps uncover
clues that aid in the actual data analysis.

Data: The Collection of Facts

Data refers to a collection of facts. These facts can take the form of numbers, words,
observations, or descriptions. However, data on its own does not carry any meaning or
context. It is simply a raw representation of information.

Information: Understanding and Context

Information, on the other hand, is how we interpret and understand the facts within a specific
context. It is the structured or organized form of data that conveys a logical meaning. For
example, let's consider the following unorganized data:

 Rich (name of a person)


 Red (color)
 2022 (year)
 Blue car (object)

Individually, these words do not hold much significance. However, when we structure the
data and organize it, we can derive meaningful information. For instance, "Rich bought a blue
car" conveys a complete thought and provides useful information.

Data vs. Information

While data and information are closely related, there are key differences between the two:

1. Data is unorganized, while information is structured and organized.


2. Data is a part, while information is the whole.
3. Data alone is typically not useful, but information can be valuable on its
ownInformation is dependent on data; without data, there can be no information.

DATA COLLECTION
Data collection is the process of collecting and evaluating information or data from multiple
sources to find answers to research problems, answer questions, evaluate outcomes, and
forecast trends and probabilities. It is an essential phase in all types of research, analysis, and
decision-making, including that done in the social sciences, business, and healthcare.

1. Primary Data Collection:


Primary data collection involves the collection of original data directly from the source or
through direct interaction with the respondents. This method allows researchers to obtain
firsthand information specifically tailored to their research objectives. There are various
techniques for primary data collection, including:

a. Surveys and Questionnaires: Researchers design structured questionnaires or surveys to


collect data from individuals or groups. These can be conducted through face-to-face
interviews, telephone calls, mail, or online platforms.

b. Interviews: Interviews involve direct interaction between the researcher and the
respondent. They can be conducted in person, over the phone, or through video conferencing.
Interviews can be structured (with predefined questions),semi-structured (allowing
flexibility), or unstructured (more conversational).

c. Observations: Researchers observe and record behaviors, actions, or events in their natural
setting. This method is useful for gathering data on human behavior

2. Secondary Data Collection:

Secondary data collection involves using existing data collected by someone else for a
purpose different from the original intent. Researchers analyze and interpret this data to
extract relevant information. Secondary data can be obtained from various sources, including:

a. Published Sources: Researchers refer to books, academic journals, magazines, newspapers,


government reports, and other published materials that contain relevant data.

b. Online Databases: Numerous online databases provide access to a wide range of secondary
data, such as research articles, statistical information, economic data, and social surveys.

c. Government and Institutional Records: Government agencies, research institutions, and


organizations often maintain databases or records that can be used for research purposes.

COMMON ISSUES IN DATA

Duplicate Data
Duplicate data reflects a specific system or database that stores multiple variations of the
same data record or same information. Some common causes of data duplication include data
being re-imported multiple times, data not being properly decoupled in data integration
processes, gaining data from multiple data sources.

Irrelevant Data
Many organizations believe that capturing and storing every customer’s data will benefit
them at a certain point in time. However, that’s not necessarily the case. Because the amount
of data is massive and not all are useful immediately, businesses may face the irrelevant data
quality issue instead

 Use filters to remove irrelevant data from large data sets.


 Select and use the right data resources that are related to the project.
 Use data visualization to highlight relevant patterns.

Unstructured Data
Unstructured data can be considered a data quality issue due to many factors. As unstructured
data refers to any type that does not organize to a particular data structure or model, such as
text, audio, image, etc., it can be challenging for businesses to store and do data analysis.

Data Downtime
Data downtime refers to the period when data is not ready or even unavailable and
inaccessible. When data downtime occurs, organizations and customers lose the ability to
connect to the information they need. audiences and leads to poor analytical results and
customer complaints.

Inconsistent data
Because data is gained from many different sources, mismatches in the same information
across sources are inevitable. This condition is collectively known as “inconsistent data.” The
data inconsistencies arise due to many factors like manual data entry errors by human error,
inefficient data management practices.

Inaccurate data
Inaccurate data is data that contains errors that affect its quality and reliability. Since it is a
fairly broad concept, other data quality issues such as incomplete, outdated, inconsistent, or
typographical errors and missing or incorrect values are also considered inaccurate data.

Hidden data
Enterprises extract and analyze data for operational efficiency. However, with today’s huge
amount of data, most organizations only use only part of them. The remaining unused or
missing data in data silos are referred to as hidden data. More specifically, hidden data can be
valuable but unused and stored within other files or documents or invisible information to
customers, such as metadata.
For instance, a company’s sales team has data on customers, while the customer service team
doesn’t. Without sharing the needed information, the company may lose an opportunity to
create more accurate and complete customer profiles.

Outdated Data
Collected data can become obsolete quickly and inevitably lead to data decay through the
development and modernization of human life. All information that is no longer accurate or
relevant in the current state is considered outdated data. Information about a customer, such
as name, address, contact details, etc.,

2. UNDERSTANDING DATA SCIENCE

Data science is the study of data to extract meaningful insights for business. It is a
multidisciplinary approach that combines principles and practices from the fields of
mathematics, statistics, artificial intelligence, and computer engineering to analyze large
amounts of data.

DATA SCIENCE LIFE CYCLE

Phases Description

Identifying problems and Discovering the answers for basic questions including
understanding business requirements, priorities and budget of the project.

Collecting data from relevant sources either in structured or


Data Collection
unstructured form.

Processing and fine-tuning the raw data, critical for the goodness
Data processing
of the overall project.

Capturing ideas about solutions and factors that influence the data
Data analysis
life cycle.

Data modelling Preparing the appropriate model to achieve desired performance.

Model deployment Executing the analyzed model in desired format and channel.

Real-world Applications of Data Science

1. In Search Engines
The most useful application of Data Science is Search Engines. As we know when we want
to search for something on the internet, we mostly use Search engines like Google, Yahoo,
Safari, Firefox, etc. So Data Science is used to get Searches faster.

2. In Transport
Data Science is also entered in real-time such as the Transport field like Driverless Cars.
With the help of Driverless Cars, it is easy to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm and with the
help of Data Science techniques, the Data is analyzed like what as the speed limit in
highways, Busy Streets, Narrow Roads, etc. And how to handle different situations while
driving etc.

3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an
issue of fraud and risk of losses. Thus, Financial Industries needs to automate risk of loss
analysis in order to carry out strategic decisions for the company. Also, Financial Industries
uses Data Science Analytics tools in order to predict the future. It allows the companies to
predict customer lifetime value and their stock market moves.

4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user
experience with personalized recommendations.
For Example, When we search for something on the E-commerce websites we get
suggestions similar to choices according to our past data and also we get recommendations
according to most buy the product, most rated, most searched, etc. This is all done with the
help of Data Science.

5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
 Detecting Tumor.
 Drug discoveries.
 Medical Image Analysis.
 Virtual Medical Bots.
 Genetics and Genomics.
 Predictive Modeling for Diagnosis etc.

6. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When we upload
our image with our friend on Facebook, Facebook gives suggestions Tagging who is in the
picture. This is done with the help of machine learning and Data Science.

7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science. Whatever
the user searches on the Internet, he/she will see numerous posts everywhere. This can be
explained properly with an example: Suppose I want a mobile phone, so I just Google
search it and after that, I changed my mind to buy offline. In Real -World Data Science
helps those companies who are paying for Advertisements for their mobile. So everywhere
on the internet in the social media, in the websites, in the apps everywhere I will see the
recommendation of that mobile phone which I searched for. So this will force me to buy
online.

8. Airline Routing Planning


With the help of Data Science, Airline Sector is also growing like with the help of it, it
becomes easy to predict flight delays. It also helps to decide whether to directly land into
the destination or take a halt in between like a flight can have a direct route from Delhi to
the U.S.A or it can halt in between after that reach at the destination.

9. Data Science in Gaming


In most of the games where a user will play with an opponent i.e. a Computer Opponent,
data science concepts are used with machine learning where with the help of past data the
Computer will improve its performance. There are many games like Chess, EA Sports, etc.
will use Data Science concepts.

10. Medicine and Drug Development


The process of creating medicine is very difficult and time-consuming and has to be done
with full disciplined because it is a matter of Someone’s life. Without Data Science, it takes
lots of time, resources, and finance or developing new Medicine or drug but with the help
of Data Science, it becomes easy because the prediction of success rate can be easily
determined based on biological data or factors. The algorithms based on data science will
forecast how this will react to the human body without lab experiments.

11. In Delivery Logistics


Various Logistics companies like DHL, FedEx, etc. make use of Data Science. Data
Science helps these companies to find the best route for the Shipment of their Products, the
best time suited for delivery, the best mode of transport to reach the destination, etc.

12. Autocomplete
AutoComplete feature is an important part of Data Science where the user will get the
facility to just type a few letters or words, and he will get the feature of auto-completing the
line. In Google Mail, when we are writing formal mail to someone so at that time data
science concept of Autocomplete feature is used where he/she is an efficient choice to auto-
complete the whole line. Also in Search Engines in social media, in various apps,
AutoComplete feature is widely used.

3. SIGNIFFICANCE OF EDA
Exploratory Data Analysis (EDA) is a fundamental and crucial step in data science
projects. As a data scientist, approximately 70% of your work revolves around conducting
EDA on your dataset. Let’s delve into the significance of EDA:

1. Data Cleaning:
o EDA involves meticulously examining the data for errors, missing values, and
inconsistencies.
o Techniques such as data imputation, handling missing data, and outlier
detection are employed.
o Ensuring data quality is essential before proceeding to more formal statistical
analyses or modeling.
2. Descriptive Statistics:
o EDA utilizes descriptive statistics to understand key tendencies, variability,
and distributions of variables.
o Measures like mean, median, mode, standard deviation, range, and percentiles
provide valuable insights.
3. Data Visualization
o Visual techniques play a crucial role in EDA.
o Histograms, box plots, scatter plots, line plots, heatmaps, and bar charts help
identify patterns, trends, and relationships within the data.
4. Feature Engineering:
o EDA explores various variables and their transformations to create new
features or derive meaningful insights.
o Techniques include scaling, normalization, binning, encoding categorical
variables, and creating interaction or derived variables.
5. Correlation and Relationships:
o EDA uncovers relationships and dependencies between variables.
o Correlation analysis, scatter plots, and cross-tabulations reveal the strength
and direction of associations.
6. Data Segmentation:
o EDA involves dividing data into meaningful segments based on specific
criteria or characteristics.
o Segmentation provides insights into distinct subgroups within the data.
7. Hypothesis Generation:
o EDA aids in formulating hypotheses or research questions based on initial data
exploration.
o It lays the foundation for further analysis and model building.
8. Data Quality Assessment:
o EDA assesses data quality, integrity, consistency, and accuracy.
o Ensuring data reliability is crucial for valid analysis.

TYPES OF EXPLORATORY DATA ANALYSIS


Certainly! Exploratory Data Analysis (EDA) is a crucial step in understanding and
summarizing data before diving into more complex modeling or hypothesis testing. Let’s
explore the different types of EDA:

1. Univariate Non-graphical EDA:


o In this simplest form of analysis, we focus on a single variable to understand its
distribution and characteristics.
o Goals include identifying central tendency (mean, median, mode), spread
(standard deviation, variance), skewness, and kurtosis.
2. Univariate Graphical EDA:
o Here, we use visualizations to explore a single variable.
o Histograms, boxplots, and density plots help us understand the data distribution
and spot anomalies
3. Multivariate Non-graphical EDA:
o This technique examines the relationship between two or more variables.
o Cross-tabulation and summary statistics are commonly used.
o For categorical data, cross tabulation provides insights into associations between
variables
4. Multivariate Graphical EDA:
o Visualizations involving multiple variables fall into this category.
o Scatterplots, heatmaps, and parallel coordinates plots reveal patterns and
correlations among variables

4. MAKING SENSE OF DATA

Certainly! Making sense of data involves several steps, each crucial for deriving meaningful
insights. Let’s explore the key stages in the data analysis process:

1. Defining the Question:


o Begin by clearly defining your objective or problem statement.
2. Collecting the Data:
o Gather relevant data from various sources. This could involve surveys,
experiments, or accessing existing datasets.
3. Cleaning the Data:
o Data can be messy, containing errors, missing values, or inconsistencies. Clean
and preprocess the data by removing duplicates, handling missing values, and
standardizing formats.
4. Analyzing the Data:
o Dive into exploratory data analysis (EDA). Explore patterns, relationships, and
trends within the data. Use descriptive statistics, visualizations, and tools to
gain insights.
5. Sharing Your Results:
o Communicate your findings effectively. Present your insights through reports,
visualizations, or presentations.
6. Embracing Failure:
o Not all analyses lead to groundbreaking discoveries. Be open to failure and
learn from it.

Remember, data analysis is iterative, and each step informs the next.

5. COMPARING EDA WITH CLASSICAL AND BAYESIAN ANALYSIS

Certainly! Let’s delve into the comparison between Exploratory Data Analysis (EDA),
Classical Analysis, and Bayesian Analysis.

1. Classical Data Analysis:


o Problem Definition: The process begins by defining the problem or research
question.
o Data Collection: Relevant data is collected.
o Model Development: A model (deterministic or probabilistic) is constructed.
o Analysis: The data is analyzed using the chosen model.
o Conclusions: Conclusions are drawn based on the analysis results.
2. Exploratory Data Analysis (EDA):
o Problem Definition: Similar to classical analysis, we start with defining the
problem.
o Data Collection: Data is collected.
o Analysis: Here’s where EDA differs. Instead of imposing a model upfront,
EDA focuses on understanding the data itself. We explore its structure,
identify outliers, and create visualizations.
o Model Imposition: Unlike classical analysis, EDA does not impose
deterministic or probabilistic models on the data.
o Conclusion: conclusions are drawn based on insights gained from the data
explorations
3. Bayesian Data Analysis:
o Problem Definition: Begin with defining the problem.
o Data Collection: Collect relevant data.
o Model Development: Construct a model.
o Prior Distribution: The Bayesian approach incorporates prior probability
distribution knowledge into the analysis. Prior distribution expresses beliefs
about a quantity before considering evidence.
o Analysis: Analyze the data using the model and prior distribution.
o Conclusion: draw conclusions based on the Bayesian analysis

In summary:

 EDA emphasizes understanding the data and its characteristics.


 Classical analysis follows a structured sequence of problem definition, model
development, and analysis.
 Bayesian analysis incorporates prior knowledge into the analysis process.

Remember, each approach has its paradigms and is suitable for different types of data
analysis.

6. SOFTWARE TOOLS FOR EDA

Certainly! Let’s explore some popular Exploratory Data Analysis (EDA) tools for data
analysis and visualization:

1. Python:
o Description: Python is a versatile programming language widely used for data
analysis, machine learning, and scientific computing.
o Key Features:
 Pandas: A powerful library for data manipulation and analysis.
 Matplotlib and Seaborn: Used for creating visualizations.
 Jupyter Notebooks: Interactive environment for data exploration.
 NumPy: Provides support for numerical operations.
 SciPy: Useful for scientific computing.

2. R:
o Description: R is a statistical programming language and environment
specifically designed for data analysis and visualization.
o Key Features:
 Tidyverse: A collection of R packages for data manipulation and
visualization.
 ggplot2: A popular package for creating high-quality plots.
 dplyr: Used for data wrangling.
 Shiny: Enables interactive web applications.

3. Microsoft Excel:
o Description: Excel is a spreadsheet software widely used for data analysis,
reporting, and visualization.
o Key Features:
 Data Analysis ToolPak: Provides statistical functions.
 PivotTables: Useful for summarizing and analyzing data.
 Charts and Graphs: Allows visual representation of data.
 Formulas and Functions: Supports calculations.

7. VISUAL AIDS FOR EDA

Certainly! Univariate plots , bivariate plots, special plots, multi variate plots focuses on
understanding the distribution of a variable at a time.

1.Univariate plots:

1. Histogram: A histogram displays the frequency distribution of a continuous variable.


It divides the data into bins and shows how many observations fall into each bin. For
instance, if we analyze the weight of individuals, a histogram would show the
distribution of weights across different ranges.
2. Bar Chart: Bar charts are suitable for categorical or discrete variables. They
represent the frequency or count of each category. For example, if we have data on
movie ratings (such as “Good,” “Above average,” “Average,” and “Bad”), a bar chart
would illustrate the distribution of ratings.
3. Box Plot (Box-and-Whisker Plot): Box plots summarize the distribution of a
continuous variable. They display the median, quartiles, and potential outliers.
Suppose we analyze income levels—a box plot would reveal the central tendency and
spread of income data.
4. Density Plot (Kernel Density Plot): Density plots estimate the probability density
function of a continuous variable. They provide insights into the shape of the
distribution. If we examine age, a density plot would show how age values are
distributed.
5. Scatter Plot: Although commonly used for bivariate analysis, scatter plots can also be
univariate.

2.Bivariate Plots:

o Scatter Plot: A scatter plot displays the relationship between two continuous
variables. It helps identify trends, correlations, and outliers.
o Bivariate Box Plot: This plot compares the distribution of a numerical
variable across different categories.
o Mosaic Plot: Useful for visualizing associations between categorical
variables.
o Pair Plot: A matrix of scatter plots showing pairwise relationships between
multiple variables.

3.Special Purpose Plots:

o Histogram: Shows the distribution of a single variable.


o Box Plot: Visualizes the summary statistics (median, quartiles, and outliers) of
a numerical variable.
o Violin Plot: Combines a box plot with a kernel density estimate to show the
distribution of data.
o Contour Plot: Represents three-dimensional data on a two-dimensional plane
using contour lines.

4.Multivariate Plots:

o Heatmap: Displays the correlation between multiple variables using color


intensity.
o Pair Plot (again): Shows pairwise scatter plots for multiple variables.
o 3D Scatter Plot: Visualizes relationships among three continuous variables.

8. DATA TRANSFORMATION TECHNIQUE

Certainly! Data transformation techniques play a crucial role in preparing raw data for
analysis. Let us explore some common techniques:

1. Data Smoothing: This technique involves applying algorithms to remove noise from
your dataset, making it easier to identify trends. There are three types of algorithms
for data smoothing:
o Clustering: Group similar values together and label any value outside the
cluster as an outlier.
o Binning: Split data into bins and smooth the data value within each bin.
o Regression: identify relationship between dependent attributes and predict one
attributes based on the values of another
2. Attribution Construction (Feature Construction): This technique creates new
features from existing attributes in the dataset.it is commonly used in data
transformations pipelines.
3. Data Generalization: Generalization simplifies data by aggregating it into higher-
level categories.
4. Data Aggregation: Aggregation combines multiple data points into summary
statistics. for example, calculating average sales per month from daily sales data
5. Data Discretization: Discretization converts continuous data into discrete categories.
For examples, grouping income levels into income brackets (e.g. low, medium, high)
6. Data Normalization: Normalization scales data to a common range. It ensures that
different attributes have comparable values. Techniques like min max scaling and z
score normalization are commonly used.

9. MERGING DATABASE (USING PANDAS LIBRARY)

Certainly! When working with data in Python, pandas provide powerful tools for combining
and merging datasets. Let us explore how you can achieve this using panda:

1. merge (): This function is like database joins. It allows you to combine data based
on common columns or indices. You can perform both many-to-one and many-to-
many joins. In a many-to-one join, one dataset has repeated values in the merge
column, while the other does not.in a many to many joins, both datasets have repeated
values in the merge columns.
2. join (): Use this method when you want to combine data based on a key column or
an index. Its particularly useful for combing data from different sources.
3. concat (): This function is used for combining Data Frames across rows or
columns.it is handy when you want to stack datasets vertically or horizontally

RESHAPING AND PIVOTINGJOINING AND SPLITTING DATA

Explore

Certainly! Let us explore how to reshape data using the melt () and pivot () methods in
pandas.

1. Melt () Method:
o The melt () method is used to reshape a Data Frame from a wide format to a
long format.
o It essentially “unpivots” the data, converting columns into rows.
o You can specify which columns should remain as-is (identifier variables) and
which columns should be melted into a single “variable” column (measured
variables).
o Here’s an example using a sample DataFrame:

import pandas as pd
# Sample data
data = {
"value": range (12),
"variable": ["A"] * 3 + ["B"] * 3 + ["C"] * 3 + ["D"] * 3,
"date": pd.to_datetime(["2020-01-03", "2020-01-04", "2020-01-05"] * 4)
}
df = pd.DataFrame(data)
# Reshape the data using melt()
df_melted = df.melt(id_vars=["date", "variable"], value_vars=["value"])
print(df_melted.head(10))

2. pivot() Method:
o The pivot() method is used to reshape data from a long format to a wide
format.
o It’s particularly useful for time series operations where you want unique
variables as columns and dates as indices.
o Here’s an example using the same sample DataFrame:

# Reshape the data using pivot()


pivoted = df.pivot(index="date", columns="variable", values="value")
print(pivoted)

If you omit the values argument, the resulting DataFrame will have hierarchical
columns with the respective value column names.

You can then select subsets from the pivoted DataFrame as needed.

Remember that pivot() can only handle unique rows specified by index and columns.

10. TRANSFORMATION TECHNIQUES

Certainly! Let us delve into the world of data transformation techniques, aggregation, pivot
tables, and cross-tabulation.

1. Grouping Data Sets:


o Grouping involves categorizing data based on specific criteria. It’s useful for
aggregating information within subsets. For instance, you can group sales data
by product category or customer segment.
o In Python, the pandas library provides powerful tools for grouping data using
the group by () function. You can then apply various aggregation functions to
the grouped data.
2. Data Aggregation:
o Aggregation summarizes data by calculating statistics like mean, sum, count,
or standard deviation. It condenses large datasets into more manageable forms.
o For example, you can aggregate sales data to find the total revenue per month
or the average order value.
o In pandas, you can use functions like sum(), mean(), or count() to perform
aggregation.
3. Pivot Tables:
o Pivot tables allow you to reorganize and manipulate data in a spreadsheet.
They are particularly useful for summarizing and analyzing large datasets.
o With pivot tables, you can group, sort, filter, and analyze data based on
multiple criteria. They are flexible and customizable.
o In Python, pandas provide a robust implementation of pivot tables.
4. Cross-Tabulation (Crosstabs):
o Crosstabs summarize the relationship between two categorical variables. They
display the frequency of observations falling into each combination of
categories for the two variables.
o For instance, you can create a crosstab to explore patterns between gender and
income levels or age and education.
o In Excel, you can use pivot tables to create crosstabs, which help identify
patterns and test hypotheses.

Remember that these techniques serve different purposes:

 Grouping helps organize data into meaningful subsets.


 Aggregation summarizes data for analysis.
 Pivot tables provide flexibility in data manipulation.
 Crosstabs reveal relationships between categorical variables.

---------------------------------------------------------------------------------------
PART – A
1. What is the primary purpose of data exploration in the context of data analysis?
2. Name two common plots used for visualizing the distribution of a single variable.
3 . How can a scatter plot be used to identify relationships between two variables?
4. What is the role of a box plot in data visualization, and what key statistics does it
display?
5. Explain the significance of using a correlation matrix in data exploration.
6. What does a heatmap represent, and when is it most useful?
7. Why is it important to visualize outliers in a dataset, and which plots can be used for
this purpose?
8. How can you use histograms to assess the skewness of a dataset?
9. What is the advantage of using pair plots for exploring relationships in a multi-
variable dataset?
10. How does Seaborn enhance the process of visualizing categorical data compared to
matplotlib?

PART – B&C

1. Explain the various stages in EDA.


2. Write down the steps in EDA, Explain.
3. Explain the software tools available for EDA
4. Elaborate in detail the visual aids for EDA
5. Elucidate the different transformation techniques in EDA
6. Define pivot table and cross tabulation, Explain.

You might also like