Unit 1
Unit 1
PRESENTER’ S MANUAL
This semester is based on the teaching methodology following three cardinal components:
1. Concept Class
2. Lab Manual
3. Directed Learning Class
Concept class: This is a theory class that will focus on the concepts. Whenever required this
session will also demonstrate how these concepts data exploration and visualization topics.
Lab Manual: The lab manual follows the concept class. The students learn to implement the
concepts learnt in the concept class.
Directed Learning Class: Learning and application may be challenging for some students.
One of the oldest and most comprehensive ways of delivery information, self-directed class
allows the student to apply themselves in a manner that makes understanding content more
accessible. In this process, learners take initiative in their own learning by planning,
implementing and evaluating their learning.
Concept focused
Adapted to real life work environment
Introduction to the Course
This Presenter’s Manual is to be used for the fifth semester of Artificial Intelligence & Data
Science for the course of data exploration and visualization. The syllabus of this course
enhances the students’ to understand and learn the important of data exploration and
visualization.
Course Objectives:
To enable the students to
OBJECTIVES:
Importing Matplotlib – Simple line plots – Simple scatter plots – visualizing errors – density
and contour plots – Histograms – legends – colors – subplots – text and annotation –
customization – three dimensional plotting - Geographic Data with Baseman - Visualization
with Seaborn.
TOTAL: 45 PERIODS
1. Suresh Kumar Mukhiya, Usman Ahmed, “Hands-On Exploratory Data Analysis with
Python”,Packt Publishing, 2020.
2. Jake Vander Plas, "Python Data Science Handbook: Essential Tools for
Working with Data", Oreilly, 1st Edition, 2016.
3. Catherine Marsh, Jane Elliott, “Exploring Data: An Introduction to Data Analysis for
SocialScientists”, Wiley Publications, 2nd Edition, 2008.
REFERENCE BOOKS:
1. Eric Pimpler, Data Visualization and Exploration with R, GeoSpatial Training service,
2017.
2. Claus O. Wilke, “Fundamentals of Data Visualization”, O’reilly publications, 2019.
3. Matthew O. Ward, Georges Grinstein, Daniel Keim, “Interactive Data
Visualization:Foundations, Techniques, and Applications”, 2nd Edition, CRC press,
2015.
PO’s PSO’s
CO’s 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3
1 3 1 3 3 - - - - 2 3 3 3 2 2 2
2 2 2 2 1 1 - - - 3 2 3 1 3 1 3
3 2 1 2 1 1 - - - 3 2 1 2 2 2 1
4 2 2 2 1 - - - - 1 2 1 3 1 3 2
5 3 1 1 2 1 - - - 3 2 1 2 2 2 3
AVG 2 1 2 2 1 - - - 2 2 2 2 2 2 2
Unit - 1 7 2 9
Unit - 2 7 2 9
Unit - 3 7 2 10
Unit - 4 6 2 8
Unit - 5 7 2 9
Total 34 10 45
I. CONCEPT CLASS
LESSON PLAN
Lecture
Method Proposed notes – Teaching
S.No. Topics to be covered Ref
Date Page Method
Number
EDA fundamentals
1. CC 15.07.24 1- 2 T1 CB/L
Understanding data science – CC
2. Significance of EDA 16.07.24 3-7 T1 CB/L
Making sense of data – CC
3. Comparing EDA with classical 18.07.24 8 - 11 T1 CB/L
and Bayesian analysis
4. Software tools for EDA CC 19.07.24 11 - 12 T1 CB/L
Visual Aids for EDA CC
5. 20.07.24 13 - 15 T1 CB/L
CC
6. Data transformation techniques 22.07.24 16 - 18 T1 CB/L
Transformation techniques -
1 2 Mins Attendance
35
5 Content: Exploratory Data Analysis (EDA) is an analysis approach that identifies
Mins general patterns in the data.
Remarks:
Faculty Incharge
1 2 Mins Attendance
4 1 min Objective: understand the concept of data science and Significance of EDA.
35 Content: Data science is the study of data to extract meaningful insights for
5 business. It help look at data before making any assumptions.
Mins
Questions by Students :
6 3 Mins
Remarks:
Faculty Incharge
Topic covered Making sense of data, Comparing EDA with classical and Bayesian analysis.
1 2 Mins Attendance
Remarks:
Faculty Incharge
1 2 Mins Attendance
Revision: Making sense of data, Comparing EDA with classical and Bayesian
3 7 Mins
analysis.
Objective: Able to understand the Software tools for Exploratory data
4 1 min
analysis.
35 Content: Tools required for exploratory data analysis: R, PYTHON, EXCEL.
5
Mins
Questions by Students:
6 3 Mins
Faculty Incharge
1 2 Mins Attendance
Revision: Making sense of data, Comparing EDA with classical and Bayesian
3 7 Mins
analysis.
35 Content: visual tools such as box plots, scatter plots, and histograms, EDA aids
5
Mins in identifying underlying patterns and relationships within the data
Questions by Students :
6 3 Mins
Remarks:
Faculty Incharge
1 2 Mins Attendance
Remarks:
Faculty Incharge
1 2 Mins Attendance
Remarks:
Faculty Incharge
1 2 Mins Attendance
Remarks:
Faculty Incharge
Topic covered
Transformation techniques - Pivot tables and cross-tabulations.
1 2 Mins Attendance
35
5 Content: Crosstabs are used for categorical data, while pivot tables can be
Mins
used for both categorical and numerical data.
Outcome: The student should be able to understand the concept of Pivot tables
8 1 Mins
and cross-tabulations.
Remarks:
Faculty Incharge
Technical Terms
4. Data science the scientific analysis of large Data science is the study of data to
amounts of information held on extract meaningful insights for
computers business. It is a multidisciplinary
approach that combines principles
and practices from the fields of
mathematics, statistics, artificial
intelligence, and computer
engineering to analyze large
amounts of data.
5. Data Transformations typically involve Data transformation is the process
transformation converting a raw data source into of converting data from one
a cleansed, validated and ready- format, such as a database file,
to-use format. XML document or Excel
spreadsheet, into another.
6 Data Smoothing. refers to a statistical approach of Data smoothing refers to a
eliminating outliers from datasets statistical approach of eliminating
outliers from datasets to make the
patterns more noticeable.
7 Data the process of generating Data generalization is the process
generalization summary data with successive of generating summary data with
layers for a dataset. successive layers for a dataset. It
is to hide the characteristic of an
individual from its group, such
that the adversary will not able to
distinguish this individual from its
peers.
8 Data aggregation any process whereby data is Data aggregation is the process
gathered and expressed in a where raw data is gathered and
summary form. expressed in a summary form for
statistical analysis. Raw data can
be aggregated over a given time
period to provide statistics such as
average, minimum, maximum,
sum, and count.
9 Data the process of reorganizing data Data normalization is the process
normalization within a database so that users can of reorganizing data within a
utilize it for further queries and database so that users can utilize it
analysis for further queries and analysis.
Simply put, it is the process of
developing clean data. This
includes eliminating redundant
and unstructured data and making
the data appear similar across all
records and fields.
10 Data merging the process of combining two or Data merging is the process of
more data sets into a single, combining two or more data sets
unified database. into a single, unified database. It
involves adding new details to
existing data, appending cases,
and removing any duplicate or
incorrect information to ensure
that the data at hand is
comprehensive, complete, and
accurate
UNIT I
EXPLORATORY DATA ANALYSIS
CONTENTS
1. EDA Fundamentals
2. Understanding data science
3. Significance of EDA
4. Making sense of Data
5. Comparing EDA with classical and Bayesian Analysis
6. Software tools for EDA
7. Visual aids for EDA
8. Data transformation techniques
9. Merging database, Reshaping and Pivoting
10. Transformation Techniques
10.1Grouping datasets
10.2 Data aggregation
10.3 Pivot tables and Cross Tabulations
Exploratory Data Analysis (EDA) is a vital step in the process of understanding and
analyzing data. It serves as the foundation stone for any data analysis project, providing
valuable insights and revealing the true nature of the data. EDA can be compared to an
investigation carried out by a detective, where digging deep into piles of data helps uncover
clues that aid in the actual data analysis.
Data refers to a collection of facts. These facts can take the form of numbers, words,
observations, or descriptions. However, data on its own does not carry any meaning or
context. It is simply a raw representation of information.
Information, on the other hand, is how we interpret and understand the facts within a specific
context. It is the structured or organized form of data that conveys a logical meaning. For
example, let's consider the following unorganized data:
Individually, these words do not hold much significance. However, when we structure the
data and organize it, we can derive meaningful information. For instance, "Rich bought a blue
car" conveys a complete thought and provides useful information.
While data and information are closely related, there are key differences between the two:
DATA COLLECTION
Data collection is the process of collecting and evaluating information or data from multiple
sources to find answers to research problems, answer questions, evaluate outcomes, and
forecast trends and probabilities. It is an essential phase in all types of research, analysis, and
decision-making, including that done in the social sciences, business, and healthcare.
b. Interviews: Interviews involve direct interaction between the researcher and the
respondent. They can be conducted in person, over the phone, or through video conferencing.
Interviews can be structured (with predefined questions),semi-structured (allowing
flexibility), or unstructured (more conversational).
c. Observations: Researchers observe and record behaviors, actions, or events in their natural
setting. This method is useful for gathering data on human behavior
Secondary data collection involves using existing data collected by someone else for a
purpose different from the original intent. Researchers analyze and interpret this data to
extract relevant information. Secondary data can be obtained from various sources, including:
b. Online Databases: Numerous online databases provide access to a wide range of secondary
data, such as research articles, statistical information, economic data, and social surveys.
Duplicate Data
Duplicate data reflects a specific system or database that stores multiple variations of the
same data record or same information. Some common causes of data duplication include data
being re-imported multiple times, data not being properly decoupled in data integration
processes, gaining data from multiple data sources.
Irrelevant Data
Many organizations believe that capturing and storing every customer’s data will benefit
them at a certain point in time. However, that’s not necessarily the case. Because the amount
of data is massive and not all are useful immediately, businesses may face the irrelevant data
quality issue instead
Unstructured Data
Unstructured data can be considered a data quality issue due to many factors. As unstructured
data refers to any type that does not organize to a particular data structure or model, such as
text, audio, image, etc., it can be challenging for businesses to store and do data analysis.
Data Downtime
Data downtime refers to the period when data is not ready or even unavailable and
inaccessible. When data downtime occurs, organizations and customers lose the ability to
connect to the information they need. audiences and leads to poor analytical results and
customer complaints.
Inconsistent data
Because data is gained from many different sources, mismatches in the same information
across sources are inevitable. This condition is collectively known as “inconsistent data.” The
data inconsistencies arise due to many factors like manual data entry errors by human error,
inefficient data management practices.
Inaccurate data
Inaccurate data is data that contains errors that affect its quality and reliability. Since it is a
fairly broad concept, other data quality issues such as incomplete, outdated, inconsistent, or
typographical errors and missing or incorrect values are also considered inaccurate data.
Hidden data
Enterprises extract and analyze data for operational efficiency. However, with today’s huge
amount of data, most organizations only use only part of them. The remaining unused or
missing data in data silos are referred to as hidden data. More specifically, hidden data can be
valuable but unused and stored within other files or documents or invisible information to
customers, such as metadata.
For instance, a company’s sales team has data on customers, while the customer service team
doesn’t. Without sharing the needed information, the company may lose an opportunity to
create more accurate and complete customer profiles.
Outdated Data
Collected data can become obsolete quickly and inevitably lead to data decay through the
development and modernization of human life. All information that is no longer accurate or
relevant in the current state is considered outdated data. Information about a customer, such
as name, address, contact details, etc.,
Data science is the study of data to extract meaningful insights for business. It is a
multidisciplinary approach that combines principles and practices from the fields of
mathematics, statistics, artificial intelligence, and computer engineering to analyze large
amounts of data.
Phases Description
Identifying problems and Discovering the answers for basic questions including
understanding business requirements, priorities and budget of the project.
Processing and fine-tuning the raw data, critical for the goodness
Data processing
of the overall project.
Capturing ideas about solutions and factors that influence the data
Data analysis
life cycle.
Model deployment Executing the analyzed model in desired format and channel.
1. In Search Engines
The most useful application of Data Science is Search Engines. As we know when we want
to search for something on the internet, we mostly use Search engines like Google, Yahoo,
Safari, Firefox, etc. So Data Science is used to get Searches faster.
2. In Transport
Data Science is also entered in real-time such as the Transport field like Driverless Cars.
With the help of Driverless Cars, it is easy to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm and with the
help of Data Science techniques, the Data is analyzed like what as the speed limit in
highways, Busy Streets, Narrow Roads, etc. And how to handle different situations while
driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an
issue of fraud and risk of losses. Thus, Financial Industries needs to automate risk of loss
analysis in order to carry out strategic decisions for the company. Also, Financial Industries
uses Data Science Analytics tools in order to predict the future. It allows the companies to
predict customer lifetime value and their stock market moves.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user
experience with personalized recommendations.
For Example, When we search for something on the E-commerce websites we get
suggestions similar to choices according to our past data and also we get recommendations
according to most buy the product, most rated, most searched, etc. This is all done with the
help of Data Science.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
Detecting Tumor.
Drug discoveries.
Medical Image Analysis.
Virtual Medical Bots.
Genetics and Genomics.
Predictive Modeling for Diagnosis etc.
6. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When we upload
our image with our friend on Facebook, Facebook gives suggestions Tagging who is in the
picture. This is done with the help of machine learning and Data Science.
7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science. Whatever
the user searches on the Internet, he/she will see numerous posts everywhere. This can be
explained properly with an example: Suppose I want a mobile phone, so I just Google
search it and after that, I changed my mind to buy offline. In Real -World Data Science
helps those companies who are paying for Advertisements for their mobile. So everywhere
on the internet in the social media, in the websites, in the apps everywhere I will see the
recommendation of that mobile phone which I searched for. So this will force me to buy
online.
12. Autocomplete
AutoComplete feature is an important part of Data Science where the user will get the
facility to just type a few letters or words, and he will get the feature of auto-completing the
line. In Google Mail, when we are writing formal mail to someone so at that time data
science concept of Autocomplete feature is used where he/she is an efficient choice to auto-
complete the whole line. Also in Search Engines in social media, in various apps,
AutoComplete feature is widely used.
3. SIGNIFFICANCE OF EDA
Exploratory Data Analysis (EDA) is a fundamental and crucial step in data science
projects. As a data scientist, approximately 70% of your work revolves around conducting
EDA on your dataset. Let’s delve into the significance of EDA:
1. Data Cleaning:
o EDA involves meticulously examining the data for errors, missing values, and
inconsistencies.
o Techniques such as data imputation, handling missing data, and outlier
detection are employed.
o Ensuring data quality is essential before proceeding to more formal statistical
analyses or modeling.
2. Descriptive Statistics:
o EDA utilizes descriptive statistics to understand key tendencies, variability,
and distributions of variables.
o Measures like mean, median, mode, standard deviation, range, and percentiles
provide valuable insights.
3. Data Visualization
o Visual techniques play a crucial role in EDA.
o Histograms, box plots, scatter plots, line plots, heatmaps, and bar charts help
identify patterns, trends, and relationships within the data.
4. Feature Engineering:
o EDA explores various variables and their transformations to create new
features or derive meaningful insights.
o Techniques include scaling, normalization, binning, encoding categorical
variables, and creating interaction or derived variables.
5. Correlation and Relationships:
o EDA uncovers relationships and dependencies between variables.
o Correlation analysis, scatter plots, and cross-tabulations reveal the strength
and direction of associations.
6. Data Segmentation:
o EDA involves dividing data into meaningful segments based on specific
criteria or characteristics.
o Segmentation provides insights into distinct subgroups within the data.
7. Hypothesis Generation:
o EDA aids in formulating hypotheses or research questions based on initial data
exploration.
o It lays the foundation for further analysis and model building.
8. Data Quality Assessment:
o EDA assesses data quality, integrity, consistency, and accuracy.
o Ensuring data reliability is crucial for valid analysis.
Certainly! Making sense of data involves several steps, each crucial for deriving meaningful
insights. Let’s explore the key stages in the data analysis process:
Remember, data analysis is iterative, and each step informs the next.
Certainly! Let’s delve into the comparison between Exploratory Data Analysis (EDA),
Classical Analysis, and Bayesian Analysis.
In summary:
Remember, each approach has its paradigms and is suitable for different types of data
analysis.
Certainly! Let’s explore some popular Exploratory Data Analysis (EDA) tools for data
analysis and visualization:
1. Python:
o Description: Python is a versatile programming language widely used for data
analysis, machine learning, and scientific computing.
o Key Features:
Pandas: A powerful library for data manipulation and analysis.
Matplotlib and Seaborn: Used for creating visualizations.
Jupyter Notebooks: Interactive environment for data exploration.
NumPy: Provides support for numerical operations.
SciPy: Useful for scientific computing.
2. R:
o Description: R is a statistical programming language and environment
specifically designed for data analysis and visualization.
o Key Features:
Tidyverse: A collection of R packages for data manipulation and
visualization.
ggplot2: A popular package for creating high-quality plots.
dplyr: Used for data wrangling.
Shiny: Enables interactive web applications.
3. Microsoft Excel:
o Description: Excel is a spreadsheet software widely used for data analysis,
reporting, and visualization.
o Key Features:
Data Analysis ToolPak: Provides statistical functions.
PivotTables: Useful for summarizing and analyzing data.
Charts and Graphs: Allows visual representation of data.
Formulas and Functions: Supports calculations.
Certainly! Univariate plots , bivariate plots, special plots, multi variate plots focuses on
understanding the distribution of a variable at a time.
1.Univariate plots:
2.Bivariate Plots:
o Scatter Plot: A scatter plot displays the relationship between two continuous
variables. It helps identify trends, correlations, and outliers.
o Bivariate Box Plot: This plot compares the distribution of a numerical
variable across different categories.
o Mosaic Plot: Useful for visualizing associations between categorical
variables.
o Pair Plot: A matrix of scatter plots showing pairwise relationships between
multiple variables.
4.Multivariate Plots:
Certainly! Data transformation techniques play a crucial role in preparing raw data for
analysis. Let us explore some common techniques:
1. Data Smoothing: This technique involves applying algorithms to remove noise from
your dataset, making it easier to identify trends. There are three types of algorithms
for data smoothing:
o Clustering: Group similar values together and label any value outside the
cluster as an outlier.
o Binning: Split data into bins and smooth the data value within each bin.
o Regression: identify relationship between dependent attributes and predict one
attributes based on the values of another
2. Attribution Construction (Feature Construction): This technique creates new
features from existing attributes in the dataset.it is commonly used in data
transformations pipelines.
3. Data Generalization: Generalization simplifies data by aggregating it into higher-
level categories.
4. Data Aggregation: Aggregation combines multiple data points into summary
statistics. for example, calculating average sales per month from daily sales data
5. Data Discretization: Discretization converts continuous data into discrete categories.
For examples, grouping income levels into income brackets (e.g. low, medium, high)
6. Data Normalization: Normalization scales data to a common range. It ensures that
different attributes have comparable values. Techniques like min max scaling and z
score normalization are commonly used.
Certainly! When working with data in Python, pandas provide powerful tools for combining
and merging datasets. Let us explore how you can achieve this using panda:
1. merge (): This function is like database joins. It allows you to combine data based
on common columns or indices. You can perform both many-to-one and many-to-
many joins. In a many-to-one join, one dataset has repeated values in the merge
column, while the other does not.in a many to many joins, both datasets have repeated
values in the merge columns.
2. join (): Use this method when you want to combine data based on a key column or
an index. Its particularly useful for combing data from different sources.
3. concat (): This function is used for combining Data Frames across rows or
columns.it is handy when you want to stack datasets vertically or horizontally
Explore
Certainly! Let us explore how to reshape data using the melt () and pivot () methods in
pandas.
1. Melt () Method:
o The melt () method is used to reshape a Data Frame from a wide format to a
long format.
o It essentially “unpivots” the data, converting columns into rows.
o You can specify which columns should remain as-is (identifier variables) and
which columns should be melted into a single “variable” column (measured
variables).
o Here’s an example using a sample DataFrame:
import pandas as pd
# Sample data
data = {
"value": range (12),
"variable": ["A"] * 3 + ["B"] * 3 + ["C"] * 3 + ["D"] * 3,
"date": pd.to_datetime(["2020-01-03", "2020-01-04", "2020-01-05"] * 4)
}
df = pd.DataFrame(data)
# Reshape the data using melt()
df_melted = df.melt(id_vars=["date", "variable"], value_vars=["value"])
print(df_melted.head(10))
2. pivot() Method:
o The pivot() method is used to reshape data from a long format to a wide
format.
o It’s particularly useful for time series operations where you want unique
variables as columns and dates as indices.
o Here’s an example using the same sample DataFrame:
If you omit the values argument, the resulting DataFrame will have hierarchical
columns with the respective value column names.
You can then select subsets from the pivoted DataFrame as needed.
Remember that pivot() can only handle unique rows specified by index and columns.
Certainly! Let us delve into the world of data transformation techniques, aggregation, pivot
tables, and cross-tabulation.
---------------------------------------------------------------------------------------
PART – A
1. What is the primary purpose of data exploration in the context of data analysis?
2. Name two common plots used for visualizing the distribution of a single variable.
3 . How can a scatter plot be used to identify relationships between two variables?
4. What is the role of a box plot in data visualization, and what key statistics does it
display?
5. Explain the significance of using a correlation matrix in data exploration.
6. What does a heatmap represent, and when is it most useful?
7. Why is it important to visualize outliers in a dataset, and which plots can be used for
this purpose?
8. How can you use histograms to assess the skewness of a dataset?
9. What is the advantage of using pair plots for exploring relationships in a multi-
variable dataset?
10. How does Seaborn enhance the process of visualizing categorical data compared to
matplotlib?
PART – B&C