0% found this document useful (0 votes)

252 views39 pages

Chapter 5 - Data Exploration and Visualization With

This document discusses data exploration and visualization using Pandas and Matplotlib in Python. It introduces core Pandas concepts like Series and DataFrames, describes how to create and manipulate DataFrames to perform exploratory data analysis, and demonstrates how to generate basic visualizations of DataFrame data using Matplotlib. The key steps of the data science process are reviewed before diving into practical examples with real data.

Uploaded by

Khadar Yare

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

252 views39 pages

Chapter 5 - Data Exploration and Visualization With

Uploaded by

Khadar Yare

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 39

Jamhuriya University of Science & Technology (JUST)

CA416 - Principles of Data Science

Chapter 5

Practical Data Exploration and Visualization with

Pandas and Matplotlib packages.

Lecturer: XYZ

1
Jamhuriya University of Science & Technology (JUST)

Learning outcomes
By the end of this lecture, you will be able to:
 Describe some core data analysis concepts including
dataframes, and data exploration.
 Create and access main data structures in Python,
such as series and dataframes.
 Perform exploratory data analysis in Python usig
Pandas library.
 Build data visualizations with Matplotlib library

2
Jamhuriya University of Science & Technology (JUST)

Data science workflow – recap

Source:
https://www.dataquest.io/bl
og/what-is-data-science/ 3
Jamhuriya University of Science & Technology (JUST)

Types of variables
A variable is any characteristic or attribute that can be
quantitatively and qualitatively been measured

Variable

Numeric Categorical

Continuous Discrete Ordinal Nominal

Continuous – time, age, etc Ordinal – grades, rating, etc

Discrete – people, houses, etc Nominal – nationality, race, etc
4
Jamhuriya University of Science & Technology (JUST)

Exploratory data analysis

 Exploratory Data Analysis (EDA) is an initial exploration of
the data to understand its characteristics, patterns,
correlations, and to identify any anomalies in the data.
 Statistical data summarization and graphical visualizations
are two primary forms of EDA.
 EDA is typically applied before formal data modeling and
helps inform the development of appropriate statistical models
or machine learning models.

5
Jamhuriya University of Science & Technology (JUST)

Pandas
 Data analysis is normally performed over data stored in a
tabular format, e.g., Excel spreadsheet.
 Each observation is recorded in a row and its each
attribute is recorded in a column (e.g., students’ and
their grades in each assignment)
 Pandas is a Python library for manipulating data in tabular
format and comes with Python Anaconda.
 In Pandas, data manipulation can be much more varied, it can
programmed more easily and performed more efficiently,
which is critical in large-scale projects

6
Jamhuriya University of Science & Technology (JUST)

Pandas
• Pandas is a high-level library built on NumPy, providing
tools that make it easier to work with real-world data:
– load data from a variety of sources (e.g. CSV, JSON, SQL).
– update (add, modify, delete etc) data
– select subsets of the data
– group data by a certain criterion
– clean and handling missing values or NANs
– visualize the data using different plotting tools
– perform statistical analysis of the data and
– export the data to other file formats or databases
• Pandas provides two main data structures, namely,
DataFrame (equivalent to a spreadsheet) and a Series
(equivalent to a column in a spreadsheet).
7
Jamhuriya University of Science & Technology (JUST)

Pandas Series
 A Series is just a column is a dataframe or spreadsheet

Pandas series with indices

is created the list

the series attributes can

also be separately
accessed

Noice that the values

are just NumPy arrays 8
Jamhuriya University of Science & Technology (JUST)

Operations on series data

Please read the Pandas

documentation for detailed list of
statistical functions applicable on
series, as given in the additional
resource slide.

9
Jamhuriya University of Science & Technology (JUST)

Pandas Dataframe
 A Pandas dataframe is a collection of Series ( a 2D data
structure with row & columns - effectively a spreadsheet )
 Let's create an example dataframe with the population and area
values for some regions in Somalia

The head() returns the first few

rows of the dataframe. You can use
tail() to see the last few rows

The dataframe is just like

spreadsheet sheet with indices (as
the series) 10
Jamhuriya University of Science & Technology (JUST)

Indexing the dataframe records

 Since we did not supply any particular values as the index,
a range of integers was used as the index for the dataframe
 However, it may be convenient to set the region names as the
indices for our example dataframe.

The argument inplace=True is

an instruction to modify the
dataframe and as the result the
dataframe will now have two
columns

Region names are now used as the

dataframe indices.
11
Jamhuriya University of Science & Technology (JUST)

Hands-on exercise 1
 We can create dataframes by supplying dictionaries with
identical sets of keys as arguments to DataFrame()

 Can you represent each column (population and area)

as
dictionaries and create a dataframe from them ?

12
Jamhuriya University of Science & Technology (JUST)

Dataframe attributes

Using these attributes, one can

separately access df information

This shows that dataframe values are

just 2D NumPy arrays, and this is
why NumPy underpins Pandas and other
data science libraries

13
Jamhuriya University of Science & Technology (JUST)

Descriptive statistics

Average population and

area for all regions are
6.931368e+05 and
41079.333333 in order.

These statistics include

mean, quartiles, median,
total observations etc.

14
Jamhuriya University of Science & Technology (JUST)

Descriptive statistics

The info() method

provides a concise
description of the
dataframe

The shape() method

provides the shape of
the dataframe in terms
of the number of rows
and columns

15
Jamhuriya University of Science & Technology (JUST)

Matplotlib for plotting a dataframe

• The matplotlib is comprehensive package for
data visualization, and comes as part of Anaconda.
 Pandas has a convenient integration with matplotlib, which
means that data contained in a dataframe can be plotted with
plot():
• You can select the plot type of your choice ( e.g., scatter, bar,
boxplot, pie, hist, …) corresponding to your data
• Please see the resource at the end for more information on
various plots and arguments of the plot() function

16
Jamhuriya University of Science & Technology (JUST)

Matplotlib for plotting a dataframe

• Let us now plot the data contained in our dataframe, df, by
simply calling its plot method:

Logy=True argument
enables us to scale the
y values
logarithmically.
Otherwise the scales
could have been very
different

Bar plots are useful

tools for viewing
categorical
17
variables
Jamhuriya University of Science & Technology (JUST)

Selecting dataframe columns

• Let us extract the area column/variable from our data
frame

The extracted columns are Pandas series type and can stored in a
different variable or processed separately 18
Jamhuriya University of Science & Technology (JUST)

Selecting dataframe cells

• We also extract a cell value of a dataframe
Notice that the cell is
accessed by its column
name and row index.
The index can be a
number and its format is
to put it in square
brakcets []

You can use the same

syntax to update the cell,
e.g. change the number

Again the extracted cell values can be separately processed.

19
Jamhuriya University of Science & Technology (JUST)

Slicing the dataframe

The iloc attribute is used

to access the rows and
columns by their integer
indices:
‘:’ means extract all –
columns (also rows)

The loc attribute is used

to access the rows and
columns by their string
indices: 20
Jamhuriya University of Science & Technology (JUST)

Adding columns to a dataframe

Remember this a
vectorised or element
wise math operation just
like NumPy arrays.

‘density’ columns is now

created and added to the
dataframe

Obviously Banaadir has the

highest density 21
Jamhuriya University of Science & Technology (JUST)

Adding columns to a dataframe

The backward slash ‘\’

enables the continuation
of the list definition.

A new column ‘capital’

is now created and
added to the dataframe

22
Jamhuriya University of Science & Technology (JUST)

Adding rows to a dataframe

This now adds a new

row with index
‘M_Shabelle’ to the
dataframe

23
Jamhuriya University of Science & Technology (JUST)

Conditional data selection

The selected data is a

dataframe itself.

Such conditional
extraction can be
applied to any other
dataframe column
24
Jamhuriya University of Science & Technology (JUST)

Conditional updating

This populates
the entire new
column with the
single value
‘low’

2525
Jamhuriya University of Science & Technology (JUST)

Conditional updating

The first index in

loc specifies the
rows to which the
change applies
and the second
argument
specifies the
column

26
Jamhuriya University of Science & Technology (JUST)

Conditional updating
 We can also use the apply() function to apply some
operation to every row or every column in a dataframe.
 The function takes a custom function as an argument, the
custom function takes either a row or a column at a time and
can return a modified row or column:

2727
Jamhuriya University of Science & Technology (JUST)

Conditional updating

The axis
argument
indicates
whether to
process the
dataframe by
columns (1)
or rows (0)

28
Jamhuriya University of Science & Technology (JUST)

Deleting dataframe columns

The drop() method can be

used to remove rows or
columns depending on the
axis we specifiy and
column/row we name

From this output, we can

see that density_status
column is now removed.

29
Jamhuriya University of Science & Technology (JUST)

Deleting dataframe rows

Like with columns, using

the inplace=True
argument means we are
updating the dataframe
and as the result the
dataframe will now have
fewer records or rows.

From this output, we can

see that M_Shabelle row is
now removed.

30
Jamhuriya University of Science & Technology (JUST)

Exploring categorical variables

The unique() function

returns the unique
values of the colum

The value_counts() method

returns the requicy of each
The value_counts() unique value
method returns a series
31
Jamhuriya University of Science & Technology (JUST)

Exploring numerical variables

The unique() function

returns the unique
values of the colum

The value_counts() method

returns the requicy of each
The value_counts() unique value
method returns a series
32
Jamhuriya University of Science & Technology (JUST)

Visualizing numerical variables

This plot shows that 5

regions has a population
ranging from about 375K
to approx.
620k

Histograms are useful

tools for viewing the
distribution of
variables
33
Jamhuriya University of Science & Technology (JUST)

Visualizing numerical variables

Maximum
Q3

Median IQR
Q1

Minimum

Boxplots are used the represent summary statistics (5 number

summary) and to compare summary of different datasets

The two variable or columns could be drawn on the same plot but they
have been plotted separately since their scales differ, 34
Jamhuriya University of Science & Technology (JUST)

Visualizing numerical variables

The rot and fontsize

arguments are used
here to rotate and
size the x labels.

Line plots are

primarily used for
viewing continuous
variables.
35
Jamhuriya University of Science & Technology (JUST)

Visualizing categorical variables

plt is an alias for

matplotlib Pyplot
module. The loc
argument to the legend
function sets the
location of the legend

Pie charts are primarily

using for visualizing
proportions of mostly
categorical variables
36
Jamhuriya University of Science & Technology (JUST)

Hands-on exercise 2
 Suppose you have this data in a dictionary:
exam_data =
{
'name': ['Ali', 'Ahmed', 'Jama', 'Omar', 'Fatima', 'Mohamed',
'Mohamud', 'Malin', 'Farah', 'Samad'],
'score': [62.5, 79, 16.5, 65, 53, 81, 58, 45, 72, 66.5], 'attempts':
[1, 3,
2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'yes', 'no', 'yes']
}
 Create a dataframe from this data and retrieve the following subsets of data:
1. The first three rows
2. The following three rows
3. The score for 'Mohamed'
4. The scores of all students who qualify and who made just one
attempt.
37
Jamhuriya University of Science & Technology (JUST)

References & reading resources

Data Analysis with Pandas
• https://pandas.pydata.org/docs/getting_started/
• https://pandas.pydata.org/pandas-docs/stable/index.html
• https://www.youtube.com/watch?v=5JnMutdy6Fw
• https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.plot.html
• Chapter 3, Python Data Science Handbook, Jake
VanderPlas
• Chapter 5-6, Python for Data Analysis, Wes
McKinney.

Matplotlib
• https://matplotlib.org/tutorials/index.html
38
Jamhuriya University of Science & Technology (JUST)

Outlier Detection Techniques
No ratings yet
Outlier Detection Techniques
55 pages
In All The Regression Models That We Have Considered So
100% (1)
In All The Regression Models That We Have Considered So
52 pages
Statistical Methods For Decision Making (SMDM) Project Report
100% (2)
Statistical Methods For Decision Making (SMDM) Project Report
22 pages
R for Simplified Mapping
100% (1)
R for Simplified Mapping
54 pages
New Batches Info: Quality Thought Ai-Data Science Diploma
No ratings yet
New Batches Info: Quality Thought Ai-Data Science Diploma
16 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
62 pages
Spatial Analysis with PostgreSQL & QGIS
100% (1)
Spatial Analysis with PostgreSQL & QGIS
27 pages
Using R For Basic Spatial Analysis
100% (1)
Using R For Basic Spatial Analysis
48 pages
EDA Techniques in R with dlookr
100% (2)
EDA Techniques in R with dlookr
11 pages
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
No ratings yet
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
16 pages
Fifteen: 15.1 Lesson: Introduction To Databases
No ratings yet
Fifteen: 15.1 Lesson: Introduction To Databases
22 pages
Linear Regression Chap01
100% (1)
Linear Regression Chap01
7 pages
Predictive Modeling Project Report
100% (2)
Predictive Modeling Project Report
31 pages
Statistics For Data Science by Mihir Patnaik
100% (1)
Statistics For Data Science by Mihir Patnaik
103 pages
Data Science M-1 Notes
No ratings yet
Data Science M-1 Notes
34 pages
Basic Python
No ratings yet
Basic Python
111 pages
Career Plans For Next 2 Years
No ratings yet
Career Plans For Next 2 Years
11 pages
Data Mining Essentials for Analysts
No ratings yet
Data Mining Essentials for Analysts
35 pages
Sqlserver Toturial
No ratings yet
Sqlserver Toturial
3,386 pages
Business Analytics & Data Visualization - Unit1
100% (1)
Business Analytics & Data Visualization - Unit1
30 pages
SAS Presentation
No ratings yet
SAS Presentation
49 pages
1 Lecture 2: Supervised Machine Learning
No ratings yet
1 Lecture 2: Supervised Machine Learning
20 pages
Data Science - Unit II
100% (2)
Data Science - Unit II
173 pages
Distributed Database System
No ratings yet
Distributed Database System
6 pages
Beginner's Guide to Programming
No ratings yet
Beginner's Guide to Programming
16 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
Grupo 7 Build A Geospatial Dashboard in Python Using Greppo by Adithya Krishnan Towards Data Science
100% (1)
Grupo 7 Build A Geospatial Dashboard in Python Using Greppo by Adithya Krishnan Towards Data Science
13 pages
Dhis2 User Manual
No ratings yet
Dhis2 User Manual
787 pages
Analysis Vs Reporting
No ratings yet
Analysis Vs Reporting
21 pages
Data Mining
No ratings yet
Data Mining
27 pages
Interactive Applications For Modeling and Analysis With Shiny
No ratings yet
Interactive Applications For Modeling and Analysis With Shiny
15 pages
Lab Program
100% (1)
Lab Program
15 pages
Data Mining: Techniques & Applications
No ratings yet
Data Mining: Techniques & Applications
16 pages
Data Science I: Charles C.N. Wang
No ratings yet
Data Science I: Charles C.N. Wang
68 pages
Session 1 (The Nature of Probability and Statistics) PDF
No ratings yet
Session 1 (The Nature of Probability and Statistics) PDF
173 pages
RMM Unit-I Introdution To Data Mining
No ratings yet
RMM Unit-I Introdution To Data Mining
129 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
Python Data Science Course Guide
100% (1)
Python Data Science Course Guide
5 pages
For Power BI Installation:: Get Data: To Get The Data From Different Sources Like CSV, Excel, Test, SQL, Access Etc..
No ratings yet
For Power BI Installation:: Get Data: To Get The Data From Different Sources Like CSV, Excel, Test, SQL, Access Etc..
11 pages
Data Science
No ratings yet
Data Science
87 pages
Building A Connection in Qgis With Postgres SQL
100% (1)
Building A Connection in Qgis With Postgres SQL
18 pages
R Data Analyst Training Program
No ratings yet
R Data Analyst Training Program
6 pages
CENG301 DBMS - Session-3
100% (1)
CENG301 DBMS - Session-3
13 pages
Intro to R for Data Analysis
No ratings yet
Intro to R for Data Analysis
146 pages
Lesson 2 - Designing Web Services and Web Maps
No ratings yet
Lesson 2 - Designing Web Services and Web Maps
10 pages
Data Visualization - Matplotlib PDF
100% (1)
Data Visualization - Matplotlib PDF
15 pages
Data Analytics Lab Manual Using R Programming
No ratings yet
Data Analytics Lab Manual Using R Programming
27 pages
Kenny-230718-Top 70 Microsoft Data Science Interview Questions
No ratings yet
Kenny-230718-Top 70 Microsoft Data Science Interview Questions
17 pages
R Programming
No ratings yet
R Programming
11 pages
Exploratory Data Analysis Reference
100% (2)
Exploratory Data Analysis Reference
49 pages
Tutorial All PPSS PostGIS
100% (1)
Tutorial All PPSS PostGIS
11 pages
Data Visualization With Ggplot2 PDF
No ratings yet
Data Visualization With Ggplot2 PDF
13 pages
365 Data Science R Course Notes
No ratings yet
365 Data Science R Course Notes
20 pages
Learn Python With Example
No ratings yet
Learn Python With Example
30 pages
Data Visualization
No ratings yet
Data Visualization
9 pages
Dealing With Missing Data in Python Pandas
100% (1)
Dealing With Missing Data in Python Pandas
14 pages
Metabase 160201194446
No ratings yet
Metabase 160201194446
31 pages
Python For ML
No ratings yet
Python For ML
41 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
47 pages
Python Libraries for Data Science
No ratings yet
Python Libraries for Data Science
96 pages
Workload Repository Comparison Report
No ratings yet
Workload Repository Comparison Report
135 pages
MBA Syllabus 2023 2025
No ratings yet
MBA Syllabus 2023 2025
198 pages
Unix Material
No ratings yet
Unix Material
74 pages
Vacancy RA IT
No ratings yet
Vacancy RA IT
4 pages
MTech CO
No ratings yet
MTech CO
21 pages
1Z0-908 01
No ratings yet
1Z0-908 01
2 pages
C# Crystal Reports Guide
No ratings yet
C# Crystal Reports Guide
10 pages
SQL Server Difference Between Questions List-1
No ratings yet
SQL Server Difference Between Questions List-1
2 pages
SQL Cheat Sheet Bascis - MD
No ratings yet
SQL Cheat Sheet Bascis - MD
1 page
Related Literature Demo
50% (2)
Related Literature Demo
45 pages
Outline: What Is A Distributed DBMS Problems Current State-Of-Affairs
No ratings yet
Outline: What Is A Distributed DBMS Problems Current State-Of-Affairs
20 pages
BBA MCIS Management Control Information
No ratings yet
BBA MCIS Management Control Information
39 pages
Siebel Data Model Reference For Industry Applications v8.1.1
75% (4)
Siebel Data Model Reference For Industry Applications v8.1.1
330 pages
DA-100 Exam - Free Actual Q&as, Page 12 - ExamTopics
No ratings yet
DA-100 Exam - Free Actual Q&as, Page 12 - ExamTopics
5 pages
R Package Installation Guide
No ratings yet
R Package Installation Guide
10 pages
DP 900
No ratings yet
DP 900
59 pages
R Fundamental For Data Science
No ratings yet
R Fundamental For Data Science
15 pages
Philips DSD File Format Specification
No ratings yet
Philips DSD File Format Specification
34 pages
Trade Dissertation Topics
100% (2)
Trade Dissertation Topics
8 pages
108332
No ratings yet
108332
122 pages
C Pointers: A Beginner's Guide
No ratings yet
C Pointers: A Beginner's Guide
72 pages
Oracle 12c Exam Guide for Experts
No ratings yet
Oracle 12c Exam Guide for Experts
16 pages
Mba HRD 203 Research Methodology Interpretation and Report Writing Converted1
No ratings yet
Mba HRD 203 Research Methodology Interpretation and Report Writing Converted1
8 pages
ECCD Automated Computation PRE
No ratings yet
ECCD Automated Computation PRE
21 pages
Machinery Dept Info System Flow
No ratings yet
Machinery Dept Info System Flow
1 page
EBS Maintenance
No ratings yet
EBS Maintenance
8 pages
Week-2 Lecture Notes
No ratings yet
Week-2 Lecture Notes
101 pages
WBUT Data C Book
No ratings yet
WBUT Data C Book
587 pages
Thesis Help for Tech-Savvy Students
100% (2)
Thesis Help for Tech-Savvy Students
6 pages

Chapter 5 - Data Exploration and Visualization With

Uploaded by

Chapter 5 - Data Exploration and Visualization With

Uploaded by

Jamhuriya University of Science & Technology (JUST)

CA416 - Principles of Data Science

Practical Data Exploration and Visualization with

Data science workflow – recap

Continuous Discrete Ordinal Nominal

Continuous – time, age, etc Ordinal – grades, rating, etc

Exploratory data analysis

Pandas series with indices

the series attributes can

Noice that the values

Operations on series data

Please read the Pandas

The head() returns the first few

The dataframe is just like

Indexing the dataframe records

The argument inplace=True is

Region names are now used as the

 Can you represent each column (population and area)

Using these attributes, one can

This shows that dataframe values are

Average population and

These statistics include

The info() method

The shape() method

Matplotlib for plotting a dataframe

Matplotlib for plotting a dataframe

Bar plots are useful

Selecting dataframe columns

Selecting dataframe cells

You can use the same

Again the extracted cell values can be separately processed.

Slicing the dataframe

The iloc attribute is used

The loc attribute is used

Adding columns to a dataframe

‘density’ columns is now

Obviously Banaadir has the

Adding columns to a dataframe

The backward slash ‘\’

A new column ‘capital’

Adding rows to a dataframe

This now adds a new

Conditional data selection

The selected data is a

The first index in

Deleting dataframe columns

The drop() method can be

From this output, we can

Deleting dataframe rows

Like with columns, using

From this output, we can

Exploring categorical variables

The unique() function

The value_counts() method

Exploring numerical variables

The unique() function

The value_counts() method

Visualizing numerical variables

This plot shows that 5

Histograms are useful

Visualizing numerical variables

Boxplots are used the represent summary statistics (5 number

Visualizing numerical variables

The rot and fontsize

Line plots are

Visualizing categorical variables

plt is an alias for

Pie charts are primarily

References & reading resources

You might also like