Jamhuriya University of Science & Technology (JUST)
CA416 - Principles of Data Science
Chapter 5
Practical Data Exploration and Visualization with
Pandas and Matplotlib packages.
Lecturer: XYZ
1
Jamhuriya University of Science & Technology (JUST)
Learning outcomes
By the end of this lecture, you will be able to:
Describe some core data analysis concepts including
dataframes, and data exploration.
Create and access main data structures in Python,
such as series and dataframes.
Perform exploratory data analysis in Python usig
Pandas library.
Build data visualizations with Matplotlib library
2
Jamhuriya University of Science & Technology (JUST)
Data science workflow – recap
Source:
https://www.dataquest.io/bl
og/what-is-data-science/ 3
Jamhuriya University of Science & Technology (JUST)
Types of variables
A variable is any characteristic or attribute that can be
quantitatively and qualitatively been measured
Variable
Numeric Categorical
Continuous Discrete Ordinal Nominal
Continuous – time, age, etc Ordinal – grades, rating, etc
Discrete – people, houses, etc Nominal – nationality, race, etc
4
Jamhuriya University of Science & Technology (JUST)
Exploratory data analysis
Exploratory Data Analysis (EDA) is an initial exploration of
the data to understand its characteristics, patterns,
correlations, and to identify any anomalies in the data.
Statistical data summarization and graphical visualizations
are two primary forms of EDA.
EDA is typically applied before formal data modeling and
helps inform the development of appropriate statistical models
or machine learning models.
5
Jamhuriya University of Science & Technology (JUST)
Pandas
Data analysis is normally performed over data stored in a
tabular format, e.g., Excel spreadsheet.
Each observation is recorded in a row and its each
attribute is recorded in a column (e.g., students’ and
their grades in each assignment)
Pandas is a Python library for manipulating data in tabular
format and comes with Python Anaconda.
In Pandas, data manipulation can be much more varied, it can
programmed more easily and performed more efficiently,
which is critical in large-scale projects
6
Jamhuriya University of Science & Technology (JUST)
Pandas
• Pandas is a high-level library built on NumPy, providing
tools that make it easier to work with real-world data:
– load data from a variety of sources (e.g. CSV, JSON, SQL).
– update (add, modify, delete etc) data
– select subsets of the data
– group data by a certain criterion
– clean and handling missing values or NANs
– visualize the data using different plotting tools
– perform statistical analysis of the data and
– export the data to other file formats or databases
• Pandas provides two main data structures, namely,
DataFrame (equivalent to a spreadsheet) and a Series
(equivalent to a column in a spreadsheet).
7
Jamhuriya University of Science & Technology (JUST)
Pandas Series
A Series is just a column is a dataframe or spreadsheet
Pandas series with indices
is created the list
the series attributes can
also be separately
accessed
Noice that the values
are just NumPy arrays 8
Jamhuriya University of Science & Technology (JUST)
Operations on series data
Please read the Pandas
documentation for detailed list of
statistical functions applicable on
series, as given in the additional
resource slide.
9
Jamhuriya University of Science & Technology (JUST)
Pandas Dataframe
A Pandas dataframe is a collection of Series ( a 2D data
structure with row & columns - effectively a spreadsheet )
Let's create an example dataframe with the population and area
values for some regions in Somalia
The head() returns the first few
rows of the dataframe. You can use
tail() to see the last few rows
The dataframe is just like
spreadsheet sheet with indices (as
the series) 10
Jamhuriya University of Science & Technology (JUST)
Indexing the dataframe records
Since we did not supply any particular values as the index,
a range of integers was used as the index for the dataframe
However, it may be convenient to set the region names as the
indices for our example dataframe.
The argument inplace=True is
an instruction to modify the
dataframe and as the result the
dataframe will now have two
columns
Region names are now used as the
dataframe indices.
11
Jamhuriya University of Science & Technology (JUST)
Hands-on exercise 1
We can create dataframes by supplying dictionaries with
identical sets of keys as arguments to DataFrame()
Can you represent each column (population and area)
as
dictionaries and create a dataframe from them ?
12
Jamhuriya University of Science & Technology (JUST)
Dataframe attributes
Using these attributes, one can
separately access df information
This shows that dataframe values are
just 2D NumPy arrays, and this is
why NumPy underpins Pandas and other
data science libraries
13
Jamhuriya University of Science & Technology (JUST)
Descriptive statistics
Average population and
area for all regions are
6.931368e+05 and
41079.333333 in order.
These statistics include
mean, quartiles, median,
total observations etc.
14
Jamhuriya University of Science & Technology (JUST)
Descriptive statistics
The info() method
provides a concise
description of the
dataframe
The shape() method
provides the shape of
the dataframe in terms
of the number of rows
and columns
15
Jamhuriya University of Science & Technology (JUST)
Matplotlib for plotting a dataframe
• The matplotlib is comprehensive package for
data visualization, and comes as part of Anaconda.
Pandas has a convenient integration with matplotlib, which
means that data contained in a dataframe can be plotted with
plot():
• You can select the plot type of your choice ( e.g., scatter, bar,
boxplot, pie, hist, …) corresponding to your data
• Please see the resource at the end for more information on
various plots and arguments of the plot() function
16
Jamhuriya University of Science & Technology (JUST)
Matplotlib for plotting a dataframe
• Let us now plot the data contained in our dataframe, df, by
simply calling its plot method:
Logy=True argument
enables us to scale the
y values
logarithmically.
Otherwise the scales
could have been very
different
Bar plots are useful
tools for viewing
categorical
17
variables
Jamhuriya University of Science & Technology (JUST)
Selecting dataframe columns
• Let us extract the area column/variable from our data
frame
The extracted columns are Pandas series type and can stored in a
different variable or processed separately 18
Jamhuriya University of Science & Technology (JUST)
Selecting dataframe cells
• We also extract a cell value of a dataframe
Notice that the cell is
accessed by its column
name and row index.
The index can be a
number and its format is
to put it in square
brakcets []
You can use the same
syntax to update the cell,
e.g. change the number
Again the extracted cell values can be separately processed.
19
Jamhuriya University of Science & Technology (JUST)
Slicing the dataframe
The iloc attribute is used
to access the rows and
columns by their integer
indices:
‘:’ means extract all –
columns (also rows)
The loc attribute is used
to access the rows and
columns by their string
indices: 20
Jamhuriya University of Science & Technology (JUST)
Adding columns to a dataframe
Remember this a
vectorised or element
wise math operation just
like NumPy arrays.
‘density’ columns is now
created and added to the
dataframe
Obviously Banaadir has the
highest density 21
Jamhuriya University of Science & Technology (JUST)
Adding columns to a dataframe
The backward slash ‘\’
enables the continuation
of the list definition.
A new column ‘capital’
is now created and
added to the dataframe
22
Jamhuriya University of Science & Technology (JUST)
Adding rows to a dataframe
This now adds a new
row with index
‘M_Shabelle’ to the
dataframe
23
Jamhuriya University of Science & Technology (JUST)
Conditional data selection
The selected data is a
dataframe itself.
Such conditional
extraction can be
applied to any other
dataframe column
24
Jamhuriya University of Science & Technology (JUST)
Conditional updating
This populates
the entire new
column with the
single value
‘low’
2525
Jamhuriya University of Science & Technology (JUST)
Conditional updating
The first index in
loc specifies the
rows to which the
change applies
and the second
argument
specifies the
column
26
Jamhuriya University of Science & Technology (JUST)
Conditional updating
We can also use the apply() function to apply some
operation to every row or every column in a dataframe.
The function takes a custom function as an argument, the
custom function takes either a row or a column at a time and
can return a modified row or column:
2727
Jamhuriya University of Science & Technology (JUST)
Conditional updating
The axis
argument
indicates
whether to
process the
dataframe by
columns (1)
or rows (0)
28
Jamhuriya University of Science & Technology (JUST)
Deleting dataframe columns
The drop() method can be
used to remove rows or
columns depending on the
axis we specifiy and
column/row we name
From this output, we can
see that density_status
column is now removed.
29
Jamhuriya University of Science & Technology (JUST)
Deleting dataframe rows
Like with columns, using
the inplace=True
argument means we are
updating the dataframe
and as the result the
dataframe will now have
fewer records or rows.
From this output, we can
see that M_Shabelle row is
now removed.
30
Jamhuriya University of Science & Technology (JUST)
Exploring categorical variables
The unique() function
returns the unique
values of the colum
The value_counts() method
returns the requicy of each
The value_counts() unique value
method returns a series
31
Jamhuriya University of Science & Technology (JUST)
Exploring numerical variables
The unique() function
returns the unique
values of the colum
The value_counts() method
returns the requicy of each
The value_counts() unique value
method returns a series
32
Jamhuriya University of Science & Technology (JUST)
Visualizing numerical variables
This plot shows that 5
regions has a population
ranging from about 375K
to approx.
620k
Histograms are useful
tools for viewing the
distribution of
variables
33
Jamhuriya University of Science & Technology (JUST)
Visualizing numerical variables
Maximum
Q3
Median IQR
Q1
Minimum
Boxplots are used the represent summary statistics (5 number
summary) and to compare summary of different datasets
The two variable or columns could be drawn on the same plot but they
have been plotted separately since their scales differ, 34
Jamhuriya University of Science & Technology (JUST)
Visualizing numerical variables
The rot and fontsize
arguments are used
here to rotate and
size the x labels.
Line plots are
primarily used for
viewing continuous
variables.
35
Jamhuriya University of Science & Technology (JUST)
Visualizing categorical variables
plt is an alias for
matplotlib Pyplot
module. The loc
argument to the legend
function sets the
location of the legend
Pie charts are primarily
using for visualizing
proportions of mostly
categorical variables
36
Jamhuriya University of Science & Technology (JUST)
Hands-on exercise 2
Suppose you have this data in a dictionary:
exam_data =
{
'name': ['Ali', 'Ahmed', 'Jama', 'Omar', 'Fatima', 'Mohamed',
'Mohamud', 'Malin', 'Farah', 'Samad'],
'score': [62.5, 79, 16.5, 65, 53, 81, 58, 45, 72, 66.5], 'attempts':
[1, 3,
2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'yes', 'no', 'yes']
}
Create a dataframe from this data and retrieve the following subsets of data:
1. The first three rows
2. The following three rows
3. The score for 'Mohamed'
4. The scores of all students who qualify and who made just one
attempt.
37
Jamhuriya University of Science & Technology (JUST)
References & reading resources
Data Analysis with Pandas
• https://pandas.pydata.org/docs/getting_started/
• https://pandas.pydata.org/pandas-docs/stable/index.html
• https://www.youtube.com/watch?v=5JnMutdy6Fw
• https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.plot.html
• Chapter 3, Python Data Science Handbook, Jake
VanderPlas
• Chapter 5-6, Python for Data Analysis, Wes
McKinney.
Matplotlib
• https://matplotlib.org/tutorials/index.html
38
Jamhuriya University of Science & Technology (JUST)
39