Unit V Data Visualization
Unit V Data Visualization
Importing Matplotlib – Line plots – Scatter plots – visualizing errors – density and
contour plots – Histograms – legends – colors – subplots – text and annotation –
customization – three dimensional plotting – Geographic Data with Basemap –
Visualization with Seaborn.
DATA VISUALIZATION
Data visualization is the graphical representation of information and data.
By using visual elements like charts, graphs, and maps, data visualization tools provide an
accessible way to see and understand trends, outliers, and patterns in data.
Additionally, it provides an excellent way for employees or business owners to present data to non-
technical audiences without confusion.
ADVANTAGES
Importing Matplotlib
Most of the Matplotlib utilities lies under the pyplot submodule, and are usually
imported under the plt alias:
Matplotlib Plotting
If we need to plot a line from (0, 0) to (6, 250), we have to pass two arrays [0, 6]
and [0, 250] to the plot function.
LINE PLOT
Example
OUTPUT
Multiple Points
You can plot as many points as you like, just make sure you have the same number
of points in both axis.
Example
Draw a line in a diagram from position (1, 3) to (2, 8) then to (6, 1) and
finally to position (8, 10)
OUTPUT
Default X-Points
If we do not specify the points on the x-axis, they will get the default values 0, 1, 2,
3 (etc., depending on the length of the y-points.
So, if we take the same example as above, and leave out the x-points, the diagram
will look like this:
Matplotlib Scatter
The scatter() function plots one dot for each observation. It needs two
arrays of the same length, one for the values of the x-axis, and one for
values on the y-axis:
OUTPUT
The observation in the example above is the result of 13 cars passing by.
It seems that the newer the car, the faster it drives, but that could be a coincidence,
after all we only registered 13 cars.
Compare Plots
In the example above, there seems to be a relationship between speed and age, but
if we plot the observations from another day as well? Will the scatter plot tell us
something else?
OUTPUT
Colors
You can set your own color for each scatter plot with the color or the c argument:
Note: We cannot use the colorargumenr for this, only the c argument.
OUTPUT
Matplotlib.pyplot,errorbar() in Python
Pyplot is a state-based interface to a Matplotlib module which provides a
MATLAB-like interface.
matplotlib.pyplot.errorbar() Function:
The errorbar() function in pyplot module of matplotlib library is used to plot
y versus x as lines and/or markers with attached errorbars.
Parameters: This method accept the following parameters that are described
below:
x, y: These parameter are the horizontal and vertical coordinates of the
data points.
fmt: This parameter is an optional parameter and it contains the string
value.
xerr, yerr: These parameter contains an array.And the error array should
have positive values.
ecolor: This parameter is an optional parameter. And it is the color of the
errorbar lines with default value NONE.
elinewidth: This parameter is also an optional parameter. And it is the
linewidth of the errorbar lines with default value NONE.
capsize: This parameter is also an optional parameter. And it is the length
of the error bar caps in points with default value NONE.
barsabove: This parameter is also an optional parameter. It contains
boolean value True for plotting errorsbars above the plot symbols.Its
default value is False.
lolims, uplims, xlolims, xuplims: These parameter are also an optional
parameter. They contain boolean values which is used to indicate that a
value gives only upper/lower limits.
errorevery: This parameter is also an optional parameter. They contain
integer values which is used to draws error bars on a subset of the data.
Returns: This returns the container and it is comprises of the following:
plotline:This returns the Line2D instance of x, y plot markers and/or line.
caplines:This returns the tuple of Line2D instances of the error bar caps.
barlinecols:This returns the tuple of LineCollection with the horizontal and
vertical error ranges.
Example
import numpy as np
import matplotlib.pyplot as plt
xval = np.arrange(0.1, 4, 0.5)
yval = np.exp(xval)
plt.errorbar(xval, yval, xerr = 0.4, yerr = 0.5)
plt.title(‘Matplotlib.pyplot.errorbar() example’)
plt.show()
OUTPUT
To visualize this information error bars work by drawing lines that extend from the center
of the plotted data point or edge with bar charts
The length of an error bar helps to reveal uncertainty of a data point as shown in the
below graph.
A short error bar shows that values are concentrated signaling that the plotted averaged
value is more likely while a long error bar would indicate that the values are more spread
out and less reliable.
Also depending on the type of data. the length of each pair of error bars tends to be of
equal length on both sides, however, if the data is skewed then the lengths on each side
would be unbalanced.
# importing matplotlib
importmatplotlib.pyplot as plt
# making a simple plot
x =[1, 2, 3, 4, 5, 6, 7]
y =[1, 2, 1, 2, 1, 2, 1]
# creating error
y_error =0.2
# plotting graph
plt.plot(x, y)
plt.errorbar(x, y,
yerr =y_error,
fmt ='o')
OUTPUT
# importing matplotlib
importmatplotlib.pyplot as plt
# making a simple plot
x =[1, 2, 3, 4, 5, 6, 7]
y =[1, 2, 1, 2, 1, 2, 1]
# creating error
x_error =0.5
# plotting graph
plt.plot(x, y)
plt.errorbar(x, y,
xerr =x_error,
fmt ='o')
OUTPUT
OUTPUT
<ErrorbarContainer object of 3 artists>
HISTOGRAMS
A histogram is the best way to visualize the frequency distribution of a dataset
by splitting it into small equal-sized intervals called bins. The Numpy histogram
function is similar to the hist() function of matplotlib library, the only difference
is that the Numpy histogram gives the numerical representation of the dataset
while the hist() gives graphical representation of the dataset.
Attribute Parameter
bins int or sequence of str defines number of equal width bins in a range, default is 10
optional parameter same as density attribute, gives incorrect result for unequal bin
normed width
weights optional parameter defines array of weights having same dimensions as data
optional parameter if False result contain number of sample in each bin, if True
density result contain probability density function at bin
The function has two return values hist which gives the array of values of the histogram,
and edge_bin which is an array of float datatype containing the bin edges having length one
more than the hist.
Example:
# Import libraries
importnumpy as np
# Creating dataset
a =np.random.randint(100, size =(50))
# Creating histogram
np.histogram(a, bins =[0, 10, 20, 30, 40,
50, 60, 70, 80, 90,
100])
hist, bins =np.histogram(a, bins =[0, 10,
20, 30,
40, 50,
60, 70,
80, 90,
100])
# printing histogram
print()
print(hist)
print(bins)
print()
OUTPUT
Graphical representation
The above numeric representation of histogram can be converted into a graphical
form.Theplt() function present in pyplot submodule of Matplotlib takes the array of
dataset and array of bin as parameter and creates a histogram of the
corresponding data values.
# import libraries
frommatplotlib importpyplot as plt
importnumpy as np
# Creating dataset
a =np.random.randint(100, size =(50))
# Creating plot
fig =plt.figure(figsize =(10, 7))
plt.hist(a, bins =[0, 10, 20, 30,
40, 50, 60, 70,
80, 90, 100])
plt.title("Numpy Histogram")
# show plot
plt.show()
The simplest legend can be created with the plt.legend() command, which
automatically creates a legend for any labeled plot elements:
You can use the following basic syntax to add a legend to a plot in
pandas:
plt.legend(['A', 'B', 'C', 'D'], loc='center left', title='Legend Title')
We can use the following syntax to create a bar chart to visualize the
values in the DataFrame and add a legend with custom labels:
importmatplotlib.pyplotasplt
#create bar chart
df.plot(kind='bar')
We can also use the loc argument and the title argument to modify the
location and the title of the legend:
importmatplotlib.pyplotasplt
#create bar chart
df.plot(kind='bar')
#add custom legend to bar chart
plt.legend(['A Label', 'B Label', 'C Label', 'D Label'],
loc='upper left', title='Labels')
Lastly, we can use the size argument to modify the font size in the legend:
importmatplotlib.pyplotasplt
#create bar chart
df.plot(kind='bar')
#add custom legend to bar chart
plt.legend(['A Label', 'B Label', 'C Label', 'D Label'], prop={'size': 20})
OUTPUT
<matplotlib.legend.Legend at 0x7fbc83fd9fa0>
df.style.highlight_null : With the help of this, we can highlight the missing or null values
inside the data frame.
OUTPUT
df.style.highlight_null
output
df.style.highlight_max : For highlighting the maximum value in each
column throughout the data frame.
output
df.style.highlight_max
OUTPUT
SUBPLOTS
Example 1:
# importing packages
import matplotlib.pyplot as plt
import numpy as np
# draw graph
for i in ax:
for j in i:
j.plot(np.random.randint(0, 5, 5), np.ran
dom.randint(0, 5, 5))
plt.show()
OUTPUT
Example – 2
# importing packages
import matplotlib.pyplot as plt
import numpy as np
# making subplots objects
fig, ax = plt.subplots(2, 2)
# draw graph
ax[0][0].plot(np.random.randint(0, 5, 5), np.random.randint(0, 5, 5))
ax[0][1].plot(np.random.randint(0, 5, 5), np.random.randint(0, 5, 5))
ax[1][0].plot(np.random.randint(0, 5, 5), np.random.randint(0, 5, 5))
ax[1][1].plot(np.random.randint(0, 5, 5), np.random.randint(0, 5, 5))
plt.show()
OUTPUT
Example – 3
# importing packages
import matplotlib.pyplot as plt
import numpy as np
# making subplots objects
fig, ax = plt.subplots(2, 2)
# create data
x = np.linspace(0, 10, 1000)
# draw graph
ax[0, 0].plot(x, np.sin(x), 'r-.')
ax[0, 1].plot(x, np.cos(x), 'g--')
ax[1, 0].plot(x, np.tan(x), 'y-')
ax[1, 1].plot(x, np.sinc(x), 'c.-')
plt.show()
OUTPUT
Example:1
# Implementation of matplotlib.pyplot.annotate()
# function
importmatplotlib.pyplot as plt
importnumpy as np
fig, geeeks=plt.subplots()
t =np.arange(0.0, 5.0, 0.001)
s =np.cos(3*np.pi*t)
line =geeeks.plot(t, s, lw=2)
# Annotation
geeeks.annotate('Local Max', xy=(3.3, 1),
xytext=(3, 1.8),
arrowprops=dict(facecolor='green',
shrink =0.05),)
geeeks.set_ylim(-2, 2)
# Plot the Annotation in the graph
plt.show()
OUTPUT
Python Pandas - Options and Customization
Pandas provide API to customize some aspects of its behavior, display is being mostly
used.
The API is composed of five relevant functions. They are −
get_option()
set_option()
reset_option()
describe_option()
option_context()
Let us now understand how the functions operate.
get_option(param)
get_option takes a single parameter and returns the value as given in the output below −
display.max_rows
Displays the default number of value. Interpreter reads this value and displays the rows
with this value as upper limit to display.
import pandas as pd
print(pd.get_option("display.max_rows"))
Its output is as follows −
60
display.max_columns
Displays the default number of value. Interpreter reads this value and displays the rows
with this value as upper limit to display.
import pandas as pd
print(pd.get_option("display.max_columns"))
Its output is as follows −
20
Here, 60 and 20 are the default configuration parameter values.
set_option(param,value)
set_option takes two arguments and sets the value to the parameter as shown below −
display.max_rows
reset_option(param)
reset_option takes an argument and sets the value back to the default value.
display.max_rows
Using reset_option(), we can change the value back to the default number of rows to be
displayed.
import pandas as pd
pd.reset_option("display.max_rows")
print(pd.get_option("display.max_rows"))
Its output is as follows −
60
describe_option(param)
describe_option prints the description of the argument.
display.max_rows
Using reset_option(), we can change the value back to the default number of rows to be
displayed.
import pandas as pd
pd.describe_option("display.max_rows")
Its output is as follows −
display.max_rows : int
If max_rows is exceeded, switch to truncate view. Depending on
'large_repr', objects are either centrally truncated or printed as
a summary view. 'None' value means unlimited.
In case python/IPython is running in a terminal and `large_repr`
equals 'truncate' this can be set to 0 and pandas will auto-detect
the height of the terminal and print a truncated object which fits
the screen height. The IPython notebook, IPythonqtconsole, or
IDLE do not run in a terminal and hence it is not possible to do
correct auto-detection.
[default: 60] [currently: 60]
option_context()
option_context context manager is used to set the option in with statement temporarily.
Option values are restored automatically when you exit the with block −
display.max_rows
1 display.max_rows
Displays maximum number of rows to display
2 display.max_columns
Displays maximum number of columns to display
3 display.expand_frame_repr
Displays DataFrames to Stretch Pages
4 display.max_colwidth
Displays maximum column width
5 display.precision
Displays precision for decimal numbers
The first one is a standard import statement for plotting using matplotlib, which you
would see for 2D plotting as well.
The second import of the Axes3D class is required for enabling 3D projections. It is,
otherwise, not used anywhere else.
Step 2: Create figure and axes
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(4,4))
ax = fig.add_subplot(111, projection='3d')
Output:
After we create the axes object, we can use it to create any type of plot we want in
the 3D space.
To plot a single point, we will use the scatter()method, and pass the three
coordinates of the point.
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(4,4))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(2,3,4) # plot the point (2,3,4) on the figure
plt.show()
OUTPUT
We will use the plot() method and pass 3 arrays, one each for the x, y, and z
coordinates of the points on the line.
import numpy as np
x = np.linspace(-4*np.pi,4*np.pi,50)
y = np.linspace(-4*np.pi,4*np.pi,50)
z = x**2 + y**2
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot(x,y,z)
plt.show()
OUTPUT
We are generating x, y, and z coordinates for 50 points.
The x and y coordinates are generated usingnp.linspace to generate 50 uniformly
distributed points between -4π and +4π. The z coordinate is simply the sum of the
squares of the corresponding x and y coordinates.
Customizing a 3D plot
Let us plot a scatter plot in 3D space and look at how we can customize its
appearance in different ways based on our preferences
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
np.random.seed(42)
xs = np.random.random(100)*10+20
ys = np.random.random(100)*5+7
zs = np.random.random(100)*15+50
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(xs,ys,zs)
plt.show()
OUTPUT
Let us now add a title to this plot
Adding a title
We will call the set_title method of the axes object to add a title to the plot.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
print(ax.set_title("Atom velocity distribution"))
np.random.seed(42)
xs = np.random.random(100)*10+20
ys = np.random.random(100)*5+7
zs = np.random.random(100)*15+50
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(xs,ys,zs)
plt.show()
OUTPUT
Following are a series of examples that illustrate how to use Basemap instance methods
to plot your data on a map. More examples are included in the examples directory of
the basemap source distribution. There are a number of Basemap instance methods for
plotting data:
Here are the examples (many of which utilize the netcdf4-python module to
retrieve datasets over http):
OUTPUT
Pandas and Seaborn is one of those packages and makes importing and
analyzing data much easier.
Seaborn
Seaborn is an amazing visualization library for statistical graphics
plotting in python.
It is built on the top of matplotlib library and also closely integrated into
the data structures from pandas.
Seaborn is a Python data visualization library based on Matplotlib.
It provides a high-level interface for drawing attractive and informative statistical
graphics.
● Usage: Those who want to create amplified data visuals, especially in color.
About Seaborn’s Pros and Cons:
● Pro: Includes higher level interfaces and settings than does Matplotlib
● Pro: Relatively simple to use, just like Matplotlib.
● Pro: Easier to use when working with Dataframes.
● Con: Like Matplotlib, data visualization seems to be simpler than other tools.
The features help in –
● Built in themes for styling matplotlib graphics
● Visualizing univariate and bivariate data
● Fitting in and visualizing linear regression models
● Plotting statistical time series data
● Seaborn works well with NumPy and Pandas data structures
● It comes with built in themes for styling Matplotlib graphics
Seaborn - Installation
Installing Seaborn should also be straightforward.
The following command will help you import Pandas:.
# Pandas for managing datasets
import pandas as pd
Now, let us import the Matplotlib library, which helps us customize our plots .
# Matplotlib for additional customization
from matplotlib import pyplot as plt
We will import the Seaborn library with the following command:
# Seaborn for plotting and styling
import seaborn as sb
Sample code:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn %matplotlib inline
OUTPUT
To view all the available data sets in the Seaborn library, you can use the
following command with the get_dataset_names() function
Visualizing data is one step and further making the visualized data more pleasing
is another step. Visualization plays a vital role in communicating quantitative
insights to an audience to catch their attention.
Aesthetics means a set of principles concerned with the nature and appreciation of
beauty, especially in art. Visualization is an art of representing data in effective and
easiest possible way. Matplotlib library highly supports customization, but
knowing what settings to tweak to achieve an attractive and anticipated plot is what
one should be aware of to make use of it. Unlike Matplotlib, Seaborn comes packed
with customized themes and a high -level interface for customizing and controlling
the look of Matplotlib figures.
import numpy as np
from matplotlib import pyplot as plt
def sinplot(flip=1):
x = np.linspace(0, 14, 100)
for i in range(1, 5):
plt.plot(x, np.sin(x + i * .5) * (7 - i) * flip)
sinplot()
plt.show()
OUTPUT
To change the same plot to Seaborn de faults, use the set() function:
import numpy as np
from matplotlib import pyplot as plt
def sinplot(flip=1):
x = np.linspace(0, 14, 100)
for i in range(1, 5):
plt.plot(x, np.sin(x + i * .5) * (7 - i) * flip)
import seaborn as sb
sb.set()
sinplot()
plt.show()
OUTPUT
The above two figures show the difference in the de fault Matplotlib and Seaborn plots.
The representation of data is same, but the repre sentation style varies in both.
Plot styles
Plot scale
The interface for m anipulating the styles is set_style(). U sing this function you can set
the them e of the plot. As per the latest updated version, below are the five themes
available.
Darkgrid
Whitegrid
Dark
White
Ticks
The de fault theme of the plot will be darkgrid which we have seen in the
previous example.
import numpy as np
from matplotlib import pyplot as plt
def sinplot(flip=1):
x = np.linspace(0, 14, 100)
for i in range(1, 5):
plt.plot(x, np.sin(x + i * .5) * (7 - i) * flip)
import seaborn as sb
sb.set_style("whitegrid")
sinplot()
plt.show()
OUTPUT
The difference between the above two plots is the background color.
In the white and ticks themes, we can rem ove the top and right axis spines using
the despine() function.
Example
import numpy as np
from matplotlib import pyplot as plt
def sinplot(flip=1):
x = np.linspace(0, 14, 100)
for i in range(1, 5):
plt.plot(x, np.sin(x + i * .5) * (7 - i) * flip)
import seaborn as sb
sb.set_style("white")
sinplot()
sb.despine()
plt.show()
OUTPUT
In the regular plots , we use left and bottom axes only. Using the despine() function, we
can avoid the unnecessary right and top axes spines, which is not supported in Matplotlib.
Overriding the Element
If you want to customize the Seaborn styles, you can pass a dictionary of parameters to
the set_style() function. Parameters available are viewed using axes_style() function.
Altering the values of any of the parameter will alter the plot style .
Example
import numpy as np
from matplotlib import pyplot as plt
def sinplot(flip=1):
x = np.linspace(0, 14, 100)
for i in range(1, 5):
plt.plot(x, np.sin(x + i * .5) * (7 - i) * flip)
import seaborn as sb
sb.set_style("darkgrid", {'axes.axisbelow': False})
sinplot()
sb.despine()
plt.show()
OUTPUT
We also have control on the plot elements and can control the scale of plot using
the set_context() function.
We have four preset templates for contexts , based on relative size , the contexts
are named as follows:
Paper
Notebook
Talk
Poster
By default, context is set to notebook; and was used in the plots above.
Example
import numpy as np
from matplotlib import pyplot as plt
def sinplot(flip=1):
x = np.linspace(0, 14, 100)
for i in range(1, 5):
plt.plot(x, np.sin(x + i * .5) * (7 - i) * flip)
import seaborn as sb
sb.set_style("darkgrid", {'axes.axisbelow': False})
sinplot()
sb.despine()
plt.show()
OUTPUT
Color Palette
● qualitative
● sequential
● diverging
Qualitative or categorical palettes are best suitable to plot the categorical data.
OUTPUT
Sequential plots are suitable to express the distribution of data ranging from relative lower
values to higher values within a range.
Appending an additional character ‘s’ to the color passed to the color parameter will plot
the Sequential plot.
Example
OUTPUT
Assume plotting the data ranging from -1 to 1. The values from -1 to 0 takes one color
and 0 to +1 takes another color.
By de fault, the values are centered from zero. You can control it with parameter center
by passing a value.
OUTPUT