0% found this document useful (0 votes)
66 views

FDS Notes Unit-5

Uploaded by

Disha Singhal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

FDS Notes Unit-5

Uploaded by

Disha Singhal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

FUNDAMENTALS OF DATA SCIENCE

21CSS202T
Unit-5
Visualization

 Advantages and Use Cases


 Working with Matplotlib to Plot Different Visuals
 Working with Seaborn to plot different visuals
 Univariate Graphs for Numeric and Categorical Data
 Bivariate Graphs for Numeric and Categorical Data
 Multivariate Graphs
 Choosing appropriate Graphical Techniques
 Using Graph to explore the data insights
 Introduction to Dashboards
VISUALIZATION
 Data visualization is the most important step in the life cycle of data science, data
analytics, or we can say in data engineering. It is more impressive, interesting and
understanding when we represent our study or analysis with the help of colors and
graphics.
 Using visualization elements like graphs, charts, maps, etc., it becomes easier for
clients to understand the underlying structure, trends, patterns and relationships
among variables within the dataset.
 Simply explaining the data summary and analysis using plain numbers becomes
complicated for both, people coming from technical and non-technical backgrounds.
 Data visualization gives us a clear idea of what the data wants to convey to us. It makes
data neutral for us to understand the data insights.
 Libraries:
o Matplotlib
o Seaborn
o Bokeh
o Plotly

ADVANTAGES AND USE CASES

USING GRAPH TO EXPLORE THE DATA INSIGHTS


WORKING WITH MATPLOTLIB TO PLOT DIFFERENT
VISUALS
 Matplotlib is a low-level graph plotting library in python that serves as a visualization
utility.
 Matplotlib was created by John D. Hunter.
 Matplotlib is open source and we can use it freely.
 Matplotlib is mostly written in python, a few segments are written in C, Objective-C
and Javascript for Platform compatibility.

 Installation of Matplotlib
pip install matplotlib
 Import it in Your Applications
import matplotlib
 Check Version
print(matplotlib.__version__)
 Pyplot
import matplotlib.pyplot as plt
Example: Draw a line in a diagram from position (0,0) to position (6,250):
import matplotlib.pyplot as plt
import numpy as np
xpoints = np.array([0, 6])
ypoints = np.array([0, 250])
plt.plot(xpoints, ypoints)
plt.show()

 Plotting Without Line


To plot only the markers, you can use shortcut string notation parameter 'o', which means 'rings'.
Example: Draw two points in the diagram, one at position (1, 3) and one in position (8, 10):
import matplotlib.pyplot as plt
import numpy as np
xpoints = np.array([1, 8])
ypoints = np.array([3, 10])
plt.plot(xpoints, ypoints, 'o')
plt.show()

 Multiple Points
Example: Draw a line in a diagram from position (1, 3) to (2, 8) then to (6, 1) and finally to
position (8, 10):
import matplotlib.pyplot as plt
import numpy as np
xpoints = np.array([1, 2, 6, 8])
ypoints = np.array([3, 8, 1, 10])
plt.plot(xpoints, ypoints)
plt.show()

 Markers
import matplotlib.pyplot as plt
import numpy as np
ypoints = np.array([3, 8, 1, 10])
plt.plot(ypoints, marker = 'o')
plt.show()
 Marker Reference
Marker
'o' Circle
'*' Star
'.' Point
',' Pixel
'x' X
'X' X (filled)  Line Reference
'+' Plus Line Syntax Description
'P' Plus (filled) '-' Solid line
's' Square ':' Dotted line
'D' Diamond '--' Dashed line
'd' Diamond (thin) '-.' Dashed/dotted line
'p' Pentagon
'H' Hexagon
'h' Hexagon  Color Reference
'v' Triangle Down Color Syntax Description

'^' Triangle Up 'r' Red

'<' Triangle Left 'g' Green

'>' Triangle Right 'b' Blue

'1' Tri Down 'c' Cyan

'2' Tri Up 'm' Magenta

'3' Tri Left 'y' Yellow

'4' Tri Right 'k' Black

'|' Vline 'w' White

'_' Hline

 Hexadecimal color values


Example: Mark each point with a beautiful green color:
plt.plot(ypoints, marker = 'o', ms = 20, mec
= '#4CAF50', mfc = '#4CAF50')
ms= marker size
mec = markeredgecolor
mfc= makerfacecolor
 Different Types of Plots in Matplotlib
 Line Graph
 Stem Plot
 Bar chart
 Histograms
 Scatter Plot
 Stack Plot
 Box Plot
 Pie Chart
 Error Plot
 Violin Plot
 3D Plots

 Matplotlib Line
Example: Use a dotted line
import matplotlib.
pyplot as plt
import numpy as np
ypoints = np.array([3, 8, 1, 10])
plt.plot(ypoints, linestyle = 'dotted')
plt.show()

plt.plot(ypoints, linestyle = 'dashed')

 Line Styles
Style Or
'solid' (default) '-'
'dotted' ':'
'dashed' '--'
'dashdot' '-.'
'None' '' or ' '
 Line Color
plt.plot(ypoints, color = 'r')

 Line Width
ypoints = np.array([3, 8, 1, 10])
plt.plot(ypoints, linewidth = '20.5')

 Multiple Lines
y1 = np.array([3, 8, 1, 10])
y2 = np.array([6, 2, 7, 11])
plt.plot(y1)
plt.plot(y2)

Example: Draw two lines by specifying the x- and y-point values for both
lines:
import matplotlib.pyplot as plt
import numpy as np
x1 = np.array([0, 1, 2, 3])
y1 = np.array([3, 8, 1, 10])
x2 = np.array([0, 1, 2, 3])
y2 = np.array([6, 2, 7, 11])
plt.plot(x1, y1, x2, y2)
plt.show()
 Add Grid Lines to a Plot
import numpy as np
import matplotlib.pyplot as plt
x = np.array([80, 85, 90, 95, 100, 105, 110,
115, 120, 125])
y = np.array([240, 250, 260, 270, 280, 290,
300, 310, 320, 330])
plt.title("Sports Watch Data")
plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")
plt.plot(x, y)
plt.grid()
plt.show()

 SubPlot: Display Multiple Plots


import matplotlib.pyplot as plt plt.title("INCOME")
import numpy as np plt.suptitle("MY SHOP")
#plot 1: plt.show()
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.subplot(1, 2, 1)
#the figure has 1 row, 2 columns,
and this plot is the first plot.
plt.plot(x,y)
plt.title("SALES")

#plot 2:
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
plt.subplot(1, 2, 2)
#the figure has 1 row, 2 columns,
and this plot is the second plot.
plt.plot(x,y)

 Matplotlib Scatter
import matplotlib.pyplot as plt
import numpy as np
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y)
plt.show()

 Matplotlib Bars
import matplotlib.pyplot as plt
import numpy as np
x = np.array(["A", "B", "C", "D"])
y = np.array([3, 8, 1, 10])
plt.bar(x,y)
plt.show()

 Horizontal Bars
import matplotlib.pyplot as plt
import numpy as np
x = np.array(["A", "B", "C", "D"])
y = np.array([3, 8, 1, 10])
plt.barh(x, y)
plt.show()

 Bar Width
plt.bar(x, y, width = 0.1)
 Bar Height
plt.barh(x, y, height = 0.1)

 Histogram
A histogram is a graph showing frequency distributions.
It is a graph showing the number of observations within each given interval.
import matplotlib.pyplot as plt
import numpy as np
x = np.random.normal(170, 10, 250)
plt.hist(x)
plt.show()

 Matplotlib Pie Charts


import matplotlib.pyplot as plt
import numpy as np
y = np.array([35, 25, 25, 15])
mylabels = ["Apples", "Bananas",
"Cherries", "Dates"]
plt.pie(y, labels = mylabels)
plt.legend(title = "Four Fruits:")
plt.show()
WORKING WITH SEABORN TO PLOT DIFFERENT
VISUALS
Seaborn is a library mostly used for statistical plotting in Python. It is built on top of Matplotlib
and provides beautiful default styles and color palettes to make statistical plots more attractive.
Installation of Seaborn
pip install seaborn

Note: Seaborn has the following dependencies –


Python 2.7 or 3.4+
numpy
scipy
pandas
matplotlib

Example:Using Seaborn with Matplotlib


import seaborn as sns
import matplotlib.pyplot as plt
data = sns.load_dataset("iris")
sns.lineplot(x="sepal_length",
y="sepal_width", data=data)
plt.title('Title using Matplotlib Function')
plt.show()

Example: Setting the xlim and ylim


import seaborn as sns
import matplotlib.pyplot as plt
data = sns.load_dataset("iris")
sns.lineplot(x="sepal_length",
y="sepal_width", data=data)
plt.xlim(5)
plt.show()
Types Of Seaborn Plots
A. Relational Plots in Seaborn
1. Scatter Plot
2. Line plot
3. Relational Plot (relplot):
B. Categorical Plots in Seaborn
1. Bar Plot (barplot):
2. Count Plot (countplot):
3. Box Plot (boxplot):
4. Violin Plot (violinplot):
5. Strip Plot (stripplot):
6. Swarm Plot
C. Distribution Plots in Seaborn
1. Histogram (histplot):
2. Kernel Density Estimate Plot (kdeplot):
3. Distribution Plot (displot):
4. Empirical Cumulative Distribution Function Plot (ecdfplot):
5. Rug Plot (rugplot):
D. Matrix Plots in Seaborn
1. Heatmap (heatmap):
2. Cluster Map (clustermap):
E. Pair Grid (PairGrid) in Seaborn
1. Pair Plot (pairplot):

1. Line plot: Lineplot is the most popular plot to draw a relationship between x and y with
the possibility of several semantic groupings. It is often used to track changes over intervals.
Syntax : sns.lineplot(x=None, y=None)
Parameters:
x, y: Input data variables; must be numeric. Can pass data directly or reference columns in data.

Example:
import pandas as pd
import seaborn as sns
data = {'Weight':[ 254, 354, 230, 253 ],
'Age':[ 21 , 28 , 29 , 30 ]}
df = pd.DataFrame( data )
sns.lineplot(x=df['Age'], y=df['Weight'])
2. Scatter Plot: Scatter plots are used to visualize the relationship between two numerical
variables. They help identify correlations or patterns. It can draw a two-dimensional graph.
Syntax: seaborn.scatterplot(x=None, y=None)
Parameters:
x, y: Input data variables that should be numeric.
Returns: This method returns the Axes object with the plot drawn onto it.

Example:
import pandas as pd
import seaborn as sns
data = {'Age':[ 21 , 22, 23,24,25, 28 , 29 , 30
], 'Weight':[ 230 , 221 , 243, 246, 265, 268,
259 , 228 ] }
df = pd.DataFrame( data )
sns.scatterplot(x=df['Age'],y=df['Weight'])

3. Box plot: A box plot (or box-and-whisker plot) s is the visual representation of the depicting
groups of numerical data through their quartiles against continuous/categorical data.
A box plot consists of 5 things.
 Minimum
 First Quartile or 25%
 Median (Second Quartile) or 50%
 Third Quartile or 75%
 Maximum
Syntax: seaborn.boxplot(x=None, y=None, hue=None, data=None)
Parameters:
x, y, hue: Inputs for plotting long-form data.
data: Dataset for plotting. If x and y are absent, this is interpreted as wide-form.
Returns: It returns the Axes object with the plot drawn onto it.

import pandas as pd
import seaborn as sns
data = {'Name':[ 'Mohe' , 'Karnal' , 'Yrik' ,
'jack' ],'Age':[ 21 , 28 , 29, 30 ]}
df = pd.DataFrame( data )
sns.boxplot( df['Age'] )
4. Violin Plot: A violin plot is similar to a boxplot. It shows several quantitative data across
one or more categorical variables such that those distributions can be compared.
Syntax: seaborn.violinplot(x=None, y=None, hue=None, data=None)
Parameters:
x, y, hue: Inputs for plotting long-form data.
data: Dataset for plotting.

Example:
import pandas as pd
import seaborn as sns
data = {'Name':[ 'Mohe' , 'Karnal' , 'Yrik' ,
'jack' ],'Age':[ 30 , 21 , 29 , 28 ]}
df = pd.DataFrame( data )
sns.violinplot(data['Age'])

5. Swarm plot: A swarm plot with non-overlapping points against categorical data.
Syntax: seaborn.swarmplot(x=None, y=None, hue=None, data=None)
Parameters:
x, y, hue: Inputs for plotting long-form data.
data: Dataset for plotting.
Example:
import seaborn
seaborn.set(style = 'whitegrid')
data = pandas.read_csv( "nba.csv" )
seaborn.swarmplot(x = data["Age"]

6. Bar plot: Barplot represents an estimate of central tendency for a numeric variable with
the height of each rectangle and provides some indication of the uncertainty around that estimate
using error bars.
Syntax : seaborn.barplot(x=None, y=None, hue=None, data=None)
Parameters :
x, y : This parameter take names of variables in data or vector data, Inputs for plotting long-form
data.
hue : (optional) This parameter take column name for colour encoding.
data : (optional) This parameter take DataFrame, array, or list of arrays, Dataset for plotting. If x
and y are absent, this is interpreted as wide-form. Otherwise it is expected to be long-form.
Returns : Returns the Axes object with the plot drawn onto it.

Example:
import seaborn
seaborn.set(style = 'whitegrid')
# read csv and plot
data = pandas.read_csv("nba.csv")
seaborn.barplot(x ="Age", y ="Weight",
data = data)

7. Point plot: Point plot used to show point estimates and confidence intervals using scatter
plot glyphs. A point plot represents an estimate of central tendency for a numeric variable by the
position of scatter plot points and provides some indication of the uncertainty around that
estimate using error bars.
Syntax: seaborn.pointplot(x=None, y=None, hue=None, data=None)
Parameters:
x, y: Inputs for plotting long-form data.
hue: (optional) column name for color encoding.
data: dataframe as a Dataset for plotting.
Return: The Axes object with the plot drawn onto it.
Example
import seaborn
seaborn.set(style = 'whitegrid')
# read csv and plot
data = pandas.read_csv("nba.csv")
seaborn.pointplot(x = "Age", y = "Weight",
data = data)

8. Count plot: Count plot used to Show the counts of observations in each categorical bin
using bars.
Syntax : seaborn.countplot(x=None, y=None, hue=None, data=None)
Parameters :
x, y: This parameter take names of variables in data or vector data, optional, Inputs for
plotting long-form data.
hue : (optional) This parameter take column name for color encoding.
data : (optional) This parameter take DataFrame, array, or list of arrays, Dataset for
plotting. If x and y are absent, this is interpreted as wide-form. Otherwise, it is expected to
be long-form.
Returns: Returns the Axes object with the plot drawn onto it.
Example
:
import seaborn
seaborn.set(style = 'whitegrid')
data = pandas.read_csv("nba.csv")
seaborn.countplot(data["Age"])

9. KDE Plot: KDE Plot described as Kernel Density Estimate is used for visualizing the
Probability Density of a continuous variable. It depicts the probability density at different values
in a continuous variable. We can also plot a single graph for multiple samples which helps in
more efficient data visualization.
Syntax: seaborn.kdeplot(x=None, *, y=None, vertical=False, palette=None, **kwargs)
Parameters:
x, y : vectors or keys in data
vertical : boolean (True or False)
data : pandas.DataFrame, numpy.ndarray, mapping, or sequence

Example:
import seaborn as sns
import pandas
data = pandas.read_csv("nba.csv").head()
sns.kdeplot( data['Age'], data['Number'])

10. Heatmap: A heatmap is a graphical representation of data where values in a matrix are
represented as colors. It’s often used to visualize the magnitude of values in a matrix, allowing
patterns and correlations to be easily identified.
Example:
import seaborn as sns
import matplotlib.pyplot as plt
flights = sns.load_dataset("flights")
flights_pivot = flights.pivot(index="month",
columns="year", values="passengers")
sns.heatmap(flights_pivot, annot=True,
fmt="d", cmap="YlGnBu")
plt.show()

11. Cluster Map: A cluster map is a heatmap that organizes rows and columns of a
dataset based on their similarity, often using hierarchical clustering. It’s useful for identifying
patterns and relationships in complex datasets by grouping similar rows and columns together.
import seaborn as sns
import matplotlib.pyplot as plt
flights = sns.load_dataset("flights")
flights_pivot = flights.pivot(index="month",
columns="year", values="passengers")
sns.clustermap(flights_pivot,
cmap="viridis", standard_scale=1)
plt.show()

12. Pair Plot: A pair plot creates a grid of scatterplots and histograms for each pair of
variables in a dataset, allowing for visual exploration of relationships and distributions between
variables. It’s particularly useful for identifying patterns and correlations in multivariate data.
import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset("tips")
sns.pairplot(tips, hue="smoker",
palette="coolwarm")
plt.show()
UNIVARIATE GRAPHS FOR NUMERIC AND
CATEGORICAL DATA
Univariate Analysis is a type of data visualization where we visualize only a single variable at a
time. Univariate Analysis helps us to analyze the distribution of the variable present in the data
so that we can perform further analysis.
import pandas as pd
import seaborn as sns
data = pd.read_csv('Employee_dataset.csv')
print(data.head())

Histogram: Perform univariate analysis on Numerical variables


sns.histplot(data['age'])

Bar Chart: Univariate analysis of categorical data.


sns.countplot(data['gender_full'])
Pie Chart: A piechart helps us to visualize the percentage of the data belonging to each
category.
x = data['STATUS_YEAR'].value_counts()
plt.pie(x.values,
labels=x.index,
autopct='%1.1f%%')
plt.show()

BIVARIATE GRAPHS FOR NUMERIC AND


CATEGORICAL DATA
Bivariate analysis is the simultaneous analysis of two variables. It explores the concept of the
relationship between two variables whether there exists an association and the strength of this
association or whether there are differences between two variables and the significance of these
differences.
The main three types:
1. Categorical v/s Numerical
2. Numerical V/s Numerical
3. Categorical V/s Categorical data

1. Categorical v/s Numerical


import matplotlib.pyplot as plt
plt.figure(figsize=(15, 5))
sns.barplot(x=data['department_name'], y=data['length_of_service'])
plt.xticks(rotation='90')

2. Numerical V/s Numerical


sns.scatterplot(x=data['length_of_service'], y=data['age'])

3. Categorical V/s Categorical data


sns.countplot(data['STATUS_YEAR'], hue=data['STATUS'])

MULTIVARIATE GRAPHS
It is an extension of bivariate analysis which means it involves multiple variables at the same
time to find correlation between them. Multivariate Analysis is a set of statistical models that
examine patterns in multidimensional data by considering at once, several data variable.
import numpy as np
import matplotlib.pyplot as plt x, y = np.random.randn(2, 30)
plt.rcParams["figure.figsize"] = [7.50, 3.50] y *= 100
plt.rcParams["figure.autolayout"] = True z = func(x, y)
def func(x, y): fig, ax = plt.subplots()
return 3 * x + 4 * y - 2 + s = ax.scatter(x, y, c=z, s=100, marker='*',
np.random.randn(30) cmap='plasma')
fig.colorbar(s)
plt.show()

CHOOSING APPROPRIATE GRAPHICAL TECHNIQUES


 Charts for showing change over time
 Bar charts encode value by the heights of bars from a baseline.
 Line charts encode value by the vertical positions of points connected by line segments.
This is useful when a baseline is not meaningful, or if the number of bars would be
overwhelming to plot.
 A box plot can be useful when a distribution of values need to be plotted for each time
period; each set of box and whiskers can show where the most common data values lie.

 Charts for showing part-to-whole composition


 The pie chart and cousin donut chart represent the whole with a circle, divided by slices
into parts.
 A stacked bar chart modifies a bar chart by dividing each bar into multiple sub-bars,
showing a part-to-whole composition within each primary bar.
 Similarly, a stacked area chart modifies the line chart by using shading under the line to
divide the total into sub-group values.
 Charts for looking at how data is distributed
 Bar charts are used when a variable is qualitative and takes a number of discrete values.
 A histogram is used when a variable is quantitative, taking numeric values.
 Alternatively, a density curve can be used in place of a histogram, as a smoothed
estimate of the underlying distribution.
 A violin plot compares numeric value distributions between groups by plotting a density
curve for each group.
 The box plot is another way of comparing distributions between groups, but with a
summary of statistics rather than an estimated distributional shape.

 Charts for comparing values between groups


 A bar chart compares values between groups by assigning a bar to each group.
 A dot plot can be used similarly, except with value indicated by point positions instead
of bar lengths. This is like a line chart with the line segments removed, eliminating the
‘connection’ between sequential points. Also like a line chart, a dot plot is useful when
including a vertical baseline would not be meaningful.
 A line chart can be used to compare values between groups across time by plotting one
line per group.
 A grouped bar chart allows for comparison of data across two different grouping
variables by plotting multiple bars at each location, not just one.
 Violin plots and box plots are used to compare data distributions between groups.
 A funnel chart is a specialist chart for showing how quantities move through a process,
like tracking how many visitors get from being shown an ad to eventually making a
purchase.
 Bullet charts are another specialist chart for comparing a true value to one or more
benchmarks.

 Charts for observing relationships between variables


 The scatter plot is the standard way of showing the relationship between two variables.
 Scatter plots can also be expanded to additional variables by adding color, shape, or size
to each point as indicators, as in a bubble chart.
 When a third variable represents time, points in a scatter plot can be connected with line
segments, generating a connected scatter plot.
 Another alternative for a temporal third-variable is a dual-axis plot, such as plotting a
line chart and bar chart with a shared horizontal axis.

INTRODUCTION TO DASHBOARDS
Dash is a Python framework for building analytical web applications. Dash helps in building
responsive web dashboards that is good to look at and is very fast without the need to understand
complex front-end frameworks or languages such as HTML, CSS, JavaScript. Let’s build our
first web dashboard using Dash.

pip install dash

Step 1: Importing all the required libraries: import Dash, Dash Core
Components (which has components like graph, inputs etc., ) and Dash
HTML Components(which has HTML components like meta tags, body tags,
paragraph tags etc., )
import dash
import dash_core_components as dcc
import dash_html_components as html
Step 2: Designing a layout: make a graph which has various parameters such
as id(a unique ID to a particular graph), figure(the graph itself), layout(the
basic layout, title of graph, X axis, Y axis data etc., ).

The figure parameter is essentially a dictionary which has elements like x, y, type, name.
x refers to the X-axis value(it can be a list or a single element), y is the same except it is
associated with the Y-axis.
The type parameter refers to the type of the graph, it maybe line, bar.
The name parameter refers to the name associated with the axis of a graph

app = dash.Dash()

app.layout = html.Div(children =[
html.H1("Dash Tutorial"),
dcc.Graph(
id ="example",
figure ={
'data':[
{'x':[1, 2, 3, 4, 5],
'y':[5, 4, 7, 4, 8],
'type':'line',
'name':'Trucks'},
{'x':[1, 2, 3, 4, 5],
'y':[6, 3, 5, 3, 7],
'type':'bar',
'name':'Ships'}
],
'layout':{
'title':'Basic Dashboard'
}
}
)
])

Step 3: Running the server: The dashboard is now ready, but it needs a server
to run on.
if __name__ == '__main__':
app.run_server()
Open the app on the web browser in localhost and default port 8050.

http://127.0.0.1:8050/

You might also like