Unit 2 - Data Visualization Techniques
Unit 2 - Data Visualization Techniques
Unit 2
Data visualization
techniques
Dr Franklin Lam
The Nature of Data
• Data: a collection of facts
• usually obtained as the result of
experiences, observations, or
experiments
2
A Simple Taxonomy of Data
3
Types of Data Sets: (1) Record Data
• Relational records
• Relational tables, highly structured
• Data matrix, e.g., numerical matrix, crosstabs
• Transaction data
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
TID Items
1 Bread, Coke, Milk
2 Beer, Bread Document 1 3 0 5 0 2 6 0 2 0 2
3 Beer, Coke, Diaper, Milk
Document 2 0 7 0 2 1 0 0 3 0 0
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk Document 3 0 1 0 0 1 2 2 0 3 0
• Molecular Structures
6
Types of Data Sets: (4) Spatial, Image and Multimedia Data
• Image data:
• Video data:
7
Characteristics of Structured Data
• Dimensionality
• Curse of dimensionality
• Data required to densely populate space increase
exponentially as the dimension (i.e., # of variables)
increases. For example, the possible combinations
of 100 binary variables = 2100 =1.26765 x 1030.
• Sparsity
• Only presence counts
• Sparsity refers to the extent to which a measure
contains null values, or “NA”.
• The available data become sparse when the volume
of space increases with dimensionality.
• Resolution
• Patterns depend on the scale
• Distribution
• Centrality and dispersion 8
Data Objects
• Data sets are made up of data objects
• A data object represents an entity
• Examples:
• sales database: customers, store items, sales
• medical database: patients, treatments
• university database: students, professors, courses
• Also called samples , examples, instances, data points, objects,
tuples
• Data objects are described by attributes
• Database rows → data objects; columns → attributes
9
Attribute Types
• Nominal: categories, states, or “names of things”
• Hair_color = {auburn, black, blond, brown, grey, red, white}
• marital status, occupation, ID numbers, zip codes
• Binary
• Nominal attribute with only 2 states (0 and 1)
• Symmetric binary: both outcomes equally important
• e.g., gender
• Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV positive)
• Ordinal
• Values have a meaningful order (ranking) but magnitude between successive
values is not known
• Size = {small, medium, large}, grades, army rankings
10
Numeric Attribute Types
• Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order
• E.g., temperature in C˚or F˚, calendar dates
• No true zero-point
• Ratio
• Inherent zero-point
• We can speak of values as being an order of magnitude larger than
the unit of measurement (10 K˚ is twice as high as 5 K˚).
• e.g., temperature in Kelvin, length, counts, monetary quantities
11
Discrete vs Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of documents
• Sometimes, represented as integer variables
• Note: Binary attributes are a special case of discrete attributes
• Continuous Attribute
• Has real numbers as attribute values
• E.g., temperature, height, or weight
• Practically, real values can only be measured and represented using a finite
number of digits
• Continuous attributes are typically represented as floating-point variables
12
• R implementation:
• Continuous variables – Numeric (real numbers)
• Discrete variables – Integer
• Binary variables – Logical (True/False)
• Categorical variables – Factor
• Datasets – Data frame
13
The Art and Science of Data Preprocessing
• The real-world data is dirty, misaligned, overly
complex, and inaccurate
• Not ready for analytics!
• Readying the data for analytics is needed
• Data preprocessing
• Data consolidation
• Data cleaning
• Data transformation
• Data reduction
14
Data Preprocessing for Analytics
• Data in its original form (i.e., real-world data) is not usually ready
to be used in analytics tasks.
• Therefore, a tedious and time-consuming process (so-called data
preprocessing) is necessary to convert the raw real-world data
into a well-refined form for analytics algorithms.
• Time spent on data preprocessing can be significantly longer than
the time spent on the analytics tasks.
• Steps of data preprocessing include: data consolidation, data
cleaning, data transformation, and data reduction.
• Data reduction
1. Variables: dimension reduction (or variable selection)
2. Cases/samples:
• Probability sampling – simple random sampling, systematic sampling,
stratified sampling, and cluster sampling
• Resampling – bootstrap sampling
Balance skewed data Oversample the less represented or undersample the more represented classes.
21
Measuring the Central Tendency: (1) Mean
• Mean (algebraic measure) (sample vs. population):
Note: n is sample size and N is population size.
1 n
x xi x
n i 1 N
n
Weighted arithmetic mean:
•
w x i i
x i 1
n
• Trimmed mean:
w
i 1
i
22
Measuring the Central Tendency: (2) Median
• Median:
• Middle value if odd number of values, or average of the middle two values otherwise
• Estimated by interpolation (for grouped data):
n 3194
n / 2 1597
( freq) l 950
L1 21; L1 50; freqmedian 1500
1597 950
median 20 ( ) 50 21 33.5
1500
• Unimodal
• Empirical formula:
mean mode 3 (mean median)
• Multi-modal
• Bimodal
• Trimodal
24
Measuring the Dispersion: variance and Standard Deviation
• Variance and standard deviation (sample: s, population: σ)
• Variance: (algebraic, scalable computation)
• Q: Can you compute it incrementally and efficiently?
n n n
1 1 1
s
2 2 2 2
( x x ) [ x ( x )
i ]
n 1 i 1 n 1 i 1
i i
n i 1
Note: The subtle difference of
n n formulae for sample vs. population
1 1
( xi ) x
2 2 2 2 • n : the size of the sample
i • N : the size of the population
N i 1 N i 1
25
Measuring the Dispersion: Quartiles & Boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: Data is represented with a box
• Q1, Q3, IQR: The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
• Median (Q2) is marked by a line within the box
• Whiskers: two lines outside the box extended to Minimum
and Maximum
• Outliers: points beyond a specified outlier threshold,
plotted individually
• Outlier: usually, a value higher/lower than 1.5 x IQR
27
boxplot(DomesTransc + IntTransc ~ Gender, data = data)
title("Number of Domestic + International Transaction")
28
Shape of a Distribution
• Histogram – frequency chart
• Skewness
• Measure of asymmetry
• Kurtosis
• Peak/tall/skinny nature of the distribution
29
Graphic Displays of Basic Statistical Descriptions
• Boxplot: graphic display of five-number summary
• Quantile plot: each value xi is paired with fi indicating that approximately 100 fi
% of data are xi
• Scatter plot: each pair of values is a pair of coordinates and plotted as points in
the plane
30
Histogram
Histogram Analysis 40
35
• Histogram: Graph display of tabulated 30
charts 10
5
• Histograms are used to show distributions of 0
variables while bar charts are used to compare 10000 30000 50000 70000 90000
variables
Bar chart
• Histograms plot binned quantitative data while bar
charts plot categorical data
• Bars can be reordered in bar charts but not in
histograms
• Differs from a bar chart in that it is the area of the
bar that denotes the value, not the height as in bar
charts, a crucial distinction when the categories
are not of uniform width
31
Histograms Often Tell More than Boxplots
32
hist(dat_new$mpg, breaks = 10, main = "histogram plot of mile per gallon")
33
Quantile Plot
• Displays all of the data (allowing the user to assess both the overall behavior
and unusual occurrences)
34
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution against the corresponding quantiles
of another
• View: Is there is a shift in going from one distribution to another?
• Example shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Unit
prices of items sold at Branch 1 tend to be lower than those at Branch 2
35
qqnorm(dat_new$mpg)
qqline(dat_new$mpg, col = 4, lwd = 2)
36
Scatter Plot
• Provides a first look at bivariate data to see clusters of points, outliers, etc.
• Each pair of values is treated as a pair of coordinates and plotted as points in
the plane
37
Positively and Negatively Correlated Data
39
pairs(dat_new, pch = 16, main = "Scatter plots
between numerical variables in 'mtcars' ")
40
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, and transmission error
• Incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• e.g., Occupation = “ ” (missing data)
• Noisy: containing noise, errors, or outliers
• e.g., Salary = “−10” (an error)
• Inconsistent: containing discrepancies in codes or names, e.g.,
• Age = “42”, Birthday = “03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
41
# Mapping US state
library(data.table)
data <-fread(file="fraud.csv",header=TRUE)
state_map<-fread(file="US_State_Code_Mapping.csv")
data<-merge(data, state_map, by = 'state')
## Mapping gender
gender_map<-fread(file="Gender Map.csv")
data<-merge(data, gender_map, by = 'gender')
library(dplyr)
data <- data %>%
select(-c(gender, state, PostalCode)) %>% ## delete original coded variables
rename(CustomerID=custID, ## change column names
Gender=code,
DomesTransc=numTrans,
IntTransc=numIntlTrans,
FraudFlag=fraudRisk,
NumOfCards=cardholder,
OutsBal=balance,
State=StateName) %>%
mutate(FraudFlag = factor(FraudFlag, levels=c(0, 1), labels=c("No", "Yes")))
42
## change FraudFlag to factor
How to Handle Missing Data?
(Source:
https://towardsdatascience.co
m/how-to-handle-missing-data-
8646b18db0d4)
43
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
• Faulty data collection instruments
• Data entry problems
• Data transmission problems
• Technology limitation
• Inconsistency in naming convention
• Other data problems
• Duplicate records
• Incomplete data
• Inconsistent data
44
How to Handle Noisy Data?
• Binning
• First sort data and partition into (equal-frequency) bins
• Then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Regression
• Smooth by fitting the data into regression functions
• Clustering
• Detect and remove outliers
45
#visualize missing values with VIM package
library(VIM)
# in number
aggr(df_miss, prop=FALSE, numbers=TRUE)
# Matrix plot. Red for missing values, Darker values are high values.
matrixplot(df_miss, interactive=FALSE, sortby="HouseNetWorth")
46
Data Transformation
• A function that maps the entire set of values of a given attribute to a new set of
replacement values s.t. each old value can be identified with one of the new
values
• Methods
• Smoothing: Remove noise from data
• Attribute/feature construction
• New attributes constructed from the given ones
• Aggregation: Summarization, data cube construction
• Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Discretization: Concept hierarchy climbing
47
Normalization
• Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]
73,600 12,000
• Then $73,000 is mapped to (1.0 0) 0 0.716
98,000 12,000
library(dplyr)
# normalize the data to [0,1] use rescale function of scales package
library(scales)
df<-df %>% mutate(HousePrice = NULL,
StoreArea = rescale(StoreArea),
BasementArea = rescale(BasementArea),
LawnArea = rescale(LawnArea))
49
Discretization
• Three types of attributes
• Nominal—values from an unordered set, e.g., color, profession
• Ordinal—values from an ordered set, e.g., military or academic rank
• Numeric—real numbers, e.g., integer or real numbers
51
Simple Discretization: Binning
• Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well
• Equal-depth (frequency) partitioning
• Divides the range into N intervals, each containing approximately same
number of samples
• Good data scaling
• Managing categorical attributes can be tricky
52
Example: Binning Methods for Data Smoothing
•Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equal-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
53
Discretization without Supervision: Binning vs Clustering
55
Dimensionality Reduction Techniques
• Dimensionality reduction methodologies
• Feature selection: Find a subset of the original variables (or features,
attributes)
• Feature extraction: Transform the data in the high-dimensional space to a
space of fewer dimensions
• Some typical dimensionality methods
• Principal Component Analysis
• Supervised and nonlinear techniques
• Feature subset selection
• Feature creation
56
Principal Component Analysis (PCA)
• PCA: A statistical procedure that uses an
orthogonal transformation to convert a set of
observations of possibly correlated variables into
a set of values of linearly uncorrelated variables
called principal components
• The original data are projected onto a much
smaller space, resulting in dimensionality
reduction
• Method: Find the eigenvectors of the covariance
matrix, and these eigenvectors define the new
space Ball travels in a straight line. Data from
three cameras contain much redundancy
57
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
• Duplicate much or all of the information
contained in one or more other attributes
• E.g., purchase price of a product and the
amount of sales tax paid
• Irrelevant attributes
• Contain no information that is useful for the
data mining task at hand
• Ex. A student’s ID is often irrelevant to the
task of predicting his/her GPA
58
Heuristic Search in Attribute Selection
• There are 2d possible attribute combinations of d attributes
• Typical heuristic attribute selection methods:
• Best single attribute under the attribute independence assumption: choose by
significance tests
• Best step-wise feature selection:
• The best single-attribute is picked first
• Then next best attribute condition to the first, ...
• Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
• Best combined attribute selection and elimination
• Optimal branch and bound:
• Use attribute elimination and backtracking
59
Attribute Creation (Feature Generation)
• Create new attributes (features) that can capture the important information in
a data set more effectively than the original ones
• Three general methodologies
• Attribute extraction
• Domain-specific
• Mapping data to new space (see: data reduction)
• E.g., Fourier transformation, wavelet transformation, manifold approaches (not covered)
• Attribute construction
• Combining features
• Data discretization
60
Business Reporting
Definitions and Concepts
• Report = Information Decision
• Report?
• Any communication artifact prepared to convey
specific information
• A report can fulfill many functions
• To ensure proper departmental functioning
• To provide information
• To provide the results of an analysis
• To persuade others to act
• To create an organizational memory…
61
What is a Business Report?
• A written document that contains information
regarding business matters.
• Purpose: to improve managerial decisions
• Source: data from inside and outside the organization
(via the use of ETL or ELT)
• Format: text + tables + graphs/charts
• Distribution: in-print, email, portal/intranet
DEPLOYMENT CHART
DEPT 1
DEPT 2
DEPT 3
Data
DEPT 4
4 5
2 3
1
Repositories
Decision
Information
Maker
(reporting)
(Source: Business intelligence, analytics, and data science : a managerial perspective)
63
Types of Business Reports
• Metric Management Reports
• Help manage business performance through metrics (Service-level
agreements (SLAs) for externals; KPIs for internals)
• Can be used as part of Six Sigma and/or TQM
• Dashboard-Type Reports
• Graphical presentation of several performance indicators in a
single page using dials/gauges
• Balanced Scorecard–Type Reports
• Include financial, customer, business process, and learning &
growth indicators
64
Dashboard application examples
An example of a Dundas BI
dashboard, complete with data
visualizations.
https://selecthub.com/business-
intelligence/business-intelligence-
vs-business-analytics/
65
Dashboard application examples
A business analytics
dashboard from
Sisense.
https://selecthub.com/
business-
intelligence/business-
intelligence-vs-business-
analytics/
66
Data Visualization
“The use of visual representations to explore, make
sense of, and communicate data.”
• Data visualization vs. Information visualization
• Information = aggregation, summarization, and
contextualization of data
• Related to information graphics, scientific
visualization, and statistical graphics
• Often includes charts, graphs, illustrations, …
67
Data Visualization
• Why data visualization?
• Gain insight into an information space by mapping data onto graphical
primitives
• Provide qualitative overview of large data sets
• Search for patterns, trends, structure, irregularities, relationships among data
• Help find interesting regions and suitable parameters for further quantitative
analysis
• Provide a visual proof of computer representations derived
(a) Income (b) Credit Limit (c) transaction volume (d) age 69
Laying Out Pixels in Circle Segments
• To save space and show the connections among multiple dimensions, space
filling is often done in a circle segment
• Bullet graph
71
Geometric Projection Visualization Techniques
• Bar and column charts • Line chart
72
GDP_Long_Format <- melt(GDP, id="Country")
names(GDP_Long_Format) <- c("Country", "Year","GDP_USD_Trillion")
• Funnel chart
74
GCD_China <- read.csv("China - USD - Percentage.csv") %>%
melt(id = "Sector", variable.name="Income_Group",
value.name="Perc_Cont") # melt by sector
75
Geometric Projection Visualization Techniques
• Scatter and bubble charts • Scatterplot matrices
• Scatter-high density
76
bc<- read.delim("BubbleChart_Data.txt") %>%
filter(continent != "Oceania", year==2007) %>% # filter by continent and year
droplevels() # drop unused levels
77
Thematic Map
A thematic map is a type of map specifically designed to show a particular theme
connected with a specific geographic area.
• Cholopleth • Proportional symbol
78
3D/Volumetric Visualization Techniques
• 3D scatter plot • Surface rendering • 3D computer graphics
Glass surface
79
Temporal
• Time series
• Waterfall chart A form of data visualization that helps in understanding the cumulative
effect of sequentially introduced positive or negative values
80
waterfallchart(Net~Time_Period, data=footfall,col =
as.factor(footfall$Type),
xlab = "Time Period(Month)",ylab="Footfall",
main="Footfall by Month")
81
Temporal
• Polar area chart The polar area diagram is similar to a usual pie chart, except sectors have equal angles
and differ rather in how far each sector extends from the center of the circle. The polar
area diagram is used to plot cyclic phenomena (e.g., counts of deaths by month).
82
Hierarchical Visualization Techniques
• Visualization of the data using a hierarchical partitioning into
subspaces
• Methods
• Treemap
• Heatmap
83
Treemap
• Treemaps display hierarchical (tree-structured) data as a set of nested rectangles.
• Each branch of the tree is given a rectangle, which is then tiled with smaller rectangles
representing sub-branches.
• A leaf node's rectangle has an area proportional to a specified dimension of the data. Often the
leaf nodes are colored to show a separate dimension of the data.
84
Heatmap
• A heatmap is a graphical representation of
data where the individual values contained
in a matrix are represented as colors.
85
bc <- read.csv("Region Wise Data.csv") %>%
melt(id = c("Region","Indicator"), variable.name="Year",
value.name="Inc_Value") %>% # melt by region and indicator
mutate(Year=substr(Year, 2,length(Year))) %>% # extract the year
group_by(Indicator) %>%
mutate(Inc_Value = ifelse(is.na(Inc_Value),
mean(Inc_Value, na.rm=TRUE), Inc_Value)) %>% # replace NA by mean
mutate(rescale=rescale(Inc_Value)) # rescale the data
86
Visualizing Complex Data and Relations: Tag Cloud
• Tag cloud: Visualizing user-generated
tags
• The importance of tag is
represented by font size/color
• Popularly used to visualize
word/phrase distributions
88
Visualizing Complex Data and Relations: Social Networks
• Visualizing non-numerical data: social and information networks
organizing
information networks
A social network
89
Which Chart or Graph Should You Use?
91
Visual Analytics
• A recently coined term
• Information visualization + predictive analytics
• Information visualization
• Descriptive, backward focused
• “what happened” “what is happening”
• Predictive analytics
• Predictive, future focused
• “what will happen” “why will it happen”
92
Visual Analytics by SAS Institute
93
Visual Analytics by SAS Institute
• At teradatauniversitynetwork.com, you can learn
more about SAS VA, experiment with the tool
94
Performance Dashboards
• Performance dashboards are commonly used in BPM
(Business Process Management) software suites and BI
platforms
• Dashboards provide visual displays of important
information that is consolidated and arranged on a single
screen so that information can be digested at a single
glance and easily drilled in and further explored
95
Performance Dashboards
96
Performance Dashboards
• Dashboard design
• The fundamental challenge of dashboard design is to display all the
required information on a single screen, clearly and without
distraction, in a manner that can be assimilated quickly
• Three layers of information
• Monitoring – graphical, abstracted data to monitor key performance
metrics
• Analysis – summarized dimensional data to analyze the root cause of
problems
• Management – detailed operational data that identify what actions to
take to resolve a problem
97
Performance Dashboards
• What to look for in a dashboard
• Use of visual components to highlight data and exceptions
that require action
• Transparent to the user, meaning that they require minimal
training and are extremely easy to use
• Combine data from a variety of systems into a single,
summarized, unified view of the business
• Enable drill-down or drill-through to underlying data sources
or reports
• Present a dynamic, real-world view with timely data
• Require little coding to implement, deploy, and maintain
98
Best Practices in Dashboard Design
• Benchmark KPIs with Industry Standards
• Doing a gap assessment with industry benchmarks to align with
industry best practices.
100
Best Practices in Dashboard Design
• Pick the Right Visual Constructs
• In presenting information in a dashboard, some information is
presented best with bar charts, some with time series line graphs, and
when presenting correlations, a scatter plot is useful. Sometimes
merely rendering it as simple tables is effective.
• Explicitly document the dashboard design principles so that all the
developers working on the front end can adhere to the same principles
while rendering the reports and dashboard.
101