0% found this document useful (0 votes)
230 views

Unit 2 - Data Visualization Techniques

This document discusses various data visualization techniques and concepts related to data. It begins by defining data and explaining that data is the lowest level of abstraction from which information and knowledge are derived. It then presents a simple taxonomy of data, including record data, graphs and networks, ordered data, spatial/image data, and characteristics of structured data such as dimensionality and sparsity. The document also discusses data objects, attribute types, discrete vs continuous attributes, and data preprocessing techniques.

Uploaded by

Nile Seth
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
230 views

Unit 2 - Data Visualization Techniques

This document discusses various data visualization techniques and concepts related to data. It begins by defining data and explaining that data is the lowest level of abstraction from which information and knowledge are derived. It then presents a simple taxonomy of data, including record data, graphs and networks, ordered data, spatial/image data, and characteristics of structured data such as dimensionality and sparsity. The document also discusses data objects, attribute types, discrete vs continuous attributes, and data preprocessing techniques.

Uploaded by

Nile Seth
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

BIA B452F

Unit 2
Data visualization
techniques
Dr Franklin Lam
The Nature of Data
• Data: a collection of facts
• usually obtained as the result of
experiences, observations, or
experiments

• Data may consist of numbers,


words, images, …
• Data is the lowest level of
abstraction (from which information
and knowledge are derived)
• Data is the source for information
and knowledge
• Data quality and data integrity 
critical to analytics
(Source: Business intelligence, analytics, and data science : a managerial perspective)

2
A Simple Taxonomy of Data

(Source: Business intelligence, analytics, and data science : a managerial perspective)

3
Types of Data Sets: (1) Record Data
• Relational records
• Relational tables, highly structured
• Data matrix, e.g., numerical matrix, crosstabs

• Transaction data

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
TID Items
1 Bread, Coke, Milk
2 Beer, Bread Document 1 3 0 5 0 2 6 0 2 0 2
3 Beer, Coke, Diaper, Milk
Document 2 0 7 0 2 1 0 0 3 0 0
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk Document 3 0 1 0 0 1 2 2 0 3 0

• Document data: Term-frequency vector (matrix) of text documents


4
Types of Data Sets: (2) Graphs and Networks
• Transportation network

• World Wide Web

• Molecular Structures

• Social or information networks


5
Types of Data sets: (3) Ordered Data
• Video data: sequence of images

• Temporal data: time-series

• Sequential Data: transaction sequences

• Genetic sequence data

6
Types of Data Sets: (4) Spatial, Image and Multimedia Data

• Spatial data: maps

• Image data:

• Video data:
7
Characteristics of Structured Data
• Dimensionality
• Curse of dimensionality
• Data required to densely populate space increase
exponentially as the dimension (i.e., # of variables)
increases. For example, the possible combinations
of 100 binary variables = 2100 =1.26765 x 1030.

• Sparsity
• Only presence counts
• Sparsity refers to the extent to which a measure
contains null values, or “NA”.
• The available data become sparse when the volume
of space increases with dimensionality.

• Resolution
• Patterns depend on the scale
• Distribution
• Centrality and dispersion 8
Data Objects
• Data sets are made up of data objects
• A data object represents an entity
• Examples:
• sales database: customers, store items, sales
• medical database: patients, treatments
• university database: students, professors, courses
• Also called samples , examples, instances, data points, objects,
tuples
• Data objects are described by attributes
• Database rows → data objects; columns → attributes
9
Attribute Types
• Nominal: categories, states, or “names of things”
• Hair_color = {auburn, black, blond, brown, grey, red, white}
• marital status, occupation, ID numbers, zip codes
• Binary
• Nominal attribute with only 2 states (0 and 1)
• Symmetric binary: both outcomes equally important
• e.g., gender
• Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV positive)
• Ordinal
• Values have a meaningful order (ranking) but magnitude between successive
values is not known
• Size = {small, medium, large}, grades, army rankings
10
Numeric Attribute Types
• Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order
• E.g., temperature in C˚or F˚, calendar dates
• No true zero-point
• Ratio
• Inherent zero-point
• We can speak of values as being an order of magnitude larger than
the unit of measurement (10 K˚ is twice as high as 5 K˚).
• e.g., temperature in Kelvin, length, counts, monetary quantities

11
Discrete vs Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of documents
• Sometimes, represented as integer variables
• Note: Binary attributes are a special case of discrete attributes
• Continuous Attribute
• Has real numbers as attribute values
• E.g., temperature, height, or weight
• Practically, real values can only be measured and represented using a finite
number of digits
• Continuous attributes are typically represented as floating-point variables
12
• R implementation:
• Continuous variables – Numeric (real numbers)
• Discrete variables – Integer
• Binary variables – Logical (True/False)
• Categorical variables – Factor
• Datasets – Data frame

13
The Art and Science of Data Preprocessing
• The real-world data is dirty, misaligned, overly
complex, and inaccurate
• Not ready for analytics!
• Readying the data for analytics is needed
• Data preprocessing
• Data consolidation
• Data cleaning
• Data transformation
• Data reduction

• Art – it develops and improves with experience

14
Data Preprocessing for Analytics
• Data in its original form (i.e., real-world data) is not usually ready
to be used in analytics tasks.
• Therefore, a tedious and time-consuming process (so-called data
preprocessing) is necessary to convert the raw real-world data
into a well-refined form for analytics algorithms.
• Time spent on data preprocessing can be significantly longer than
the time spent on the analytics tasks.
• Steps of data preprocessing include: data consolidation, data
cleaning, data transformation, and data reduction.
• Data reduction
1. Variables: dimension reduction (or variable selection)
2. Cases/samples:
• Probability sampling – simple random sampling, systematic sampling,
stratified sampling, and cluster sampling
• Resampling – bootstrap sampling

(Source: Business intelligence, analytics, and data science : a managerial perspective)


15
Data Preprocessing Tasks and Methods
Main Task Subtasks Popular Methods
Data consolidation Access and collect the data SQL queries, software agents, Web services.
Select and filter the data Domain expertise, SQL queries, statistical tests.
Integrate and unify the data SQL queries, domain expertise, ontology-driven data mapping.
Data cleaning Handle missing values in Fill in missing values (imputations) with most appropriate values (mean, median, min/max,
the data mode, etc.); recode the missing values with a constant such as “ML”; remove the record of
the missing value; do nothing.
Identify and reduce noise in Identify the outliers in data with simple statistical techniques (such as averages and
the data standard deviations) or with cluster analysis; once identified, either remove the
outliers or smooth them by using binning, regression, or simple averages.
Find and eliminate Identify the erroneous values in data (other than outliers), such as odd values,
erroneous data inconsistent class labels, odd distributions; once identified, use domain expertise to
correct the values or remove the records holding the erroneous values.
Data transformation Normalize the data Reduce the range of values in each numerically valued variable to a standard range (e.g., 0
to 1 or –1 to +1) by using a variety of normalization or scaling techniques.
Discretize or aggregate the If needed, convert the numeric variables into discrete representations using range or
data frequency-based binning techniques; for categorical variables, reduce the number of values
by applying proper concept hierarchies.
Construct new attributes Derive new and more informative variables from the existing ones using a wide
range of mathematical functions (as simple as addition and multiplication or as
complex as a hybrid combination of log transformations).
Data reduction Reduce number of Principal component analysis, independent component analysis, chi-square testing,
attributes correlation analysis, and decision tree induction.
Reduce number of records Random sampling, stratified sampling, expert-knowledge-driven purposeful sampling.

Balance skewed data Oversample the less represented or undersample the more represented classes.

(Source: Business intelligence, analytics, and data science : a managerial perspective) 16


Proposed
Application – Improving Student Analytics
Approach
Retention with Data-driven Analytics to
Predicting
Student
• Student attrition (i.e., drop-out) has Attrition

become one of the most challenging


problems in academic institutions.
• What are the common techniques to deal
with student attrition?
• Analytics versus theoretical approaches
to student retention problem

2.1 Delen, D. (2010) “A comparative analysis of


machine learning techniques for student retention
management”, Decision Support Systems, Vol. 49,
No. 4, 498-506. (Source: Business intelligence, analytics, and data science : a managerial perspective)
17
Variables obtained from student records

Application – Improving Student


Retention with Data-driven Analytics
• 5 years freshman student data from a single institution with
an average enrollment of 23,000 students
• Freshman student retention rate is about 80% (i.e., “Second
Fall Registered” = Yes)
• The data contained variables related to student's academic,
financial, and demographic characteristics.
• Data cleaning:
• To identify and remove anomalies and unusable records
e.g., removed all international student records because
they did not contain information about some of the most
reputed predictors (e.g., high school GPA, SAT scores)
• Transformation
• Some variables are aggregated (e.g., “Major” and
“Concentration” variables aggregated to binary variables
“MajorDeclared” and “ConcentrationSpecified”) for better
interpretation for the predictive modeling.
• Some variables were used to derive new variables e.g.
“Earned/Registered = EarnedHours/RegisterHours” to
represent the students’ resiliency and determination in
18
their first semester.
Application – Improving Student Retention with Data-driven Analytics
• The dependent variable (i.e.,
“Second Fall Registered”)
contained many more yes records
(~80%) than no records (~20).
• The imbalanced data has a
negative impact on model
performance.
• Balanced dataset by taking all the
samples from the minority class
(i.e., the “No” class) and randomly 50% Yes
selected an equal number of
samples from the majority class
(i.e., the “Yes” class), and repeated (Source: Business intelligence, analytics, and data science : a managerial perspective)

this process for ten times to reduce


bias of random sampling.
19
• Data mining methods can predict freshmen student
Application – Improving attrition with approximately 80% accuracy
Student Retention with Data- • Balanced dataset produced better prediction
driven Analytics • For individual methods, SVM > DT > ANN > LR

(Source: Business intelligence, analytics, and data science : a managerial perspective)


20
Basic Statistical Descriptions of Data
• Motivation
• To better understand the data: central tendency, variation and spread
• Data dispersion characteristics
• Median, max, min, quantiles, outliers, variance, ...

• Numerical dimensions correspond to sorted intervals


• Data dispersion:
• Analyzed with multiple granularities of precision
• Boxplot or quantile analysis on sorted intervals

• Dispersion analysis on computed measures


• Folding measures into numerical dimensions
• Boxplot or quantile analysis on the transformed cube

21
Measuring the Central Tendency: (1) Mean
• Mean (algebraic measure) (sample vs. population):
Note: n is sample size and N is population size.

1 n
x   xi   x
n i 1 N
n
Weighted arithmetic mean:

w x i i
x i 1
n

• Trimmed mean:
w
i 1
i

• Chopping extreme values (e.g., Olympics gymnastics score computation)

22
Measuring the Central Tendency: (2) Median
• Median:
• Middle value if odd number of values, or average of the middle two values otherwise
• Estimated by interpolation (for grouped data):

n  3194
n / 2  1597
( freq) l  950
L1  21; L1  50; freqmedian  1500
1597  950
median  20  ( )  50  21  33.5
1500

Sum before the median interval


Approximate
median
n / 2  ( freq)l
Interval width (L2 – L1)
median  L1  ( ) width
freqm edian
Low interval limit
23
Measuring the Central Tendency: (3) Mode
• Mode: Value that occurs most frequently in the
data

• Unimodal
• Empirical formula:
mean  mode  3  (mean  median)

• Multi-modal
• Bimodal

• Trimodal

24
Measuring the Dispersion: variance and Standard Deviation
• Variance and standard deviation (sample: s, population: σ)
• Variance: (algebraic, scalable computation)
• Q: Can you compute it incrementally and efficiently?

n n n
1 1 1
s       
2 2 2 2
( x x ) [ x ( x )
i ]
n  1 i 1 n  1 i 1
i i
n i 1
Note: The subtle difference of
n n formulae for sample vs. population
1 1
   ( xi   )  x 
2 2 2 2 • n : the size of the sample
i • N : the size of the population
N i 1 N i 1

• Standard deviation s (or σ) is the square root of variance s2 (or σ2)

25
Measuring the Dispersion: Quartiles & Boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: Data is represented with a box
• Q1, Q3, IQR: The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
• Median (Q2) is marked by a line within the box
• Whiskers: two lines outside the box extended to Minimum
and Maximum
• Outliers: points beyond a specified outlier threshold,
plotted individually
• Outlier: usually, a value higher/lower than 1.5 x IQR

(Source: Business intelligence, analytics, and data science : a managerial perspective)


26
Visualization of Data Dispersion: 3-D Boxplots

27
boxplot(DomesTransc + IntTransc ~ Gender, data = data)
title("Number of Domestic + International Transaction")

28
Shape of a Distribution
• Histogram – frequency chart
• Skewness
• Measure of asymmetry

• Kurtosis
• Peak/tall/skinny nature of the distribution

29
Graphic Displays of Basic Statistical Descriptions
• Boxplot: graphic display of five-number summary

• Histogram: x-axis are values, y-axis represents frequencies

• Quantile plot: each value xi is paired with fi indicating that approximately 100 fi
% of data are  xi

• Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution


against the corresponding quantiles of another

• Scatter plot: each pair of values is a pair of coordinates and plotted as points in
the plane

30
Histogram
Histogram Analysis 40
35
• Histogram: Graph display of tabulated 30

frequencies, shown as bars 25


20

• Differences between histograms and bar 15

charts 10
5
• Histograms are used to show distributions of 0
variables while bar charts are used to compare 10000 30000 50000 70000 90000

variables
Bar chart
• Histograms plot binned quantitative data while bar
charts plot categorical data
• Bars can be reordered in bar charts but not in
histograms
• Differs from a bar chart in that it is the area of the
bar that denotes the value, not the height as in bar
charts, a crucial distinction when the categories
are not of uniform width
31
Histograms Often Tell More than Boxplots

• The two histograms shown in the left


may have the same boxplot
representation
• The same values for: min, Q1,
median, Q3, max

• But they have rather different data


distributions

32
hist(dat_new$mpg, breaks = 10, main = "histogram plot of mile per gallon")

33
Quantile Plot
• Displays all of the data (allowing the user to assess both the overall behavior
and unusual occurrences)

• Plots quantile information


• For a data xi data sorted in increasing order, fi indicates that approximately
100 fi% of the data are below or equal to the value xi

34
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution against the corresponding quantiles
of another
• View: Is there is a shift in going from one distribution to another?
• Example shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Unit
prices of items sold at Branch 1 tend to be lower than those at Branch 2

35
qqnorm(dat_new$mpg)
qqline(dat_new$mpg, col = 4, lwd = 2)

36
Scatter Plot
• Provides a first look at bivariate data to see clusters of points, outliers, etc.
• Each pair of values is treated as a pair of coordinates and plotted as points in
the plane

37
Positively and Negatively Correlated Data

• The left half fragment is


positively correlated

• The right half is negative


correlated 38
Uncorrelated Data

39
pairs(dat_new, pch = 16, main = "Scatter plots
between numerical variables in 'mtcars' ")

plot(dat_new$hp, dat_new$qsec, type =


"p", pch = 16, main = "horsepower and
1/4 mile time")

40
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, and transmission error
• Incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• e.g., Occupation = “ ” (missing data)
• Noisy: containing noise, errors, or outliers
• e.g., Salary = “−10” (an error)
• Inconsistent: containing discrepancies in codes or names, e.g.,
• Age = “42”, Birthday = “03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?

41
# Mapping US state
library(data.table)
data <-fread(file="fraud.csv",header=TRUE)
state_map<-fread(file="US_State_Code_Mapping.csv")
data<-merge(data, state_map, by = 'state')

## Mapping gender
gender_map<-fread(file="Gender Map.csv")
data<-merge(data, gender_map, by = 'gender')

## Mapping credit line


credit_map<-fread(file="credit line map.csv")
data<-merge(data, credit_map, by = 'creditLine')

library(dplyr)
data <- data %>%
select(-c(gender, state, PostalCode)) %>% ## delete original coded variables
rename(CustomerID=custID, ## change column names
Gender=code,
DomesTransc=numTrans,
IntTransc=numIntlTrans,
FraudFlag=fraudRisk,
NumOfCards=cardholder,
OutsBal=balance,
State=StateName) %>%
mutate(FraudFlag = factor(FraudFlag, levels=c(0, 1), labels=c("No", "Yes")))
42
## change FraudFlag to factor
How to Handle Missing Data?

(Source:
https://towardsdatascience.co
m/how-to-handle-missing-data-
8646b18db0d4)

43
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
• Faulty data collection instruments
• Data entry problems
• Data transmission problems
• Technology limitation
• Inconsistency in naming convention
• Other data problems
• Duplicate records
• Incomplete data
• Inconsistent data

44
How to Handle Noisy Data?
• Binning
• First sort data and partition into (equal-frequency) bins
• Then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Regression
• Smooth by fitting the data into regression functions

• Clustering
• Detect and remove outliers

• Semi-supervised: Combined computer and human inspection


• Detect suspicious values and check by human (e.g., deal with possible
outliers)

45
#visualize missing values with VIM package
library(VIM)

# in number
aggr(df_miss, prop=FALSE, numbers=TRUE)

# Matrix plot. Red for missing values, Darker values are high values.
matrixplot(df_miss, interactive=FALSE, sortby="HouseNetWorth")

# Margin plot. Red dots have at least one missing.


marginplot(df_miss[,c("StoreArea","HouseNetWorth")])

46
Data Transformation
• A function that maps the entire set of values of a given attribute to a new set of
replacement values s.t. each old value can be identified with one of the new
values
• Methods
• Smoothing: Remove noise from data
• Attribute/feature construction
• New attributes constructed from the given ones
• Aggregation: Summarization, data cube construction
• Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Discretization: Concept hierarchy climbing

47
Normalization
• Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]
73,600  12,000
• Then $73,000 is mapped to (1.0  0)  0  0.716
98,000  12,000

• Z-score normalization (μ: mean, σ: standard deviation):


v  A Z-score: The distance between the raw score and the
v' 
 A
population mean in the unit of the standard deviation

• Ex. Let μ = 54,000, σ = 16,000. Then 73,600  54,000  1.225


16,000
• Normalization by decimal scaling
v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10 48
df <- read.csv("House Worth Data.csv",header=TRUE, stringsAsFactors=TRUE)
table(df$HouseNetWorth)

library(dplyr)
# normalize the data to [0,1] use rescale function of scales package
library(scales)
df<-df %>% mutate(HousePrice = NULL,
StoreArea = rescale(StoreArea),
BasementArea = rescale(BasementArea),
LawnArea = rescale(LawnArea))

49
Discretization
• Three types of attributes
• Nominal—values from an unordered set, e.g., color, profession
• Ordinal—values from an ordered set, e.g., military or academic rank
• Numeric—real numbers, e.g., integer or real numbers

• Discretization: Divide the range of a continuous attribute into intervals


• Interval labels can then be used to replace actual data values
• Reduce data size by discretization
• Supervised vs. unsupervised
• Split (top-down) vs. merge (bottom-up)
• Discretization can be performed recursively on an attribute
• Prepare for further analysis, e.g., classification
50
Data Discretization Methods
• Binning
• Top-down split, unsupervised
• Histogram analysis
• Top-down split, unsupervised
• Clustering analysis
• Unsupervised, top-down split or bottom-up merge
• Decision-tree analysis
• Supervised, top-down split
• Correlation (e.g., 2) analysis
• Unsupervised, bottom-up merge
• Note: All the methods can be applied recursively

51
Simple Discretization: Binning
• Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well
• Equal-depth (frequency) partitioning
• Divides the range into N intervals, each containing approximately same
number of samples
• Good data scaling
• Managing categorical attributes can be tricky
52
Example: Binning Methods for Data Smoothing
•Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equal-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

53
Discretization without Supervision: Binning vs Clustering

Data Equal width (distance) binning

Equal depth (frequency) (binning) K-means clustering leads to better results


54
Discretization by Classification & Correlation Analysis
• Classification (e.g., decision tree analysis)
• Supervised: Given class labels, e.g., cancerous vs. benign
• Using entropy to determine split point (discretization point)
• Top-down, recursive split
• Details to be covered in Chapter “Classification”
• Correlation analysis (e.g., Chi-merge: χ2-based discretization)
• Supervised: use class information
• Bottom-up merge: Find the best neighboring intervals (those having similar distributions of
classes, i.e., low χ2 values) to merge
• Merge performed recursively, until a predefined stopping condition

55
Dimensionality Reduction Techniques
• Dimensionality reduction methodologies
• Feature selection: Find a subset of the original variables (or features,
attributes)
• Feature extraction: Transform the data in the high-dimensional space to a
space of fewer dimensions
• Some typical dimensionality methods
• Principal Component Analysis
• Supervised and nonlinear techniques
• Feature subset selection
• Feature creation

56
Principal Component Analysis (PCA)
• PCA: A statistical procedure that uses an
orthogonal transformation to convert a set of
observations of possibly correlated variables into
a set of values of linearly uncorrelated variables
called principal components
• The original data are projected onto a much
smaller space, resulting in dimensionality
reduction
• Method: Find the eigenvectors of the covariance
matrix, and these eigenvectors define the new
space Ball travels in a straight line. Data from
three cameras contain much redundancy

57
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
• Duplicate much or all of the information
contained in one or more other attributes
• E.g., purchase price of a product and the
amount of sales tax paid
• Irrelevant attributes
• Contain no information that is useful for the
data mining task at hand
• Ex. A student’s ID is often irrelevant to the
task of predicting his/her GPA
58
Heuristic Search in Attribute Selection
• There are 2d possible attribute combinations of d attributes
• Typical heuristic attribute selection methods:
• Best single attribute under the attribute independence assumption: choose by
significance tests
• Best step-wise feature selection:
• The best single-attribute is picked first
• Then next best attribute condition to the first, ...
• Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
• Best combined attribute selection and elimination
• Optimal branch and bound:
• Use attribute elimination and backtracking

59
Attribute Creation (Feature Generation)
• Create new attributes (features) that can capture the important information in
a data set more effectively than the original ones
• Three general methodologies
• Attribute extraction
• Domain-specific
• Mapping data to new space (see: data reduction)
• E.g., Fourier transformation, wavelet transformation, manifold approaches (not covered)
• Attribute construction
• Combining features
• Data discretization

60
Business Reporting
Definitions and Concepts
• Report = Information  Decision
• Report?
• Any communication artifact prepared to convey
specific information
• A report can fulfill many functions
• To ensure proper departmental functioning
• To provide information
• To provide the results of an analysis
• To persuade others to act
• To create an organizational memory…
61
What is a Business Report?
• A written document that contains information
regarding business matters.
• Purpose: to improve managerial decisions
• Source: data from inside and outside the organization
(via the use of ETL or ELT)
• Format: text + tables + graphs/charts
• Distribution: in-print, email, portal/intranet

Data acquisition  Information generation  Decision


making  Process management
62
Business Reporting
Business Functions

UOB 1.0 X UOB 2.1 X UOB 3.0

Data UOB 2.2


Transactional Records
Exception Event
Symbol Count Description
Action
Machine
1
Failure (decision)

DEPLOYMENT CHART

PHASE 1 PHASE 2 PHASE 3 PHASE 4 PHASE 5

DEPT 1

DEPT 2

DEPT 3

Data
DEPT 4

4 5
2 3
1
Repositories
Decision
Information
Maker
(reporting)
(Source: Business intelligence, analytics, and data science : a managerial perspective)
63
Types of Business Reports
• Metric Management Reports
• Help manage business performance through metrics (Service-level
agreements (SLAs) for externals; KPIs for internals)
• Can be used as part of Six Sigma and/or TQM
• Dashboard-Type Reports
• Graphical presentation of several performance indicators in a
single page using dials/gauges
• Balanced Scorecard–Type Reports
• Include financial, customer, business process, and learning &
growth indicators

64
Dashboard application examples

An example of a Dundas BI
dashboard, complete with data
visualizations.
https://selecthub.com/business-
intelligence/business-intelligence-
vs-business-analytics/

65
Dashboard application examples

A business analytics
dashboard from
Sisense.
https://selecthub.com/
business-
intelligence/business-
intelligence-vs-business-
analytics/

66
Data Visualization
“The use of visual representations to explore, make
sense of, and communicate data.”
• Data visualization vs. Information visualization
• Information = aggregation, summarization, and
contextualization of data
• Related to information graphics, scientific
visualization, and statistical graphics
• Often includes charts, graphs, illustrations, …

67
Data Visualization
• Why data visualization?
• Gain insight into an information space by mapping data onto graphical
primitives
• Provide qualitative overview of large data sets
• Search for patterns, trends, structure, irregularities, relationships among data
• Help find interesting regions and suitable parameters for further quantitative
analysis
• Provide a visual proof of computer representations derived

• Categorization of visualization methods:


• Pixel-oriented visualization techniques
• Geometric projection visualization techniques
• Thematic Map visualization techniques
• Hierarchical visualization techniques
• Visualizing complex data and relations 68
Pixel-Oriented Visualization Techniques
• For a data set of m dimensions, create m windows on the screen, one for each
dimension
• The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows
• The colors of the pixels reflect the corresponding values

(a) Income (b) Credit Limit (c) transaction volume (d) age 69
Laying Out Pixels in Circle Segments
• To save space and show the connections among multiple dimensions, space
filling is often done in a circle segment

(a) Representing a data record


in circle
Representing about 265,000 segmentData Items
50-dimensional (b) Laying out pixels in circle segment
70
with the ‘Circle Segments’ Technique
Geometric Projection Visualization Techniques
• Gauge chart

• Bullet graph

71
Geometric Projection Visualization Techniques
• Bar and column charts • Line chart

• Area chart • Combo chart

72
GDP_Long_Format <- melt(GDP, id="Country")
names(GDP_Long_Format) <- c("Country", "Year","GDP_USD_Trillion")

ggplot(GDP_Long_Format, aes(x=Year, y=GDP_USD_Trillion, group=Country)) +


geom_line(aes(colour=Country)) +
geom_point(aes(colour=Country),size = 5) +
labs(title="Gross Domestic Product - Top 10 Countries",
x="Year", y="GDP (in trillion USD)") +
theme(legend.title=element_text(size=20),
legend.text=element_text(face ="italic",size=15),
plot.title=element_text(face="bold", size=20),
axis.title.x=element_text(face="bold", size=12),
axis.title.y=element_text(face="bold", size=12))

## process the data


Population <- read.csv("Population All Year.csv", header=TRUE) %>%
melt(id = "Country", variable.name="Year",
value.name="Pop_Billion") %>% # melt by countries
mutate(Year=substr(Year, 2,length(Year))) %>% # extract the year
filter(Country %in% c('India','China')) # filter by countries

ggplot(Population, aes(Pop_Billion, fill = Country)) +


geom_density(alpha = 0.8, col="black") +
labs(title="Population (in Billion): Histogram",
x="Population (in Billion)", y="Frequency") +
annotate("text", x=0.9, y=1.5, label="alpha = 0.8") # add an annotation
73
Geometric Projection Visualization Techniques
• Pie chart • Doughnut chart

• Funnel chart

74
GCD_China <- read.csv("China - USD - Percentage.csv") %>%
melt(id = "Sector", variable.name="Income_Group",
value.name="Perc_Cont") # melt by sector

ggplot(data=GCD_China, aes(x="", y=Perc_Cont, fill = Sector)) +


geom_col() +
coord_polar(theta="y", start = 0) +
facet_grid(cols=vars(Income_Group)) +
scale_fill_brewer(palette="Set3") +
labs(title="China - Percentage share of each sector by consumption segment",
x="Population (in Billion)", y="Frequency", fill="Sector")

75
Geometric Projection Visualization Techniques
• Scatter and bubble charts • Scatterplot matrices

• Scatter-high density

• Matrix of scatterplots (x-y-


diagrams) of the k-dim. data
• A total of k(k-1)/2 distinct
scatterplots

76
bc<- read.delim("BubbleChart_Data.txt") %>%
filter(continent != "Oceania", year==2007) %>% # filter by continent and year
droplevels() # drop unused levels

ggplot(bc, aes(x = gdpPercap, y = lifeExp, fill=continent)) +


scale_x_log10() +
geom_point(aes(size = sqrt(pop/pi)), pch = 21, show.legend = FALSE) +
scale_size_continuous(range=c(1,40)) +
facet_wrap(~ continent, ncol=2) +
scale_fill_manual(values = c("#FAB25B", "#276419", "#529624", "#C6E79C")) +
theme(text=element_text(size=12),
title=element_text(size=14,face="bold")) +
labs(title="Bubble Chart - GDP Per Captita Vs Life Expectency",
x="GDP Per Capita(in US $", y="Life Expectancy(in years)")

77
Thematic Map
A thematic map is a type of map specifically designed to show a particular theme
connected with a specific geographic area.
• Cholopleth • Proportional symbol

78
3D/Volumetric Visualization Techniques
• 3D scatter plot • Surface rendering • 3D computer graphics

Surface waves in water

Glass surface
79
Temporal
• Time series

• Waterfall chart A form of data visualization that helps in understanding the cumulative
effect of sequentially introduced positive or negative values

80
waterfallchart(Net~Time_Period, data=footfall,col =
as.factor(footfall$Type),
xlab = "Time Period(Month)",ylab="Footfall",
main="Footfall by Month")

time_series <- read.csv("timeseries.csv",header=TRUE) %>%


melt(id = c("Year"), variable.name="Country",
value.name="GDP_Growth") %>%
mutate(Date=as.Date(Year,format="%d/%m/%Y")) # create date

#GDP growth rate from 1980 to 2015


ggplot(data=time_series,aes(x=Date,y=GDP_Growth)) +
geom_line(aes(color=Country),size=1.5)

81
Temporal
• Polar area chart The polar area diagram is similar to a usual pie chart, except sectors have equal angles
and differ rather in how far each sector extends from the center of the circle. The polar
area diagram is used to plot cyclic phenomena (e.g., counts of deaths by month).

82
Hierarchical Visualization Techniques
• Visualization of the data using a hierarchical partitioning into
subspaces
• Methods
• Treemap
• Heatmap

83
Treemap
• Treemaps display hierarchical (tree-structured) data as a set of nested rectangles.
• Each branch of the tree is given a rectangle, which is then tiled with smaller rectangles
representing sub-branches.
• A leaf node's rectangle has an area proportional to a specified dimension of the data. Often the
leaf nodes are colored to show a separate dimension of the data.

84
Heatmap
• A heatmap is a graphical representation of
data where the individual values contained
in a matrix are represented as colors.

85
bc <- read.csv("Region Wise Data.csv") %>%
melt(id = c("Region","Indicator"), variable.name="Year",
value.name="Inc_Value") %>% # melt by region and indicator
mutate(Year=substr(Year, 2,length(Year))) %>% # extract the year
group_by(Indicator) %>%
mutate(Inc_Value = ifelse(is.na(Inc_Value),
mean(Inc_Value, na.rm=TRUE), Inc_Value)) %>% # replace NA by mean
mutate(rescale=rescale(Inc_Value)) # rescale the data

ggplot(bc, aes(x=Indicator, y=Region, fill=rescale)) +


geom_tile() +
scale_fill_gradient(low = "white",high = "steelblue") +
scale_x_discrete(expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
theme(
text=element_text(size=12),
title=element_text(size=14,face="bold"),
axis.text.x = element_text(size = 15 * 0.8, angle = 330, hjust = 0,
colour = "black",face="bold"),
axis.text.y = element_text(size = 15 * 0.8, colour = "black",face="bold")) +
labs(title ="Heatmap - Region Vs World Development Indicators")

86
Visualizing Complex Data and Relations: Tag Cloud
• Tag cloud: Visualizing user-generated
tags
• The importance of tag is
represented by font size/color
• Popularly used to visualize
word/phrase distributions

KDD 2013 Research Paper Title Tag Cloud


Newsmap: Google News Stories in 2005
87
set.seed(146)
wordcloud(words = docs, scale=c(3,0.5), max.words=100,
min.freq=5,
colors=brewer.pal(6, "Dark2"),
random.order=FALSE, rot.per=0.10,
use.r.layout=FALSE)

88
Visualizing Complex Data and Relations: Social Networks
• Visualizing non-numerical data: social and information networks

organizing
information networks

A typical network structure

A social network

89
Which Chart or Graph Should You Use?

(Source: Business intelligence, analytics, and data science : a managerial perspective)


90
The Emergence of Data Visualization and Visual Analytics

• Magic Quadrant for Analytics


and Business Intelligence
Platforms
• Data visualization companies
are in the “Leaders” quadrant
• There is a move towards
visualization

91
Visual Analytics
• A recently coined term
• Information visualization + predictive analytics
• Information visualization
• Descriptive, backward focused
• “what happened” “what is happening”

• Predictive analytics
• Predictive, future focused
• “what will happen” “why will it happen”

• There is a strong move toward visual analytics

92
Visual Analytics by SAS Institute

• SAS Visual Analytics Architecture


• Big data + In memory + Massively parallel processing + ..

93
Visual Analytics by SAS Institute
• At teradatauniversitynetwork.com, you can learn
more about SAS VA, experiment with the tool

94
Performance Dashboards
• Performance dashboards are commonly used in BPM
(Business Process Management) software suites and BI
platforms
• Dashboards provide visual displays of important
information that is consolidated and arranged on a single
screen so that information can be digested at a single
glance and easily drilled in and further explored

95
Performance Dashboards

96
Performance Dashboards
• Dashboard design
• The fundamental challenge of dashboard design is to display all the
required information on a single screen, clearly and without
distraction, in a manner that can be assimilated quickly
• Three layers of information
• Monitoring – graphical, abstracted data to monitor key performance
metrics
• Analysis – summarized dimensional data to analyze the root cause of
problems
• Management – detailed operational data that identify what actions to
take to resolve a problem

97
Performance Dashboards
• What to look for in a dashboard
• Use of visual components to highlight data and exceptions
that require action
• Transparent to the user, meaning that they require minimal
training and are extremely easy to use
• Combine data from a variety of systems into a single,
summarized, unified view of the business
• Enable drill-down or drill-through to underlying data sources
or reports
• Present a dynamic, real-world view with timely data
• Require little coding to implement, deploy, and maintain

98
Best Practices in Dashboard Design
• Benchmark KPIs with Industry Standards
• Doing a gap assessment with industry benchmarks to align with
industry best practices.

• Wrap the Metrics with Contextual Metadata


• Often when a report or a visual dashboard/scorecard is presented to
business users, questions remain unanswered such as:
• Where did you source this data from?
• While loading the data warehouse, what percentage of the data got
rejected/encountered data quality problems?
• Is the dashboard presenting “fresh” information or “stale” information?

• Validate the Design by a Usability Specialist


• Even though it’s a well-engineered data warehouse that can perform
well, many business users do not use the dashboard, as it is perceived
as not being user friendly, leading to poor adoption of the
infrastructure and change management issues.
99
Best Practices in Dashboard Design
• Prioritize and Rank Alerts and Exceptions
• Because there are tons of raw data, it is important to have a
mechanism by which important exceptions/behaviors are proactively
pushed to the information consumers.

• Enrich Dashboard with Business-User Comments


• When the same dashboard information is presented to multiple
business users, a small text box can be provided that can capture the
comments from an end-user’s perspective.

• Present Information in Three Different Levels


• Information can be presented in three layers depending on the
granularity of the information: the visual dashboard level, the static
report level, and the self-service cube level.

100
Best Practices in Dashboard Design
• Pick the Right Visual Constructs
• In presenting information in a dashboard, some information is
presented best with bar charts, some with time series line graphs, and
when presenting correlations, a scatter plot is useful. Sometimes
merely rendering it as simple tables is effective.
• Explicitly document the dashboard design principles so that all the
developers working on the front end can adhere to the same principles
while rendering the reports and dashboard.

• Provide for Guided Analytics


• The capability of the dashboard can be used to guide the “average”
business user to access the same navigational path as that of an
analytically savvy business user.

101

You might also like