0% found this document useful (0 votes)
28 views

Data Visualization

This document discusses data visualization techniques for exploring and preprocessing data. It describes commonly used graphs like bar charts, line charts, scatterplots, histograms, and boxplots that can display one or two variables to explore trends, relationships, and distributions. For larger datasets, it recommends heatmaps and color-coded scatterplots to accommodate more data and variables. The goal of these visualization techniques is to gain insights from data in order to select relevant variables and inform data mining methods.

Uploaded by

ShyamBhatt
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Data Visualization

This document discusses data visualization techniques for exploring and preprocessing data. It describes commonly used graphs like bar charts, line charts, scatterplots, histograms, and boxplots that can display one or two variables to explore trends, relationships, and distributions. For larger datasets, it recommends heatmaps and color-coded scatterplots to accommodate more data and variables. The goal of these visualization techniques is to gain insights from data in order to select relevant variables and inform data mining methods.

Uploaded by

ShyamBhatt
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Data Visualization

IIM Udaipur
“A picture is worth a thousand words” – Chinese
proverb
Uses
• Preprocessing and cleaning the data: “illegal”
values, missing values, duplicate rows,
columns with all the same values etc
• Variable selection: which variables to be
selected in the analysis
• Exploration: hidden trends in the data
Some Most Commonly Used Graphs in
Business World
• Bar chart
• Line chart
• Scatterplot
• Histogram
• Boxplot
• Scatterplot matrix
• Parallel co-ordinates plot
Use
• Generally speaking, these graphs display one
or two columns (i.e., variables) of data at a
time
• Useful to explore data type, volume, relations
etc.
Bar Chart
• Maybe used to compare a single
statistic/measure (e.g., average, count,
percentage) across groups
• Typically used for qualitative data (categorical
data) – represents frequencies/counts for
each categories
• Height of the bar represents the value of the
statistic/measure (typically, count)
• Different bars correspond to different groups
Line Chart
• Useful for time series data
• Choice of the time frame should be made
keeping in mind the forecasting task
• Shows the trend in the values
Scatterplot
• Shows the relationship between two variables
• Suggests the type of correlation two variables may
have – positive, or negative
• Can show any non-linear relationship that may be
present between two variables
• Extremely important for prediction problems, as we
would like to see the relationship between the
response and predictors
• Cannot be used for classification task in its basic form,
as the response is binary in a classification task
• If colour-coded, one more variable can be looked at
Histogram
• Shows the distribution of a numerical
(continuous) variable
• Useful in supervised learning, for determining
potential data mining methods and variable
transformations (for example, transforming a
skewed variable to a symmetric one, for
regression)
Box Plot
• Shows the distribution of a numerical
(continuous) variable
• Useful in supervised learning, for determining
potential data mining methods and variable
transformations (for example, transforming a
skewed variable to a symmetric one, for
regression)
• Effective for comparing subgroups by
generating side-by-side box plots
Scatterplot Matrix
• Relations between variables
• Diagonal histograms show the individual
distribution of the variables
• Choose number of variables wisely for better
visibility
Parallel co-ordinates plot
• A vertical axis is drawn for each variable
• Each record is represented by drawing a line
that connects its values on different axes
• Creates a multivariate profile for every record
Some Methods Useful for Large
Datasets
• The graphs / charts discussed so far are very
powerful, and give accurate information
• But they cannot accommodate large amount
of data and/or variables at one go
• Some methods: Heatmap
Heatmap
• Useful for: (i) visualizing correlation tables, (ii)
visualizing missing values in the data
• Colour is used to denote values
• Darker shades correspond to stronger
correlation
• Caution: They are not replacements for more
accurate graphs already discussed, as colour
differences cannot be perceived accurately!
Colour-coded Scatterplot
• Colour-coding to bring in a categorical variable
• In this enhanced version, scatterplot can be
used for a classification task

You might also like