Exploratory Data Analysis
Exploratory Data Analysis
(EDA)
1
WHAT IS EDA?
• The analysis of datasets based on various numerical methods and graphical tools.
• Exploring data for patterns, trends, underlying structure, deviations from the trend, anomalies and strange
structures.
• It facilitates discovering unexpected as well as conforming the expected.
• Another definition: An approach/philosophy for data analysis that employs a variety of techniques (mostly
graphical).
12
13
AIM OF THE EDA
• Maximize insight into a dataset
• Uncover underlying structure
• Extract important variables
• Detect outliers and anomalies
• Test underlying assumptions
• Develop valid models
• Determine optimal factor settings (Xs)
14
Classification of EDA*
• Exploratory data analysis is generally cross-classified in two ways. First, each method is either non-graphical
or graphical. And second, each method is either univariate or multivariate (usually just bivariate).
• Non-graphical methods generally involve calculation of summary statistics, while graphical methods
obviously summarize the data in a diagrammatic or pictorial way.
• Univariate methods look at one variable (data column) at a time, while multivariate methods look at two or
more variables at a time to explore relationships. Usually our multivariate EDA will be bivariate (looking at
exactly two variables), but occasionally it will involve three or more variables.
• It is almost always a good idea to perform univariate EDA on each of the components of a multivariate EDA
before performing the multivariate EDA.
Examine the entire data set using basic techniques before starting a formal statistical analysis.
16
Summarizing Variables
• Categorical variables
Frequency tables - how many observations in each category?
Relative frequency table - percent in each category.
Bar chart and other plots.
• Continuous variables
Bin the observations (create categories .e.g., (0-10), (11-20), etc.) then, treat as ordered categorical.
Plots specific to Continuous variables.
The goal for both categorical and continuous data is data reduction while preserving/extracting key information
about the process under investigation.
17
Categorical Data Summaries
Tables
18
Frequency Table
19
Graphing a Frequency Table - Bar Chart:
Plot the number of observations in each category:
20
Continuous Data - Tables
Example: Ages of 10 adult leukemia patients:
35; 40; 52; 27; 31; 42; 43; 28; 50; 35
One option is to group these ages into decades and create a categorical
age variable:
21
We can then create a frequency table for this new categorical age
variable.
22
Continuous data - plots
A histogram is a bar chart constructed using the frequencies or relative
frequencies of a grouped (or \binned") continuous variable
23
Age histogram of 10 adult leukemia patients
24
Plotting Functions
R has several distinct plotting systems
Base R functions
• hist()
• barplot()
• boxplot()
• plot()
lattice package
ggplot2 package
25
Boxplot
> boxplot(mtcars$mpg, main = "Miles per Gallon")
26
The mean is the sum of the data Helpful Hint
values divided by the number of The mean is
data items. sometimes called
The median is the middle value of the average.
an
odd number of data items arranged
in order. For an even number of data items, the
median is the average of the two middle values.
Find the mean, median, mode, and range of the data set.
4, 7, 8, 2, 1, 2, 4, 2
mean:
8 items sum
Divide the sum by the
30 8 = 3.75 number of items.
median:
The median is 3.
Find the mean, median, mode, and range of the data set.
4, 7, 8, 2, 1, 2, 4, 2
mode:
The mode is 2.
Find the mean, median, mode, and range of the data set.
4, 7, 8, 2, 1, 2, 4, 2
range:
8 –1= 7
The range is 7.
▪ The mean and median are measures of central tendency used to
represent the “middle” of a data set.
▪ To decide which measure is most appropriate for describing a set of
data, think about what each measure tells you about the data.
▪ The measure that you choose may depend on how the information in
the data set is being used.
With the Outlier