Exploratory Data Analysis
Exploratory Data Analysis
Outline of Presentation
Exploratory v. Confirmatory Data Analyses
Exploratory Data Analysis Techniques
Definition
EDA consists of methods of discovering unanticipated
patterns and relationships in a data set, by summarizing
data quantitatively or presenting them visually.
3
Exploratory v. Confirmatory
Exploratory Data Analysis
Descriptive Statistics - Inductive Approach
Look for flexible ways to examine data without preconceptions
Heavy reliance on graphical displays
Let data suggest questions
Advantages
Flexible ways to generate hypotheses
Does not require more than data can support
Promotes deeper understanding of processes
Disadvantages
Usually does not provide definitive answers
Requires judgment - cannot be cookbooked
Exploratory v. Confirmatory
Advantages
Provide precise information in the right circumstances
Well-established theory and methods
Disadvantages
Misleading impression of precision in less than ideal circumstances
Analysis driven by preconceived ideas
Difficult to notice unexpected results
EDA Techniques
Graphical presentation of distribution
Stem-and-Leaf Plot
What is it?
A plot where each data value is split into a "leaf"
(usually the last digit) and a "stem" (the other digits).
By Mouse
Descriptive Statistics-> Explore -> Plot Stem and
Leaf Plot
Box Plot
What is it?
A way of graphically depicting groups of numerical data
through their five-number summaries: the smallest
observation (sample minimum), lower quartile (Q1),
median (Q2), upper quartile (Q3), and largest observation
(sample maximum). A box plot may also indicate which
observations, if any, might be considered outliers.
Location
Spread
Skewness
Outliers
10
By mouse
Graphs> legacy plots-> Box Plots->Click summaries of
separate variables-> Scaled Variable-> Optional:
Label Case-> Okay
11
12
13
By mouse
Graph> legacy plots-> Box Plots> click summaries
of groups of cases> define> Variable (scalar) >
categories (how are we organize them)> label (IDs
or name (optional))
14
Histogram
What is it?
A diagram consisting of rectangles which area is
proportional to the frequency of a continuous variable
and which width is equal to the class interval (bin).
By Mouse
Graphs-> histogram-> Variable (scalar)-> okay
16
By Mouse
Graphs-> Chartbuilder > Histogram-> Drag Variable (scalar) (x-axis)>set parameters-> custom -> number of intervals -> continue-> okay
17
By Mouse
Graphs-> Chartbuilder > Histogram-> Drag Variable (scalar) (x-axis)>set parameters-> custom -> number of intervals -> continue-> okay
18
Example: Histogram
A researcher might need to select bins to have
a better understanding of the distribution and
check what type of distribution we have.
19
Scatterplot
What is it?
A scatterplot is a plot of data points in xy-plane
that displays the strength, direction and shape of
the relationship between the two variables.
Used for
Analyzing relationships between two variables
Looking to see if there are any outliers in the data
20
By Mouse
> graph-> legacy dialogs-> scatter/dot-> Simple
Scatter-> Y axis (outcome) -> X axis (predictor)->
okay
21
Example: Scatterplot
Researchers wanted to see if there is a link
between Height and Weight.
22
Bar Graph
What is it?
-- A diagram consisting of rectangles which area is
proportional to the frequency of each level of
categorical variable.
-- Bar graph is similar to histogram but for
categorical variables.
Used for
-- comparison of frequencies for different levels
23
24
25
Pie chart
What is it?
A type of graph in which a circle is divided into
sectors corresponding to each level of categorical
variable and illustrating numerical proportion for
that level.
Used for
-- comparison of proportions for different levels
26
By Mouse
Graph-> Legacy Dialogs-> Pie Chart->
Summaries for group of cases-> define->
categorical variable-> categorical axis-> okay
27
28
Non-Graphical Techniques
Measures of Central Tendency
Central Tendency is the location of the middle
value
Mean=sum of all data values divided by the
number of values (arithmetic average).
29
30
Measures of Spread
Spread is how far observations lie from each
other.
-- Variance=average of the squared distances from
the mean.
By Mouse
Analyze-> Frequency -> Select a Scaled data->
click Statistics-> select Mean, Median, Mode,
Range, Maximum and Minimum.
32
Valid
Missing
60
0
Mean
940.3650
Median
943.7000
Mode
790.70 a
Std. Deviation
62.20482
Variance
3869.439
Range
322.30
33
Correlation Coefficient
What is it?
-- A numeric measure of linear relationship between two continuous
variables.
34
Correlation
Slight warning:
Correlation tend to measure linear relationship;
however there are events that a curves might exist
35
Linear Regression
What is it?
-- Statistical technique of fitting a linear function to
data points in attempt to describe a relationship
between two variables.
Used for
-- prediction
-- interpretation of coefficients (change in y for a
unit increase in x)
36
By mouse
Analyze->Regression-> Y (Variable we want to
predict) to Dependent -> X (variable we are using to
predict Y) with Independent->
37
Example: Correlation
Referring to our weight and height scatterplot,
the researchers want to check how related
these two variable are.
Correlations
Wieght
Pearson
Correlation
Wieght
Hieght
1.000
.717
.717
1.000
Hieght
Sig. (1tailed)
Wieght
Hieght
.000
Wieght
507
507
Hieght
507
507
.000
38
Example: Regression
Researchers want to create a linear model
using the height as an independent variable
(predictor) and weight as a dependent variable
(outcome or response).
The fitted line can be written as
Weight= -105.011+1.018 (Height)
Coefficientsa
Unstandardized
Coefficients
Model
1
B
(Constant)
Hieght
Std. Error
-105.011
7.539
1.018
.044
Standardiz
ed
Coefficient
s
Beta
.717
Sig.
-13.928
.000
23.135
.000
39
Frequency Table
What is it?
-- A table that shows frequency (count) for each
level of a categorical variable.
Used for
-- comparison of frequencies for different levels
40
By mouse
Analyze-> Descriptives-> frequency->Variable
-> display Frequency-> okay
41
By Mouse
Transform-> Visual Binning-> variable we want to create into an ordinal value->
okay-> Make cut point-> enter number of cutpoints, and width-> apply-> okay
42
Valid
Valid
Cumulative
Percent
Percent
15.0
15.0
Frequency
9
Percent
15.0
19
31.7
31.7
46.7
20
33.3
33.3
80.0
12th grade
and up
12
20.0
20.0
100.0
Total
60
100.0
100.0
9th Grade
10th Grade
11th Grade
Cross-tabulation
What it is?
a two-way table containing frequencies (counts)
for different levels of the column and row
variables.
Used for
Comparison of frequencies for different levels of
the variables (chi-squared test)
44
By Mouse
Analyze-> Descriptive Statistics-> Crosstabs-> select
variable for row-> select variable for column->
statistic-> Chi-Square-> continue-> Okay
45
Example: Cross-tabulation
Researchers wish to understand if the
educational levels from the SMSA data were
equally distributed among the US.
Looking at the p-value, we can see that the
educational levels are different among the
regions of the US.
Chi-Square Tests
Asymp.
Sig. (2sided)
Count
US
1.00
EDU
(Binned)
Total
9th Grade
10th
Grade
11th
Grade
12th grade
and up
2.00
3.00
4.00
Value
Total
Pearson ChiSquare
19
20
12
Likelihood
Ratio
Linear-byLinear
Association
21
16
14
60
N of Valid
Cases
df
26.078a
.002
25.377
.003
9.893
.002
60
46
47
Recommended Readings/Citations
Hartwig, F., & Dearing, B. E. (1979). Exploratory Data
Analysis. Beverly Hills : Sage Publications.
Hoaglin, D. C., Mostellar, F., & Tukey, J. W. (1983).
Understanding Robust and Exploratory Data Analysis. New
York: John Wile & Sons Inc.
Pampel, F. C. (2004). Exploratory Data Analysis . In M. S.
Lewis-Beck, A. Bryman, & L. t. Futing, The SAGE
Encyclopedia of Social Science Research Methods (pp. 359360). Thousand Oak, California : Sage Publications.
Vogt, W. P. (1999). Exploratory Data Analysis. In W. P. Vogt,
Dictionary of Statistics & Methodology: A Nontechnical
Guide for the Social Science (pp. 104-105). Thousand Oaks,
California: SAGE Publications. Inc.
48