Exploratory Data Analysis
Exploratory Data Analysis
In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their
main characteristics, often using statistical graphics and other data visualization methods. A statistical model
can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling
and thereby contrasts traditional hypothesis testing. Exploratory data analysis has been promoted by John
Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that
could lead to new data collection and experiments. EDA is different from initial data analysis (IDA),[1][2]
which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing,
and handling missing values and making transformations of variables as needed. EDA encompasses IDA.
Overview
Tukey defined data analysis in 1961 as: "Procedures for analyzing data, techniques for interpreting the
results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise
or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing
data."[3]
Tukey's championing of EDA encouraged the development of statistical computing packages, especially S
at Bell Labs.[4] The S programming language inspired the systems S-PLUS and R. This family of
statistical-computing environments featured vastly improved dynamic visualization capabilities, which
allowed statisticians to identify outliers, trends and patterns in data that merited further study.
Tukey's EDA was related to two other developments in statistical theory: robust statistics and nonparametric
statistics, both of which tried to reduce the sensitivity of statistical inferences to errors in formulating
statistical models. Tukey promoted the use of five number summary of numerical data—the two extremes
(maximum and minimum), the median, and the quartiles—because these median and quartiles, being
functions of the empirical distribution are defined for all distributions, unlike the mean and standard
deviation; moreover, the quartiles and median are more robust to skewed or heavy-tailed distributions than
traditional summaries (the mean and standard deviation). The packages S, S-PLUS, and R included
routines using resampling statistics, such as Quenouille and Tukey's jackknife and Efron's bootstrap, which
are nonparametric and robust (for many problems).
Exploratory data analysis, robust statistics, nonparametric statistics, and the development of statistical
programming languages facilitated statisticians' work on scientific and engineering problems. Such
problems included the fabrication of semiconductors and the understanding of communications networks,
which concerned Bell Labs. These statistical developments, all championed by Tukey, were designed to
complement the analytic theory of testing statistical hypotheses, particularly the Laplacian tradition's
emphasis on exponential families.[5]
Development
John W. Tukey wrote the book Exploratory Data Analysis in 1977.[6] Tukey held that too much emphasis
in statistics was placed on statistical hypothesis testing (confirmatory data analysis); more emphasis needed
to be placed on using data to suggest hypotheses to test. In particular, he held that confusing the two types
of analyses and employing them on the same set of data can lead to systematic bias owing to the issues
inherent in testing hypotheses suggested by the data.
The objectives of EDA are to:
Many EDA techniques have been adopted into Data science process flowchart
data mining. They are also being taught to
young students as a way to introduce them to
statistical thinking.[8]
Box plot
Histogram
Multi-vari chart
Run chart
Pareto chart
Scatter plot (2D/3D)
Stem-and-leaf plot
Parallel coordinates
Odds ratio
Targeted projection pursuit
Heat map
Bar chart
Horizon graph
Glyph-based visualization methods such as PhenoPlot[10] and Chernoff faces
Projection methods such as grand tour, guided tour and manual tour
Interactive versions of these plots
Dimensionality reduction:
Multidimensional scaling
Principal component analysis (PCA)
Multilinear PCA
Nonlinear dimensionality reduction (NLDR)
Iconography of correlations
Median polish
Trimean
Ordination
History
Many EDA ideas can be traced back to earlier authors, for example:
The Open University course Statistics in Society (MDST 242), took the above ideas and merged them with
Gottfried Noether's work, which introduced statistical inference via coin-tossing and the median test.
Example
Findings from EDA are orthogonal to the primary analysis task. To illustrate, consider an example from
Cook et al. where the analysis task is to find the variables which best predict the tip that a dining party will
give to the waiter.[12] The variables available in the data collected for this task are: the tip amount, total bill,
payer gender, smoking/non-smoking section, time of day, day of the week, and size of the party. The
primary analysis task is approached by fitting a regression model where the tip rate is the response variable.
The fitted model is
which says that as the size of the dining party increases by one person (leading to a higher bill), the tip rate
will decrease by 1%, on average.
However, exploring the data reveals other interesting features not described by this model.
Histogram of tip Histogram of tip Scatterplot of tips Scatterplot of
amounts where the bins amounts where the bins vs. bill. Points tips vs. bill
cover $1 increments. cover $0.10 increments. below the line separated by
The distribution of An interesting correspond to tips payer gender
values is skewed right phenomenon is visible: that are lower and smoking
and unimodal, as is peaks occur at the than expected section status.
common in distributions whole-dollar and half- (for that bill Smoking parties
of small, non-negative dollar amounts, which is amount), and have a lot more
quantities. caused by customers points above the variability in the
picking round numbers line are higher tips that they
as tips. This behavior is than expected. give. Males tend
common to other types We might expect to pay the (few)
of purchases too, like to see a tight, higher bills, and
gasoline. positive linear the female non-
association, but smokers tend to
instead see be very
variation that consistent
increases with tip tippers (with
amount. In three
particular, there conspicuous
are more points exceptions
far away from the shown in the
line in the lower sample).
right than in the
upper left,
indicating that
more customers
are very cheap
than very
generous.
What is learned from the plots is different from what is illustrated by the regression model, even though the
experiment was not designed to investigate any of these other trends. The patterns found by exploring the
data suggest hypotheses about tipping that may not have been anticipated in advance, and which could lead
to interesting follow-up experiments where the hypotheses are formally stated and tested by collecting new
data.
Software
JMP, an EDA package from SAS Institute.
KNIME, Konstanz Information Miner – Open-Source data exploration platform based on
Eclipse.
Minitab, an EDA and general statistics package widely used in industrial and corporate
settings.
Orange, an open-source data mining and machine learning software suite.
Python, an open-source programming language widely used in data mining and machine
learning.
R, an open-source programming language for statistical computing and graphics. Together
with Python one of the most popular languages for data science.
TinkerPlots an EDA software for upper elementary and middle school students.
Weka an open source data mining package that includes visualization and EDA tools such
as targeted projection pursuit.
See also
Anscombe's quartet, on importance of exploration
Data dredging
Predictive analytics
Structured data analysis (statistics)
Configural frequency analysis
Descriptive statistics
References
1. Chatfield, C. (1995). Problem Solving: A Statistician's Guide (2nd ed.). Chapman and Hall.
ISBN 978-0412606304.
2. Baillie, Mark; Le Cessie, Saskia; Schmidt, Carsten Oliver; Lusa, Lara; Huebner, Marianne;
Topic Group "Initial Data Analysis" of the STRATOS Initiative (2022). "Ten simple rules for
initial data analysis" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8870512). PLOS
Computational Biology. 18 (2): e1009819. Bibcode:2022PLSCB..18E9819B (https://ui.adsa
bs.harvard.edu/abs/2022PLSCB..18E9819B). doi:10.1371/journal.pcbi.1009819 (https://doi.
org/10.1371%2Fjournal.pcbi.1009819). PMC 8870512 (https://www.ncbi.nlm.nih.gov/pmc/art
icles/PMC8870512). PMID 35202399 (https://pubmed.ncbi.nlm.nih.gov/35202399).
3. John Tukey-The Future of Data Analysis-July 1961 (http://projecteuclid.org/download/pdf_1/
euclid.aoms/1177704711)
4. Becker, Richard A., A Brief History of S (https://web.archive.org/web/20150723044213/http://
www2.research.att.com/areas/stat/doc/94.11.ps), Murray Hill, New Jersey: AT&T Bell
Laboratories, archived from the original (http://www2.research.att.com/areas/stat/doc/94.11.p
s) (PS) on 2015-07-23, retrieved 2015-07-23, "... we wanted to be able to interact with our
data, using Exploratory Data Analysis (Tukey, 1971) techniques."
5. Morgenthaler, Stephan; Fernholz, Luisa T. (2000). "Conversation with John W. Tukey and
Elizabeth Tukey, Luisa T. Fernholz and Stephan Morgenthaler" (https://doi.org/10.1214%2Fs
s%2F1009212675). Statistical Science. 15 (1): 79–94. doi:10.1214/ss/1009212675 (https://d
oi.org/10.1214%2Fss%2F1009212675).
6. Tukey, John W. (1977). Exploratory Data Analysis. Pearson. ISBN 978-0201076165.
7. Behrens-Principles and Procedures of Exploratory Data Analysis-American Psychological
Association-1997 (https://web.archive.org/web/20170808064326/cll.stanford.edu/~willb/cour
se/behrens97pm.pdf)
8. Konold, C. (1999). "Statistics goes to school". Contemporary Psychology. 44 (1): 81–82.
doi:10.1037/001949 (https://doi.org/10.1037%2F001949).
9. Tukey, John W. (1980). "We need both exploratory and confirmatory". The American
Statistician. 34 (1): 23–25. doi:10.1080/00031305.1980.10482706 (https://doi.org/10.1080%
2F00031305.1980.10482706).
10. Sailem, Heba Z.; Sero, Julia E.; Bakal, Chris (2015-01-08). "Visualizing cellular imaging
data using PhenoPlot" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4354266). Nature
Communications. 6 (1): 5825. Bibcode:2015NatCo...6.5825S (https://ui.adsabs.harvard.edu/
abs/2015NatCo...6.5825S). doi:10.1038/ncomms6825 (https://doi.org/10.1038%2Fncomms6
825). ISSN 2041-1723 (https://www.worldcat.org/issn/2041-1723). PMC 4354266 (https://ww
w.ncbi.nlm.nih.gov/pmc/articles/PMC4354266). PMID 25569359 (https://pubmed.ncbi.nlm.ni
h.gov/25569359).
11. Elementary Manual of Statistics (3rd edn.,
1920)https://archive.org/details/cu31924013702968/page/n5
12. Cook, D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H. Hofmann, H. Wickham, M.
Lawrence) (2007) ″Interactive and Dynamic Graphics for Data Analysis: With R and GGobi″
Springer, 978-0387717616
Bibliography
Andrienko, N & Andrienko, G (2005) Exploratory Analysis of Spatial and Temporal Data. A
Systematic Approach. Springer. ISBN 3-540-25994-5
Cook, D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H. Hofmann, H. Wickham, M.
Lawrence) (2007-12-12). Interactive and Dynamic Graphics for Data Analysis: With R and
GGobi. Springer. ISBN 9780387717616.Andrienko, N & Andrienko, G (2005) Exploratory
Analysis of Spatial and Temporal Data. A Systematic Approach. Springer. ISBN 3-540-
25994-5
Cook, D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H. Hofmann, H. Wickham, M. Lawrence)
(2007-12-12). Interactive and Dynamic Graphics for Data Analysis: With R and GGobi. Springer. ISBN
9780387717616. Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1985). Exploring Data Tables,
Trends and Shapes. ISBN 978-0-471-09776-1. Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds)
(1983). Understanding Robust and Exploratory Data Analysis. ISBN 978-0-471-09777-8. Young, F. W.
Valero-Mora, P. and Friendly M. (2006) Visual Statistics: Seeing your data with Dynamic Interactive
Graphics. Wiley ISBN 978-0-471-68160-1 Jambu M. (1991) Exploratory and Multivariate Data Analysis.
Academic Press ISBN 0123800900 S. H. C. DuToit, A. G. W. Steyn, R. H. Stumpf (1986) Graphical
Exploratory Data Analysis. Springer ISBN 978-1-4612-9371-2
Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1985). Exploring Data Tables, Trends
and Shapes (https://archive.org/details/exploringdatatab0000unse). ISBN 978-0-471-09776-
1.
Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1983). Understanding Robust and
Exploratory Data Analysis. ISBN 978-0-471-09777-8.
Inselberg, Alfred (2009). Parallel Coordinates:Visual Multidimensional Geometry and its
Applications. London New York: Springer. ISBN 978-0-387-68628-8.
Leinhardt, G., Leinhardt, S., Exploratory Data Analysis: New Tools for the Analysis of
Empirical Data (https://journals.sagepub.com/doi/pdf/10.3102/0091732X008001085),
Review of Research in Education, Vol. 8, 1980 (1980), pp. 85–157.
Martinez, W. L.; Martinez, A. R. & Solka, J. (2010). Exploratory Data Analysis with MATLAB,
second edition. Chapman & Hall/CRC. ISBN 9781439812204.
Theus, M., Urbanek, S. (2008), Interactive Graphics for Data Analysis: Principles and
Examples, CRC Press, Boca Raton, FL, ISBN 978-1-58488-594-8
Tucker, L; MacCallum, R. (1993). Exploratory Factor Analysis. [1] (http://www.unc.edu/~rcm/b
ook/factornew.htm). {{cite book}}: External link in |location= (help)
Tukey, John Wilder (1977). Exploratory Data Analysis (https://archive.org/details/exploratory
dataa00tuke_0). Addison-Wesley. ISBN 978-0-201-07616-5.
Velleman, P. F.; Hoaglin, D. C. (1981). Applications, Basics and Computing of Exploratory
Data Analysis (https://archive.org/details/applicationsbasi00vell). ISBN 978-0-87150-409-8.
Young, F. W. Valero-Mora, P. and Friendly M. (2006) Visual Statistics: Seeing your data with
Dynamic Interactive Graphics (http://www.uv.es/visualstats/Book). Wiley ISBN 978-0-471-
68160-1
Jambu M. (1991) Exploratory and Multivariate Data Analysis (http://www.sciencedirect.com/s
cience/book/9780123800909). Academic Press ISBN 0123800900
S. H. C. DuToit, A. G. W. Steyn, R. H. Stumpf (1986) Graphical Exploratory Data Analysis (htt
ps://link.springer.com/book/10.1007%2F978-1-4612-4950-4). Springer ISBN 978-1-4612-
9371-2
Andrienko, N & Andrienko, G (2005) Exploratory Analysis of Spatial and Temporal Data. A Systematic
Approach. Springer. ISBN 3-540-25994-5 Cook, D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H.
Hofmann, H. Wickham, M. Lawrence) (2007-12-12). Interactive and Dynamic Graphics for Data Analysis:
With R and GGobi. Springer. ISBN 9780387717616. Hoaglin, D C; Mosteller, F & Tukey, John Wilder
(Eds) (1985). Exploring Data Tables, Trends and Shapes. ISBN 978-0-471-09776-1. Hoaglin, D C;
Mosteller, F & Tukey, John Wilder (Eds) (1983). Understanding Robust and Exploratory Data Analysis.
ISBN 978-0-471-09777-8. Young, F. W. Valero-Mora, P. and Friendly M. (2006) Visual Statistics: Seeing
your data with Dynamic Interactive Graphics. Wiley ISBN 978-0-471-68160-1 Jambu M. (1991)
Exploratory and Multivariate Data Analysis. Academic Press ISBN 0123800900 S. H. C. DuToit, A. G. W.
Steyn, R. H. Stumpf (1986) Graphical Exploratory Data Analysis. Springer ISBN 978-1-4612-9371-2
External links
Carnegie Mellon University – free online course on Probability and Statistics, with a module
on EDA (https://oli.cmu.edu/courses/free-open/statistics-course-details/)
• Exploratory data analysis chapter: engineering statistics handbook (https://www.itl.nist.gov/
div898/handbook/eda/eda.htm)