S8-Why is data important
S8-Why is data important
1, Winter 2020
Column Editor’s Note: Data visualization, facilitated by the power of the computer, represents one of the
fundamental tools of modern data science. Professor Antony Unwin from the University of Augsburg describes
different ways in which data visualization is used, explores the opportunities for future research in the area,
and looks at how data visualization is taught.
Data visualization is useful for data cleaning, exploring data structure, detecting outliers and unusual groups,
identifying trends and clusters, spotting local patterns, evaluating modeling output, and presenting results. It is
essential for exploratory data analysis and data mining to check data quality and to help analysts become
familiar with the structure and features of the data before them. This is a part of data analysis that is
underplayed in textbooks, yet ever-present in actual investigations. Look, for instance, at the one-sided peaks in
the distributions of marathon finishing times (marastats, 2019).
Graphics reveal data features that statistics and models may miss: unusual distributions of data, local patterns,
clusterings, gaps, missing values, evidence of rounding or heaping, implicit boundaries, outliers, and so on.
Graphics raise questions that stimulate research and suggest ideas. It sounds easy. In fact, interpreting graphics
needs experience to identify potentially interesting features and statistical nous to guard against the dangers of
overinterpretation. Just as graphics are useful for checking model results, models are useful for checking ideas
derived from graphics (for more on models, see Hand, 2019).
This overview concentrates on static graphics. Dynamic graphics and, more especially, interactive graphics are
in an exciting stage of development and have much to add. They require an article of their own. Superb
examples include Human Terrain, a dynamic graphic showing the world's population in 3-D, and the
interactive NameVoyager.
2
Harvard Data Science Review • Issue 2.1, Winter 2020 Why Is Data Visualization Important? What Is Important in Data
Visualization?
the data, how and why they were collected, whether more could be collected, the reasons for drawing the
displays, and how people with the necessary background knowledge advise they might be interpreted. There is
a story that M. G. Kendall reviewed a book of R.A. Fisher's with the words: "No one should read this book
who has not read it already." It is like that with graphics. If you have read all the supporting text, the display is
often memorable and readily understandable. If you have not, it is not. Graphics on their own are insufficient,
they are part of a whole. They complement text and are complemented by text. Student's reanalysis of the
Lanarkshire Milk Experiment (Student, 1931) is an excellent example (and is also interesting as an early
analysis of a large data set).
The potential synergy of text and graphics can be appreciated by talking through your own graphics, explaining
them to others. Why have you drawn those graphics? How have you drawn them? What can be seen? Are there
interesting patterns? What could be changed and improved? Which other graphics might be drawn? How can
conclusions be checked? There should be more talking about graphics and less relying on the graphics to speak
for themselves.
When it comes to graphics you have not drawn yourself, the same kinds of questions are still relevant, although
they may be more difficult to answer. Edward Tufte described Charles Minard's display of Napoleon's Russian
campaign as the best statistical graphic ever drawn (Tufte, 2001). It is a magnificent graphic, fully deserving of
the praise heaped on it, yet as Lee Wilkinson has pointed out in his book The Grammar of Graphics
(Wilkinson, 2005), there are inaccuracies and imprecisions in the display. Why did no one point them out
before? We are too used to accepting graphics uncritically, not asking enough questions of them.
Published graphics tend to be graphics for presentation, partly because they are for publication and partly
because no one wants to see hundreds of quick graphics that may or may not have been helpful. It is rather like
mathematical proofs: articles contain the elegant and concise final versions, not the scribbled notes and random
ideas that came before. How many graphics may have been drawn before the striking display was chosen to
show the resignations of U.K. cabinet ministers in recent years (Institute for Government, 2019)?
3
Harvard Data Science Review • Issue 2.1, Winter 2020 Why Is Data Visualization Important? What Is Important in Data
Visualization?
Exploratory graphics take advantage of how easy it is now to draw and redraw graphics. What used to be a
slow and wearisome process, even including having to print out displays, has become fast and flexible. At the
same time, new, additional skills are required. Identifying interesting features and knowing how to check them
in more detail among a myriad of possible graphics is not just a matter of drawing many graphics, you need
interpretative skills and an appreciation of which graphics will provide what kinds of information. There is so
much that can be varied: the variables displayed, the types of graphics, the sizes of graphics and their aspect
ratios, the colors and symbols used, the scales and limits, the ordering of categorical variables, the ordering of
variables in multivariate displays. Selecting from the wide range of graphics wisely, and understanding how to
gain insights, are not trivial tasks. The lack of a theory of data visualization to guide and build on is a key issue.
More understanding of combining and linking graphics is needed, whether in static ensembles or in interactive
displays, just as better software is needed for these. The value of alignment and common scaling for making
effective comparisons, for instance, with small multiples and faceting (displaying many graphics of the same
form conditioning on other variables) is one part of this. It is a historical curiosity that the current exciting
work on interactive graphics on the Web still lags behind standalone systems that were already available more
4
Harvard Data Science Review • Issue 2.1, Winter 2020 Why Is Data Visualization Important? What Is Important in Data
Visualization?
than 30 years ago in linking multiple windows. Data Desk and JMP were commercial examples at the time (see
Velleman, 2019, and Sall, 2019, for current versions).
Published graphics are sometimes attractive and beautifully produced. The content does not always match.
That may be because authors and publishers do not expect the graphics to be examined in any detail. They may
be added as illustrations to balance the layout and make it look more agreeable. If you do not have a suitable
photograph, cartoon, or map, you could use a colorful statistical graphic. I have many times heard people say
that they do not understand numbers and were bad at mathematics in school. No one has ever said to me they
do not understand graphics, perhaps because they regard them as illustrations and not as central parts of an
argument. There is work to be done in educating researchers and readers in the value of graphics.
Research into new and innovative graphics is exciting and productive. Simultaneously, it is essential to make
the best use of known and well-understood graphics. There is a risk of emphasis on novelty at the expense of
familiarity. New, innovative graphics need instruction and experience to interpret them. Their designers have
spent much time developing them and reasonably enough believe that what is obvious to them should be
obvious to everyone. Just think of the humble scatterplot. It is only in recent years that scatterplots have
appeared in the media, although they are one of the most important statistical graphics. If you have never seen
one before, they can be intimidating, even more so when you are told ‘It is clear that...’ or ‘You can easily see
that...’ We should build on the familiar to carry our readers along with us.
5
Harvard Data Science Review • Issue 2.1, Winter 2020 Why Is Data Visualization Important? What Is Important in Data
Visualization?
Disclosure Statement
Antony Unwin has no financial or non-financial disclosures to share for this article.
References
Daniels, M. (2018). Human terrain. https://pudding.cool/2018/10/city_3d/
Hand, D. (2019). What is the purpose of statistical modelling? Harvard Data Science Review, 1(1).
https://doi.org/10.1162/99608f92.4a85af74
Institute for Government. (2019). Ministerial resignations outside reshuffles, by prime minister. Retrieved
August 14, 2019, from https://www.instituteforgovernment.org.uk/charts/ministerial-resignations-outside-
reshuffles-prime-minister
Klimek, P., Yegorov, Y., Hanel, R., & Thurner, S. (2012). Statistical detection of systematic election
irregularities. PNAS, 109(41), 16469–16473. https://doi.org/10.1073/pnas.1210722109
marastats. (2019). General marathon stats. Retrieved August 14, 2019, from https://marastats.com/marathon/
New York Times. (2018, December 31). 2018: The year in visual stories and graphics.
https://www.nytimes.com/interactive/2018/us/2018-year-in-graphics.html
Theus, M. (2015). Tour de France 2015. Retrieved August 14, 2019, from http://www.theusrus.de/blog/tour-de-
france-2015/
Tufte, E. (2001). The visual display of quantitative information (2nd ed.) Cheshire, CT: Graphics Press.
Unwin, A. (2015). Studying multivariate categorical data. Retrieved August 14, 2019, from
http://www.gradaanwr.net/content/ch07/
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis (2nd ed.). New York, NY: Springer-Verlag.
https://doi.org/10.1007/978-3-319-24277-4
Wilkinson, L. (2005). The grammar of graphics (2nd ed.). New York, NY: Springer. https://doi.org/10.1007/0-
387-28695-0
6
Harvard Data Science Review • Issue 2.1, Winter 2020 Why Is Data Visualization Important? What Is Important in Data
Visualization?
©2020 Antony Unwin. This article is licensed under a Creative Commons Attribution (CC BY 4.0)
International license, except where otherwise indicated with respect to particular material included in the
article.