How To Read and Use A Box-and-Whisker Plot
How To Read and Use A Box-and-Whisker Plot
That’s all there is to it, so the next time you’re thinking of making a bar graph or a
histogram, think about using Tukey’s beloved box-and-whisker plot too.
In my opinion the box plot is one of the most underestimated views in current fashionable
information visualization approaches. Modern chart libraries come with a lot of available
charts but almost all of them miss the box plot. Thus, I decided to write this article to put the
brilliant box plot back on the map again and provide a CSS/Javascript solution for displaying
History:
The box plot goes back to John Tukey, which published in 1977 this efficient method to
display robust statistics (Tukey77).
Best Practice:
The most impressive and excellent usage of a box plot I found on the world freedom atlas:
Let’s first
look at the view at the top. Also here a box plot is displayed. The red dot in the blue bar is the
median; the lines at the left and right represent the lower and upper quartiles (I
will explain later on what numbers a box plot actually displays); 0 and 40 are the minimal
and maximal possible values. If you move with the mouse over a country on the map, it is
highlighted in the box plot as you can see in the picture above: the country with a raw
political rights score of 34 (it’s Mongolia by the way). Another very nice feature here is the
stacking of elements with the observed value at the top of the blue bar. This indicates for
each value how many countries have this score – and thereby providing an immediately
comprehensible understanding of the underlying distribution. But, of course, this is only
possible if you have a predictable number of values to stack – otherwise you cannot
determine the necessary height; and if these values are integers – otherwise you have an
infinite number of possible positions for the values and a stacking is not possible. The box
plot at the bottom of the above picture is as recommended by Edward Tufte (Tufte01). Again,
the red dot represents the median; the ends of the lines towards the red dot are the lower
and upper quartile, respectively; the ends of the lines towards the borders are the minimum
and maximum values. Another nice feature here is the yellow line showing the development
of the shown index (the raw political rights score) over the last years for the selected country
(the currently selected year is displayed in darker blue). Each particular value for the
selected country in each year is connected by the yellow line. As one can see immediately, it
is a little decreasing. As mentioned above this is probably the most stunning example of a
box plot, everything is done correctly. Still, in my opinion, there are some drawbacks with
Tufte's recommendation for box plots. Usually, a box plot is displayed in the following way
(this one was created with the data exploration tool KNIME, where this box plot was
implemented by myself):
Tufte's recommendation is based on the notion of avoiding chart junk and the principle of
maximizing data ink, i.e. the ink in the drawing should be used to display data and not
decoration or junk. While this is certainly a good guideline, it is sometimes difficult to read.
In the example of the world freedom atlas, it is only possible to decipher the actual values by
looking at the box plot to the left. By maximizing the data ink sometimes the readability is
minimized. In the example below definitively more “ink” was used, but in my opinion the
essential information – the key values and their exact numbers – are immediately visible.
This might not be as appealing as the box plot above, but if you are really interested in the
values this version might better fit your needs. (Maybe, because I’m more familiar with it?)
Theory:
But what is this all about? What values are displayed in a box plot? What are the advantages
of a box plot? The image below should at least clarify the used terms, whose meaning is
explained below. A small example should make things
clear. Consider a small village with 25 inhabitants. This is what they earn and the resulting
box plot:
24 2,996.45
23 2,919.35
22 2,787.02
21 2,784.72
20 2,696.83
18 2,400.43
17 2,367.84
16 2,333.37
15 2,285.53
14 2,214.87
12 1,923.62
11 1,819.22
10 1,773.34
9 1,597.54
8 1,589.48
7 1,494.65 Q1: 0.25 * 25 = 6.25 = 7.
6 1,423.74
5 1,391.92
4 1,334.88
3 1,184.53
2 1,125.78
1 1,005.85 Minimum
As you can see, the basic idea is to sort the data and then select the minimum, the maximum
and the values at the referring positions: median (0.5), lower (Q1) (0.25) and upper quartile
(Q3) (0.75). Why these values are considered to be robust statistic key values? In order to
explain this, consider a similar village with one rich person and the following incomes:
24 10,345.67 Maximum
22 2,787.02
21 2,784.72
20 2,696.83
19 2,412.51
16 2,333.37
15 2,285.53
14 2,214.87
13 2,069.79
11 1,819.22
10 1,773.34
9 1,597.54
8 1,589.48
7 1,494.65
5 1,391.92
4 1,334.88
3 1,184.53
2 1,125.78
Almost all programming languages start counting at zero, so the values don't have to
be ceiled but floored to get the correct positon and if it is an integer the mean
between p and p-1 has to be taken.
2. The horizontal bars outside of the box in the middle (called whiskers: hence the name
box and whisker plot) are not always the maximum and the minimum.
The whiskers mark those values which are minimum and maximum unless these values
exceed 1.5 * IQR. The IQR is the inter quartile range: the distance between Q1 and Q3. If
there are observations which are outside 1.5 * IQR or even 3 * IQR then they are considered
as mild and extreme outliers, respectively. The picture below depicts the concept in a
qualitative way (distances are not correct):
Robust Statistics:
In the first case we have a median of 2,069.79 and a mean of 2,037.38, so they are quite
comparable. In the second case – according to the mean of 2,303.437 – the village is richer,
while the median keeps incorruptible saying the truth (1996.705) and the only rich person is
displayed as what it is in this village: an outlier. The same holds for the other key values, of
course.
Summary
At this point we can summarize, what a box plot actually displays.
at least 25% of all values are below the lower quartile Q1.
at least 50% of all values are below (or above) the median.
at least 25% of all values are above the upper quartile Q3.
The box contains 50% of the data (Q3 (75%) - Q1(25%) = 50%).
You can read from the size of the box, the distance of the whiskers the distribution of
the values.
Between the median and the quartiles are 25% of the data (75% - 50% = 25% and
50% - 25% = 25%), i.e. the position of the median inside the box indicates whether
there are more values towards the upper or lower quartile.
Not to mention the outliers, which are those values, that are far away from most of
the other values
In This Topic
Median
The median is represented by the line in the box. The median is a common
measure of the center of your data.
Whiskers
The whiskers extend from either side of the box. The whiskers represent the
ranges for the bottom 25% and the top 25% of the data values, excluding
outliers.
Hold the pointer over the boxplot to display a tooltip that shows these statistics.
For example, the following boxplot of the heights of students shows that the
median height is 69. Most students have a height that is between 66 and 72, but
some students have heights that are as low as 61 and as high as 75.
N = 500
A boxplot works best when the sample size is at least 20. If the sample size is too
small, the quartiles and outliers shown by the boxplot may not be meaningful. If
the sample size is less than 20, consider using Individual Value Plot.
Skewed data
When data are skewed, the majority of the data are located on the high or low
side of the graph. Skewness indicates that the data may not be normally
distributed.
The following boxplots are skewed. The boxplot with right-skewed data shows
wait times. Most of the wait times are relatively short, and only a few wait times
are long. The boxplot with left-skewed data shows failure time data. A few items
fail immediately and many more items fail later.
Right-skewed
Left-skewed
Some analyses assume that your data come from a normal distribution. If your
data are skewed (nonnormal), read the data considerations topic for the analysis
to make sure that you can use data that are not normal.
Outliers
Outliers, which are data values that are far away from other data values, can
strongly affect your results. Often, outliers are easiest to identify on a boxplot. On
a boxplot, outliers are identified by asterisks (*).
TIP
Hold the pointer over the outlier to identify the data point.
Try to identify the cause of any outliers. Correct any data-entry errors or
measurement errors. Consider removing data values that are associated with
abnormal, one-time events (special causes). Then, repeat the analysis.
Centers
Look for differences between the centers of the groups. For example, the
following boxplot shows the thickness of wire from four suppliers. The median
thicknesses for some groups seem to be different.
Spreads
Look for differences between the spreads of the groups. For example, the
following boxplot shows the fill weights of cereal boxes from four production
lines. The median weights of the groups of cereal boxes are similar, but the
weights of some groups are more variable than others.