01 Descriptive Statistics For Exploring Data
01 Descriptive Statistics For Exploring Data
In January 1986, the space shuttle Challenger broke apart shortly after liftoff. The
accident was caused by a part that was not designed to fly at the unusually cold
temperature of 29◦ F at launch.
Here are the launch-temperatures of the first 25 shuttle missions (in degrees F):
66,70,69,80,68,67,72,70,70,57,63,70,78,67,53,67,75,70,81,76,79,75,76,58,29
10
8
Frequency
6
4
2
0
20 30 40 50 60 70 80 90
Temperature
The two most important functions of descriptive statistics are:
I Communicate information
I Support reasoning about data
There are many ways to visualize data. The nature of the data and the goal of the
visualization determine which method to choose.
Pie chart and dot plot
California
International
Other US
Washington
Oregon
International
Oregon
California
Other US
Washington
0 10 20 30 40
Percent
The dot plot makes it easier to compare frequencies of various categories, while the pie
chart allows more easily to eyeball what fraction of the total a category corresponds to.
Bar graph
When the data are quantitative (i.e. numbers), then they should be put on a number
line. This is because the ordering and the distance between the numbers convey
important information.
The bar graph is essentially a dot plot put on its side.
8
6
4
2
0
3 4 5 6
So the percentage falling into a block can be figured without a vertical scale since the
total area equals 100%.
But it’s helpful to have a vertical scale (density scale). Its unit is ‘% per unit’, so in the
above example the vertical unit is ‘% per year’.
The histogram gives two kinds of information about the data:
1. Density (crowding): The height of the bar tells how many subjects there are for one
unit on the horizontal scale. For example, the highest density is around age 19 as
.04 = 4% of all subjects are age 19. In contrast, only about 0.7% of subjects fall into
each one year range for ages 60–80.
2. Percentages (relative frequences): Those are given by
For example, about 14% of all subjects fall into the age range 60–80, because the
corresponding area is (20 years) x (0.7 % per year)=14 %. Alternatively, you can find
this answer by eyeballing that this area makes up roughly 1/7 of the total area of the
histogram, so roughly 1/7=14% of all subjects fall in that range.
The boxplot (box-and-whisker plot)
The boxplot depicts five key numbers of the data:
30
Miles per gallon for 32 cars
25
20
15
10
The boxplot conveys less information than a histogram, but it takes up less space and
so is well suited to compare several datasets:
30
25
Miles per gallon
20
15
10
4 6 8
Number of Cylinders
The scatterplot
25000
20000
15000
The scatterplot visualizes the relationship
Income
5000
0
6 8 10 12 14 16
Education
Providing context is important
Statistical analyses typically compare the observed data to a reference. Therefore
context is essential for graphical integrity.
I ‘The Visual Display of Quantitative Information’ by Edward Tufte (p.74)
One way to provide context is by using small multiples. The compact design of the
boxplot makes it well suited for this task:
Providing context with small multiples
ale up the projects so it can
elf," Webb said. over the 2000 figure. "1 expected growth in giving ority now is its Campaign fo
gether experts from many- "The total is surprisinglylarge in by foundations to be there, but 1 Undergraduate Education
light of the overall sharp decline in didn't expect it to be nearly as launched by University Presi-
se cooperation results in a
ogical breakthroughs.
Pitfalls when visualizing data
stock market values over this peri- high as it was," Kaplan said. dent Hennessy in 2000 with th
is heading towards inter- od," RAND's Council for Aid to Ed- "There was so much bad news in goal to raise $ 1 billion over five
Webb said. "Having the ucation said. The non-profit Coun- the stock market, especially dur- years.
Sophisticated software
biologists all in the same
makes
cil forAid it tempting
to Education has trackedto produce showy
ing that fiscal but
period, withpoor
the visualizations:
This is "the largest campaign
roughs. I think it is a good specifically for undergraduat
erdisciplinary program at education ever undertaken b
any university," Henness
of cooperation between dis- wrote in a 2000 letter introduc-
ing funding requirements. ing the program.
fficult because it is general- As of March 31, Stanford
experiments. Often biolo- campaign had already raise
ogy development is critical, $733 million, although a portio
ding. You are going after of that total reflected as-yet unme
al sources that are used to commitments to match dona
al way," Webb said. tions.
fense Advanced Research Columbia University, wi
n more willing to take the $359 million in 2001, was just be
ayoff projects, according to hind Stanford. Indiana Universi
rtment of Defense funding ty was the most-funded pub
ects that are long-range, university, in seventh place wi
ard, like those taken on by AARON STAPLE/The Stanford Daily $301 million.
Steakburger"
Numerical summary measures
For summerizing data with one number, use the mean (=average) or the median.
The median is the number that is larger than half the data and smaller than the other
half.
Mean vs. median
Mean and median are the same when the histogram is symmetric.
30
25
20
15
10
5
0
km/sec
Mean vs. median
When the histogram is skewed to the right, then the mean can be much larger than the
median.
30
Miles per gallon for 32 cars
25
20
15
10
The two numbers x̄ and s are often used to summarize data. Both are sensitive to a
few large or small data.
If that is a concern, use the median and the interquartile range.