Topic2-Numerical-Summary
Topic2-Numerical-Summary
Numerical Summaries
Numerical Summaries
What are the main features of the data?
2/62
Outline
Centre
· Sample mean
· Sample median
· Robustness and comparisons
Spread
· Standard deviation
· Interquartile range
Write functions in R
Summary
3/62
Data story
How much does a property in Newtown cost?
5/62
Data on Newtown property sales
· Data is taken from domain.com.au:
- All properties sold in Newtown (NSW 2042) between April-June 2017.
- The variable Sold has price in $1000s.
6/62
dim(data)
## [1] 56 8
str(data)
## 'data.frame': 56 obs. of 8 variables:
## $ Property : chr "19 Watkin Street Newtown" "30 Pearl Street Newtown" "26 John Street Newtowmn" "23/617
## $ Type : chr "House" "House" "House" "Apartment" ...
## $ Agent : chr "RayWhite" "RayWhite" "Belle" "RayWhite" ...
## $ Bedrooms : int 4 2 2 1 1 5 1 1 1 3 ...
## $ Bathrooms: int 1 1 1 1 1 1 1 1 1 2 ...
## $ Carspots : int 1 0 0 1 1 1 0 1 1 0 ...
## $ Sold : int 1975 1250 1280 780 650 2100 675 740 625 1950 ...
## $ Date : chr "23/6/17" "23/6/17" "17/6/17" "17/6/17" ...
7/62
Numerical summaries
Recap: graphical summaries
For the Newtown property data we could produce a histogram or boxplot.
9/62
Advantages of numerical summaries
· A numerical summary reduces all the data to one simple number (“statistic”).
- This loses a lot of information.
- However it allows easy communication and comparisons.
· Major features that we can summarise numerically are:
- Maximum
- Minimum
- Centre [sample mean, median]
- Spread [standard deviation, range, interquartile range]
Which summaries might be useful for talking about Newtown house prices?
· It depends!
· Reporting the centre without the spread can be misleading!
10/62
Useful notation for data
· Observations of a single variable of size n can be represented by
x1 , x2 , … , xn
· The ranked observations (ordered from smallest to largest) are
11/62
Sample mean
Sample mean
Sample mean
The sample mean is the average of the data.
sum of data
sample mean =
size of data
or
∑ni=1 xi
x̄ =
n
13/62
· The sample mean of all the properties sold in Newtown is:
mean(data$Sold)
## [1] 1407.143
· Focusing specifically on houses with 4 bedrooms (large), the sample mean is:
mean(data$Sold[data$Type == "House" & data$Bedrooms == "4"])
## [1] 2198.857
14/62
Deviation from the mean
Given a data point xi , its deviation from the sample mean x̄ is
Di = xi − x̄
For example,
15/62
Sample mean as a balancing point
The sample mean is the point at which the data is balanced in the sense the sum of the
absolute deviations for values to the left of the mean is the same as the sum of
absolute deviations to the right of the mean.
16/62
Sample mean on the histogram
However, sample mean may not be balancing point of a histogram, the area to the left of
the mean may not be the same as the area to the right of the mean.
17/62
When the data is skewed, this effect is more significant.
18/62
Sample median
Sample median
Sample Median
~
The sample median x is the middle data point, when the observations are ordered
from smallest to largest.
20/62
Ordering observations
The ranked observations are:
sort(data$Sold)
## [1] 370 625 645 650 675 692 720 740 740 755 770 780 812 860 861
## [16] 920 935 955 955 999 1100 1240 1250 1280 1309 1315 1370 1375 1400 1460
## [31] 1553 1575 1590 1600 1600 1600 1605 1662 1701 1710 1750 1780 1790 1806 1850
## [46] 1940 1950 1975 2000 2100 2200 2235 2300 2410 2810 3150
length(data$Sold)
## [1] 56
21/62
· The sample median of all the properties sold in Newtown is:
median(data$Sold)
## [1] 1387.5
· Focusing specifically on houses with 4 bedrooms (large), the sample median is:
median(data$Sold[data$Type == "House" & data$Bedrooms == "4"])
## [1] 1975
22/62
Sample median on the histogram
· The sample median is the half way point on the histogram - i.e., 50% of the houses
sold are below and above $1.3875 million.
hist(data$Sold)
abline(v = mean(data$Sold), col = "green")
abline(v = median(data$Sold), col = "purple")
23/62
hist(data$Sold[data$Type == "House" & data$Bedrooms == "4"], main = "Newtown 4 Bedrooms",
xlab = "Price (in 1000s)")
abline(v = mean(data$Sold[data$Type == "House" & data$Bedrooms == "4"]), col = "green")
abline(v = median(data$Sold[data$Type == "House" & data$Bedrooms == "4"]), col = "purple")
24/62
Statistical Thinking
If you had to choose between reporting the sample mean or sample median for
Newtown properties, which would you choose and why?
· For the full property portfolio, the sample mean and the sample median are fairly
similar.
· For the 4 bedroom houses, the sample mean is higher than the sample median
because it is being “pulled up” by some very expensive houses.
· For the average buyer, the sample median would be more useful as an
indication of the sort of price needed to get into the market.
· For any agent selling houses in the area, the sample mean might be more useful
in order to predict their average commissions!
· In practise, we can report both!
25/62
Sample mean and median on the boxplot
· The sample median is the centre line on the boxplot.
boxplot(data$Sold, main = "Newtown properties")
abline(h = mean(data$Sold), col = "green")
abline(h = median(data$Sold), col = "purple")
26/62
boxplot(data$Sold[data$Type == "House" & data$Bedrooms == "4"], main = "Newtown 4B Properties")
abline(h = mean(data$Sold[data$Type == "House" & data$Bedrooms == "4"]), col = "green")
abline(h = median(data$Sold[data$Type == "House" & data$Bedrooms == "4"]), col = "purple")
27/62
Robustness and comparisons
Robustness
Robustness
The sample median is said to be robust and is a good summary for skewed data as
it is not affected by outliers (extreme data values).
29/62
Example
Recently a heritage building was sold for 13 million in Newtown.
How would the sample mean and sample median change if it was added to the
data?
· The sample median would be a bit higher: it moves from the average of the 28th and
29th points to the 29th point.
30/62
data2 = c(data$Sold, 13000)
sort(data2)
## [1] 370 625 645 650 675 692 720 740 740 755 770 780
## [13] 812 860 861 920 935 955 955 999 1100 1240 1250 1280
## [25] 1309 1315 1370 1375 1400 1460 1553 1575 1590 1600 1600 1600
## [37] 1605 1662 1701 1710 1750 1780 1790 1806 1850 1940 1950 1975
## [49] 2000 2100 2200 2235 2300 2410 2810 3150 13000
mean(data2)
## [1] 1610.526
median(data2)
## [1] 1400
Summary of changes
Change in data sample mean sample median
Original data 1407.143 1387.5
Extra property of 13000 1610.526 1400
31/62
Comparing the sample mean and the median
The difference between the sample mean and the sample median can be an indication
of the shape of the data.
· For symmetric data, the sample mean and sample median are the same: x̄ ~.
=x
· For left skewed data (the most frequent data are concentrated on the right, with a left
~.
tail), the sample mean is smaller than the sample median: x̄ < x
· For right skewed data (the most frequent data are concentrated on the left, with a
~.
right tail), the sample mean is larger than the sample median: x̄ > x
32/62
Which is optimal for describing centre?
· Both have strengths and weaknesses depending on the nature of the data.
· Sometimes neither gives a sensible sense of location, for example if the data is
bimodal.
· As the sample median is robust, it is preferable for data which is skewed or has
many outliers, like Sydney house prices.
· The sample mean is helpful for data which is basically symmetric, with not too
many outliers, and for theoretical analysis.
33/62
Limitations of both?
· Both the sample mean and sample median allow easy comparisons.
· However, they need to be paired with a measure of spread.
· In the following example, the sample means are the same, but the data are very
different. Or, consider two data sets {−1, 0, 1} and {−100, 0, 100}.
34/62
Standard deviation
How to measure spread?
For each property sold, we could calculate the deviation (or the gap) from the sample
mean, Di = xi − x̄, between the house and the sample mean $1407 (thousands).
36/62
gaps = data$Sold - mean(data$Sold)
gaps
## [1] 567.857143 -157.142857 -127.142857 -627.142857 -757.142857
## [6] 692.857143 -732.142857 -667.142857 -782.142857 542.857143
## [11] -32.142857 167.857143 -408.142857 -452.142857 -547.142857
## [16] 197.857143 182.857143 -167.142857 -1037.142857 532.857143
## [21] -687.142857 -452.142857 -487.142857 442.857143 192.857143
## [26] -652.142857 -7.142857 145.857143 1402.857143 192.857143
## [31] 792.857143 372.857143 398.857143 293.857143 -98.142857
## [36] -307.142857 -472.142857 -762.142857 52.857143 -37.142857
## [41] -715.142857 1742.857143 1002.857143 -637.142857 254.857143
## [46] 827.857143 592.857143 382.857143 342.857143 302.857143
## [51] 192.857143 -546.142857 -667.142857 -92.142857 892.857143
## [56] -595.142857
max(gaps)
## [1] 1742.857
What are the biggest and smallest deviations?
37/62
1st attempt: The mean gap
We could calculate the average of the deviations.
Mean gap
round(mean(gaps))
## [1] 0
What’s the problem?
38/62
Note: It will always be 0.
· From the definition, the mean deviation must be 0, as the mean is the balancing
point of the deviations.
39/62
Better option: Standard deviation
First define the root mean square (RMS).
Root mean square
· The RMS measures the average of a set of numbers, regardless of the signs.
· The steps are: Square the numbers, then Mean the result, then Root the result.
−−−−−−−−−−−−−−−−−−−2−
RMS(numbers) = √sample mean (numbers )
· RMS retain the same unit as the unit of the sample mean.
40/62
· Applying RMS to the deviations, we get
−−−n−−−−−
RMS of deviations = √sample mean (deviations ) = √
−−−−−−−−−−−−−−−−−−−−2− ∑i=1 D2i
n
∑ni=1 |Di |
mean absolute deviation (MAD) = .
n
However, MAD is much harder to analyse.
41/62
Standard deviation in terms of RMS
Population Standard deviation
· The standard deviation measures the spread of the data.
sqrt(mean(gaps^2))
## [1] 593.7166
42/62
Standard deviation in R?
It is easy to calculate in R.
sd(data$Sold)
## [1] 599.0897
But why is this slightly different?
43/62
Adjusting the standard deviation
· There are two different formulas for the standard deviation, depending on whether
the data is the population or a sample.
· The sd command in R always gives the sample version, as we most commonly
have samples.
−−−−− −−−− −− −−−−−−−−
·
Formally,SDpop = √ n ∑i=1 Di and SDsample = √ n−1 ∑i=1 D2i , where
1 n 2 1 n
Di = xi − x̄ is the deviation.
sd(data$Sold) * sqrt(55/56) # adjust by sqrt((n-1)/n), it calculates the population SD.
## [1] 593.7166
gaps = data$Sold - mean(data$Sold) # calculate the gaps
sqrt(mean(gaps^2)) # calculates the population SD.
## [1] 593.7166
44/62
−−−−−−−−
Why does the sample SD use the adjustment √(n − 1)/n?
· It is an unbiased estimator of the standard deviation (beyond the scope of this unit,
will be covered in Year 2)
· Estimating the sample mean uses all of the n data points. The sum (or the mean) of
n deviations is zero
n n
∑ Di = ∑(xi − x̄) = 0.
i=1 i=1
This means, given the first n − 1 deviations, we know the n-th deviation, because
(∑ Di ) + Dn = 0
n−1 n−1
⟹ Dn = − ∑ Di .
i=1 i=1
45/62
Summary: population and sample
Summary Formula In R
Population or Sample mean Sample Mean (Average) mean(data)
RMS of gaps from the
Population standard deviation SDpop sd(data)*sqrt((n-1)/n)
sample mean
Adjusted RMS of gaps
Sample standard deviation SDsample sd(data)
from the sample mean
46/62
How to tell the difference?
· It can be tricky to work out whether your data is a population or sample!
· Look at the information about the data story and the research questions.
- If we are just interested in the Newtown property prices during April-June 2017,
then the data is the whole population.
- If we are studying the property prices during April-June 2017 as a window into
more general property prices (for the rest of the year or for the Inner West area)
, then the data could be considered a sample.
· Population SD and sample SD get closer with increasing sample size n.
47/62
Variance
The squared standard deviation is called the variance. Similar to the sample SD and the
population SD, there are two versions of the variance
· For summarising spread, we often prefer SD, as it has the same unit as the data
points and the mean.
· In some situations, e.g., dealing with random variables (Part III) and understanding
the property of sample mean, using the variance can be much simpler.
48/62
Standard units
Standard units (“Z score”)
Standard units of a data point = how many standard deviations is it below or above
the mean
49/62
Comparing 2 data points
To compare 2 data points, we can compare the standard units.
So 19 Watkin is a more unusual purchase than 30 Pearl St, relative to the mean.
50/62
Interquartile range
Interquartile range (IQR)
The IQR is another measure of spread by the ordering
Interquartile Range (IQR)
52/62
Quantile, quartile, percentile
The set of q-quantiles divides the ordered data into q equal size sets (in terms of
percentage of data).
Percentile is 100-quantile.
summary(data$Sold)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 370.0 860.8 1387.5 1407.1 1782.5 3150.0
summary(data$Sold)[5] - summary(data$Sold)[2] # one way to calculate IQR
## 3rd Qu.
## 921.75
IQR(data$Sold) # use the built-in function
## [1] 921.75
So the range of the middle 50% of properties sold is almost a million dollars!
53/62
Reporting
· Like the median, the IQR is robust, so it’s suitable as a summary of spread for
skewed data.
· We report in pairs: (mean,SD) or (median,IQR).
54/62
IQR on the boxplot and outliers
· The IQR is the length of the box in the boxplot. It represents the span of the middle
50% of the houses sold.
· The lower and upper thresholds (expected minimum and maximum) are a distance
of 1.5IQR from the 1st and 3rd quartiles (by Tukey’s convention).
LT = Q1 − 1.5IQR
and
UT = Q3 + 1.5IQR
· Data outside these thresholds is considered an outlier (“extreme reading”).
55/62
boxplot(data$Sold, horizontal = T)
iqr = quantile(data$Sold)[4] - quantile(data$Sold)[2]
abline(v = median(data$Sold), col = "green")
abline(v = quantile(data$Sold)[2], col = "red")
abline(v = quantile(data$Sold)[4], col = "red")
abline(v = quantile(data$Sold)[2] - 1.5 * iqr, col = "purple")
abline(v = quantile(data$Sold)[4] + 1.5 * iqr, col = "purple")
56/62
boxplot(data$Sold, horizontal = T)
abline(v = median(data$Sold), col = "green")
abline(v = quantile(data$Sold)[2], col = "red")
abline(v = quantile(data$Sold)[4], col = "red")
abline(v = max(min(data$Sold), quantile(data$Sold)[2] - 1.5 * iqr), col = "purple")
abline(v = min(max(data$Sold), quantile(data$Sold)[4] + 1.5 * iqr), col = "purple")
To make the LT and UT staying within the range of data, R uses the convention
LT = max(min(x), Q1 − 1.5IQR)
and
UT = min(max(x), Q3 + 1.5IQR)
57/62
Dealing with outliers (not for examination)
Sometimes outliers indicate that a better model is needed. We may remove outliers by
transform the data. For example, a right skewed data set with outliers can be
transformed into the logarithmic scale.
58/62
Write a function R
How to write a function in R
A function in R is one of the most used objects. For example, mean , median , sd are
all R functions. It is very important to understand the purpose and syntax of R functions
and knowing how to create or use them.
Here we declared a function with name function_name , the function takes inputs
parameter1 , parameter2 and returns an output c . It can take any number of inputs
but only one outputs.
60/62
Example
Here we want to write a function in R that calculates the sample mean and sample
standard deviation
61/62
Summary
Centre
· Sample mean
· Sample median
· Robustness and comparisons
Spread
· Standard deviation
- Population standard deviation and sample standard deviation
· Interquartile range
Write functions in R
Some R Functions
length , mean , median , sd , IQR , summary , boxplot
62/62