0% found this document useful (0 votes)
11 views

Topic2-Numerical-Summary

The document provides an overview of numerical summaries in data analysis, focusing on property sales in Newtown, NSW. It discusses key concepts such as sample mean, sample median, and measures of spread like standard deviation and interquartile range, along with their implications for understanding property prices. The document emphasizes the importance of both mean and median in statistical reporting, particularly in skewed data scenarios.

Uploaded by

ishrat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Topic2-Numerical-Summary

The document provides an overview of numerical summaries in data analysis, focusing on property sales in Newtown, NSW. It discusses key concepts such as sample mean, sample median, and measures of spread like standard deviation and interquartile range, along with their implications for understanding property prices. The document emphasizes the importance of both mean and median in statistical reporting, particularly in skewed data scenarios.

Uploaded by

ishrat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Exploring Data

Numerical Summaries

© University of Sydney MATH1062


06 October 2024

 Module1 Exploring Data

Data & Graphical Summaries


What type of data do we have & how can we visualise it?

Numerical Summaries
What are the main features of the data?

2/62
Outline

Centre
· Sample mean
· Sample median
· Robustness and comparisons

Spread
· Standard deviation
· Interquartile range

Write functions in R

Summary

3/62
Data story
How much does a property in Newtown cost?
5/62
Data on Newtown property sales
· Data is taken from domain.com.au:
- All properties sold in Newtown (NSW 2042) between April-June 2017.
- The variable Sold has price in $1000s.

data <- read.csv("data/NewtownJune2017.csv", header = T)


head(data, n = 2)
## Property Type Agent Bedrooms Bathrooms Carspots Sold
## 1 19 Watkin Street Newtown House RayWhite 4 1 1 1975
## 2 30 Pearl Street Newtown House RayWhite 2 1 0 1250
## Date
## 1 23/6/17
## 2 23/6/17

6/62
dim(data)
## [1] 56 8
str(data)
## 'data.frame': 56 obs. of 8 variables:
## $ Property : chr "19 Watkin Street Newtown" "30 Pearl Street Newtown" "26 John Street Newtowmn" "23/617
## $ Type : chr "House" "House" "House" "Apartment" ...
## $ Agent : chr "RayWhite" "RayWhite" "Belle" "RayWhite" ...
## $ Bedrooms : int 4 2 2 1 1 5 1 1 1 3 ...
## $ Bathrooms: int 1 1 1 1 1 1 1 1 1 2 ...
## $ Carspots : int 1 0 0 1 1 1 0 1 1 0 ...
## $ Sold : int 1975 1250 1280 780 650 2100 675 740 625 1950 ...
## $ Date : chr "23/6/17" "23/6/17" "17/6/17" "17/6/17" ...

7/62
Numerical summaries
Recap: graphical summaries
For the Newtown property data we could produce a histogram or boxplot.

par(mfrow = c(1, 2))


hist(data$Sold, freq = F)
boxplot(data$Sold, horizontal = T)

 What do they reveal about Newtown house prices?

9/62
Advantages of numerical summaries
· A numerical summary reduces all the data to one simple number (“statistic”).
- This loses a lot of information.
- However it allows easy communication and comparisons.
· Major features that we can summarise numerically are:
- Maximum
- Minimum
- Centre [sample mean, median]
- Spread [standard deviation, range, interquartile range]
 Which summaries might be useful for talking about Newtown house prices?

· It depends!
· Reporting the centre without the spread can be misleading!

10/62
Useful notation for data
· Observations of a single variable of size n can be represented by

x1 , x2 , … , xn
· The ranked observations (ordered from smallest to largest) are

x(1) , x(2) , … , x(n)

such that x(1) ≤ x(2) ≤ … ≤ x(n)

· The sum of the observations are


n
∑ xi
i=1

11/62
Sample mean
Sample mean


 Sample mean
The sample mean is the average of the data.

sum of data
sample mean =
size of data
or

∑ni=1 xi
x̄ =
n

Note that the sample mean involves all of the data.

13/62
· The sample mean of all the properties sold in Newtown is:
mean(data$Sold)
## [1] 1407.143

· Focusing specifically on houses with 4 bedrooms (large), the sample mean is:
mean(data$Sold[data$Type == "House" & data$Bedrooms == "4"])
## [1] 2198.857

14/62
Deviation from the mean
Given a data point xi , its deviation from the sample mean x̄ is

Di = xi − x̄
For example,

· 19 Watkin St sold for $1950 (thousands).


- This gives a gap of ($1950-$1407.143) = $542.857 (thousands)
- $542.857 (thousands) above the sample mean
· 30 Pearl St sold for $1250 (thousands).
- This gives a gap of ($1250-$1407.143) = -$157.143 (thousands)
- $157.143 (thousands) below the sample mean

15/62
Sample mean as a balancing point
The sample mean is the point at which the data is balanced in the sense the sum of the
absolute deviations for values to the left of the mean is the same as the sum of
absolute deviations to the right of the mean.

∑ |xi − x̄| = ∑ |xi − x̄|


xi <x̄ xi >x̄

16/62
Sample mean on the histogram
However, sample mean may not be balancing point of a histogram, the area to the left of
the mean may not be the same as the area to the right of the mean.

hist(data$Sold, main = "Newtown properties", xlab = "Price (in 1000s)")


abline(v = mean(data$Sold), col = "green")

17/62
When the data is skewed, this effect is more significant.

hist(data$Sold[data$Type == "House" & data$Bedrooms == "4"], main = "Newtown 4 Bedrooms",


xlab = "Price (in 1000s)")
abline(v = mean(data$Sold[data$Type == "House" & data$Bedrooms == "4"]), col = "green")

18/62
Sample median
Sample median


 Sample Median
~
The sample median x is the middle data point, when the observations are ordered
from smallest to largest.

· For an odd sized number of observations:

sample median = the unique middle point = x( n+1 )


2

· For an even sized number of observations:


x( n ) + x( n +1)
sample median = average of the 2 middle points = 2 2

20/62
Ordering observations
The ranked observations are:

sort(data$Sold)
## [1] 370 625 645 650 675 692 720 740 740 755 770 780 812 860 861
## [16] 920 935 955 955 999 1100 1240 1250 1280 1309 1315 1370 1375 1400 1460
## [31] 1553 1575 1590 1600 1600 1600 1605 1662 1701 1710 1750 1780 1790 1806 1850
## [46] 1940 1950 1975 2000 2100 2200 2235 2300 2410 2810 3150
length(data$Sold)
## [1] 56

As we have n = 56 observations (even), the sample median is found between the ( n2 ) =


1375+1400
28th and ( n2 + 1) = 29th prices, or 2 = 1387.5.

21/62
· The sample median of all the properties sold in Newtown is:
median(data$Sold)
## [1] 1387.5

· Focusing specifically on houses with 4 bedrooms (large), the sample median is:
median(data$Sold[data$Type == "House" & data$Bedrooms == "4"])
## [1] 1975

22/62
Sample median on the histogram
· The sample median is the half way point on the histogram - i.e., 50% of the houses
sold are below and above $1.3875 million.
hist(data$Sold)
abline(v = mean(data$Sold), col = "green")
abline(v = median(data$Sold), col = "purple")

23/62
hist(data$Sold[data$Type == "House" & data$Bedrooms == "4"], main = "Newtown 4 Bedrooms",
xlab = "Price (in 1000s)")
abline(v = mean(data$Sold[data$Type == "House" & data$Bedrooms == "4"]), col = "green")
abline(v = median(data$Sold[data$Type == "House" & data$Bedrooms == "4"]), col = "purple")

 What does this suggest?

24/62

 Statistical Thinking
If you had to choose between reporting the sample mean or sample median for
Newtown properties, which would you choose and why?

· For the full property portfolio, the sample mean and the sample median are fairly
similar.
· For the 4 bedroom houses, the sample mean is higher than the sample median
because it is being “pulled up” by some very expensive houses.
· For the average buyer, the sample median would be more useful as an
indication of the sort of price needed to get into the market.
· For any agent selling houses in the area, the sample mean might be more useful
in order to predict their average commissions!
· In practise, we can report both!

25/62
Sample mean and median on the boxplot
· The sample median is the centre line on the boxplot.
boxplot(data$Sold, main = "Newtown properties")
abline(h = mean(data$Sold), col = "green")
abline(h = median(data$Sold), col = "purple")

26/62
boxplot(data$Sold[data$Type == "House" & data$Bedrooms == "4"], main = "Newtown 4B Properties")
abline(h = mean(data$Sold[data$Type == "House" & data$Bedrooms == "4"]), col = "green")
abline(h = median(data$Sold[data$Type == "House" & data$Bedrooms == "4"]), col = "purple")

27/62
Robustness and comparisons
Robustness


 Robustness
The sample median is said to be robust and is a good summary for skewed data as
it is not affected by outliers (extreme data values).

29/62
Example
Recently a heritage building was sold for 13 million in Newtown.


 How would the sample mean and sample median change if it was added to the
data?

· The sample mean would be a lot higher.

· The sample median would be a bit higher: it moves from the average of the 28th and
29th points to the 29th point.

30/62
data2 = c(data$Sold, 13000)
sort(data2)
## [1] 370 625 645 650 675 692 720 740 740 755 770 780
## [13] 812 860 861 920 935 955 955 999 1100 1240 1250 1280
## [25] 1309 1315 1370 1375 1400 1460 1553 1575 1590 1600 1600 1600
## [37] 1605 1662 1701 1710 1750 1780 1790 1806 1850 1940 1950 1975
## [49] 2000 2100 2200 2235 2300 2410 2810 3150 13000
mean(data2)
## [1] 1610.526
median(data2)
## [1] 1400

Summary of changes
Change in data sample mean sample median
Original data 1407.143 1387.5
Extra property of 13000 1610.526 1400

31/62
Comparing the sample mean and the median
The difference between the sample mean and the sample median can be an indication
of the shape of the data.

· For symmetric data, the sample mean and sample median are the same: x̄ ~.
=x
· For left skewed data (the most frequent data are concentrated on the right, with a left
~.
tail), the sample mean is smaller than the sample median: x̄ < x
· For right skewed data (the most frequent data are concentrated on the left, with a
~.
right tail), the sample mean is larger than the sample median: x̄ > x

32/62
Which is optimal for describing centre?
· Both have strengths and weaknesses depending on the nature of the data.
· Sometimes neither gives a sensible sense of location, for example if the data is
bimodal.
· As the sample median is robust, it is preferable for data which is skewed or has
many outliers, like Sydney house prices.
· The sample mean is helpful for data which is basically symmetric, with not too
many outliers, and for theoretical analysis.

33/62
Limitations of both?
· Both the sample mean and sample median allow easy comparisons.
· However, they need to be paired with a measure of spread.
· In the following example, the sample means are the same, but the data are very
different. Or, consider two data sets {−1, 0, 1} and {−100, 0, 100}.

## [1] 10.22644 9.95406

34/62
Standard deviation
How to measure spread?
For each property sold, we could calculate the deviation (or the gap) from the sample
mean, Di = xi − x̄, between the house and the sample mean $1407 (thousands).

Property Sold Gap Conclusion


More than half a
million dollars more
19 Watkin Street $1950 (thousands) 1950-1407=543
expensive than the
average house price
Cheaper than the
30 Pearl St $1250 (thousands) 1250-1407=-157
average house price

36/62
gaps = data$Sold - mean(data$Sold)
gaps
## [1] 567.857143 -157.142857 -127.142857 -627.142857 -757.142857
## [6] 692.857143 -732.142857 -667.142857 -782.142857 542.857143
## [11] -32.142857 167.857143 -408.142857 -452.142857 -547.142857
## [16] 197.857143 182.857143 -167.142857 -1037.142857 532.857143
## [21] -687.142857 -452.142857 -487.142857 442.857143 192.857143
## [26] -652.142857 -7.142857 145.857143 1402.857143 192.857143
## [31] 792.857143 372.857143 398.857143 293.857143 -98.142857
## [36] -307.142857 -472.142857 -762.142857 52.857143 -37.142857
## [41] -715.142857 1742.857143 1002.857143 -637.142857 254.857143
## [46] 827.857143 592.857143 382.857143 342.857143 302.857143
## [51] 192.857143 -546.142857 -667.142857 -92.142857 892.857143
## [56] -595.142857
max(gaps)
## [1] 1742.857


 What are the biggest and smallest deviations?

How do we summarise all the deviations into 1 number (“spread”)?

37/62
1st attempt: The mean gap
We could calculate the average of the deviations.


 Mean gap

mean deviation = sample mean(data - sample mean(data))

round(mean(gaps))
## [1] 0


 What’s the problem?

38/62
Note: It will always be 0.

· From the definition, the mean deviation must be 0, as the mean is the balancing
point of the deviations.

· The mean deviation is

∑ni=1 Di ∑ni=1 (xi − x̄) ∑ni=1 xi nx̄


= = − = 0.
n n n n

39/62
Better option: Standard deviation
First define the root mean square (RMS).


 Root mean square
· The RMS measures the average of a set of numbers, regardless of the signs.

· The steps are: Square the numbers, then Mean the result, then Root the result.
−−−−−−−−−−−−−−−−−−−2−
RMS(numbers) = √sample mean (numbers )

· So effectively, the Square and Root operations “reverse” each other.

· RMS retain the same unit as the unit of the sample mean.

40/62
· Applying RMS to the deviations, we get
−−−n−−−−−
RMS of deviations = √sample mean (deviations ) = √
−−−−−−−−−−−−−−−−−−−−2− ∑i=1 D2i
n

· To avoid the cancellation of the deviations, another possible method is to consider


the average of the absolute values of the deviations:

∑ni=1 |Di |
mean absolute deviation (MAD) = .
n
However, MAD is much harder to analyse.

41/62
Standard deviation in terms of RMS


 Population Standard deviation
· The standard deviation measures the spread of the data.

SDpop = RMS of (deviations from the mean)


· Formally,
−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−n−−−−− −
= √Mean of (deviations from the mean)2 = √ i=1 n i
∑ (x − ¯¯)2

SDpop

sqrt(mean(gaps^2))
## [1] 593.7166

42/62
Standard deviation in R?
It is easy to calculate in R.

sd(data$Sold)
## [1] 599.0897


 But why is this slightly different?

43/62
Adjusting the standard deviation
· There are two different formulas for the standard deviation, depending on whether
the data is the population or a sample.
· The sd command in R always gives the sample version, as we most commonly
have samples.
−−−−− −−−− −− −−−−−−−−
·
Formally,SDpop = √ n ∑i=1 Di and SDsample = √ n−1 ∑i=1 D2i , where
1 n 2 1 n

Di = xi − x̄ is the deviation.
sd(data$Sold) * sqrt(55/56) # adjust by sqrt((n-1)/n), it calculates the population SD.
## [1] 593.7166
gaps = data$Sold - mean(data$Sold) # calculate the gaps
sqrt(mean(gaps^2)) # calculates the population SD.
## [1] 593.7166

44/62
−−−−−−−−
Why does the sample SD use the adjustment √(n − 1)/n?

· It is an unbiased estimator of the standard deviation (beyond the scope of this unit,
will be covered in Year 2)

· Estimating the sample mean uses all of the n data points. The sum (or the mean) of
n deviations is zero
n n
∑ Di = ∑(xi − x̄) = 0.
i=1 i=1

This means, given the first n − 1 deviations, we know the n-th deviation, because

(∑ Di ) + Dn = 0
n−1 n−1
⟹ Dn = − ∑ Di .
i=1 i=1

Hence, there are only n − 1 effective pieces of information in the deviations.

45/62
Summary: population and sample
Summary Formula In R
Population or Sample mean Sample Mean (Average) mean(data)
RMS of gaps from the
Population standard deviation SDpop sd(data)*sqrt((n-1)/n)
sample mean
Adjusted RMS of gaps
Sample standard deviation SDsample sd(data)
from the sample mean

· The population standard deviation is always smaller than a sample standard


deviation, SDpop ≤ SDsample , why? Extra variability due to sampling.
· Note for large sample sizes, the difference becomes negligible.

46/62
How to tell the difference?
· It can be tricky to work out whether your data is a population or sample!
· Look at the information about the data story and the research questions.
- If we are just interested in the Newtown property prices during April-June 2017,
then the data is the whole population.
- If we are studying the property prices during April-June 2017 as a window into
more general property prices (for the rest of the year or for the Inner West area)
, then the data could be considered a sample.
· Population SD and sample SD get closer with increasing sample size n.

47/62
Variance
The squared standard deviation is called the variance. Similar to the sample SD and the
population SD, there are two versions of the variance

Varsample = SD2sample and Varpop = SD2pop .

· For summarising spread, we often prefer SD, as it has the same unit as the data
points and the mean.
· In some situations, e.g., dealing with random variables (Part III) and understanding
the property of sample mean, using the variance can be much simpler.

48/62
Standard units


 Standard units (“Z score”)
Standard units of a data point = how many standard deviations is it below or above
the mean

data point - mean


standard units =
SD
This means that

data point = mean + SD × standard units


It gives the relative location of a data point in the data set. It also have other benefits
in data modelling (see later lectures).

49/62
Comparing 2 data points
To compare 2 data points, we can compare the standard units.

Property Sold Standard units Conclusion


Almost 1 SD higher
1950−1407
19 Watkin Street $1950 (thousands) 599 = 0.91 than the average
house price
0.26 SDs cheaper
1250−1407
30 Pearl St $1250 (thousands) 599 = −0.26 than the average
house price

So 19 Watkin is a more unusual purchase than 30 Pearl St, relative to the mean.

50/62
Interquartile range
Interquartile range (IQR)
The IQR is another measure of spread by the ordering


 Interquartile Range (IQR)

IQR = range of the middle 50% of the data


More formally, IQR = Q3 − Q1 , where
· Q1 is the 25% percentile (1st quartile) and Q3 is the 75% percentile (3rd
quartile).
~
· The median is the 50% percentile, or 2nd quartile x = Q2 .
· p% percentile: there are p% of ordered data below the value of p% percentile.

52/62
Quantile, quartile, percentile
The set of q-quantiles divides the ordered data into q equal size sets (in terms of
percentage of data).

Percentile is 100-quantile. 

The set of quartiles divides the data into four quarters.

summary(data$Sold)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 370.0 860.8 1387.5 1407.1 1782.5 3150.0
summary(data$Sold)[5] - summary(data$Sold)[2] # one way to calculate IQR
## 3rd Qu.
## 921.75
IQR(data$Sold) # use the built-in function
## [1] 921.75
So the range of the middle 50% of properties sold is almost a million dollars!

53/62
Reporting
· Like the median, the IQR is robust, so it’s suitable as a summary of spread for
skewed data.
· We report in pairs: (mean,SD) or (median,IQR).

54/62
IQR on the boxplot and outliers
· The IQR is the length of the box in the boxplot. It represents the span of the middle
50% of the houses sold.

· The lower and upper thresholds (expected minimum and maximum) are a distance
of 1.5IQR from the 1st and 3rd quartiles (by Tukey’s convention).

LT = Q1 − 1.5IQR
and

UT = Q3 + 1.5IQR
· Data outside these thresholds is considered an outlier (“extreme reading”).

55/62
boxplot(data$Sold, horizontal = T)
iqr = quantile(data$Sold)[4] - quantile(data$Sold)[2]
abline(v = median(data$Sold), col = "green")
abline(v = quantile(data$Sold)[2], col = "red")
abline(v = quantile(data$Sold)[4], col = "red")
abline(v = quantile(data$Sold)[2] - 1.5 * iqr, col = "purple")
abline(v = quantile(data$Sold)[4] + 1.5 * iqr, col = "purple")

Note the lower threshold is not shown…why?

56/62
boxplot(data$Sold, horizontal = T)
abline(v = median(data$Sold), col = "green")
abline(v = quantile(data$Sold)[2], col = "red")
abline(v = quantile(data$Sold)[4], col = "red")
abline(v = max(min(data$Sold), quantile(data$Sold)[2] - 1.5 * iqr), col = "purple")
abline(v = min(max(data$Sold), quantile(data$Sold)[4] + 1.5 * iqr), col = "purple")

To make the LT and UT staying within the range of data, R uses the convention
LT = max(min(x), Q1 − 1.5IQR)
and
UT = min(max(x), Q3 + 1.5IQR)

57/62
Dealing with outliers (not for examination)
Sometimes outliers indicate that a better model is needed. We may remove outliers by
transform the data. For example, a right skewed data set with outliers can be
transformed into the logarithmic scale.

w = c(1, 2, 3, 4, 10, 30, 60, 120, 180, 300)


w1 = log(w, 10)
par(mfrow = c(1, 2))
boxplot(w, main = "Data", horizontal = T)
boxplot(w1, main = "Log of Data", horizontal = T)

58/62
Write a function R
How to write a function in R
A function in R is one of the most used objects. For example, mean , median , sd are
all R functions. It is very important to understand the purpose and syntax of R functions
and knowing how to create or use them.

To declare a user-defined function in R, we use the keyword function.

function_name <- function(parameter1, parameter2) {


# function body
c = parameter1 + parameter2
# return the outputs
return(c)
}

Here we declared a function with name function_name , the function takes inputs
parameter1 , parameter2 and returns an output c . It can take any number of inputs
but only one outputs.

60/62
Example
Here we want to write a function in R that calculates the sample mean and sample
standard deviation

my_summary <- function(X) {


# Write operations within the curly brackets
m = sum(X)/length(X)
s = sqrt(sum((X - m)^2)/(length(X) - 1))
# put mean and sd in a vector, then return the vector as a single output
return(c(m, s))
}

Then we can reuse all the operations defined in the function.

w = c(1, 2, 3, 4, 10, 30, 60, 120, 180, 300) # a data vector


my_summary(w) # our function
## [1] 71.0000 100.5651
c(mean(w), sd(w)) # R built-in function
## [1] 71.0000 100.5651

61/62
Summary
Centre
· Sample mean
· Sample median
· Robustness and comparisons

Spread
· Standard deviation
- Population standard deviation and sample standard deviation
· Interquartile range

Write functions in R

Some R Functions
length , mean , median , sd , IQR , summary , boxplot

62/62

You might also like