0% found this document useful (0 votes)

11 views

Topic2-Numerical-Summary

The document provides an overview of numerical summaries in data analysis, focusing on property sales in Newtown, NSW. It discusses key concepts such as sample mean, sample median, and measures of spread like standard deviation and interquartile range, along with their implications for understanding property prices. The document emphasizes the importance of both mean and median in statistical reporting, particularly in skewed data scenarios.

Uploaded by

ishrat

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Topic2-Numerical-Summary

Uploaded by

ishrat

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

Exploring Data

Numerical Summaries

© University of Sydney MATH1062

06 October 2024

 Module1 Exploring Data

Data & Graphical Summaries

What type of data do we have & how can we visualise it?

Numerical Summaries
What are the main features of the data?

2/62
Outline

Centre
· Sample mean
· Sample median
· Robustness and comparisons

Spread
· Standard deviation
· Interquartile range

Write functions in R

Summary

3/62
Data story
How much does a property in Newtown cost?
5/62
Data on Newtown property sales
· Data is taken from domain.com.au:
- All properties sold in Newtown (NSW 2042) between April-June 2017.
- The variable Sold has price in $1000s.

data <- read.csv("data/NewtownJune2017.csv", header = T)

head(data, n = 2)
## Property Type Agent Bedrooms Bathrooms Carspots Sold
## 1 19 Watkin Street Newtown House RayWhite 4 1 1 1975
## 2 30 Pearl Street Newtown House RayWhite 2 1 0 1250
## Date
## 1 23/6/17
## 2 23/6/17

6/62
dim(data)
## [1] 56 8
str(data)
## 'data.frame': 56 obs. of 8 variables:
## $ Property : chr "19 Watkin Street Newtown" "30 Pearl Street Newtown" "26 John Street Newtowmn" "23/617
## $ Type : chr "House" "House" "House" "Apartment" ...
## $ Agent : chr "RayWhite" "RayWhite" "Belle" "RayWhite" ...
## $ Bedrooms : int 4 2 2 1 1 5 1 1 1 3 ...
## $ Bathrooms: int 1 1 1 1 1 1 1 1 1 2 ...
## $ Carspots : int 1 0 0 1 1 1 0 1 1 0 ...
## $ Sold : int 1975 1250 1280 780 650 2100 675 740 625 1950 ...
## $ Date : chr "23/6/17" "23/6/17" "17/6/17" "17/6/17" ...

7/62
Numerical summaries
Recap: graphical summaries
For the Newtown property data we could produce a histogram or boxplot.

par(mfrow = c(1, 2))

hist(data$Sold, freq = F)
boxplot(data$Sold, horizontal = T)

 What do they reveal about Newtown house prices?

9/62
Advantages of numerical summaries
· A numerical summary reduces all the data to one simple number (“statistic”).
- This loses a lot of information.
- However it allows easy communication and comparisons.
· Major features that we can summarise numerically are:
- Maximum
- Minimum
- Centre [sample mean, median]
- Spread [standard deviation, range, interquartile range]
 Which summaries might be useful for talking about Newtown house prices?

· It depends!
· Reporting the centre without the spread can be misleading!

10/62
Useful notation for data
· Observations of a single variable of size n can be represented by

x1 , x2 , … , xn
· The ranked observations (ordered from smallest to largest) are

x(1) , x(2) , … , x(n)

such that x(1) ≤ x(2) ≤ … ≤ x(n)

· The sum of the observations are

n
∑ xi
i=1

11/62
Sample mean
Sample mean


 Sample mean
The sample mean is the average of the data.

sum of data
sample mean =
size of data
or

∑ni=1 xi
x̄ =
n

Note that the sample mean involves all of the data.

13/62
· The sample mean of all the properties sold in Newtown is:
mean(data$Sold)
## [1] 1407.143

· Focusing specifically on houses with 4 bedrooms (large), the sample mean is:
mean(data$Sold[data$Type == "House" & data$Bedrooms == "4"])
## [1] 2198.857

14/62
Deviation from the mean
Given a data point xi , its deviation from the sample mean x̄ is

Di = xi − x̄
For example,

· 19 Watkin St sold for $1950 (thousands).

- This gives a gap of ($1950-$1407.143) = $542.857 (thousands)
- $542.857 (thousands) above the sample mean
· 30 Pearl St sold for $1250 (thousands).
- This gives a gap of ($1250-$1407.143) = -$157.143 (thousands)
- $157.143 (thousands) below the sample mean

15/62
Sample mean as a balancing point
The sample mean is the point at which the data is balanced in the sense the sum of the
absolute deviations for values to the left of the mean is the same as the sum of
absolute deviations to the right of the mean.

∑ |xi − x̄| = ∑ |xi − x̄|

xi <x̄ xi >x̄

16/62
Sample mean on the histogram
However, sample mean may not be balancing point of a histogram, the area to the left of
the mean may not be the same as the area to the right of the mean.

hist(data$Sold, main = "Newtown properties", xlab = "Price (in 1000s)")

abline(v = mean(data$Sold), col = "green")

17/62
When the data is skewed, this effect is more significant.

hist(data$Sold[data$Type == "House" & data$Bedrooms == "4"], main = "Newtown 4 Bedrooms",

xlab = "Price (in 1000s)")
abline(v = mean(data$Sold[data$Type == "House" & data$Bedrooms == "4"]), col = "green")

18/62
Sample median
Sample median


 Sample Median
~
The sample median x is the middle data point, when the observations are ordered
from smallest to largest.

· For an odd sized number of observations:

sample median = the unique middle point = x( n+1 )

· For an even sized number of observations:

x( n ) + x( n +1)
sample median = average of the 2 middle points = 2 2

20/62
Ordering observations
The ranked observations are:

sort(data$Sold)
## [1] 370 625 645 650 675 692 720 740 740 755 770 780 812 860 861
## [16] 920 935 955 955 999 1100 1240 1250 1280 1309 1315 1370 1375 1400 1460
## [31] 1553 1575 1590 1600 1600 1600 1605 1662 1701 1710 1750 1780 1790 1806 1850
## [46] 1940 1950 1975 2000 2100 2200 2235 2300 2410 2810 3150
length(data$Sold)
## [1] 56

As we have n = 56 observations (even), the sample median is found between the ( n2 ) =

1375+1400
28th and ( n2 + 1) = 29th prices, or 2 = 1387.5.

21/62
· The sample median of all the properties sold in Newtown is:
median(data$Sold)
## [1] 1387.5

· Focusing specifically on houses with 4 bedrooms (large), the sample median is:
median(data$Sold[data$Type == "House" & data$Bedrooms == "4"])
## [1] 1975

22/62
Sample median on the histogram
· The sample median is the half way point on the histogram - i.e., 50% of the houses
sold are below and above $1.3875 million.
hist(data$Sold)
abline(v = mean(data$Sold), col = "green")
abline(v = median(data$Sold), col = "purple")

23/62
hist(data$Sold[data$Type == "House" & data$Bedrooms == "4"], main = "Newtown 4 Bedrooms",
xlab = "Price (in 1000s)")
abline(v = mean(data$Sold[data$Type == "House" & data$Bedrooms == "4"]), col = "green")
abline(v = median(data$Sold[data$Type == "House" & data$Bedrooms == "4"]), col = "purple")

 What does this suggest?

24/62

 Statistical Thinking
If you had to choose between reporting the sample mean or sample median for
Newtown properties, which would you choose and why?

· For the full property portfolio, the sample mean and the sample median are fairly
similar.
· For the 4 bedroom houses, the sample mean is higher than the sample median
because it is being “pulled up” by some very expensive houses.
· For the average buyer, the sample median would be more useful as an
indication of the sort of price needed to get into the market.
· For any agent selling houses in the area, the sample mean might be more useful
in order to predict their average commissions!
· In practise, we can report both!

25/62
Sample mean and median on the boxplot
· The sample median is the centre line on the boxplot.
boxplot(data$Sold, main = "Newtown properties")
abline(h = mean(data$Sold), col = "green")
abline(h = median(data$Sold), col = "purple")

26/62
boxplot(data$Sold[data$Type == "House" & data$Bedrooms == "4"], main = "Newtown 4B Properties")
abline(h = mean(data$Sold[data$Type == "House" & data$Bedrooms == "4"]), col = "green")
abline(h = median(data$Sold[data$Type == "House" & data$Bedrooms == "4"]), col = "purple")

27/62
Robustness and comparisons
Robustness


 Robustness
The sample median is said to be robust and is a good summary for skewed data as
it is not affected by outliers (extreme data values).

29/62
Example
Recently a heritage building was sold for 13 million in Newtown.


 How would the sample mean and sample median change if it was added to the
data?

· The sample mean would be a lot higher.

· The sample median would be a bit higher: it moves from the average of the 28th and
29th points to the 29th point.

30/62
data2 = c(data$Sold, 13000)
sort(data2)
## [1] 370 625 645 650 675 692 720 740 740 755 770 780
## [13] 812 860 861 920 935 955 955 999 1100 1240 1250 1280
## [25] 1309 1315 1370 1375 1400 1460 1553 1575 1590 1600 1600 1600
## [37] 1605 1662 1701 1710 1750 1780 1790 1806 1850 1940 1950 1975
## [49] 2000 2100 2200 2235 2300 2410 2810 3150 13000
mean(data2)
## [1] 1610.526
median(data2)
## [1] 1400

Summary of changes
Change in data sample mean sample median
Original data 1407.143 1387.5
Extra property of 13000 1610.526 1400

31/62
Comparing the sample mean and the median
The difference between the sample mean and the sample median can be an indication
of the shape of the data.

· For symmetric data, the sample mean and sample median are the same: x̄ ~.
=x
· For left skewed data (the most frequent data are concentrated on the right, with a left
~.
tail), the sample mean is smaller than the sample median: x̄ < x
· For right skewed data (the most frequent data are concentrated on the left, with a
~.
right tail), the sample mean is larger than the sample median: x̄ > x

32/62
Which is optimal for describing centre?
· Both have strengths and weaknesses depending on the nature of the data.
· Sometimes neither gives a sensible sense of location, for example if the data is
bimodal.
· As the sample median is robust, it is preferable for data which is skewed or has
many outliers, like Sydney house prices.
· The sample mean is helpful for data which is basically symmetric, with not too
many outliers, and for theoretical analysis.

33/62
Limitations of both?
· Both the sample mean and sample median allow easy comparisons.
· However, they need to be paired with a measure of spread.
· In the following example, the sample means are the same, but the data are very
different. Or, consider two data sets {−1, 0, 1} and {−100, 0, 100}.

## [1] 10.22644 9.95406

34/62
Standard deviation
How to measure spread?
For each property sold, we could calculate the deviation (or the gap) from the sample
mean, Di = xi − x̄, between the house and the sample mean $1407 (thousands).

Property Sold Gap Conclusion

More than half a
million dollars more
19 Watkin Street $1950 (thousands) 1950-1407=543
expensive than the
average house price
Cheaper than the
30 Pearl St $1250 (thousands) 1250-1407=-157
average house price

36/62
gaps = data$Sold - mean(data$Sold)
gaps
## [1] 567.857143 -157.142857 -127.142857 -627.142857 -757.142857
## [6] 692.857143 -732.142857 -667.142857 -782.142857 542.857143
## [11] -32.142857 167.857143 -408.142857 -452.142857 -547.142857
## [16] 197.857143 182.857143 -167.142857 -1037.142857 532.857143
## [21] -687.142857 -452.142857 -487.142857 442.857143 192.857143
## [26] -652.142857 -7.142857 145.857143 1402.857143 192.857143
## [31] 792.857143 372.857143 398.857143 293.857143 -98.142857
## [36] -307.142857 -472.142857 -762.142857 52.857143 -37.142857
## [41] -715.142857 1742.857143 1002.857143 -637.142857 254.857143
## [46] 827.857143 592.857143 382.857143 342.857143 302.857143
## [51] 192.857143 -546.142857 -667.142857 -92.142857 892.857143
## [56] -595.142857
max(gaps)
## [1] 1742.857


 What are the biggest and smallest deviations?

How do we summarise all the deviations into 1 number (“spread”)?

37/62
1st attempt: The mean gap
We could calculate the average of the deviations.


 Mean gap

mean deviation = sample mean(data - sample mean(data))

round(mean(gaps))
## [1] 0


 What’s the problem?

38/62
Note: It will always be 0.

· From the definition, the mean deviation must be 0, as the mean is the balancing
point of the deviations.

· The mean deviation is

∑ni=1 Di ∑ni=1 (xi − x̄) ∑ni=1 xi nx̄

= = − = 0.
n n n n

39/62
Better option: Standard deviation
First define the root mean square (RMS).


 Root mean square
· The RMS measures the average of a set of numbers, regardless of the signs.

· The steps are: Square the numbers, then Mean the result, then Root the result.
−−−−−−−−−−−−−−−−−−−2−
RMS(numbers) = √sample mean (numbers )

· So effectively, the Square and Root operations “reverse” each other.

· RMS retain the same unit as the unit of the sample mean.

40/62
· Applying RMS to the deviations, we get
−−−n−−−−−
RMS of deviations = √sample mean (deviations ) = √
−−−−−−−−−−−−−−−−−−−−2− ∑i=1 D2i
n

· To avoid the cancellation of the deviations, another possible method is to consider

the average of the absolute values of the deviations:

∑ni=1 |Di |
mean absolute deviation (MAD) = .
n
However, MAD is much harder to analyse.

41/62
Standard deviation in terms of RMS


 Population Standard deviation
· The standard deviation measures the spread of the data.

SDpop = RMS of (deviations from the mean)

· Formally,
−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−n−−−−− −
= √Mean of (deviations from the mean)2 = √ i=1 n i
∑ (x − ¯¯)2
x̄
SDpop

sqrt(mean(gaps^2))
## [1] 593.7166

42/62
Standard deviation in R?
It is easy to calculate in R.

sd(data$Sold)
## [1] 599.0897


 But why is this slightly different?

43/62
Adjusting the standard deviation
· There are two different formulas for the standard deviation, depending on whether
the data is the population or a sample.
· The sd command in R always gives the sample version, as we most commonly
have samples.
−−−−− −−−− −− −−−−−−−−
·
Formally,SDpop = √ n ∑i=1 Di and SDsample = √ n−1 ∑i=1 D2i , where
1 n 2 1 n

Di = xi − x̄ is the deviation.
sd(data$Sold) * sqrt(55/56) # adjust by sqrt((n-1)/n), it calculates the population SD.
## [1] 593.7166
gaps = data$Sold - mean(data$Sold) # calculate the gaps
sqrt(mean(gaps^2)) # calculates the population SD.
## [1] 593.7166

44/62
−−−−−−−−
Why does the sample SD use the adjustment √(n − 1)/n?

· It is an unbiased estimator of the standard deviation (beyond the scope of this unit,
will be covered in Year 2)

· Estimating the sample mean uses all of the n data points. The sum (or the mean) of
n deviations is zero
n n
∑ Di = ∑(xi − x̄) = 0.
i=1 i=1

This means, given the first n − 1 deviations, we know the n-th deviation, because

(∑ Di ) + Dn = 0
n−1 n−1
⟹ Dn = − ∑ Di .
i=1 i=1

Hence, there are only n − 1 effective pieces of information in the deviations.

45/62
Summary: population and sample
Summary Formula In R
Population or Sample mean Sample Mean (Average) mean(data)
RMS of gaps from the
Population standard deviation SDpop sd(data)*sqrt((n-1)/n)
sample mean
Adjusted RMS of gaps
Sample standard deviation SDsample sd(data)
from the sample mean

· The population standard deviation is always smaller than a sample standard

deviation, SDpop ≤ SDsample , why? Extra variability due to sampling.
· Note for large sample sizes, the difference becomes negligible.

46/62
How to tell the difference?
· It can be tricky to work out whether your data is a population or sample!
· Look at the information about the data story and the research questions.
- If we are just interested in the Newtown property prices during April-June 2017,
then the data is the whole population.
- If we are studying the property prices during April-June 2017 as a window into
more general property prices (for the rest of the year or for the Inner West area)
, then the data could be considered a sample.
· Population SD and sample SD get closer with increasing sample size n.

47/62
Variance
The squared standard deviation is called the variance. Similar to the sample SD and the
population SD, there are two versions of the variance

Varsample = SD2sample and Varpop = SD2pop .

· For summarising spread, we often prefer SD, as it has the same unit as the data
points and the mean.
· In some situations, e.g., dealing with random variables (Part III) and understanding
the property of sample mean, using the variance can be much simpler.

48/62
Standard units


 Standard units (“Z score”)
Standard units of a data point = how many standard deviations is it below or above
the mean

data point - mean

standard units =
SD
This means that

data point = mean + SD × standard units

It gives the relative location of a data point in the data set. It also have other benefits
in data modelling (see later lectures).

49/62
Comparing 2 data points
To compare 2 data points, we can compare the standard units.

Property Sold Standard units Conclusion

Almost 1 SD higher
1950−1407
19 Watkin Street $1950 (thousands) 599 = 0.91 than the average
house price
0.26 SDs cheaper
1250−1407
30 Pearl St $1250 (thousands) 599 = −0.26 than the average
house price

So 19 Watkin is a more unusual purchase than 30 Pearl St, relative to the mean.

50/62
Interquartile range
Interquartile range (IQR)
The IQR is another measure of spread by the ordering


 Interquartile Range (IQR)

IQR = range of the middle 50% of the data

More formally, IQR = Q3 − Q1 , where
· Q1 is the 25% percentile (1st quartile) and Q3 is the 75% percentile (3rd
quartile).
~
· The median is the 50% percentile, or 2nd quartile x = Q2 .
· p% percentile: there are p% of ordered data below the value of p% percentile.

52/62
Quantile, quartile, percentile
The set of q-quantiles divides the ordered data into q equal size sets (in terms of
percentage of data).

Percentile is 100-quantile. 

The set of quartiles divides the data into four quarters.

summary(data$Sold)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 370.0 860.8 1387.5 1407.1 1782.5 3150.0
summary(data$Sold)[5] - summary(data$Sold)[2] # one way to calculate IQR
## 3rd Qu.
## 921.75
IQR(data$Sold) # use the built-in function
## [1] 921.75
So the range of the middle 50% of properties sold is almost a million dollars!

53/62
Reporting
· Like the median, the IQR is robust, so it’s suitable as a summary of spread for
skewed data.
· We report in pairs: (mean,SD) or (median,IQR).

54/62
IQR on the boxplot and outliers
· The IQR is the length of the box in the boxplot. It represents the span of the middle
50% of the houses sold.

· The lower and upper thresholds (expected minimum and maximum) are a distance
of 1.5IQR from the 1st and 3rd quartiles (by Tukey’s convention).

LT = Q1 − 1.5IQR
and

UT = Q3 + 1.5IQR
· Data outside these thresholds is considered an outlier (“extreme reading”).

55/62
boxplot(data$Sold, horizontal = T)
iqr = quantile(data$Sold)[4] - quantile(data$Sold)[2]
abline(v = median(data$Sold), col = "green")
abline(v = quantile(data$Sold)[2], col = "red")
abline(v = quantile(data$Sold)[4], col = "red")
abline(v = quantile(data$Sold)[2] - 1.5 * iqr, col = "purple")
abline(v = quantile(data$Sold)[4] + 1.5 * iqr, col = "purple")

Note the lower threshold is not shown…why?

56/62
boxplot(data$Sold, horizontal = T)
abline(v = median(data$Sold), col = "green")
abline(v = quantile(data$Sold)[2], col = "red")
abline(v = quantile(data$Sold)[4], col = "red")
abline(v = max(min(data$Sold), quantile(data$Sold)[2] - 1.5 * iqr), col = "purple")
abline(v = min(max(data$Sold), quantile(data$Sold)[4] + 1.5 * iqr), col = "purple")

To make the LT and UT staying within the range of data, R uses the convention
LT = max(min(x), Q1 − 1.5IQR)
and
UT = min(max(x), Q3 + 1.5IQR)

57/62
Dealing with outliers (not for examination)
Sometimes outliers indicate that a better model is needed. We may remove outliers by
transform the data. For example, a right skewed data set with outliers can be
transformed into the logarithmic scale.

w = c(1, 2, 3, 4, 10, 30, 60, 120, 180, 300)

w1 = log(w, 10)
par(mfrow = c(1, 2))
boxplot(w, main = "Data", horizontal = T)
boxplot(w1, main = "Log of Data", horizontal = T)

58/62
Write a function R
How to write a function in R
A function in R is one of the most used objects. For example, mean , median , sd are
all R functions. It is very important to understand the purpose and syntax of R functions
and knowing how to create or use them.

To declare a user-defined function in R, we use the keyword function.

function_name <- function(parameter1, parameter2) {

# function body
c = parameter1 + parameter2
# return the outputs
return(c)
}

Here we declared a function with name function_name , the function takes inputs
parameter1 , parameter2 and returns an output c . It can take any number of inputs
but only one outputs.

60/62
Example
Here we want to write a function in R that calculates the sample mean and sample
standard deviation

my_summary <- function(X) {

# Write operations within the curly brackets
m = sum(X)/length(X)
s = sqrt(sum((X - m)^2)/(length(X) - 1))
# put mean and sd in a vector, then return the vector as a single output
return(c(m, s))
}

Then we can reuse all the operations defined in the function.

w = c(1, 2, 3, 4, 10, 30, 60, 120, 180, 300) # a data vector

my_summary(w) # our function
## [1] 71.0000 100.5651
c(mean(w), sd(w)) # R built-in function
## [1] 71.0000 100.5651

61/62
Summary
Centre
· Sample mean
· Sample median
· Robustness and comparisons

Spread
· Standard deviation
- Population standard deviation and sample standard deviation
· Interquartile range

Write functions in R

Some R Functions
length , mean , median , sd , IQR , summary , boxplot

62/62

Unit 2
No ratings yet
Unit 2
32 pages
Assignments123 2013
0% (4)
Assignments123 2013
5 pages
Seminar Slides Week 3 - With Solutions - Fullpage
No ratings yet
Seminar Slides Week 3 - With Solutions - Fullpage
33 pages
Part 2-Chapter 3 - Describing Data - Edit
No ratings yet
Part 2-Chapter 3 - Describing Data - Edit
46 pages
Data Science Summary Notes
No ratings yet
Data Science Summary Notes
9 pages
r Module 5
No ratings yet
r Module 5
21 pages
R Doc Ii Vee
No ratings yet
R Doc Ii Vee
24 pages
Clodes Class Data Science
No ratings yet
Clodes Class Data Science
14 pages
Numerical Descriptive Measures
No ratings yet
Numerical Descriptive Measures
52 pages
Nummerical Summaries
No ratings yet
Nummerical Summaries
11 pages
2.descriptive Statistics
No ratings yet
2.descriptive Statistics
49 pages
Chapter 1
No ratings yet
Chapter 1
44 pages
New Chapter 13 Elementary Statistics
No ratings yet
New Chapter 13 Elementary Statistics
15 pages
Statistics For Managers Using Microsoft Excel: 5 Edition
No ratings yet
Statistics For Managers Using Microsoft Excel: 5 Edition
54 pages
Statistics For Data Science 1
No ratings yet
Statistics For Data Science 1
65 pages
Week 1
No ratings yet
Week 1
25 pages
2025-02-25_15-19-30_gBgFpFhD4P7W4A8jcNqeiM6UE3yAWzC05C0SBjgT
No ratings yet
2025-02-25_15-19-30_gBgFpFhD4P7W4A8jcNqeiM6UE3yAWzC05C0SBjgT
36 pages
AAAAAAAAAAAAAAAAAAAAAAAAA
No ratings yet
AAAAAAAAAAAAAAAAAAAAAAAAA
41 pages
Data Preprocessing
No ratings yet
Data Preprocessing
27 pages
Standard Deviation
No ratings yet
Standard Deviation
13 pages
Packages Used in This Chapter: R Studio - Descriptive Statistics
No ratings yet
Packages Used in This Chapter: R Studio - Descriptive Statistics
9 pages
Seminar Slides Week 3 - Fullpage
No ratings yet
Seminar Slides Week 3 - Fullpage
36 pages
Math236_Lecture_2 (1)
No ratings yet
Math236_Lecture_2 (1)
64 pages
Data Representation Interpretation
No ratings yet
Data Representation Interpretation
61 pages
Stats Lab1
No ratings yet
Stats Lab1
11 pages
1 3 ST-explore
No ratings yet
1 3 ST-explore
55 pages
Statistics
No ratings yet
Statistics
23 pages
Advanced Statistics
No ratings yet
Advanced Statistics
259 pages
DSA1101 2019 Week1 Part2
No ratings yet
DSA1101 2019 Week1 Part2
38 pages
PLSC214 Topic 3
No ratings yet
PLSC214 Topic 3
77 pages
Session 12
No ratings yet
Session 12
8 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
MCS Lecture 3
No ratings yet
MCS Lecture 3
57 pages
stat python
No ratings yet
stat python
4 pages
R Module 5
No ratings yet
R Module 5
21 pages
Descriptive Statistics SV
No ratings yet
Descriptive Statistics SV
77 pages
2Descriptives
No ratings yet
2Descriptives
43 pages
Summary Measures
No ratings yet
Summary Measures
26 pages
Capital Gains
No ratings yet
Capital Gains
8 pages
Dr. K. M. Salah Uddin Associate Professor Dept. of MIS, DU
No ratings yet
Dr. K. M. Salah Uddin Associate Professor Dept. of MIS, DU
41 pages
Describing Data_Numerical Measure
No ratings yet
Describing Data_Numerical Measure
33 pages
5_Data Summaries and Visualization (4)
No ratings yet
5_Data Summaries and Visualization (4)
87 pages
Central Tendency in R Programming
100% (1)
Central Tendency in R Programming
6 pages
02-03 ASAP Business Analytics-2 Descriptive Statistics
No ratings yet
02-03 ASAP Business Analytics-2 Descriptive Statistics
109 pages
03 Descriptive-Numerical
No ratings yet
03 Descriptive-Numerical
91 pages
03 Numerical Description
No ratings yet
03 Numerical Description
52 pages
Lecture 2b - Describing Data-Numerical
No ratings yet
Lecture 2b - Describing Data-Numerical
47 pages
CH 9 - Part 3
No ratings yet
CH 9 - Part 3
19 pages
Describing Data:: Numerical Measures
No ratings yet
Describing Data:: Numerical Measures
37 pages
Stats and its Real world applications.
No ratings yet
Stats and its Real world applications.
53 pages
Intro To Statistic Using R - Session 1
No ratings yet
Intro To Statistic Using R - Session 1
1 page
Statistical Analysis 2023
No ratings yet
Statistical Analysis 2023
56 pages
Descriptive Statistics: Numerical Measures: Measures of Location Measures of Variability
100% (1)
Descriptive Statistics: Numerical Measures: Measures of Location Measures of Variability
68 pages
Exploratory Data Analysis - NOTES
No ratings yet
Exploratory Data Analysis - NOTES
31 pages
Lecture_04
No ratings yet
Lecture_04
88 pages
Lecture 2 Core Statistics 101 Mean Median Mode Distribution
No ratings yet
Lecture 2 Core Statistics 101 Mean Median Mode Distribution
32 pages
Statistics Unit1 Notes.docx
No ratings yet
Statistics Unit1 Notes.docx
11 pages
Measusres of Locations
No ratings yet
Measusres of Locations
52 pages
GCE AS Level Representation of Data Advantages and Disadvantages of Different Representations of Data
No ratings yet
GCE AS Level Representation of Data Advantages and Disadvantages of Different Representations of Data
5 pages
Applied Statistics For Economic and Buisness
No ratings yet
Applied Statistics For Economic and Buisness
315 pages
Number: To Infinity and Beyond
From Everand
Number: To Infinity and Beyond
Oliver Linton
No ratings yet
Topic3-NormalCurve
No ratings yet
Topic3-NormalCurve
40 pages
12-UnknownProportions (1)
No ratings yet
12-UnknownProportions (1)
37 pages
13-UnknownProportionsMore (1)
No ratings yet
13-UnknownProportionsMore (1)
38 pages
AERO1400 quiz notes
No ratings yet
AERO1400 quiz notes
4 pages
Python Notes
No ratings yet
Python Notes
27 pages
Topic5-Probability
No ratings yet
Topic5-Probability
39 pages
14-UnknownMeans
No ratings yet
14-UnknownMeans
43 pages
AP Stats Cheat Sheet FINAL
No ratings yet
AP Stats Cheat Sheet FINAL
8 pages
Simple Correlation and Regression Analysis
No ratings yet
Simple Correlation and Regression Analysis
14 pages
7 Mean Variance SD
No ratings yet
7 Mean Variance SD
14 pages
MATH 219 Calculators Normal
No ratings yet
MATH 219 Calculators Normal
2 pages
Activity No.1 Engineering Data Analysis
No ratings yet
Activity No.1 Engineering Data Analysis
1 page
Decision Tree Ppt
0% (1)
Decision Tree Ppt
24 pages
DataScience - Week 10
No ratings yet
DataScience - Week 10
2 pages
Adaptive Forecasting SKJ
No ratings yet
Adaptive Forecasting SKJ
21 pages
Understanding Q-Q Plots: Latest News
No ratings yet
Understanding Q-Q Plots: Latest News
4 pages
VaibhavKumar Extendedproject PDF
100% (2)
VaibhavKumar Extendedproject PDF
10 pages
Computation of Tourist Arrival
No ratings yet
Computation of Tourist Arrival
1 page
1966 Benston
No ratings yet
1966 Benston
17 pages
Introduction To Econometrics: Bivariate Regression Models
No ratings yet
Introduction To Econometrics: Bivariate Regression Models
21 pages
Confidence Interval
100% (1)
Confidence Interval
19 pages
FALLSEM2023-24 SWE2020 ETH VL2023240103291 2023-11-22 Reference-Material-II
No ratings yet
FALLSEM2023-24 SWE2020 ETH VL2023240103291 2023-11-22 Reference-Material-II
26 pages
2023 Psy 311 Course Outline
No ratings yet
2023 Psy 311 Course Outline
4 pages
2019 - Nissen Etal - Missing Data and Bias in Physics Education Research - A Case For Using Multiple Imputation
No ratings yet
2019 - Nissen Etal - Missing Data and Bias in Physics Education Research - A Case For Using Multiple Imputation
15 pages
MMW - Correlation Analysis
No ratings yet
MMW - Correlation Analysis
5 pages
chapter 5 review elementary statistics
No ratings yet
chapter 5 review elementary statistics
3 pages
Business Research 8ed Final Exam
No ratings yet
Business Research 8ed Final Exam
10 pages
Sample Business Plan
No ratings yet
Sample Business Plan
10 pages
Poisson Distribution
No ratings yet
Poisson Distribution
23 pages
M3 Exploratory Data Analysis
No ratings yet
M3 Exploratory Data Analysis
22 pages
Ecn 2331
No ratings yet
Ecn 2331
15 pages
(FREE PDF Sample) Testing Statistical Hypotheses 4th Edition E.L. Lehmann Ebooks
No ratings yet
(FREE PDF Sample) Testing Statistical Hypotheses 4th Edition E.L. Lehmann Ebooks
49 pages
Spract 6
No ratings yet
Spract 6
40 pages
ML CLASS 5 Logistic Regression Algorithm
No ratings yet
ML CLASS 5 Logistic Regression Algorithm
16 pages
Random Motors Project Submission: Name
No ratings yet
Random Motors Project Submission: Name
10 pages
Sample Size Determination (Capability Indices)
No ratings yet
Sample Size Determination (Capability Indices)
4 pages

Topic2-Numerical-Summary

Uploaded by

Topic2-Numerical-Summary

Uploaded by

Exploring Data

© University of Sydney MATH1062

Data & Graphical Summaries

data <- read.csv("data/NewtownJune2017.csv", header = T)

par(mfrow = c(1, 2))

 What do they reveal about Newtown house prices?

x(1) , x(2) , … , x(n)

such that x(1) ≤ x(2) ≤ … ≤ x(n)

· The sum of the observations are

Note that the sample mean involves all of the data.

· 19 Watkin St sold for $1950 (thousands).

∑ |xi − x̄| = ∑ |xi − x̄|

hist(data$Sold, main = "Newtown properties", xlab = "Price (in 1000s)")

hist(data$Sold[data$Type == "House" & data$Bedrooms == "4"], main = "Newtown 4 Bedrooms",

· For an odd sized number of observations:

sample median = the unique middle point = x( n+1 )

· For an even sized number of observations:

As we have n = 56 observations (even), the sample median is found between the ( n2 ) =

 What does this suggest?

· The sample mean would be a lot higher.

## [1] 10.22644 9.95406

Property Sold Gap Conclusion

How do we summarise all the deviations into 1 number (“spread”)?

mean deviation = sample mean(data - sample mean(data))

· The mean deviation is

∑ni=1 Di ∑ni=1 (xi − x̄) ∑ni=1 xi nx̄

· So effectively, the Square and Root operations “reverse” each other.

· To avoid the cancellation of the deviations, another possible method is to consider

SDpop = RMS of (deviations from the mean)

Hence, there are only n − 1 effective pieces of information in the deviations.

· The population standard deviation is always smaller than a sample standard

Varsample = SD2sample and Varpop = SD2pop .

data point - mean

data point = mean + SD × standard units

Property Sold Standard units Conclusion

IQR = range of the middle 50% of the data

The set of quartiles divides the data into four quarters.

Note the lower threshold is not shown…why?

w = c(1, 2, 3, 4, 10, 30, 60, 120, 180, 300)

To declare a user-defined function in R, we use the keyword function.

function_name <- function(parameter1, parameter2) {

my_summary <- function(X) {

Then we can reuse all the operations defined in the function.

w = c(1, 2, 3, 4, 10, 30, 60, 120, 180, 300) # a data vector

You might also like