0% found this document useful (0 votes)
10 views

Lesson 5 (Descriptive Statistics Part 1)_Oct 2024

Uploaded by

2009 SAIFUL
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lesson 5 (Descriptive Statistics Part 1)_Oct 2024

Uploaded by

2009 SAIFUL
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Lesson 5

Descriptive Statistics
(Part 1)
i. Introduction
ii. Types of data
2

i. Introduction
3

A scientific method for collecting,


Definition organizing, summarizing, presenting and
interpreting data.
of
○ Collecting
Statistics
○ Organizing
○ Summarizing
○ Presenting
○ Interpreting
4

○ Statistics is a branch of mathematics that


has applications in almost every part of
Statistics our daily life. It provides a powerful tool
for data analysis in many different fields
of application.
○ Statistics has a wide range of uses in field
of science, business, industry, economy,
medicine, education, agriculture and so
on.
5

○ For example:
○ In the field of science – statistical techniques are used to analyze
data that is obtained from an experiment.
○ In manufacturing – quality control is achieved with the aid of
statistics.
○ In the area of business – marketing surveys are carried out to
determine the compatibility of the product with the economic
and social demand.
○ In the field of education – statistical techniques are used to
analyze the performance of students in an examination.
6

Definition
Population: The entire collection of individuals or
objects whose characteristics are being
studied.
Sample: A subset of the population selected for study.
7

Standard Notation

Measurement Sample Population

Mean 𝑥ҧ 𝜇

Standard deviation 𝑠 𝜎

Variance 𝑠2 𝜎2
Branches of Statistics
8

Statistics

Descriptive Inferential

Graphical Estimation Hypothesis


Numerical
Testing

Point Interval
9

Descriptive Statistics:
Methods for organizing and summarizing
Definition data by using tables, graphs and
summary measures.

Inferential Statistics:
The branch of statistics that includes
methods that use sample results to help
make decisions of predictions about a
population.
10
11
Why Descriptive Statistics
Process of Statistical Inference 12

(Data Analysis Process)


1. Understand the nature
of the problem

6. Interpretation of results 2. Deciding what to measure and


how to measure it

5. Data analysis 3. Data collection

4. Data summarization and


preliminary analysis
13

ii. Types of data


14

Types
of Data
data
Quantitative Qualitative
(ratio and interval) (nominal and
ordinal)

Discrete Continuous
15

○ Data that can be measured numerically


Quantitative ○ Ex : income, heights, gross sales, prices of
Data homes, numbers of cars owned and
numbers of accident
○ can be measured by ratio and interval
○ may be classified as either discrete or
continuous data
16

Discrete data: Continuous data:


Quantitative
o Values are o Values that cannot
Data countable with no be counted but can
intermediate be measured.
values. o Any numerical value
o Ex : We can count between two
number of cars but numbers.
cannot count the o Ex : time, height,
height of a person. weight
17

Qualitative ○ Data that cannot be measured but can


be classified into different categories
Data ○ can be divided into nominal or ordinal
measurement
○ Ex : gender, status of a students,
nationality, races
18
Scales of Measurement
Nominal • gender, nationality, ethnicity, language, colours,
genre, etc.

Ordinal • class rank, students’ performance, satisfaction scales

• The intervals between each value are equally split


Interval • Ex : temperature, date

• Ex : Income, price, mass, length, duration, electric


Ratio
charge
19

Nominal Data
Example:
What is your gender? Did you enjoy the film?
(please tick) (please tick)
Male Yes

Female No
Ordinal Data
20

Example:
How satisfied are you with the level
Are you satisfied with your
of service you have received?
(please tick) education at U of L?
Very satisfied
Dissatisfied Satisfied
Somewhat satisfied
Neutral 1 2 3 4 5
Somewhat dissatisfied
Very dissatisfied
21

Interval and Ratio Data


o Both interval and ratio data are examples of scale data.
o Scale data :
• data is in numeric format (£50, £100, £150)
• data that can be measured on a continuous scale
• the distance between each can be observed and as
a result measured
• the data can be placed in rank order.
22

Ratio Data
o Ratio data measured on a continuous scale and
does have a true zero point.

Examples:
o Age
o Weight
o Height
Summary of “types of data” and “scale 23

of measurement”
24

Example 1
Classify each set of data as discrete or continuous.
1) The number of suitcases lost by an airline.
2) The height of corn plants.
3) The grade level of students.
4) The number of green M&M's in a bag.
5) The time it takes for a car battery to die.
6) The production of tomatoes by weight.
Example 2 25
26

Refer to the questionnaire in EXAMPLE 2 and decide


on the types of data and scales of measurement for

a) Gender
b) Age
c) Educational Background
d) Position in Botanic Gardens
e) Working Experience (years)
Example 3 27

Age Month Sex Head.L Head.W Neck.G Length Chest.G Weight Name
19 7 1 10 5 15 45 23 65 Allen
19 7 2 11 6.5 20 47.5 24 70 Berta
20 8 2 12 6 17 57 27 74 Berta
23 11 2 12.5 5 20.5 59.5 38 142 Berta
29 5 2 12 6 18 62 31 121 Berta
19 7 1 11 5.5 16 53 26 80 Clyde
20 8 1 12 5.5 17 56 30.5 108 Clyde
55 7 1 16.5 9 28 67.5 45 344 Doc
67 7 1 16.5 9 27 78 49 371 Doc
81 9 1 15.5 8 31 72 54 416 Quincy
10 1 16 8 32 77 52 432 Kooch
115 7 1 17 10 31.5 72 49 348 Charlie
117 9 1 15.5 7.5 32 75 54.5 476 Charlie
124 4 1 17.5 8 32 75 55 478 Charlie
140 8 1 15 9 33 75 49 386 Charlie
28

Refer to the data in EXAMPLE 3 and decide on the types of data


and scales of measurement for
a) Age
b) Month
c) Sex
d) Head Length
e) Head Width
f) Neck
g) Length
h) Chest
i) Weight
Lesson 6
Descriptive Statistics
(Part 2)
Statistical Measures
30

Statistical Measures
31
a) Measure
measure of asymmetry : of central
to show frequency tendency
distribution symmetrical
measure of location :
about the mean or skewed
to show where the center of
the data

STATISTICAL b) Measure of
c) Measure of MEASURES dispersion
skewness
measure of spread :
to show how spread out the data
are around the center
32

Numerical Descriptions
(a) Measures of central tendency
▪ Also called measures of location or average
▪ It refers to the middle point (central value) of a
distribution.
(b) Measures of dispersion
▪ It describes how spread or scattered a set or distribution
of numeric data about the central point or “how far apart
are the data values from each other”.
33

Numerical Descriptions
(c) Measures of skewness
▪ Skewness is the statistical term for asymmetry or “lop-
sided”
▪ Measure of skewness summarizes to what extent the
items are symmetrically distributed.
34

a) Measures of Central Tendency


There are 3 main types :

Mean Median Mode

• Sum of all • It is the value of • The most frequent


measurements the middle member measurement in
divided by the of a distribution or the data.
number of array (or the value
measurements / of that item which
average. lies exactly half
• 𝑥ҧ =
σ𝑥 way along the
𝑛 array).
35
Mean 36

➢ Advantages
➢ it is widely understood
➢ the value of every item is included in the computation of the
mean.
➢ it is well suited to further statistical analysis.

➢ Disadvantages
➢ its value may not correspond to any actual value.
➢ it might be distorted by extremely high or low values.
Median 37

➢ Advantages
➢ can be used when certain end values of a set or distribution are
difficult, expensive or impossible to obtain, particularly appropriate
to ‘life’ data.
➢ can be used with non-numeric data if desired, providing the
measurements can be naturally ordered.
➢ will often assume a value equal to one of the original data.

➢ Disadvantages
➢ it is difficult to handle theoretically in more advanced statistical
work, so its use is restricted to analysis at a basic level.
➢ it fails to reflect the full range of values.
Mode 38

➢ Advantages
➢ it is more appropriate average to use in situations where it is useful
to know the most common value.
➢ easy to understand, not difficult to calculate and can be used when a
distribution has opened-ended classes.
➢ it is not affected by extreme values.
➢ Disadvantages
➢ it ignores dispersion around the modal value and it does not take all
the values into account.
➢ it is unsuitable for further statistical analysis.
➢ although it ignores extreme values, it is thought to be too much
affected by the most popular class when a distribution is significantly
skewed.
39
Summary of when to use the Mean,
Median & Mode

Type of Data Best measure of central tendency


Nominal Mode
Ordinal Median
Interval/Ratio (not skewed) Mean
Interval/Ratio (skewed) Median
Example 4 40

Consider the salaries of staff at a factory below :

Determine the mean, median and mode of the data set.


41

b) Measures of Dispersion
Range Standard deviation

• it is the numerical difference • it is a measure of the extent


between the smallest and for a particular random
largest values of the items in a variable (x) is spread about the
set or distribution. mean.

Range = Highest – Lowest 𝑥𝑖 − 𝑥ҧ 2


𝑠= , 𝑛 < 30
Score Score 𝑛−1
Comparing Standard Deviation 42
43

Coefficient of variation Quartile deviation

• A coefficient of variation (CV) can • Known as semi-inter-quartile


be calculated as the ratio of the range, it is the dispersion which
standard deviation to the mean and shows the degree of spread
interpreted in two different around the middle of a set of
settings: analysing a single variable data
and interpreting a model.
• The higher the CV, the greater the
dispersion in the variable. 𝑄3 − 𝑄1
𝑠 𝑄𝑢𝑎𝑟𝑡𝑖𝑙𝑒𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 =
𝐶𝑉 = × 100% 2
𝑥ҧ
Example 5 44

Two students worked on five similar projects during their FYP


courses. Student A completed each project in 30 days with
standard deviation of 4 days whereas Student B completed each
project in 25 days with standard deviation of 6 days. Which
student is more consistent in terms of completing the projects?
45

STUDENT A STUDENT B
𝑥ҧ = 30 𝑥ҧ = 25
𝑠=4 𝑠=6
4 6
𝐶𝑉 = × 100% 𝐶𝑉 = × 100%
30 25
𝐶𝑉 = 13.33% 𝐶𝑉 = 24%
Since the coefficient of variation of Student A is lower than the
coefficient of variation of Student B, we can interpret as Student
A is more consistent in completing the projects.
Example 6 46

STANDARD
MEAN WEEKLY NO OF
DEVIATION
SALARY (RM) WORKERS
(RM)
FACTORY A 345 50 476
FACTORY B 285 45 524

(a) Which factory pays out a larger amount of weekly salary?


(b) Which factory has higher variability in paying individual weekly
salary?
47
(a) Which factory pays out a larger amount of weekly salary?

𝐹𝑎𝑐𝑡𝑜𝑟𝑦 𝐴 ∶ 𝑥ҧ = 345 𝑠 = 50 𝑛 = 476

𝑇𝑜𝑡𝑎𝑙 𝑤𝑒𝑒𝑘𝑙𝑦 𝑠𝑎𝑙𝑎𝑟𝑦 𝑝𝑎𝑖𝑑 𝑏𝑦 𝐹𝑎𝑐𝑡𝑜𝑟𝑦 𝐴 ∶

= 345 × 476 = 𝑅𝑀164220 𝑝𝑒𝑟 𝑤𝑒𝑒𝑘

𝐹𝑎𝑐𝑡𝑜𝑟𝑦 𝐵 ∶ 𝑥ҧ = 285 𝑠 = 45 𝑛 = 524

𝑇𝑜𝑡𝑎𝑙 𝑤𝑒𝑒𝑘𝑙𝑦 𝑠𝑎𝑙𝑎𝑟𝑦 𝑝𝑎𝑖𝑑 𝑏𝑦 𝐹𝑎𝑐𝑡𝑜𝑟𝑦 𝐵 ∶


= 285 × 524 = 𝑅𝑀149340 𝑝𝑒𝑟 𝑤𝑒𝑒𝑘

Factory A pays a larger amount of weekly wages than Factory B.


48
(b) Which factory has higher variability in paying individual weekly salary?

𝐹𝑎𝑐𝑡𝑜𝑟𝑦 𝐴 ∶ 𝐹𝑎𝑐𝑡𝑜𝑟𝑦 𝐵 ∶

50 45
𝐶𝑉 = × 100% 𝐶𝑉 = × 100%
345 285

𝐶𝑉 = 14.49% 𝐶𝑉 = 15.79%

Factory B has a higher coefficient of variation than Factory A.


Therefore it indicates that the variability in the payment of individual
weekly salary is high.
Quartiles 49

- Quartiles are defined as value


which are quarter the data 𝐼𝑛𝑡𝑒𝑟𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑅𝑎𝑛𝑔𝑒 = 𝑄3 − 𝑄1
𝑄1 - first quartile
- value below 25% of 𝑛+1
observations 𝑄1 = 𝑡ℎ 𝑣𝑎𝑙𝑢𝑒
4
𝑄2 - second quartile
3 𝑛+1
- half of the data(median) 𝑄3 = 𝑡ℎ 𝑣𝑎𝑙𝑢𝑒
𝑄3 - third quartile 4
- value below 75% of
observation
Range 50

➢ Advantages
➢ it is widely understood
➢ the value of every item is included in the computation of the
mean.
➢ it is well suited to further statistical analysis.

➢ Disadvantages
➢ its value may not correspond to any actual value.
➢ it might be distorted by extremely high or low values.
Standard Deviation 51

➢ Advantages
➢ it takes all values into account; therefore, it can be regarded
as truly representative of the data.
➢ it is suitable for further statistical analysis.

➢ Disadvantages
➢ it is more difficult to understand than some other measures
of dispersion.
52

b) Measures of Skewness

Negatively Positively
Skewed Skewed
53

Skewness is a measure of symmetry or more


precisely, the lack of symmetry. It can describe the
distribution of data. A distribution is symmetric if it
looks like a “bell-shaped curve”.
From Histogram 54
Example 7 55

What type of distribution is described by the following


information?

Mean = 56 Median = 58.1 Mode = 63

𝑀𝑒𝑎𝑛 < 𝑀𝑒𝑑𝑖𝑎𝑛 < 𝑀𝑜𝑑𝑒


Answer : Negatively skewed
Why is skewness important? 56

➢ Skewness gives the direction of the outliers and the


concentration of data.
➢ For example; if it is right-skewed, most of the outliers are
present on the right side of the distribution while majority of
the data distribution will be on the left side of the mean. The
lower ranging values will be on the right side of the curve.
Why is skewness important? 57

➢ If it is left-skewed, most of the outliers will present on the


left side of the distribution while majority of the data
distribution will be on the right side of the mean. The lower
ranging values will be on the left side of the curve.
Example 8 58

Let’s look at the below distribution. It is the distribution of


horsepower of cars.
59

You can clearly see that the distribution is positively skewed.


Since our data is positively skewed here, it means that it has a
higher number of data points having low values, i.e., cars with
less horsepower.

Also, skewness tells us about the direction of outliers. You can


see that our distribution is positively skewed and most of the
outliers are present on the right side of the distribution.
Kurtosis 60

➢ kurtosis characterizes the relative peakedness or flatness of a


distribution compared to the normal distribution

➢ the kurtosis of a normal distribution is = 0 (mesokurtic)

➢ types of kurtosis:
➢ Platykurtic – when the kurtosis < 0, the frequencies
throughout the curve are closer to be equal (i.e., the curve
is more flat and wide). Thus, negative kurtosis indicates a
relatively flat distribution
61

➢ Leptokurtic – when the kurtosis > 0, there are high


frequencies in only a small part of the curve (i.e. the curve is
more peaked). Thus, positive kurtosis indicates a relatively
peaked distribution
➢ In finance, kurtosis is used as a measure of financial risk. A
large kurtosis is associated with a high risk for an investment
because it indicates high probabilities of extremely large and
extremely small returns.
➢ On the other hand, a small kurtosis signals a moderate level
of risk because the probabilities of extreme returns are
relatively low.
62

➢ Please note that an investor is more comfortable with a


platykurtic distribution of return as it indicates stable returns
and lower risk of sudden shock of outliers, while leptokurtic
distribution means chances of higher return but with higher
risk.
WHY DO WE NEED KURTOSIS? 63

➢ The concept of kurtosis is very important as it indicates how are the


outliers distributed across the distribution in comparison to a normal
distribution.
➢ For example:

➢ These two distributions have the same variance, approximately the same
skewness, but differ markedly in kurtosis.
64

leptokurtic
platykurtic
65
Comparison of Central Tendency
for Three Curves
Curve C

Curve A
Curve B
Comparison of Dispersion of Two 66

Curves
67
Comparison of Two Skewed
Curves
Curve A: Curve B:
Positively Skewed Negatively Skewed
Example 9 68

Adam normally makes really good grades in Mathematics class.


All but one of his test scores are really high. His test scores are
97, 98, 94, 93, 99, and 70.

(a) Obtain a computer output for the summary of data that includes
the mean, mode, median, range, quartiles, inter-quartile range
(IQR), standard deviation, skewness and kurtosis of the data set.
69

(b) Is Adam's test score data skewed to the left or to the right?
(c) Which measure of spread is larger? Which measure of spread
will give a more accurate picture of Adam's Maths
performance?
(d) Which measure of center is higher? Which measure of center
gives a more accurate picture of Adam's Maths performance?
70

(a)

(b) The data set is negatively skewed because the skewness


value is negative.
71
(c) Measure of spread :
The standard deviation is the larger measure of spread when
compared to the interquartile range (IQR) because it takes into
account all the values in the data set. The mark 70 will increase
the value of the standard deviation.
In Adam’s case, the interquartile range gives a more accurate
picture of Adam's Maths performance because its value will not
be affected by the lower mark. IQR will only consider the middle
50% of the data set.
72

(c) Measure of center

Median has the higher measure of center as compared to


mean because the score of 70 will actually pull the mean
mark down.
Median will also give a more accurate picture of the data set
because it will not be affected by the skewness of the data
set.

You might also like