Week 1,2 Instructor
Week 1,2 Instructor
Population: Sample:
a collection of a portion of the
persons, objects, whole, and if
or items of properly taken,
interest representative of the
whole
• Random sample of
• All automobiles
4k automobiles
• All Ford Escape
• Random sample of
crossover vehicles
100 Ford Escape
produced in 2021 Census: When analysts gather data from the crossover vehicles
whole population for a given measurement of produced in 2021
interest. Example: Canadian Census
2 main branches of statistics
Descriptive Statistics
• using data gathered on a group to
describe or reach conclusions about
that same group
Inferential Statistics
• gathers data from a sample and uses
the statistics generated to reach
conclusions about the population
from which the sample was taken
Descriptive vs Inferential Statistics: an example
Widely used in
Aka. Inductive
pharmaceutical
Statistics
research
Allows studying a
Starts from a
wide range of
hypothesis and makes
phenomena without
a statement about the
having to conduct a
population
census
Parameter vs Statistics
Parameter Statistics
• Descriptive • Descriptive
measure of the measure of a
population sample
• Greek Letters • Roman letter
• Population mean • Sample mean ()
() • Sample Variance
• Population SD () ()
Parameter vs Statistics
Inferences about
Randomly select 10 students
parameters are made
under uncertainty
Statistical Inference is inference about a … ?
LEFT RIGHT
Population Sample
Variables, Data, and Data Measurement
Variable Measurement
D
• a characteristic of any entity
being studied that is capable of
• a standard process used to
assign numbers to particular
AT
taking on different values
• labour productivity
attributes or characteristics of a
variable
A
• Products produced per hour
Data Case studies with Big Results
GOOGLE: Netflix:
Working with the U.S. Centers for collects data from its users
Disease Control, tracks when users including Viewing time, platform
are inputting search terms related to searches for keywords, Metadata
flu topics, to help predict which related to content abandonment,
regions may experience outbreaks. such as content pause time, rewind,
rewatched. Using the data they
predict what a viewer is likely to
watch and give a personalized
watchlist to a user. .
Levels of Data
Known,
Categories Meaningful
Ranks Equal
zero
intervals
Height, Mass,
time
Ratio
% change in
employment Interval
Patient CTAS in
ED
Ordinal
Sex, Religion, Metric/Quantitative Data
Student ID
Nominal
Nonmetric/Qualitative Data
A researcher collects demographic data from her
participants. She asks participants for their city of
birth.
Which level of measurement is this?
LEFT RIGHT
Nominal Interval
She then asks participants to report the number of
hours they spent exercising in the past week.
Which level of measurement is this?
LEFT RIGHT
Interval Ratio
Big Data
Application of
processes and
techniques that
transform raw data into
meaningful
information to improve
decision making
Source
Canadian Occupational Projectio
Business Analytics
bank that has disbursed 60816 auto loans in the 27-30 4000 172 3828
33-36 As
5698the borrowers
188 5510 are3.3%
getting96.7%
36-39
39-42
older,
8209
8117
they
197
211
are less
8012
7906
likely
2.4%
2.6%
to97.6%
97.4%
42-45 9000 default
216 on8784
their loans
2.4% 97.6%
45-48 7600 152 7448 2.0% 98.0%
48-51 6000 84 5916 1.4% 98.6%
51-54 4000 64 3936 1.6% 98.4%
54-57 2000 26 1974 1.3% 98.7%
57-60 788 9 779 1.1% 98.9%
>60 6 0 6 0.0% 100.0%
Visualizing Data With Charts and
Graphs
Data visualization is useful for:
Exploring Detecting
Data Cleaning
data structure outliers
Identifying
Spotting local Presenting
trends and
patterns Results
clusters
Frequency Distributions
Ungrouped Data
• Raw data, have not been summarized in any way
Grouped Data
• Data that have been organized into a frequency
distribution
Frequency Distributions
Frequency Distributions
Raw data refers to which type of data?
LEFT RIGHT
Grouped Ungrouped
Let’s solve Problem 2.1…
Let’s solve Problem 2.3…
Quantitative Data Graphs
Intervals:
one year
Descriptive Statistics
Odd number of
terms: find the
Sort the numbers middle number
from smallest to
largest Even number of
terms: find the
average of the middle
2 terms
Median -Example
Date Mean Daily • Find the median temperature of the day August
Temperature (C)
1st over the last 10 years.
1 Aug 2022 20.6
1 Aug 2021 22 • Sort the data:
1 Aug 2020 19.2
17.8,17.9,18.8,19.2,19.9,20.2,20.5,20.6,20.6,22
1 Aug 2019 20.5
1 Aug 2018 18.8 • The number of terms: 10 (Even number)
1 Aug 2017 19.9 • The median will be the average of 2 middle
1 Aug 2016 17.8
terms:19.9 and 20.2
1 Aug 2015 20.6
1 Aug 2014 20.2 • (19.9+20.2)/2=20.05
1 Aug 2013 17.9 •
Mode
5.5 5
Percentiles
• Measures of central tendency that divide a group of data into 100 parts.
• There are 99 percentiles (and not 100) as it takes 99 dividers to separate a
group of data into 100 parts
• Example: the 87th percentile value:
• 87% of the data are below that and
• no more than 13% of the data are above the value
• Widely used in reporting test results
Percentiles
• Range: the difference between the largest and the smallest values of a
data set.
• Advantage: ease of computation
• Disadvantage: affected by extreme values
• Interquartile Range (IQR): the range of values between the first and
third quartiles.
• It is the range of the middle 50% of the data
• Determined by
Interquartile Range - Example
• The average of the squared deviations Apple annual revenue 2015-2021 ($bn)
about the arithmetic mean for a set of YEA
Revenue
Deviation from Squared
R the mean Deviation
numbers
2015 233.6 29.8 886.3
Population Variance 2016 215.4 48.0 2301.3
2017 229.0 34.4 1181.4
2018 265.4 -2.0 4.1
2019 260.1 20.2 10.7
2020 274.3 -10.9 119.4
2021 365.8 -102.4 10491.6
Standard Deviation
Z Scores
• If the z score is
• Positive: the raw value (x) is above the mean
• Negative: the raw value (s) is below the mean
z Scores - Example
Z Scores
Coefficient of Variation
Coefficient of Variation:
• Useful in comparing the SDs that have been computed for data with
different means. Assume the following 2 data sets:
• Data A: Mean=1000, SD=5 5
• Data B: Mean=10, SD=5 𝐶 𝑉 𝐴= ( 100 )= 0.5 %
1000
5
𝐶 𝑉 𝐵= ( 100 ) =50 %
10
Let’s solve Problem 3.20…
The more dispersed the data are, the larger the
range, the interquartile range, the variance, and
the standard deviation will be.
LEFT RIGHT
True False
Measures of Shape
Skewness Kurtosis
• Suppose:
• Mean=29, Median=26,SD=12.3
Should the empirical rule be used for data sets
that are highly skewed?
LEFT RIGHT
YES NO
Kurtosis
• A Suppose you have the math test results for a class of 15 students. 91 95
54 69 80 85 88 73 71 70 66 90 86 84 73
• First Sort them:
• 54 66 69 70 71 73 73 80 84 85 86 88 90 91 95
Min Lower Median Upper Max
quartile quartile
Let’s solve Problem 3.31…