0% found this document useful (0 votes)
31 views

Week 1,2 Instructor

1. The bad loan rate is highest for borrowers aged 42-45 years at 2.4% and lowest for borrowers aged <21 years at 1.1%. 2. In general, the bad loan rate increases with age until peaking between 42-45 years of age and then decreases for older borrowers. 3. While the number of loans is largest for borrowers aged 30-39 years, this age group has a relatively low bad loan rate of around 2%.

Uploaded by

kins
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Week 1,2 Instructor

1. The bad loan rate is highest for borrowers aged 42-45 years at 2.4% and lowest for borrowers aged <21 years at 1.1%. 2. In general, the bad loan rate increases with age until peaking between 42-45 years of age and then decreases for older borrowers. 3. While the number of loans is largest for borrowers aged 30-39 years, this age group has a relatively low bad loan rate of around 2%.

Uploaded by

kins
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 73

BUS 232

Data and Decisions |


(Business Statistics)
Instructor: Negar Ganjouhaghighi
What is Statistics?

A science dealing with the collection, analysis, interpretation, and


presentation of numerical data

Collect Analyze Interpret Present


Data Data Data Findings
Data in Business Disciplines Examples

Revenue Bond rates Salaries Forecast sale CPU time

# of “hits” on Time spent on # of Amount of Quantity of raw


the net net per day productions/day inventory materials

Storage Age of Foreign


Market share taxes
Capacity employees exchange rate
Population vs Sample

Population: Sample:
a collection of a portion of the
persons, objects, whole, and if
or items of properly taken,
interest representative of the
whole
• Random sample of
• All automobiles
4k automobiles
• All Ford Escape
• Random sample of
crossover vehicles
100 Ford Escape
produced in 2021 Census: When analysts gather data from the crossover vehicles
whole population for a given measurement of produced in 2021
interest. Example: Canadian Census
2 main branches of statistics

Descriptive Statistics
• using data gathered on a group to
describe or reach conclusions about
that same group

Inferential Statistics
• gathers data from a sample and uses
the statistics generated to reach
conclusions about the population
from which the sample was taken
Descriptive vs Inferential Statistics: an example

• Heights of a random sample of 10 male


DS basketball players are measured
• Mean/Median/Mode/Variance/Range…

• Test and compare the mean height of the


above sample with the mean height of the
IS
whole population to see if basketball players
are larger than the average male population
Inferential Statistics

Widely used in
Aka. Inductive
pharmaceutical
Statistics
research

Allows studying a
Starts from a
wide range of
hypothesis and makes
phenomena without
a statement about the
having to conduct a
population
census
Parameter vs Statistics

Parameter Statistics

• Descriptive • Descriptive
measure of the measure of a
population sample
• Greek Letters • Roman letter
• Population mean • Sample mean ()
() • Sample Variance
• Population SD () ()
Parameter vs Statistics

Example: determine the average


height of the students in this class
• The population?
• The parameter?

Inferences about
Randomly select 10 students
parameters are made
under uncertainty
Statistical Inference is inference about a … ?

LEFT RIGHT

Population Sample
Variables, Data, and Data Measurement

Most business statistics studies contain Variables, Measurement, and Data.

Variable Measurement
D
• a characteristic of any entity
being studied that is capable of
• a standard process used to
assign numbers to particular
AT
taking on different values
• labour productivity
attributes or characteristics of a
variable
A
• Products produced per hour
Data Case studies with Big Results

GOOGLE: Netflix:
Working with the U.S. Centers for collects data from its users
Disease Control, tracks when users including Viewing time, platform
are inputting search terms related to searches for keywords, Metadata
flu topics, to help predict which related to content abandonment,
regions may experience outbreaks. such as content pause time, rewind,
rewatched. Using the data they
predict what a viewer is likely to
watch and give a personalized
watchlist to a user. .
Levels of Data
Known,
Categories Meaningful
Ranks Equal
zero
intervals
Height, Mass,
time
Ratio
% change in
employment Interval
Patient CTAS in
ED
Ordinal
Sex, Religion, Metric/Quantitative Data
Student ID
Nominal
Nonmetric/Qualitative Data
A researcher collects demographic data from her
participants. She asks participants for their city of
birth.
Which level of measurement is this?
LEFT RIGHT

Nominal Interval
She then asks participants to report the number of
hours they spent exercising in the past week.
Which level of measurement is this?
LEFT RIGHT

Interval Ratio
Big Data

Big data: a collection of


large and complex datasets
from different sources that
are difficult to process using
traditional data management
and processing applications.
So:
BUSINESS ANALYTICS
Business Analytics

Application of
processes and
techniques that
transform raw data into
meaningful
information to improve
decision making

Source 
Canadian Occupational Projectio
Business Analytics

Descriptive Predictive Prescriptive


Analytics Analytics Analytics
• Simplest and most • Finds relationships in • Still in early stages of
commonly used the data that are not development
• Describe what is found in the first step • Takes uncertainty into
happening in business • Make predictions about account
• Data mining, data future • Recommend ways to
visualisation, statistics, • Regression, Time-series, mitigate risk
… forecasting, Simulation, • Aims to optimize the
ML,… performance of a system
Case Study 1 Total # # of bad # of good
of loans loans loans
<21 9 2 7
21-24 310 14 296
Assume that you are the chief risk officer for a 24-27 511 20 491

bank that has disbursed 60816 auto loans in the 27-30 4000 172 3828

quarter between April-June 2021. 30-33 4568 169 4399


33-36 5698 188 5510
According to data, you have had total of 1524 36-39 8209 197 8012

bad loans or rate of 2.5% 39-42 8117 211 7906


42-45 9000 216 8784
You want to analyze the bad rate across several 45-48 7600 152 7448
individual variables. 48-51 6000 84 5916
Based on your experience, the borrower’s age is 51-54 4000 64 3936

a critical factor. 54-57 2000 26 1974


57-60 788 9 779
>60 6 0 6
Case Study 1

1. The distribution of loans across


age groups is a reasonably
smooth normally distributed
curve
2. The max bad loans are in the
age bucket 42-45 years (doesn’t
necessarily mean the risk is also
higher)
3. Not enough data for the fringe
buckets (<21 and >60 years)
Case Study 1 Age
Total # # of bad # of good % Bad % Good
of loans loans loans loans loans
<21 9 2 7 22.2% 77.8%
21-24 310 14 296 4.5% 95.5%
24-27 511 20 491 3.9% 96.1%
Normalized Plot 27-30 4000 172 3828
30-33 4568
Conclusion:
169
4399 3.7%
4.3%
96.3%
95.7%

33-36 As
5698the borrowers
188 5510 are3.3%
getting96.7%
36-39
39-42
older,
8209
8117
they
197
211
are less
8012
7906
likely
2.4%
2.6%
to97.6%
97.4%
42-45 9000 default
216 on8784
their loans
2.4% 97.6%
45-48 7600 152 7448 2.0% 98.0%
48-51 6000 84 5916 1.4% 98.6%
51-54 4000 64 3936 1.6% 98.4%
54-57 2000 26 1974 1.3% 98.7%
57-60 788 9 779 1.1% 98.9%
>60 6 0 6 0.0% 100.0%
Visualizing Data With Charts and
Graphs
Data visualization is useful for:

Exploring Detecting
Data Cleaning
data structure outliers

Identifying
Spotting local Presenting
trends and
patterns Results
clusters
Frequency Distributions

Ungrouped Data
• Raw data, have not been summarized in any way

Grouped Data
• Data that have been organized into a frequency
distribution
Frequency Distributions
Frequency Distributions
Raw data refers to which type of data?

LEFT RIGHT

Grouped Ungrouped
Let’s solve Problem 2.1…
Let’s solve Problem 2.3…
Quantitative Data Graphs

Histograms Frequency Ogive Stem and leaf


Polygons
Useful toll for
differentiating the Similar to Histogram but Cumulative frequency Separating the digits for
frequencies of class each class frequency is polygon each number of the data
intervals plotted as a dot at the
Running totals into a stem and a leaf
class midpoint
Finding the outliers
Let’s solve Problem 2.10…
Qualitative Data Graphs

Pie Charts Bar Charts Pareto Charts

Shows the relative A vertical bar chart that


magnitude of parts to a Easier to see the displays the most
whole difference between common types of
similar categories defects, ranked in order
Less accurate of occurrence
Scatter plot data

Temperature °C Ice Cream Sales


• A 2-dimensional graph plot of pairs of 14.2° $215
points from 2 numerical variables 16.4° $325
• Often Used to examine possible 11.9° $185
relationships between 2 variables 15.2° $332
18.5° $406
22.1° $522
19.4° $412
25.1° $614
23.4° $544
18.1° $421
22.6° $445
17.2° $408
Visualizing Time-Series Data

• Time series data: data gathered on a particular characteristic over


a period of time at regular intervals (hours/weeks/years…)
• Visualizing with a line chart: see the trend
Visualizing
Time-Series
Data with a
line chart

Intervals:
one year
Descriptive Statistics

• Data visualization: general observations about the shape and


spread of the data
• Statistics: more complete understanding of the data
• Measure of central tendency
• Measures of variability
• Measures of shape
Measures of Central Tendency

• Yield information about the centre, or middle part, of a group of


numbers.
• Yield such information as the average, the middle point, and the
most frequently occurring point
• Do not focus on the span of the data
• The common measure of central tendency:

Mean Median Mode Percentile Quartile


Mean

• The arithmetic mean: the average of a group of numbers


Population Mean: Sample Mean:

• Example: Salaries of data analysts in top 6 companies in Vancouver, BC:


Company Annual Salary Company Annual Salary
Si Systems $131,508 TransLink $99,181
DISYS $107,808 CRD $97,160
UBC $106,272 MSi Corp $94,944

• Mean salary of a data analyst:


Median

• The median: the middle value in an ordered array of numbers

Odd number of
terms: find the
Sort the numbers middle number
from smallest to
largest Even number of
terms: find the
average of the middle
2 terms
Median -Example

Date Mean Daily • Find the median temperature of the day August
Temperature (C)
1st over the last 10 years.
1 Aug 2022 20.6
1 Aug 2021 22 • Sort the data:
1 Aug 2020 19.2
17.8,17.9,18.8,19.2,19.9,20.2,20.5,20.6,20.6,22
1 Aug 2019 20.5
1 Aug 2018 18.8 • The number of terms: 10 (Even number)
1 Aug 2017 19.9 • The median will be the average of 2 middle
1 Aug 2016 17.8
terms:19.9 and 20.2
1 Aug 2015 20.6
1 Aug 2014 20.2 • (19.9+20.2)/2=20.05
1 Aug 2013 17.9 •
Mode

• The mode: the most frequently occurring value in a set of data


• Sorting the data helps to locate the mode
• Example: Inflation rate, Canada, 1995-2022
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
2.1 1.6 1.6 1 1.7 2.7 2.5 2.3 2.8 1.9 2.2 2.2 2.1 2.4
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
0.3 1.8 2.9 1.5 0.9 1.9 1.1 1.4 1.6 2.3 2.0 0.7 3.4 5.6
• 0.3 0.7 0.9 1 1.1 1.4 1.5 1.6 1.6 1.6 1.7 1.8 1.9 1.9 2 2.1 2.1 2.2 2.2 2.3 2.3 2.4 2.5 2.7 2.8 2.9 3.4 5.6

• The mode: 1.6 %


What is the median of the following dataset?
5 3 6 0 7
LEFT RIGHT

5.5 5
Percentiles

• Measures of central tendency that divide a group of data into 100 parts.
• There are 99 percentiles (and not 100) as it takes 99 dividers to separate a
group of data into 100 parts
• Example: the 87th percentile value:
• 87% of the data are below that and
• no more than 13% of the data are above the value
• Widely used in reporting test results
Percentiles

• Steps in determining the location of a percentile:


• Sort the numbers in an ascending order
• Calculate the percentile location (i) by:

• P=the percentile of interest


• i=percentile location
• N=number in the data set
• a) if i is a whole number: the Pth percentile is the average of the value
at the ith and (i+1) locations
• b) if I is not a whole number: the Pth percentile value is located at the
whole number part of i+1
Percentiles - example

• Determine the 40th percentile of the following 9 numbers:


14,12,19,23,5,13,28,17,2

• Step1: Sort the data from smallest to largest


• 2,5,12,13,14,17,19,23,28
• Step2: calculate the percentile location: 40% * 9 =3.6
• Since the location is not a whole number, we find the closest integer
number that is greater than 3.6  4. So the 40% percentile is located at
the 4th value: 13.
• So, 13 is the 40th percentile.
Percentiles - example

• Determine the 50th percentile of the following 8 numbers:


14,12,19,23,5,13,28,17

• Step1: Sort the data from smallest to largest


• 5,12,13,14,17,19,23,28
• Step2: calculate the percentile location: 50% * 8 =4
• Since the location is a whole number, we find the average value of the
4th and 5th number: (14+17)/2=15.5
• So, 15.5 is the 50th percentile.
Quartiles

• Measures of central tendency that divide a group of data into 4 subgroups


or parts
Quartiles - Example

• Suppose we want to determine the values of , , and for the following


numbers: 2, 5, 6, 7, 10, 22, 13, 14, 16, 65, 45, 12.
• Step 1: Put the numbers in order: 2, 5, 6, 7, 10, 12 13, 14, 16, 22, 45, 65
• Count how many numbers there are in your set: 12
• 12/4=3: 3 numbers in each quartile:
• 2, 5, 6, | 7, 10, 12 | 13, 14, 16, | 22, 45, 65
• : , so the first quartile value is 6.5
12.5
• ?
19
Let’s solve Problem 3.7…
Measures of Variability

• Describe the spread or the


dispersion of a set of data
• Used with measures of central
tendency provides more
accurate information about the
data
• 7 main measures of variability:

Interquartile Mean absolute Standard Coefficient of


Range variance z scores
range deviation deviation variation
Range and Interquartile Range

• Range: the difference between the largest and the smallest values of a
data set.
• Advantage: ease of computation
• Disadvantage: affected by extreme values
• Interquartile Range (IQR): the range of values between the first and
third quartiles.
• It is the range of the middle 50% of the data
• Determined by
Interquartile Range - Example

• Canada Imports- Top Categories:


Category Value (Billion USD)
Cars 28 • is the 3rd value from the bottom: 8
Car parts 20
Trucks 15
Crude oil 14 • is the 8th value from the bottom: 15
Processed petroleum oil 14 • The IQR is
phones IQR 11 • The middle 50% of the top 10 imports
Computers 9 to Canada spans a range of $7 billion
Medications 8 (USD)
Turbo jets 6
Gold 6
Deviation from the mean

Apple annual revenue 2015-2021 ($bn)


Subtracting the mean from each data
YEAR Revenue
Mean=268.3 2015 233.6
2016 215.4
Sum of the deviations from the 2017 229
Arithmetic Mean is Always 2018 265.4
Zero 2019 260.1
2020 274.3
2021 365.8
Apple annual revenue 2015-2021 ($bn)
YEAR Revenue Deviation from the mean
2015 233.6 268.3-233.6=29.8
2016 215.4 268.3-215.4=48.0
2017 229 268.3-229=34.4
2018 265.4 268.3-265.4=-2.0
2019 260.1 268.3-260.1=20.2
2020 274.3 268.3-274=-10.9
2021 365.8 268.3-366=-102.4
Mean absolute Deviation

Apple annual revenue 2015-2021 ($bn)


• The average of the absolute values of
YEA Deviation from Absolute
the deviations around the mean for a R
Revenue
the mean Deviation
set of numbers 2015 233.6 29.8 29.8
2016 215.4 48.0 48.0
MAD = 2017 229 34.4 34.4
2018 265.4 -2.0 2.0
2019 260.1 20.2 20.2
• Less useful in statistics than other
2020 274.3 -10.9 10.9
measures of variability 2021 365.8 -102.4 102.4
• Occasionally used in the field of
forecasting as a measure of error
Variance

• The average of the squared deviations Apple annual revenue 2015-2021 ($bn)
about the arithmetic mean for a set of YEA
Revenue
Deviation from Squared
R the mean Deviation
numbers
2015 233.6 29.8 886.3
Population Variance 2016 215.4 48.0 2301.3
2017 229.0 34.4 1181.4
2018 265.4 -2.0 4.1
2019 260.1 20.2 10.7
2020 274.3 -10.9 119.4
2021 365.8 -102.4 10491.6
Standard Deviation

• Square root of variance


• Popular measure of variability
Standard Deviation

• Advantage over Variance: SD is expressed in the same units as the raw


data
Meaning of Standard Deviation

Empirical Rule Chebyshev’s theorem


Population vs Sample Variance and SD
z Scores

• Represents the number of standard deviations a value (x) is above or


below the mean of a set of numbers
• Only for normally distributed data
• Allows the distance of a raw data from the mean be translated into SDs

Z Scores

• If the z score is
• Positive: the raw value (x) is above the mean
• Negative: the raw value (s) is below the mean
z Scores - Example

• A normally distributed data set:


• Mean=50
• SD=10
• What is the z score for a value of 70?

Z Scores
Coefficient of Variation

• The ratio of the Standard deviation to the mean expressed in percentage

Coefficient of Variation:

• Useful in comparing the SDs that have been computed for data with
different means. Assume the following 2 data sets:
• Data A: Mean=1000, SD=5 5
• Data B: Mean=10, SD=5 𝐶 𝑉 𝐴= ( 100 )= 0.5 %
1000
5
𝐶 𝑉 𝐵= ( 100 ) =50 %
10
Let’s solve Problem 3.20…
The more dispersed the data are, the larger the
range, the interquartile range, the variance, and
the standard deviation will be.
LEFT RIGHT

True False
Measures of Shape

• Tools that can be used to describe the shape of a distribution of


data
• 2 important measures:

Skewness Kurtosis

• Box-and-whisker plots: great visualization tool


Skewness

The right half is a


Distribution skewed Distribution skewed
mirror of the left
Left: Right:
half:
Negatively Skewed Positively Skewed
Symmetrical
Skewness

• Pearsonian Coefficient of Skewness: compares the mean and


median in light of the magnitude of the standard deviation

= Median Karl Pearson

• Suppose:
• Mean=29, Median=26,SD=12.3
Should the empirical rule be used for data sets
that are highly skewed?
LEFT RIGHT

YES NO
Kurtosis

• Describes the amount of peakedness of a distribution


Box and Whisker Plot

• A.k.a Box Plot


• Determined from 5 specific numbers:
• The median ()
• The lower quartile ()
• The lower quartile ()
• The minimum
• The maximum
Box and Whisker Plot - Example

• A Suppose you have the math test results for a class of 15 students. 91  95 
54  69  80  85  88  73  71  70  66  90  86  84  73
• First Sort them:
• 54  66  69  70  71  73  73  80  84  85  86  88  90  91  95
Min Lower Median Upper Max
quartile quartile
Let’s solve Problem 3.31…

Construct a box-and-whisker plot for the following data. Do the


data contain any outliers? Is the distribution of data skewed?
Let’s solve Problem 3.48…

The Globe and Mail compiled a list of the top


100 public companies in Canada according to
profit. Leading the list is the Toronto-
Dominion Bank, followed by the Bank of
Nova Scotia. The following Excel descriptive
statistics output describes the profits for these
100 companies. Study the output and
describe in your own words what you can
learn about the profits (shown in $ thousands)
of these top 100 Canadian public companies.
Thank you all, We did it!

• We just finished the first 3 chapters of the book:


• Please read the book
• solve the end of chapter questions
• review the cases
• answer the concept check questions
• Don’t worry about the formulas
• all the formulas listed at the end of the chapter will be
provided to you for the midterm and final exam.
• We will use the concepts learned so far in the next session to
solve different problems using Excel.

You might also like