Business Analytics: Describing The Distribution of A Single Variable
Business Analytics: Describing The Distribution of A Single Variable
Slides adapted from Albright & Describing the Distribution of a Single Variable
Winston (2015)
Introduction
(slide 1 of 2)
The goal is to present data in a form that makes sense to people. Tools
that are used to do this include:
◦ Graphs: bar charts, pie charts, histograms, scatterplots, time series graphs
◦ Numerical summary measures: counts, percentages, averages, measures of
variability
◦ Tables of summary measures: totals, averages, counts, grouped by
categories
Another efficient way to find counts for a categorical variable is to use dummy
(0–1) variables.
◦ Recode each variable so that one category is replaced by 1 and all others by 0.
◦ This can be done using a simple IF formula.
◦ Find the count of that category by summing the 0s and 1s.
◦ Find the percentage of that category by averaging the 0s and 1s.
Descriptive Measures for
Numerical Variables
There are many ways to summarize numerical variables, both with
numerical summary measures and with charts.
To learn how the values of a variable are distributed, ask:
What are the most “typical” values?
◦ How spread out are the values?
◦ What are the “extreme” values on either end?
◦ Is the chart of the values symmetric about some middle value, or is it skewed
in some direction? Does it have any other peculiar features besides possible
skewness?
Example 2.3:
Baseball Salaries 2011.xlsx (slide 1 of 2)
Objective: To learn how salaries are distributed across all 2011 MLB players.
Solution: Data set contains data on 843 Major League Baseball players in the
2011 season.
Variables are player’s name, team, position, and salary.
Create summary measures of baseball salaries using Excel functions.
Example 2.3:
Baseball Salaries 2011.xlsx (slide 2 of 2)
Measures of Central Tendency
(slide 1 of 3)
The mean is the average of all values.
◦ If the data set represents a sample from some larger population, this
measure is called the sample mean and is denoted by X.
◦ If the data set represents the entire population, it is called the population
mean and is denoted by μ.
The minimum and maximum values can be calculated with Excel’s MIN
and MAX functions, and the percentiles and quartiles with Excel’s
PERCENTILE and QUARTILE functions.
Percentile - example 0.7 between 2nd
and 3rd position
Percentile distributions
Definition: Let 0 p 100. The pth percentile is a number x such that p% of all
measurements fall below the pth percentile and ( 100 − p ) fall above it.
Here, the integer of the position is 2, where the second value is 5. The fractional part
is 0.7, which is multiplied by the difference of the third and the second value (8-5).
24
Measures of Variability
(slide 1 of 3)
The range is the maximum value minus the minimum value.
The interquartile range (IQR) is the third quartile minus the first quartile.
◦ Thus, it is the range of the middle 50% of the data.
◦ It is less sensitive to extreme values than the range.
The variance is essentially the average of the squared deviations from the
mean.
◦ If Xi is a typical observation, its squared deviation from the mean is (Xi – mean)2.
Measures of Variability
(slide 2 of 3)
◦ The sample variance is denoted by s2, and the population variance by σ2.
Population Totals
Example 2.5:
Crime in US.xlsx (slide 3 of 3)
Violent and Property Crime Rates
Example 2.6:
DJIA Monthly Close.xlsx (slide 1 of 2)