0% found this document useful (0 votes)
33 views45 pages

Jaggia BA 1e Chap003 PPT

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views45 pages

Jaggia BA 1e Chap003 PPT

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

3

Data Visualization and


Summary Measures

Business Analytics, 1e
By Sanjiv Jaggia, Alison Kelly, Kevin Lertwachara, and
Leida Chen

Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or


5/13/2020 distribution without the prior written consent of McGraw-Hill Education.
3-1
Chapter 3 Learning Objectives
(LOs)
LO 3.1 Visualize categorical and numerical
variables.
LO 3.2 Construct and interpret a contingency table
and a stacked bar chart.
LO 3.3 Construct and interpret a scatterplot.
LO 3.4 Construct and interpret a scatterplot with a
categorical variable, a bubble plot, a line
chart, and a heat map.
LO 3.5 Calculate and interpret summary measures.
LO 3.6 Use boxplots and z-scores to identify
outliers.

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-2


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-2
Education.
Introductory Case: Investment
• Decision
Dorothy Brennan works as a financial advisor at a large
investment firm. She meets with an inexperienced investor who
has some questions regarding two approaches to mutual fund
investing: growth investing versus value investing. The investor
has heard that growth funds invest in companies whose stock
prices are expected to grow at a faster rate, relative to the
overall stock market, and value funds invest in companies
whose stock prices are below their true worth. The investor has
also heard that the main component of investment return is
through capital appreciation in growth funds and through
dividend income in value funds.

• Dorothy will use the sample information for the following tasks.
1. Calculate and interpret the typical return for these two mutual funds.
2. Calculate and interpret the investment risk for these two mutual funds.
BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-3
3. Determine which mutual fund
Copyright provides
© 2021 theEducation.
McGraw-Hill greaterAllreturn relative
rights reserved. No to risk.
reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-3
Education.
3.1: Methods to Visualize Categorical
and Numerical Variables (1/6)
• A categorical variable consists of
observations that represent labels or names.
• Summarize the data with a frequency
distribution.
– Group the data into categories and record the number of
observations that fall into each category.
– The relative frequency for each category is the proportion of
observations in each category. Multiply the proportions by
100 to get percentages.
• A bar chart depicts the frequency or relative
frequency for each category of the categorial
variable.
– Horizontal or vertical bars
– Lengths proportional to the values they are depicting
BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-4
Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-4
Education.
3.1: Methods to Visualize Categorical
and Numerical Variables (2/6)
• Example: Myers-Briggs assessment of employees

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-5


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-5
Education.
3.1: Methods to Visualize Categorical
and Numerical Variables (3/6)
• With a numerical variable, each observation
represents a meaningful amount or count.
• Use a frequency distribution to summarize a
numerical variable.
– Instead of categories, construct a series of
intervals (classes).
• The intervals are mutually exclusive.
• The total number of intervals usually ranges from 5 to
20.
• The intervals are exhaustive.
• The intervals are easy to recognize and interpret.
– The data are more manageable using a
frequency distribution, but some detail is lost.
BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-6
Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-6
Education.
3.1: Methods to Visualize Categorical
and Numerical Variables (4/6)
• A histogram is a series of rectangles where the width
and height of each rectangle represent the interval width
and frequency (or relative frequency) of the respective
interval.
• A histogram provides information about the shape of the
distribution.
– Symmetric: mirror image of itself on both sides of its center
– Skewed: positive (elongated right tail) or negative (elongated
left left)

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-7


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-7
Education.
3.1: Methods to Visualize Categorical
and Numerical Variables (5/6)
• Example: Consider the Growth variable from the
introductory case.

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-8


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-8
Education.
3.1: Methods to Visualize Categorical
and Numerical Variables (6/6)
• The possibility exists for unintentional, as well as purposeful,
distortions of graphical information.
• Follow these basic guidelines.
– The simple graph should be used for a given set of data. Strive for clarity and
avoid unnecessary adornments.
– Axes should be clearly marked with numbers of their respective scales; each
axis should be labeled.
– When creating a bar chart or a histogram, each bar/rectangle should be of the
same width. Differing widths create distortions.
– The vertical axis should not be given a very high value as an upper limit. In
these instances, the data may appear compressed so that an increase (or
decrease) of the data is not as apparent as it perhaps should be. The vertical
axis should not be stretched so that an increase (or decrease) of the data
appears more pronounced than warranted.

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-9


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-9
Education.
3.2: Methods to Visualize The
Relationship Between Two Variables
(1/4)
• Use a contingency table to examine the
relationship between two categorical
variables.
– Frequencies for two categorical variables
– Each cell represents a mutually exclusive
combination of the pair of values
• Use a stacked column chart to visualize
more than one categorical variable.
– Graphically shows the contingency table
– Allows for the comparison compositive within
each category.
BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-10
Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-10
Education.
3.2: Methods to Visualize The
Relationship Between Two Variables
(2/4)
• Example: Myers-Brigg assessment and
sex

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-11


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-11
Education.
3.2: Methods to Visualize The
Relationship Between Two Variables
(3/4)
• Use a scatterplot to examine the relationship between two
numerical variables.
– Determine whether or not two numerical variables are related in some
systematic way
– Each point represents a paired observation for the two variables
– Refer to one variable as x (x-axis) and the other as y (y-axis)
• Once plotted, the graph may reveal one of the below.
– A linear relationship
– A nonlinear relationship
– No relationship

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-12


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-12
Education.
3.2: Methods to Visualize The
Relationship Between Two Variables
(4/4)
• Example: the returns for the Growth and Value funds from
the introductory case

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-13


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-13
Education.
3.3: Other Data Visualization Methods
(1/5)
• Incorporate a categorical variable within a scatterplot by
using different colors or symbols. This allows you to
determine if the relationship between x and y differs
across the values of the categorical variable.
• Example: life expectancy vs. birth rate by country
development

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-14


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-14
Education.
3.3: Other Data Visualization Methods
(2/5)
• A bubble plot shows the relationship between three
numerical variables. The third variable is represented by
the size of the bubble (points).
• Example: life expectancy vs. birth rate by GNI

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-15


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-15
Education.
3.3: Other Data Visualization Methods
(3/5)
• A line chart displays a numerical
variable as a series of data points
connected by a line.
• A line chart is especially useful for
tracking changes or trends over time.
• It is also easy for us to identify any
major changes that happened in the past
on a line chart.
• When multiple lines are plotted in the
same chart, we can compare these
observations on one or more dimensions.
BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-16
Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-16
Education.
3.3: Other Data Visualization Methods
(4/5)
• Example: the returns of the Growth
and Value funds from the
introductory case

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-17


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-17
Education.
3.3: Other Data Visualization Methods
(5/5)
• A heat map uses color or color intensity
to display relationships between
variables.
• Example: bookstore and book type

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-18


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-18
Education.
3.4: Summary Measures (1/20)
• We can also use numerical descriptive
measures to extract meaningful
information for data.
• These measures provide precise,
objectively determined values that are
easy to calculate, interpret, and compare
with one another.
– Central location: a typical or central value in the data
– Dispersion: variability in the data
– Shape: whether or not the distribution is symmetric
– Association: whether two numerical variables have a
linear relationship

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-19


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-19
Education.
3.4: Summary Measures (2/20)
• The term central location refers to how numerical data tend to
cluster around some middle or central value.
• Measures of central location attempt to find a typical or
central value that describes the data.
• The arithmetic mean is the primary measure of central
location.
– Referred to as the mean or the average
– Simply add up all the observations and divide by the number of
observations.
• The only thing that differs between a population mean and a
sample mean is the notation.
• The population mean is denoted as .
– observations in the population:

– is a parameter
• The sample mean is denoted as .
– n observations in the sample:

– is a statistic
BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-20
Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-20
Education.
3.4: Summary Measures (3/20)
• The mean is used extensively in data analysis. However, it
can give a misleading description of the center in the
presence of extremely small or large observations, or
outliers.
• Also calculate the median as a measure of central location.
– Middle value of a data set: an equal number of observations lie above and
below the median
– Arrange the data in ascending order
– The middle value if the number of observations is odd
– The average of the two middle values if the number of observations is even
• If the mean and median are different, it is likely the variable
contains outliers.
• The mode of a variable is the observation that occurs most
frequently.
– There can be one or no modes
– One mode: unimodal
– Two modes: bimodal
– Less useful measure of centrality for more than three modes
– Mode is a useful summary for a categorical variable
BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-21
Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-21
Education.
3.4: Summary Measures (4/20)
• Example: the mean and median for the Growth
and Value variables from the introductory case

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-22


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-22
Education.
3.4: Summary Measures (5/20)
• With Excel

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-23


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-23
Education.
3.4: Summary Measures (6/20)
• With Excel

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-24


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-24
Education.
3.4: Summary Measures (7/20)
• With R

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-25


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-25
Education.
3.4: Summary Measures (8/20)
• The median is the middle observation. Half of
the observations fall below and above the
median.
• The median is also called the 50th
percentile.
• A percentile is technically a measure of
location, however it is also used as a
measure of relative position.
• The pth percentile divides a variable into two
parts.
– Approximately p percent of the observations are
less than the pth percentile.
– Approximately (100p) percent of the observations
are greater than
BUSINESS the1ep| Jaggia,
ANALYTICS, th percentile.
Kelly, Lertwachara, Chen
Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
3-26
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-26
Education.
3.4: Summary Measures (9/20)
• Example: the first quartile of the Growth
variable and the third quartile of the
Value variable from the introductory
case

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-27


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-27
Education.
3.4: Summary Measures (10/20)
• Measures of central location do not describe the
underlying dispersion.
• Measures of dispersion gauge the variability.
– 0 indicates all the observations are identical
– Increases as the observations become more diverse
• The range is the simplest measure.
– Difference between the maximum and minimum
– Not good because it focuses solely on extreme observations
• The interquartile range (IQR) is the difference between the
third quartile and the first quartile.

– The range of the middle 50% of the variable


– Does not depend on the extreme observations
• The mean absolute difference (MAD) is the average of the
absolute differences between the observations and the
mean.
– Population:
– Sample: BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-28
Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-28
Education.
3.4: Summary Measures (11/20)
• The variance and the standard deviation
are the two most widely used measures of
dispersion.
– Compute the average of the squared
differences
– The squaring of the differences emphasizes
larger differences
• The population variance is denoted .
• The sample variance is denoted .
• The units of each are the units of the
underlying variable squared.
• The standard deviation of each is the
positiveBUSINESS
square root.
ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen
Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
3-29
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-29
Education.
3.4: Summary Measures (12/20)
• Example: dispersion statistics for the
Growth and Value variables from the
introductory case

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-30


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-30
Education.
3.4: Summary Measures (13/20)
• In general, investments with higher returns also
carry higher risk.
• Investments include financial assets such as
stocks, bonds, and mutual funds.
• The average return represents an investor’s
reward, whereas variance, or equivalently
standard deviation, corresponds to risk.
• The Sharpe ratio is the “reward-to-variability” ratio.
– Calculated as
– is the mean return for a risk-free asset such as a
Treasury bill (T-bill)
– The numerator measures the extra reward for the
added risk, and the difference is excess return
– The higher the Sharpe ratio, the better the investment
compensates its investors for risk
BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-31
Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-31
Education.
3.4: Summary Measures (14/20)
• Example: Compute the Sharpe ratios for the
Growth and Value fund assuming .

• Growth:
• Value:

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-32


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-32
Education.
3.4: Summary Measures (15/20)
• The skewness coefficient measures
the degree to which a distribution
is not symmetric about its mean.
– Calculated as
– Symmetric: coefficient of 0
– Positively skewed: positive coefficient
– Negatively skewed: negative
coefficient

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-33


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-33
Education.
3.4: Summary Measures (16/20)
• The kurtosis coefficient is a summary measure that tells
us whether the tails of the distribution are more or less
extreme than the normal distribution.
– Tails that are more extreme than the normal
distribution are called leptokurtic; suggests outliers
– A platykurtic distribution is one that has shorter or less
extreme tails than the normal distribution
– Calculated as
• The kurtosis coefficient of a normal distribution is 3.
– Kurtosis more than three: more extreme tails than a
normal distribution
– Kurtosis less than three: less extreme tail than a
normal distribution
• The excess kurtosis is the kurtosis coefficient minus 3.
– Positive: more extreme tails than a normal distribution
– Negative: less extreme tail than a normal distribution
BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-34
Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-34
Education.
3.4: Summary Measures (17/20)
• Example: skewness and kurtosis for the
Growth and Value variables from the
introductory case. Note Excel gives
excess kurtosis.

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-35


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-35
Education.
3.4: Summary Measures (18/20)
• We used a scatterplot to visually assess whether two
numerical variables have some type of systematic
relationship.
• There are two numerical measures of association that
quantify the direction (and strength) of the linear
relationship between x and y. These are not appropriate
when the relationship is not linear.
• Covariance measures the direction of the linear
relationship.
– Population:
– Sample:
– Negative: negative linear relationship
– Positive: positive linear relationship
– Zero: no linear relationship
• Covariance is hard to interpret because it is sensitive
to the units of measurement. We cannot comment on
the strength of the
BUSINESS linear
ANALYTICS, relationship.
1e | Jaggia, Kelly, Lertwachara, Chen 3-36
Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-36
Education.
3.4: Summary Measures (19/20)
• The correlation coefficient describes both the
direction and strength of the linear relationship
between x and y.
– Population:
– Sample:
– Negative: negative linear relationship
– Positive: positive linear relationship
– Zero: no linear relationship
• The correlation is unit-free.
• The correlation is between 1 and 1.
– Correlation is 1: perfect negative linear relationship
– Correlation is 0: not linearly related
– Correlation is 1: perfect positive linear relationship

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-37


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-37
Education.
3.4: Summary Measures (20/20)
• Example: the correlation between
the Growth and Value variables
from the introductory case
• With Excel: CORREL(B2:B36,
C2:C36)=0.6527
• With R:

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-38


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-38
Education.
3.5: Detecting Outliers (1/7)
• Extremely large or small observations for a variable are referred to as
outliers
• Outliers can unduly influence summary statistics, such as the mean
or the standard deviation.
• In a small sample, the impact of outliers is particularly pronounced.
• Sometimes, outliers may just be due to random variations, in which
case the relevant observations should remain in the data set.
• Alternatively, outliers may indicate bad data due to incorrectly
recorded observations or incorrectly included observations in the
data set.
• In such cases, the relevant observations should be corrected or
simply deleted from the data set.
• However, there are no universally agreed upon methods for treating
outliers.
• In any event, it is important to be able to identify potential outliers so
that one can take corrective actions, if needed.
• We first construct a boxplot which is an effective tool for identifying
outliers. A series of boxplots are also useful when comparing similar
information for a variable gathered at another place or time.
• Another method for detecting outliers is to calculate z-scores.
BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-39
Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-39
Education.
3.5: Detecting Outliers (2/7)
• A common way to quickly summarize a variable is to use a five-
number summary.
• A five-number summary shows the minimum, the quartiles (Q1,
Q2, and Q3), and the maximum.
• A boxplot, also referred to as a box-and-whisker plot, is a way to
graphically display a five-number summary.
– Draw a box encompassing the first and third quartiles.
– Draw a dashed vertical line in the box at the median.
– Calculate the IQR. Draw a whisker that extends from Q1 to the
minimum value that is not further from 1.5*IQR from Q1. Similarly,
draw a line that extends from Q3 to the maximum value that is not
farther than 1.5*IQR from Q3.
– Use an asterisk (or another symbol) to indicate observations that
are farther than 1.5*QQR from the box. These observations are
considered outliers.

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-40


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-40
Education.
3.5: Detecting Outliers (3/7)
• A boxplot is also used to informally gauge the shape of
the distribution.
• Symmetry is implied if the median is in the center of the
box and the left/right whiskers are equidistant from their
respective quartiles.
• If the median is left of center and the right whisker is
longer than the left whisker, then the distribution is
positively skewed.
• Similarly, if the median is right of center and the left
whisker is longer than the right whisker, then the
distribution is negatively skewed.
• If outliers exist, we need to include them when
comparing the lengths of the left and right whiskers.

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-41


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-41
Education.
3.5: Detecting Outliers (4/7)
• Example: construct a boxplot for the Growth and Value
variables from the introductory case.
• Excel does not provide this capability in a straight-
forward way.
• With R

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-42


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-42
Education.
3.5: Detecting Outliers (5/7)
• The empirical rule makes precise statements regarding
the percentage of observations that fall within a
specified number of standard deviations from the mean.
• Assume the observations are drawn from a relatively
symmetric and bell-shaped distribution, perhaps by an
inspection of its histogram
– Approximately 68% of all observations fall in the interval .
– Approximately 95% of all observations fall in the interval .
– Approximately 100% of all observations fall in the interval .

BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-43


Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-43
Education.
3.5: Detecting Outliers (6/7)
• It is often instructive to use the mean and the standard
deviation to find the relative location of an observation.
• We use the z-score to find the relative position of an
observation by dividing the difference of the observation from
the mean by the standard deviation: .
• A z-score is a unitless measure.
• It measures the distance of an observation from the mean in
terms of standard deviations.
• Converting observations into z-scores is also called
standardizing the observations.
• Standardization is a common technique used in data analytics
when dealing with variables measured using different scales.
• If the distribution of a variable is relatively symmetric and bell-
shaped, we can also use z-scores to detect outliers.
– Since almost all observations fall within three standard deviations of
the mean, it is common to treat an observation as an outlier if its z-
score is more than 3 or less than 3.
– Such observations must be reviewed to determine if they should
remain in the data set.
BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-44
Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-44
Education.
3.5: Detecting Outliers (7/7)
• Example: What are the z-scores for the
minimum and maximum values of the Growth
and Value variables from the introductory case?

• Growth minimum:
• Growth maximum:
• Value minimum:
• Value maximum:
• Recall the box-and-whisker plot identified the
Growth maximum and Value minimum as
outliers.
BUSINESS ANALYTICS, 1e | Jaggia, Kelly, Lertwachara, Chen 3-45
Copyright © 2021 McGraw-Hill Education. All rights reserved. No reproduction or
distribution
Copyright © 2021 McGraw-Hill Education. All rights reserved. without theorprior
No reproduction writtenwithout
distribution consent ofprior
the McGraw-Hill Education.
written consent of McGraw-Hill
3-45
Education.

You might also like