4 - SM and Data Visualization
4 - SM and Data Visualization
DS342
3
Learning Objectives (LOs)
31 October 2024
Dr. Marwa Sabry 4
Counts of
categories
Categorical
variables
Charts of
counts
Summary measures
Describe (mean, median,
individual standard deviation,
variables quartiles, etc.)
Cross-sectional
data
Histograms,
Numeric box plots
variables
Time series
Time series charts for Trend lines
patterns
Purpose of
analysis
31 October 2024
Dr. Marwa Sabry 5
Tables of joint
counts (cross-tabs
or pivot tables)
Categorical vs
categorical
Charts of joint
counts
Summary measures
by category
Categorical vs Side-by-side
Purpose of analysis Find relationships numeric boxplots
between variables
Pivot tables
Trend lines
Scatterplots
(regression)
Correlations (and
Numeric vs numeric
covariances)
Pivot tables
31 October 2024
Dr. Marwa Sabry 6
3.3: Detecting Outliers
31 October 2024
Dr. Marwa Sabry 11
3.3: Detecting Outliers
• The empirical rule makes precise statements regarding the percentage of
observations that fall within a specified number of standard deviations from the
mean.
• Assume the observations are drawn from a relatively symmetric and bell-
shaped distribution, perhaps by an inspection of its histogram
• Approximately 68% of all observations fall in the interval 𝑥ҧ ± 𝑠.
• Approximately 95% of all observations fall in the interval 𝑥ҧ ± 2𝑠.
• Approximately 100% of all observations fall in the interval 𝑥ҧ ± 3𝑠.
31 October 2024
Dr. Marwa Sabry 13
Working Example 3.10:
P.101
Answer:
The empirical rule states that for a bell-shaped distribution, 95% of the observations,
fall within two standard deviations of the mean.
The scores are:
1. 𝑥ҧ − 2𝑠 = 74 – 2*8 = 58
2. 𝑥ҧ + 2𝑠 = 74 + 2*8 = 90
31 October 2024
Dr. Marwa Sabry 14
3.5: Detecting Outliers
• It is often instructive to use the mean and the standard deviation to find
the relative location of an observation.
• We use the z-score to find the relative position of an observation by
dividing the difference of the observation from the mean by the
𝑥−𝑥ҧ
standard deviation: 𝑧 = .
𝑠
• A z-score is a unitless measure.
• It measures the distance of an observation from the mean in terms of
standard deviations.
• Converting observations into z-scores is also called standardizing the
observations.
−40.90−15.755
• Growth minimum: 𝑧 = = −2.38
23.7993
• Growth maximum: 𝑧 = 2.68
• Value minimum: 𝑧 = −3.28
• Value maximum: 𝑧 = 1.78
31 October 2024
Dr. Marwa Sabry 17
Working Example 3.12:
Growth-Value, P.103
=STANDARDIZE(B2,AVERAGE($B$2:$B$37),
STDEV.S($B$2:$B$37))
=STANDARDIZE(C2,AVERAGE($C$2:$C$37)
,STDEV.S($C$2:$C$37))
11 October 2023
Dr. Marwa Sabry 18
Chapter 4
Data Visualization
19
Learning Objectives (LOs)
Transaction
31 October 2024
Dr. Marwa Sabry 23
Working Example 4.1:
Transactions, P.111
• Example: Consider the Repeat and Sex variables from the introductory case.
• With Excel
• Use COUNTIF
• Insert > Insert Column or Bar Chart > 2-D Column
• Add Chart Elements > Data Labels > Outside End
31 October 2024
Dr. Marwa Sabry 24
Working Example 4.1:
Transactions, P.111
• Example continued
31 October 2024
Dr. Marwa Sabry 25
4.1: Methods to Visualize a Single
Variable
• With a numerical variable, each observation represents a meaningful
amount or count.
• Although different from a categorical variable, we still use a frequency
distribution to summarize a numerical variable.
• For a numerical variable, a frequency distribution groups the observations
into intervals and records the number of observations that falls into each
interval.
• The relative frequency for each interval equals the proportion of
observations in each interval.
• The data are more manageable using a frequency distribution, but some
detail is lost.
31 October 2024
Dr. Marwa Sabry 29
Working Example 4.2:
Transactions, P.117
• Example: The Income
variable from the
introductory case.
• With Excel: Data > Data Analysis
> Histogram
31 October 2024
Dr. Marwa Sabry 30
Working Example 4.2:
Transactions, P.117
• Example continued
31 October 2024
Dr. Marwa Sabry 31
Working Example 4.2:
Transactions, P.117
✓ From the previous Table, we see that the range of annual income for
the 200 customers is between $0 and $250,000.
✓ The majority of customers (66%) earned between $50,000 and
$100,000.
✓ Only 5% of the customers earned more than $150,000.
31 October 2024
Dr. Marwa Sabry 32
4.1: Methods to Visualize a Single
Variable
• The possibility exists for unintentional, as well as purposeful,
distortions of graphical information.
• The simplest graph should be used for a given set of data. Strive
for clarity and avoid unnecessary adornments.
• Axes should be clearly marked with the numbers of their respective
scales; each axis should be labeled.
• When creating a bar chart or a histogram, each bar/rectangle should
be of the same width.
• Differing widths create distortions.
31 October 2024
Dr. Marwa Sabry 34
4.1: Methods to Visualize a Single Variable
31 October 2024
Dr. Marwa Sabry 35
4.2: Methods to Visualize the Relationship
between Two Variables
Categorical Variables:
• Use a contingency table to summarize and examine the
relationship between two categorical variables.
• Frequencies for two categorical variables
• Each cell represents a mutually exclusive combination of
the pair of values
• The contingency table allows us to present and interpret the raw
data in a much more manageable format.
• Contingency tables are widely used in marketing as well as other
business applications.
31 October 2024
Dr. Marwa Sabry 38
Working Example 4.3:
Promotion, P.126
• Example continued
• With Excel: Insert > Pivot Table
31 October 2024
Dr. Marwa Sabry 39
Working Example 4.3:
Promotion, P.126
• Example continued with Excel
31 October 2024
Dr. Marwa Sabry 40
Working Example 4.3:
Promotion, P.126
• Example continued with Excel
• Insert > Insert Column or Bar Chart > Stacked Column
31 October 2024
Dr. Marwa Sabry 41
Working Example 4.3:
Promotion, P.126
✓ We can readily see from previous Tables that of the 600 e-mail recipients, 410 of
them made a purchase using the promotional discount. This translates into a
68.33% positive response rate, suggesting that this marketing strategy was
successful.
✓ However, there do appear to be some differences depending on location, and these
differences are apparent from the previous Figure.
✓ Recipients residing in the South and West were a lot more likely to make a
purchase (130 out of 154 and 101 out of 119) compared to those residing in the
Midwest (77 out of 184).
✓ It would be wise for the retailer to examine if there are other traits that the
customers in the South and West share (age, gender, etc.). That way, in the next
marketing campaign, the e-mails can be even more targeted.
31 October 2024
Dr. Marwa Sabry 42
4.2: Methods to Visualize the Relationship
between Two Variables
Numerical Variables:
• A scatterplot is used to determine whether or not two numerical
variables are related in some systematic way.
• Each point represents a paired observation for the two variables
• Refer to one variable as x (x-axis) and the other as y (y-axis)
• Once plotted, the graph may reveal one of the below.
• A linear relationship
• A nonlinear relationship
• No relationship
31 October 2024
Dr. Marwa Sabry 43
Working Example 4.4:
Transactions, P.130
• Example: Consider the customer’s purchase amount and annual
income.
• With Excel: Insert > Insert Scatter or Bubble Chart > Scatter
31 October 2024
Dr. Marwa Sabry 44
4.2: Methods to Visualize the Relationship
between Two Variables
• Example continued
31 October 2024
Dr. Marwa Sabry 45
Working Example 4.4:
Transactions, P.130
✓ From the previous Figure, we can infer that there seems to be a
positive relationship between Purchase and Income; that is, those
customers with higher incomes tend to make purchases of a higher
amount.
31 October 2024
Dr. Marwa Sabry 46
4.3: Other Data Visualization Methods
• With Excel: Insert > Insert Scatter or Bubble Chart > Scatter; add Edit
Series
31 October 2024
Dr. Marwa Sabry 48
Working Example 4.5:
Birth-Life, P.134
• Example continued
31 October 2024
Dr. Marwa Sabry 49
Working Example 4.5:
Birth-Life, P.134
31 October 2024
Dr. Marwa Sabry 50
4.3: Other Data Visualization Methods
31 October 2024
Dr. Marwa Sabry 51
Working Example 4.6:
Birth-Life, P.136
• Example: life expectancy vs.
birth rate by country
development
• With Excel: Insert > Insert
Scatter or Bubble Chart >
Bubble
31 October 2024
Dr. Marwa Sabry 52
Working Example 4.6:
Birth-Life, P.136
✓ From the previous Figure, we see that a country’ birth rate and its average
life expectancy display a negative linear relationship.
✓ We also see that countries with low birth rates and high life expectancies
have higher GNI per capita, which is indicative of developed countries.
31 October 2024
Dr. Marwa Sabry 53
4.3: Other Data Visualization Methods
• With Excel: Insert > Insert Line or Area Chart > Line
31 October 2024
Dr. Marwa Sabry 55
Working Example 4.7:
Apple-Merck, P.137
• Example continued
31 October 2024
Dr. Marwa Sabry 56
Working Example 4.7:
Apple-Merck, P.137
✓ The line charts in the previous Figure show the monthly stock prices for Apple and
Merck over the years 2016 through 2019. Both stocks rose over this period,
however, the rise in Apple’s stock price is far more dramatic as compared to the
rise in Merck’s stock price.
✓ There is also a lot more volatility in Apple’s stock price. Specifically, we see a
dramatic decline in Apple’s stock at the end of 2018. This dip corresponded to
news that the company would no longer offer unit sales data for its products.
✓ At the time, some wondered if this lack of transparency presaged weaker iPhone
sales in the future. Fortunately for Apple, this prediction did not materialize.
31 October 2024
Dr. Marwa Sabry 57
4.3: Other Data Visualization Methods
31 October 2024
Dr. Marwa Sabry 59
Working Example 4.8:
Bookstores, P.139
• Example continued with Excel
• Insert > Pivot Table
• Home > Conditional Formatting > Color Scales > Green – Yellow – Red Color
Scale
31 October 2024
Dr. Marwa Sabry 60
Working Example 4.8:
Bookstores, P.139
✓ The heat maps in the previous Figure reveal that customers’ book preferences do
differ across different store locations.
✓ For example, romance fictions are the most popular books sold at Store2 but the
least popular at Store3.
✓ Self-help books are the least popular books at Store1.
✓ The management can use this information to make decisions about how many
copies of each type of book to stock at each store.
31 October 2024
Dr. Marwa Sabry 61
Thank You ☺
62