0% found this document useful (0 votes)
2 views

4 - SM and Data Visualization

The document outlines key learning objectives related to summary measures in data analytics, including calculating and interpreting measures of location, dispersion, and identifying outliers using boxplots and z-scores. It emphasizes the importance of visualizing data through various methods such as frequency distributions, bar charts, and histograms to enhance understanding. Additionally, it discusses the detection of outliers and the use of z-scores for standardization in data analysis.

Uploaded by

omar Yousef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

4 - SM and Data Visualization

The document outlines key learning objectives related to summary measures in data analytics, including calculating and interpreting measures of location, dispersion, and identifying outliers using boxplots and z-scores. It emphasizes the importance of visualizing data through various methods such as frequency distributions, bar charts, and histograms to enhance understanding. Additionally, it discusses the detection of outliers and the use of z-scores for standardization in data analysis.

Uploaded by

omar Yousef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Data Analytics

DS342

31 October 2024 Dr. Marwa Sabry 1-2


Chapter 3
Summary Measures

3
Learning Objectives (LOs)

LO 3.1 Calculate and interpret measures of location.


LO 3.2 Calculate and interpret measures of dispersion, shape, and
association.
LO 3.3 Use boxplots and z-scores to identify outliers.

31 October 2024
Dr. Marwa Sabry 4
Counts of
categories
Categorical
variables
Charts of
counts

Summary measures
Describe (mean, median,
individual standard deviation,
variables quartiles, etc.)
Cross-sectional
data
Histograms,
Numeric box plots
variables

Time series
Time series charts for Trend lines
patterns

Purpose of
analysis

31 October 2024
Dr. Marwa Sabry 5
Tables of joint
counts (cross-tabs
or pivot tables)
Categorical vs
categorical
Charts of joint
counts

Summary measures
by category

Categorical vs Side-by-side
Purpose of analysis Find relationships numeric boxplots
between variables

Pivot tables

Trend lines
Scatterplots
(regression)

Correlations (and
Numeric vs numeric
covariances)

Pivot tables

31 October 2024
Dr. Marwa Sabry 6
3.3: Detecting Outliers

• Extremely large or small observations for a variable are referred to as outliers


• Outliers can unduly influence summary statistics, such as the mean or the standard
deviation.
• In a small sample, the impact of outliers is particularly pronounced.
• Sometimes, outliers may just be due to random variations, in which case the relevant
observations should remain in the data set.
• Alternatively, outliers may indicate bad data due to incorrectly recorded observations
or incorrectly included observations in the data set.
• In such cases, the relevant observations should be corrected or simply deleted from
the data set.

31 October 2024 Dr. Marwa Sabry 7


3.3: Detecting Outliers

• There are no universally agreed upon methods for treating outliers.


• It is important to be able to identify potential outliers so that one can take
corrective actions, if needed.
• We first construct 1 a boxplot which is an effective tool for identifying
outliers.
• A series of boxplots are also useful when comparing similar information for a
variable gathered at another place or time.
• Another method for detecting outliers is to 2 calculate z-scores.

31 October 2024 Dr. Marwa Sabry 8


3.3: Detecting Outliers
• A common way to quickly summarize a variable is to use a five-number summary.
• A five-number summary shows the minimum, the quartiles (Q1, Q2, and Q3), and the
maximum.
• A boxplot, also referred to as a box-and-whisker plot, is a way to graphically display a
five-number summary.
• Draw a box encompassing the first and third quartiles.
• Draw a dashed vertical line in the box at the median.
• Calculate the IQR. Draw a whisker that extends from Q1 to the minimum value that is not further from
1.5*IQR from Q1.
• Similarly, draw a line that extends from Q3 to the maximum value that is not farther than 1.5*IQR from
Q3.
• Use an asterisk (or another symbol) to indicate observations that are farther than 1.5*QQR from the
box. These observations are considered outliers.

31 October 2024 Dr. Marwa Sabry 9


3.3: Detecting Outliers
• A boxplot is also used to informally gauge the shape of the distribution.
• Symmetry is implied if the median is in the center of the box and the left/right whiskers are
equidistant from their respective quartiles.
• If the median is left of center and the right whisker is longer than the left whisker, then the
distribution is positively skewed.
• Similarly, if the median is right of center and the left whisker is longer than the right whisker,
then the distribution is negatively skewed.
• If outliers exist, we need to include them when comparing the lengths of the left and right
whiskers.

31 October 2024 Dr. Marwa Sabry 10


Working Example 3.9:
Growth-Value, P.99
• Use Excel to construct a boxplot for the Growth and Value variables from the
introductory case.
• Excel: use the Box and Whisker function

31 October 2024
Dr. Marwa Sabry 11
3.3: Detecting Outliers
• The empirical rule makes precise statements regarding the percentage of
observations that fall within a specified number of standard deviations from the
mean.
• Assume the observations are drawn from a relatively symmetric and bell-
shaped distribution, perhaps by an inspection of its histogram
• Approximately 68% of all observations fall in the interval 𝑥ҧ ± 𝑠.
• Approximately 95% of all observations fall in the interval 𝑥ҧ ± 2𝑠.
• Approximately 100% of all observations fall in the interval 𝑥ҧ ± 3𝑠.

31 October 2024 Dr. Marwa Sabry 12


Working Example 3.10:
P.101
A large lecture class has 280 students. The professor has announced that the mean
score on an exam is 74 with a standard deviation of 8. The distribution of scores is
bell shaped.
a. Approximately what are the score for about 95% of students?

31 October 2024
Dr. Marwa Sabry 13
Working Example 3.10:
P.101
Answer:
The empirical rule states that for a bell-shaped distribution, 95% of the observations,
fall within two standard deviations of the mean.
The scores are:
1. 𝑥ҧ − 2𝑠 = 74 – 2*8 = 58
2. 𝑥ҧ + 2𝑠 = 74 + 2*8 = 90

31 October 2024
Dr. Marwa Sabry 14
3.5: Detecting Outliers

• It is often instructive to use the mean and the standard deviation to find
the relative location of an observation.
• We use the z-score to find the relative position of an observation by
dividing the difference of the observation from the mean by the
𝑥−𝑥ҧ
standard deviation: 𝑧 = .
𝑠
• A z-score is a unitless measure.
• It measures the distance of an observation from the mean in terms of
standard deviations.
• Converting observations into z-scores is also called standardizing the
observations.

31 October 2024 Dr. Marwa Sabry 15


3.5: Detecting Outliers

• Standardization is a common technique used in data analytics when


dealing with variables measured using different scales.
• If the distribution of a variable is relatively symmetric and bell-
shaped, we can also use z-scores to detect outliers.
• Since almost all observations fall within three standard
deviations of the mean, it is common to treat an observation as
an outlier if its z-score is more than 3 or less than −3.
• Such observations must be reviewed to determine if they should
remain in the data set.

31 October 2024 Dr. Marwa Sabry 16


Working Example 3.12:
Growth-Value, P.103
• What are the z-scores for the minimum and maximum values of the Growth and
Value variables?

−40.90−15.755
• Growth minimum: 𝑧 = = −2.38
23.7993
• Growth maximum: 𝑧 = 2.68
• Value minimum: 𝑧 = −3.28
• Value maximum: 𝑧 = 1.78

31 October 2024
Dr. Marwa Sabry 17
Working Example 3.12:
Growth-Value, P.103

=STANDARDIZE(B2,AVERAGE($B$2:$B$37),
STDEV.S($B$2:$B$37))

=STANDARDIZE(C2,AVERAGE($C$2:$C$37)
,STDEV.S($C$2:$C$37))

From the min and max values, it’s clear


now that the Value variable has an outlier
observation

11 October 2023
Dr. Marwa Sabry 18
Chapter 4
Data Visualization

19
Learning Objectives (LOs)

LO 4.1 Visualize a single variable.


LO 4.2 Visualize the relationship between two variables.
LO 4.3 Visualize the relationship between two or more
variables.

31 October 2024 Dr. Marwa Sabry 20


4.1: Methods to Visualize a Single
Variable
• Tabular and graphical tools help us to organize and present data.
• We can summarize both a categorical variable as well as a numerical
variable.
• A categorical variable consists of observations that represent labels or
names.
• Data presented in this format—that is, in raw form—are very difficult to
interpret.
• When presented with a categorical variable, it is often useful to summarize
the variable with a frequency distribution and/or a bar chart.

31 October 2024 Dr. Marwa Sabry 21


4.1: Methods to Visualize a Single
Variable
• Converting the raw data into a frequency distribution makes the data more
manageable and easier to assess.
• A frequency distribution for a categorical variable groups the observations
into categories and records the number of observations that fall into each
category.
• The relative frequency for each category equals the proportion of
observations in each category.
• A bar chart is a graphical representation of the frequency or relative
frequency distribution.
• Horizontal or vertical bars
• Lengths proportional to the values they are depicting

31 October 2024 Dr. Marwa Sabry 22


Introductory Case: Construction Clothing

• ReliableWorkWear.com is an online company that offers a large selection of construction


clothing, work boots, gloves, and more that keep workers safe and comfortable on the job.
• Brendan Moore is the marketing analyst for ReliableWorkWear.com. He has compiled
data on 200 recent transactions

Transaction

• Brendan will use the sample information to:


1. Convey the information from the variables in tabular form.
2. Convey the information from the variables in graphical form.
3. Discuss finding and provide strategies that may help increase sales.

31 October 2024
Dr. Marwa Sabry 23
Working Example 4.1:
Transactions, P.111
• Example: Consider the Repeat and Sex variables from the introductory case.
• With Excel
• Use COUNTIF
• Insert > Insert Column or Bar Chart > 2-D Column
• Add Chart Elements > Data Labels > Outside End

31 October 2024
Dr. Marwa Sabry 24
Working Example 4.1:
Transactions, P.111
• Example continued

31 October 2024
Dr. Marwa Sabry 25
4.1: Methods to Visualize a Single
Variable
• With a numerical variable, each observation represents a meaningful
amount or count.
• Although different from a categorical variable, we still use a frequency
distribution to summarize a numerical variable.
• For a numerical variable, a frequency distribution groups the observations
into intervals and records the number of observations that falls into each
interval.
• The relative frequency for each interval equals the proportion of
observations in each interval.
• The data are more manageable using a frequency distribution, but some
detail is lost.

31 October 2024 Dr. Marwa Sabry 26


4.1: Methods to Visualize a Single
Variable
• Instead of categories, construct a series of intervals (classes).
• We must make certain decisions about the number of intervals, as well as the
width of each interval.
• The intervals are mutually exclusive.
• The total number of intervals usually ranges from 5 to 9.
• Smaller data sets tend to have fewer intervals than larger data sets.
• If we have too many intervals, then the advantage of the frequency
distribution is lost.
• If the frequency distribution has too few classes, then considerable
accuracy and detail are lost.
• The intervals are exhaustive.
• The intervals are easy to recognize and interpret.

31 October 2024 Dr. Marwa Sabry 27


4.1: Methods to Visualize a Single
Variable
• A histogram is the counterpart to the bar chart to visualize a frequency
distribution.
• A histogram is a series of rectangles where the width and height of each
rectangle represent the interval width and frequency (or relative
frequency) of the respective interval.
• Mark off the interval limits along the horizontal axis.
• The height of each bar represents either the frequency or the
relative frequency for each interval.
• No gaps appear between the interval limits.

31 October 2024 Dr. Marwa Sabry 28


4.1: Methods to Visualize a Single Variable

• A histogram provides information about the shape of the distribution.


• Symmetric: mirror image of itself on both sides of its center
• Skewed: positive (elongated right tail) or negative (elongated left left)

31 October 2024
Dr. Marwa Sabry 29
Working Example 4.2:
Transactions, P.117
• Example: The Income
variable from the
introductory case.
• With Excel: Data > Data Analysis
> Histogram

31 October 2024
Dr. Marwa Sabry 30
Working Example 4.2:
Transactions, P.117
• Example continued

31 October 2024
Dr. Marwa Sabry 31
Working Example 4.2:
Transactions, P.117
✓ From the previous Table, we see that the range of annual income for
the 200 customers is between $0 and $250,000.
✓ The majority of customers (66%) earned between $50,000 and
$100,000.
✓ Only 5% of the customers earned more than $150,000.

✓ From the Figure, we see that the distribution of Income is not


symmetric; it is positively skewed with a tail running off to the right.

31 October 2024
Dr. Marwa Sabry 32
4.1: Methods to Visualize a Single
Variable
• The possibility exists for unintentional, as well as purposeful,
distortions of graphical information.
• The simplest graph should be used for a given set of data. Strive
for clarity and avoid unnecessary adornments.
• Axes should be clearly marked with the numbers of their respective
scales; each axis should be labeled.
• When creating a bar chart or a histogram, each bar/rectangle should
be of the same width.
• Differing widths create distortions.

31 October 2024 Dr. Marwa Sabry 33


4.1: Methods to Visualize a Single Variable
• The vertical axis should not be given a very high value as an
upper limit.
• The data may appear compressed so that an increase (or decrease)
of the data is not as apparent as it perhaps should be.

31 October 2024
Dr. Marwa Sabry 34
4.1: Methods to Visualize a Single Variable

• The vertical axis should not be stretched so that an increase (or


decrease) of the data appears more pronounced than warranted.

31 October 2024
Dr. Marwa Sabry 35
4.2: Methods to Visualize the Relationship
between Two Variables

Categorical Variables:
• Use a contingency table to summarize and examine the
relationship between two categorical variables.
• Frequencies for two categorical variables
• Each cell represents a mutually exclusive combination of
the pair of values
• The contingency table allows us to present and interpret the raw
data in a much more manageable format.
• Contingency tables are widely used in marketing as well as other
business applications.

31 October 2024 Dr. Marwa Sabry 36


4.2: Methods to Visualize the Relationship
between Two Variables

• The information in a contingency table can be shown graphically using a


stacked column chart.
• A stacked column chart is an advanced version of the bar chart.
• It is designed to visualize more than one categorical variable.
• It allows for the comparison compositive within each category.
• Each column represents the total number of responses for each level of one
variable.
• The segments within a column represent the other variable.

31 October 2024 Dr. Marwa Sabry 37


Working Example 4.3:
Promotion, P.126
• Example: An online retailer recently sent e-mails to customers that
included a promotional discount.
• The retailer wonders whether there is any relationship between a
customer’s location in the U.S. and whether the customer made a
purchase with the discount.

31 October 2024
Dr. Marwa Sabry 38
Working Example 4.3:
Promotion, P.126
• Example continued
• With Excel: Insert > Pivot Table

31 October 2024
Dr. Marwa Sabry 39
Working Example 4.3:
Promotion, P.126
• Example continued with Excel

31 October 2024
Dr. Marwa Sabry 40
Working Example 4.3:
Promotion, P.126
• Example continued with Excel
• Insert > Insert Column or Bar Chart > Stacked Column

31 October 2024
Dr. Marwa Sabry 41
Working Example 4.3:
Promotion, P.126
✓ We can readily see from previous Tables that of the 600 e-mail recipients, 410 of
them made a purchase using the promotional discount. This translates into a
68.33% positive response rate, suggesting that this marketing strategy was
successful.
✓ However, there do appear to be some differences depending on location, and these
differences are apparent from the previous Figure.
✓ Recipients residing in the South and West were a lot more likely to make a
purchase (130 out of 154 and 101 out of 119) compared to those residing in the
Midwest (77 out of 184).
✓ It would be wise for the retailer to examine if there are other traits that the
customers in the South and West share (age, gender, etc.). That way, in the next
marketing campaign, the e-mails can be even more targeted.

31 October 2024
Dr. Marwa Sabry 42
4.2: Methods to Visualize the Relationship
between Two Variables
Numerical Variables:
• A scatterplot is used to determine whether or not two numerical
variables are related in some systematic way.
• Each point represents a paired observation for the two variables
• Refer to one variable as x (x-axis) and the other as y (y-axis)
• Once plotted, the graph may reveal one of the below.
• A linear relationship
• A nonlinear relationship
• No relationship

31 October 2024
Dr. Marwa Sabry 43
Working Example 4.4:
Transactions, P.130
• Example: Consider the customer’s purchase amount and annual
income.
• With Excel: Insert > Insert Scatter or Bubble Chart > Scatter

31 October 2024
Dr. Marwa Sabry 44
4.2: Methods to Visualize the Relationship
between Two Variables
• Example continued

31 October 2024
Dr. Marwa Sabry 45
Working Example 4.4:
Transactions, P.130
✓ From the previous Figure, we can infer that there seems to be a
positive relationship between Purchase and Income; that is, those
customers with higher incomes tend to make purchases of a higher
amount.

31 October 2024
Dr. Marwa Sabry 46
4.3: Other Data Visualization Methods

• A scatterplot with a categorical variable modifies a basic


scatterplot by incorporating a categorical variable.
• This allows us to see if the relationship between two numeric
variables differs across the levels of a categorical variable.
• It is common to encode the categorical variable through point
color.
• Giving each point a distinct hue makes it easy to show its
membership to a respective category.

31 October 2024 Dr. Marwa Sabry 47


Working Example 4.5:
Birth-Life, P.134
• Example: Life expectancy vs. birth rate by country development.

• With Excel: Insert > Insert Scatter or Bubble Chart > Scatter; add Edit
Series

31 October 2024
Dr. Marwa Sabry 48
Working Example 4.5:
Birth-Life, P.134
• Example continued

31 October 2024
Dr. Marwa Sabry 49
Working Example 4.5:
Birth-Life, P.134

✓ From the previous Figure, we see a negative linear relationship


between birth rate and life expectancy.
✓ That is, countries with lower birth rates tend to have higher life
expectancies. This relationship holds true for both developing
and developed countries.
✓ We also see that, in general, developed countries have lower
birth rates and higher life expectancies as compared to
developing countries.

31 October 2024
Dr. Marwa Sabry 50
4.3: Other Data Visualization Methods

• A bubble plot shows the relationship between three numerical


variables.
• The third variable is represented by the size of the bubble (points).

31 October 2024
Dr. Marwa Sabry 51
Working Example 4.6:
Birth-Life, P.136
• Example: life expectancy vs.
birth rate by country
development
• With Excel: Insert > Insert
Scatter or Bubble Chart >
Bubble

31 October 2024
Dr. Marwa Sabry 52
Working Example 4.6:
Birth-Life, P.136
✓ From the previous Figure, we see that a country’ birth rate and its average
life expectancy display a negative linear relationship.
✓ We also see that countries with low birth rates and high life expectancies
have higher GNI per capita, which is indicative of developed countries.

31 October 2024
Dr. Marwa Sabry 53
4.3: Other Data Visualization Methods

• A line chart displays a numerical variable as a series of data points


connected by a line.
• Connects the consecutive observations of a numerical variable with
a line.
• A line chart is especially useful for tracking changes or trends over
time.
• It is also easy for us to identify any major changes that happened in the past
on a line chart.
• When multiple lines are plotted in the same chart, we can compare these
observations on one or more dimensions.

31 October 2024 Dr. Marwa Sabry 54


Working Example 4.7:
Apple-Merck, P.137
• Example: Monthly stock prices for Apple and Merck.

• With Excel: Insert > Insert Line or Area Chart > Line

31 October 2024
Dr. Marwa Sabry 55
Working Example 4.7:
Apple-Merck, P.137
• Example continued

31 October 2024
Dr. Marwa Sabry 56
Working Example 4.7:
Apple-Merck, P.137
✓ The line charts in the previous Figure show the monthly stock prices for Apple and
Merck over the years 2016 through 2019. Both stocks rose over this period,
however, the rise in Apple’s stock price is far more dramatic as compared to the
rise in Merck’s stock price.
✓ There is also a lot more volatility in Apple’s stock price. Specifically, we see a
dramatic decline in Apple’s stock at the end of 2018. This dip corresponded to
news that the company would no longer offer unit sales data for its products.
✓ At the time, some wondered if this lack of transparency presaged weaker iPhone
sales in the future. Fortunately for Apple, this prediction did not materialize.

31 October 2024
Dr. Marwa Sabry 57
4.3: Other Data Visualization Methods

• A heat map uses color or color intensity to display relationships


between variables.
• A heat map is especially useful for identifying combinations of the
categorical variables that have economic significance.
• There are a number of ways to display a heat map.
• They use color to communicate relationships between the variables
that would be harder to understand by simply inspecting the raw
data.

31 October 2024 Dr. Marwa Sabry 58


Working Example 4.8:
Bookstores, P.139
• A national bookstore chain is trying to understand customer preferences at various store
locations. The marketing department has acquired a list of 500 of the most recent
transactions from four of its stores.
• The data set includes the record number (Record), which one of its four stores sold the
book (BookStore), and the type of book sold (BookType).
• The marketing department wants to visualize the data using a heat map to help it
understand customer preferences at different stores.

31 October 2024
Dr. Marwa Sabry 59
Working Example 4.8:
Bookstores, P.139
• Example continued with Excel
• Insert > Pivot Table
• Home > Conditional Formatting > Color Scales > Green – Yellow – Red Color
Scale

31 October 2024
Dr. Marwa Sabry 60
Working Example 4.8:
Bookstores, P.139
✓ The heat maps in the previous Figure reveal that customers’ book preferences do
differ across different store locations.
✓ For example, romance fictions are the most popular books sold at Store2 but the
least popular at Store3.
✓ Self-help books are the least popular books at Store1.
✓ The management can use this information to make decisions about how many
copies of each type of book to stock at each store.

31 October 2024
Dr. Marwa Sabry 61
Thank You ☺

62

You might also like