GEA1000
Chapter 2 Review
[email protected]
1
W HERE WE WERE . . .
• P ROBLEM
Research Question?
• P LAN
– What and how to measure?
Variable types
– Study Design?
Experiments vs Observational studies
– Collecting?
Census or sampling?
• D ATA /A NALYSIS
Exploratory Data Analysis (EDA)
Summary statistics
2
W HERE WE ARE NOW . . .
In this tutorial. . .
A NALYSIS OF C ATEGORICAL D ATA
1. Bar graphs
2. Rates (marginal, conditional and joint)
3. Association between categorical variables
4. Symmetry, basic rule on rates
5. Simpson’s Paradox
3
1 B AR GRAPHS
The most common approach
to visualizing amounts (i.e.,
numerical values shown for
some set of categories) is us-
ing bars, either vertically or
horizontally arranged.
4
However, beware of making your plot
“too busy"!
Some people contend that
. . . you can always write every
single figure that your chart rep-
resents on top of your bars, lines,
or pie segments; but then, what
is the point of designing the chart
in the first place? A good graphic
should let you visualize trends
and patterns without having to
read all the numbers.
How Charts Lie by Alberto Cairo
Straits Times on 17 Jun 2021.
5
Here is my attempt to make it read better. This is known as a slope chart, first introduced by Edward Tufte.
6
Here is another set of plots I come across
quite often recently, but am not too
happy about.
Do you have ideas to make them better?
An app written using R to make plots:
https://david-chew.shinyapps.io/esquisse/
7
2 R ATES ( MARGINAL , CONDITIONAL AND JOINT )
Consider the following contingency table (2-by-2 table) that shows the smoking status and outcome
(heart disease (HD) or not) for 390 people after 20 years.
• Marginal rate
rate(HD) = 210
390 = 53.8%
HD No HD Total
Smoker 100 100 200 • Conditional rate
Non-smoker 110 80 190 rate(HD | Smoker) = 100
210 = 50.0%
Total 210 180 390
• Joint rate
rate(HD and Smoker) = 100390 = 25.6%
8
3 A SSOCIATION BETWEEN CATEGORICAL VARIABLES
Continuing . . . We compare the conditional rates
100
HD No HD Total • rate(HD | Smoker) = 200 = 50.0%
Smoker 100 100 200 110
Non-smoker 110 80 190 • rate(HD | non-smoker) = 190 = 57.9%
Total 210 180 390 We say that HD and smoking are negatively associated,
since
rate(HD | Smoker) < rate(HD | Non-smoker)
Does this mean that it is better to smoke?
9
4 S YMMETRY RULE ON RATES
It can be shown that
(I) rate(A|B) > rate(A|NB) ⇐⇒ rate(B|A) > rate(B|NA)
(II) rate(A|B) < rate(A|NB) ⇐⇒ rate(B|A) < rate(B|NA)
(III) rate(A|B) = rate(A|NB) ⇐⇒ rate(B|A) = rate(B|NA)
• When we have (I), A and B are said to be positively associated;
• When we have (II), A and B are said to be negatively associated;
• When we have (III), A and B are said to be not associated.
10
B ASIC R ULE ON R ATES
In a population, let A and B be characteristics.
Denote the overall rate of A by rate(A), similarly for rate(B).
1. rate(A) always lies between rate(A|B) and rate(A|NB). Important!
2. The closer rate(B) is to 100%, the closer rate(A) is to rate(A|B). Important!
3. If rate(A|B) = rate(A|NB), then rate(A) = rate(A|B) = rate(A|NB).
rate(A|B) + rate(A|NB)
4. If rate(B) = rate(NB) = 50%, then rate(A) = .
2
11
Now let us look at the same data, but "sliced" by • For Female
60
gender. rate(HD|Smoker) = 80 = 75.0%
• For Male
Female Male Total rate(HD|Smoker) = 40
= 33.3%
120
HD No HD HD No HD
Smoker 60 20 40 80 200 • For All
100
Non-smoker 100 50 10 30 190 rate(HD|Smoker) = 200 = 50%
Total 160 70 50 110 390 • Note that as stated by Rule 1,
33.3 ≤ 50% ≤ 75%.
• Further, note that the overall HD rate among
smokers is closer to HD rate among male
smokers since there are more male smokers
then female smokers (Rule 2).
12
5 S IMPSON ’ S PARADOX
Female Male All
HD Total Rate HD Total Rate HD Total Rate
Smoker 60 80 75.0% 40 120 33.3% 100 200 50.0%
Non-smoker 100 150 66.7% 10 40 25.0% 110 190 57.9%
• Note that the relationship between the percentages in the subgroups are reversed when sub-
groups are combined — (75.0% > 66.7% and 33.3% > 25.0% but 50.0% < 57.9%).
An instance of the Yule-Simpson’s paradox!
• This is due to the confounder “gender" which is associated with both HD and smoking.
• How to resolve? We control by slicing on gender.
• So this answers our earlier question: "Does this mean that it is better to smoke?"
It is in fact not good to smoke.
13
• It does not mean that Simpson’s Paradox will occur whenever there is a confounder in the
relationship between 2 variables.
• Slicing the data into male and female subgroups and studying the association between smoking
status and outcome controlled for the confounder gender.
• Randomised assignment should be used whenever possible in experiments, to lower the possi-
bility of having to deal with confounders.
• In observational studies, collecting data on possible variables that may be confounders is a good
idea. But there may too many of them!
Hence only association and not causation may be determined.
14