0% found this document useful (0 votes)
49 views41 pages

Categorical Predictor S

This document describes analyzing a dataset of salaries for managers at a large firm to determine if there are statistically significant differences in salaries between genders. It is initially found that women earn less on average than men using a t-test. However, years of experience is identified as a confounding variable, as men have more experience on average. Restricting the analysis to managers with similar experience years reduces the salary difference. A multiple regression model including an interaction term for gender and experience finds no significant evidence of salary discrimination after accounting for experience differences.

Uploaded by

Nilesh Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views41 pages

Categorical Predictor S

This document describes analyzing a dataset of salaries for managers at a large firm to determine if there are statistically significant differences in salaries between genders. It is initially found that women earn less on average than men using a t-test. However, years of experience is identified as a confounding variable, as men have more experience on average. Restricting the analysis to managers with similar experience years reduces the salary difference. A multiple regression model including an interaction term for gender and experience finds no significant evidence of salary discrimination after accounting for experience differences.

Uploaded by

Nilesh Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Categorical

Predictors
Shankar Venkatagiri

Circa 2003: Six women from Walmart file a gender


discrimination lawsuit in SFO

Note: We dont have the Walmart dataset - instead, we work with


a sample of salaries for mid-level managers at a large firm.

Salaries by Gender

Scenario 1
Salary in $'000s

170

Filename:
Salaries.R

160
150
140
130
120
110
Female

Male

Open up GenderSalaries.csv

59 women (Group 0) & 115 men (Group 1)

Evidently, women are earning less than men

Q: Is this disparity due to chance variation in the sample data?


Or is it a systematic phenomenon across the firm?

Conditions

Data from an SRS, so both samples can be considered random


Normality of Male Salaries

150
110

norm quantiles

130

Males$Salary

160
140
120

Females$Salary

170

Normality of Female Salaries

norm quantiles

Assuming equal group variances, we conduct a 2-sample t-test

Results

Filename:
Salaries.R

Two Sample t-test


data: Females$Salary and Males$Salary
t = -2.3517, df = 172, p-value = 0.01982
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval for the difference:
-8.5905282 -0.7503708
sample estimates:
mean of x mean of y
140.0339 144.7043

Based on our SRS sample, average salaries differ by $4,670.


With 95% confidence, this shortfall can be anywhere between
$750 and $8,590. The difference is statistically significant!

Attribution
Q: Could there be lurking variables - some other explanation
for the salary difference?
Years of experience?

25

10

15

20

25
170

20

140
110

Do men have more?

15

YearsExp

15

Frequency

0.4

0.8

Group

110

130

150

170

0.0

0.4

Frequency

25

0.0

Frequency

Our t-test result may be


confounded by correlation

15

r = 0.36

10

Salary

25

0 5

Scatterplot Matrix

Frequency

GroupYears

0.8
x

Confounding

A predictor is a confounding variable if

It is correlated with the response variable

The two groups differ w.r.t. this variable

In our case, YearsExp is a confounding variable

r = 0.36; Ave. YearsExp differs: 7.7 yrs (women), 12 yrs (men)

Conclusion: The 2-sample t-test is not a fair comparison

Exp=5
Two Sample t-test of managers with 5 years of experience
data: Females5$Salary and Males5$Salary
t = -0.0885, df = 22, p-value = 0.9303
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval for the difference:
-9.228310 8.472755
sample estimates:
mean of x mean of y
137.7333 138.1111

If CI contains 0 populations are not statistically significant!

To combat confounding, compare managers with YearsExp

Estimated deficiency = $378, as compared to $4,670 earlier

Exp=5
Two Sample t-test of managers with 5 years of experience
data: Females5$Salary and Males5$Salary
t = -0.0885, df = 22, p-value = 0.9303
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval for the difference:
-9.228310 8.472755
sample estimates:
mean of x mean of y
137.7333 138.1111

Careful: This subset has only 24 managers!


What if we wish to compare managers at
arbitrary experience levels?

If CI contains 0 populations are not statistically significant!

To combat confounding, compare managers with YearsExp

Estimated deficiency = $378, as compared to $4,670 earlier

Subsets

Filename:
Salaries.R

Regress for all female and all male managers separately

Trends reverse around an experience level of 11 years

Q: Are these findings due


to chance variation?

Group-wise
WOMEN: lm(formula = Salary ~ YearsExp, data = Females)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 130.9888

3.3241

39.41

YearsExp

0.3882

3.03

1.1760

< 2e-16 ***


0.00368 **

Residual standard error: 11.22 on 57 degrees of freedom


Multiple R-squared:

0.1387,

Adjusted R-squared:

F-statistic: 9.178 on 1 and 57 DF,

0.1236

p-value: 0.003677

***************************************************************
MEN: lm(formula = Salary ~ YearsExp, data = Males)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 135.6001

2.9015

46.735

< 2e-16 ***

YearsExp

0.2236

3.404

0.00092 ***

0.7611

Residual standard error: 12.06 on 113 degrees of freedom


Multiple R-squared:

0.093, Adjusted R-squared:

F-statistic: 11.59 on 1 and 113 DF,

0.08497

p-value: 0.0009203

Contrast

Difference in salaries of F/M managers if we do not consider


YearsExp: $140,034 - $144,704 = -$4,670

By restricting rows in the dataset to YearsExp = 5

Female: $137,733 versus Male: $138,111; Difference = -$378

Inferring from the regression models

YearsExp = 5: Females make $130,988 + $1,176 * 5 = $136,868 vs


Males: $135,600 + $761 * 5 = $139,405; Difference = -$2,537

YearsExp = 0: Difference = $130,989 - $135,600 = -$4,611

We have just one snapshot of salary differences

Need: Std errors for differences between intercepts & slopes

Combining

A product of two predictors is known as an interaction

E.g. Group x YearsExp

Combining

A product of two predictors is known as an interaction

E.g. Group x YearsExp

Remedy: Incorporate a dummy variable D for Gender and an


interaction of Gender with YearsExp

Est. Salary = 0 + 1 YearsExp + 2 D + 3 D * YearsExp

For male managers: D = 1, Interaction = YearsExp

Combining

A product of two predictors is known as an interaction

E.g. Group x YearsExp

Remedy: Incorporate a dummy variable D for Gender and an


interaction of Gender with YearsExp

Est. Salary = 0 + 1 YearsExp + 2 D + 3 D * YearsExp

For male managers: D = 1, Interaction = YearsExp

Female managers serve as the reference group (D = 0)

Est. SalaryFemale = 0 + 1 YearsExp

Est. SalaryMale = (0 + 2) + (1 + 3) YearsExp

Full model
lm(Salary ~ YearsExp + factor(Group)+ factor(Group)* YearsExp,
data = Salaries)
Estimate Std. Error t value Pr(>|t|)
(Intercept)
130.9888
3.4902 37.531 < 2e-16
YearsExp
1.1760
0.4076
2.885 0.00442
factor(Group)1
4.6113
4.4970
1.025 0.30663
YearsExp:factor(Group)1 -0.4149
0.4625 -0.897 0.37088
Residual standard error: 11.79 on 170 degrees of freedom
Multiple R-squared: 0.1352, Adjusted R-squared: 0.1199
F-statistic: 8.859 on 3 and 170 DF, p-value: 1.728e-05

Case of dj vu?

Women: Est. SalaryFemale = $130,989 + $1,176 * YearsExp

Men: Est. Salary = (130.989 + 4.611) + (1.176 - 0.415) YearsExp


giving us Est. SalaryMale = $135,600 + $761 * YearsExp

SPSS: Dummy

In SPSS, use the Transform menu Recode option to create


dummy variables

D will henceforth be treated as a numeric variable

SPSS: Dummy

In SPSS, use the Transform menu Recode option to create


dummy variables

D will henceforth be treated as a numeric variable

Interaction

Use the Transform menu Compute Variable option to specify


interaction variables

Redo the regression!

Interaction

Use the Transform menu Compute Variable option to specify


interaction variables

Redo the regression!

Interpret

Remember that Group 1 = Male

By regressing group-wise, we obtain

Men: Est. SalaryMale = $135,600 + $761 * YearsExp

Women: Est. SalaryFemale = $130,989 + $1,176 * YearsExp

Difference between intercepts = $135,600 - $130,989 = $4,611

Same as the slope for the dummy variable!

Men with no experience


make this much more!

Difference between slopes = $761 - $1,176 = -$415

Same as the slope for the interaction

Women catch up by this


variable! amount over every year!

Ok to Regress?

Check for the conditions of MRM before continuing

Linearity (demonstrated!), independent residuals (SRS),


equivariance in residuals, and normal residuals

Variance

Similar variances are visually asserted using box-plots

Convention: Dissimilar only if IQR of one > twice the other.


Not so in our case!

Proceed

Principle of Marginality

Q: What if the interaction is not statistically significant?

Keep the components of a statistically significant interaction,


regardless of the components own significance
Remove and re-estimate

Rationale 1: Model is easier to interpret without the interaction.


Model: Est. Salary = 0 + 1 YearsExp + 2 D + 3 D * YearsExp

The regression model will fit parallel lines to the groups,


because the slopes are unaffected

Drop

Q: What if the interaction is not statistically significant?

Rationale 2: Collinearity can reduce the significance

Cannot reject the hypotheses that the slopes match

Decision: Drop the interaction term and re-estimate

VIFs

No interaction
lm(Salary ~ YearsExp + factor(Group), data = Salaries)
Estimate Std. Error t value Pr(>|t|)
(Intercept)

133.4676

2.1315

62.616

< 2e-16 ***

YearsExp

0.8537

0.1925

4.435 1.64e-05 ***

factor(Group)1

1.0242

2.0576

0.498

0.619

Residual standard error: 11.78 on 171 digs of freedom


Multiple R-squared: 0.1311,
F-statistic:

Adjusted R-squared: 0.1209

12.9 on 2 and 171 DF,

p-value: 6.047e-06

Q: So, whats your final conclusion?

SPSS

Q: Is D significant??

Conclusion

Based on supplied information, there is no statistical evidence


for salary discrimination based on gender at the firm

Final model: Est. Salary = $133,742 + $892 * YearsExp

R
lm(formula = Salary ~ YearsExp, data = Salaries)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 133.7420

2.0545

YearsExp

0.1761

0.8920

65.098

< 2e-16 ***

5.066 1.04e-06 ***

Residual standard error: 11.75 on 172 degrees of freedom


Multiple R-squared: 0.1298, Adjusted R-squared: 0.1248
F-statistic: 25.67 on 1 and 172 DF,

p-value: 1.039e-06

Scenario 2

A factor variable can have multiple categories

E.g. Drug trials involving the product, rivals & placebo

Unless data come from a randomised experiment, we need


a regression to study the effects of confounding variables

Open up BankSalaries.csv

Courtesy: AWZ, data based on a real case from 1995

YrsExper - with this bank, YrsPrior - with earlier bank

PCJob - Is the employees job computer-related?

Q: Does the bank discriminate against women by salary?

Difference
Salaries by Gender

norm quantiles

30000 60000 90000

Normality of Male Salaries


Males$Salary

30000 45000 60000

Females$Salary

Normality of Female Salaries

90000
80000
70000
60000
50000
40000

norm quantiles

30000
Female

Male

Populations fairly normal, some skew, could transform with log

Two Sample t-test


data: Females$Salary and Males$Salary
t = -5.3024, df = 206, p-value = 2.935e-07
alternative hypothesis: true difference
in means is not equal to 0

95% CI for Salary Diff


-11379.984 -5211.041
sample estimates:
mean of x mean of y
37209.93 45505.44

Dummy

First, lets regress with just a dummy variable for Gender

Model: Est. Salary = $37,210 + $8,295 * Gender

Gender = 1 for Male and 0 for Female (reference)

Q: Model for men? For women?

lm(formula = Salary ~ Gender, data = Salaries)


Estimate Std. Error t value Pr(>|t|)
(Intercept)
GenderMale

37209.9

894.5

8295.5

1564.5

41.597

< 2e-16 ***

5.302 2.94e-07 ***

Residual standard error: 10580 on 206 digs of freedom


Multiple R-squared: 0.1201, Adjusted R-squared: 0.1158
F-statistic: 28.12 on 1 and 206 DF,

p-value: 2.935e-07

Regress

Know: YrsPrior does not add substantively to the regression.


Regress with Gender (dummy) & interaction with YrsExper

Regress

Know: YrsPrior does not add substantively to the regression.


Regress with Gender (dummy) & interaction with YrsExper

Q: : Model for men? For women?

lm(Salary ~ YrsExper + Gender + Gender * YrsExper)


Estimate Std. Error t value Pr(>|t|)
(Intercept)
YrsExper
GenderMale
YrsExper:GenderMale

34528.3

1138.0

30.342

280.0

102.5

2.733

-4098.3

1665.8

-2.460

1247.8

136.7

9.130

< 2e-16 ***


0.00684 **
0.01472 *
< 2e-16 ***

Residual standard error: 6816 on 204 degrees of freedom


Multiple R-squared:

0.6386, Adjusted R-squared:

F-statistic: 120.2 on 3 and 204 DF,

0.6333

p-value: < 2.2e-16

Visualise

Computing gender-wise, we obtain

Men: Est. SalaryMale = $30,430 + $1528 * YearsExp

Women: Est. SalaryFemale = $34,528 + $280 * YearsExp

Case of dj vu? Why??

Improved

Add in JobGrade & PCJob; JobGrade masks EducLev

lm(Salary ~ YrsExper+Gender+ Gender*YrsExper + JobGrade + PCJob)


Estimate Std. Error t value Pr(>|t|)
(Intercept)
31366.35
990.54 31.666 < 2e-16 ***
YrsExper
93.45
77.48
1.206 0.229247
GenderMale
-5039.19
1269.96 -3.968 0.000101 ***
JobGrade2
2143.65
993.17
2.158 0.032101 *
JobGrade3
6048.03
973.93
6.210 3.06e-09 ***
JobGrade4
10685.04
1148.17
9.306 < 2e-16 ***
JobGrade5
14949.09
1305.90 11.447 < 2e-16 ***
JobGrade6
16877.82
2330.30
7.243 9.55e-12 ***
PCJobYes
4201.57
1233.33
3.407 0.000796 ***
YrsExper:GenderMale
975.98
116.44
8.382 9.58e-15 ***
Residual standard error: 4864 on 198 degrees of freedom
Multiple R-squared: 0.8214, Adjusted R-squared: 0.8133
F-statistic: 101.2 on 9 and 198 DF, p-value: < 2.2e-16

Final

Our choice of predictors reveals multiple versions of the truth.


Do several regression models reveal discriminatory results?

Our final model

Estimated Salary = $31,366 + $93*YrsExper - $5,039*IsMale? +


$2,143*JobGrade2 + $6,048*JobGrade3 + $10,685*JobGrade4
+ $14,949*JobGrade5 + $16,878*JobGrade6 + $4,200*PCJob?
+ $976*(IsMale?*YrsExper)

82% variance in Salary is explained by YrsExper, Gender,


JobGrade and PCJob

Q: The last term is the kicker - why?

You might also like