A Reductions Approach To Fair Classification: Alekh Agarwal Alina Beygelzimer Miroslav Dud Ik John Langford Hanna Wallach
A Reductions Approach To Fair Classification: Alekh Agarwal Alina Beygelzimer Miroslav Dud Ik John Langford Hanna Wallach
Alekh Agarwal 1 Alina Beygelzimer 2 Miroslav Dudı́k 1 John Langford 1 Hanna Wallach 1
(2018) begin with a similar goal to ours, but they analyze fair classification; however, our approach also encompasses
the Bayes optimal classifier under fairness constraints in the many other previously studied definitions of fairness as
limit of infinite data. In contrast, our focus is algorithmic, special cases, as we explain at the end of this section.
our approach applies to any classifier family, and we obtain
The first definition—demographic (or statistical) parity—
finite-sample guarantees. Dwork et al. (2018) also begin
can be thought of as a stronger version of the US Equal
with a similar goal to ours. Their approach partitions the
Employment Opportunity Commission’s “four-fifths rule,”
training examples into subsets according to protected at-
which requires that the “selection rate for any race, sex, or
tribute values and then leverages transfer learning to jointly
ethnic group [must be at least] four-fifths (4/5) (or eighty
learn from these separate data sets. Our approach avoids par-
percent) of the rate for the group with the highest rate.”1
titioning the data and assumes access only to a classification
algorithm rather than a transfer learning algorithm. Definition 1 (Demographic parity—DP). A classifier h
satisfies demographic parity under a distribution over
A preliminary version of this paper appeared at the FAT/ML (X, A, Y ) if its prediction h(X) is statistically indepen-
workshop (Agarwal et al., 2017), and led to extensions with dent of the protected attribute A—that is, if P[h(X) = ŷ |
more general optimization objectives (Alabi et al., 2018) A = a] = P[h(X) = ŷ] for all a, ŷ. Because ŷ ∈ {0, 1},
and combinatorial protected attributes (Kearns et al., 2018). this is equivalent to E[h(X) | A = a] = E[h(X)] for all a.
In the next section, we formalize our problem. While we
focus on two well-known quantitative definitions of fairness, The second definition—equalized odds—was recently pro-
our approach also encompasses many other previously stud- posed by Hardt et al. (2016) to remedy two previously noted
ied definitions of fairness as special cases. In Section 3, we flaws with demographic parity (Dwork et al., 2012). First,
describe our reductions approach to fair classification and demographic parity permits a classifier which accurately
its guarantees in detail. The experimental study in Section 4 classifies data points with one value A = a, such as the
shows that our reductions compare favorably to three base- value a with the most data, but makes random predictions
lines, while overcoming some of their disadvantages and for data points with A 6= a as long as the probabilities of
also offering the flexibility of picking a suitable accuracy– h(X) = 1 match. Second, demographic parity rules out
fairness tradeoff. Our results demonstrate the utility of perfect classifiers whenever Y is correlated with A. In
having a general-purpose approach for combining machine contrast, equalized odds suffers from neither of these flaws.
learning methods and quantitative fairness definitions. Definition 2 (Equalized odds—EO). A classifier h satis-
fies equalized odds under a distribution over (X, A, Y )
if its prediction h(X) is conditionally independent of
2. Problem Formulation the protected attribute A given the label Y —that is, if
We consider a binary classification setting where the training P[h(X) = ŷ | A = a, Y = y] = P[h(X) = ŷ | Y = y] for
examples consist of triples (X, A, Y ), where X ∈ X is a fea- all a, y, and ŷ. Because ŷ ∈ {0, 1}, this is equivalent to
ture vector, A ∈ A is a protected attribute, and Y ∈ {0, 1} E[h(X) | A = a, Y = y] = E[h(X) | Y = y] for all a, y.
is a label. The feature vector X can either contain the pro-
tected attribute A as one of the features or contain other fea- We now show how each definition can be viewed as a special
tures that are arbitrarily indicative of A. For example, if the case of a general set of linear constraints of the form
classification task is to predict whether or not someone will Mµ(h) ≤ c, (1)
default on a loan, each training example might correspond
to a person, where X represents their demographics, income where matrix M ∈ R|K|×|J| and vector c ∈ R|K| describe
level, past payment history, and loan amount; A represents the linear constraints, each indexed by k ∈ K, and µ(h) ∈
their race; and Y represents whether or not they defaulted on R|J| is a vector of conditional moments of the form
that loan. Note that X might contain their race as one of the
µj (h) = E gj (X, A, Y, h(X)) Ej for j ∈ J,
features or, for example, contain their zipcode—a feature
that is often correlated with race. Our goal is to learn an ac-
curate classifier h : X → {0, 1} from some set (i.e., family) where gj : X × A × {0, 1} × {0, 1} → [0, 1] and Ej is
of classifiers H, such as linear threshold rules, decision trees, an event defined with respect to (X, A, Y ). Crucially, gj
or neural nets, while satisfying some definition of fairness. depends on h, while Ej cannot depend on h in any way.
Note that the classifiers in H do not explicitly depend on A. Example 1 (DP). In a binary classification setting, demo-
graphic parity can be expressed as a set of |A| equality
2.1. Fairness Definitions constraints, each of the form E[h(X) | A = a] = E[h(X)].
Letting J = A ∪ {?}, gj (X, A, Y, h(X)) = h(X) for all j,
We focus on two well-known quantitative definitions of
1
fairness that have been considered in previous work on See the Uniform Guidelines on Employment Selection Proce-
dures, 29 C.F.R. §1607.4(D) (2015).
A Reductions Approach to Fair Classification
Ea = {A = a}, and E? = {True}, where {True} refers to error: err(h) := P[h(X) 6= Y ]. However, because our
the event encompassing all points in the sample space, each goal is to learn the most accurate classifier while satisfying
equality constraint can be expressed as µa (h) = µ? (h).2 fairness constraints, as formalized above, we instead seek to
Finally, because each such constraint can be equivalently find the solution to the constrained optimization problem3
expressed as a pair of inequality constraints of the form
min err(h) subject to Mµ(h) ≤ c. (2)
h∈H
µa (h) − µ? (h) ≤ 0
Furthermore, rather than just considering classifiers in the
−µa (h) + µ? (h) ≤ 0,
set H, we can enlarge the space of possible classifiers by
demographic parity can be expressed as equation (1), where considering randomized classifiers that can be obtained via
K = A×{+, −}, M(a,+),a0 = 1{a0 = a}, M(a,+),? = −1, a distribution over H. By considering randomized classi-
M(a,−),a0 = −1{a0 = a}, M(a,−),? = 1, and c = 0. fiers, we can achieve better accuracy–fairness tradeoffs than
Expressing each equality constraint as a pair of inequality would otherwise be possible. A randomized classifier Q
constraints allows us to control the extent to which each makes a prediction by first sampling a classifier h ∈ H
constraint is enforced by positing ck > 0 for some (or all) k. from Q and then using h to make the Pprediction. The result-
ing classification error is err(Q) = h∈HPQ(h) err(h) and
Example 2 (EO). In a binary classification set- the conditional moments are µ(Q) = h∈H Q(h)µ(h)
ting, equalized odds can be expressed as a set (see Appendix A for the derivation). Thus we seek to solve
of 2 |A| equality constraints, each of the form
E[h(X) | A = a, Y = y] = E[h(X) | Y = y]. Letting min err(Q) subject to Mµ(Q) ≤ c, (3)
Q∈∆
J = (A ∪ {?}) × {0, 1}, gj (X, A, Y, h(X)) = h(X) for
all j, E(a,y) = {A = a, Y = y}, and E(?,y) = {Y = y}, where ∆ is the set of all distributions over H.
each equality constraint can be equivalently expressed as
In practice, we do not know the true distribution over
µ(a,y) (h) − µ(?,y) (h) ≤ 0 (X, A, Y ) and only have access to a data set of training ex-
amples {(Xi , Ai , Yi )}ni=1 . We therefore replace err(Q) and
−µ(a,y) (h) + µ(?,y) (h) ≤ 0. µ(Q) in equation (3) with their empirical versions ec rr(Q)
and µb (Q). Because of the sampling error in µ b (Q), we
As a result, equalized odds can be expressed
also allow errors in satisfying the constraints by setting
as equation (1), where K = A × Y × {+, −},
ck = ck + εk for all k, where εk ≥ 0. After these modifica-
M(a,y,+),(a0 ,y0 ) = 1{a0 = a, y 0 = y}, M(a,y,+),(?,y0 ) = −1,
b
tions, we need to solve the empirical version of equation (3):
M(a,y,−),(a0 ,y0 ) = −1{a0 = a, y 0 = y}, M(a,y,−),(?,y0 ) = 1,
and c = 0. Again, we can posit ck > 0 for some (or all) k
min ec µ(Q) ≤ b
rr(Q) subject to Mb c. (4)
to allow small violations of some (or all) of the constraints. Q∈∆
Although we omit the details, we note that many other pre- 3. Reductions Approach
viously studied definitions of fairness can also be expressed
as equation (1). For example, equality of opportunity (Hardt We now show how the problem (4) can be reduced to a se-
et al., 2016) (also known as balance for the positive class; quence of cost-sensitive classification problems. We further
Kleinberg et al., 2017), balance for the negative class (Klein- show that the solutions to our sequence of cost-sensitive clas-
berg et al., 2017), error-rate balance (Chouldechova, sification problems yield a randomized classifier with the
2017), overall accuracy equality (Berk et al., 2017), and lowest (empirical) error subject to the desired constraints.
treatment equality (Berk et al., 2017) can all be expressed
as equation (1); in contrast, calibration (Kleinberg et al., 3.1. Cost-sensitive Classification
2017) and predictive parity (Chouldechova, 2017) cannot
because to do so would require the event Ej to depend on We assume access to a cost-sensitive classification algorithm
h. We note that our approach can also be used to satisfy for the set H. The input to such an algorithm is a data set
multiple definitions of fairness, though if these definitions of training examples {(Xi , Ci0 , Ci1 )}ni=1 , where Ci0 and Ci1
are mutually contradictory, e.g., as described by Kleinberg denote the losses—costs in this setting—for predicting the
et al. (2017), then our guarantees become vacuous. labels 0 or 1, respectively, for Xi . The algorithm outputs
n
X
2.2. Fair Classification arg min h(Xi ) Ci1 + (1 − h(Xi )) Ci0 . (5)
h∈H i=1
In a standard (binary) classification setting, the goal is to
3
learn the classifier h ∈ H with the minimum classification We consider misclassification error for concreteness, but all
the results in this paper apply to any error of the form err(h) =
2
Note that µ? (h) = E[h(X) | True] = E[h(X)]. E[gerr (X, A, Y, h(X))], where gerr (·, ·, ·, ·) ∈ [0, 1].
A Reductions Approach to Fair Classification
This abstraction allows us to specify different costs for dif- Algorithm 1 Exp. gradient reduction for fair classification
ferent training examples, which is essential for incorporat- Input: training examples {(Xi , Yi , Ai )}ni=1
ing fairness constraints. Moreover, efficient cost-sensitive fairness constraints specified by gj , Ej , M, b c
classification algorithms are readily available for several bound B, accuracy ν, learning rate η
common classifier representations (e.g., Beygelzimer et al., Set θ 1 = 0 ∈ R|K|
2005; Langford & Beygelzimer, 2005; Fan et al., 1999). In for t = 1, 2, . . . do
k}
particular, equation (5) is equivalent to a weighted classi- Set λt,k = B 1+P exp{θ for all k ∈ K
k0 ∈K exp{θk0 }
fication problem, where the input consists of labeled ex-
ht ← B ESTh (λt )
amples {(Xi , Yi , Wi )}ni=1 with Yi ∈ {0, 1} and Wi ≥ 0,
Qb t ← 1 Pt0 ht0 , L ← L Q b t , B ESTλ (Q bt )
andP the goal is to minimize the weighted classification er- t t =1
n
ror i=1 Wi 1{h(Xi ) 6= Yi }. This is equivalent to equa- 1
Pt
λt ← t t0 =1 λt0 , L ← L B ESTh (λ
b b t ), λ
bt
tion (5) if we set Wi = |Ci0 − Ci1 | and Yi = 1{Ci0 ≥ Ci1 }. n o
νt ← max L(Q b t ) − L, L − L(Q
bt , λ bt , λ
bt)
3.2. Reduction if νt ≤ ν then
Return (Q bt , λ
bt)
To derive our fair classification algorithm, we rewrite equa- end if
tion (4) as a saddle point problem. We begin by introducing Set θ t+1 = θ t + η (Mb µ(ht ) − b
c)
a Lagrange multiplier λk ≥ 0 for each of the |K| constraints, end for
|K|
summarized as λ ∈ R+ , and form the Lagrangian
rr(Q) + λ> Mb
L(Q, λ) = ec µ(Q) − b
c . (where ν > 0 is an input to the algorithm). Such an approx-
imate equilibrium corresponds to a ν-approximate saddle
Thus, equation (4) is equivalent to point of the Lagrangian, which is a pair (Q,
b λ),
b where
B ESTh (λ): the best response of the Q-player. Here, the needing more iterations to reach any given suboptimality. In
best response minimizes L(Q, λ) over all Qs in the simplex. particular, as we derive in the theorem, achieving subopti-
Because L is linear in Q, the minimizer can always be cho- mality ν may need up to 4ρ2 B 2 log(|K| + 1) / ν 2 iterations.
sen to put all of the mass on a single classifier h. We show Example 3 (DP). Using the matrix M for demographic
how to obtain the classifier constituting the best response parity as described in Section 2, the cost-sensitive reduction
via a reduction to cost-sensitive classification. Letting pj := for a vector of Lagrange multipliers λ uses costs
b j ] be the empirical event probabilities, the Lagrangian
P[E
for Q which puts all of the mass on a single h is then λAi X
Ci0 = 1{Yi 6= 0}, Ci1 = 1{Yi 6= 1} + − λa ,
pAi
a∈A
rr(h) + λ> Mb
L(h, λ) = ec µ(h) − b c
X where pa := P[Ab = a] and λa := λ(a,+) − λ(a,−) , effec-
b 1{h(X) 6= Y } − λ>b
=E c+ Mk,j λk µbj (h) tively replacing two non-negative Lagrange multipliers by a
k,j
single multiplier, which can be either positive or negative.
>
= −λ b b 1{h(X) 6= Y }
c+E Because ck = 0 for all k, bck = εk . Furthermore, because
all empirical moments are bounded in [0, 1], we can assume
X Mk,j λk h i
+ b gj X,A,Y,h(X) 1{(X,A,Y ) ∈ Ej } .
E
εk ≤ 1, which yields the bound ρ ≤ 2. Thus, Algorithm 1
pj terminates in at most 16B 2 log(2 |A| + 1) / ν 2 iterations.
k,j
Example 4 (EO). For equalized odds, the cost-sensitive
Assuming a data set of training examples {(Xi , Ai , Yi )}ni=1 ,
reduction for a vector of Lagrange multipliers λ uses costs
the minimization of L(h, λ) over h then corresponds to cost-
sensitive classification on {(Xi , Ci0 , Ci1 )}ni=1 with costs4 Ci0 = 1{Yi 6= 0},
Ci0 = 1{Yi 6= 0} λ(Ai ,Yi ) X λ(a,Yi )
Ci1 = 1{Yi 6= 1} + − ,
X Mk,j λk p(Ai ,Yi ) p(?,Yi )
a∈A
+ gj (Xi ,Ai ,Yi , 0) 1{(Xi ,Ai ,Yi ) ∈ Ej }
pj
k,j where p(a,y) := P[A
b = a, Y = y], p(?,y) := P[Y b = y], and
Ci1 = 1{Yi 6= 1} λ(a,y) := λ(a,y,+) − λ(a,y,−) . If we again assume εk ≤ 1,
X Mk,j λk then we obtain the bound ρ ≤ 2. Thus, Algorithm 1 termi-
+ gj (Xi ,Ai ,Yi , 1) 1{(Xi ,Ai ,Yi ) ∈ Ej }. nates in at most 16B 2 log(4 |A| + 1) / ν 2 iterations.
pj
k,j
α ≤ 1/2. We note that α = 1/2 in the vast majority Let Q? minimize err(Q) subject to Mµ(Q) ≤ c. Then
of classifier families, including norm-bounded linear Algorithm 1 with ν ∝ n−α , B ∝ nα and η ∝ ρ−2 n−2α ter-
functions (see Theorem 1 of Kakade et al., 2009), neural minates in O(ρ2 n4α ln |K|) iterations and returns Q,
b which
networks (see Theorem 18 of Bartlett & Mendelson, 2002), with probability at least 1 − (|J| + 1)δ satisfies
and classifier families with bounded VC dimension (see
Lemma 4 and Theorem 6 of Bartlett & Mendelson, 2002). err(Q) e −α ),
b ≤ err(Q? ) + O(n
X
Recall that in our empirical optimization problem we as- b ≤ ck +
γk (Q) e −α )
|Mk,j | O(n for all k.
j
sume that bck = ck + εk , where εk ≥ 0 are error bounds that j∈J
account for the discrepancy between µ(Q) and µ b (Q). In Example 5 (DP). If na denotes the number of training ex-
our analysis, we assume that these error bounds have been amples with Ai = a, then Assumption 1 states that we
set in accordance with the Rademacher complexity of H. should set ε(a,+) = ε(a,−) = C 0 (n−α −α
a +n ) and Theo-
Assumption 1. There exists C, C 0 ≥ 0Pand α ≤ 1/2 rem 3 then shows that for a suitable setting of C 0 , ν, B,
such that Rn (H) ≤ Cn−α and εk = C 0 j∈J |Mk,j |n−α j , and η, Algorithm 1 will return a randomized classifier Q b
where nj is the number of data points that fall in Ej , e −α )
with the lowest feasible classification error up to O(n
while also approximately satisfying the fairness constraints
nj := i : (Xi , Ai , Yi ) ∈ Ej .
e −α ) for all a,
E[h(X) | A = a] − E[h(X)] ≤ O(n a
The optimization error can be bounded via a careful analy-
sis of the Lagrangian and the optimality conditions of (P) where E is with respect to (X, A, Y ) as well as h ∼ Q.
b
and (D). Combining the three different sources of error Example 6 (EO). Similarly, if n(a,y) denotes the number
yields the following bound, which we prove in Appendix C. of examples with Ai = a and Yi = y and n(?,y) denotes the
0
p 2. Let Assumption 1 hold for C ≥ 2C +
Theorem number of examples with Yi = y, then Assumption 1 states
2 + ln(4/δ) / 2, where δ > 0. Let (Q, λ) be any ν-
b b that we should set ε(a,y,+) = ε(a,y,−) = C 0 (n−α −α
(a,y) + n(?,y) )
approximate saddle point of L, let Q? minimize err(Q) sub- and Theorem 3 then shows that for a suitable setting of C 0 , ν,
ject to Mµ(Q) ≤ c, and let p?j = P[Ej ]. Then, with proba- B, and η, Algorithm 1 will return a randomized classifier Q b
bility at least 1 − (|J| + 1)δ, the distribution Q
b satisfies
with the lowest feasible classification error up to O(n )
e −α
e −α ),
b ≤ err(Q? ) + 2ν + O(n while also approximately satisfying the fairness constraints
err(Q)
b ≤ ck + 1+2ν + e −α )
E[h(X) | A = a, Y = y] − E[h(X) | Y = y] ≤ O(n
X
γk (Q) e −α ) for all k,
|Mk,j | O(n (a,y)
j
B
j∈J
for all a, y. Again, E includes randomness under the true
e suppresses polynomial dependence on ln(1/δ).
where O(·) distribution over (X, A, Y ) as well as h ∼ Q.
b
If np?j ≥ 8 log(2/δ) for all j, then, for all k,
3.4. Grid Search
b ≤ ck + 1+2ν +
X
γk (Q) e (np?j )−α .
|Mk,j | O
B In some situations, it is preferable to select a deterministic
j∈J
classifier, even if that means a lower accuracy or a modest vi-
olation of the fairness constraints. A set of candidate classi-
In other words, the solution returned by Algorithm 1
fiers can be obtained from the saddle point (Q† , λ† ). Specif-
achieves the lowest feasible classification error on the true
ically, because Q† is a minimizer of L(Q, λ† ) and L is
distribution up to the optimization error, which grows lin-
linear in Q, the distribution Q† puts non-zero mass only on
early with ν, and the statistical error, which grows as n−α .
classifiers that are the Q-player’s best responses to λ† . If we
Therefore, if we want to guarantee that the optimization er-
knew λ† , we could retrieve one such best response via the re-
ror does not dominate the statistical error, we should set ν ∝
duction to cost-sensitive learning introduced in Section 3.2.
n−α . The fairness constraints on the true distribution are
satisfied up to the optimization error (1 + 2ν) /B and up to We can compute λ† using Algorithm 1, but when the number
the statistical error. Because the statistical error depends on of constraints is very small, as is the case for demographic
the moments, and the error in estimating the moments grows parity or equalized odds with a binary protected attribute,
as n−α
j ≥ n−α , we can set B ∝ nα to guarantee that the op- it is also reasonable to consider a grid of values λ, calculate
timization error does not dominate the statistical error. Com- the best response for each value, and then select the value
bining this reasoning with the learning rate setting of Theo- with the desired tradeoff between accuracy and fairness.
rem 1 yields the following theorem (proved in Appendix C). Example 7 (DP). When the protected attribute is binary,
Theorem 3. Let ρ := maxh kMb pµ(h) − b ck∞ . Let Assump- e.g., A ∈ {a, a0 }, then the grid search can in fact be con-
tion 1 hold for C 0 ≥ 2C + 2 + ln(4/δ) / 2, where δ > 0. ducted in a single dimension. The reduction formally takes
A Reductions Approach to Fair Classification
two real-valued arguments λa and λa0 , and then adjusts the binary protected attributes. First, a classifier is trained on
costs for predicting h(Xi ) = 1 by the amounts the original data (without considering fairness). The training
examples close to the decision boundary are then relabeled
λa λa0 to remove all disparity while minimally affecting accuracy.
δa = − λa − λa0 and δa0 = − λa − λa0 ,
pa pa0 The final classifier is then trained on the relabeled data.
respectively, on the training examples with Ai = a and As the base classifiers for our reductions, we used the
Ai = a0 . These adjustments satisfy pa δa + pa0 δa0 = 0, weighted classification implementations of logistic regres-
so instead of searching over λa and λa0 , we can carry out sion and gradient-boosted decision trees in scikit-learn (Pe-
the grid search over δa alone and apply the adjustment dregosa et al., 2011). In addition to the three baselines
δa0 = −pa δa /pa0 to the protected attribute value a0 . described above, we also compared our reductions to the
“unconstrained” classifiers trained to optimize accuracy only.
With three attribute values, e.g., A ∈ {a, a0 , a00 }, we sim-
ilarly have pa δa + pa0 δa0 + pa00 δa00 = 0, so it suffices to We used four data sets, randomly splitting each one into
conduct grid search in two dimensions rather than three. training examples (75%) and test examples (25%):
Example 8 (EO). If A ∈ {a, a0 }, we obtain the adjustment • The adult income data set (Lichman, 2013) (48,842
examples). Here the task is to predict whether some-
λ(a,y) λ(a,y) + λ(a0 ,y) one makes more than $50k per year, with gender as the
δ(a,y) = −
p(a,y) p(?,y) protected attribute. To examine the performance for
non-binary protected attributes, we also conducted an-
for an example with protected attribute value a and label y, other experiment with the same data, using both gender
and similarly for protected attribute value a0 . In this case, and race (binarized into white and non-white) as the
separately for each y, the adjustments satisfy protected attribute. Relabeling, which requires binary
protected attributes, was therefore not applicable here.
p(a,y) δ(a,y) + p(a0 ,y) δ(a0 ,y) = 0,
• ProPublica’s COMPAS recidivism data (7,918 exam-
so it suffices to do the grid search over δ(a,0) and δ(a,1) and ples). The task is to predict recidivism from someone’s
set the parameters for a0 to δ(a0 ,y) = −p(a,y) δ(a,y) /p(a0 ,y) . criminal history, jail and prison time, demographics,
and COMPAS risk scores, with race as the protected
attribute (restricted to white and black defendants).
4. Experimental Results • Law School Admissions Council’s National Longitu-
dinal Bar Passage Study (Wightman, 1998) (20,649
We now examine how our exponentiated-gradient reduc- examples). Here the task is to predict someone’s even-
tion5 performs at the task of binary classification subject to tual passage of the bar exam, with race (restricted to
either demographic parity or equalized odds. We provide an white and black only) as the protected attribute.
evaluation of our grid-search reduction in Appendix D. • The Dutch census data set (Dutch Central Bureau for
We compared our reduction with the score-based post- Statistics, 2001) (60,420 examples). Here the task is
processing algorithm of Hardt et al. (2016), which takes as to predict whether or not someone has a prestigious
its input any classifier, (i.e., a standard classifier without any occupation, with gender as the protected attribute.
fairness constraints) and derives a monotone transformation While all the evaluated algorithms require access to the pro-
of the classifier’s output to remove any disparity with respect tected attribute A at training time, only the post-processing
to the training examples. This post-processing algorithm algorithm requires access to A at test time. For a fair com-
works with both demographic parity and equalized odds, as parison, we included A in the feature vector X, so all algo-
well as with binary and non-binary protected attributes. rithms had access to it at both the training time and test time.
For demographic parity, we also compared our reduction We used the test examples to measure the classification error
with the reweighting and relabeling approaches of Kamiran for each approach, as well as the violation of the desired fair-
& Calders (2012). Reweighting can be applied to both ness constraints, i.e., maxa E[h(X) | A = a] − E[h(X)]
binary and non-binary protected attributes and operates by and maxa,y E[h(X) | A = a, Y = y] − E[h(X) | Y = y]
changing importance weights on each example with the for demographic parity and equalized odds, respectively.
goal of removing any statistical dependence between the
protected attribute and label.6 Relabeling was developed for We ran our reduction across a wide range of tradeoffs be-
5
tween the classification error and fairness constraints. We
https://github.com/Microsoft/fairlearn considered ε ∈ {0.001, . . . , 0.1} and for each value ran
6
Although reweighting was developed for demographic parity,
Algorithm 1 with b ck = ε across all k. As expected, the
the weights that it induces are achievable by our grid search, albeit
the grid search for equalized odds rather than demographic parity. returned randomized classifiers tracked the training Pareto
A Reductions Approach to Fair Classification
adult / DP / log. reg. adult4 / DP / log. reg. COMPAS / DP / log. reg. Dutch census / DP / log. reg. Law Schools / DP / log. reg.
1.0 0.36
0.22
0.34
0.16 0.16 0.05
error
0.32 0.20
0.5
0.30
0.14 0.14 0.18 0.04
0.28
0.0 0.26 0.16
0.00
0.0 0.05 0.10 0.00
0.2 0.05 0.10 0.15 0.4 0.00 0.05 0.10 0.6 0.00 0.05 0.10 0.150.8 0.00 0.05 0.10
1.0
adult / DP / boosting adult4 / DP / boosting COMPAS / DP / boosting Dutch census / DP / boosting Law Schools / DP / boosting
1.0 0.36
0.22
0.34
0.16 0.16 0.05
error
0.32 0.20
0.5
0.30
0.14 0.14 0.18 0.04
0.28
0.0 0.26 0.16
0.00
0.0 0.05 0.10 0.00
0.2 0.05 0.10 0.15 0.4 0.00 0.05 0.10 0.6 0.00 0.05 0.10 0.15 0.8 0.00 0.05 0.10 1.0
adult / EO / log. reg. adult4 / EO / log. reg. COMPAS / EO / log. reg. Dutch census / EO / log. reg. Law Schools / EO / log. reg.
1.0 0.36
0.22
0.34
0.16 0.16 0.05
error
0.32 0.20
0.5
0.30
0.14 0.14 0.18 0.04
0.28
0.0 0.26 0.16
0.0
0.00 0.05 0.10 0.2
0.00 0.05 0.10 0.15 0.4 0.00 0.05 0.10 0.15 0.6 0.00 0.05 0.10 0.8 0.00 0.05 0.10 0.15 1.0
adult / EO / boosting adult4 / EO / boosting COMPAS / EO / boosting Dutch census / EO / boosting Law Schools / EO / boosting
1.0 0.36
0.22
0.34
0.16 0.16 0.05
error
0.32 0.20
0.5
0.30
0.14 0.14 0.18 0.04
0.28
0.0 0.26 0.16
0.0
0.00 0.02 0.04 0.06 0.2
0.00 0.05 0.10 0.15 0.4 0.00 0.05 0.10 0.15 0.6 0.00 0.05 0.10 0.8
0.15 0.00 0.05 0.10 0.15 0.20 1.0
Figure 1. Test classification error versus constraint violation with respect to DP (top two rows) and EO (bottom two rows). All data sets
have binary protected attributes except for adult4, which has four protected attribute values, so relabeling is not applicable there. For our
reduction approach we plot the convex envelope of the classifiers obtained on training data at various accuracy–fairness tradeoffs. We
show 95% confidence bands for the classification error of our reduction approach and 95% confidence intervals for the constraint violation
of post-processing. Our reduction approach dominates or matches the performance of the other approaches up to statistical uncertainty.
frontier (see Figure 2 in Appendix D). In Figure 1, we evalu- the training data, but its performance on the test data
ate these classifiers alongside the baselines on the test data. very closely matched that of our exponentiated-gradient
reduction. However, if the protected attribute is non-binary,
For all the data sets, the range of classification errors
then grid search is not feasible. For instance, for the version
is much smaller than the range of constraint violations.
of the adult income data set where the protected attribute
Almost all the approaches were able to substantially reduce
takes on four values, the grid search would need to span
or remove disparity without much impact on classifier accu-
three dimensions for demographic parity and six dimensions
racy. One exception was the Dutch census data set, where
for equalized odds, both of which are prohibitively costly.
the classification error increased the most in relative terms.
Our reduction generally dominated or matched the baselines. 5. Conclusion
The relabeling approach frequently yielded solutions that
were not Pareto optimal. Reweighting yielded solutions We presented two reductions for achieving fairness in a
on the Pareto frontier, but often with substantial disparity. binary classification setting. Our reductions work for any
As expected, post-processing yielded disparities that were classifier representation, encompass many definitions of fair-
statistically indistinguishable from zero, but the resulting ness, satisfy provable guarantees, and work well in practice.
classification error was sometimes higher than achieved by
Our reductions optimize the tradeoff between accuracy and
our reduction under a statistically indistinguishable dispar-
any (single) definition of fairness given training-time access
ity. In addition, and unlike the post-processing algorithm,
to protected attributes. Achieving fairness when training-
our reduction can achieve any desired accuracy–fairness
time access to protected attributes is unavailable remains an
tradeoff, allows a wider range of fairness definitions, and
open problem for future research, as does the navigation of
does not require access to the protected attribute at test time.
tradeoffs between accuracy and multiple fairness definitions.
Our grid-search reduction, evaluated in Appendix D,
sometimes failed to achieve the lowest disparities on
A Reductions Approach to Fair Classification
Acknowledgements Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel,
R. Fairness through awareness. In Proceedings of the 3rd
We would like to thank Aaron Roth, Sam Corbett-Davies, Innovations in Theoretical Computer Science Conference,
and Emma Pierson for helpful discussions. pp. 214–226, 2012.
Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., and Huq, Kamishima, T., Akaho, S., and Sakuma, J. Fairness-aware
A. Algorithmic decision making and the cost of fairness. learning through regularization approach. In 2011 IEEE
In Proceedings of the 23rd ACM SIGKDD International 11th International Conference on Data Mining Work-
Conference on Knowledge Discovery and Data Mining, shops, pp. 643–650, 2011.
pp. 797–806, 2017.
Kearns, M., Neel, S., Roth, A., and Wu, Z. S. Preventing
Donini, M., Oneto, L., Ben-David, S., Shawe-Taylor, J., and fairness gerrymandering: Auditing and learning for sub-
Pontil, M. Empirical risk minimization under fairness group fairness. In Proceedings of the 35th International
constraints. 2018. arXiv:1802.08626. Conference on Machine Learning (ICML), 2018.
A Reductions Approach to Fair Classification
X
err(Q) := P [h(X) 6= Y ] = Q(h) err(h) .
(X,A,Y )∼D, h∼Q
h∈H
The fairness constraints on a classifier h are Mµ(h) ≤ c. Recall that µj (h) := ED [gj (X, A, Y, h(X)) | Ej ]. For a
randomized classifier Q we define its moment µj as
h i X
µj (Q) := E gj (X, A, Y, h(X)) Ej = Q(h)µj (h) ,
(X,A,Y )∼D, h∼Q
h∈H
B. Proof of Theorem 1
The proof follows immediately from the analysis of Freund & Schapire (1996) applied to the Exponentiated Gradient (EG)
algorithm (Kivinen & Warmuth, 1997), which in our specific case is also equivalent to Hedge (Freund & Schapire, 1997).
|K| |K|+1
Let Λ := {λ ∈ R+ : kλ0 k1 ≤ B} and Λ0 := {λ0 ∈ R+ : kλ0 k1 = B}. We associate any λ ∈ Λ with the λ0 ∈ Λ0
that is equal to λ on coordinates 1 through |K| and puts the remaining mass on the coordinate λ0|K|+1 .
Consider a run of Algorithm 1. For each λt , let λ0t ∈ Λ0 be the associated element of Λ0 . Let rt := Mbµ(ht ) − bc and let
r0t ∈ R|K|+1 be equal to rt on coordinates 1 through |K| and put zero on the coordinate rt,|K|+1
0
. Thus, for any λ and the
associated λ0 , we have, for all t,
and, in particular,
We interpret r0t as the reward vector for the λ-player. The choices of λ0t then correspond to those of the EG algorithm with
the learning rate η. By the assumption of the theorem we have kr0t k∞ = krt k∞ ≤ ρ. The regret bound for EG, specifically,
Corollary 2.14 of Shalev-Shwartz (2012), then states that for any λ0 ∈ Λ0 ,
T T
X X B log(|K| + 1)
(λ0 )> r0t ≤ (λ0t )> r0t + + ηρ2 BT .
t=1 t=1
η
| {z }
=:ζT
T
X T
X
λ> rt ≤ λ>t rt + ζT . (9)
t=1 t=1
A Reductions Approach to Fair Classification
T
1 X
L(Q
b T , λ) = rr(ht ) + λ> Mb
ec µ(ht ) − b
c
T t=1
T
1 X
= rr(ht ) + λ> rt
ec
T t=1
T
1 X ζ
T
≤ rr(ht ) + λ>t rt +
ec (10)
T t=1 T
T
1X ζT
= L(ht , λt ) +
T t=1 T
T
1X b ζT
≤ L(QT , λt ) + (11)
T t=1 T
T
bT , 1
ζ
b T ) + ζT .
T
X
=L Q λt + = L(Q
bT , λ (12)
T t=1 T T
Equation (10) follows from the regret bound (9). Equation (11) follows because L(ht , λt ) ≤ L(Q, λt ) for all Q by the
choice of ht as the best response of the Q-player. Finally, equation (12) follows by linearity of L(Q, λ) in λ. Thus, we have
for all λ ∈ Λ,
L(Q b T ) ≥ L(Q
bT , λ b T , λ) − ζT . (13)
T
Also, for any Q,
T
1X
L(Q, λT ) =
b L(Q, λt ) (14)
T t=1
T
1X
≥ L(ht , λt ) (15)
T t=1
T
1X b T ) − ζT
≥ L(ht , λ (16)
T t=1 T
= L(Q b T ) − ζT ,
bT , λ (17)
T
where equation (14) follows by linearity of L(Q, λ) in λ, equation (15) follows by the optimality of ht with respect to λ
bt,
equation (16) from the regret bound (9), and equation (17) by linearity of L(Q, λ) in Q. Thus, for all Q,
b T ) ≤ L(Q, λ ζT
L(Q
bT , λ bT ) + . (18)
T
Equations (13) and (18) immediately imply that for any T ≥ 1,
ζT B log(|K| + 1)
νT ≤ = + ηρ2 B ,
T ηT
B log(|K| + 1) ν ν ν
νT ≤ 4ρ2 B 2 log(|K|+1)
+ 2
· ρ2 B = + .
ν
· 2ρ B 2 2
2ρ2 B ν2
A Reductions Approach to Fair Classification
Let Q? minimize err(Q) subject to Mµ(Q) ≤ c. Then with probability at least 1 − (|J| + 1)δ, the distribution Q
b satisfies
r
b ≤ err(Q ) + 2ν + 4Rn (H) + √ + 2 ln(2/δ) ,
err(Q) ? 4
n n
b ≤ ck + 1 + 2ν
and for all k, γk (Q) + 2εk .
B
|K|
Let Λ := {λ ∈ R+ : kλ0 k1 ≤ B} denote the domain of λ. In the remainder of the section, we assume that we are given a
pair (Q,
b λ)
b which is a ν-approximate saddle point of L, i.e.,
L(Q, b ≤ L(Q, λ)
b λ) b +ν for all Q ∈ ∆,
(19)
and L(Q, b ≥ L(Q,
b λ) b λ) − ν for all λ ∈ Λ.
Proof. We show that the lemma follows from the optimality conditions (19). We consider a dual variable λ defined as
(
0 if γ b ≤b
b (Q) c,
λ=
Bek? otherwise, where k ? = arg maxk [b b −b
γk (Q) ck ],
where ek denotes the kth vector of the standard basis. Then we have by equations (19) and (20) that
X
rr(Q)
ec b + λ
bk (b b −b
γk (Q) ck ) = L(Q,b λ)
b
k∈K X
≥ L(Q,b λ) − ν = ec
rr(Q)
b + λk (b b −b
γk (Q) ck ) − ν ,
k∈K
Next two lemmas bound the empirical error of Q b and also bound the amount by which Q b violates the empirical fairness
constraints.
Lemma 2 (Empirical error bound). The distribution Q b ≤ ec
rr(Q)
b satisfies ec rr(Q) + 2ν for any Q satisfying the empirical
b (Q) ≤ b
fairness constraints, i.e., any Q such that γ c.
A Reductions Approach to Fair Classification
b (Q) ≤ b
Proof. Assume that Q satisfies γ b ≥ 0, we have
c. Since λ
b> γ
L(Q, λ)
b = ec
rr(Q) + λ b (Q) − b
c ≤ ec
rr(Q) .
L(Q, b ≤ ec
b λ) rr(Q) + ν .
≥ ec b −ν .
rr(Q)
b (Q) ≤ b
Lemma 3 (Empirical fairness violation). Assume that the empirical fairness constraints γ c are feasible. Then the
distribution Q approximately satisfies all empirical fairness constraints:
b
1 + 2ν
max γ b −b
bk (Q) ck ≤ .
k∈K B
b (Q) ≤ b
Proof. Let Q satisfy γ c. Applying the same upper and lower bound on L(Q,
b λ)
b as in the proof of Lemma 2, we
obtain
rr(Q)
ec b + B max γ b −b
bk (Q) ck + − ν ≤ L(Q, b ≤ ec
b λ) rr(Q) + ν .
k∈K
rr(Q) − ec
We can further upper bound ec b by 1 and use x ≤ x+ for any real number x to complete the proof.
rr(Q)
It remains to lift the bounds on empirical classification error and constraint violation into the corresponding bounds on true
classification error and the violation of true constraints. We will use the standard machinery of uniform convergence bounds
via the (worst-case) Rademacher complexity.
Let F be a class of functions f : Z → [0, 1] over some space Z. Then the (worst-case) Rademacher complexity of F is
defined as " #
n
1X
Rn (F) := sup E sup σi f (zi ) ,
z1 ,...,zn ∈Z f ∈F n i=1
where the expectation is over the i.i.d. random variables σ1 , . . . , σn with P[σi = 1] = P[σi = −1] = 1/2.
We first prove concentration of generic moments derived from classifiers h ∈ H and then move to bounding the deviations
from true classification error and true fairness constraints.
Lemma 4 (Concentration of moments). Let g : X × A × {0, 1} × {0, 1} → [0, 1] be any function and let D be a distribution
over (X, A, Y ). Then with probability at least 1 − δ, for all h ∈ H,
r
2 ln(2/δ)
E g(X, A, Y, h(X)) − E g(X, A, Y, h(X)) ≤ 2Rn (H) + √ +
b ,
n 2n
where the expectation is with respect to D and the empirical expectation is based on n i.i.d. draws from D.
Proof. Let F := {fh }h∈H be the class of functions fh : (x, y, a) 7→ g x, y, a, h(x) . By Theorem 3.2 of Boucheron et al.
We will next bound Rn (F) in terms of Rn (H). Since h(x) ∈ {0, 1}, we can write
fh (x, y, a) = h(x)g(x, a, y, 1) + 1 − h(x) g(x, a, y, 0) = g(x, a, y, 0) + h(x) g(x, a, y, 1) − g(x, a, y, 0) .
Since g(x, a, y, 0) ≤ 1 and g(x, a, y, 1) − g(x, a, y, 0) ≤ 1, we can invoke Theorem 12(5) of Bartlett & Mendelson
(2002) for bounding function classes shifted by an offset, in our case g(x, a, y, 0), and Theorem 4.4 of Ledoux & Talagrand
(1991) for bounding function classes under contraction, in our case g(x, a, y, 1) − g(x, a, y, 0), yielding
1
Rn (F) ≤ √ + Rn (H) .
n
Proof. We first use Lemma 4 with g : (x, a, y, ŷ) 7→ 1{y 6= ŷ} to obtain, with probability 1 − δ, for all h,
r
b h ] − E[fh ] ≤ 2Rn (H) + √2 +
rr(h) − err(h) = E[f
ec
ln(2/δ)
.
n 2n
The lemma now follows for any Q by taking a convex combination of the corresponding bounds on h ∈ H.7
Finally, we show a result for the concentration of the empirical constraint violations to their population counterparts. We
will actually show the concentration of the individual moments µ bj (Q) to µj (Q) uniformly for all Q ∈ ∆. Since M is
a fixed matrix not dependent on the data, this also directly implies concentration of the constraints γ b (Q) = Mb µ(Q) to
γ(Q) = Mµ(Q). For this result, recall that nj = |{i ∈ [n] : (Xi ,Ai ,Yi ) ∈ Ej }| and p?j = P[Ej ].
Lemma 6 (Concentration of conditional moments). For any j ∈ J, with probability at least 1 − δ, for all Q,
s
2 ln(2/δ)
bj (Q) − µj (Q) ≤ 2Rnj (H) + √ +
µ .
nj 2nj
Proof. Our proof largely follows the proof of Lemma 2 of Woodworth et al. (2017), with appropriate modifications for our
more general constraint definition. Let Sj := {i ∈ [n] : (Xi ,Ai ,Yi ) ∈ Ej } be the set of indices such that the corresponding
examples fall in the event Ej . Note that we have defined nj = |Sj |. Let D(·) denote the joint distribution of (X, A, Y ).
Then, conditioned on i ∈ Sj , the random variables gj (Xi ,Ai ,Yi ,h(Xi )) are i.i.d. draws from the distribution D(· | Ej ), with
mean µj (h). Applying Lemma 4 with gj and the distribution D(· | Ej ) therefore yields, with probability 1 − δ, for all h,
s
2 ln(2/δ)
bj (h) − µj (h) ≤ 2Rnj (H) + √ +
µ ,
nj 2nj
Proof of Theorem 4. We now use the lemmas derives so far to prove Theorem 4. We first use Lemma 6 to bound the gap
between the empirical and population fairness constraints. The lemma implies that with probability at least 1 − |J|δ, for all
k ∈ K and all Q ∈ ∆,
bk (Q) − γk (Q) = Mk µ(Q)
γ b − µ(Q)
X
≤ |Mk,j | µbj (Q) − µj (Q)
j∈J
s !
X 2 ln(2/δ)
≤ |Mk,j | 2Rnj (H) + √ +
nj 2nj
j∈J
≤ εk . (22)
The above shows that Q? satisfies the empirical fairness constraints, so we can use Lemma 3, which together with
equation (22) yields
b ≤γ b + εk ≤ b 1 + 2ν 1 + 2ν
γk (Q) bk (Q) ck + + εk = ck + + 2εk ,
B B
proving the second statement of the theorem.
Proof of Theorem 2. The first part of the theorem follows immediately from Assumption 1 and Theorem 4 (with δ/2 instead
of δ). The statement in fact holds with probability at least 1 − (|J| + 1)δ/2. For the second part, we use the multiplicative
Chernoff bound for binomial random variables. Note that E[nj ] = np?j , and we assume that np?j ≥ 8 ln(2/δ), so the
multiplicative Chernoff bound implies that nj ≤ np?j /2 with probability at most δ/2. Taking the union bound across all j
and combining with the first part of the theorem then proves the second part.
Proof of Theorem 3. This follows immediately from Theorem 1 and the first part of Theorem 2.
adult / DP / log. reg. adult4 / DP / log. reg. COMPAS / DP / log. reg. Dutch census / DP / log. reg. Law Schools / DP / log. reg.
1.0
0.34 0.22 0.05
0.16 0.16
0.32 0.20
error
0.04
0.5
0.14 0.14 0.30 0.18
0.28 0.03
0.16
0.12 0.12
0.26 0.02
0.0 0.14
0.00
0.0 0.05 0.10 0.00
0.2 0.05 0.10 0.15 0.4 0.00 0.05 0.10 0.6 0.00 0.05 0.10 0.150.8 0.00 0.05 0.101.0
adult / DP / boosting adult4 / DP / boosting COMPAS / DP / boosting Dutch census / DP / boosting Law Schools / DP / boosting
1.0
0.34 0.22 0.05
0.16 0.16
0.32 0.20
error
0.04
0.5
0.14 0.14 0.30 0.18
0.28 0.03
0.16
0.12 0.12
0.26 0.02
0.0 0.14
0.00
0.0 0.05 0.10 0.00
0.2 0.05 0.10 0.15 0.4 0.00 0.05 0.10 0.6 0.00 0.05 0.10 0.15 0.8 0.00 0.05 0.10 0.15 1.0
adult / EO / log. reg. adult4 / EO / log. reg. COMPAS / EO / log. reg. Dutch census / EO / log. reg. Law Schools / EO / log. reg.
1.0
0.34 0.22 0.05
0.16 0.16
0.32 0.20
error
0.04
0.5
0.14 0.14 0.30 0.18
0.28 0.03
0.16
0.12 0.12
0.26 0.02
0.0 0.14
0.0
0.00 0.05 0.10 0.2
0.00 0.05 0.10 0.4 0.00 0.05 0.10 0.15 0.6 0.00 0.05 0.10 0.8 0.00 0.05 0.10 0.15 1.0
adult / EO / boosting adult4 / EO / boosting COMPAS / EO / boosting Dutch census / EO / boosting Law Schools / EO / boosting
1.0
0.34 0.22 0.05
0.16 0.16
0.32 0.20
error
0.04
0.5
0.14 0.14 0.30 0.18
0.28 0.03
0.16
0.12 0.12
0.26 0.02
0.0 0.14
0.0
0.00 0.02 0.04 0.06 0.2
0.00 0.05 0.10 0.4 0.00 0.05 0.10 0.150.6 0.00 0.05 0.10 0.8 0.0 0.1 0.2 1.0
0.3
unconstrained postprocessing reweight relabel reduction (grid search) reduction (exp. grad.)
Figure 2. Training classification error versus constraint violation, with respect to DP (top two rows) and EO (bottom two rows). Markers
correspond to the baselines. For our two reductions and the interpolants between reweighting (or relabeling) and the unconstrained
classifier, we varied their tradeoff parameters and plot the Pareto frontiers of the sets of classifiers obtained for each method. Because the
curves of the different methods often overlap, we use vertical dashed lines to indicate the lowest constraint violations. All data sets have
binary protected attributes except for adult4, which has four protected attribute values, so relabeling is not applicable and grid search is
not feasible for this data set. The exponentiated-gradient reduction dominates or matches other approaches as expected since it solves
exactly for the points on the Pareto frontier of the set of all classifiers in each considered class.
A Reductions Approach to Fair Classification
adult / DP / log. reg. adult4 / DP / log. reg. COMPAS / DP / log. reg. Dutch census / DP / log. reg. Law Schools / DP / log. reg.
1.0 0.36
0.22
0.34
0.16 0.16 0.05
error
0.32 0.20
0.5
0.30
0.14 0.14 0.18 0.04
0.28
0.0 0.26 0.16
0.00
0.0 0.05 0.10 0.00
0.2 0.05 0.10 0.15 0.4 0.00 0.05 0.10 0.6 0.00 0.05 0.10 0.150.8 0.00 0.05 0.10
1.0
adult / DP / boosting adult4 / DP / boosting COMPAS / DP / boosting Dutch census / DP / boosting Law Schools / DP / boosting
1.0 0.36
0.22
0.34
0.16 0.16 0.05
error
0.32 0.20
0.5
0.30
0.14 0.14 0.18 0.04
0.28
0.0 0.26 0.16
0.00
0.0 0.05 0.10 0.00
0.2 0.05 0.10 0.15 0.4 0.00 0.05 0.10 0.6 0.00 0.05 0.10 0.15 0.8 0.00 0.05 0.10 1.0
adult / EO / log. reg. adult4 / EO / log. reg. COMPAS / EO / log. reg. Dutch census / EO / log. reg. Law Schools / EO / log. reg.
1.0 0.36
0.22
0.34
0.16 0.16 0.05
error
0.32 0.20
0.5
0.30
0.14 0.14 0.18 0.04
0.28
0.0 0.26 0.16
0.0
0.00 0.05 0.10 0.2
0.00 0.05 0.10 0.15 0.4 0.00 0.05 0.10 0.15 0.6 0.00 0.05 0.10 0.8 0.00 0.05 0.10 0.15 1.0
adult / EO / boosting adult4 / EO / boosting COMPAS / EO / boosting Dutch census / EO / boosting Law Schools / EO / boosting
1.0 0.36
0.22
0.34
0.16 0.16 0.05
error
0.32 0.20
0.5
0.30
0.14 0.14 0.18 0.04
0.28
0.0 0.26 0.16
0.0
0.00 0.02 0.04 0.06 0.2
0.00 0.05 0.10 0.15 0.4 0.00 0.05 0.10 0.15 0.6 0.00 0.05 0.10 0.8
0.15 0.00 0.05 0.10 0.15 0.20 1.0
unconstrained postprocessing reweight relabel reduction (grid search) reduction (exp. grad.)
Figure 3. Test classification error versus constraint violation, with respect to DP (top two rows) and EO (bottom two rows). Markers
correspond to the baselines. For our two reductions and the interpolants between reweighting (or relabeling) and the unconstrained
classifier, we show convex envelopes of the classifiers taken from the training Pareto frontier of each method (i.e., the same classifiers as
shown in Figure 2). Because the curves of the different methods often overlap, we use vertical dashed lines to indicate the lowest constraint
violations. All data sets have binary protected attributes except for adult4, which has four protected attribute values, so relabeling
is not applicable and grid search is not feasible for this data set. We show 95% confidence bands for the classification error of the
exponentiated-gradient reduction and 95% confidence intervals for the constraint violation of post-processing. The exponentiated-gradient
reduction dominates or matches performance of all other methods up to statistical uncertainty.