0% found this document useful (0 votes)
22 views

A Reductions Approach To Fair Classification: Alekh Agarwal Alina Beygelzimer Miroslav Dud Ik John Langford Hanna Wallach

Uploaded by

조킹
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

A Reductions Approach To Fair Classification: Alekh Agarwal Alina Beygelzimer Miroslav Dud Ik John Langford Hanna Wallach

Uploaded by

조킹
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

A Reductions Approach to Fair Classification

Alekh Agarwal 1 Alina Beygelzimer 2 Miroslav Dudı́k 1 John Langford 1 Hanna Wallach 1

Abstract methods, often by relaxing the desired definitions of fair-


We present a systematic approach for achieving ness, and only enforcing weaker constraints, such as lack of
fairness in a binary classification setting. While correlation (e.g., Woodworth et al., 2017; Zafar et al., 2017;
Johnson et al., 2016; Kamishima et al., 2011; Donini et al.,
arXiv:1803.02453v3 [cs.LG] 16 Jul 2018

we focus on two well-known quantitative defini-


tions of fairness, our approach encompasses many 2018). The resulting fairness guarantees typically only hold
other previously studied definitions as special under strong distributional assumptions, and the approaches
cases. The key idea is to reduce fair classification are tied to specific families of classifiers, such as SVMs.
to a sequence of cost-sensitive classification The second group of approaches eliminate the restriction
problems, whose solutions yield a randomized to specific classifier families and treat the underlying clas-
classifier with the lowest (empirical) error subject sification method as a “black box,” while implementing
to the desired constraints. We introduce two a wrapper that either works by pre-processing the data or
reductions that work for any representation of the post-processing the classifier’s predictions (e.g., Kamiran
cost-sensitive classifier and compare favorably & Calders, 2012; Feldman et al., 2015; Hardt et al., 2016;
to prior baselines on a variety of data sets, while Calmon et al., 2017). Existing pre-processing approaches
overcoming several of their disadvantages. are specific to particular definitions of fairness and typically
seek to come up with a single transformed data set that will
work across all learning algorithms, which, in practice, leads
1. Introduction to classifiers that still exhibit substantial unfairness (see our
evaluation in Section 4). In contrast, post-processing allows
Over the past few years, the media have paid considerable
a wider range of fairness definitions and results in provable
attention to machine learning systems and their ability to
fairness guarantees. However, it is not guaranteed to find the
inadvertently discriminate against minorities, historically
most accurate fair classifier, and requires test-time access to
disadvantaged populations, and other protected groups when
the protected attribute, which might not be available.
allocating resources (e.g., loans) or opportunities (e.g., jobs).
In response to this scrutiny—and driven by ongoing debates We present a general-purpose approach that has the key
and collaborations with lawyers, policy-makers, social sci- advantage of this second group of approaches—i.e., the
entists, and others (e.g., Barocas & Selbst, 2016)—machine underlying classification method is treated as a black
learning researchers have begun to turn their attention to the box—but without the noted disadvantages. Our approach
topic of “fairness in machine learning,” and, in particular, to encompasses a wide range of fairness definitions, is
the design of fair classification and regression algorithms. guaranteed to yield the most accurate fair classifier, and
does not require test-time access to the protected attribute.
In this paper we study the task of binary classification sub-
Specifically, our approach allows any definition of fairness
ject to fairness constraints with respect to a pre-defined pro-
that can be formalized via linear inequalities on conditional
tected attribute, such as race or sex. Previous work in this
moments, such as demographic parity or equalized
area can be divided into two broad groups of approaches.
odds (see Section 2.1). We show how binary classification
The first group of approaches incorporate specific quanti- subject to these constraints can be reduced to a sequence
tative definitions of fairness into existing machine learning of cost-sensitive classification problems. We require only
1
black-box access to a cost-sensitive classification algorithm,
Microsoft Research, New York 2 Yahoo! Research, New York. which does not need to have any knowledge of the desired
Correspondence to: A. Agarwal <[email protected]>,
A. Beygelzimer <[email protected]>, M. Dudı́k definition of fairness or protected attribute. We show that
<[email protected]>, J. Langford <[email protected]>, the solutions to our sequence of cost-sensitive classification
H. Wallach <[email protected]>. problems yield a randomized classifier with the lowest
(empirical) error subject to the desired fairness constraints.
Proceedings of the 35 th International Conference on Machine
Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 Corbett-Davies et al. (2017) and Menon & Williamson
by the author(s).
A Reductions Approach to Fair Classification

(2018) begin with a similar goal to ours, but they analyze fair classification; however, our approach also encompasses
the Bayes optimal classifier under fairness constraints in the many other previously studied definitions of fairness as
limit of infinite data. In contrast, our focus is algorithmic, special cases, as we explain at the end of this section.
our approach applies to any classifier family, and we obtain
The first definition—demographic (or statistical) parity—
finite-sample guarantees. Dwork et al. (2018) also begin
can be thought of as a stronger version of the US Equal
with a similar goal to ours. Their approach partitions the
Employment Opportunity Commission’s “four-fifths rule,”
training examples into subsets according to protected at-
which requires that the “selection rate for any race, sex, or
tribute values and then leverages transfer learning to jointly
ethnic group [must be at least] four-fifths (4/5) (or eighty
learn from these separate data sets. Our approach avoids par-
percent) of the rate for the group with the highest rate.”1
titioning the data and assumes access only to a classification
algorithm rather than a transfer learning algorithm. Definition 1 (Demographic parity—DP). A classifier h
satisfies demographic parity under a distribution over
A preliminary version of this paper appeared at the FAT/ML (X, A, Y ) if its prediction h(X) is statistically indepen-
workshop (Agarwal et al., 2017), and led to extensions with dent of the protected attribute A—that is, if P[h(X) = ŷ |
more general optimization objectives (Alabi et al., 2018) A = a] = P[h(X) = ŷ] for all a, ŷ. Because ŷ ∈ {0, 1},
and combinatorial protected attributes (Kearns et al., 2018). this is equivalent to E[h(X) | A = a] = E[h(X)] for all a.
In the next section, we formalize our problem. While we
focus on two well-known quantitative definitions of fairness, The second definition—equalized odds—was recently pro-
our approach also encompasses many other previously stud- posed by Hardt et al. (2016) to remedy two previously noted
ied definitions of fairness as special cases. In Section 3, we flaws with demographic parity (Dwork et al., 2012). First,
describe our reductions approach to fair classification and demographic parity permits a classifier which accurately
its guarantees in detail. The experimental study in Section 4 classifies data points with one value A = a, such as the
shows that our reductions compare favorably to three base- value a with the most data, but makes random predictions
lines, while overcoming some of their disadvantages and for data points with A 6= a as long as the probabilities of
also offering the flexibility of picking a suitable accuracy– h(X) = 1 match. Second, demographic parity rules out
fairness tradeoff. Our results demonstrate the utility of perfect classifiers whenever Y is correlated with A. In
having a general-purpose approach for combining machine contrast, equalized odds suffers from neither of these flaws.
learning methods and quantitative fairness definitions. Definition 2 (Equalized odds—EO). A classifier h satis-
fies equalized odds under a distribution over (X, A, Y )
if its prediction h(X) is conditionally independent of
2. Problem Formulation the protected attribute A given the label Y —that is, if
We consider a binary classification setting where the training P[h(X) = ŷ | A = a, Y = y] = P[h(X) = ŷ | Y = y] for
examples consist of triples (X, A, Y ), where X ∈ X is a fea- all a, y, and ŷ. Because ŷ ∈ {0, 1}, this is equivalent to
ture vector, A ∈ A is a protected attribute, and Y ∈ {0, 1} E[h(X) | A = a, Y = y] = E[h(X) | Y = y] for all a, y.
is a label. The feature vector X can either contain the pro-
tected attribute A as one of the features or contain other fea- We now show how each definition can be viewed as a special
tures that are arbitrarily indicative of A. For example, if the case of a general set of linear constraints of the form
classification task is to predict whether or not someone will Mµ(h) ≤ c, (1)
default on a loan, each training example might correspond
to a person, where X represents their demographics, income where matrix M ∈ R|K|×|J| and vector c ∈ R|K| describe
level, past payment history, and loan amount; A represents the linear constraints, each indexed by k ∈ K, and µ(h) ∈
their race; and Y represents whether or not they defaulted on R|J| is a vector of conditional moments of the form
that loan. Note that X might contain their race as one of the
µj (h) = E gj (X, A, Y, h(X)) Ej for j ∈ J,
 
features or, for example, contain their zipcode—a feature
that is often correlated with race. Our goal is to learn an ac-
curate classifier h : X → {0, 1} from some set (i.e., family) where gj : X × A × {0, 1} × {0, 1} → [0, 1] and Ej is
of classifiers H, such as linear threshold rules, decision trees, an event defined with respect to (X, A, Y ). Crucially, gj
or neural nets, while satisfying some definition of fairness. depends on h, while Ej cannot depend on h in any way.
Note that the classifiers in H do not explicitly depend on A. Example 1 (DP). In a binary classification setting, demo-
graphic parity can be expressed as a set of |A| equality
2.1. Fairness Definitions constraints, each of the form E[h(X) | A = a] = E[h(X)].
Letting J = A ∪ {?}, gj (X, A, Y, h(X)) = h(X) for all j,
We focus on two well-known quantitative definitions of
1
fairness that have been considered in previous work on See the Uniform Guidelines on Employment Selection Proce-
dures, 29 C.F.R. §1607.4(D) (2015).
A Reductions Approach to Fair Classification

Ea = {A = a}, and E? = {True}, where {True} refers to error: err(h) := P[h(X) 6= Y ]. However, because our
the event encompassing all points in the sample space, each goal is to learn the most accurate classifier while satisfying
equality constraint can be expressed as µa (h) = µ? (h).2 fairness constraints, as formalized above, we instead seek to
Finally, because each such constraint can be equivalently find the solution to the constrained optimization problem3
expressed as a pair of inequality constraints of the form
min err(h) subject to Mµ(h) ≤ c. (2)
h∈H
µa (h) − µ? (h) ≤ 0
Furthermore, rather than just considering classifiers in the
−µa (h) + µ? (h) ≤ 0,
set H, we can enlarge the space of possible classifiers by
demographic parity can be expressed as equation (1), where considering randomized classifiers that can be obtained via
K = A×{+, −}, M(a,+),a0 = 1{a0 = a}, M(a,+),? = −1, a distribution over H. By considering randomized classi-
M(a,−),a0 = −1{a0 = a}, M(a,−),? = 1, and c = 0. fiers, we can achieve better accuracy–fairness tradeoffs than
Expressing each equality constraint as a pair of inequality would otherwise be possible. A randomized classifier Q
constraints allows us to control the extent to which each makes a prediction by first sampling a classifier h ∈ H
constraint is enforced by positing ck > 0 for some (or all) k. from Q and then using h to make the Pprediction. The result-
ing classification error is err(Q) = h∈HPQ(h) err(h) and
Example 2 (EO). In a binary classification set- the conditional moments are µ(Q) = h∈H Q(h)µ(h)
ting, equalized odds can be expressed as a set (see Appendix A for the derivation). Thus we seek to solve
of 2 |A| equality constraints, each of the form
E[h(X) | A = a, Y = y] = E[h(X) | Y = y]. Letting min err(Q) subject to Mµ(Q) ≤ c, (3)
Q∈∆
J = (A ∪ {?}) × {0, 1}, gj (X, A, Y, h(X)) = h(X) for
all j, E(a,y) = {A = a, Y = y}, and E(?,y) = {Y = y}, where ∆ is the set of all distributions over H.
each equality constraint can be equivalently expressed as
In practice, we do not know the true distribution over
µ(a,y) (h) − µ(?,y) (h) ≤ 0 (X, A, Y ) and only have access to a data set of training ex-
amples {(Xi , Ai , Yi )}ni=1 . We therefore replace err(Q) and
−µ(a,y) (h) + µ(?,y) (h) ≤ 0. µ(Q) in equation (3) with their empirical versions ec rr(Q)
and µb (Q). Because of the sampling error in µ b (Q), we
As a result, equalized odds can be expressed
also allow errors in satisfying the constraints by setting
as equation (1), where K = A × Y × {+, −},
ck = ck + εk for all k, where εk ≥ 0. After these modifica-
M(a,y,+),(a0 ,y0 ) = 1{a0 = a, y 0 = y}, M(a,y,+),(?,y0 ) = −1,
b
tions, we need to solve the empirical version of equation (3):
M(a,y,−),(a0 ,y0 ) = −1{a0 = a, y 0 = y}, M(a,y,−),(?,y0 ) = 1,
and c = 0. Again, we can posit ck > 0 for some (or all) k
min ec µ(Q) ≤ b
rr(Q) subject to Mb c. (4)
to allow small violations of some (or all) of the constraints. Q∈∆

Although we omit the details, we note that many other pre- 3. Reductions Approach
viously studied definitions of fairness can also be expressed
as equation (1). For example, equality of opportunity (Hardt We now show how the problem (4) can be reduced to a se-
et al., 2016) (also known as balance for the positive class; quence of cost-sensitive classification problems. We further
Kleinberg et al., 2017), balance for the negative class (Klein- show that the solutions to our sequence of cost-sensitive clas-
berg et al., 2017), error-rate balance (Chouldechova, sification problems yield a randomized classifier with the
2017), overall accuracy equality (Berk et al., 2017), and lowest (empirical) error subject to the desired constraints.
treatment equality (Berk et al., 2017) can all be expressed
as equation (1); in contrast, calibration (Kleinberg et al., 3.1. Cost-sensitive Classification
2017) and predictive parity (Chouldechova, 2017) cannot
because to do so would require the event Ej to depend on We assume access to a cost-sensitive classification algorithm
h. We note that our approach can also be used to satisfy for the set H. The input to such an algorithm is a data set
multiple definitions of fairness, though if these definitions of training examples {(Xi , Ci0 , Ci1 )}ni=1 , where Ci0 and Ci1
are mutually contradictory, e.g., as described by Kleinberg denote the losses—costs in this setting—for predicting the
et al. (2017), then our guarantees become vacuous. labels 0 or 1, respectively, for Xi . The algorithm outputs
n
X
2.2. Fair Classification arg min h(Xi ) Ci1 + (1 − h(Xi )) Ci0 . (5)
h∈H i=1
In a standard (binary) classification setting, the goal is to
3
learn the classifier h ∈ H with the minimum classification We consider misclassification error for concreteness, but all
the results in this paper apply to any error of the form err(h) =
2
Note that µ? (h) = E[h(X) | True] = E[h(X)]. E[gerr (X, A, Y, h(X))], where gerr (·, ·, ·, ·) ∈ [0, 1].
A Reductions Approach to Fair Classification

This abstraction allows us to specify different costs for dif- Algorithm 1 Exp. gradient reduction for fair classification
ferent training examples, which is essential for incorporat- Input: training examples {(Xi , Yi , Ai )}ni=1
ing fairness constraints. Moreover, efficient cost-sensitive fairness constraints specified by gj , Ej , M, b c
classification algorithms are readily available for several bound B, accuracy ν, learning rate η
common classifier representations (e.g., Beygelzimer et al., Set θ 1 = 0 ∈ R|K|
2005; Langford & Beygelzimer, 2005; Fan et al., 1999). In for t = 1, 2, . . . do
k}
particular, equation (5) is equivalent to a weighted classi- Set λt,k = B 1+P exp{θ for all k ∈ K
k0 ∈K exp{θk0 }
fication problem, where the input consists of labeled ex-
ht ← B ESTh (λt )
amples {(Xi , Yi , Wi )}ni=1 with Yi ∈ {0, 1} and Wi ≥ 0,  
Qb t ← 1 Pt0 ht0 , L ← L Q b t , B ESTλ (Q bt )
andP the goal is to minimize the weighted classification er- t t =1
n  
ror i=1 Wi 1{h(Xi ) 6= Yi }. This is equivalent to equa- 1
Pt
λt ← t t0 =1 λt0 , L ← L B ESTh (λ
b b t ), λ
bt
tion (5) if we set Wi = |Ci0 − Ci1 | and Yi = 1{Ci0 ≥ Ci1 }. n o
νt ← max L(Q b t ) − L, L − L(Q
bt , λ bt , λ
bt)
3.2. Reduction if νt ≤ ν then
Return (Q bt , λ
bt)
To derive our fair classification algorithm, we rewrite equa- end if
tion (4) as a saddle point problem. We begin by introducing Set θ t+1 = θ t + η (Mb µ(ht ) − b
c)
a Lagrange multiplier λk ≥ 0 for each of the |K| constraints, end for
|K|
summarized as λ ∈ R+ , and form the Lagrangian

rr(Q) + λ> Mb

L(Q, λ) = ec µ(Q) − b
c . (where ν > 0 is an input to the algorithm). Such an approx-
imate equilibrium corresponds to a ν-approximate saddle
Thus, equation (4) is equivalent to point of the Lagrangian, which is a pair (Q,
b λ),
b where

min max L(Q, λ). (6) L(Q, b ≤ L(Q, λ)


b λ) b +ν for all Q ∈ ∆,
|K|
Q∈∆ λ∈R+ |K|
L(Q, b ≥ L(Q,
b λ) b λ) − ν for all λ ∈ R+ , kλk1 ≤ B.
For computational and statistical reasons, we impose an We proceed iteratively by running a no-regret algorithm for
additional constraint on the `1 norm of λ and seek to simul- the λ-player, while executing the best response of the Q-
taneously find the solution to the constrained version of (6) player. Following Freund & Schapire (1996), the average
as well as its dual, obtained by switching min and max: play of both players converges to the saddle point. We run
the exponentiated gradient algorithm (Kivinen & Warmuth,
min max L(Q, λ), (P)
Q∈∆
|K|
λ∈R+ , kλk1 ≤B
1997) for the λ-player and terminate as soon as the subop-
timality of the average play falls below the pre-specified
max min L(Q, λ). (D) accuracy ν. The best response of the Q-player can always
|K|
λ∈R+ , kλk1 ≤B Q∈∆
be chosen to put all of the mass on one of the candidate
classifiers h ∈ H, and can be implemented by a single call
Because L is linear in Q and λ and the domains of Q and
to a cost-sensitive classification algorithm for the set H.
λ are convex and compact, both problems have solutions
(which we denote by Q† and λ† ) and the minimum value of Algorithm 1 fully implements this scheme, except for the
(P) and the maximum value of (D) are equal and coincide functions B ESTλ and B ESTh , which correspond to the best-
with L(Q† , λ† ). Thus, (Q† , λ† ) is the saddle point of L response algorithms of the two players. (We need the best
(Corollary 37.6.2 and Lemma 36.2 of Rockafellar, 1970). response of the λ-player to evaluate whether the subopti-
mality of the current average play has fallen below ν.) The
We find the saddle point by using the standard scheme of
two best response functions can be calculated as follows.
Freund & Schapire (1996), developed for the equivalent
problem of solving for an equilibrium in a zero-sum game.
B ESTλ (Q): the best response of the λ-player. The best
From game-theoretic perspective, the saddle point can be
response of the λ-player for a given Q is any maximizer of
viewed as an equilibrium of a game between two players:
L(Q, λ) over all valid λs. In our setting, it can always be
the Q-player choosing Q and the λ-player choosing λ. The
chosen to be either 0 or put all of the mass on the most vio-
Lagrangian L(Q, λ) specifies how much the Q-player has to
lated constraint. Letting γb (Q) := Mb µ(Q) and letting ek de-
pay to the λ-player after they make their choices. At the sad-
note the k th vector of the standard basis, B ESTλ (Q) returns
dle point, neither player wants to deviate from their choice.
(
Our algorithm finds an approximate equilibrium in which 0 b (Q) ≤ b
if γ c,
neither player can gain more than ν by changing their choice Bek∗ otherwise, where k ∗ = arg maxk [b γk (Q) − bck ].
A Reductions Approach to Fair Classification

B ESTh (λ): the best response of the Q-player. Here, the needing more iterations to reach any given suboptimality. In
best response minimizes L(Q, λ) over all Qs in the simplex. particular, as we derive in the theorem, achieving subopti-
Because L is linear in Q, the minimizer can always be cho- mality ν may need up to 4ρ2 B 2 log(|K| + 1) / ν 2 iterations.
sen to put all of the mass on a single classifier h. We show Example 3 (DP). Using the matrix M for demographic
how to obtain the classifier constituting the best response parity as described in Section 2, the cost-sensitive reduction
via a reduction to cost-sensitive classification. Letting pj := for a vector of Lagrange multipliers λ uses costs
b j ] be the empirical event probabilities, the Lagrangian
P[E
for Q which puts all of the mass on a single h is then λAi X
Ci0 = 1{Yi 6= 0}, Ci1 = 1{Yi 6= 1} + − λa ,
pAi
a∈A
rr(h) + λ> Mb

L(h, λ) = ec µ(h) − b c
X where pa := P[Ab = a] and λa := λ(a,+) − λ(a,−) , effec-
b 1{h(X) 6= Y } − λ>b
 
=E c+ Mk,j λk µbj (h) tively replacing two non-negative Lagrange multipliers by a
k,j
single multiplier, which can be either positive or negative.
>
 
= −λ b b 1{h(X) 6= Y }
c+E Because ck = 0 for all k, bck = εk . Furthermore, because
all empirical moments are bounded in [0, 1], we can assume
X Mk,j λk h i
+ b gj X,A,Y,h(X) 1{(X,A,Y ) ∈ Ej } .
E
 εk ≤ 1, which yields the bound ρ ≤ 2. Thus, Algorithm 1
pj terminates in at most 16B 2 log(2 |A| + 1) / ν 2 iterations.
k,j
Example 4 (EO). For equalized odds, the cost-sensitive
Assuming a data set of training examples {(Xi , Ai , Yi )}ni=1 ,
reduction for a vector of Lagrange multipliers λ uses costs
the minimization of L(h, λ) over h then corresponds to cost-
sensitive classification on {(Xi , Ci0 , Ci1 )}ni=1 with costs4 Ci0 = 1{Yi 6= 0},
Ci0 = 1{Yi 6= 0} λ(Ai ,Yi ) X λ(a,Yi )
Ci1 = 1{Yi 6= 1} + − ,
X Mk,j λk p(Ai ,Yi ) p(?,Yi )
a∈A
+ gj (Xi ,Ai ,Yi , 0) 1{(Xi ,Ai ,Yi ) ∈ Ej }
pj
k,j where p(a,y) := P[A
b = a, Y = y], p(?,y) := P[Y b = y], and
Ci1 = 1{Yi 6= 1} λ(a,y) := λ(a,y,+) − λ(a,y,−) . If we again assume εk ≤ 1,
X Mk,j λk then we obtain the bound ρ ≤ 2. Thus, Algorithm 1 termi-
+ gj (Xi ,Ai ,Yi , 1) 1{(Xi ,Ai ,Yi ) ∈ Ej }. nates in at most 16B 2 log(4 |A| + 1) / ν 2 iterations.
pj
k,j

Theorem 1. Letting ρ := maxh kMb µ(h) − b


ck∞ , Algo- 3.3. Error Analysis
rithm 1 satisfies the inequality Our ultimate goal, as formalized in equation (3), is to
B log(|K| + 1) minimize the classification error while satisfying fairness
νt ≤ + ηρ2 B. constraints under a true but unknown distribution over
ηt
(X, A, Y ). In the process of deriving Algorithm 1, we in-
Thus, for η = 2ρν2 B , Algorithm 1 will return a ν-approximate troduced three different sources of error. First, we replaced
4ρ2 B 2 log(|K|+1) the true classification error and true moments with their
saddle point of L in at most ν2 iterations.
empirical versions. Second, we introduced a bound B on
This theorem, proved in Appendix B, bounds the subopti- the magnitude of λ. Finally, we only run the optimization
mality νt of the average play (Qbt , λ
b t ), which is equal to its algorithm for a fixed number of iterations, until it reaches
suboptimality as a saddle point. The right-hand side suboptimality level ν. The first source of error, due to the
p √ of the
bound is optimized by η =p log(|K| + 1) / (ρ t), lead- use of empirical rather than true quantities, is unavoidable
ing to the bound νt ≤ 2ρB log(|K| + 1) / t. This bound and constitutes the underlying statistical error. The other two
decreases with the number of iterations t and grows very sources of error, the bound B and the suboptimality level ν,
slowly with the number of constraints |K|. The quantity ρ stem from the optimization algorithm and can be driven
is a problem-specific constant that bounds how much any arbitrarily small at the cost of additional iterations. In this
single classifier h ∈ H can violate the desired set of fair- section, we show how the statistical error and the optimiza-
ness constraints. Finally, B is the bound on the `1 -norm of tion error affect the true accuracy and the fairness of the ran-
λ, which we introduced to enable this specific algorithmic domized classifier returned by Algorithm 1—in other words,
scheme. In general, larger values of B will bring the prob- how well Algorithm 1 solves our original problem (3).
lem (P) closer to (6), and thus also to (4), but at the cost of To bound the statistical error, we use the Rademacher
4
For general error, err(h) = E[gerr (X, A, Y, h(X))], the costs complexity of the classifier family H, which we denote
Ci0 and Ci1 contain, respectively, the terms gerr (Xi , Ai , Yi , 0) and by Rn (H), where n is the number of training examples.
gerr (Xi , Ai , Yi , 1) instead of 1{Yi 6= 0} and 1{Yi 6= 1}. We assume that Rn (H) ≤ Cn−α for some C ≥ 0 and
A Reductions Approach to Fair Classification

α ≤ 1/2. We note that α = 1/2 in the vast majority Let Q? minimize err(Q) subject to Mµ(Q) ≤ c. Then
of classifier families, including norm-bounded linear Algorithm 1 with ν ∝ n−α , B ∝ nα and η ∝ ρ−2 n−2α ter-
functions (see Theorem 1 of Kakade et al., 2009), neural minates in O(ρ2 n4α ln |K|) iterations and returns Q,
b which
networks (see Theorem 18 of Bartlett & Mendelson, 2002), with probability at least 1 − (|J| + 1)δ satisfies
and classifier families with bounded VC dimension (see
Lemma 4 and Theorem 6 of Bartlett & Mendelson, 2002). err(Q) e −α ),
b ≤ err(Q? ) + O(n
X
Recall that in our empirical optimization problem we as- b ≤ ck +
γk (Q) e −α )
|Mk,j | O(n for all k.
j
sume that bck = ck + εk , where εk ≥ 0 are error bounds that j∈J

account for the discrepancy between µ(Q) and µ b (Q). In Example 5 (DP). If na denotes the number of training ex-
our analysis, we assume that these error bounds have been amples with Ai = a, then Assumption 1 states that we
set in accordance with the Rademacher complexity of H. should set ε(a,+) = ε(a,−) = C 0 (n−α −α
a +n ) and Theo-
Assumption 1. There exists C, C 0 ≥ 0Pand α ≤ 1/2 rem 3 then shows that for a suitable setting of C 0 , ν, B,
such that Rn (H) ≤ Cn−α and εk = C 0 j∈J |Mk,j |n−α j , and η, Algorithm 1 will return a randomized classifier Q b
where nj is the number of data points that fall in Ej , e −α )
with the lowest feasible classification error up to O(n
while also approximately satisfying the fairness constraints
nj := i : (Xi , Ai , Yi ) ∈ Ej .


e −α ) for all a,
E[h(X) | A = a] − E[h(X)] ≤ O(n a
The optimization error can be bounded via a careful analy-
sis of the Lagrangian and the optimality conditions of (P) where E is with respect to (X, A, Y ) as well as h ∼ Q.
b
and (D). Combining the three different sources of error Example 6 (EO). Similarly, if n(a,y) denotes the number
yields the following bound, which we prove in Appendix C. of examples with Ai = a and Yi = y and n(?,y) denotes the
0
p 2. Let Assumption 1 hold for C ≥ 2C +
Theorem number of examples with Yi = y, then Assumption 1 states
2 + ln(4/δ) / 2, where δ > 0. Let (Q, λ) be any ν-
b b that we should set ε(a,y,+) = ε(a,y,−) = C 0 (n−α −α
(a,y) + n(?,y) )
approximate saddle point of L, let Q? minimize err(Q) sub- and Theorem 3 then shows that for a suitable setting of C 0 , ν,
ject to Mµ(Q) ≤ c, and let p?j = P[Ej ]. Then, with proba- B, and η, Algorithm 1 will return a randomized classifier Q b
bility at least 1 − (|J| + 1)δ, the distribution Q
b satisfies
with the lowest feasible classification error up to O(n )
e −α

e −α ),
b ≤ err(Q? ) + 2ν + O(n while also approximately satisfying the fairness constraints
err(Q)
b ≤ ck + 1+2ν + e −α )
E[h(X) | A = a, Y = y] − E[h(X) | Y = y] ≤ O(n
X
γk (Q) e −α ) for all k,
|Mk,j | O(n (a,y)
j
B
j∈J
for all a, y. Again, E includes randomness under the true
e suppresses polynomial dependence on ln(1/δ).
where O(·) distribution over (X, A, Y ) as well as h ∼ Q.
b
If np?j ≥ 8 log(2/δ) for all j, then, for all k,
3.4. Grid Search
b ≤ ck + 1+2ν +
X  
γk (Q) e (np?j )−α .
|Mk,j | O
B In some situations, it is preferable to select a deterministic
j∈J
classifier, even if that means a lower accuracy or a modest vi-
olation of the fairness constraints. A set of candidate classi-
In other words, the solution returned by Algorithm 1
fiers can be obtained from the saddle point (Q† , λ† ). Specif-
achieves the lowest feasible classification error on the true
ically, because Q† is a minimizer of L(Q, λ† ) and L is
distribution up to the optimization error, which grows lin-
linear in Q, the distribution Q† puts non-zero mass only on
early with ν, and the statistical error, which grows as n−α .
classifiers that are the Q-player’s best responses to λ† . If we
Therefore, if we want to guarantee that the optimization er-
knew λ† , we could retrieve one such best response via the re-
ror does not dominate the statistical error, we should set ν ∝
duction to cost-sensitive learning introduced in Section 3.2.
n−α . The fairness constraints on the true distribution are
satisfied up to the optimization error (1 + 2ν) /B and up to We can compute λ† using Algorithm 1, but when the number
the statistical error. Because the statistical error depends on of constraints is very small, as is the case for demographic
the moments, and the error in estimating the moments grows parity or equalized odds with a binary protected attribute,
as n−α
j ≥ n−α , we can set B ∝ nα to guarantee that the op- it is also reasonable to consider a grid of values λ, calculate
timization error does not dominate the statistical error. Com- the best response for each value, and then select the value
bining this reasoning with the learning rate setting of Theo- with the desired tradeoff between accuracy and fairness.
rem 1 yields the following theorem (proved in Appendix C). Example 7 (DP). When the protected attribute is binary,
Theorem 3. Let ρ := maxh kMb pµ(h) − b ck∞ . Let Assump- e.g., A ∈ {a, a0 }, then the grid search can in fact be con-
tion 1 hold for C 0 ≥ 2C + 2 + ln(4/δ) / 2, where δ > 0. ducted in a single dimension. The reduction formally takes
A Reductions Approach to Fair Classification

two real-valued arguments λa and λa0 , and then adjusts the binary protected attributes. First, a classifier is trained on
costs for predicting h(Xi ) = 1 by the amounts the original data (without considering fairness). The training
examples close to the decision boundary are then relabeled
λa λa0 to remove all disparity while minimally affecting accuracy.
δa = − λa − λa0 and δa0 = − λa − λa0 ,
pa pa0 The final classifier is then trained on the relabeled data.
respectively, on the training examples with Ai = a and As the base classifiers for our reductions, we used the
Ai = a0 . These adjustments satisfy pa δa + pa0 δa0 = 0, weighted classification implementations of logistic regres-
so instead of searching over λa and λa0 , we can carry out sion and gradient-boosted decision trees in scikit-learn (Pe-
the grid search over δa alone and apply the adjustment dregosa et al., 2011). In addition to the three baselines
δa0 = −pa δa /pa0 to the protected attribute value a0 . described above, we also compared our reductions to the
“unconstrained” classifiers trained to optimize accuracy only.
With three attribute values, e.g., A ∈ {a, a0 , a00 }, we sim-
ilarly have pa δa + pa0 δa0 + pa00 δa00 = 0, so it suffices to We used four data sets, randomly splitting each one into
conduct grid search in two dimensions rather than three. training examples (75%) and test examples (25%):
Example 8 (EO). If A ∈ {a, a0 }, we obtain the adjustment • The adult income data set (Lichman, 2013) (48,842
examples). Here the task is to predict whether some-
λ(a,y) λ(a,y) + λ(a0 ,y) one makes more than $50k per year, with gender as the
δ(a,y) = −
p(a,y) p(?,y) protected attribute. To examine the performance for
non-binary protected attributes, we also conducted an-
for an example with protected attribute value a and label y, other experiment with the same data, using both gender
and similarly for protected attribute value a0 . In this case, and race (binarized into white and non-white) as the
separately for each y, the adjustments satisfy protected attribute. Relabeling, which requires binary
protected attributes, was therefore not applicable here.
p(a,y) δ(a,y) + p(a0 ,y) δ(a0 ,y) = 0,
• ProPublica’s COMPAS recidivism data (7,918 exam-
so it suffices to do the grid search over δ(a,0) and δ(a,1) and ples). The task is to predict recidivism from someone’s
set the parameters for a0 to δ(a0 ,y) = −p(a,y) δ(a,y) /p(a0 ,y) . criminal history, jail and prison time, demographics,
and COMPAS risk scores, with race as the protected
attribute (restricted to white and black defendants).
4. Experimental Results • Law School Admissions Council’s National Longitu-
dinal Bar Passage Study (Wightman, 1998) (20,649
We now examine how our exponentiated-gradient reduc- examples). Here the task is to predict someone’s even-
tion5 performs at the task of binary classification subject to tual passage of the bar exam, with race (restricted to
either demographic parity or equalized odds. We provide an white and black only) as the protected attribute.
evaluation of our grid-search reduction in Appendix D. • The Dutch census data set (Dutch Central Bureau for
We compared our reduction with the score-based post- Statistics, 2001) (60,420 examples). Here the task is
processing algorithm of Hardt et al. (2016), which takes as to predict whether or not someone has a prestigious
its input any classifier, (i.e., a standard classifier without any occupation, with gender as the protected attribute.
fairness constraints) and derives a monotone transformation While all the evaluated algorithms require access to the pro-
of the classifier’s output to remove any disparity with respect tected attribute A at training time, only the post-processing
to the training examples. This post-processing algorithm algorithm requires access to A at test time. For a fair com-
works with both demographic parity and equalized odds, as parison, we included A in the feature vector X, so all algo-
well as with binary and non-binary protected attributes. rithms had access to it at both the training time and test time.
For demographic parity, we also compared our reduction We used the test examples to measure the classification error
with the reweighting and relabeling approaches of Kamiran for each approach, as well as the violation of the desired fair-
& Calders (2012). Reweighting can be applied to both ness constraints, i.e., maxa E[h(X) | A = a] − E[h(X)]
binary and non-binary protected attributes and operates by and maxa,y E[h(X) | A = a, Y = y] − E[h(X) | Y = y]
changing importance weights on each example with the for demographic parity and equalized odds, respectively.
goal of removing any statistical dependence between the
protected attribute and label.6 Relabeling was developed for We ran our reduction across a wide range of tradeoffs be-
5
tween the classification error and fairness constraints. We
https://github.com/Microsoft/fairlearn considered ε ∈ {0.001, . . . , 0.1} and for each value ran
6
Although reweighting was developed for demographic parity,
Algorithm 1 with b ck = ε across all k. As expected, the
the weights that it induces are achievable by our grid search, albeit
the grid search for equalized odds rather than demographic parity. returned randomized classifiers tracked the training Pareto
A Reductions Approach to Fair Classification
adult / DP / log. reg. adult4 / DP / log. reg. COMPAS / DP / log. reg. Dutch census / DP / log. reg. Law Schools / DP / log. reg.
1.0 0.36
0.22
0.34
0.16 0.16 0.05
error

0.32 0.20
0.5
0.30
0.14 0.14 0.18 0.04
0.28
0.0 0.26 0.16
0.00
0.0 0.05 0.10 0.00
0.2 0.05 0.10 0.15 0.4 0.00 0.05 0.10 0.6 0.00 0.05 0.10 0.150.8 0.00 0.05 0.10
1.0
adult / DP / boosting adult4 / DP / boosting COMPAS / DP / boosting Dutch census / DP / boosting Law Schools / DP / boosting
1.0 0.36
0.22
0.34
0.16 0.16 0.05
error

0.32 0.20
0.5
0.30
0.14 0.14 0.18 0.04
0.28
0.0 0.26 0.16
0.00
0.0 0.05 0.10 0.00
0.2 0.05 0.10 0.15 0.4 0.00 0.05 0.10 0.6 0.00 0.05 0.10 0.15 0.8 0.00 0.05 0.10 1.0
adult / EO / log. reg. adult4 / EO / log. reg. COMPAS / EO / log. reg. Dutch census / EO / log. reg. Law Schools / EO / log. reg.
1.0 0.36
0.22
0.34
0.16 0.16 0.05
error

0.32 0.20
0.5
0.30
0.14 0.14 0.18 0.04
0.28
0.0 0.26 0.16
0.0
0.00 0.05 0.10 0.2
0.00 0.05 0.10 0.15 0.4 0.00 0.05 0.10 0.15 0.6 0.00 0.05 0.10 0.8 0.00 0.05 0.10 0.15 1.0
adult / EO / boosting adult4 / EO / boosting COMPAS / EO / boosting Dutch census / EO / boosting Law Schools / EO / boosting
1.0 0.36
0.22
0.34
0.16 0.16 0.05
error

0.32 0.20
0.5
0.30
0.14 0.14 0.18 0.04
0.28
0.0 0.26 0.16
0.0
0.00 0.02 0.04 0.06 0.2
0.00 0.05 0.10 0.15 0.4 0.00 0.05 0.10 0.15 0.6 0.00 0.05 0.10 0.8
0.15 0.00 0.05 0.10 0.15 0.20 1.0

violation of the fairness constraint

unconstrained postprocessing reweight relabel reduction (exp. grad.)

Figure 1. Test classification error versus constraint violation with respect to DP (top two rows) and EO (bottom two rows). All data sets
have binary protected attributes except for adult4, which has four protected attribute values, so relabeling is not applicable there. For our
reduction approach we plot the convex envelope of the classifiers obtained on training data at various accuracy–fairness tradeoffs. We
show 95% confidence bands for the classification error of our reduction approach and 95% confidence intervals for the constraint violation
of post-processing. Our reduction approach dominates or matches the performance of the other approaches up to statistical uncertainty.

frontier (see Figure 2 in Appendix D). In Figure 1, we evalu- the training data, but its performance on the test data
ate these classifiers alongside the baselines on the test data. very closely matched that of our exponentiated-gradient
reduction. However, if the protected attribute is non-binary,
For all the data sets, the range of classification errors
then grid search is not feasible. For instance, for the version
is much smaller than the range of constraint violations.
of the adult income data set where the protected attribute
Almost all the approaches were able to substantially reduce
takes on four values, the grid search would need to span
or remove disparity without much impact on classifier accu-
three dimensions for demographic parity and six dimensions
racy. One exception was the Dutch census data set, where
for equalized odds, both of which are prohibitively costly.
the classification error increased the most in relative terms.
Our reduction generally dominated or matched the baselines. 5. Conclusion
The relabeling approach frequently yielded solutions that
were not Pareto optimal. Reweighting yielded solutions We presented two reductions for achieving fairness in a
on the Pareto frontier, but often with substantial disparity. binary classification setting. Our reductions work for any
As expected, post-processing yielded disparities that were classifier representation, encompass many definitions of fair-
statistically indistinguishable from zero, but the resulting ness, satisfy provable guarantees, and work well in practice.
classification error was sometimes higher than achieved by
Our reductions optimize the tradeoff between accuracy and
our reduction under a statistically indistinguishable dispar-
any (single) definition of fairness given training-time access
ity. In addition, and unlike the post-processing algorithm,
to protected attributes. Achieving fairness when training-
our reduction can achieve any desired accuracy–fairness
time access to protected attributes is unavailable remains an
tradeoff, allows a wider range of fairness definitions, and
open problem for future research, as does the navigation of
does not require access to the protected attribute at test time.
tradeoffs between accuracy and multiple fairness definitions.
Our grid-search reduction, evaluated in Appendix D,
sometimes failed to achieve the lowest disparities on
A Reductions Approach to Fair Classification

Acknowledgements Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel,
R. Fairness through awareness. In Proceedings of the 3rd
We would like to thank Aaron Roth, Sam Corbett-Davies, Innovations in Theoretical Computer Science Conference,
and Emma Pierson for helpful discussions. pp. 214–226, 2012.

References Dwork, C., Immorlica, N., Kalai, A. T., and Leiserson, M.


Decoupled classifiers for group-fair and efficient machine
Agarwal, A., Beygelzimer, A., Dudı́k, M., and Langford, J. learning. In Conference on Fairness, Accountability and
A reductions approach to fair classication. In Fairness, Transparency (FAT ?), pp. 119–133, 2018.
Accountability, and Transparency in Machine Learning
(FATML), 2017. Fan, W., Stolfo, S. J., Zhang, J., and Chan, P. K. Adacost:
Misclassification cost-sensitive boosting. In Proceedings
Alabi, D., Immorlica, N., and Kalai, A. T. Unleashing linear of the Sixteenth International Conference on Machine
optimizers for group-fair learning and optimization. In Learning (ICML), pp. 97–105, 1999.
Proceedings of the 31st Annual Conference on Learning
Theory (COLT), 2018. Feldman, M., Friedler, S. A., Moeller, J., Scheidegger, C.,
and Venkatasubramanian, S. Certifying and removing dis-
Barocas, S. and Selbst, A. D. Big data’s disparate impact. parate impact. In Proceedings of the 21st ACM SIGKDD
California Law Review, 104:671–732, 2016. International Conference on Knowledge Discovery and
Data Mining, 2015.
Bartlett, P. L. and Mendelson, S. Rademacher and gaussian
complexities: Risk bounds and structural results. Journal Freund, Y. and Schapire, R. E. Game theory, on-line pre-
of Machine Learning Research, 3:463–482, 2002. diction and boosting. In Proceedings of the Ninth Annual
Conference on Computational Learning Theory (COLT),
Berk, R., Heidari, H., Jabbari, S., Kearns, M., and Roth, A. pp. 325–332, 1996.
Fairness in criminal justice risk assessments: The state of
the art. arXiv:1703.09207, 2017. Freund, Y. and Schapire, R. E. A decision-theoretic general-
ization of on-line learning and an application to boosting.
Beygelzimer, A., Dani, V., Hayes, T. P., Langford, J., and Journal of Computer and System Sciences, 55(1):119–
Zadrozny, B. Error limiting reductions between clas- 139, 1997.
sification tasks. In Proceedings of the Twenty-Second
International Conference on Machine Learning (ICML), Hardt, M., Price, E., and Srebro, N. Equality of opportunity
pp. 49–56, 2005. in supervised learning. In Neural Information Processing
Systems (NIPS), 2016.
Boucheron, S., Bousquet, O., and Lugosi, G. Theory of
classification: a survey of some recent advances. ESAIM: Johnson, K. D., Foster, D. P., and Stine, R. A. Impartial pre-
Probability and Statistics, 9:323–375, 2005. dictive modeling: Ensuring fairness in arbitrary models.
arXiv:1608.00528, 2016.
Calmon, F., Wei, D., Vinzamuri, B., Ramamurthy, K. N.,
and Varshney, K. R. Optimized pre-processing for dis- Kakade, S. M., Sridharan, K., and Tewari, A. On the com-
crimination prevention. In Advances in Neural Informa- plexity of linear prediction: Risk bounds, margin bounds,
tion Processing Systems 30, 2017. and regularization. In Advances in neural information
processing systems, pp. 793–800, 2009.
Chouldechova, A. Fair prediction with disparate impact: A
study of bias in recidivism prediction instruments. Big Kamiran, F. and Calders, T. Data preprocessing techniques
Data, Special Issue on Social and Technical Trade-Offs, for classification without discrimination. Knowledge and
2017. Information Systems, 33(1):1–33, 2012.

Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., and Huq, Kamishima, T., Akaho, S., and Sakuma, J. Fairness-aware
A. Algorithmic decision making and the cost of fairness. learning through regularization approach. In 2011 IEEE
In Proceedings of the 23rd ACM SIGKDD International 11th International Conference on Data Mining Work-
Conference on Knowledge Discovery and Data Mining, shops, pp. 643–650, 2011.
pp. 797–806, 2017.
Kearns, M., Neel, S., Roth, A., and Wu, Z. S. Preventing
Donini, M., Oneto, L., Ben-David, S., Shawe-Taylor, J., and fairness gerrymandering: Auditing and learning for sub-
Pontil, M. Empirical risk minimization under fairness group fairness. In Proceedings of the 35th International
constraints. 2018. arXiv:1802.08626. Conference on Machine Learning (ICML), 2018.
A Reductions Approach to Fair Classification

Kivinen, J. and Warmuth, M. K. Exponentiated gradient


versus gradient descent for linear predictors. Information
and Computation, 132(1):1–63, 1997.
Kleinberg, J., Mullainathan, S., and Raghavan, M. Inherent
trade-offs in the fair determination of risk scores. In Pro-
ceedings of the 8th Innovations in Theoretical Computer
Science Conference, 2017.

Langford, J. and Beygelzimer, A. Sensitive error correct-


ing output codes. In Proceedings of the 18th Annual
Conference on Learning Theory (COLT), pp. 158–172,
2005.

Ledoux, M. and Talagrand, M. Probability in Banach


Spaces: Isoperimetry and Processes. Springer, 1991.
Lichman, M. UCI machine learning repository, 2013. URL
http://archive.ics.uci.edu/ml.
Menon, A. K. and Williamson, R. C. The cost of fairness in
binary classification. In Proceedings of the Conference
on Fiarness, Accountability, and Transparency, 2018.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-
napeau, D., Brucher, M., Perrot, M., and Duchesnay, E.
Scikit-learn: Machine learning in Python. Journal of
Machine Learning Research, 12:2825–2830, 2011.
Rockafellar, R. T. Convex analysis. Princeton University
Press, 1970.

Shalev-Shwartz, S. Online learning and online convex opti-


mization. Foundations and Trends in Machine Learning,
4(2):107–194, 2012.
Wightman, L. LSAC National Longitudinal Bar Passage
Study, 1998.

Woodworth, B. E., Gunasekar, S., Ohannessian, M. I., and


Srebro, N. Learning non-discriminatory predictors. In
Proceedings of the 30th Conference on Learning Theory
(COLT), pp. 1920–1953, 2017.
Zafar, M. B., Valera, I., Rodriguez, M. G., and Gummadi,
K. P. Fairness constraints: Mechanisms for fair classifica-
tion. In Proceedings of the 20th International Conference
on Artificial Intelligence and Statistics (AISTATS), pp.
962–970, 2017.
A Reductions Approach to Fair Classification

A. Error and Fairness for Randomized Classifiers


Let D denote the distribution over triples (X, A, Y ). The accuracy of a classifier h ∈ H is measured by 0-1 error,
err(h) := PD [h(X) 6= Y ], which for a randomized classifier Q becomes

X
err(Q) := P [h(X) 6= Y ] = Q(h) err(h) .
(X,A,Y )∼D, h∼Q
h∈H

The fairness constraints on a classifier h are Mµ(h) ≤ c. Recall that µj (h) := ED [gj (X, A, Y, h(X)) | Ej ]. For a
randomized classifier Q we define its moment µj as

h i X
µj (Q) := E gj (X, A, Y, h(X)) Ej = Q(h)µj (h) ,
(X,A,Y )∼D, h∼Q
h∈H

where the last equality follows because Ej is independent of the choice of h.

B. Proof of Theorem 1
The proof follows immediately from the analysis of Freund & Schapire (1996) applied to the Exponentiated Gradient (EG)
algorithm (Kivinen & Warmuth, 1997), which in our specific case is also equivalent to Hedge (Freund & Schapire, 1997).
|K| |K|+1
Let Λ := {λ ∈ R+ : kλ0 k1 ≤ B} and Λ0 := {λ0 ∈ R+ : kλ0 k1 = B}. We associate any λ ∈ Λ with the λ0 ∈ Λ0
that is equal to λ on coordinates 1 through |K| and puts the remaining mass on the coordinate λ0|K|+1 .

Consider a run of Algorithm 1. For each λt , let λ0t ∈ Λ0 be the associated element of Λ0 . Let rt := Mbµ(ht ) − bc and let
r0t ∈ R|K|+1 be equal to rt on coordinates 1 through |K| and put zero on the coordinate rt,|K|+1
0
. Thus, for any λ and the
associated λ0 , we have, for all t,

λ> rt = (λ0 )> r0t , (7)

and, in particular,

λ>t Mb c = λ>t rt = (λ0t )> r0t .



µ(ht ) − b (8)

We interpret r0t as the reward vector for the λ-player. The choices of λ0t then correspond to those of the EG algorithm with
the learning rate η. By the assumption of the theorem we have kr0t k∞ = krt k∞ ≤ ρ. The regret bound for EG, specifically,
Corollary 2.14 of Shalev-Shwartz (2012), then states that for any λ0 ∈ Λ0 ,

T T
X X B log(|K| + 1)
(λ0 )> r0t ≤ (λ0t )> r0t + + ηρ2 BT .
t=1 t=1
η
| {z }
=:ζT

Therefore, by equations (7) and (8), we also have for any λ ∈ Λ,

T
X T
X
λ> rt ≤ λ>t rt + ζT . (9)
t=1 t=1
A Reductions Approach to Fair Classification

This regret bound can be used to bound the suboptimality of L(Q


bT , λ
b T ) in λ
b T as follows:

T
1 X 
L(Q
b T , λ) = rr(ht ) + λ> Mb
ec µ(ht ) − b
c
T t=1
T
1 X 
= rr(ht ) + λ> rt
ec
T t=1
T
1 X  ζ
T
≤ rr(ht ) + λ>t rt +
ec (10)
T t=1 T
T
1X ζT
= L(ht , λt ) +
T t=1 T
T
1X b ζT
≤ L(QT , λt ) + (11)
T t=1 T
T
bT , 1
 ζ
b T ) + ζT .

T
X
=L Q λt + = L(Q
bT , λ (12)
T t=1 T T

Equation (10) follows from the regret bound (9). Equation (11) follows because L(ht , λt ) ≤ L(Q, λt ) for all Q by the
choice of ht as the best response of the Q-player. Finally, equation (12) follows by linearity of L(Q, λ) in λ. Thus, we have
for all λ ∈ Λ,
L(Q b T ) ≥ L(Q
bT , λ b T , λ) − ζT . (13)
T
Also, for any Q,
T
1X
L(Q, λT ) =
b L(Q, λt ) (14)
T t=1
T
1X
≥ L(ht , λt ) (15)
T t=1
T
1X b T ) − ζT
≥ L(ht , λ (16)
T t=1 T

= L(Q b T ) − ζT ,
bT , λ (17)
T

where equation (14) follows by linearity of L(Q, λ) in λ, equation (15) follows by the optimality of ht with respect to λ
bt,
equation (16) from the regret bound (9), and equation (17) by linearity of L(Q, λ) in Q. Thus, for all Q,

b T ) ≤ L(Q, λ ζT
L(Q
bT , λ bT ) + . (18)
T
Equations (13) and (18) immediately imply that for any T ≥ 1,

ζT B log(|K| + 1)
νT ≤ = + ηρ2 B ,
T ηT

proving the first part of the theorem.


ν 4ρ2 B 2 log(|K|+1)
The second part of the theorem follows by plugging in η = 2ρ2 B and verifying that if T ≥ ν2 then

B log(|K| + 1) ν ν ν
νT ≤ 4ρ2 B 2 log(|K|+1)
+ 2
· ρ2 B = + .
ν
· 2ρ B 2 2
2ρ2 B ν2
A Reductions Approach to Fair Classification

C. Proofs of Theorems 2 and 3


The bulk of this appendix proves the following theorem, which will immediately imply Theorems 2 and 3.
Theorem 4. Let (Q, b λ)
b be any ν-approximate saddle point of L with
s !
X 2 ln(2/δ)
ck = ck + εk
b and εk ≥ |Mk,j | 2Rnj (H) + √ + .
nj 2nj
j∈J

Let Q? minimize err(Q) subject to Mµ(Q) ≤ c. Then with probability at least 1 − (|J| + 1)δ, the distribution Q
b satisfies
r
b ≤ err(Q ) + 2ν + 4Rn (H) + √ + 2 ln(2/δ) ,
err(Q) ? 4
n n
b ≤ ck + 1 + 2ν
and for all k, γk (Q) + 2εk .
B

|K|
Let Λ := {λ ∈ R+ : kλ0 k1 ≤ B} denote the domain of λ. In the remainder of the section, we assume that we are given a
pair (Q,
b λ)
b which is a ν-approximate saddle point of L, i.e.,

L(Q, b ≤ L(Q, λ)
b λ) b +ν for all Q ∈ ∆,
(19)
and L(Q, b ≥ L(Q,
b λ) b λ) − ν for all λ ∈ Λ.

We first establish that the pair (Q,


b λ)
b satisfies an approximate version of complementary slackness. For the statement and
proof of the following lemma, recall that γ
b (Q) = Mb b (Q) ≤ b
µ(Q), so the empirical fairness constraints can be written as γ c
and the Lagrangian L can be written as
X
L(Q, λ) = ec rr(Q) + γk (Q) − b
λk (b ck ) . (20)
k∈K

Lemma 1 (Approximate complementary slackness). The pair (Q,


b λ)
b satisfies
X 
λ
bk (b b −b
γk (Q) ck ) ≥ B max γ b −b
bk (Q) ck + − ν ,
k∈K
k∈K

where we abbreviate x+ = max{x, 0} for any real number x.

Proof. We show that the lemma follows from the optimality conditions (19). We consider a dual variable λ defined as
(
0 if γ b ≤b
b (Q) c,
λ=
Bek? otherwise, where k ? = arg maxk [b b −b
γk (Q) ck ],

where ek denotes the kth vector of the standard basis. Then we have by equations (19) and (20) that
X
rr(Q)
ec b + λ
bk (b b −b
γk (Q) ck ) = L(Q,b λ)
b
k∈K X
≥ L(Q,b λ) − ν = ec
rr(Q)
b + λk (b b −b
γk (Q) ck ) − ν ,
k∈K

and the lemma follows by our choice of λ.

Next two lemmas bound the empirical error of Q b and also bound the amount by which Q b violates the empirical fairness
constraints.
Lemma 2 (Empirical error bound). The distribution Q b ≤ ec
rr(Q)
b satisfies ec rr(Q) + 2ν for any Q satisfying the empirical
b (Q) ≤ b
fairness constraints, i.e., any Q such that γ c.
A Reductions Approach to Fair Classification

b (Q) ≤ b
Proof. Assume that Q satisfies γ b ≥ 0, we have
c. Since λ

b> γ

L(Q, λ)
b = ec
rr(Q) + λ b (Q) − b
c ≤ ec
rr(Q) .

The optimality conditions (19) imply that


L(Q, b ≤ L(Q, λ)
b λ) b +ν .

Putting these together, we obtain

L(Q, b ≤ ec
b λ) rr(Q) + ν .

We next invoke Lemma 1 to lower bound L(Q,


b λ)
b as
X 
L(Q,
b λ)
b = ec rr(Q)
b + λ
bk (b b −b
γk (Q) ck ) ≥ ec
rr(Q)
b + B max γ b −b
bk (Q) ck + − ν
k∈K
k∈K

≥ ec b −ν .
rr(Q)

Combining the upper and lower bounds on L(Q,


b λ)
b completes the proof.

b (Q) ≤ b
Lemma 3 (Empirical fairness violation). Assume that the empirical fairness constraints γ c are feasible. Then the
distribution Q approximately satisfies all empirical fairness constraints:
b
  1 + 2ν
max γ b −b
bk (Q) ck ≤ .
k∈K B

b (Q) ≤ b
Proof. Let Q satisfy γ c. Applying the same upper and lower bound on L(Q,
b λ)
b as in the proof of Lemma 2, we
obtain 
rr(Q)
ec b + B max γ b −b
bk (Q) ck + − ν ≤ L(Q, b ≤ ec
b λ) rr(Q) + ν .
k∈K

rr(Q) − ec
We can further upper bound ec b by 1 and use x ≤ x+ for any real number x to complete the proof.
rr(Q)

It remains to lift the bounds on empirical classification error and constraint violation into the corresponding bounds on true
classification error and the violation of true constraints. We will use the standard machinery of uniform convergence bounds
via the (worst-case) Rademacher complexity.
Let F be a class of functions f : Z → [0, 1] over some space Z. Then the (worst-case) Rademacher complexity of F is
defined as " #
n
1X
Rn (F) := sup E sup σi f (zi ) ,
z1 ,...,zn ∈Z f ∈F n i=1

where the expectation is over the i.i.d. random variables σ1 , . . . , σn with P[σi = 1] = P[σi = −1] = 1/2.
We first prove concentration of generic moments derived from classifiers h ∈ H and then move to bounding the deviations
from true classification error and true fairness constraints.
Lemma 4 (Concentration of moments). Let g : X × A × {0, 1} × {0, 1} → [0, 1] be any function and let D be a distribution
over (X, A, Y ). Then with probability at least 1 − δ, for all h ∈ H,
r
    2 ln(2/δ)
E g(X, A, Y, h(X)) − E g(X, A, Y, h(X)) ≤ 2Rn (H) + √ +
b ,
n 2n
where the expectation is with respect to D and the empirical expectation is based on n i.i.d. draws from D.

Proof. Let F := {fh }h∈H be the class of functions fh : (x, y, a) 7→ g x, y, a, h(x) . By Theorem 3.2 of Boucheron et al.


(2005), we then have with probability at least 1 − δ, for all h,


r
    ln(2/δ)
E g(X, A, Y, h(X)) − E g(X, A, Y, h(X)) = E[fh ] − E[fh ] ≤ 2Rn (F) +
b b . (21)
2n
A Reductions Approach to Fair Classification

We will next bound Rn (F) in terms of Rn (H). Since h(x) ∈ {0, 1}, we can write
   
fh (x, y, a) = h(x)g(x, a, y, 1) + 1 − h(x) g(x, a, y, 0) = g(x, a, y, 0) + h(x) g(x, a, y, 1) − g(x, a, y, 0) .

Since g(x, a, y, 0) ≤ 1 and g(x, a, y, 1) − g(x, a, y, 0) ≤ 1, we can invoke Theorem 12(5) of Bartlett & Mendelson
(2002) for bounding function classes shifted by an offset, in our case g(x, a, y, 0), and Theorem 4.4 of Ledoux & Talagrand
(1991) for bounding function classes under contraction, in our case g(x, a, y, 1) − g(x, a, y, 0), yielding

1
Rn (F) ≤ √ + Rn (H) .
n

Together with the bound (21), this proves the lemma.

Lemma 5 (Concentration of loss). With probability at least 1 − δ, for all Q ∈ ∆,


r
2 ln(2/δ)
err(Q) − err(Q)| ≤ 2Rn (H) + √ +
|c .
n 2n

Proof. We first use Lemma 4 with g : (x, a, y, ŷ) 7→ 1{y 6= ŷ} to obtain, with probability 1 − δ, for all h,
r
b h ] − E[fh ] ≤ 2Rn (H) + √2 +
rr(h) − err(h) = E[f
ec
ln(2/δ)
.
n 2n

The lemma now follows for any Q by taking a convex combination of the corresponding bounds on h ∈ H.7

Finally, we show a result for the concentration of the empirical constraint violations to their population counterparts. We
will actually show the concentration of the individual moments µ bj (Q) to µj (Q) uniformly for all Q ∈ ∆. Since M is
a fixed matrix not dependent on the data, this also directly implies concentration of the constraints γ b (Q) = Mb µ(Q) to
γ(Q) = Mµ(Q). For this result, recall that nj = |{i ∈ [n] : (Xi ,Ai ,Yi ) ∈ Ej }| and p?j = P[Ej ].
Lemma 6 (Concentration of conditional moments). For any j ∈ J, with probability at least 1 − δ, for all Q,
s
2 ln(2/δ)
bj (Q) − µj (Q) ≤ 2Rnj (H) + √ +
µ .
nj 2nj

If np?j ≥ 8 log(2/δ), then with probability at least 1 − δ, for all Q,


s s
2 ln(4/δ)
bj (Q) − µj (Q) ≤ 2Rnp?j /2 (H) + 2
µ + .
np?j np?j

Proof. Our proof largely follows the proof of Lemma 2 of Woodworth et al. (2017), with appropriate modifications for our
more general constraint definition. Let Sj := {i ∈ [n] : (Xi ,Ai ,Yi ) ∈ Ej } be the set of indices such that the corresponding
examples fall in the event Ej . Note that we have defined nj = |Sj |. Let D(·) denote the joint distribution of (X, A, Y ).
Then, conditioned on i ∈ Sj , the random variables gj (Xi ,Ai ,Yi ,h(Xi )) are i.i.d. draws from the distribution D(· | Ej ), with
mean µj (h). Applying Lemma 4 with gj and the distribution D(· | Ej ) therefore yields, with probability 1 − δ, for all h,
s
2 ln(2/δ)
bj (h) − µj (h) ≤ 2Rnj (H) + √ +
µ ,
nj 2nj

The lemma now follows by taking a convex combination over h.


7
The same reasoning applies for general error, err(h) = E[gerr (X,A,Y,h(X))], by using g = gerr in Lemma 4.
A Reductions Approach to Fair Classification

Proof of Theorem 4. We now use the lemmas derives so far to prove Theorem 4. We first use Lemma 6 to bound the gap
between the empirical and population fairness constraints. The lemma implies that with probability at least 1 − |J|δ, for all
k ∈ K and all Q ∈ ∆,
 
bk (Q) − γk (Q) = Mk µ(Q)
γ b − µ(Q)
X
≤ |Mk,j | µbj (Q) − µj (Q)
j∈J
s !
X 2 ln(2/δ)
≤ |Mk,j | 2Rnj (H) + √ +
nj 2nj
j∈J

≤ εk . (22)

c along with equation (22) ensure that γ


Note that our choice of b ck for all k ∈ K. Using Lemma 2 allows us to
bk (Q? ) ≤ b
conclude that
b ≤ ec
rr(Q)
ec rr(Q? ) + 2ν .
rr(Q)
We now invoke Lemma 5 twice, once for ec rr(Q? ), proving the first statement of the theorem.
b and once for ec

The above shows that Q? satisfies the empirical fairness constraints, so we can use Lemma 3, which together with
equation (22) yields
b ≤γ b + εk ≤ b 1 + 2ν 1 + 2ν
γk (Q) bk (Q) ck + + εk = ck + + 2εk ,
B B
proving the second statement of the theorem.

We are now ready to prove Theorems 2 and 3

Proof of Theorem 2. The first part of the theorem follows immediately from Assumption 1 and Theorem 4 (with δ/2 instead
of δ). The statement in fact holds with probability at least 1 − (|J| + 1)δ/2. For the second part, we use the multiplicative
Chernoff bound for binomial random variables. Note that E[nj ] = np?j , and we assume that np?j ≥ 8 ln(2/δ), so the
multiplicative Chernoff bound implies that nj ≤ np?j /2 with probability at most δ/2. Taking the union bound across all j
and combining with the first part of the theorem then proves the second part.

Proof of Theorem 3. This follows immediately from Theorem 1 and the first part of Theorem 2.

D. Additional Experimental Results


In this appendix we present more complete experimental results. We present experimental results for both the training and
test data. We evaluate the exponentiated-gradient as well as the grid-search variants of our reductions. And, finally, we
consider extensions of reweighting and relabeling beyond the specific tradeoffs proposed by Kamiran & Calders (2012).
Specifically, we introduce a scaling parameter that interpolates between the prescribed tradeoff (specific importance weights
or the number of examples to relabel) and the unconstrained classifier (uniform weights or zero examples to relabel). The
training data results are shown in Figure 2. The test set results are shown in Figure 3.
A Reductions Approach to Fair Classification

adult / DP / log. reg. adult4 / DP / log. reg. COMPAS / DP / log. reg. Dutch census / DP / log. reg. Law Schools / DP / log. reg.
1.0
0.34 0.22 0.05
0.16 0.16
0.32 0.20
error

0.04
0.5
0.14 0.14 0.30 0.18
0.28 0.03
0.16
0.12 0.12
0.26 0.02
0.0 0.14
0.00
0.0 0.05 0.10 0.00
0.2 0.05 0.10 0.15 0.4 0.00 0.05 0.10 0.6 0.00 0.05 0.10 0.150.8 0.00 0.05 0.101.0

adult / DP / boosting adult4 / DP / boosting COMPAS / DP / boosting Dutch census / DP / boosting Law Schools / DP / boosting
1.0
0.34 0.22 0.05
0.16 0.16
0.32 0.20
error

0.04
0.5
0.14 0.14 0.30 0.18
0.28 0.03
0.16
0.12 0.12
0.26 0.02
0.0 0.14
0.00
0.0 0.05 0.10 0.00
0.2 0.05 0.10 0.15 0.4 0.00 0.05 0.10 0.6 0.00 0.05 0.10 0.15 0.8 0.00 0.05 0.10 0.15 1.0

adult / EO / log. reg. adult4 / EO / log. reg. COMPAS / EO / log. reg. Dutch census / EO / log. reg. Law Schools / EO / log. reg.
1.0
0.34 0.22 0.05
0.16 0.16
0.32 0.20
error

0.04
0.5
0.14 0.14 0.30 0.18
0.28 0.03
0.16
0.12 0.12
0.26 0.02
0.0 0.14
0.0
0.00 0.05 0.10 0.2
0.00 0.05 0.10 0.4 0.00 0.05 0.10 0.15 0.6 0.00 0.05 0.10 0.8 0.00 0.05 0.10 0.15 1.0
adult / EO / boosting adult4 / EO / boosting COMPAS / EO / boosting Dutch census / EO / boosting Law Schools / EO / boosting
1.0
0.34 0.22 0.05
0.16 0.16
0.32 0.20
error

0.04
0.5
0.14 0.14 0.30 0.18
0.28 0.03
0.16
0.12 0.12
0.26 0.02
0.0 0.14
0.0
0.00 0.02 0.04 0.06 0.2
0.00 0.05 0.10 0.4 0.00 0.05 0.10 0.150.6 0.00 0.05 0.10 0.8 0.0 0.1 0.2 1.0
0.3

violation of the fairness constraint

unconstrained postprocessing reweight relabel reduction (grid search) reduction (exp. grad.)

Figure 2. Training classification error versus constraint violation, with respect to DP (top two rows) and EO (bottom two rows). Markers
correspond to the baselines. For our two reductions and the interpolants between reweighting (or relabeling) and the unconstrained
classifier, we varied their tradeoff parameters and plot the Pareto frontiers of the sets of classifiers obtained for each method. Because the
curves of the different methods often overlap, we use vertical dashed lines to indicate the lowest constraint violations. All data sets have
binary protected attributes except for adult4, which has four protected attribute values, so relabeling is not applicable and grid search is
not feasible for this data set. The exponentiated-gradient reduction dominates or matches other approaches as expected since it solves
exactly for the points on the Pareto frontier of the set of all classifiers in each considered class.
A Reductions Approach to Fair Classification

adult / DP / log. reg. adult4 / DP / log. reg. COMPAS / DP / log. reg. Dutch census / DP / log. reg. Law Schools / DP / log. reg.
1.0 0.36
0.22
0.34
0.16 0.16 0.05
error

0.32 0.20
0.5
0.30
0.14 0.14 0.18 0.04
0.28
0.0 0.26 0.16
0.00
0.0 0.05 0.10 0.00
0.2 0.05 0.10 0.15 0.4 0.00 0.05 0.10 0.6 0.00 0.05 0.10 0.150.8 0.00 0.05 0.10
1.0
adult / DP / boosting adult4 / DP / boosting COMPAS / DP / boosting Dutch census / DP / boosting Law Schools / DP / boosting
1.0 0.36
0.22
0.34
0.16 0.16 0.05
error

0.32 0.20
0.5
0.30
0.14 0.14 0.18 0.04
0.28
0.0 0.26 0.16
0.00
0.0 0.05 0.10 0.00
0.2 0.05 0.10 0.15 0.4 0.00 0.05 0.10 0.6 0.00 0.05 0.10 0.15 0.8 0.00 0.05 0.10 1.0
adult / EO / log. reg. adult4 / EO / log. reg. COMPAS / EO / log. reg. Dutch census / EO / log. reg. Law Schools / EO / log. reg.
1.0 0.36
0.22
0.34
0.16 0.16 0.05
error

0.32 0.20
0.5
0.30
0.14 0.14 0.18 0.04
0.28
0.0 0.26 0.16
0.0
0.00 0.05 0.10 0.2
0.00 0.05 0.10 0.15 0.4 0.00 0.05 0.10 0.15 0.6 0.00 0.05 0.10 0.8 0.00 0.05 0.10 0.15 1.0
adult / EO / boosting adult4 / EO / boosting COMPAS / EO / boosting Dutch census / EO / boosting Law Schools / EO / boosting
1.0 0.36
0.22
0.34
0.16 0.16 0.05
error

0.32 0.20
0.5
0.30
0.14 0.14 0.18 0.04
0.28
0.0 0.26 0.16
0.0
0.00 0.02 0.04 0.06 0.2
0.00 0.05 0.10 0.15 0.4 0.00 0.05 0.10 0.15 0.6 0.00 0.05 0.10 0.8
0.15 0.00 0.05 0.10 0.15 0.20 1.0

violation of the fairness constraint

unconstrained postprocessing reweight relabel reduction (grid search) reduction (exp. grad.)

Figure 3. Test classification error versus constraint violation, with respect to DP (top two rows) and EO (bottom two rows). Markers
correspond to the baselines. For our two reductions and the interpolants between reweighting (or relabeling) and the unconstrained
classifier, we show convex envelopes of the classifiers taken from the training Pareto frontier of each method (i.e., the same classifiers as
shown in Figure 2). Because the curves of the different methods often overlap, we use vertical dashed lines to indicate the lowest constraint
violations. All data sets have binary protected attributes except for adult4, which has four protected attribute values, so relabeling
is not applicable and grid search is not feasible for this data set. We show 95% confidence bands for the classification error of the
exponentiated-gradient reduction and 95% confidence intervals for the constraint violation of post-processing. The exponentiated-gradient
reduction dominates or matches performance of all other methods up to statistical uncertainty.

You might also like