0% found this document useful (0 votes)
4 views14 pages

The Counterfactual-Shapley Value- Attributing Change in System Metrics

Uploaded by

Xiaoxi Zhao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views14 pages

The Counterfactual-Shapley Value- Attributing Change in System Metrics

Uploaded by

Xiaoxi Zhao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

T HE C OUNTERFACTUAL -S HAPLEY VALUE : ATTRIBUTING

C HANGE IN S YSTEM M ETRICS

Amit Sharma Hua Li Jian Jiao


Microsoft Research Microsoft Bing Ads Microsoft Bing Ads
arXiv:2208.08399v1 [cs.LG] 17 Aug 2022

[email protected] [email protected] [email protected]

A BSTRACT
Given an unexpected change in the output metric of a large-scale system, it is important to answer
why the change occurred: which inputs caused the change in metric? A key component of such an
attribution question is estimating the counterfactual: the (hypothetical) change in the system metric
due to a specified change in a single input. However, due to inherent stochasticity and complex
interactions between parts of the system, it is difficult to model an output metric directly. We utilize
the computational structure of a system to break up the modelling task into sub-parts, such that
each sub-part corresponds to a more stable mechanism that can be modelled accurately over time.
Using the system’s structure also helps to view the metric as a computation over a structural causal
model (SCM), thus providing a principled way to estimate counterfactuals. Specifically, we propose
a method to estimate counterfactuals using time-series predictive models and construct an attribution
score, CF-Shapley, that is consistent with desirable axioms for attributing an observed change in the
output metric. Unlike past work on causal shapley values, our proposed method can attribute a single
observed change in output (rather than a population-level effect) and thus provides more accurate
attribution scores when evaluated on simulated datasets. As a real-world application, we analyze a
query-ad matching system with the goal of attributing observed change in a metric for ad matching
density. Attribution scores explain how query volume and ad demand from different query categories
affect the ad matching density, leading to actionable insights and uncovering the role of external
events (e.g., “Cheetah Day”) in driving the matching density.

1 Introduction
In large-scale systems, a common problem is to explain the reasons for a change in the output, especially for unexpected
and big changes. Explaining the reasons or attributing the change to input factors can help isolate the cause and debug it
if the change is undesirable, or suggest ways to amplify the change if desirable. For example, in a distributed system,
system failure [29] or performance anomaly [19; 1] are important undesirable outcomes. In online platforms such as
e-commerce websites or search websites, a desirable outcome is increase in revenue and it is important to understand
why the revenue increased or decreased [25; 5].
Technically, this problem can be framed as an attribution problem [7; 28; 5]. Given a set of candidate factors, which
of them can best explain the observed change in output? Methods include statistical analysis based on conditional
probabilities [2; 13; 24] or computation of game-theoretic attribution scores like Shapley value [17; 25; 5]. However,
most past work assumes that the output can be written as a function of the inputs, ignoring any structure in the
computation of the output.
In this paper, we consider large-scale systems such as search or ad systems where output metrics are aggregated over
different kinds of inputs or composed over multiple pipeline stages, leading to a natural computational structure (instead
of a single function of the inputs). For example, in an ad system, the number of ads that are matched per query is a
composite measure that is composed of an analogous metric over each query category (see Figure 1). While the overall
matching density may fluctuate, the matching density per category is expected to be more stably associated with the
input queries and ads. As another example, the output metric may be a result of a series of modules in a pipeline,
e.g., recommendations that are finally shown to a user may be a result of multiple pipeline stages where each stage
filters some items. Our key insight is that utilizing the computational structure of a real-world system can break up
the system into smaller sub-parts that stay stable over time and thus can be modelled accurately. In other words, the
U1,ad Cat1 U1,qv Uk,ad Catk Uk,qv

Cat1 Ad Cat1 Query … Catk Ad


Demand
Catk Query
Volume
Demand Volume

Cat1 Density Catk Density


(𝑑𝑒𝑛𝑡𝑐1 ) (𝑑𝑒𝑛𝑡𝑐𝑘 )
Daily
Density (𝑦𝑡 )
Figure 1: Causal graph for an ad matching system that reflects computation of the matching density metric. For each
query category, the number of queries (query volume) and ads (ad demand) determine the categorical density for each
day. Different categorical densities combine to yield the daily density. Goal is to attribute daily density to ad demand
and query volume of different categories. The category density is directly affected by category-wise ad demand and
query volume and thus has a relatively more stable relationship with the inputs than overall daily density.
system’s computation can be modelled as a set of independent, causal mechanisms [22] over a structural causal model
(SCM) [20].
Modeling the system’s computation as a SCM also provides a principled way to define an attribution score. Specifically,
we show that attribution can be defined in terms of counterfactuals on the SCM. Following recent work on causal shapley
values [10; 14], we posit four axioms that any desirable attribution method for an output metric should satisfy. We then
propose a counterfactual variant of the Shapley value that satisfies all these properties. Thus, given the computational
structure, our proposed CF-Shapley method has the following steps: 1) utilize machine learning algorithms to fit the
SCM and compute counterfactual values of the metric under any input, and 2) use the estimated counterfactuals to
construct an attribution score to rank the contribution of different inputs. On simulated data, our results show that the
proposed method is significantly more accurate for explaining inputs’ contribution to an observed change in a system
metric, compared to Shapley value [17] or its recent causal variants [10; 14].
We apply the proposed method, CF-Shapley attribution, to a large-scale ad matching system that outputs relevant ads
for each search query issued by a user. The key outcome is matching density, the number of ads matched per query.
This density is roughly proportional to revenue generated, since only the queries for which ads are selected contribute
to revenue. There are two main causes for a change in matching density: change in query volume or change in demand
from advertisers. Given that queries are typically organized by categories, the attribution problem is to understand
which of these two are driving an observed change in matching density, and from which categories.
To do so, we construct a causal graph representing the system’s computation pipeline (Figure 1). Given six months of
system’s log data, we repurpose time-series prediction models to learn the structural equation for category-wise density
as a function of query volume and ad demand, its parents in the graph. For this system, we find that category-wise
attribution is possible with minimal assumptions, while attribution between query volume and ad demand requires
knowledge of the structural equations that generate category-wise density. In both cases, we show how the CF-Shapley
method can be used to estimate the system’s counterfactual outputs and the resultant attribution scores. As a sanity
check, CF-Shapley attribution scores satisfy the efficiency property for attributing the matching density metric: their sum
matches the observed change in density. We then use CF-Shapley scores to explain density changes on five outlier days
from November to December 2021, uncovering insights on how changes in query volume or ad demand for different
categories affects the density metric. We validate the results through an analysis of external events during the time
period.
To summarize, our contributions include,
• A method for attributing metrics in a large-scale system utilizing its computational structure as a causal graph, that
outperforms recent Shapley value-based methods on simulated data.
• A case study on estimating counterfactuals in a real-world ad matching system, providing a principled way for
attributing change in its output metric.

2
2 Related Work

Our work considers a causal interpretation of the attribution problem. Unlike attribution methods on predictions from a
(deterministic) machine learning model [17; 11], here we are interested in attributing real-world outcomes where the
data-generating process includes noise. Since the attribution problem concerns explaining a single outcome or event,
we focus on causality on individual outcomes [9] rather than general causality that deals with the average effect of a
cause on the outcome over a (sub)population [20]. In other words, we are interested in estimating the counterfactual,
given that we already have an observed event. Counterfactuals are the hardest problem in Pearl’s ladder of causation,
compared to observation and intervention [21].
While counterfactuals have been applied in feature attribution for machine learning models [15; 27], less work has been
done for attributing real-world outcomes in systems using formal counterfactuals. Recent work uses the do-intervention
to propose do-shapley values [10; 14] that attribute the interventional quantity P (Y |do(V )) across different inputs
v ∈ V . While do-shapley values are useful for calculating the average effect of different inputs on the output Y , they
are not applicable for attributing an individual change in the output. For attributing individual changes, [12] analyze
root cause identification for outliers in a structural causal model, and find that attribution conditional on the parents of
a node is more effective than global attribution. They quantify the attribution using information theoretic scores, but
do not provide any axiomatic characterization of the resulting attribution score. In this work, we propose four axioms
that characterize desirable properties for an attribution score for explaining individual change in output and present the
CF-Shapley value that satisfies those axioms.
Attribution in ad systems. Multi-touch attribution is the most common attribution problem studied in online ad
systems. Given an ad click, the goal is to assign credit to the different preceding exposures of the same item to the user,
e.g., previous ad exposures, emails, or other media. Multiple methods have been proposed to estimate the attribution
such as attributing all to the last exposure [2], an average over all exposures, or using probabilistic models to model the
click data as a function of the input exposures [24; 13]. Recent methods utilize the game-theoretic attribution score
using Shapley values that summarizes the attribution over multiple simulations of input variables, with [5] or without a
causal interpretation [25]. Multi-touch attribution can be considered as a one-level SCM problem, where there is an
output node being affected by all input nodes. It does not cover more complex systems where there is a computational
structure.
Performance Anomaly Attribution. Computational structure (e.g., specific system capabilities or logs) has been
considered in the systems literature to root-cause performance anomalies [1] or system failures [29]. Some methods use
causal reasoning to motivate their attribution algorithm, but they do so informally. Our work provides a formal analysis
of the system attribution problem.

3 Defining the attribution problem

For a system’s outcome metric Y , let Y = yt be a value that needs to be explained (e.g., an extreme value). Our goal is
to explain the value by attributing it to a set of input variables, X. Can we rank the variables by their contribution in
causing the outcome?
For example, consider a system that crashes whenever its load crosses 0.9 units. The system’s crash metric can
be described by the following structural equations, Y = ILoad>=0.9 ; Load = 0.5X1 + 0.4X2 + 0.9X3 ; Xi =
Bernoulli(0.5)∀i. The corresponding graph for the system has the following edges: X1 , X2 , X3 → Load; Load → Y .
The value of each input Xi is affected by the independent error terms through the Bernoulli distribution. Suppose the
initial reference value was (X1 = 0, X2 = 0, X3 = 0, Y = 0) and the next observed value is (X1 = 1, X2 = 1, X3 =
1, Y = 1). Given that the system crashed (Y = 1), how do we attribute it to X1 , X2 , X3 ? Intuitively, X3 is a sufficient
cause of the crash since changing X3 = 1 would lead to the crash irrespective of values of other variables. However,
X1 and X2 can be equally a reason for this particular crash since their coefficients sum to 0.9. However, if either of X1
or X2 are observed to be zero, then the other one cannot explain the crash. This example indicates that the attribution
for any input variable depends on the equations of the data-generating process and also on the values of other variables.

3.1 Attribution score for system metric change

We now define the attribution score for explaining an observed value wrt a reference value. While system inputs can
be continuous, we utilize the fact that system metrics are measured and compared over time. That is, we are often
interested in attribution for a metric value compared to an reference timestamp. Reference values are typically chosen
from previous values that are expected to be comparable (E.g., metric value at last hour or last week). By comparing to

3
a reference timestamp, we simplify the problem by considering only two values of a continuous variable: its observed
value, and its value on the reference timestamp.
Formally, we express the problem of attribution of an outcome metric, Y = yt as explaining change in the metric wrt. a
reference, ∆Y = yt − y 0 : Why did the outcome value change from y 0 to yt ?
Definition 1. Attribution Score. Let Y = yt and Y = y 0 be the observed and reference values respectively of a
system metric. Let V be the set of input variables. Then, an attribution score for X ∈ V provides the contribution of
X in causing the change from y 0 to yt .

3.2 The need for SCM and counterfactuals

To estimate the causal contribution, we need to model the data-generating process from input variables to the outcome.
This is usually done by a structural causal model (SCM) M , that consists of a causal graph and structural equations
describing the generating functions for each variable.
SCM. Formally, a structural causal model [20] is defined by a tuple hV , U , F , P (u)i where V is the set of observed
variables, U refer to the unobserved variables, F is a set of functions, and P (U ) is a strictly positive probability measure
for U . For each V ∈ V , fV ∈ F determines its data-generating process, V = fV (PaV , UV ) where PaV ⊆ V \ {V }
denotes parents of V and UV ⊆ U . We consider a non-linear, additive noise SCM such that ∀V ∈ V , fV can be written
as a additive combination of some fV∗ (PaV ) and the unobserved variables (error terms). We assume a Markovian SCM
such that unobserved variables (corresponding to error terms) are mutually independent, thus the SCM corresponds to a
directed acyclic graph (DAG) over V with edges to each node from its parents. Note that a specific realization of the
unobserved variables, U = u determines the values of all other variables.
Counterfactual. Given an SCM, values of unobserved variables U = u, a target variable Y ∈ V and a subset of
inputs X ⊆ V \ {Y }, a counterfactual corresponds to the query, “What would have been the value of Y (under u),
had X been x”. It is written as Yx (u).
Using counterfactuals, we can formally express the attribution question in the the above example. Suppose the
observed values are Y = yt and Xi = xi for some input Xi , under U = u. At an earlier reference timestamp with a
different value of the unobserved variables, U = u0 , the values are Y = y 0 and Xi = x0i . Starting from the observed
value (U = u), the attribution for Xi is characterized by the change in Y after changing Xi to its reference value,
Yxi (u) − Yx0i (u) = yt − Yx0i (u). That is, given that Y is yt with Xi = xi and all other variables at their observed
value, how much would Y change if Xi is set to x0i ? Similarly, we can ask, Yxi ,x01 (u) − Yx0i ,x01 (u) (i 6= 1), denoting the
change in Y ’s value upon setting X = xi when X1 is set to its reference values. Thus, there can be multiple expressions
to determine the counterfactual impact of Xi depends on the values of other variables.

4 Attribution using CF-Shapley value


To develop an attribution score, we propose a way to average over the different possible counterfactual impacts. First,
we posit desirable axioms that an attribution score should satisfy, as in [17; 14].

4.1 Desirable axioms for an attribution score

Axioms. Given two values of the metric, observed, Y (u) and reference, Y (u0 ), corresponding to unobserved variables,
u and u0 respectively, following properties are desirable for an attribution score φ that measures the causal contribution
of inputs V ∈ V .
1. CF-Efficiency. The sum of attribution scores for all V ∈ V equals P the counterfactual change in output from
reference to observed value, Y (u) − Yv0 (u) = Yv (u0 ) − Y (u0 ) = V φV .
2. CF-Irrelevance. If a variable X has no effect on the counterfactual value of output under all witnesses, Yx0 ,s0 (u) =
Ys0 (u)∀S ⊆ V \ {X}, then φX = 0.
3. CF-Symmetry. If two variables have the same effect on counterfactual value of output Ys0 (u) − Yx01 ,s0 (u) =
Ys0 (u) − Yx02 ,s0 (u)∀S ⊆ V \ {X1 , X2 }, then their attribution scores are same, φX1 = φX2 .
4. CF-Approximation. For any subset of variables S ⊆ V set to their reference values s0 , the sum of attribution
scores approximates the counterfactual change from observed Pvalue. I.e., there exists a weight ω(S)
P s.t. the vector
φS is the solution to the weighted least squares, arg minφ∗S S⊆V ω(S)((Y (u) − Ys0 (u)) − S∈S φ∗S )2 .

Similar to shapley value axioms, these axioms convey intuitive properties that a counterfactual attribution score should
satisfy. CF-Efficiency states the sum of attribution scores for inputs should equal the difference between the observed

4
metric and the counterfactual metric when all inputs are set to their reference values. CF-Irrelevance states that if
changing the value of an input X has no effect on the output counterfactual under all values of other variables, then the
Shapley value of X should be zero. CF-Symmetry states that if changing the value of two inputs has the same effect on
the counterfactual output under all values of the other variables, then both variables should have an identical attribution
score. And finally, CF-Approximation states the difference between the observed output and the counterfactual output
due to a change in any subset of variables is roughly equal to the sum of attribution scores for those variables.
Note that CF-Efficiency does not necessarily imply that the sum of attribution scores is equal to the actual difference
between the observed value and reference value. This is because the actual difference is a combination of the input
variables’ contribution and statistical noise (error terms). That is, yt − y 0 = Yv (u) − Yv0 (u0 ) = V φV + (Yv0 (u) −
P
Yv0 (u0 )), where we used the CF-Efficiency property for a desirable attribution score φ. The second term corresponds
to the difference in metric with the same input variables but different noise corresponding to the observed and reference
timestamps. This is the unavoidable noise component since we are explaining the change due to a single observation.
Therefore, for any counterfactual attribution score to meaningfully explain the observed difference, it is useful to select a
reference timestamp to minimize the difference over exogenous factors (e.g., using a previous value of the metric on the
same day of week or same hour). Given the true structural equations and an attribution score that satisfies the axioms, if
the scores do sum to the observed difference in a metric, then it implies that reference timestamp was well-selected.

4.2 The CF-Shapley value

We now define the CF-Shapley value that satisfies all four axioms.
Definition 2. Given an observed output metric Y = yt and a reference value y 0 , the CF-Shapley value for contribution
by input X is given by,
X Ys0 (u) − Yx0 ,s0 (u)
φX = (1)
nC(n − 1, |S|)
S⊆V \{X}

where n is the number of input variables V , S is the subset of variables set to their reference values s0 , and U = u is
the value of unobserved variables such that Y (u) = yt .
Proposition 1. CF-Shapley value satisfies all four axioms, Efficiency, Irrelevance, Symmetry and Approximation.

Proof. Efficiency. Following [14; 26], the CF-Shapley value for an input Vi can be written as,

1 X
φV i = Ywpre
0 (π) (u) − Yvi0 ,wpre
0 (π) (u) (2)
n!
π∈Π(n)

where Π is the set of all permutations over the n variables and Wpre (π) is the subset of variables that precede Vi in the
permutation π ∈ Π. The sum is,

n n
X 1 X X
φV i = Ywpre
0 (π) (u) − Yvi0 ,wpre
0 (π) (u)
i=1
n! i=0
π∈Π(n)
1 X (3)
= Y∅ (u) − Yv0 (u)
n!
π∈Π(n)

= Y (u) − Yv0 (u)

We can show it analogously under U = u0 .


CF-Irrelevance. If Yx0 ,s0 (u) = Ys0 (u)∀S ⊆ V \ {X}, then the numerator in Eqn. 1 for φX , will be zero and the result
follows.
CF-Symmetry. Assuming same effect on counterfactual value, we write the CF-Shapley value for Vi and show it is the

5
same for Vj .
X Yw0 (u) − Yvi0 ,w0 (u)
φV i =
nC(n − 1, |W |)
W ⊆V \{Vi }
X Yw0 (u) − Yvi0 ,w0 (u) X Yvj0 ,z0 (u) − Yvi0 ,vj0 ,z0 (u)
= +
nC(n − 1, |W |) nC(n − 1, |Z| + 1)
W ⊆V \{Vi ,Vj } Z⊆V \{Vi ,Vj }
X Yw0 (u) − Yvj0 ,w0 (u) X Yvi0 ,z0 (u) − Yvi0 ,vj0 ,z0 (u)
= +
nC(n − 1, |W |) nC(n − 1, |Z| + 1)
W ⊆V \{Vi ,Vj } Z⊆V \{Vi ,Vj }
X Yw0 (u) − Yvj0 ,w0 (u)
= = φVj
nC(n − 1, |W |)
W ⊆V \{Vj }

where the third equality uses Yvi0 ,s0 (u) = Yvj0 ,s0 (u) ∀S ⊆ V \ {Vi , Vj }.
CF-Approximation. Here we use a property [17] on value functions ofP standard Shapley values.
P There exists specific
weights ω(S) such that the Shapley value is the solution to arg minφ∗S S⊆V ω(w)(ν(S) − s∈S φ∗w )2 where ν(S)
is the value function of any subset S ⊆ V . The result follows by selecting ν(S) = Y (u) − Ys0 (u).

Comparison to do-shapley. Unlike CF-Shapley, the do-shapley value [14] takes the expectation over all values of
the unobserved u, Eu [Y |do(S)] − Eu [Y ]. Thus, it measures the average causal effect over values of u, whereas for
attributing a single observed value, we want to know the contributions of inputs under the same u.

4.3 Estimating CF-Shapley values

Eqn. 1 requires estimation of counterfactual output at different (hypothetical) values of input, and in turn requires both
the causal graph and the structural equations of the SCM. Using knowledge on the system’s metric computation, the
first step is to construct its computational graph. Then for each node in the graph, we fit its generating function using a
predictive model over its parents, which we consider as the data-generating process (fitted SCM).
To fit the SCM equations, for each node V , a common way is to use supervised learning to build a model fˆV estimating
its value using the values of its parent nodes at the same timestamp. However, such a model will have high variance due
to natural temporal variation in the node’s value over time. Since including variables predictive of the outcome reduces
the variance of an estimate in general [3], we utilize auto-correlation in time-series data to include the previous values
of the node as predictive features. Thus, the final model is expressed as, ∀V ∈ V ,
v̂t = fˆ(PaV , vt−1 , vt−2 · · · , vt−r ) (4)
where r is the number of auto-correlated features that we include. The model can be trained using a supervised
time-series prediction algorithm with auxiliary features, such as DeepAR [23].
We then use the fitted SCM equations to estimate the counterfactual with the 3-step algorithm from Pearl [20], assuming
additive error. To compute Ys0 (u) for any subset S ⊆ V , the three steps are,
1. Abduction. Infer error of structural equations on all observed variables. For each V ∈ V , ˆv,t = vt −
fˆV (Pa(V ), vt−1 ..vt−r ) where vt is the observed value at timestamp t.
2. Action. Set the value of S ← s0 , ignoring any parents of S.
3. Prediction. Use the inferred error term and new value of s0 to estimate the new outcome, by proceeding step-wise
for each level of the graph [20; 6] (i.e., following a topological sort of the graph), starting with S’s children and
proceeding downstream until Y node’s value is obtained. For each X ∈ V ordered by the topological sort of the
graph (after S), x0 = fˆX (P a0 (X), · · · ) + ˆx,t . And finally, we will obtain, y 0 = fˆY (P a0 (Y ), · · · ) + ˆy,t .
Thus, the CF-Shapley score for any input is obtained by repeatedly applying the above algorithm and aggregating the
required counterfactuals in Eqn. 1; we use a common Monte Carlo approximation to sample a fixed number (M = 1000)
of values of S [4; 8].

5 Evaluation
Our goal is to attribute observed changes in the output metric of an ad matching system. We first describe the system
and conduct a simulation study to evaluate CF-Shapley scores.

6
5.1 Description of the ad matching system

We consider an ad matching system where the goal is to retrieve all the relevant ads for a particular web search query by
a user (these ads are ranked later to show only top ones to the user). The outcome variable is the average number of ads
matched for each query, called the “matching density” (or simply density). This outcome can be affected by multiple
factors, including the availability of ads by advertisers, the distribution and amount of user queries issued on the system,
any algorithm changes, or any other system bug or unknown factors. For simplicity, we consider a matching algorithm
based on matching exact full text between a query and provided keyword phrases for an ad. This algorithm remains
stable over time due to its simplicity. Thus, we can safely assume that there are no algorithm changes or code bugs
for the matching algorithm under study. Given an extreme or unexpected value of density, our goal then is to attribute
between change in ads and change in queries.
Since there are millions of queries and ads, we categorize the data by nearly 250 semantic query categories. Examples
of query categories are "Fashion Apparel", "Health Devices", "Internet", and so on. A naive solution may be to simply
compare the magnitude of observed change in ad demand or query volume across categories. That is, given a change in
density on day t, choose a reference day r (e.g., same day last week) and compare the values of ad demand and query
volume. We may conclude that the factor with the highest percentage change is causing the overall change in density.
However, the limitation is that the factor with the highest percentage change may neither be necessary nor sufficient
to cause the change because its effect depends on the values of other factors. E.g., an increase in query volume for a
category can either have positive, negative, or no effect on the daily density depending on its ad demand compared to
other categories. This is because the density is computed as a query volume-weighted average of category density;
increase in query volume for a low-demand (and hence low-density) category decreases the aggregate density (see
Eqn. 5).

5.2 Constructing an SCM for ad density metric

To apply the CF-Shapley method for attributing a matching density value, we define a causal graph based on how the
metric is computed, as shown in Figure 1. The number of queries for a category is measured by the number of search
result page views (SRPV). The number of ads is measured by the number of listings posted by advertisers. For simplicity,
we call them query volume and ad demand. We assume that given a category, the ad demand and query volume are
independent of each other since they are driven by the advertiser and user goals respectively. The combination of ad
demand and query volume for a category determine its category-wise density which then is aggregated to yield the daily
density. As we are interested in attribution over days as a time unit, we refer to the aggregate density as daily density, y.
Thus, the variables {adc1 , qv c1 , adc2 , qv c2 · · · adck , qv ck } are the 2k inputs to the system where ci is the category, ad
refers to ad demand, qv refers to query volume, and k is the number of categories.
The structural equation from category-wise densities to daily density is known. It is simply a weighted average of the
category-wise densities, weighted by the query volume.
P c c
c dent qv t
yt = f (denc1 c1 ck ck
t , qv t , · · · dent , qv t ) = P c (5)
c qv t

where denct is the density of category c on day t and qv ct is the query volume for the category on day t. However, the
equation from category-wise ad demand and query volume to category density is infeasible to obtain. This would
involve “replay” of a computationally expensive matching algorithm to real-time queries and ad listings but the ad
listings are not available (only a daily snapshot of ads inventory is stored in the logs). We will show how to to estimate
the structural equation for category density in Section 6.1.

5.3 Evaluating CF-Shapley on simulated data

Before applying CF-Shapley on the ad matching system, we first evaluate the method on simulated data motivated by
the causal graph of the system. This is because it is impossible to know the ground-truth attribution using data from a
real-world system, since we do not know how the change in input variables led to the observed metric value and which
inputs were the most important.
We construct simulated data based on the causal structure of Figure 1. For each category, we assume ad demand and
query volume as independent Guassian random variables (we simulate real-world variation in query volume using
a Beta prior). The category-wise density is constructed as a monotonic function of ad demand and has a biweekly

7
periodicity. The SCM equations are,
γ = B(0.5, 0.5); qv ct = N (1000γ, 100); adct = N (10000, 100)
denct = g(adct , qv ct , denct−1 ) + N (0, σ 2 )
= κ ∗ adct /qv ct + β ∗ a ∗ denct−1 + N (0, σ 2 ) (6)
c c
P
c dent qv t
yt = P c (7)
c qv t
where qv ct and adct are the query volume and ad demand respectively for category c at time t. They combine to produce
the ad matching density denct based on a function g and additive normal noise. The variance of the noise, σ 2 determines
the stochastic variation in the system. For the simulation, we construct g based on two observations about the category
density: 1) it is roughly a ratio of the relevant ads and the number of queries; and 2) it exhibits auto-correlation with its
previous value and periodicity over a longer time duration. We use κ to denote the fraction of relevant ads and add a
second term with parameter a to simulate a biweekly pattern, a = 1 if floor(t/7) is even else a = −1. β is the relative
importance of the previous value in determining the current category density. Finally, all the category-wise densities are
weighted by their query volume qv ct and averaged to produce the daily density metric, yt .
Each dataset generated using these equations has 1000 days and 10 categories; we set κ = 0.85, β = 0.15 for simplicity.
We intervene on the ad demand or query volume of the 1000th point to construct an outlier metric that needs to be
attributed. Given the biweekly pattern, reference date is chosen 14 days before the 1000th point.
Setting ground-truth attribution. Even with simulated data, setting ground-truth attribution can be tricky. For
example, if there is an increase in ad demand for one category and increase in query volume for another, it is not
clear which one would cause the biggest impact on the daily density. That depends on their query volume and ad
demand respectively and any changes in other categories. To evaluate attribution methods, therefore, we consider simple
interventions where objective ground-truth can be obtained. Specifically, for ease of interpretation, we intervene on
only two categories at a time such that the first has a substantially higher chance of affecting the outcome metric than
the second.
We consider two configurations: change in 1) ad demand and 2) query volume. For changing ad demand (Config 1), we
choose two categories such that the first has the highest query volume and the second has the lowest query volume.
We double the ad demand for both categories with a slight difference (x2 for the first category, x2.1 for the second).
Since the category-wise densities are weighted by query volume to obtain the daily density metric, for the same (or
similar) change in demand, it is natural that first category has higher impact on daily density (even though they may
have similar impact on their category-wise density). For Config 1, thus, the ground-truth attribution is the first category.
For changing query volume (Config 2), we choose two categories such that the first has the most extreme density and
the second has density equal to the reference daily density. Then, we change query volume as above: x2 for the first
category and x2.1 for the second. Following Eqn. 7, query volume change in a category having the same density as the
daily density is expected to have low impact on daily density (keeping other categories constant, if category density is
not impacted by query volume, an increase in query volume for a category with density equal to daily density causes
zero change in daily density). Thus, the ground-truth attribution (category with the highest impact on output metric) is
again the first category. Note that query volume has higher variation across categories, so a higher multiplicative factor
does not necessarily mean a higher absolute difference.
Baseline attribution methods. We compare CF-Shapley to the standard Shapley value (as implemented in SHAP [17],
Shapley) and the do-shapley value (DoShapley) [14]. The Shapley method ignores the structure and fits a model
directly predicting daily density yt using (category-wise) ad demand and query volume features. It uses the predictions
of this model for computing the Shapley score. For the DoShapley method, we notice that our causal graph corresponds
to the Direct-Causal graph structure in their paper and use the estimator from Eq. (5) in [14], that depends on the
same daily density predictor as the standard Shapley value. We also evaluate on three intuitive baselines based on
absolute change in inputs: 1) The category with the biggest change in ad demand (AdDemandDelta); 2) query volume
(QVolumeDelta); or 3) density multiplied by query volume (ProductDelta) since this product is used in the daily
density equation.
For the CF-Shapley algorithm, we fit the structural equation for category density, using the following features: ad
demand, query volume, dent−1 , dent−7 , dent−14 . For both the CF-Shapley category density prediction and the
Shapley daily density prediction model, we use a 3-layer feed forward network. We use all data uptil 999th day for
training and validation for all models.
Results. For each attribution method, we measure accuracy compared to the ground-truth as we increase the noise (σ)
in the true data-generating process (SCM) (σ = {0.1, 1, 10}). As noise in the generating process increases, we expect
higher error for fitting structural equations and thus the attribution task becomes harder. Attribution accuracy is defined

8
Category Attribution Accuracy
Config1: Change in Ad Demand Config2: Change in Query Volume
1.00

0.75

0.50

0.25
● ● ●
0.00 ● ● ●
0.1 1.0 10.00.1 1.0 10.0
Noise in the data−generating function
● AdDemandDelta QueryVolumeDelta ProductDelta
Shapley DoShapley CFShapley

Figure 2: Category attribution accuracy for various methods.

as the fraction of times a method outputs the highest attribution score for the correct category (first category), over 20
simulations.
Figure 2 shows the results. CF-Shapley obtains the highest attribution accuracy for both Config 1 and 2. In general,
attribution for ad demand is easier than query volume because both the category density and daily density are monotonic
functions of the ad demand. That is why we observe near 100% accuracy for CF-Shapley under Config 1, even with
high noise. The attribution accuracy for Config 2 is 70-80%, decreasing as more noise is introduced.
In comparison, none of the baselines achieve more than 50% (random-guess) accuracy. Note that the Shapley and
DoShapley methods obtain similar attribution accuracies. While their attribution scores are different, the highest ranked
category often turns out to be the same since they rely on the same daily density model (but use different formulae).
Inspecting the predictive accuracy of the daily density model offers an explanation: error on the daily density prediction
is higher than that for category-wise density prediction (and it increases as the noise is increased). This indicates the
value of computing an individualized counterfactual using the full graph structure, rather than focusing on the average
(causal) effect. Finally, the other intuitive baselines fail on both tasks since they only look at the change in the input
variables.

6 Case study on ad matching system

We now apply the CF-Shapley attribution method on data logs of a real-world ad matching system from July 6 to Dec
28, 2021. For each query, we have log data on the number of ads matched by the system. In addition, each query is
marked with its category. The category query volume is measured as the number of queries issued by users for each
category. This allows us to calculate the ground-truth matching density on each day, category-wise and aggregate.
Separately, to compute the category-wise input ad demand for a day, we fetch each ad listing available on the day and
assign it to a category if any query from that category contains a word that is present in its keywords. This is the total
sum of ad listings that are potentially relevant to the query for the exact matching algorithm (that matches the full query
exactly to the full ad keyword phrase).

6.1 Implementing CF-Shapley: Fitting the SCM

We follow the method outlined in Section 4.3. The main task is to estimate the structural equations for category-wise ad
density. There are over 250 categories; fitting a separate model for each is not efficient. Besides, it may be beneficial
to exploit the common patterns in the different time-series. We therefore consider a deep learning-based model,
DeepAR [23] that fits a single recurrent network for multiple timeseries (we also tried a transformer-based model,

9
Model Mean APE (%) Median APE (%) sMAPE
LastWeek 21.2 11.5 0.20
Avg4Weeks 25.1 10.6 0.17
FeedForward 20.0 10.8 0.20
DeepAR 15.6 7.8 0.13
Table 1: Accuracy of category-wise density prediction models.
Fractional change in Daily Density

0.2 ●

0.1 ●
● ●



● ● ●

● ● ●
● ●
● ● ●
● ● ● ●
● ●
0.0 ●

● ● ● ● ● ●
● ● ● ● ●
● ●
● ●
Actual Difference

Sum of Attribution Scores
−0.1
15

22

29

06

13

20

27
1−

1−

1−

2−

2−

2−

2−
−1

−1

−1

−1

−1

−1

−1
21

21

21

21

21

21

21
20

20

20

20

20

20

20
Figure 3: Comparison of actual percentage change in daily density with sum of estimated CF-Shapley attribution scores.

temporal fusion transformer (TFT) [16] but found it hard to tune to obtain comparable accuracy). As specified in
Equation 4, for each category, the DeepAR model is given ad demand, query volume and the autoregressive values of
density for the past 14 days. Note that rather than predicting over a range of days (which can be innacurate), we fit the
timeseries model separately for each day using data up to its t − 1th day, to utilize the additional information available
from the previous day. To implement DeepAR, we used the open-source GluonTS library.
We compare the DeepAR model to three baselines. As simple baselines that capture the weekly pattern, we consider, 1)
category density on the same day a week before; and 2) the average density over the last four weeks. We also consider a
3-layer feed-forward network that uses the same features as DeepAR. Table 1 shows the prediction error. DeepAR
model obtains the lowest error on the validation set according to all three metrics: mean absolute percentage error
(MAPE), median APE, and the symmetric MAPE [18]. For our results, we choose DeepAR as the estimated SCM
equation and apply CF-Shapley on data from Nov 15 to Dec 28. We chose Nov 15 to allow sufficient days of training
data.
Choosing reference timestamp. The CF-Shapley method requires specifying a reference day that provides the the
“expected/usual” density value. Common ways to choose it are the last day’s value or the value last week on the same
day. We choose the latter due to known weekly patterns in the density metric.

6.2 Validating the CF-Efficiency axiom

We first check whether the obtained CF-Shapley scores sum up to the observed percentage change in daily density
metric (Figure 3). The difference between the sum of CF-Shapley scores and the actual change is less than 0.10% for all
days, indicating that our choice of reference timestamp is appropriate (Sec. 4.1) and that the shapley value computation
by approximation is capturing relevant signal.

6.3 Choosing dates for evaluation

While we computed attribution scores for all days, typically one is interested in attribution for unexpected values for
daily density.
To discover unusual days for attribution, we fit a standard time-series model to the aggregate daily density data. We
use four candidate models: 1) daily density on the same day last week; 2) mean density of the last 4 weeks; 3) a feed
forward network; and 4) DeepAR model. As for the category-wise prediction, all neural network models are provided
the last 14 days of daily density. Table2 shows the mean APE, median APE, and SMAPE. The feed forward model

10
Model Mean APE (%) Median APE (%) sMAPE
LastWeek 4.6 3.1 0.047
Last4Weeks 4.5 3.3 0.047
FeedForward 3.0 2.2 0.031
DeepAR 3.4 2.4 0.035
Table 2: Prediction error for daily density models.

1.0 ●


Actual
Predicted ●

0.9
Daily Density

● ●
● ●
● ●

0.8 ● ● ● ●


● ●
● ● ●


● ●

● ● ●
● ● ● ●
● ● ●
● ● ● ●
● ● ●

0.7
15

22

29

06

13

20

27
1−

1−

1−

2−

2−

2−

2−
−1

−1

−1

−1

−1

−1

−1
21

21

21

21

21

21

21
20

20

20

20

20

20

20
Figure 4: Outliers through FeedForward model’s prediction. Shaded region represents the 95% prediction interval.

obtains the lowest error. While DeepAR is a more expressive model than FeedForward, a potential reason for its lower
accuracy is the number of training samples (only as many data points as the number of days for dailydensity prediction
unlike category-wise prediction). For its simplicity, we use the FeedForward network for detecting outlier days. Its
prediction for different days and the outliers detected can be seen in Figure 4. Like DeepAR, the feedforward model is
implemented as a Bayesian probabilistic model, so it outputs prediction samples rather than a point prediction.
Days where the daily density goes beyond the 95% prediction interval are chosen for attribution. A visual inspection
shows two clusters, Thanksgiving/Black Friday and Christmas, which are expected due to their significance in the US.
We also find an extreme value on Dec. 4. In all three cases, the daily density increases. Intuitively, one may have
expected the opposite for holidays: density would decrease since people are expected to spend less time online.

0.04
Attribution Score

AdDemand
QueryVolume
0.02

0.00
& usin ers inm l
on ess na ent

e
Fa inin ec rial

& Nig cs

m fe
od F nity

ce e
es He s
In me Lei h
rn & G ure

La s & Tele n
& du m

M cas ern on

& t
s

M E e
or ha te

el Fitn e
To s
Ve ism
es
e

ia ns en
rie

om ift

es
m In ar

G nc

o & alt

al al rc

s
& ert ar

om htli

co

Sp erc sta
ily & ni

Tr ts & ndi

cl
Jo & rd

C G
in Oc Go cat
D r El ust

ed io m

er Re me
su & l C

s
u

ur
& ina
ty nt App

m g tro

hi
a
d

ro
P a

v
o

&
C

et

&

av
e

bi
au & E

b
ob

H
Fo

te

w
H
Be ts

en
Ar

G
C

&
nl

rs
rs

le
te

ai
pu

et
om

R
C

Figure 5: Attribution scores of different categories by ad demand and query volume on December 4.

11
Category AdDemandAttrib QueryVolumeAttrib
Sort by AdDemandAttrib
Internet & Telecom 0.0450 0.00850
Apparel -0.00843 0.00151
Arts & Entertainment 0.00663 -0.00928
Hobbies & Leisure 0.00646 -0.0168
Travel & Tourism -0.00584 -0.000287
Sort by QueryVolumeAttrib
Hobbies & Leisure 0.00646 -0.0168
Arts & Entertainment 0.00633 -0.00928
Internet & Telecom 0.0450 0.00850
Law & Government 0.000203 -0.00743
Health 0.000161 -0.00645
Table 3: Ad demand and query volume attribution on Dec 4.
6.4 Qualitative analysis

We now use CF-Shapley to explain these observed changes.


December 4. Figure 5 shows the attribution by different categories, aggregated up to obtain 22 high-level categories.
The Internet & Telecom (IT) category has the biggest positive attribution score while the Hobbies & Leisure (HL)
category has the biggest negative attribution score. That is, daily density decreased on the day due to the HL category.
To understand why, we look at the attribution scores separately for ad demand and query volume for each category
in Table 3. The attribution score reflects to the percentage change in daily density compared to last week, due to ad
demand or query volume of a category. The only categories to have an attribution score greater than 1% are IT and HL,
agreeing with the category-wise analysis. Specifically, the change in ad demand due to IT leads to a 4.5% increase in
daily density. The query volume change in HL, on the other hand, leads to a 1.7% decrease in daily density. Considering
all categories together, ad demand change leads to an 6.5% increase in daily density and query volume change leads to
a 5.6% decrease. The net result is a 1% improvement over the last week. While an increase of 1% of daily density may
look small, note that the value last week was already inflated due to it being a Black Friday week. This is why we detect
outliers using the expected time-series pattern rather than simply difference from last week. On such days, one may
also consider an alternative baseline, e.g., two weeks before.
Are the attributions meaningful? In the absence of ground-truth, we dive deeper into the query logs to check for
evidence. We do find a significant increase in queries for the HL category. In fact, more than 70% of the increase in
query volume for HL is due to cheetah-related queries. On manual inspection, we find that December 4 is International
Cheetah Day. Cheetah-related queries also contribute to 86% of the ad demand increase for HL category. Given that the
category density of HL is much lower than the daily density, this increase in query volume causes a decrease in daily
density, leading to the negative attribution score. Due to ad demand volume increase (perhaps in anticipation of the
Cheetah Day), the HL also leads to an increase of 0.6% in daily density (see Table 3. On the other hand, IT category’s
main contribution is from an increase in ad demand. Logs show a substantial (14%) increase in ads compared to last
week for the category on Dec 4, which explains its high attribution score for ad demand. This increase is sustained
across queries, possibly indicating a shift for the first Saturday after the holiday weekend.
Nov 25 and 26 (Thanksgiving). On Thanksgiving holiday (Nov 25), we may have expected density to drop since
many people in the US are expected to spend more time with their family and less time online. At the same time, online
shopping on Black Friday (Nov 26) may increase density. Instead, we find that the density increases significantly on
both days (see Figure 4. Specifically, compared to last week, daily density on Nov 26 increased by 18.3%, out of
which 13.5% is contributed by query volume change and 4.8% by ad demand. How to explain this result? Using the
CF-Shapley method, for query volume change, we find that the categories Health, Law and Government and Business &
Industrial are the top-ranked categories. Each contribute more than 2% of the density increase, leading to a cumulative
7% increase. From the logs, we see that query volume for these categories decreased as people spent less time on work
or health related queries. Since these categories tend to have low density, the daily density increased as a result. On the
ad demand side, Online Media & Ecommerce contributed nearly 3% increase in daily density, perhaps due to increased
demand for Black Friday shopping. Nov 25 exhibits similar patterns for query volume.

12
Dec 24 and Dec 25 (Christmas). On Christmas days too, there is an significant increase in density. Like the
Thanksgiving days, health and work-related queries are issued fewer times, leading to an overall increase in daily
density (all three categories have attribution scores >1%). However, we find that the top categories by query volume
change are Hobbies & Leisure and Arts & Entertainment. Both these categories experience a surge in their query
volume and being high-density categories, cause a 2.1% and 1.8% increase in daily density respectively. To explain this,
we look at the query logs and find that the rise in Hobbies & Leisure queries is fueled by the toys & games subcategory,
which is aligned with the expectation of the holiday days. On Dec 25, Hobbies & Leisure is also the category which has
the highest attribution score by ad demand (2.7%). Overall, the category contributes 4.8% increase, nearly one-third of
the total density increase on Christmas day, signifying the importance of toys & games subcategory for Christmas.

7 Discussion and Conclusion


We presented a counterfactual-based attribution method to explain changes in a large-scale ad system’s output metric.
Using the computational structure of the system, the method provides attribution scores that are more accurate than
prior methods.

References
[1] Mona Attariyan, Michael Chow, and Jason Flinn. 2012. X-ray: Automating root-cause diagnosis of perfor-
mance anomalies in production software. In 10th {USENIX} Symposium on Operating Systems Design and
Implementation ({OSDI} 12).
[2] Ron Berman. 2018. Beyond the last touch: Attribution in online advertising. Marketing Science 37, 5 (2018),
771–792.
[3] Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly,
Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual Reasoning and Learning Systems: The
Example of Computational Advertising. Journal of Machine Learning Research 14, 11 (2013).
[4] Javier Castro, Daniel Gómez, and Juan Tejada. 2009. Polynomial calculation of the Shapley value based on
sampling. Computers & Operations Research 36, 5 (2009), 1726–1730.
[5] Brian Dalessandro, Claudia Perlich, Ori Stitelman, and Foster Provost. 2012. Causally motivated attribution for
online advertising. In Proceedings of the sixth international workshop on data mining for online advertising and
internet economy.
[6] Saloni Dash, Vineeth N Balasubramanian, and Amit Sharma. 2022. Evaluating and mitigating bias in image
classifiers: A causal perspective using counterfactuals. In Proceedings of the IEEE/CVF WACV Conference.
915–924.
[7] Bradley Efron. 2020. Prediction, estimation, and attribution. International Statistical Review 88 (2020), S28–S59.
[8] Shaheen S Fatima, Michael Wooldridge, and Nicholas R Jennings. 2008. A linear approximation method for the
Shapley value. Artificial Intelligence 172, 14 (2008).
[9] Joseph Y Halpern. 2016. Actual causality. MiT Press.
[10] Tom Heskes, Evi Sijben, Ioan Gabriel Bucur, and Tom Claassen. 2020. Causal shapley values: Exploiting causal
knowledge to explain individual predictions of complex models. Advances in neural information processing
systems 33 (2020), 4778–4789.
[11] Tsuyoshi Idé, Amit Dhurandhar, Jiří Navrátil, Moninder Singh, and Naoki Abe. 2021. Anomaly attribution with
likelihood compensation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 4131–4138.
[12] Dominik Janzing, Kailash Budhathoki, Lenon Minorics, and Patrick Blöbaum. 2019. Causal structure based root
cause analysis of outliers. arXiv preprint arXiv:1912.02724 (2019).
[13] Wendi Ji, Xiaoling Wang, and Dell Zhang. 2016. A probabilistic multi-touch attribution model for online adver-
tising. In Proceedings of the 25th acm international on conference on information and knowledge management.
1373–1382.
[14] Yonghan Jung, Shiva Kasiviswanathan, Jin Tian, Dominik Janzing, Patrick Bloebaum, and Elias Bareinboim. 2022.
On Measuring Causal Contributions via do-interventions. In Proceedings of the 39th International Conference on
Machine Learning (Proceedings of Machine Learning Research, Vol. 162). PMLR, 10476–10501.
[15] Ramaravind Kommiya Mothilal, Divyat Mahajan, Chenhao Tan, and Amit Sharma. 2021. Towards unifying
feature attribution and counterfactual explanations: Different means to the same end. In Proceedings of the 2021
AAAI/ACM AIES Conference.

13
[16] Bryan Lim, Sercan Ö Arık, Nicolas Loeff, and Tomas Pfister. 2021. Temporal fusion transformers for interpretable
multi-horizon time series forecasting. International Journal of Forecasting 37, 4 (2021), 1748–1764.
[17] Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. Advances in neural
information processing systems 30 (2017).
[18] Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. 2020. The M4 Competition: 100,000
time series and 61 forecasting methods. International Journal of Forecasting 36, 1 (2020), 54–74.
[19] Karthik Nagaraj, Charles Killian, and Jennifer Neville. 2012. Structured comparative analysis of systems logs to
diagnose performance problems. In 9th {USENIX} Symposium on Networked Systems Design and Implementation
({NSDI} 12).
[20] Judea Pearl. 2009. Causality. Cambridge university press.
[21] Judea Pearl. 2019. The seven tools of causal inference, with reflections on machine learning. Commun. ACM 62,
3 (2019), 54–60.
[22] Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. 2017. Elements of causal inference: foundations and
learning algorithms. The MIT Press.
[23] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. 2020. DeepAR: Probabilistic forecasting
with autoregressive recurrent networks. International Journal of Forecasting 36, 3 (2020), 1181–1191.
[24] Xuhui Shao and Lexin Li. 2011. Data-driven multi-touch attribution models. In Proceedings of the 17th ACM
SIGKDD international conference on Knowledge discovery and data mining. 258–264.
[25] Raghav Singal, Omar Besbes, Antoine Desir, Vineet Goyal, and Garud Iyengar. 2022. Shapley meets uniform: An
axiomatic framework for attribution in online advertising. Management Science (2022).
[26] Erik Štrumbelj and Igor Kononenko. 2014. Explaining prediction models and individual predictions with feature
contributions. Knowledge and information systems 41, 3 (2014), 647–665.
[27] Sahil Verma, John Dickerson, and Keegan Hines. 2020. Counterfactual explanations for machine learning: A
review. arXiv preprint arXiv:2010.10596 (2020).
[28] Teppei Yamamoto. 2012. Understanding the past: Statistical analysis of causal attribution. American Journal of
Political Science 56, 1 (2012), 237–256.
[29] Yongle Zhang, Kirk Rodrigues, Yu Luo, Michael Stumm, and Ding Yuan. 2019. The inflection point hypothesis: a
principled debugging approach for locating the root cause of a failure. In Proceedings of the 27th ACM Symposium
on Operating Systems Principles. 131–146.

14

You might also like