0% found this document useful (0 votes)
49 views

2021-Modeling Labels For Conversion Value Prediction

Uploaded by

xd19930502
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

2021-Modeling Labels For Conversion Value Prediction

Uploaded by

xd19930502
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Modeling labels for conversion value prediction

Ashwinkumar Badanidiyuru Guru Guruganesh


[email protected] [email protected]
Google Research Google Research
USA USA
ABSTRACT bidding) platforms along with an associated value for these conver-
In performance based digital advertising, one of the key technical sions (see [1]). In a CPA (cost per acquisition) product, there is a
tools is to predict the expected value of post ad click purchases model to predict the probability of a conversion per click which is
(a.k.a. conversions). Some of the salient aspects of this problem used to proportionally adjust bids for these advertisers. In a ROAS
such as having a non-binary label and advertisers reporting the label (return on advertiser spend) product there is instead a model to pre-
in different scales make it a much harder problem than predicting dict expected value of conversions per click which is also used to
probability of a click. In this paper we ask what is a good way to proportionally adjust bids for these advertisers; i.e. in both cases,
model the label and extract as much information as possible from bid proportionally higher on users whose prediction is higher.
the features. We investigate three main issues that arise from adver- We observe that conversions and their purported value are both
tiser reported labels and come up with new techniques to address advertiser reported data and are not observed by the search engine
them. The first issue is that the label scale can affect how the model directly. As a result a number of issues that present themselves in
capacity is devoted to different advertisers. The second issue is how this setting that are not present in more traditional models which
outlier labels can cause over-fitting. Finally, we also show that the predict for e.g. click through rates. As a result, building a machine
distribution of the label contains vital information and the we train learning model to predict expected value turns out to be a lot more
our models to use them and not just rely on the mean. challenging than building a model to predict the probability of a
click or a conversion. In this paper, our main goal is to construct a
highly accurate model which predicts the advertiser reported value.
KEYWORDS The first salient issue in value prediction is that the label reported
Conversion, Ads, Value prediction, Label by each advertiser is in an arbitrary (but consistent) scale. As an
example one advertiser can report labels in the range of 10000, while
another in the range of 0.00001. This doesn’t necessarily mean that
1 INTRODUCTION dollar value generated per conversion by the first advertiser is larger
than the second advertiser. As a result, if we train a generalized linear
The holy grail of advertising is to maximize its’ net effect, while
regression model, the model capacity is allocated disproportionately
simultaneously spending as little as possible. Over many decades, the
to the first advertiser in the above example (and more generally to
advertising industry has repeatedly changed to be able to achieve this.
advertisers who report labels in larger scale). Our first idea is to
Historically, advertising included print and billboard messaging, and
normalize the labels such that the average label is 1 across different
its effectiveness was measured via surveys. These were at best good
advertisers.
proxies. While the introduction of radio and television increased the
The resulting normalized label still has two important issues. First,
overall reach and effectiveness, they still suffered from the same
the range of normalized label is still quite wide due to the presence
measurement issues as previous forms. With the advent of internet
of many outliers. Secondly, the distribution is biased towards having
advertising, we are able to measure engagement through more direct
many more zero-valued labels than any other. To handle the outliers,
proxies or even end goals.
we take inspiration from robust statistics and utilize a technique
The most popular form of advertising on internet was based on
known as winsorized mean and merge it with multi-task learning.
“pricing per view” impression which is most suitable for “Brand”
To fix the excess of zero’s in the conversion data, we take inspiration
advertising. A breakthrough change was the introduction of pay-
from the zero inflated models. Traditionally one would incorporate
per-click ads where the advertiser only pays if the user clicks on an
these new ideas into a single objective. One of our main ideas is to
ad( [24]). This was substantially more effective as users who click
instead create new auxilary objectives and use this multi-objective
on the ad are typically more valuable for the advertiser. This also
approach to learn these properties of the resulting distribution.
allows for better optimization in showing relevant ads as one can
Our final observation is while predicting the expected value, most
predict click through rates. While pay-per-click ads are indeed better
standard models just predict the mean of the distribution and neglect
for performance advertising, it still suffered from some nontrivial
the information in the complete distribution. We further use multi-
challenges. These challenges include advertisers manually specify-
task learning to predict different properties of the distribution which
ing a bid for each segment of the population which is computed
further improves accuracy of predicting label. In particular, by asking
painstakingly, lack of automation due to manually specifying which
the model to solve an additional classification task, we find that the
users to target and paying for users who are unlikely to buy on
model quality improves further. Note that this is not like the use of
advertisers website.
reducing a regression task into a classification task but rather adding
Over the last few years, the industry has shifted towards optimiz-
new tasks which are only used in training and not used in computing
ing end goals. In this case, advertisers report conversions (purchase
the final prediction. The use of multi-task learning here is not like
events etc) which happen on their website back to RTB (real time
the classical use where we train on different tasks (for example CTR solution in [31] is to weight the examples by CPA and works for
and CVR prediction). The additional tasks don’t correspond to any binary label. This work is orthogonal to our work and can work in
other natural product objective and solely drive performance gains. conjunction to the label normalization, where can do CPA weighting
on top of label normalization. But we won’t be touching upon this
topic in the rest of this paper.
2 RELATED WORK
Internet advertising has a very rich literature. This started with a 3 PRELIMINARIES
long line of work on models for predicting clicks [25, 27, 34] and In this section we list the notation used in this paper and also discuss
continued with topics such as designing auctions [4, 12, 30]. There offline evaluation metrics.
also have been a series of papers studying conversion prediction
in [3, 21, 26, 28]. While conversion prediction has several aspects Notation. We will use 𝑋 to denote the set of features used for
common with classical click prediction, it also has many unique prediction. For each click the advertiser reports to RTB a set of post
aspects such as delayed label [7], attribution [10] and computing the click conversions denoted by 𝐶 = {𝑐 1, 𝑐 2, . . .}. For each conversion
causal effect [22]. 𝑐𝑖 the advertiser also reports a corresponding value ℓ𝑖 . Hence the
Í
Another line of work which is quite relevant is on loss functions. total value of the click for the advertiser is 𝑖 ℓ𝑖 which we denote by
In these works (see [6, 14, 29, 32]), there is a complete charac- ℓ. We will denote the advertiser by 𝐴 which is also a feature included
terization of proper scoring rules – i.e. loss functions which re- in the general set of features 𝑋 , i.e 𝐴 ∈ 𝑋 . The goal of the prediction
sult in unbiased estimators. At a very high level, these loss func- problem is to predict the quantity 𝐸 [ℓ |𝑋 ]. Let 𝑐𝑝𝑐 be the cost per
tions are exactly the functions whose gradient 𝜕𝑙𝑜𝑠𝑠/𝜕𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 = click that advertiser needs to pay for that specific click.
𝑓 (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛) · (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛−𝑙𝑎𝑏𝑒𝑙) for some reasonably “nice” func- Regression formulation as Poisson regression. There are several
tion 𝑓 . These loss functions have gradient 0 (in expectation) when ways to predict a real valued random variable and we use one of the
the prediction is exactly the (expected) label. Throughout this pa- most popular formulations to predict this value. In particular, we
per, we will use Poisson regression which belongs to the class of model it as a Poisson regression task. Poisson Regression is a type of
proper scoring rules. This is quite standard and is used for predicting generalized linear regression model, where the corresponding label
positive float labels throughout literature (see [2]). We won’t be follows the expression
concerned with other class of loss functions. Note that while Poisson
log(E[𝑌 | 𝑋 ]) = ⟨𝜃, 𝑋 ⟩.
regression can be derived using Maximum likelihood for integer
labels, the loss function is well defined for arbitrary positive real While one can derive Poisson regression via maximum likelihood
valued labels and gives an unbiased estimator irrespective of the for a integer Poisson random variable, it belongs to the class of
underlying distribution that the label comes from. proper scoring rules [6] which give unbiased estimators even when
A second line of work has been on mean estimation with heavy the corresponding label is real valued. We will have a deep neural
tails distributions, both in the setting of i.i.d random variables and in network output parameter 𝜃 of Poisson regression and the prediction
the setting of regression. An excellent survey of existing techniques is 𝑒 𝜃 . If 𝑙 is the label for the regression formulation then Poisson log
and relevant references can be found in [23]. The setting relevant for likelihood for the example is
us is that of regression and they describe and analyze median of mean 𝑙 · 𝜃 − 𝑒𝜃 . (1)
tournaments. Today a number of different techniques have been
developed in the literature. Modern techniques such as [9, 15, 18] Offline metric as Negative Poisson log likelihood (NPLL) with
use more sophisticated tools and produce robust estimators but are respect to normalized label. When training a Poisson regression
not very practical. model we maximize the Poisson log likelihood or minimize the
A third line of work is on using information in the distribution negative Poisson log likelihood (NPLL). So it is natural to evaluate
of label and not just predicting the mean of the distribution. One the accuracy of different models by comparing NPLL on a held out
example of this is the work on Zero-Inflated Poisson (ZIP) regression dataset. Since we will be normalizing our label we will be evaluating
[17, 20, 33] which assumes that the label is generated via a mixture our ideas by comparing NPLL with respect to normalized label.
distribution where we generated 0 with probability p and a Poisson
random variable with probability 1-p. There are two challenges with 4 LABEL NORMALIZATION
this approach, one being that a ZIP model is not a proper scoring If we look at the gradient of Poisson log likelihood it is exactly equal
rule and hence might not give an unbiased estimator and the second to label minus prediction (see eq. (1)). As a result, gradients for
being that it works only for integer labels and isn’t defined for float advertisers who report in a larger scale will be larger and more model
labels. capacity will be devoted to these advertisers. Large gradients affect
A fourth line of work which is very relavant is how offline ac- the model capacity as it will take many examples from smaller labels
curacy relates to final business metrics. While traditional machine to compensate for the one large gradient from a large label. While
learning research used various metrics for comparing offline accu- labels for different clicks do represent their relative value for a given
racy of models such as log likelihood, l2 error etc, there was a need advertiser (i.e. they are consistent), across different advertisers they
for better evaluation of models used in internet advertising due to don’t necessarily correlate with dollar spent. In particular, advertisers
how they can affect final metrics such as revenue. This was studied who spend a small amount but report large labels will contribute
in [8, 16]. This was later extended by [31] to also allocate model disproportionately to the loss. As a result, the model will allocate
capacity for different advertisers to optimize for final metric. The more capacity to predict their labels correctly. Such a system is not
2
incentive compatible as each advertiser now has an incentive to scale 5 LEARNING PROPERTIES OF THE
up their labels to a very large scale. DISTRIBUTION
A naive method to resolve the above issue would be to try and pick
In this section, we construct a model to predict the normalized
a loss function whose gradients are scale invariant. Unfortunately,
label as accurately as possible. We notice that these distributions
this introduces a different issue. If gradients are scale invariant then
have unique properties and we show ways of exploiting them in the
the model will take a large number of steps to learn the mean predic-
subsections below.
tion for advertisers with large scale, and not converge (i.e. bounce
around) for advertisers who report labels on a small scale. 5.1 Outlier handling via Winsorized mean and
Multi task learning
Even after normalization of the mean, the relative value of the label
compared to its mean can take on large values. While this could be
due to multiple effects which may be unique to each advertiser, the
most salient hypothesis is that these are caused by outliers. There is
a rich literature on the ability to handle outliers in machine learning
models. However, many techniques such as “median-of-means” is
hard to implement in online systems for two reasons.
(1) Simple techniques such as partitioning the input into smaller
buckets does make the resulting median quite robust, however
the accuracy suffers due to the reduced batch size in each
partition.
Figure 1: A heatmap of avg label per customer vs avg cpc for that cus- (2) More sophisticated techniques such as [9, 18] are not very
tomer. efficient to implement at scale in a distributed manner and
make it difficult to estimate the median in an online fashion.

We solve the above issues by using label normalization, where


we multiplicatively normalize the label for each advertiser to 1. To
do this, we compute a normalization constant for each advertiser
𝜂𝐴 = 𝐸 [ℓ |ℓ > 0, 𝐴]. Then the final normalized label is ℓ𝑛 = ℓ/𝜂𝐴 .
Our final prediction will be normalization constant times the model
prediction, i.e 𝜂𝐴 · 𝐸 [ℓ𝑛 |𝑋, 𝐴].
Note that the above design fixes both issue of model capacity
allocation as well as learning the mean for each advertiser. Since we
do a per advertiser normalization, the average label E[ℓ |ℓ > 0, 𝐴]
is easily computed online. Since the normalized label for all adver-
tisers in the same range, the model converges in a few steps for all Figure 2: Distribution of normalized label.
advertisers. As a result, it is easy to see why the final prediction of
𝐸 [ℓ |ℓ > 0, 𝐴] · 𝐸 [ℓ𝑛 |𝑋, 𝐴] becomes calibrated for each advertiser
within a few gradient steps. Furthermore, we can see that the gra- Instead we leverage a paradigm from statistics commonly referred
dients are now scale invariant and as a result the model capacity is to as the Winsorized Mean which is known to perform well in the
allocated evenly to all advertisers. precence of outliers(see [5, 13]). To compute the mean of a random
We note that there are several alternatives to using multiplicative variable, we first winsorize the sample set – replace the extremes of
normalization. We discuss a few of them below: the sample space with truncated values. Observe that this is quite
difficult to do with extremes in the distribution directly, as each
• We note that another way to normalize the labels is to use an advertiser might have a different extreme due to the large range in
additive normalizer rather than a multiplicative normalizer. their values. By normalizing the mean of each advertiser, we can
However, this approach doesn’t yield favorable results for truncate the relative value (the ratio of the label to the mean) for all
two reasons. Advertisers use the relative values of the labels advertisers simultaneously.
to indicate the relative value of conversions. Secondly, the Simply predicting a truncated advertiser can result in a very poor
wide range which causes poor predictions is still an issue. poisson logloss. This is because some advertisers may indeed have
• Another way is to try to weight each sample by E[ℓ |ℓ>0,𝐴] 1 . a higher normalized label and may not be outliers. However, it is
This approach leads to numerical issues as the gradient can be not easy to distinguish these advertisers than those for whom large
arbitrarily large. This is because even if the label is small (say labels are outliers. Instead, we take a two-pronged approach.
ℓ ≈ 0), the prediction could be a constant and the resulting
(1) Create a separate objective (by introducing a new head) that
gradient would be enormous.
tries to minimize the winsorized relative label.
• Another idea is to simply not try to normalize the label.
(2) We down-weight the objective for the un-truncated value
We show that the above performs poorly in the experimental section. which is used for prediction.
3
The success of the above approach can be interpreted in two different provides an additional regularization technique that helps to sort out
ways. First, by having a capped label is that it more evenly allocates the outlier and focus the model’s capacity into more useful regions
the model capacity. Secondly, we can view the winsorized mean of the latent space. While training, most models aim to predict just
objective is used to regularize the objective value and forces the the mean of the distribution. However, one can make the model
model to pick an equilibrium that can better allocate model capacity learn various aspects of the distribution. We show that learning the
for all advertisers. quantiles of the distribution can help the overall accuracy of the
mean prediction.
5.2 Handling zero inflation To do this, we add an additional objective that asks the model to
One common problem faced in modelling advertiser reported values predict the quantile of the label of the normalized label. The number
is that most of the labels are zero. This is a common problem that of quantiles to predict, is a hyperparameter that can be tuned. We
is faced in many real-world datasets. One technique that is used to find that the addition of these heads improves the model performance
handle this issue is to use a model this distribution using a “Zero- especially as the model size increases. In particular, a larger model
Inflated Model”. This suggests that observed phenomenon arise as with these additional heads outperforms a larger model without these
the composition of two separate processes: the first chooses the additional heads.
probability of being zero and the second arise from some natural One reason why this model improves performance is that the
probabilistic process, in our case a Poisson Model. additional head is further able to distinguish between extreme out-
Perhaps the most natural way to capture this in a machine learning liers. Secondly, deep models are very good at classification tasks
model to split the label generating process as E[ℓ] = Pr[ℓ > 0] ·E[ℓ | as evidenced by their performance on a number of classification
ℓ > 0]. Surprisingly the resulting model has poorer performance. tasks (for e.g. [19]). By breaking the regression problem into smaller
We find that it is better to have the model predict the value directly classification tasks, we can leverage the improved performance in
and have a separate head predict Pr[ℓ > 0]. This is due to two com- classification.
pounding effects. The first is that it is inherently harder to optimize
the product of two labels as both models have to be accurate. The 6 EXPERIMENTS
second is that the model can allocate its own capacity between these We have performed both live experiments and experiments on offline
objectives as it sees fit rather than having it artificially decide weight historical data. We report only the latter due to the propriety nature
both objectives equally. Lastly, the additional head can serve as a of live experiments.
regularizer for the serving objective.
Data-set. We train the models on data from a commercial search
Remark 1. Note that the splitting of the label (as mentioned above)
engine’s logs. Each example is a click and the label is total value of
is also compatible with the other ideas in our paper. For example,
conversions for that click as reported by the advertisers. Like [7], we
we can easily incorporate labeli normalization by predicting E[ℓ] =
h
ℓ assume last click attribution where conversions are attributed to the
Pr[ℓ > 0|𝑋 ] · E 𝜂𝐴 | ℓ > 0, 𝑋 · 𝜂𝐴 where 𝜂𝐴 = E[ℓ | 𝐴, ℓ > 0]. most recent click. We train on XX-Billion training examples.

Features and Model. Similar to most machine learning models


in advertising, we use several categorical features. In particular we
embed these features, concatenate the embeddings for each of these
features and then pass it through several layers of a fully connected
feed forward deep neural network.

Optimizer and Training Model. For each of the objectives sug-


gested, we append a final layer that is attached to the appropriate
objective. For the classification objective, we use a simple softmax
function. For predicting whether the Pr[ℓ > 0], we use a sigmoid-
loss function.
All models use the Ada-Grad optimizer [11] with the same hyper-
parameters (including the same learning rate). With the exception
Figure 3: A pictorial representation of DNN along with all the ideas in of the down-weighting on the final objective, all other objectives
this paper. are weighted uniformly. Lastly, these models are trained in an on-
line fashion [25]. In online training you start training on the oldest
As we show in our experiments, the addition of a new head that examples and then train on examples in the order of time that they
predicts this probability results in an improvement in the Poisson occurred. This allows the models to continuously capture any drifts
log loss (see table 3). in the distribution in either the mean or the actual value reported by
the advertisers. Online training is quite standard in the industry and
5.3 Teaching label distribution is widely used to train a wide variety of machine learning models.
A novel aspect that we introduce in this work is to improve the Note that in online training, each example is evaluated and the loss is
regression task with the use of additional objectives which recast the noted. As a result, there is no need for a separate test set to evaluate
problem as a classification task. In particular, we believe that this the models.
4
Evaluation and metrics. We primarily consider negative Poisson E[ℓ𝜂 ] = Pr[ℓ𝜂 > 0] · E[ℓ𝜂 | ℓ𝜂 > 0]. We have two heads
log likelihood (NPLL) as the evaluation metric. This is also the same predicting each component. The first component uses sigmoid
metric used in the loss function. Typically for a machine learning loss function and the second one uses a poisson log loss.
model trained in batch setting we evaluate either using a hold out
set or via cross validation. But in the case of online training we can 6.2 Results For Label Normalization
instead evaluate the model on the example before training on that In this subsection, we compare the models S, BN and PW. We
example. That way we get exactly the performance that you would begin by noting that PW doesn’t even train and has severe numerical
get at serving time. We will use this form of evaluation in reporting issues. The reason is because the gradient for this formulation is
all our metrics. While we train over several months of data and show (ℓ − 𝑒 𝜃 )/𝐸 [ℓ |𝐴, ℓ > 0]) and while ℓ/𝐸 [ℓ |𝐴, ℓ > 0] is numerically
plots over the complete period we report aggregate numbers in the stable we find that 𝑒 𝜃 /𝐸 [ℓ |𝐴, ℓ > 0] cannot be numerically stabilized.
table over the final 3 months. We see that this quantity explodes when we consider an advertiser
for whom 𝐸 [ℓ |𝐴, ℓ > 0] is very very small.
6.1 Models Trained
We considered the following models and evaluate NPLL over a Models Relative NPLL Relative NPLL
period of X months. for un-normalized for normalized
labels labels
• Baseline Model (BN) We start a baseline model that con- S 0.0% 0.0%
tains a simple model that trains on the normalized label. We BN +1.53% -38.02%
also have a simple counting model to compute running aver- Table 1: Poisson log likelihood for label changes
age of the labels seen for each advertiser for computing the
normalization constant, i.e. 𝜂𝐴 = E[ℓ | 𝐴, ℓ > 0].
• Simple - (S) We consider a model that contains no additional Increasing S un- BN un- S BN
heads and directly tries to predict the final label. We re-weight value of normal- normal- normal- normal-
each example by the same number for all events. This ensures bucketized ized ized ized ized
that the learning rate is comparable across the baseline model 𝐸 [ℓ |𝐴, ℓ > 0] label label label label
and the simple model (due to difference in global average
Bucket0 2.36 1.04 17.94 0.99
label).
Bucket1 1.29 0.85 1.75 0.99
• Median of Means - (MM) We also take 3 copies of the base-
Bucket2 1.03 1.0 1.03 1.0
line model, with each model training on a third of the data.
Bucket3 1.02 1.0 1.02 1.0
We output the median prediction of the three towers. Observe
Bucket4 1.01 1.01 1.01 1.0
that this model is 3 times more expensive as it contains a 3
copies of the baseline model. Bucket5 1.03 1.10 1.11 1.04
• Plain Model with Weighted Training - (PW) We also con- Bucket6 1.0 1.04 1.01 1.01
sider a model where each event is weighted by 1/𝜂𝐴 instead Table 2: Avg Prediction/Avg Label
of normalizing the label.
• Full Model - (F): A model with label normalization and all Now we compare S and BN. As we can see from table 1, S
improvements from learning the distribution. These include does better on PLL with respect to un-normalized label since it
an additional head to predict the winsorized label, an addi- directly trains on un-normalized labels. But when we consider the
tional head to predict if the Pr[ℓ > 0] and a softmax head to results with respect to normalized labels it does terribly. To further
predict the quantile of the normalized label. showcase the issue we look at bias of both the models on data sliced
• Full model without ℓ > 0 head - (FP) A model with the by bucketized values of avg per advertiser label, i.e 𝐸 [ℓ |𝐴, ℓ > 0].
same configuration as F except the head predicting the proba- If we look at table 2 we see that S has terrible overprediction for
bility Pr[ℓ > 0]. smaller values of 𝐸 [ℓ |𝐴, ℓ > 0]. This is true even though we have
• Full model without Softmax Head - (FS) A model with advertiser as a feature in the model.
the same configuration as F except the softmax head which
predicts the quantile. 6.3 Learning Properties of the Distribution.
• Full model without Winsorized Label Head - (FW1) A In this section, we will evaluate the performance of each of the
model with the same configuration as F except the capped improvements that we added. The plot from 4 shows how the relative
head and head predicting the original value is weighted nor- accuracy of each model with respect to BN changes over time. To
mally. compare the total accuracy improvement of all the improvements
• Full model without Winsorized Label Head - (FW2) A that we added we compare baseline model BN to the fullmodel F and
model with the same configuration as F except the capped we see an overall improvement of 0.87% NPLL which is significant.
head and head predicting the original value is downweighted The next step in the evaluation is to show that each of the improve-
by 10x. ments we added is necessary. To do so we run ablation by removing
• Zero Inflated FullModel - (ZI) A model with the same 1 change at a time and compare F with the models FP, FS, FW1,
configuration as F except the following change. We split the FW2. We can see in table 3 that the model is indeed worse if we
normalized label into a product of two normalized labels: exclude any of the improvements that we add.
5
We further show why our results are better than what is in the [9] Ilias Diakonikolas, Daniel M Kane, and Ankit Pensia. 2020. Outlier robust mean
literature. In table 3 we see that MM is neutral with respect to BN estimation with subgaussian rates via stability. arXiv preprint arXiv:2007.15618
(2020).
showing that median of means doesn’t seem to solve the outlier [10] Eustache Diemert, Julien Meynet, Pierre Galland, and Damien Lefortier. 2017.
problem. In addition, we see that splitting the prediction as 𝑃𝑟 (ℓ𝑛 > Attribution Modeling Increases Efficiency of Bidding in Display Advertising. In
ADKDD. ACM, 2:1–2:6. https://doi.org/10.1145/3124749.3124752
0)𝐸 [ℓ |ℓ𝑛 > 0] in ZI is stricly worse than directly predicting 𝐸 [ℓ𝑛 ] [11] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods
and adding an additional head for 𝑃𝑟 (ℓ𝑛 > 0) in F. for online learning and stochastic optimization. Journal of machine learning
On top of the above aggregate analysis we also look at metrics research 12, 7 (2011).
[12] Benjamin Edelman, Michael Ostrovsky, and Michael Schwarz. 2007. Internet
on an interesting slice of the dataset. More specifically, we look at Advertising and the Generalized Second-Price Auction: Selling Billions of Dollars
examples which belong to advertisers who have more than 2% of Worth of Keywords. American Economic Review 97, 1 (March 2007), 242–259.
positive labels which are winsorized. We see that our models tend to https://doi.org/10.1257/aer.97.1.242
[13] Wayne A Fuller. 1991. Simple estimators for the mean of skewed populations.
do better on these set of advertisers. Statistica Sinica (1991), 137–158.
[14] Tilmann Gneiting and Adrian E Raftery. 2007. Strictly Proper Scoring Rules,
Models Relative Relative NPLL for Prediction, and Estimation. J. Amer. Statist. Assoc. 102, 477 (2007), 359–378.
NPLL advertisers with > 2% https://doi.org/10.1198/016214506000001437
[15] Samuel B Hopkins and Jerry Li. 2019. How Hard is Robust Mean Estimation?. In
winsorized +ve labels Conference on Learning Theory. PMLR, 1649–1682.
BN 0.0% 0.0% [16] Patrick Hummel and R. Preston McAfee. 2017. Loss Functions for Predicted
Click-Through Rates in Auctions for Online Advertising. Journal of Applied
MM +0.04% 0.12% Econometrics 32 (2017), 1314–1328. http://onlinelibrary.wiley.com/doi/10.1002/
ZI -0.47% -0.75% jae.2581/full
F -0.87% -1.23% [17] N Jansakul and JP Hinde. 2002. Score tests for zero-inflated Poisson models.
Computational statistics & data analysis 40, 1 (2002), 75–96.
FP -0.67% -1.19% [18] Pravesh K Kothari, Jacob Steinhardt, and David Steurer. 2018. Robust moment
FS -0.79% -1.07% estimation and improved clustering via sum of squares. In STOC. 1035–1046.
[19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifi-
FW1 -0.50% -0.81% cation with deep convolutional neural networks. Advances in neural information
FW2 -0.50% -0.33% processing systems 25 (2012), 1097–1105.
[20] Diane Lambert. 1992. Zero-Inflated Poisson Regression, with an Application to
Table 3: NPLL improvements of various model variants with respect to Defects in Manufacturing. Technometrics 34, 1 (1992), 1–14.
normalized label [21] Kuang-chih Lee, Burkay Orten, Ali Dasdan, and Wentong Li. 2012. Estimating
conversion rate in display advertising from past performance data. In SIGKDD,
Qiang Yang, Deepak Agarwal, and Jian Pei (Eds.). ACM, 768–776. https://doi.
org/10.1145/2339530.2339651
[22] R. Lewis and J. Wong. 2018. Incrementality Bidding & Attribution. Microeco-
nomics: Production (2018).
[23] Gábor Lugosi and Shahar Mendelson. 2019. Mean Estimation and Regression
Under Heavy-Tailed Distributions: A Survey. Found. Comput. Math. 19, 5 (2019),
1145–1190. https://doi.org/10.1007/s10208-019-09427-x
[24] Andrea Mangani. 2004. Online advertising: Pay-per-view versus pay-per-click.
Journal of Revenue and Pricing Management 2, 4 (2004), 295–302.
[25] H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner,
Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat
Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos,
and Jeremy Kubica. 2013. Ad Click Prediction: A View from the Trenches. In
KDD (KDD ’13). ACM, New York, NY, USA, 1222–1230. https://doi.org/10.
Figure 4: Comparison of relative poisson log likelihood improvement 1145/2487575.2488200
over time. [26] Aditya Krishna Menon, Krishna Prasad Chitrapura, Sachin Garg, Deepak Agar-
wal, and Nagaraj Kota. 2011. Response prediction using collaborative filtering
with hierarchies and side-information. In SIGKDD, Chid Apté, Joydeep Ghosh,
REFERENCES and Padhraic Smyth (Eds.). ACM, 141–149. https://doi.org/10.1145/2020408.
[1] [n.d.]. About Target ROAS bidding. https://support.google.com/google-ads/ 2020436
answer/6268637?hl=en [27] Matthew Richardson, Ewa Dominowska, and Robert Ragno. 2007. Predict-
[2] Pravin K. Trivedi Adrian Colin Cameron. 1998. Regression analysis of count data. ing clicks: estimating the click-through rate for new ads. In WWW, Carey L.
Cambridge University Press. Williamson, Mary Ellen Zurko, Peter F. Patel-Schneider, and Prashant J. Shenoy
[3] Deepak Agarwal, Rahul Agrawal, Rajiv Khanna, and Nagaraj Kota. 2010. Es- (Eds.). ACM, 521–530. https://doi.org/10.1145/1242572.1242643
timating Rates of Rare Events with Multiple Hierarchies Through Scalable [28] Romer Rosales, Haibin Cheng, and Eren Manavoglu. 2012. Post-Click Conversion
Log-linear Models (KDD ’10). ACM, New York, NY, USA, 213–222. https: Modeling and Analysis for Non-Guaranteed Delivery Display Advertising. (2012),
//doi.org/10.1145/1835804.1835834 293–302.
[4] Gagan Aggarwal, Ashish Goel, and Rajeev Motwani. 2006. Truthful auctions [29] Leonard J. Savage. 1971. Elicitation of Personal Probabilities and Expectations. J.
for pricing search keywords. In Proceedings 7th ACM Conference on Elec- Amer. Statist. Assoc. 66, 336 (1971), 783–801.
tronic Commerce (EC-2006), Ann Arbor, Michigan, USA, June 11-15, 2006, [30] Hal R. Varian. 2007. Position auctions. International Journal of Industrial
Joan Feigenbaum, John C.-I. Chuang, and David M. Pennock (Eds.). ACM, 1–7. Organization 25, 6 (2007), 1163 – 1178. https://doi.org/10.1016/j.ijindorg.2006.
https://doi.org/10.1145/1134707.1134708 10.002
[5] N Balakrishnan and N Kannan. 2003. Variance of a Winsorized mean when [31] Flavian Vasile, Damien Lefortier, and Olivier Chapelle. 2017. Cost-sensitive
the sample contains multiple outliers. Communications in Statistics-Theory and Learning for Utility Optimization in Online Advertising Auctions. In ADKDD’17.
Methods 32, 1 (2003), 139–149. 8:1–8:6. https://doi.org/10.1145/3124749.3124751
[6] G. W. Brier. 1950. VERIFICATION OF FORECASTS EXPRESSED IN TERMS [32] Robert L. Winkler. 1969. Scoring Rules and the Evaluation of Probability Asses-
OF PROBABILITY. Monthly Weather Review 78 (1950), 1–3. sors. J. Amer. Statist. Assoc. 64, 327 (1969), 1073–1078.
[7] Olivier Chapelle. 2014. Modeling delayed feedback in display advertising. 1097– [33] M Xie, B He, and TN Goh. 2001. Zero-inflated Poisson model in statistical process
1105. https://doi.org/10.1145/2623330.2623634 control. Computational statistics & data analysis 38, 2 (2001), 191–201.
[8] Olivier Chapelle. 2015. Offline Evaluation of Response Prediction in Online Ad- [34] Zeyuan Allen Zhu, Weizhu Chen, Tom Minka, Chenguang Zhu, and Zheng Chen.
vertising Auctions. In WWW (WWW ’15 Companion). Association for Computing 2010. A novel click model and its applications to online advertising. In WSDM,
Machinery, New York, NY, USA, 919–922. https://doi.org/10.1145/2740908. Brian D. Davison, Torsten Suel, Nick Craswell, and Bing Liu (Eds.). ACM, 321–
2742566 330. https://doi.org/10.1145/1718487.1718528
6

You might also like