Using The Mean Absolute Percentage Error For Regression Models
Using The Mean Absolute Percentage Error For Regression Models
Regression Models
Arnaud de Myttenaere1,2 , Boris Golden1 , Bénédicte Le Grand3 & Fabrice Rossi2
arXiv:1506.04176v1 [stat.ML] 12 Jun 2015
1 - Viadeo
30 rue de la Victoire, 75009 Paris - France
2 - Université Paris 1 Panthéon - Sorbonne - SAMM EA 4534
90 rue de Tolbiac, 75013 Paris - France
3 - Université Paris 1 Panthéon - Sorbonne - Centre de Recherche en Informatique
90 rue de Tolbiac, 75013 Paris - France
1 Introduction
We study in this paper the classical regression setting in which we assume given
a random pair Z = (X, Y ) with values in X × R, where X is a metric space.
The goal is to learn a mapping g from X to R such that g(X) ≃ Y . To judge
the quality of the regression model g, we need a quality measure. While the
traditional measure is the quadratic error, in some applications, a more useful
measure of the quality of the predictions made by a regression model is given by
the mean absolute percentage error (MAPE). For a target y and a prediction p,
the MAPE is
|p − y|
lMAP E (p, y) = ,
|y|
with the conventions that for all a 6= 0, a0 = ∞ and that 00 = 1. The MAPE-risk
of g is then LMAP E (g) = E(lMAP E (g(X), Y )).
We are interested in the consequences of choosing the best regression model
according to the MAPE as opposed to the Mean Absolute Error (MAE) or the
Mean Square Error (MSE), both on a practical point of view and on a theoretical
one. On a practical point of view, it seems obvious that if g is chosen so as
to minimize LMAP E (g) it will perform better according to the MAPE than a
model selected in order to minimize LMSE (and worse according to the MSE).
The practical issue is rather to determine how to perform this optimization: this
is studied in Section 3. On a theoretical point of view, it is well known (see e.g.
[4]) that consistent learning schemes can be obtained by adapting the complexity
of the model class to the data size. As the complexity of a class of models is
partially dependent on the loss function, using the MAPE instead of e.g. the
MSE has some implications that are investigated in this paper, in Section 4. The
following Section introduces the material common to both parts of the analysis.
2 General setting
We use a classical statistical learning setting as in e.g. [4]. We assume given
N independently distributed copies of Z, the training set, D = (Zi )1≤i≤N =
(Xi , Yi )1≤i≤N . Given a loss function l from R2 to R+ ∪ {∞}, we define the risk
of a predictor g, a (measurable) function from X to R as the expected loss, that
is Ll (g) = E(l(g(X), Y )). The empirical risk is the empirical mean of the loss
computed on the training set, that is:
N
b 1 X
Ll (g)N = l(g(Xi ), Yi ). (1)
N i=1
3 Practical issues
3.1 Optimization
b MAP E (g)N over a
On a practical point of view, the problem is to minimize L
1
class of models GN , that is to solve
N
1 X |g(Xi ) − Yi |
gMAP E,N = arg min
b .
g∈GN N i=1 |Yi |
regularization term. That would not modify the key point which is the use of the MAPE.
2 This is the case of quantreg R package [5], among others.
3 Notice that the goal here is to verify the effects of optimizing with respect to different
types of loss function, not to claim that one loss function is better than another, something
that would be meaningless. We report therefore the empirical risk, knowing that it is an
underestimation of the real risk for all loss functions.
Table 1. As expected, optimizing for a particular loss function leads to the best
empirical model as measured via the same risk (or a related one). In practice this
allowed one of us to win a recent datascience.net challenge about electricity
consumption prediction4 which was using the MAPE as the evaluation metric.
This can be done via covering numbers or via the Vapnik-Chervonenkis dimen-
sion (VC-dim) of certain classes of functions derived from GN . One might think
that general results about arbitrary loss functions can be used to handle the case
of the MAPE. This is not the case as those results generally assume a uniform
Lipschitz property of l (see Lemma 17.6 in [1], for instance) that is not fulfilled
by the MAPE.
Then the supremum ǫ-covering number of F , N∞ (ǫ, F ), is the size of the smallest
supremum ǫ-cover of F . If such a cover does not exists, the covering number is
∞. While controlling supremum covering numbers of H(GN , l) leads easily to
consistency via a uniform law of large numbers (see e.g. Lemma 9.1 in [4]), they
cannot be used with the MAPE without additional assumptions. Indeed, let h1
and h2 be two functions from HN,MAP E , generated by g1 and g2 in GN . Then
||g1 (x) − y| − |g2 (x) − y||
kh1 − h2 k∞ = sup .
(x,y)∈X ×R |y|
N
! p1
1 X
kf1 − f2 kp,D = |f1 (Zi ) − f2 (Zi )|p ,
N i=1
and derive from this the associated notion of ǫ-cover and of covering number.
It’s then easy to show that
The expressions of the two bounds above show that BY and λ play similar
roles on the exponential decrease of the right hand side bound. Loosening the
condition on Y (i.e., taking a large BY or a small λ) slows down the exponential
decrease.
It might seem from the results on the covering numbers that the MAPE
suffers more from the bound needed on Y than e.g. the MAE. This is not the
case as bounds hypothesis on F are also needed to get finite covering numbers
(see the following section for an example). Then we can consider that the lower
bound on |Y | plays an equivalent role for the MAPE to the one played by the
upper bound on |Y | for the MAE/MSE.
4.4 VC-dimension
A convenient way to bound covering numbers is to use VC-dimension. Inter-
estingly replacing the MAE by the MAPE cannot increase the VC-dim of the
relevant class of functions.
Let us indeed consider a set of k points shattered by H + (GN , lMAP E ), (v1 , . . . , vk ),
vj = (xj , yj , tj ). Then for each θ ∈ {0, 1}k , there is hθ ∈ HN,MAP E such
that ∀j, It≤hθ (x,y) (xj , yj , tj ) = θj . Each hθ corresponds to a gθ ∈ GN and
t ≤ hθ (x, y) ⇔ t ≤ |gθ (x)−y|
|y| . Then the set of k points defined by wj = (zj , |yj |tj )
is shattered by H (GN , lMAE ) because the h′θ associated in HN,MAE to the gθ
+
are such that ∀j, It≤h′θ (x,y) (xj , yj , |yj |tj ) = θj . Therefore
Using theorem 9.4 from [4], we can bound the Lp covering number with a VC-
B
dim based value. If Vl = V Cdim (H + (GN , l)) ≥ 2, p ≥ 1, and 0 < ǫ < N,l
4 ,
then !Vl
p p
2eBN,l 3eBN,l
Np (ǫ, H(GN , l), D) ≤ 3 log . (4)
ǫp ǫp
When this bound is plugged into equation (3), it shows the symmetry between
BY and λ as both appears in the relevant BN,l .
4.5 Consistency
Mimicking Theorem 10.1 from [4], we can prove a generic consistency result for
MAPE S ERM learning. Assume given a series of classes of models, (Gn )n≥1 such
that n≥1 Gn is dense in the set of measurable functions from Rp to R according
to the L1 (µ) metric for any probability measure µ. Assume in addition that each
Gn leads to a finite VC-dim Vn = V Cdim (H + (Gn , lMAP E ) and that each Gn is
uniformly bounded by BGn . Notice that those two conditions are compatible
with the density condition only if limn→∞ vn = ∞ and limn→∞ BGn = ∞.
Assume finally that (X, Y ) is such as |Y | ≥ λ (almost surely) and that
vn B 2
Gn log BG
n
limn→∞ n = 0, then LMAP E (bglM AP E ,n ) converges almost surely to
L∗MAP E , which shows the consistency of the ERM estimator for the MAPE.
The proof is based on the classical technique of exponential bounding. Plu-
gin equation (4) into equation (3) gives a bound on the deviation between the
empirical mean and the expectation of
v
16eBn 24eBn n − 128B nǫ2
2
K(n, ǫ) = 24 log e n ,
ǫ ǫ
Gn B
with B
Pn = 1+ λ . Then it is easy to check that the conditions above guarantee
that n≥1 K(n, ǫ) < ∞ for all ǫ > 0. This is sufficient to show almost sure
glM AP E ,n ) − L∗MAP E,Gn to 0. The conclusion follows
convergence of LMAP E (b
from the density hypothesis.
5 Conclusion
We have shown that learning under the Mean Absolute Percentage Error is
feasible both on a practical point of view and on a theoretical one. In application
contexts where this error measure is adapted (in general when the target variable
is positive by design and remains quite far away from zero, e.g. in price prediction
for expensive goods), there is therefore no reason to use the Mean Square Error
(or another measure) as a proxy for the MAPE. An open theoretical question
is whether the symmetry between the upper bound on |Y | for MSE/MAE and
the lower bound on |Y | for the MAPE is strong enough to allow results such as
Theorem 10.3 in [4] in which a truncated estimator is used to lift the bounded
hypothesis on |Y |.
References
[1] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cam-
bridge University Press, 1999.
[2] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
[3] M. Ezekiel. Methods of Correlation Analysis. Wiley, 1930.
[4] L. Györfi, M. Kohler, A. Krzyżak, and H. Walk. A Distribution-Free Theory of Nonpara-
metric Regression. Springer, New York, 2002.
[5] R. Koenker. quantreg: Quantile Regression, 2013. R package version 5.05.