0% found this document useful (0 votes)
80 views32 pages

Minimum L - Distance Estimators For Non-Normalized Parametric Models

This document proposes and investigates a new estimation method for parameters of models consisting of smooth density functions on the positive half axis. The method is based on a characterization of probability distributions using the derivative and density function, and estimates parameters by minimizing an Lq-norm distance. The paper rigorously addresses existence and consistency of the estimators. Simulation studies compare the new method to maximum likelihood for exponential, Rayleigh, and Burr Type XII distributions, finding the new method performs well in terms of bias, especially for small sample sizes with Burr distributions where maximum likelihood can have computational issues.

Uploaded by

Gaston GB
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views32 pages

Minimum L - Distance Estimators For Non-Normalized Parametric Models

This document proposes and investigates a new estimation method for parameters of models consisting of smooth density functions on the positive half axis. The method is based on a characterization of probability distributions using the derivative and density function, and estimates parameters by minimizing an Lq-norm distance. The paper rigorously addresses existence and consistency of the estimators. Simulation studies compare the new method to maximum likelihood for exponential, Rayleigh, and Burr Type XII distributions, finding the new method performs well in terms of bias, especially for small sample sizes with Burr distributions where maximum likelihood can have computational issues.

Uploaded by

Gaston GB
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Minimum Lq -distance estimators for

non-normalized parametric models


S. Betsch, B. Ebner, and B. Klar
Karlsruhe Institute of Technology (KIT), Institute of Stochastics
arXiv:1909.00002v1 [math.ST] 30 Aug 2019

Karlsruhe, Germany

September 4, 2019

Abstract

We propose and investigate a new estimation method for the parameters of models consist-
ing of smooth density functions on the positive half axis. The procedure is based on a recently
introduced characterization result for the respective probability distributions, and is to be
classified as a minimum distance estimator, incorporating as a distance function the Lq -norm.
Throughout, we deal rigorously with issues of existence and measurability of these implicitly
defined estimators. Moreover, we provide consistency results in a common asymptotic setting,
and compare our new method with classical estimators for the exponential-, the Rayleigh-,
and the Burr Type XII distribution in Monte Carlo simulation studies. The procedure fares
extraordinarily well in terms of the bias of the estimators and, in the case of the Burr dis-
tribution, where computational issues occur with the maximum likelihood estimator for very
small sample sizes, the new method has no trouble in computations.

Keywords Burr Type XII distribution, Consistent parameter estimation, Empirical pro-
cesses, Existence and measurability, Measurable selections, Minimum distance estimators,
Rayleigh distribution

1
1 Introduction

One of the most classical problems in statistics is the estimation of the parameter vector of a
parametrized family of probability distributions. It presents itself in a significant share of applica-
tions because parametric models often contribute a reasonable compromise between flexibility in
the shape of the statistical model and meaningfulness of the conclusions that can be drawn from
the model. As a consequence, all kinds of professions are confronted with the issue of parameter
estimation, be it meteorologists, engineers or biologists. Throughout the last decades, a vast
amount of highly focused estimation procedures for all kinds of situations have been provided,
but the procedure that is arguably used most often remains the maximum likelihood estimator.
Apart from its (asymptotic) optimality properties, its popularity is presumably in direct rela-
tion with its universality: For the professions mentioned above, and many more, whose prime
interest is not the study of sophisticated statistical procedures, it is essential to have at hand
a method that is both, easily communicated and applicable to a wide range of model assump-
tions. A second class of methods incorporates the idea of using as an estimator the value that
minimizes some goodness-of-fit measure. To implement this type of estimators, the empirical
distribution, quantile or characteristic function is compared to its theoretical counterpart from
the underlying parametric model in a suitable distance, and the term is minimized over the pa-
rameter space, see Wolfowitz (1957), or Parr (1981) for an early bibliography. These procedures
provide some freedom in adapting the estimation method to the intended inferences from the
model and they regularly possess good robustness properties [see Parr and Schucany (1980) as
well as Millar (1981)]. An example which was discussed recently, and which goes by the name
of minimum CRPM estimation, see Gneiting et al. (2005), is tailored to the practice of issuing
forecasts: As argued by Gneiting et al. (2007), a good probabilistic forecast minimizes a (strictly)
proper scoring rule such as the ’CRPM’ [Gneiting and Raftery (2007)], and after constructing a
suitable model it appears somewhat more natural to use as an estimator the one that minimizes
the scoring rule instead of a classical estimation method like maximum likelihood [for a compari-
son see Gebetsberger et al. (2018)]. As it happens, these rather universal procedures listed above
easily run into computational hardships. Just consider that even for ’basic’ models, density func-
tions can take complicated forms, and distribution or characteristic functions may be nowhere
near to an explicit formula. This is where we want to tie on. In a recent work, Betsch and Ebner
(2019a) established distributional characterizations that, from a practical point of view, are com-
parable to the characterization of a probability distribution through its distribution function.
Their results, which are given in terms of the derivative of a density function and the density
itself, provide explicit formulae that simplify the dependence of the terms on the parameters
(even for more complicated models), and extend characterizations via the zero-bias- or equilib-
rium transformation [Goldstein and Reinert (1997), Peköz and Röllin (2011), respectively] that

2
arise in the context of Stein’s method, cf. Chen et al. (2011). The aim of this work is to inves-
tigate these characterizations, which where already used to construct goodness-of-fit tests [see
Betsch and Ebner (2019c), Betsch and Ebner (2019b)], more closely in the context of parameter
estimation. An advantage of the resulting estimators lies in the way the density function of the
underlying model appears in the characterization, and thus also in the estimation method. When
p′
considering for some (positive) density function p the quotient p, the term no longer depends on
the integration constant which ensures that the function integrates to one, but only on the func-
tional form of the density. As indicated before, our estimators depend on the underlying model
precisely via this quotient, so they are applicable in cases where the normalization constant is
unknown. Models of this type occur (though often in discrete settings) in such applied areas as
image modeling [using Markov random fields, see Li (2009)] and machine learning, or in any other
area where models are complex enough to render the calculation of the normalization constant
impractical. For more specific discussions of such applications, we refer to the introduction of
the work by Uehara et al. (2019a). The problem was already addressed by Hyvärinen (2005),
who set out to find an estimation method which only takes into account the functional form of a
density. The approach introduced there goes by the name of ’score matching’, and the estimation
′′ ′ 2
method involves terms of the form pp − 12 pp and hence does not depend on the normalization
constant either. In the univariate case we discuss here, our method provides a good supplement
as it contains no second derivatives and may thus be applicable to cases where other methods fail.
Also note that several other approaches by Pihlaja et al. (2010), Matsuda and Hyvärinen (2019),
and Uehara et al. (2019b), are available. All these references indicate that statistical inference
for non-normalized models is a topic of very recent research.

In Section 2 we introduce this new class of parameter estimators that are comparable, in their
universality in the given setting, to the maximum likelihood and minimum Cramér-von Mises
distance estimators [as discussed by Parr and Schucany (1980) or Parr and De Wet (1981)]. We
rigorously deal with the existence and measurability of our estimators in Section 3. In Section 4
we provide results on consistency. Thereafter, we provide as examples the exponential- (Section
5), the Rayleigh- (Section 7), and the Burr Type XII distribution (Section 8). For each of the
three parametric models we compare our new method to classical methods like the maximum
likelihood and minimum Cramér-von Mises distance estimator in competitive Monte Carlo sim-
ulation studies. Moreover, the exponential distribution will turn out to be very revealing insofar
as we can explicitly calculate the estimator, specify the consistency result further, and provide
additional theoretic results to explain observations from the simulations. The Burr distribution
[cf. Burr (1942), Rodriguez (1977), Tadikamalla (1980), Section 6.2 of Kleiber and Kotz (2003),
or Kumar (2017)] as a model is relevant in econometrics, initiated by Singh and Maddala (1976)
[see also Schmittlein (1983)], and other areas like engineering, hydrology, and quality assurance,

3
see Shah and Gokhale (1993) for corresponding references. However, the parameter estimation is
non-trivial and can even cause computational issues. Thus, providing a new estimation method
could prove useful in applications.

2 The new estimators

To be specific, recall that the problem of parameter estimation for continuous, univariate probabil-
ity distributions presents itself as follows. Consider for Θ ⊂ Rd a parametric family of probability
density functions

PΘ = pϑ | ϑ ∈ Θ ,

and let X1 , . . . , Xn be a sample consisting of independent real-valued random variables with a


distribution from PΘ , that is, there exists some ϑ0 ∈ Θ such that Xi has density function pϑ0
(Xi ∼ pϑ0 , for short) for i = 1, . . . , n. Denote with Pϑ the distribution function corresponding to
pϑ . The task is to construct an estimator of the unknown ϑ0 based on X1 , . . . , Xn . In this work,
we focus on density functions whose support is the positive half axis.
Thus, assume that the support of each density function in PΘ is [0, ∞). In particular, suppose
that each pϑ is positive and continuously differentiable on (0, ∞). Also assume that
Z ∞
p′ (x) min{Pϑ (x), 1 − Pϑ (x)}
|x| p′ϑ (x) dx < ∞ and sup ϑ < ∞.
0 x>0 pϑ 2 (x)
Pϑ (x)
Moreover, suppose that limx ց 0 pϑ (x) = 0. These presumptions where made by Betsch and Ebner
(2019a) to derive the characterization we recall in a minute, and are straight forward to check
for most common density functions. Particularly the last condition is exhaustively discussed in
Proposition 3.7 of Döbler (2015). Let X be a positive random variable with distribution function
FX and

pϑ (X)

E X < ∞, ϑ ∈ Θ, (2.1)
pϑ (X)

and define the function


 ′ 
p (X)
η(t, ϑ) = E − ϑ min{X, t} − FX (t)
pϑ (X)

for (t, ϑ) ∈ (0, ∞) × Θ. Then Betsch and Ebner (2019a) have shown in Corollary 5.6 that X has
density function pϑ0 if, and only if, η(t, ϑ0 ) = 0 for every t > 0. Therefore, if we assume initially
that X ∼ pϑ0 [note that (2.1) is satisfied by requirement on pϑ ], then kη(· , ϑ)kLq = 0 if, and

only if, ϑ = ϑ0 [using the continuity of t 7→ η(t, ϑ)]. Here, Lq = Lq (0, ∞), B(0, ∞), w(t) dt ,
1 ≤ q < ∞, denote the usual Lq -spaces over (0, ∞), w is a positive weight function satisfying
Z ∞
w(t) dt < ∞,
0

4

and for f ∈ Lq , g ∈ Lq (1/q + 1/q ′ = 1)
Z ∞ 1/q Z ∞
kf kLq = |f (t)|q w(t) dt , hf, giLq = f (t) g(t) w(t) dt
0 0

are the usual norm and duality in Lq . Thus, with an empirical version
n n
1 X p′ϑ (Xj ) 1X
ηn (t, ϑ) = − min{Xj , t} − 1{Xj ≤ t}, t > 0, (2.2)
n pϑ (Xj ) n
j=1 j=1

of η, based on a sample of independent and identically distributed (i.i.d.) random variables


X1 , . . . , Xn with X1 ∼ pϑ0 , a reasonable estimator for the unknown ϑ0 is
   
ϑbn,q = arg min kηn (· , ϑ)kLq | ϑ ∈ Θ = arg min kηn (· , ϑ)kqLq | ϑ ∈ Θ , (2.3)

that is, we choose ϑbn,q such that kηn (· , ϑbn,q )kLq ≤ kηn (· , ϑ)kLq for each ϑ ∈ Θ. Heuristically,
kηn (· , ϑ)kLq approximates kη(· , ϑ)kLq , so ϑbn,q should provide an estimate for the minimum of
ϑ 7→ kη(· , ϑ)kLq which coincides with ϑ0 , the (unique) zero of this function. At this point
of course, there arise questions of existence and measurability of such an estimator, and we
will handle these questions in full detail in Section 3. Intuitively, one might argue to replace
FX and the empirical distribution function in the definition of η and ηn , respectively, with the
theoretical distribution function Pϑ . However, there is a bit of a technical point involved, and the
characterizations by Betsch and Ebner (2019a) do not include results that give a rigorous handle
for this slightly (yet decisively) different situation.

3 Existence and measurability

We discuss the measurability properties of ηn and derive an existence result for a measurable
version of (approximate) estimators of the type in (2.3). The result that is central to us in this
section can be found in Chapter III of Castaing and Valadier (1977) [see the references therein
and Chapter 8 by Cohn (2013) for further background]. Before we summarize these results, recall
that a Suslin space is a Hausdorff topological space which is the image of a separable, completely
metrizable topological space under a continuous map [for an overview, consult Chapter II of
Schwartz (1973)]. See also Remark A.1 in Appendix A for further information.

Theorem 3.1. Let (Ω, F, P) be a complete probability space and (S, OS ) a Suslin topological
space with Borel-σ-field B(S). Assume that Γ maps Ω into the non-empty subsets of S, and that

graph(Γ) = (ω, x) ∈ Ω × S | x ∈ Γ(ω) ∈ F ⊗ B(S).

Then there exists an F, B(S) -measurable map ϑb : Ω → S such that ϑ(ω)b ∈ Γ(ω) for every

ω ∈ Ω. Additionally, if ψ : Ω × S → R is F ⊗ B(S), B -measurable, then

m(ω) = inf ψ(ω, x) and M (ω) = sup ψ(ω, x)


x ∈ Γ(ω) x ∈ Γ(ω)

5
are (F, B)-measurable. Here, (R, B) denotes the extended real line with its usual σ-field, and we
write ⊗ for the product of σ-fields.

To apply Theorem 3.1, we first have to investigate the measurability properties of ηn . In the
setting of Section 2, assume the following regularity condition.
p′ϑ (x)
(R1) The map Θ ∋ ϑ 7→ pϑ (x) is continuous for every x > 0.

Let (Ω, F, P) be a complete probability space, which is assumed to underlie all random quantities
of the previous and subsequent sections. Exploiting the structure of ηn , we obtain the following
lemma. The proof is simple, and the basic thoughts can be found in Appendix A.

Lemma 3.2. The map ηn : Ω×(0, ∞)×Θ → R from (2.2) is F ⊗B(0, ∞)⊗B(Θ), B 1 -measurable.

Moreover, when seen as an element in Lq , ηn : Ω × Θ → Lq is F ⊗ B(Θ), B(Lq ) -measurable. In
particular,

(ω, ϑ) 7→ ηn (ω, · , ϑ) Lq

is an F ⊗ B(Θ), B[0, ∞) -measurable mapping.

Similar measurability results hold for η : (0, ∞) × Θ → R. For the remainder of this work
assume that

6 Θ ⊂ Rd is a Borel set in Rd .
(R2) the parameter space ∅ =

As such, Θ is a Suslin topological space [see Proposition 8.2.10 from Cohn (2013)] with the sub-
space topology induced by Rd . It is also a metric space with the standard metric in Rd restricted
to Θ. For n ∈ N, let εn be positive random variables such that εn → 0 P-almost surely (a.s.),

as n → ∞. Define ψn,q (ω, ϑ) = ηn (ω, · , ϑ) Lq which, by Lemma 3.2, is a product-measurable
function from Ω×Θ into [0, ∞). Theorem 3.1 implies that mn,q (ω) = inf ϑ ∈ Θ ψn,q (ω, ϑ) is (F, B 1 )-
measurable. Hence the set-valued function
n o

Γn,q (ω) = ϑ ∈ Θ ψn,q (ω, ϑ) ≤ mn,q (ω) + εn (ω) (3.1)

has a measurable graph. By construction, Γn,q takes as values only non-empty subsets of Θ. In
fact, Γn,q (ω) is also closed in Θ for every ω ∈ Ω, see Remark A.3 in Appendix A. Theorem 3.1

yields the existence of an F, B(Θ) -measurable map ϑbn,q : Ω → Θ such that ϑbn,q (ω) ∈ Γn,q (ω),
which is, by definition of Γn,q ,


ηn ω, · , ϑbn,q (ω) ≤ inf ηn (ω, · , ϑ) + εn (ω) (3.2)
Lq ϑ∈Θ Lq

for each ω ∈ Ω or, equivalently,


  
ψn,q ω, ϑbn,q (ω) ≤ inf ψn,q ω, ϑ + εn (ω).
ϑ∈Θ

Whenever we refer to an estimator that satisfies (2.3), we mean precisely such an (approximate)
measurable version. This settles the existence problem and for our asymptotic studies we have
measurability of ϑbn,q at hand.

6
4 Consistency

In this section, we investigate the asymptotic behavior of our estimator. Unfortunately, we can
not apply the general results for minimum distance estimators given by Millar (1984), since a
major assumption in that work is that the term in the norm is differentiable (with respect to
ϑ) with derivative not depending on ω, that is, in a sense, the parameter and the ’uncertainty’
have to be separated, which is clearly not the case in our setting. Thus, we need to deal with the
empirical process involved.
Assume the setting from Section 2. For brevity, we keep the notation from the previous

section, ψn,q (ϑ) = ψn,q (ω, ϑ) = ηn (ω, · , ϑ) q and set ψq (ϑ) = η(· , ϑ) q . Recall from the
L L
construction that ϑbn,q (approximately) minimizes ψn,q [see (3.2)], and ϑ0 is the unique minimum
of ψq . The heuristic of the consistency statement proven in this section is as follows. If ψn,q
converges to ψq in a suitable function space, then the random minimal points ϑbn,q converge to ϑ0 .
In order to establish convergence of ψn,q , we need the functions to be sufficiently smooth in ϑ.
p′ϑ (x)
In most applications the mapping ϑ 7→ pϑ (x) will be continuously differentiable for every x > 0,
which can often be used to derive the following regularity condition.

(R3) For each non-empty compact subset K of Θ there exists some 0 < α = αK < ∞ and a
 
measurable function H = HK : (0, ∞) → [0, ∞) with E H(X) X < ∞ such that

p′ (x) p′ (x) α
ϑ(2) (1)
− ϑ ≤ H(x) ϑ(2) − ϑ(1)
pϑ(2) (x) pϑ(1) (x)

for every x > 0 and all ϑ(1) , ϑ(2) ∈ K.

Now, let K 6= ∅ be an arbitrary compact subset of Θ. Then on Ω and for ϑ(1) , ϑ(2) ∈ K, we have
Z ∞ 1/q
  (2)
(1)
α 1X
n
ψn,q ϑ (2)
− ψn,q ϑ ≤ ϑ − ϑ ·
(1)
w(t) dt · H(Xj ) Xj
0 n
j=1

with H and α as in (R3). In particular, K ∋ ϑ 7→ ψn,q (ω, ϑ) is continuous for every ω ∈ Ω, and,
by Lemma 3.2, it constitutes a product measurable map. This already implies that ϑ 7→ ψn,q (ϑ)
is a random element of C(K)+ [see Lemma 3.1 of Kallenberg (2002)], the space of continuous
functions from K to [0, ∞) which is a complete, separable metric space (endowed with the usual
metric that induces the uniform topology). From (R3) it also follows that K ∋ ϑ 7→ ψq (ϑ) is an
element of C(K)+ . We can now state the convergence results for ψn,q that are essential for our
consistency proof.

Lemma 4.1. In the setting of Section 2, assume that (R1) - (R3) are satisfied. Let K 6= ∅ be a
compact subset of Θ. Then ψn,q −→ ψq in C(K)+ P-a.s., as n → ∞. Moreover,

inf ηn ( · , ϑ) Lq = inf ψn,q (ϑ) −→ inf ψq (ϑ) = inf η( · , ϑ) Lq
ϑ∈F ϑ∈F ϑ∈F ϑ∈F

P-a.s., as n → ∞, for every non-empty closed subset F of K.

7
The proof of this lemma is rather technical and deferred to Appendix B. Note that the
term inf ϑ ∈ F ψn,q (ϑ) is a random variable by Theorem 3.1 (cf. the measurability of mn,q in the
previous section). The following theorem uses the above lemma to establish consistency. In the
second statement, we assume that the parameter space Θ is compact, thus rendering Lemma 4.1
applicable on the whole of Θ, which will turn out essential to prove strong consistency. For most
practical purposes this is sufficient, when parameters relevant for modeling in applications can be
taken to stem from some (huge) compact set. Note that with this compactness assumption we
actually do not need the εn -term in (3.2) since ψn,q is lower semi-continuous by (R1) and Fatou’s
lemma, and thus attains its minimum in Θ. The first statement of the following theorem shows
that if the sequence ϑbn,q is already known to be tight, no compactness assumption is needed, but
P
we can only expect weak consistency in general, thus denoting by ’−→’ convergence in probability.
After the proof of the theorem, we provide an insight in which cases this is possible (Remark 4.3).

Theorem 4.2 (Consistency). Take the setting from Section 2, let ψn,q , ψq be as above, and
consider ϑbn,q from (3.2). Further assume that (R1) – (R3) are satisfied.
 P
(i) If ϑbn,q n ∈ N is tight in Θ, then ϑbn,q −→ ϑ0 , as n → ∞.

(ii) If Θ is compact, then ϑbn,q −→ ϑ0 P-a.s., as n → ∞.

Proof. In the proof of (i) we follow Theorem 3.2.2 from van der Vaart and Wellner (2000), but
we adapt the reasoning to our setting, using the measurability properties we established in Section
3, and Lemma 4.1. For completeness, as well as to prepare the proof of the second result, we
give a full proof. We start with a preliminary observation, establishing that the minimum at ϑ0
is (locally) well separated. If K is a compact subset of Θ, and O an open subset of Rd which
contains ϑ0 , then

0 = ψq (ϑ0 ) < inf ψq (ϑ). (4.1)


ϑ ∈ K\O


Indeed, if this is not the case, we find a sequence ϑ(k) ∈ K \O such that ψq ϑ(k) −→ 0 as k → ∞.
Since K \ O is compact, there exists a subsequence {ϑ(kj ) }j ∈ N and some ϑ∗ ∈ K \ O such that

ϑ(kj ) −→ ϑ∗ as j → ∞. By continuity of ψq , it holds that ψq (ϑ∗ ) = limj → ∞ ψ ϑ(kj ) = 0, but
K \ O ∋ ϑ∗ 6= ϑ0 ∈ O which is a contradiction to the fact that ϑ0 is the unique zero of ψq .


Now, let ε, δ > 0. Choose a compact subset K = Kδ ⊂ Θ such that supn ∈ N P ϑbn,q ∈
/ K < δ,
and define F = K \ Bε (ϑ0 ), where Bε (ϑ0 ) denotes the open ball in Rd of radius ε around ϑ0 .
Applying Lemma 4.1 and (4.1) to K and F , together with (3.2) and the Portmanteau theorem
[cf. Theorem 2.1 of Billingsley (1968)], we get
     
lim sup P ϑbn,q − ϑ0 ≥ ε ≤ lim sup P ϑbn,q ∈ F + lim sup P ϑbn,q ∈
/K
n→∞ n→∞ n→∞

8
 
≤ lim sup P inf ψn,q (ϑ) ≤ ψn,q (ϑ0 ) + εn + δ
n→∞ ϑ∈F
 
≤ P inf ψq (ϑ) ≤ ψq (ϑ0 ) + δ
ϑ∈F

= δ.

Note that if F = ∅, the inequality holds trivially. Since both ε and δ were arbitrary, the claim
follows. For this first part of the proof, we only needed the convergences provided by Lemma
4.1 to be valid in probability. For the following proof of (ii), we rely on the stronger result. The
arguments we use are scattered over Section 3 of the work by Sahler (1970). For reasons alluded
to in Remark A.1, and since that work contains some typos, we provide the adapted arguments.
Let ε > 0 and define βε = inf ϑ ∈ Θ\Bε (ϑ0 ) ψq (ϑ). By 4.1, we have βε > 0. Using the well-known
equivalent criterion for almost sure convergence, Lemma 4.1 gives
 ( )
[ β
ε 
lim P  sup ψk,q (ϑ) − ψq (ϑ) ≥ = 0.
n→∞ ϑ ∈ Θ\Bε (ϑ0 ) 2
k≥n

By definition of βε this implies


 
[ 
βε 
lim P  inf ψk,q (ϑ) ≤ = 0.
n→∞ ϑ ∈ Θ\Bε (ϑ0 ) 2
k≥n

Moreover, ψn,q (ϑ0 ) + εn −→ ψq (ϑ0 ) = 0 P-a.s., and thus


 
[  β 
ε 
lim P  ψk,q (ϑ0 ) + εk ≥ = 0.
n→∞ 2
k≥n

Putting everything together,


   
[ n o [ 
lim sup P  b
ϑk,q − ϑ0 ≥ ε  ≤ lim sup P  inf ψk,q (ϑ) ≤ ψk,q (ϑ0 ) + εk 
n→∞ n→∞ ϑ ∈ Θ\Bε (ϑ0 )
k≥n k≥n
 
[ 
βε 
≤ lim sup P  inf ψk,q (ϑ) ≤
n→∞ ϑ ∈ Θ\B (ϑ
ε 0 ) 2
k≥n
 
[ β

ε 
+ lim sup P  ψk,q (ϑ0 ) + εk ≥
n→∞ 2
k≥n

= 0,

that is, ϑbn,q −→ ϑ0 P-a.s., as n → ∞.

Remark 4.3 (A priori tightness of the sequence of estimators). We provide a tool for
proving tightness of the estimators before having established consistency, which we can use in
Theorem 4.2 to get consistency even for unbounded parameter spaces. The statement essentially

yields that if ψn,q is strictly convex, ϑbn,q n ∈ N is tight. More precisely, suppose that conditions

9
(R1) – (R3) hold. Let Θ be convex with ϑ0 ∈ Θ◦ , the interior of Θ. Further, let ψn,q be strictly
convex (almost surely). Then the sequence of estimators ϑbn,q is tight in Θ. The proof is straight-
forward and some hints are given in exercise problem 4 in Section 3.2 of van der Vaart and Wellner
(2000) (for more details, see Appendix C).

5 Example: The exponential distribution

Let Θ = (0, ∞) and pϑ = ϑ exp(−ϑx), x > 0. This trivially is an admissible class of density
functions. Moreover, let ϑ0 ∈ Θ, X ∼ pϑ0 , and take a sample X1 , . . . , Xn of i.i.d. copies of X.
An easy calculation gives
Z ∞
q    q
ψq (ϑ) = E ϑ min{X, t} − 1 − exp(−ϑ0 t) w(t) dt
0
ϑ q Z ∞ q

= − 1 1 − exp(−ϑ0 t) w(t) dt,
ϑ0 0

which nicely illustrates that ϑ0 is indeed the unique zero of this functions. For the particular
choice of weight w(t) = exp(−at), t > 0, with some tuning parameter a > 0, and in the case
q = 2, straight-forward calculations give
2
ψn,2 (ϑ) = ϑ2 Ψ(1) (2) (3)
n + ϑΨn + Ψn ,

where

2 X −aX(j)  
n X
2 1 2
Ψ(1)
n = 3
+ 2 2
e X(j) (−n + j − 1) − (2n − 2j + 1) − 2 2 X(j) e−aX(k) ,
a a n a a n
j=1 1≤j<k≤n

2 X −aX(j)  
n X
1 2
Ψ(2)
n = e X(j) (−n + j − 1) − (n − 2j + 1) − 2 X(j) e−aX(k) ,
an2 a an
j=1 1≤j<k≤n

n
1 X −aX(j)
Ψ(3)
n = e (2j − 1),
an2
j=1

and X(1) < . . . < X(n) is the ordered sample. Using that exp(−aX(k) ) ≤ exp(−aX(j) ) P-a.s. for
j < k, we obtain

2 X
n   
Ψ(1)
n ≥ 3 2 (2n − 2j + 1) 1 − 1 + aX(j) e−aX(j) ,
a n
j=1

(1) 2
and since 1 + aX(j) < exp(aX(j) ) P-a.s., we have Ψn > 0 almost surely. Therefore, ψn,2 is
strictly convex (almost surely), and has a unique minimum. By Remark 4.3 and Theorem 4.2 (i),
the estimator

(a)   2
ϑbn,2 = arg min ψn,2 (ϑ) ϑ > 0 = arg min ψn,2 (ϑ) ϑ > 0

10

(3)

= arg min ϑ2 Ψ(1) (2)
n + ϑΨn + Ψn ϑ>0

(1)
is consistent for ϑ0 (over the whole of Θ). Note that we have not made the dependence of Ψn ,
(2) (3)
Ψn , and Ψn on ’a’ explicit to prevent overloading the notation. With a similar argument as
(2) (a)
above, we may show that Ψn < 0 almost surely, thus we can calculate ϑbn,2 explicitly as

(2)
(a) Ψn
ϑbn,2 = − (1) .
2Ψn
To provide insight on the performance of this estimator, we compare it with the maximum
likelihood estimator and the minimizer of the mean squared error (for n ≥ 3) which are given as
 −1  −1
n
X n
X
1 1
ϑbM
n
L
= Xj  and ϑbM
n
SE
= Xj  ,
n n−2
j=1 j=1

respectively, as well as with the minimum Cramér-von Mises distance estimator discussed in the
introduction, namely
Z ∞ 2 

ϑbCvM
n = arg min b
Fn (t) − Pϑ (t) dPϑ (t) ϑ > 0
 0 
1 Xn       2j − 1  

= arg min exp − 2ϑX(j) + exp − ϑX(j) · −2 ϑ>0 ,
n n 
j=1

where Pϑ (x) = 1 − exp(−ϑx), x > 0, denotes the distribution function of the exponential distri-
bution, and where Fbn is the empirical distribution function of X1 , . . . , Xn . For this comparison
we simulate (for fixed values of n and ϑ0 ) D = 100, 000 samples of size n from an exponential
distribution with parameter ϑ0 , calculate the values of the estimator for each sample yielding
values ϑb1 , . . . , ϑbD , and approximate the bias and mean squared error (MSE) via

1 X b  1 X b 2
D D
ϑk − ϑ0 and ϑk − ϑ0
D D
k=1 k=1

for each of the above estimators. We perform all simulations with Python 3.7.21 . For the mini-
mization required to calculate the minimum Cramér-von Mises distance estimator, we choose as
initial value the maximum likelihood estimator and use a sequential least squares programming
method (’SLSQP’) [cf. Kraft (1988)] implemented in the ’optimize.minimize’ function of the
Python module ’scipy’2 . The Tables 1 and 2 below contain the results for the bias- and MSE
values.
As for the biases, the maximum likelihood estimator and the minimum MSE estimator perform
almost identically in terms of the absolute bias, and the minimum Cramér-von Mises distance
1
Provided by the Python Software Foundation, https://www.python.org, accessed 28 August 2019.
2
Jones E., Oliphant T., Peterson P., et al. (2001–) SciPy: Open Source Scientific Tools for Python,
http://www.scipy.org, accessed 28 August 2019.

11
(0.25) (0.5) (1) (2) (3)
ϑ0 n ϑbM
n
L ϑbM
n
SE ϑbCvM
n ϑbn,2 ϑbn,2 ϑbn,2 ϑbn,2 ϑbn,2
10 0.0557 -0.0554 0.051 0.0428 0.0376 0.0333 0.0302 0.0291
25 0.0212 -0.0205 0.0187 0.0161 0.0144 0.0129 0.0118 0.0114
0.5 50 0.0098 -0.0106 0.0089 0.0075 0.0067 0.0062 0.0057 0.0055
100 0.005 -0.0051 0.0045 0.0039 0.0035 0.0032 0.003 0.0029
200 0.0023 -0.0027 0.002 0.0018 0.0016 0.0015 0.0014 0.0013
10 0.2193 -0.2245 0.2004 0.2011 0.1871 0.168 0.1476 0.1371
25 0.083 -0.0836 0.0746 0.0754 0.0701 0.0634 0.0569 0.0538
2 50 0.0398 -0.0418 0.0345 0.0358 0.0332 0.0298 0.0263 0.0246
100 0.0191 -0.0213 0.0165 0.0171 0.0158 0.0142 0.0126 0.0118
200 0.0095 -0.0106 0.0074 0.0084 0.0077 0.0067 0.0057 0.0052
10 0.5437 -0.5651 0.4863 0.5238 0.5059 0.4753 0.4303 0.3997
25 0.2102 -0.2066 0.1832 0.2015 0.194 0.1818 0.1649 0.154
5 50 0.1048 -0.0994 0.0923 0.1004 0.0967 0.0908 0.0829 0.0779
100 0.052 -0.0491 0.044 0.0496 0.0477 0.0446 0.0404 0.0378
200 0.0264 -0.0238 0.0224 0.0253 0.0243 0.0229 0.0209 0.0196
10 1.123 -1.1016 1.0316 1.1028 1.0837 1.0484 0.9885 0.9401
25 0.4177 -0.4157 0.3719 0.4089 0.4008 0.3863 0.3628 0.3448
10 50 0.2041 -0.204 0.1826 0.1996 0.1955 0.1883 0.1768 0.1681
100 0.0991 -0.1029 0.0873 0.0967 0.0945 0.0908 0.0848 0.0804
200 0.0556 -0.045 0.0483 0.0544 0.0533 0.0513 0.0483 0.046

Table 1: Approximated biases calculated with 100,000 exponentially distributed Monte Carlo
samples for sample sizes n = 10, 25, 50, 100, 200.

estimator has a slight edge. Our new estimator outperforms all other methods (virtually) uni-
formly. More precisely, it seems as if for larger tuning parameters ’a’ the bias decreases. We will
show, however, that this observation is not correct in that generality. The results for the mean
squared error reveal that the minimum MSE estimator is the best method with respect to this
measure of quality, which is no surprise as it is constructed to minimize the MSE. For sample size
n = 10 the superiority is particularly obvious, but for larger samples, the maximum likelihood
estimator is only slightly worse. Our new estimator shows almost identical results (for a = 0.25)
as the maximum likelihood estimator, undermining that the method is sound and powerful. In
contrast to the observation with the bias values, the MSE appears to increase with ’a’. This nicely
illustrates the variance-bias trade-off commonly observed in the context of estimation problems.

12
(0.25) (0.5) (1) (2) (3)
ϑ0 n ϑbM
n
L ϑbM
n
SE ϑbCvM
n ϑbn ϑbn ϑbn ϑbn ϑbn
10 0.0416 0.0277 0.0593 0.0409 0.0428 0.0496 0.0662 0.0837
25 0.0123 0.0105 0.0162 0.0127 0.0138 0.0167 0.0229 0.0294
0.5 50 0.0055 0.0051 0.0072 0.0058 0.0064 0.0078 0.0109 0.014
100 0.0026 0.0025 0.0034 0.0028 0.0031 0.0038 0.0053 0.0069
200 0.0013 0.0013 0.0017 0.0014 0.0015 0.0019 0.0026 0.0034
10 0.6645 0.4449 0.9504 0.6569 0.6525 0.6537 0.6845 0.7346
25 0.1949 0.1661 0.2573 0.1942 0.1952 0.2006 0.2184 0.2403
2 50 0.0887 0.0821 0.1165 0.0889 0.0898 0.0932 0.1029 0.1141
100 0.0418 0.0402 0.0549 0.042 0.0426 0.0444 0.0493 0.0549
200 0.0205 0.0201 0.0269 0.0207 0.021 0.022 0.0244 0.0272
10 4.0739 2.7374 5.6848 4.0529 4.035 4.0092 3.9977 4.0335
25 1.2302 1.0465 1.621 1.2272 1.2259 1.2284 1.2493 1.2842
5 50 0.5522 0.5087 0.7246 0.5518 0.5524 0.5561 0.5706 0.5908
100 0.2635 0.2529 0.3445 0.2636 0.2641 0.2665 0.2745 0.285
200 0.1295 0.1268 0.1692 0.1296 0.1299 0.1313 0.1355 0.141
10 16.8106 11.1652 23.5189 16.7647 16.7219 16.646 16.5344 16.4779
25 4.885 4.1598 6.4565 4.8785 4.8739 4.8702 4.8831 4.9188
10 50 2.2069 2.0371 2.8967 2.2053 2.2048 2.2069 2.2213 2.2464
100 1.0473 1.007 1.3747 1.0471 1.0474 1.0497 1.0594 1.074
200 0.5126 0.5014 0.6658 0.5127 0.513 0.5144 0.5198 0.5274

Table 2: Approximated MSE calculated with 100,000 exponentially distributed Monte Carlo
samples for sample sizes n = 10, 25, 50, 100, 200.

6 The case a → ∞

As discussed previously, the simulation results for the exponential distribution somewhat indicate
that as the tuning parameter ’a’ grows, the bias decreases while the MSE increases. Interestingly,
we can lay observations for a → ∞ on a rigorous theoretical basis. To be precise, observe the
following general result.

Theorem 6.1. Consider the setting from Section 2 with weight function w(t) = e−at . For the
quantity ψn,q (ϑ, a) = ψn,q (ϑ) = kηn ( · , ϑ)kLq from the end of Section 3, we make the dependence
on the tuning parameter ’a’ explicit. Then
q
n ′
q 1 X pϑ (Xj )
lim aq+1
ψn,q (ϑ, a) = Γ(q + 1) ,
a→∞ n j=1 pϑ (Xj )

13
on a set of measure one, where Γ(·) denotes the Gamma function.

The proof consists of an almost trivial application of an Abelian theorem for the Laplace
transform, see p.182 of Widder (1959), or the work by Baringhaus et al. (2000). Since a, q > 0,
q
the functions ψn,q (ϑ) and aq+1 ψn,q (ϑ) attain their minimum in the same point. Thus, in the
limit a → ∞, our procedure essentially yields as an estimators the minimizer of the quantity
q
n ′
1 X pϑ (Xj )
.
n pϑ (Xj )
j=1

In the situation of the exponential distribution as discussed in Section 5, the result reduces to
2
lima → ∞ a3 ψn,2 (ϑ, a) = 2ϑ2 , so in the limit a → ∞, the procedure will choose ϑb = 0 ∈
/ Θ as the
estimator, which leads to a bias of −ϑ0 and an MSE of ϑ20 . The observation from the simulations
is, therefore, not universal. An example for which the limit in Theorem 6.1 is less trivial is the
Rayleigh distribution.

7 Example: Rayleigh distribution

Let Θ = (0, ∞) and take the density function of the Rayleigh distribution with scale parameter
ϑ ∈ Θ,
x  x2 
pϑ (x) = exp − , x > 0.
ϑ2 2ϑ2
It is easy to check that the Rayleigh density satisfies all regularity conditions stated throughout
p′ϑ (x) 1 x
the work, and that we have pϑ (x) = x − ϑ2
. The limit in Theorem 6.1 thus takes the form
q
n ′ n   q
1 X pϑ (Xj ) 1 X 1 Xj
Γ(q + 1)
= Γ(q + 1) − 2 ,
n j=1 pϑ (Xj ) n j=1
Xj ϑ

where X1 , . . . , Xn are i.i.d. random variables which follow the Rayleigh law, X1 ∼ pϑ0 , for some
unknown scale ϑ0 ∈ Θ. In the case q = 2, it is easy to calculate that the minimum of the above
function over ϑ > 0 is given through
v P
u1 n
u j=1 Xj
ϑbAM
n =tn P 1 .
n 1
n j=1 Xj

Strikingly, this asymptotically derived moment-type estimator is itself consistent for ϑ0 , as


v P
u1 n v   v u pπ
u uE X u
bAM t n j=1 Xj u 1 t p 2 ϑ0 = ϑ0
ϑn = 1 P n 1 −→ t h i = π 1
n j=1 Xj E X11 2 · ϑ0

P-a.s., as n → ∞, where we used the law of large numbers, as well as the fact that X1 , . . . , Xn
all follow the Rayleigh distribution with parameter ϑ0 . We compare this estimator with other
methods. Among them is our new estimator
(a)  2 
ϑbn,2 = arg min ψn,2 (ϑ) ϑ > 0 = arg min ϑ−4 Ψ
e (1) −2 e (2)
n + ϑ Ψn + Ψn
e (3) ϑ > 0 ,

14
where
" 2 X
#
2 X 2  X(j) (k) 
e (1)
Ψ = 2 X(j) X(k) · 3 1 − e−aX(j)
− e−aX(j)
+e−aX(k)
n
n a a2
1≤j<k≤n
n
" 2 3
#
1 X 2X(j)  2X(j)
+ 2 1 − e−aX(j) − e−aX(j) ,
n a3 a2
j=1

" 2 e−aX(k)    
2 X X(j) 1 X(j) e−aX(j) X(j)
e (2)
Ψ n = 2 −1 + − X(k)
n a aX(k) a aX(k)
1≤j<k≤n
 #
2  X(k) X (j)
− 3 1 − e−aX(j) +
a X(j) X(k)
n    
1 X 2e−aX(j) 2j 4 −aX(j)

+ 2 · X(j) − X(j) − 3 1 − e ,
n a a a
j=1

and
X  
X(j) −aX 
e (3) = 2
Ψ e (j) + 3
2
1−e −aX(j)
n
n2 aX(k) a X(j) X(k)
1≤j<k≤n
" #
n −aX(j) 
1 X 2  e 2
+ 2 2 1 − e−aX(j) + 4j − 1 − (2j − 1) ,
n
j=1
a3 X(j) a aX(j)

e (1)
and X(1) < . . . < X(n) denotes the ordered sample. It is easily seen that if both Ψ n > 0 and
(2)
e n < 0 P-a.s., then the minimum can be calculated explicitly as
Ψ
v
u
u 2Ψ e (1)
ϑn,2 = t− (2) ,
b(a) n
en
Ψ

and indeed, using that e−aX(k) < e−aX(j) and 1 − e−aX(j) − aX(j) e−aX(j) > 0 P-a.s., we have

2 X 2X(j) X(k)  
e (1)
Ψ n > 2 1 − e−aX(j)
− aX(j) e−aX(j)
n a3
1≤j<k≤n

1 X 2X(j)  
n 2
−aX(j) −aX(j)
+ 2 1 − e − aX(j) e
n a3
j=1

> 0 P − a.s.,

e (2)
and with similar thoughts, Ψ n < 0 P-a.s. Additionally, we consider the maximum likelihood

estimator and a moment estimator, which are given as


v r
u X
u 1 n 2 1X
n
bM
ϑn =L t 2
Xj and ϑn bM om
= · Xj ,
2n π n
j=1 j=1

respectively. Note in particular that the moment estimator is unbiased and we can expect it to
outperform the other estimators in this regard. Finally, we include the minimum Cramér-von

15
Mises distance estimator given through
 " ! !# 
1 X n  X 2 −X 2 
2j − 1 (j) (j)
ϑbCvM
n = arg min − 2 exp − 2 + exp − 2 ϑ>0 ,
n n 2ϑ ϑ 
j=1

where we solve the minimization numerically via a sequential least squares programming method
as in the case of the exponential distribution in Section 5, using as an initial value the maximum
likelihood estimator. The execution of the comparison is as in the example on the exponential
distribution, and the results are displayed in Tables 3 and 4.
(0.25) (0.5) (1) (2) (3)
ϑ0 n ϑbM
n
L ϑbM
n
om ϑbAM
n ϑbCvM
n ϑbn,2 ϑbn,2 ϑbn,2 ϑbn,2 ϑbn,2
10 -0.0061 0.0001 0.0269 0.0029 -0.005 -0.004 -0.0023 0.0006 0.0028
25 -0.0025 0.0001 0.014 0.0012 -0.002 -0.0016 -0.0009 0.0002 0.0011
0.5 50 -0.0011 0.0001 0.0085 0.0007 -0.0009 -0.0007 -0.0003 0.0002 0.0007
100 -0.0006 0 0.0047 0.0003 -0.0005 -0.0004 -0.0002 0.0001 0.0003
200 -0.0004 0 0.0027 0.0001 -0.0003 -0.0002 -0.0002 0 0.0001
10 -0.0246 0.0001 0.1074 0.0107 -0.0093 0.0019 0.0189 0.0434 0.0598
25 -0.0097 0 0.0549 0.0039 -0.0036 0.0007 0.0073 0.0169 0.0237
2 50 -0.0057 -0.0007 0.032 0.0013 -0.0025 -0.0003 0.003 0.0079 0.0114
100 -0.0026 -0.0001 0.019 0.0008 -0.0011 0 0.0017 0.0041 0.0058
200 -0.0015 -0.0003 0.0104 0 -0.0007 -0.0002 0.0005 0.0016 0.0025
10 -0.0624 -0.0009 0.2642 0.0255 0.0156 0.064 0.1293 0.1926 0.2199
25 -0.0245 -0.0002 0.1388 0.0097 0.0063 0.0251 0.0519 0.0817 0.0973
5 50 -0.0132 -0.0002 0.0848 0.0049 0.003 0.0129 0.027 0.0432 0.0523
100 -0.0059 0 0.0477 0.0021 0.0016 0.0062 0.0128 0.0206 0.0253
200 -0.0028 0.0002 0.0279 0.001 0.001 0.0033 0.0066 0.0106 0.0129
10 -0.1248 -0.0004 0.5383 0.0537 0.1302 0.2617 0.3919 0.4777 0.5055
25 -0.0565 -0.0076 0.2699 0.0123 0.043 0.0965 0.1564 0.2074 0.2293
10 50 -0.0261 -0.0021 0.1582 0.0083 0.0225 0.048 0.0783 0.1073 0.1214
100 -0.0109 0.0013 0.0979 0.0057 0.0138 0.0272 0.043 0.0586 0.0671
200 -0.0077 -0.001 0.0545 0.0011 0.0057 0.0128 0.0207 0.0289 0.0334

Table 3: Approximated biases calculated with 100,000 Rayleigh-distributed Monte Carlo samples
for sample sizes n = 10, 25, 50, 100, 200.

Apparently, the moment estimator ϑbM


n
om outperforms the other estimators with respect to the

bias values, while the maximum likelihood estimator ϑbM L gets the smallest MSE. The estimator
n

we obtained via the limit results from the previous section seems sound in itself but is completely
negligible compared to the other methods. In terms of bias, the minimum Cramér-von Mises

16
(0.25) (0.5) (1) (2) (3)
ϑ0 n ϑbM
n
L ϑbM
n
om ϑbAM
n ϑbCvM
n ϑbn,2 ϑbn,2 ϑbn,2 ϑbn,2 ϑbn,2
10 0.0061 0.0067 0.0135 0.0082 0.0062 0.0062 0.0064 0.0068 0.0072
25 0.0025 0.0027 0.0061 0.0033 0.0025 0.0025 0.0025 0.0027 0.0028
0.5 50 0.0013 0.0014 0.0033 0.0017 0.0013 0.0013 0.0013 0.0013 0.0014
100 0.0006 0.0007 0.0019 0.0008 0.0006 0.0006 0.0006 0.0007 0.0007
200 0.0003 0.0003 0.001 0.0004 0.0003 0.0003 0.0003 0.0003 0.0004
10 0.0992 0.1089 0.2185 0.1316 0.103 0.109 0.1248 0.157 0.18
25 0.0401 0.0438 0.0977 0.0527 0.0412 0.0433 0.0488 0.0601 0.069
2 50 0.0199 0.0219 0.0543 0.0264 0.0205 0.0215 0.0242 0.0297 0.0341
100 0.01 0.0109 0.03 0.013 0.0102 0.0106 0.0119 0.0146 0.0167
200 0.005 0.0055 0.0167 0.0065 0.0051 0.0053 0.006 0.0073 0.0083
10 0.6205 0.6827 1.3695 0.8271 0.7057 0.8359 1.0635 1.2775 1.3473
25 0.2521 0.276 0.6122 0.3314 0.2803 0.3258 0.4097 0.5088 0.5566
5 50 0.125 0.1371 0.3398 0.1648 0.1385 0.1606 0.2015 0.2529 0.2811
100 0.0627 0.0684 0.1876 0.0819 0.0688 0.0793 0.0989 0.1242 0.1392
200 0.0311 0.0341 0.1039 0.0409 0.0343 0.0395 0.0491 0.0618 0.0695
10 2.4749 2.7278 5.4528 3.3202 3.3485 4.2534 5.0993 5.4722 5.523
25 0.9966 1.0933 2.4305 1.3171 1.2922 1.6219 2.0109 2.3024 2.3989
10 50 0.5 0.5445 1.3528 0.6504 0.6342 0.7926 0.9955 1.1792 1.26
100 0.2499 0.2735 0.7504 0.329 0.3179 0.3959 0.4966 0.5962 0.6473
200 0.1248 0.1364 0.42 0.1637 0.1579 0.1961 0.247 0.2999 0.3288

Table 4: Approximated MSE calculated with 100,000 Rayleigh-distributed Monte Carlo samples
for sample sizes n = 10, 25, 50, 100, 200.

distance estimator is preferable to the maximum likelihood method, and both are outdone by
our new estimator, which even keeps up with the unbiased moment estimator for the smaller
values of the parameter ϑ0 . Notice that the maximum likelihood and moment estimator tend to
underestimate the parameter, while the other procedures tend to a slight overestimation. As for
the MSE, the moment estimator and our new method perform similarly and follow the maximum
likelihood estimator closely. The minimum Cramér-von Mises distance estimator is a bit behind.
To summarize, the maximum likelihood and moment estimator for the Rayleigh parameter are
both simple and very convincing, but the newly proposed method keeps up (for suitably chosen
tuning parameter) and appears to find a good compromise between bias and MSE. The only
graver weakness shows for the large parameter value ϑ0 = 10 and small sample sizes n = 10, 25.

17
8 Example: The Burr Type XII distribution

Consider the density function of the Burr distribution,


−k−1
pϑ (x) = c k xc−1 1 + xc , x > 0,

where ϑ = (c, k) ∈ (0, ∞)2 = Θ. The corresponding distribution function is given through
−k
Pϑ (x) = 1 − 1 + xc , x > 0.

It is not exactly trivial, but still straight-forward, to prove that this is an admissible distribution in
terms of the setting in Section 2 [see also Betsch and Ebner (2019a)] and the regularity conditions
(R1) – (R3). With q = 2 and weight function w(t) = e−at , a > 0, the function ψn,2 (ϑ) =

ηn ( · , ϑ) 2 from Section 3 (see also Section 2) can be calculated explicitly as
L

 2 X  
2 2A(j) (c, k)  B(j) (c, k) −aX 
ψn,2 (ϑ) = 2 A(ℓ) (c, k) 3
1 − e−aX(j) + 2
e (j) + e−aX(ℓ)
n a a
1≤j<ℓ≤n
 
c − 2 −aX(j) X(j) −aX(j) B(j) (c, k) −aX
+ 2 e − e + e (ℓ)
a a a
n   
1 X 2 2X(j) −aX 2 −aX(j) 2
+ 2 A(j) (c, k) − 2 e (j) − 3e + 3
n a a a
j=1

2(j − 1) c −aX(j) 2B(j) (c, k) −aX
+ A(j) (c, k) e + e (j)
a2 a
n n
2c X −aX(j) 1 X −aX(j)
+ je − e ,
a n2 a n2
j=1 j=1

where
c−1
X(j) c−1
A(j) (c, k) = c (k + 1) c −
1+ X(j) X(j)

and
c
X(j)
B(j) (c, k) = −c (k + 1) c ,
1 + X(j)

(a) (a) (a) 


and where X(1) < . . . < X(n) denotes the ordered sample. Our estimator ϑbn,2 = cbn , b
kn , as
defined in (2.3), can be calculated as the minimizer of the above function over Θ. We implement
the ’L-BFGS-B’-method [L-BFGS-B algorithm, see Byrd et al. (1995) and Zhu et al. (1997)] im-
plemented in the ’optimize.minimize’ function of ’scipy’ to solve the minimization numerically,
using (1, 1) as initial values. (Note that in preliminary simulations we have tried several other
optimization routines, like a truncated Newton algorithm or the ’SLSQP’ from previous sections,
but the ’L-BFGS-B’-method appeared to be the most reliable for our purpose.) As competitors to
our estimator we consider the maximum likelihood estimator with implementation as suggested

18
by Shah and Gokhale (1993) [for a different algorithm, see Wingo (1983)]. More precisely we use
the Newton-Raphson method (with initial value c = 1) to find the root
 −1 
n n n
n X 1 X X Xjc log(Xj ) !
+ log(Xj ) −  log(1 + Xjc ) + 1 · = 0,
c n 1 + Xjc
j=1 j=1 j=1

cM
giving an estimate bn
L for c which we then introduce into

 −1
n
X
b 1 c
b M L
knM L = log(1 + Xj n ) .
n
j=1

Both relations are easily derived from the likelihood equations. Note that there have been further
contributions to the estimation of the Burr parameters [see Schmittlein (1983), Shah and Gokhale
(1993), Wingo (1993), and Wang and Cheng (2010)], but letting our new method compete against
a highly modified version of some estimator would somewhat bias the results (as similar improve-
ments via numerical sophistication might be conceived for the new estimator as well). Instead
we let the methods compete in rather unmodified versions to see which algorithm fares better
in terms of the statistical methodology behind the approaches. Additionally, we consider the
minimum Cramér-von Mises distance estimator, which can be calculated from
Z ∞  2 

bCvM
ϑn = b CvM bCvM
cn , kn = arg min b
Fn (t) − Pϑ (t) dPϑ (t) c, k > 0
0
 
1 X n   

c −k 2j − 1

c −k
= arg min 1 + X(j) − 2 + 1 + X(j) c, k > 0
n n 
j=1

(the minimization is solved numerically, similar to our new estimator). Like for the exponential-
and Rayleigh distribution, we approximate bias and MSE of these estimators and compare them
in Tables 5 and 6 below. For each value of ϑ0 and n, the first line corresponds to the bias/MSE
of the estimator for the c-parameter, and the second line corresponds to the k-parameter.

As before, it becomes evident that our new procedure outperforms the maximum likelihood
and minimum Cramér-von Mises distance estimator in terms of the bias. Unlike for the expo-
nential distribution, the dependence on the tuning parameter ’a’ is less clear: For a great deal of
(3)
parameter values and sample sizes, the estimator ϑbn,2 yields the best result, but in some cases
(0.25)
(mostly for the k-parameter) the estimator ϑb n,2, with tuning parameter from the other end of
the spectrum, performs best. Thus, if one seeks to minimize the bias, an optimal, data dependent
choice of the tuning parameter would be useful (more on this in Section 9).

19
c0  (0.25) (0.5) (1) (2) (3)
ϑ0 = k0 n ϑbM
n
L ϑbCvM
n ϑbn,2 ϑbn,2 ϑbn,2 ϑbn,2 ϑbn,2
10 – 0.142 0.0094 -0.1608 0.0375 0.0636 0.0585
– 1.3014 0.175 -0.2579 0.0243 0.1635 0.1382
25 0.0406 0.0451 -0.0377 -0.196 0.024 0.0223 0.0152
0.1057 0.1311 0.0184 -0.301 0.0503 0.049 0.0131
0.8 
2 50 0.0197 0.0207 -0.0411 -0.1673 0.0121 0.0102 0.0034
0.0491 0.0565 -0.0089 -0.2435 0.0256 0.0214 -0.0122
100 0.0097 0.0102 -0.0307 -0.1102 0.0062 0.0056 0.001
0.0234 0.0266 -0.0128 -0.1505 0.0125 0.0125 -0.0095
200 0.0046 0.005 -0.012 -0.051 0.003 0.0029 0.0012
0.0114 0.0131 -0.0048 -0.066 0.0064 0.0071 -0.0001
10 0.2956 0.3458 -0.2755 -1.0773 0.2152 0.1987 0.1841
2.8551 37.1075 1.6188 -2.1985 2.35 2.287 2.1681
25 0.1027 0.1082 -0.1434 -1.2772 0.0725 0.0655 0.0618
0.6208 0.8619 0.2341 -2.6011 0.5033 0.4754 0.4647
2
5 50 0.0476 0.0492 -0.0347 -1.4268 0.0326 0.0298 0.0283
0.2669 0.3415 0.1278 -2.809 0.2126 0.2039 0.2021
100 0.0233 0.0233 0.0079 -1.5877 0.0159 0.0145 0.0138
0.1285 0.1565 0.0946 -3.0394 0.1025 0.0989 0.0983
200 0.012 0.0113 0.0089 -1.7622 0.0082 0.0076 0.0073
0.0627 0.0732 0.0526 -3.3064 0.05 0.0485 0.0483
10 – 2.1411 1.0267 1.0622 1.0927 1.1037 1.1374
– 0.0451 0.0322 0.0313 0.0327 0.038 0.0432
25 0.3731 0.4635 0.3177 0.3233 0.3167 0.3101 0.3146
0.0143 0.0113 0.0096 0.0095 0.0106 0.013 0.0151
5 
0.8 50 0.1748 0.2071 0.1519 0.153 0.1488 0.1453 0.1463
0.0063 0.0046 0.0038 0.0039 0.0045 0.0057 0.0067
100 0.0835 0.096 0.0731 0.0729 0.0708 0.069 0.0693
0.0031 0.0023 0.0019 0.0019 0.0022 0.0028 0.0033
200 0.0421 0.0481 0.0375 0.037 0.036 0.0348 0.0347
0.0016 0.0012 0.001 0.001 0.0012 0.0015 0.0017

Table 5: Approximated biases calculated with 100,000 Burr-distributed Monte Carlo samples for
sample sizes n = 10, 25, 50, 100, 200.

20
c0  (0.25) (0.5) (1) (2) (3)
ϑ0 = k0 n ϑbM
n
L ϑbCvM
n ϑbn,2 ϑbn,2 ϑbn,2 ϑbn,2 ϑbn,2
10 – 0.179 0.1398 0.2562 0.1209 0.1092 0.1199
– 22686.9274 1.9083 0.9712 1.0083 2.2673 2.5852
25 0.0228 0.0293 0.0724 0.2017 0.031 0.0344 0.0406
0.2354 0.3578 0.2867 0.3025 0.3438 0.513 0.7007
0.8 
2 50 0.009 0.0121 0.0518 0.1577 0.0137 0.0159 0.02
0.0957 0.1242 0.1437 0.1853 0.1557 0.2298 0.343
100 0.0042 0.0056 0.0337 0.1009 0.0066 0.0076 0.01
0.0483 0.0545 0.0781 0.1136 0.0737 0.1066 0.171
200 0.002 0.0027 0.0144 0.0472 0.0032 0.0036 0.0045
0.207 0.0255 0.0368 0.0582 0.0358 0.0507 0.0769
10 0.4819 0.9394 1.0985 2.6895 0.4657 0.4352 0.4291
260.2839 1671094.1072 88.3851 22.0669 164.7637 180.7299 168.6383
25 0.1139 0.1717 0.4778 2.7278 0.1143 0.1135 0.1163
3.5036 10.6678 4.6982 9.531 3.3494 3.3022 3.3652
2
5 50 0.0477 0.0699 0.1719 2.9353 0.0497 0.0504 0.0526
1.059 1.8952 1.5019 10.0855 1.0549 1.089 1.1556
100 0.0221 0.0321 0.041 3.216 0.0237 0.0243 0.0254
0.4312 0.694 0.5032 10.9842 0.4435 0.4665 0.5016
200 0.0107 0.0153 0.0119 3.5414 0.0116 0.0119 0.0125
0.1954 0.2996 0.2009 12.0603 0.2039 0.2156 0.233
10 – 63.9726 14.0301 14.8418 16.3923 16.7479 17.2525
– 0.2941 0.1349 0.137 0.1405 0.1479 0.1565
25 1.6759 2.7939 1.6869 1.6671 1.631 1.6163 1.6767
0.0403 0.0496 0.0443 0.045 0.0461 0.0483 0.0508
5 
0.8 50 0.6177 0.8648 0.6609 0.646 0.6299 0.6307 0.6597
0.0189 0.0227 0.0212 0.0215 0.022 0.0229 0.0239
100 0.2746 0.3641 0.2958 0.2891 0.2827 0.2844 0.2992
0.0091 0.0108 0.0103 0.0105 0.0107 0.0111 0.0116
200 0.1293 0.1671 0.1395 0.1366 0.1338 0.135 0.143
0.0045 0.0054 0.0052 0.0052 0.0053 0.0055 0.0058

Table 6: Approximated MSE values calculated with 100,000 Burr-distributed Monte Carlo sam-
ples for sample sizes n = 10, 25, 50, 100, 200.

21
Both in the bias and in the MSE simulation, the maximum likelihood estimator ran into
computational issues for sample size n = 10. The minimum Cramér-von Mises distance estimator
is more stable in this regard, but still a lot less so than our new estimators which show notably
slighter outliers only for large values of the Burr parameters. Once samples get larger (n = 50+),
the asymptotic optimality properties of the maximum likelihood estimator appear to kick in, as
its performance stabilizes. Still for suitably chosen tuning parameter, our estimators are very
close in virtually all instances. The small sample behavior of the maximum likelihood estimator
poses a huge drawback for applications and the problem is well-known. Our new method provides
the seemingly better alternative, but of course larger-scale simulations could give further insight
on the results for large values of the Burr parameters. Moreover, the minimization routines,
and thus the whole estimation routines, would certainly profit from a better, maybe even data
dependent, choice of the initial values, which we set to unity generically.

9 Notes and comments

Note that there remain some problems for further research on our newly proposed estimators,
the discussion or extension of which would be too extensive for this contribution. First, for all
estimators we considered explicitly, we incorporate a tuning parameter ’a’ on which the perfor-
mance depends strongly. It would be beneficial to have an adaptive choice of this parameter
[see Allison and Santana (2015) or Tenreiro (2019) who discuss such a method in the context of
goodness-of-fit testing problems], probably adaptable to which criterion (minimal bias etc.) the
estimator should satisfy. Also we have not used in practice the flexibility gained by providing all
results for the general Lq -spaces, but restricted our attention to the case q = 2, mostly because
of the explicit formulae obtainable in that case. In the context of deriving results for a → ∞, we
obtained another consistent estimator for the Rayleigh parameter, and it would be interesting to
see if such results can be derived for other distributions. We have proven in a quite usual setting
the consistency of our estimators. Surely, a limit theorem of the type
 d
s(n) ϑbn,q − ϑ0 −→ P,

where s(n) −→ ∞, as n → ∞, and where P is some limit distribution (e.g. the normal dis-
tribution) is desirable. Such a result would pave the way for constructing confidence bands for
the true parameter based on our method. Moreover, a larger-scale simulation study, involving
more underlying parameters, sample sizes, and distributions could provide further insight into
the estimation method. Improvements from a numerical point of view would, of course, benefit
the approach. From a theoretical perspective, an important step in this last direction is to study
whether the minimization method will always find a global minimum, or if not, in which situa-
tions it is likely to get stuck in some local minimum.

22
Note that Betsch and Ebner (2019a) also give characterization results for density functions
on bounded intervals or on the whole real line. These can be used to construct similar estimation
methods in the corresponding cases. To sketch the idea in the case of parametric models on the
whole real line, assume that the support of each density function pϑ in PΘ is the whole real line
(and that some mild regularity conditions hold). Let Xe be a real-valued random variable with
" #
e 
p′ (X) 
e + 1 < ∞, ϑ ∈ Θ,
E ϑ |X|
e
pϑ (X)

and consider
" #
e
p′ϑ (X) 
ηe(t, ϑ) = E e 1{X
t−X e ≤ t} − F e (t)
e X
pϑ (X)
for (t, ϑ) ∈ R×Θ. Then, similar to our elaborations in Section 2, Theorem 4.1 of Betsch and Ebner
(2019a) shows that X e ∼ pϑ if, and only if, ηe(t, ϑ0 ) = 0 for every t ∈ R. Therefore, if, initially,
0

e η (· , ϑ)kLq = 0 if, and only if, ϑ = ϑ0 . Here, Lq = Lq R, B 1 , w(t)
X ∼ pϑ0 , then ke e dt , 1 ≤ q < ∞,
with a positive weight function w
e satisfying
Z

|t|q + 1 w(t)
e dt < ∞.
R

Thus, with
n n
1 X p′ϑ (X
ej )
e

e 1X ej ≤ t},
ηen (t, ϑ) = t − Xj 1{Xj ≤ t} − 1{X
n pϑ (Xej ) n
j=1 j=1

a reasonable estimator for ϑ0 is



ϑen,q = arg min ke
ηn (· , ϑ)kLq | ϑ ∈ Θ .

Using the results from Section 3, we could prove existence and measurability for this type of
estimator, and give a formal definition as in (3.2). Moreover, a classical proof via the law of
large numbers for random elements in separable Banach spaces and the Arzelà-Ascoli theorem
[considering the modulus of continuity, as employed by Billingsley (1968)] yields the convergence
results from Lemma 4.1 for ψen,q = ke ηn (· , ϑ)kLq , but with all convergences only in probability.
That result can then be used to derive consistency as in Theorem 4.2, again with all convergences
only in probability. However, choosing a fixed (i.e. parameter-independent) weight function on R
with a mere scale-tuning, as we employ it throughout (using the weight t 7→ e−at ), appears not to
be sufficient to account for the possible location-dependence of the model. Thus, in simulations
(for instance with the Cauchy distribution) the problem is empirically more involved and therefore
not addressed in the work at hand.

A Remarks and proofs concerning measurability

Remark A.1 (Comments on Theorem 3.1). There is, in fact, another result which gives mea-
surable selections without the completeness assumption on the probability space [as provided by

23
Brown and Purves (1973)], but it requires σ-compactness of the parameter space, thus essentially
reducing the study to euclidean parameters (a Banach space is σ-compact if, and only if, it is of
finite dimension, which follows easily from Baire’s category theorem). Of course this is enough
for our purposes, but currently the interest in statistical inference for infinite dimensional models
grows remarkably. Hence if a statistician was to investigate measurability of an estimator for some
infinite dimensional quantity, she would have to resort to a result in the generality of Theorem 3.1.
Another reason for us to build on Theorem 3.1 is that other measurability results known to us do
not quite fit the construction of our estimators. For instance, Sahler (1970) considers minimum
discrepancy estimators, where discrepancies are (certain) functions on the Cartesian product of
a suitable set of probability measures with itself. It is (formally) not possible to identify such
a set of probability measures in our setting, as we ought to introduce the empirical distribution
of a sample into the discrepancy function, while only considering parametric distributions with
a continuously differentiable density. Even though we believe this to be a purely formal issue
which might be resolved to render results from Sahler (1970) applicable, additional caution is
needed that Theorem 3.1 does not require. What is more, Theorem 3.1 also provides measurabil-
ity properties for infima of functions, which we need for the consistency investigations. Likewise,
the setting considered by Pfanzagl (1969) does not cover our estimators.

Note that since completing (the σ-field of) an underlying probability space does not interfere
with measurability properties of random maps, nor does it meddle with push-forward measures,
the corresponding assumption in Theorem 3.1 is no restriction. If S is a complete, separable
metric space and the map Γ from Theorem 3.1 takes compact subsets of S as values, the condi-
tion imposed on the graph is equivalent to Γ being measurable with respect to the Borel-σ-field
generated by the Hausdorff topology [see Theorems III.2 and III.30 by Castaing and Valadier
(1977)]. Likewise, if S is a locally compact, complete, separable metric space and Γ maps into
the closed subsets of S, the condition is equivalent to Γ being measurable with respect to the
Borel-σ-field generated by the Fell topology [this can be proven using results from Beer (1993)
and Castaing and Valadier (1977)].

Proof of Lemma 3.2. First recall the following lemma on product-measurability, the proof of
which is an easy exercise.

Lemma A.2. Let (S, A, µ) be a measure space, I ⊂ R an open interval, and let (T , OT ) be a
topological vector space. Furthermore, let h : S × I → T be a map such that

• s 7→ h(s, x) is A, B(T ) -measurable for every x ∈ I, and

• x 7→ h(s, x) is right-continuous for every s ∈ S.

24

Then h is A ⊗ B(I), B(T ) -measurable.

Notice that for fixed (t, ϑ) ∈ (0, ∞) × Θ the map ω 7→ ηn (ω, t, ϑ) is (F, B 1 )-measurable, and
for fixed (ω, t) ∈ Ω × (0, ∞) the map ϑ 7→ ηn (ω, t, ϑ) is continuous. By a statement analogous to
Lemma A.2 [see for instance Lemma III.14 by Castaing and Valadier (1977)], (ω, ϑ) 7→ ηn (ω, t, ϑ)

is F ⊗ B(Θ), B 1 -measurable for fixed t > 0. Since t 7→ ηn (ω, t, ϑ) is continuous for fixed

(ω, ϑ) ∈ Ω×Θ, Lemma A.2 implies that ηn is F ⊗B(0, ∞)⊗B(Θ), B 1 -measurable. Consequently,
the maps


(ω, ϑ) 7→ ηn (ω, · , ϑ), g Lq


are measurable for every g ∈ Lq by Fubini’s theorem, and since Lq is a separable space, the map-

ping (ω, ϑ) 7→ ηn (ω, · , ϑ) is F ⊗ B(Θ), B(Lq ) -measurable [cf. Corollary 1.1.2 of Hytönen et al.
(2016)].

Remark A.3 (Γn,q from (3.1) is closed). Note that (R1) and Fatou’s lemma imply that
ϑ 7→ ψn,q (ω, ϑ) is lower semi-continuous. Thus if ϑ(k) ∈ Γn,q (ω), k ∈ N, converges (with respect
to the metric in Θ) to ϑ∗ ∈ Θ as k → ∞, then

ψn,q (ω, ϑ∗ ) = ηn (ω, · , ϑ∗ ) Lq ≤ lim inf ηn ω, · , ϑ(k) Lq ≤ mn,q (ω) + εn (ω),
k→∞

that is, ϑ∗ ∈ Γn,q (ω), so Γn,q (ω) is closed in Θ for every ω ∈ Ω. Hence we can note that if Θ
is closed, and therefore locally compact [cf. p.42 of Kuratowski (1968)] and complete, Γn,q is a
random element in the space of all closed subsets of Θ endowed with the Fell topology (see also
Remark A.1).

B Proof of Lemma 4.1

First note that for any non-empty closed subset F of K,




inf ψn,q (ϑ) − inf ψq (ϑ) ≤ sup ψn,q (ϑ) − ψq (ϑ) ,
ϑ∈F ϑ∈F ϑ∈K

so the second claim of Lemma 4.1 follows from the first. For the first claim, let K 6= ∅ be a
compact subset of Θ. Note that

sup ψn,q (ϑ) − ψq (ϑ) ≤ sup ηn ( · , ϑ) − η( · , ϑ) Lq
ϑ∈K ϑ∈K

X n ′  ′ 
1 pϑ (Xj ) p (X)
≤ C · sup min{Xj , t} − E ϑ min{X, t}
ϑ∈K n pϑ (Xj )
j=1
pϑ (X)
t>0
1 X
n

+ C · sup 1{Xj ≤ t} − FX (t) , (B.1)
t>0 n j=1

25
R∞ 1/q
where C = 0 w(t) dt . The second term on the right-hand side of (B.1) converges to 0
almost surely by the classical Glivenko-Cantelli theorem. For a function f : (0, ∞) → R we write
P  
Pn f = n1 nj=1 f (Xj ) and PX f = E f (X) . Then the first term on the right-hand side of (B.1)
can be written as

n ′  ′ 
1 X pϑ (Xj ) p (X)
sup min{Xj , t} − E ϑ min{X, t} = sup Pn ft,ϑ − PX ft,ϑ
ϑ ∈ K n j=1 pϑ (Xj ) pϑ (X) ϑ∈K
t>0 t>0


= sup Pn f − PX f , (B.2)
f ∈ HΘ

p′ (x)
where ft,ϑ (x) = pϑϑ (x) min{x, t}, x > 0, is a measurable function for every ϑ ∈ K and t > 0, and

HΘ = ft,ϑ ϑ ∈ K, t > 0 is the collection of all such functions. Note that the supremum in
(B.2) is finite (P-a.s.) by (R1), (2.1), and (R3), and that the terms in (B.2) constitute measurable
maps from (Ω, F) to (R, B 1 ) by Theorem 3.1.
As is commonly done, we denote, for given functions l, u : (0, ∞) → R, by [l, u] the set
of all functions f such that l ≤ f ≤ u pointwise. An ε-bracket with respect to L1 (PX ) =

L1 (0, ∞), B(0, ∞), PX is one such set [l, u] with ku − lkL1 (PX ) < ε. The bracketing number

N[ ] ε, HΘ , L1 (PX ) of HΘ is the minimum number of ε-brackets needed to cover HΘ . If the
bracketing number of HΘ is finite for every ε > 0, then HΘ is a Glivenko-Cantelli class, that is,

supf ∈ HΘ Pn f − PX f −→ 0 almost surely [see Theorem 2.4.1 by van der Vaart and Wellner
(2000)], which, combined with (B.1) and (B.2), implies the claim. Note that the result by
van der Vaart and Wellner (2000) is formulated to give convergence outer almost surely, but as
we work on a complete probability space, the transition to an outer probability measure is not
necessary (since we can provide enough measurability on a complete probability space and the
notions of almost sure convergence and outer almost sure convergence agree).
Thus, to prove Lemma 4.1, it remains to show that the bracketing numbers of HΘ are finite.
The following argument combines ideas from the classical Glivenko-Cantelli theorem and from Ex-
ample 19.7 of van der Vaart (1998). Let ε > 0 be arbitrary, and set δ = ε1/α (4 E[H(X) X])−1/α ,
where H and α are as in (R3). Since K is compact there exist ϑ1 , . . . , ϑm ∈ K, m = mε ∈ N,
S
such that mi=1 Bδ (ϑi ) ⊃ K. Additionally, since for each i = 1, . . . , m the function
 ′ 
pϑi (X)

[0, ∞) ∋ t 7→ Ei (t) = E min{X, t}
pϑi (X)

is continuous, monotonically ′increasing,


and satisfies Ei (0) = limt ց 0 Ei (t) = 0, as well as
pϑ (X)
Ei (∞) = limt ր ∞ Ei (t) = E pϑi (X) X < ∞, there exist 0 = t0 < t1 < . . . < tℓ = ∞, ℓ = ℓε ∈ N,
i
such that

Ei (tj ) − Ei (tj−1 ) < ε/4

p′ϑ (x)
for j = 1, . . . , ℓ and i = 1, . . . , m. Upon setting f0,ϑ (x) = 0, f∞,ϑ (x) = pϑ (x) x, for x > 0 and

26
ϑ ∈ K, we define the brackets
h i
Hi,j = ftj−1 ,ϑi − ftj ,ϑi − ftj−1 ,ϑi − δα · H ∗ , ftj−1 ,ϑi + ftj ,ϑi − ftj−1 ,ϑi + δα · H ∗ ,

for j = 1, . . . , ℓ and i = 1, . . . , m, where H ∗ (x) = H(x) · x, x > 0. These brackets cover HΘ .


Indeed, if ϑ ∈ K and t > 0 are arbitrary, there exist i ∈ {1, . . . , m} and j ∈ {1, . . . , ℓ} such that
ϑ ∈ Bδ (ϑi ) and tj−1 ≤ t < tj , so ft,ϑ ∈ Hi,j since for every x > 0

ft,ϑ (x) − ft ,ϑ (x) ≤ ft,ϑ (x) − ft,ϑ (x) + ft,ϑ (x) − ft ,ϑ (x)
j−1 i i i j−1 i
′ ′
pϑ (x) p′ϑi (x) pϑi (x)  
= − min{x, t} + min{x, t} − min{x, tj−1 }
pϑ (x) pϑi (x) pϑ (x)
′  i 
α p (x)
≤ H(x) · x · ϑ − ϑi + ϑi min{x, tj } − min{x, tj−1 }
pϑi (x)

≤ δα · H ∗ (x) + ftj ,ϑi (x) − ftj−1 ,ϑi (x) .

Moreover, the brackets Hi,j are ε-brackets with respect to L1 (PX ), as


   

2 ftj ,ϑi − ftj−1 ,ϑi + δα · H ∗ = 2 E ftj ,ϑi (X) − ftj−1 ,ϑi (X) + δα · H(X) X
L1 (PX )
 ε
= 2 Ei (tj ) − Ei (tj−1 ) +
2
< ε.

Hence N[ ] ε, HΘ , L1 (PX ) ≤ mε · ℓε < ∞.

C Proof of Remark 4.3

From Lemma 4.1 we know that ψn,q (ϑ) → ψq (ϑ) P-a.s., as n → ∞, for each ϑ ∈ Θ. Since ϑ0 ∈ Θ◦ ,
there exists a δ > 0 such that B2δ (ϑ0 ) ⊂ Θ. Then the closed ball B = Bδ (ϑ0 ) also lies in Θ.
Denote by R = ∂Bδ (ϑ0 ) the boundary of that ball. It follows from Lemma 4.1 that

inf ψn,q (ϑ) −→ inf ψq (ϑ) = 0 and inf ψn,q (ϑ) −→ inf ψq (ϑ) > 0,
ϑ∈B ϑ∈B ϑ∈R ϑ∈R

both P-a.s., as n → ∞, where the positiveness of the last term follows from (4.1). Now, let ε > 0
and choose n0 = n0 (ε) ∈ N such that
  ε
P inf ψn,q (ϑ) + εn < inf ψn,q (ϑ) ≥ 1 − , n ≥ n0 .
ϑ∈B ϑ∈R 2

Next, note that if inf ϑ ∈ B ψn,q (ϑ) + εn < inf ϑ ∈ R ψn,q (ϑ) then ψn,q has a local minimum in Bδ (ϑ0 )
(since εn > 0) which, by strict convexity, is the unique global minimum. Additionally, we have
inf ϑ∈B ψn,q (ϑ) + εn < inf ϑ ∈ Θ\B (ϑ ) ψn,q (ϑ). On the other hand, ϑbn,q ∈ Θ \ Bδ (ϑ0 ) implies
δ 0


inf ψn,q (ϑ) ≤ ψn,q ϑbn,q ≤ inf ψn,q (ϑ) + εn = inf ψn,q (ϑ) + εn .
ϑ ∈ Θ\Bδ (ϑ0 ) ϑ∈Θ ϑ∈B

27
Consequently, for all n ≥ n0 ,

ε  
1− ≤ P inf ψn,q (ϑ) + εn < inf ψn,q (ϑ)
2 ϑ∈B ϑ∈R
 
≤ P inf ψn,q (ϑ) + εn < inf ψn,q (ϑ)
ϑ∈B ϑ ∈ Θ\Bδ (ϑ0 )
 
≤ P ϑbn,q ∈ B .
 b
Since Pϑn,q n ≤ n0 is a finite set of measures on Rd , there exists a compact set K = Kε ⊂ Rd

such that P ϑbn,q ∈ K ≥ 1 − ε for all n ≤ n0 . The set K ∩ B ⊂ Θ is a compact subset of Rd and
2
thus also of Θ, for a compact metric space is a compact subset of every metric space it embeds
into continuously [see p.21, Theorem 3, of Kuratowski (1968)]. By choice of the sets,
 
P ϑbn,q ∈ K ∩ B ≥ 1 − ε,

which is the claim.

References

Allison, J. S. and Santana, L. (2015). On a data-dependent choice of the tuning parameter


appearing in certain goodness-of-fit tests. Journal of Statistical Computation and Simulation,
85(16):3276–3288.

Baringhaus, L., Gürtler, N., and Henze, N. (2000). Weighted integral test statistics and com-
ponents of smooth tests of fit. Australian & New Zealand Journal of Statistics, 42(2):179 –
192.

Beer, G. (1993). Topologies on Closed and Closed Convex Sets. Mathematics and Its Application.
Kluwer Academic Publishers, Dordrecht.

Betsch, S. and Ebner, B. (2019a). Fixed point characterizations of continuous univariate proba-
bility distributions and their applications. ArXiv e-prints, 1810.06226v2.

Betsch, S. and Ebner, B. (2019b). A new characterization of the Gamma distribution and asso-
ciated goodness-of-fit tests. Metrika, https://doi.org/10.1007/s00184-019-00708-7.

Betsch, S. and Ebner, B. (2019c). Testing normality via a distributional fixed point property in
the Stein characterization. TEST, https://doi.org/10.1007/s11749-019-00630-0.

Billingsley, P. (1968). Convergence of Probability Measures. Wiley Series in Probability and


Mathematical Statistics. John Wiley & Sons, New York.

Brown, L. D. and Purves, R. (1973). Measurable selections of extrema. The Annals of Statistics,
1(5):902–912.

28
Burr, I. W. (1942). Cumulative frequency functions. The Annals of Mathematical Statistics,
13(2):215–232.

Byrd, R. H., Lu, P., Nocedal, J., and Zhu, C. (1995). A limited memory algorithm for bound
constrained optimization. SIAM Journal of Scientific Computing, 16:1190–1208.

Castaing, C. and Valadier, M. (1977). Convex Analysis and Measurable Multifunctions. Lecture
notes in mathematics 580. Springer-Verlag, Berlin - Heidelberg - New York.

Chen, L. H. Y., Goldstein, L., and Shao, Q.-M. (2011). Normal approximation by Stein’s method.
Springer-Verlag, Berlin - Heidelberg.

Cohn, D. L. (2013). Measure Theory (Second Edition). Birkhäuser, New York.

Döbler, C. (2015). Stein’s method of exchangeable pairs for the Beta distribution and general-
izations. Electronic Journal of Probability, 20(109):1–34.

Gebetsberger, M., Messner, J. W., Mayr, G. J., and Zeileis, A. (2018). Estimation methods
for nonhomogeneous regression models: Minimum continuous ranked probability score versus
maximum likelihood. Monthly Weather Review, 146(12):4323–4338.

Gneiting, T., Balabdaoui, F., and Raftery, A. E. (2007). Probabilistic forecasts, calibration
and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
69(2):243–268.

Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation.
Journal of the American Statistical Association, 102(477):359–378.

Gneiting, T., Raftery, A. E., Westveld, A. H., and Goldman, T. (2005). Calibrated probabilistic
forecasting using ensemble model output statistics and minimum CRPS estimation. Monthly
Weather Review, 133(5):1098–1118.

Goldstein, L. and Reinert, G. (1997). Stein’s method and the zero bias transformation with
application to simple random sampling. The Annals of Applied Probability, 7(4):935–952.

Hytönen, T., van Neerven, J., Veraar, M., and Weis, L. (2016). Analysis in Banach Spaces -
Volume I: Martingales and Littlewood-Paley Theory, volume 63 of Ergebnisse der Mathematik
und ihrer Grenzgebiete. 3. Folge / A Series of Modern Surveys in Mathematics. Springer
International Publishing AG, Cham.

Hyvärinen, A. (2005). Estimation of non-normalized statistical models by score matching. Journal


of Machine Learning Research, 6:695–709.

Kallenberg, O. (2002). Foundations of Modern Probability (Second Edition). Probability and Its
Applications. Springer-Verlag, New York.

29
Kleiber, C. and Kotz, S. (2003). Statistical Size Distributions in Economics and Actuarial Sci-
ences. Wiley Series in Probability and Statistics. John Wiley and Sons, Inc., Hoboken, New
Jersey.

Kraft, D. (1988). A Software Package for Sequential Quadratic Programming. Deutsche


Forschungs- und Versuchsanstalt für Luft- und Raumfahrt Köln: Forschungsbericht. Wiss.
Berichtswesen d. DFVLR, Band 88, Ausgabe 28.

Kumar, D. (2017). The Burr type XII distribution with some statistical properties. Journal of
Data Science, 15(3):509–533.

Kuratowski, K. (1968). Topology Volume II. Academic Press / Polish Scientific Publishers, New
York / Warsaw.

Li, S. Z. (2009). Markov Random Field Modeling in Image Analysis (Third Edition). Springer-
Verlag, London.

Matsuda, T. and Hyvärinen, A. (2019). Estimation of non-normalized mixture models. In Pro-


ceedings of Machine Learning Research, volume 89, pages 2555–2563.

Millar, P. W. (1981). Robust estimation via minimum distance methods. Zeitschrift für
Wahrscheinlichkeitstheorie und Verwandte Gebiete, 55(1):73–89.

Millar, P. W. (1984). A general approach to the optimality of minimum distance estimators.


Transactions of The American Mathematical Society, 286(1):377–418.

Parr, W. C. (1981). Minimum distance estimation:a bibliography. Communications in Statistics


- Theory and Methods, 10(12):1205–1224.

Parr, W. C. and De Wet, T. (1981). On minimum Cramer-von Mises-norm parameter estimation.


Communications in Statistics - Theory and Methods, 10(12):1149–1166.

Parr, W. C. and Schucany, W. R. (1980). Minimum distance and robust estimation. Journal of
the American Statistical Association, 75(371):616–624.

Peköz, E. A. and Röllin, A. (2011). New rates for exponential approximation and the theorems
of Rényi and Yaglom. The Annals of Probability, 39(2):587–608.

Pfanzagl, J. (1969). On the measurability and consistency of minimum contrast estimates.


Metrika, 14(1):249–272.

Pihlaja, M., Gutmann, M., and Hyvärinen, A. (2010). A family of computationally efficient
and simple estimators for unnormalized statistical models. In Proceedings of the Twenty-Sixth
Conference on Uncertainty in Artificial Intelligence, UAI’10, pages 442–449, Catalina Island,
CA. AUAI Press, Arlington.

30
Rodriguez, R. N. (1977). A guide to the Burr type xii distributions. Biometrika, 64(1):129–134.

Sahler, W. (1970). Estimation by minimum-discrepancy methods. Metrika, 16(1):85–106.

Schmittlein, D. C. (1983). Some sampling properties of a model for income distribution. Journal
of Business & Economic Statistics, 1(2):147–153.

Schwartz, L. (1973). Radon Measures on Arbitrary Topological Spaces and Cylindrical Measures.
Oxford University Press, London.

Shah, A. and Gokhale, D. V. (1993). On maximum product of spacings (mps) estimation for Burr
xii distributions. Communications in Statistics - Simulation and Computation, 22(3):615–641.

Singh, S. K. and Maddala, G. S. (1976). A function for size distribution of incomes. Econometrica,
44(5):963–970.

Tadikamalla, P. R. (1980). A look at the Burr and related distributions. International Statistical
Review / Revue Internationale de Statistique, 48(3):337–344.

Tenreiro, C. (2019). On the automatic selection of the tuning parameter appearing in certain fam-
ilies of goodness-of-fit tests. Journal of Statistical Computation and Simulation, 89(10):1780–
1797.

Uehara, M., Kanamori, T., Takenouchi, T., and Matsuda, T. (2019a). Unified estimation frame-
work for unnormalized models with statistical efficiency. ArXiv e-prints, 1901.07710v2.

Uehara, M., Matsuda, T., and Kim, J. K. (2019b). Imputation estimators for unnormalized
models with missing data. ArXiv e-prints, 1903.03630.

van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Proba-
bilistic Mathematics. Cambridge University Press, Cambridge.

van der Vaart, A. W. and Wellner, J. A. (2000). Weak Convergence and Empirical Processes -
With Applications to Statistics. Springer Series in Statistics. Springer-Verlag, New York.

Wang, F.-K. and Cheng, Y.-F. (2010). Robust regression for estimating the Burr XII parameters
with outliers. Journal of Applied Statistics, 37(5):807–819.

Widder, D. V. (1959). The Laplace Transform, 5th printing. Princeton University Press, Prince-
ton.

Wingo, D. R. (1983). Maximum likelihood methods for fitting the Burr type XII distribution to
life test data. Biometrical Journal, 25(1):77–84.

Wingo, D. R. (1993). Maximum likelihood methods for fitting the Burr type XII distribution to
multiply (progressively) censored life test data. Metrika, 40(1):203–210.

31
Wolfowitz, J. (1957). The minimum distance method. The Annals of Mathematical Statistics,
28(1):75–88.

Zhu, C., Byrd, R. H., Lu, P., and Nocedal, J. (1997). Algorithm 778: L-BFGS-B: Fortran sub-
routines for large-scale bound-constrained optimization. ACM Transactions on Mathematical
Software, 23(4):550–560.

32

You might also like