Lecture 03
Lecture 03
Lecture 3 — April 10
Lecturer: Prof. Emmanuel Candès Scribe: Leda Liang
Warning: These notes may contain factual and/or typographic errors. They are
based on Emmanuel Candès’s course from 2018 and 2023, and scribe notes written
by Logan Bell, John Cherian, Paula Gablenz, and David Ritzwoller.
• If our data exhibit small, widely distributed signals, Fisher’s method excels and
is optimal as expected noise power overtakes expected signal power, but Bonferroni’s
method is powerless.
In light of these facts, we would like a test that combines the strengths of Bonferroni and
Fisher. This lecture will examine a few such methods. We will proceed by considering n
hypothesis H0,1 , . . . , H0,n and p-values pT1 , . . . , pn such that pi ∼ Unif(0, 1) under H0,i , and
we will be testing the global null H0 = ni=1 H0,i .
3-1
STATS 300C Lecture 3 — April 10 Spring 2023
In the left panel, no p-value falls below the critical line, so H0 would not be rejected. In the
right panel, there exists a p-value which falls below the critical line, and therefore Simes test
would reject H0 .
Figure 3.1. Sketch of an example where the Simes test would not reject H0 (left panel) and where it would
reject (right panel). p-values are displayed as grey dots, the critical line α ni is shown in grey. The p-value
below the critical line in the right panel is shown as a red dot.
As we observed earlier, Simes test extends Bonferroni’s method to consider every p-value
rather than just the smallest one. As such, its rejection region contains Bonferroni’s rejection
region. We can visualize the case n = 2, shown in Figure 3.2 below, in which Simes’s rejection
region is shaded in gray while Bonferroni’s rejection region is contained within, depicted by
hatching lines.
Figure 3.2. Rejection regions for Simes and Bonferroni procedures in the case n = 2.
In the above example, the Simes test rejects if p(1) ≤ α2 or p(2) ≤ α, while Bonferroni’s
method rejects if p(1) ≤ α2 . Though the Simes test is less conservative than Bonferroni’s test,
it is not too liberal. The following theorem shows that the size of Simes test is at most α
when the pi are independent under H0 .
Theorem 1. Under H0 and independence of the pi ’s, Tn ∼ Unif(0, 1).
Proof. We will show this by induction on n, following Simes [? ]. In the base case n = 1,
under H0 , it is clear that T1 = p1 ∼ Unif(0, 1). Now assume that Tn−1 ∼ Unif(0, 1) for some
3-2
STATS 300C Lecture 3 — April 10 Spring 2023
n. We will show that P(Tn ≤ α) = α for α ∈ [0, 1]. First, we split the probability into the
case p(n) ≤ α and p(n) > α.
n n n−1 n
P(Tn ≤ α) = P min p(i) ≤ α = P(p(n) ≤ α) + P min p(i) ≤ α, p(n) > α (3.1)
i=1 i i=1 i
Now recall that the density of the n-th order statistic p(n) is given by f (t) = ntn−1 for
t ∈ [0, 1]. We can thus compute each term individually. The first term is straightforward:
Z α
P(p(n) ≤ α) = ntn−1 dt = αn .
0
We will see next week that the Simes test also has size at most α under a sort of positive
dependence [? ]. We also note that, with respect to power, Simes test is similar to that of
Bonferroni. It is powerful for data which exhibit a few strong effects, and it has moderate
power for many mild effects.
3-3
STATS 300C Lecture 3 — April 10 Spring 2023
A lot of work went into establishing this very nice bound, and we will not prove it here.
Interested students are referred to Massart [? ].
3-4
STATS 300C Lecture 3 — April 10 Spring 2023
n−1
p(k+1) − 1 k2 p(k) (p(k+1) − 1)
X 2k
+ (p(k+1) − p(k) ) + 1 − log + 2 log
k=1
n p(k) − 1 n p(k+1) (p(k) − 1)
!
+ 1 − p(n) + log(p(n) )
n
X 2k − 1
= −n − log(p(k) ) + log(1 − p(n+1−k) ) . (3.2)
k=1
n
2
A quadratic statistic is thus akin to a weighted L2 -norm of F̂n (t) − t, while the KS statistic is like the
∞
L -norm.
3-5
STATS 300C Lecture 3 — April 10 Spring 2023
This formula lets us connect the Anderson-Darling P test to test statistics we have seen previ-
n
Pn that Fisher’s test statistic is TF = −2 i=1 log(pi ), and Pearson’s test statistic
ously. Recall
is TP = 2 i=1 log(1 − pi ). The formula (3.2) shows that the Anderson-Darling statistic is a
combination of Fisher’s test and Pearson’s test. Compared to Fisher’s test, the Anderson-
Darling test assigns greater weight to the p-values that are in the bulk because it re-weights
the p-values depending on their rank, something that Fisher’s test does not do. This allevi-
ates the high sensitivity to small p-values that Fisher’s test experiences.
According to Donoho and Jin [? ], Tukey [? ] proposed in his class notes at Princeton
to use a second-level significance test. Tukey’s test combines the Kolmogorov-Smirnov test
and the Anderson-Darling test by taking the maximum of the difference F̂n (t) − t, as in the
Kolmogorov-Smirnov test, but also standardizes the difference like in the Anderson-Darling
test. Specifically, Tukey proposed using the higher-criticism statistic.
F̂n (α) − α
HC∗n = max p .
0<α≤α0 α(1 − α)/n
Theoretical Analysis
Donoho and Jin [? ], based on Jin [? ] and Ingster [? ], provide a theoretical anaylsis of
Tukey’s higher-criticism statistic. The higher-criticism statistic “scans” across the signifi-
cance levels for departures from H0 . Hence, a large value of HC∗n indicates significance of
an overall body of tests. To understand the power of the higher-criticism statistic and to
compare it to Bonferroni’s method, we will next study sparse mixtures.
3-6
STATS 300C Lecture 3 — April 10 Spring 2023
H0,i : Xi ∼ N (0, 1)
H1,i : Xi ∼ N (µi , 1) µi > 0.
The expected number of non-nulls under H1 is nε. If ε = 1/n, then the above would become
the “needle in a haystack” problem: on average, there would be one coordinate with µ
nonzero.
If ε and µ were known, then the optimal test would be the likelihood ratio test. The
likelihood ratio test under the sparse mixture model is
n h i
2
Y
L= (1 − ε) + εeµXi −µ /2 .
i=1
The asymptotic analysis of Ingster [? ] and Jin [? ] specifies the dependency of ε and µ
on n as follows:
1
εn = n−β <β<1
p 2
µn = 2r log n 0 < r < 1
√
The parameter β controls the set of non-nulls in the range 1 to n, and thus the sparsity
of the alternatives, while r paramterizes the mean shift. If β were large, then our problem
would be very sparse, while if β were small, it would be mildly sparse. If r = 1, then we
get the detection threshold we have seen for Bonferroni. Hence, the “needle in a haystack”
problem corresponds to β = 1 and r = 1.
Ingster [? ] and Jin [? ] find that there is a threshold curve for r of the form
(
β − 1/2 1/2 < β < 3/4
ρ∗ (β) = √ 2
(1 − 1 − β) 3/4 ≤ β ≤ 1
such that
1. If r > ρ∗ (β), then we can adjust the NP test to achieve
3-7
STATS 300C Lecture 3 — April 10 Spring 2023
Unfortunately, we generally cannot use the NP test since we do not know ε or µ. However,
Donoho and Jin [? ], based on Ingster [? ] and Jin [? ], show that Tukey’s higher-criticism
statistic, which does not require knowledge of ε or µ, asymptotically achives the optimal
detection threshold, with
√
n(F̂n (α) − α)
pi = Φ(Xi ) = P(N (0, 1) > Xi ), and HC∗n = max p .
α≤α0 α(1 − α)
To better understand the results, Figure 3.3 visualizes the detection thresholds for NP
and Bonferroni.
If the amplitude of the signal is above the solid black curve (achievable with NP), then
the NP test has full power, that is, we asymptotically separate. However, if it is below the
curve, we asymptotically merge, that is, every test is no better than flipping a coin. The
dashed black curve in Figure 3.3 shows the detection threshold for Bonferroni. Bonferroni’s
method achieves the optimal threshold for β ∈ [3/4, 1], corresponding to the sparsest setting,
but has suboptimal threshold if β ∈ [1/2, 3/4], which has less sparsity. This is seen in the
figure as the dashed Bonferroni curve and the solid NP curve align for β ≥ 3/4, but separate
below. Hence, in the area between the Bonferroni and the NP curve, the NP test has full
power, while Bonferroni is no better than a coin toss.
Bonferroni’s threshold for 1/2 ≤ β ≤ 1 is
p
ρBon (β) = (1 − 1 − β)2 .
3-8
STATS 300C Lecture 3 — April 10 Spring 2023
Bonferroni is powerless if r < ρBon . Bonferroni correctly detects non-nulls if the maximum
of non-nulls is greater than that of nulls, i.e., roughly
p p p
max Xi ≃ 2r log n + 2 log n1−β > 2 log n
non null
√ p
⇐⇒ r + 1 − β > 1
p
⇐⇒ r > (1 − 1 − β)2 = ρBon (β).
By comparison, the higher-criticism test rejects when HC∗n is large, i.e., when the p-values
tend to be a bit too small. Next, we will consider “how small” the p-values should be, which
will be discussed in more detail in the next lecture.
For now, consider the empirical process
√
n(F̂n (t) − t)
Wn (t) = p ,
t(1 − t)
where Wn (t) converges in distribution to N (0, 1) for each t. Empirical process theory tells
us that
√
1. { n(Fn (t) − t)}0≤t≤1 converges in distribution to a Brownian bridge, and
√ p
2. max1/n≤t≤α0 Wn (t)/ 2 log log n → 1
√
This suggests the threshold 2 log log n for the HC statistic.
Theorem 3 (Donoho and Jin [? ]). If we reject when HC∗n ≥ (1 + ε)2 log log n then for
p
3-9
STATS 300C Lecture 3 — April 10 Spring 2023
The Berk-Jones statistic also achieves the optimal detection boundary asymptotically,
and it produces subexponential tails under H0 , regardless of t. Walther ran simulations to
compare the performance of HC∗n and BJ+ n and plotted the results in the graphs shown in
Figure 3.5. The plots demonstrate that HC∗n outperforms BJ+n when β ∈ [3/4, 1], the sparsest
setting, while BJ+
n is better when β is small.
Figure 3.5. Power of HC∗n (red) and BJ+ n (black) as a function of the sparsity parameter β. The left plot
shows power for sample size n = 104 , the right plot for n = 106 . In the range β ∈ (1/2, 3/4), we see that
HC∗n has low, but increasing power, while BJ+n has high, but decreasing power. Their power curves cross at
β = 3/4, and in the range β ∈ [3/4, 1], we see that HC∗n outperforms BJ+ n
.
3-10
STATS 300C Lecture 3 — April 10 Spring 2023
Figure 3.6. Power of ALR (green), HC∗n (red), and BJ+ n (black) as a function of the sparsity parameter β.
The left plot shows power for sample size n = 104 , the right plot for n = 106 .
1
where LRi = exp nD(p(i) , i/n) and the weights wi = 2i log(n/3)
.
As with HC∗n and BJ+ n , ALR also achieves the optimal detection threshold asymptotically.
Plotting the power of ALR against HC∗n and BJ+ n gives Figure 3.6. We see that it achieves
∗ +
essentially the better of HCn and BJn for any β.
3-11