0% found this document useful (0 votes)
17 views15 pages

313_identifying_homogeneous_and_in

Uploaded by

Martim Sousa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views15 pages

313_identifying_homogeneous_and_in

Uploaded by

Martim Sousa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Identifying Homogeneous and Interpretable Groups for Conformal Prediction

Natalia Martinez Gil1 Dhaval Patel1 Chandra Reddy1 Giridhar Ganapavarapu1 Roman Vaculin1
Jayant Kalagnanam1

1
IBM Research, Yorktown Heights, New York, USA

Abstract to quantify the uncertainty in their prediction, especially in


high-stakes domains such as health care or finance where
there is a robust penalty for making mistakes.
Conformal prediction methods are a tool for un-
certainty quantification of a model’s prediction, Typical ML models produce point predictions (e.g., ex-
providing a model-agnostic and distribution-free pected values for regression or most likely label in the case
statistical wrapper that generates prediction inter- of classification), these are not a-priori informative on the
vals/sets for a given model with finite sample gen- range of values the target variable can take within normal op-
eralization guarantees. However, these guarantees eration (i.e. set of values expected to occur with high proba-
hold only on average, or conditioned on the out- bility). Calibrated prediction sets (or interval) can be of great
put values of the predictor or on a set of prede- value for a decision maker that wants to consider worst-case
fined groups, which a-priori may not relate to the scenarios. Moreover, understanding how the uncertainty of
prediction task at hand. We propose a method to a model’s prediction differs across varying subsets of the
learn a generalizable partition function of the in- available data can inform the data collection process, model
put space (or representation mapping) into inter- improvements or model selection/assessment.
pretable groups of varying sizes where the non- Conformal prediction methods Vovk et al. [2005] have
conformity scores - a measure of discrepancy be- gained significant popularity in recent years since they offer
tween prediction and target - are as homogeneous a distribution-free approach to quantify the uncertainty of a
as possible when conditioned to the group. The black box model’s prediction with generalization guarantees
learned partition can be integrated with any of the Shafer and Vovk [2008], Angelopoulos and Bates [2021].
group conditional conformal approaches to pro- In particular, split conformal prediction (SCP) Papadopou-
duce conformal sets with group conditional guaran- los et al. [2002] is an attractive post-hoc, model-agnostic
tees on the discovered regions. Since these learned approach that only requires access to the model’s prediction
groups are expressed as strictly a function of the and a calibration dataset. This is especially useful in settings
input, they can be used for downstream tasks such where retraining or modifying an ML model to produce un-
as data collection or model selection. We show the certainty estimates is infeasible, or when only query access
effectiveness of our method in reducing worst case to an ML model is possible (e.g., LLMs,1 ).
group coverage outcomes in a variety of datasets.
Given a desired miscoverage level α (i.e. error-rate) con-
formal prediction methods produce prediction sets/intervals
based on a black box model’s prediction that are guaranteed
1 INTRODUCTION to contain, on average, the ground truth value of the target
variable with probability larger or equal than 1 − α. They of-
The interest on the application of Machine Learning (ML) ten rely on the quantile estimation of a non-conformity score,
models on different industrial settings has increased in re- which is a measure of the disagreement between the target
cent years, in particular given the success of deep neural variable and the model prediction (e.g., absolute error), and
networks and the availability of large amounts of data. In only require that the calibration dataset be exchangeable2
general, predictive ML models are optimized to capture the with the data samples the model will be tested on. Different
behaviour of a target variable based on a finite set of obser-
1
vations. One of the major concerns when deploying these Large Language Models
2
models into real-world decision making processes is how This is a weaker condition than full statistical independence
than-average coverage guarantees. An ideal goal would be
to achieve input-conditional coverage (i.e., the coverage
guarantees hold for each possible input), which has been
proven to be impossible in practice Vovk [2012], Lei and
Wasserman [2014]. Nonetheless, weaker guarantees such
as local conditional coverage Foygel Barber et al. [2021] or
group and level-set conditional coverage Jung et al. [2022]
are possible. Providing predictions sets with close to condi-
tional coverage guarantees is valuable in settings where the
model’s prediction uncertainty differs significantly across
the input space (heteroskedastic uncertainty). Essentially,
we want to avoid having subsets of samples with under
coverage and/or inefficient prediction sets Romano et al.
(a) Non-conformity region-based prediction framework
[2020]. Marginal-coverage guarantees hold only on average,
and do not prevent high variation in the performance of the
prediction sets across subgroups in the input space.
Many works have addressed relaxations of the conditional
coverage objective by modifying the non-conformity score
Papadopoulos et al. [2011], Lei and Wasserman [2014],
Guan [2023], Han et al. [2022], Amoukou and Brunel
[2023], Seedat et al. [2023], Ghosh et al. [2023], learning
the non-conformity quantile threshold Jung et al. [2022],
(b) Split Conformal Prediction (c) Region-based SCP Bastani et al. [2022], Gibbs et al. [2023], or using a confor-
mal quantile regression objective when the provided model
can be retrained Romano et al. [2019]. In particular, a line of
Figure 1: (1a) Overview of proposed framework to pro- work with practical guarantees has focused on the notion of
duce prediction intervals/sets. We first decompose the input local or group-conditional coverage for a pre-specified set
space into interpretable groups where each group contains of groups that partitions the input space Vovk et al. [2003],
homogeneous predictions of the 1 − α-th quantile of the Vovk [2012] and for overlapping groups Foygel Barber et al.
non-conformity scores of a ML model’s prediction f (·). [2021], Jung et al. [2022], Gibbs et al. [2023].
We then build a prediction interval/set, denoted Cτ (X) =
Cα (X, gτ (X)) where Cα is the group-conditional confor- Main Contributions. Most group conditional conformal
mal predictor, which depends on both the input X and on prediction approaches presented above rely on pre-defined
the group prediction gτ (X). Cτ satisfies group conditional groups or propose greedy approaches to slice the input space
coverage guarantees for the identified groups. (1b) regres- Lei and Wasserman [2014] or the prediction space Sesia and
sion example of heteroskedastic uncertainty in the model’s Romano [2021], Boström et al. [2021] into equal-sized re-
prediction (blue line), the x-axis indicates the input variable, gions, which scale poorly to higher dimension inputs. To
y-axis the target variable, and red dots the test samples. Pre- address this issue, we propose a method to learn a generaliz-
diction bands (blue) are produced by standard SCP with a able partition function of the input space (or representation
coverage target of 0.95 (α = 0.05), the desired coverage is mapping) into interpretable groups3 of varying sizes where
achieved on average but there is significant disparity across the quantiles of the non-conformity scores are as homoge-
regions of X. (1c) shows the prediction bands obtained with neous as possible when conditioned to the group. The main
the proposed region-based approach in conjunction with characteristics of the proposed approach are described next.
SCP. Five groups where identified, and the group condi-
tional coverage is improved significantly w.r.t. SCP. • We adopt an adversarial approach where an agent pro-
poses a partition function that approximates the non-
conformity-score conditional quantile; and a judge then
evaluates it based on its worst group conditional mis-
coverage with respect to the one achieved by an inter-
works have studied how to adapt these methods to scenarios pretable baseline. The agent and the judge use indepen-
where the exchangeability assumption is violated, such as dent datasets drawn from the same distribution.
distribution shifts or time series settings Gibbs and Candes • We define a fitness score denoted as worst group mis-
[2021], Stankeviciute et al. [2021], Barber et al. [2023]. coverage ratio (MCR) that allows the comparison of
A significant amount of work has focused on understanding 3
we use the terms regions, groups, partitions and clusters inter-
the feasibility of more efficient prediction sets and stronger- changeably
models across different partitions. We use this score to non-conformity scores {si = Sf (Xi , Yi )}ni=1 are ex-
inform the regularization of a family of interpretable changeable with any unseen non-conformity score sample
clustering functions with the goal of selecting the par- Sn+1 = Sf (Xn+1 , Yn+1 ). This exchangeabilty property
tition that best generalizes in terms of MCR over the implies the following for any new sample Xn+1 , Yn+1 ∼
set of partitions that accurately approximate the condi- P (X)P (Y |X)
tional quantile estimates of the non-conformity scores.  
1
• We learn partitions using decision trees since the iden- 1 − α ≤ P Sn+1 ≤ Q1−α ({si }ni=1 ) ≤ 1 − α + n+1
n
Pn 1 1
tified groups can be described based on interpretable Q 1−α ({si }i=1 ) = Q1−α ( i=1 n+1 δsi + n+1 δ∞ )
input rules—a valuable property for downstream tasks (1)
such as data collection or model selection. The parti- where Q1−α (·) denotes the 1 − α quantile operator of its
tion function can be integrated with any of the group input (for Eq.1 this is the ⌈(1 − α)(n + 1)⌉-th smallest
conditional conformal approaches discussed previously {si }ni=1 ), and α ∈ (0, 1) is a pre-specified mis-coverage
(see Figure 1) to produce conformal sets with group level. Conversely, we can define the conformity set for a
conditional guarantees on the discovered regions. given mis-coverage level α based on the non-conformity
score function as
The proposed method serves as an inexpensive alternative to
a more strict and costly auditing approach where the auditor Cf (Xn+1 ) = {Yn+1 ∈ Y : Sn+1 ≤ Q1−α ({si }ni=1 )}.
leverages an optimization procedure to find the worst compu- (2)
tationally identifiable miscoverage group for a given model. This satisfies P (Yn+1 ∈ Cf (Xn+1 )) ≥ 1 − α.
In our experiments, we show that we discover meaningful
groups that significantly benefit from their inclusion in a Conditional and Local Coverage Guarantees The set
group conditional conformal approach. Code is available at described in Eq. 2 provides guarantees on average across the
https://github.com/trustyai-explainability/ entire data distribution, but not for any particular value of X,
trustyai-model-trust. i.e., P (Yn+1 ∈ Cf (Xn+1 )|Xn+1 = x) ≥ 1 − α, ∀x ∈ X ,
also known as conditional coverage. This desirable guar-
Manuscript Organization. Section 2 provides a summary antee cannot be achieved in practice Vovk [2012], Lei and
of conformal prediction definitions that are used throughout Wasserman [2014], Foygel Barber et al. [2021], since it
this manuscript and Section 3 summarizes additional related would require Cf (x) to have infinite expected length at any
work. Section 4 describes the proposed objective for discov- non-atom x. A relaxation of this setting is to consider local
ering the group partition function based on non-conformity coverage over a partition of the support of P (X) denoted
score quantiles. Section 5 provides the method that inte- as g : X → G with G a discrete finite set. Then, local con-
grates group identification with conformal prediction. Fi- ditional guarantees implies P (Yn+1 ∈ Cf (Xn+1 )|Xn+1 ∈
nally, Section 6 shows experimental results that validate our gj ) ≥ 1 − α with gj = {x : g(x) = j} ∀j ∈ G.
proposed approach.
Pinball Loss in the Infinite Sample Regime In the ideal
case were the conditional distribution of the non-conformity
2 BACKGROUND
scores (P (S|X)) is known, the most efficient prediction
interval for a given X and mis-coverage level α is
Let us consider the supervised machine learning setting
where we have an input variable X ∈ X and a target variable −1
C(X) = {y ∈ Y : S(X, y) ≤ FS|X (1 − α)} (3)
Y ∈ Y jointly distributed according to an unknown distri-
bution X, Y ∼ p(X)p(Y |X). Given a prediction function −1
with FS|X (1 − α) = inf{ŝ ∈ supp(PS|X ) : P (S ≤
f : X → Y ′ , where Y ′ is an output space that approximates
Ŝ|X) ≥ 1 − α}. We can approximate the 1 − α condi-
some statistic of p(Y |X), 4 , we consider a non-conformity
tional quantile by minimizing the expected pinball loss
score function Sf : X × Y → R, that depends on f and
−1
 
measures the proximity between the prediction f (X) and FS|X (1 − α) = arg min Ep(X,S) ℓ1−α (q(X), S) (4)
q∈Q
the corresponding target Y . We use S to denote the non-
conformity random variable S = Sf (X, Y ) that depends where Q represents a universal class of functions and
on the input variable X, target variable Y and model f . ℓ1−α (·, ·) is the pinball loss, defined as
In the split conformal setting we assume we have access ℓ1−α (q, s) = max{(1 − α)(s − q), α(q − s)}. (5)
to an i.i.d. calibration dataset Dcal = {(Xi , Yi )}ni=1 ∼
p(X, Y )⊗n that is independent of f . Then, the set of
Section 4 leverages the pinball loss, in addition to a worst
case generalization objective, to identify a set of disjoint
4
e.g., Y ′ = ∆|Y|−1 if f outputs a probability vector over regions in the input space where the 1−α quantile of the non-
labels in the classification setting, or Y ′ = Y for regression conformity score differs significantly. We use the discovered
grouping in this prior step as an input to a group-conditional forest that approximates a statistic of the non-conformity
split conformal approach which now holds local conditional scores. To achieve this objective, they use a quantile ran-
guarantees on the identified groups. Section 5 presents an dom forest that approximates the distribution of the non-
implementation of this approach based on decision trees, conformity scores on the calibration dataset. Moreover, they
which provide an interpretable clustering of the input space provide an approach to approximate the forest’s weights
based on the input features (or an interpretable embedding with a partition function using a graph clustering method
of the same). based on Louvain-Leiden Traag et al. [2019] with Markov
Stability Delvenne et al. [2010]. Therefore, we compare
against two of their variants. It is important to note that
3 RELATED WORK the quantile random forest algorithm by Meinshausen and
Ridgeway [2006] does not minimize (an approximation of)
Adaptive Conformal Sets. Input-conditional coverage the quantile objective (1 − α) but instead it minimizes the
guarantees with finite samples are impossible without infi- inter-leaf variance of the non-conformity scores; the leaves
nite width intervals Vovk [2012], Lei and Wasserman [2014]. of this QRF algorithm store the entire list of non-conformity
However, an extensive line of work has focused on provid- scores of train samples falling in the leave, rather than a
ing adaptive conformal sets that can capture heteroskedastic single summary statistic. In our formulation, the learned
uncertainty in the model’s prediction Romano et al. [2019], partition function approximates the 1 − α quantile of the
Kivaranovic et al. [2020]. Some works up-weight the non- non-conformity score, since we minimize pinball loss.
conformity scores of calibration samples based on some
distance notion to the test instance Mao et al. [2022], Guan
[2023], Ghosh et al. [2023] or make assumptions on the data
4 REGION IDENTIFICATION BASED ON
distribution Lei and Wasserman [2014], Barber et al. [2023]. NON-CONFORMITY SCORE
These approaches do not integrate information about the QUANTILES
non-conformity score in the weighting process. In contrast,
approaches such as Han et al. [2022], Jung et al. [2022], Given a non-conformity score, we want to discover regions
Amoukou and Brunel [2023] model some statistic of the in the input space that maximizes intra-group homogeneity
(conditional) non-conformity score distribution to re-weight, of the score distribution, but still differ significantly be-
correct or learn the quantile threshold. tween groups. These regions, if interpretable, provide useful
insights about the uncertainty of a model’s prediction. More-
Local Conditional Coverage. Some works have proposed
over, they can be leveraged on different steps in the ML life
split conformal prediction methods for a predefined set of
cycle such as data filtering and collection.
groups. For non-overlapping groups Mondrian conformal
prediction provides finite sample guarantees Vovk et al. Given a mis-coverage objective α we want to learn a map-
[2003], Vovk [2012]. The assumption here is that the obser- ping τ : X → G × R,5 , that outputs a computationally-
vations in each group of the partition are exchangeable. For identifiable set of groups and an estimate of the 1 −
overlapping groups Foygel Barber et al. [2021] provides a α conformity score quantile for each group, τ (X) =
conservative approach with finite sample guarantees (largest (gτ (X), qτ (X)). We use gτ (X) to denote the group label
set from the groups that contain the test point). The work by and qτ (X) to denote the corresponding quantile estimate
Jung et al. [2022] learns the non-conformity score threshold (i.e., score threshold). We consider τ to belong to a family of
conditioned on each group via quantile regression. Their ap- piece-wise constant models T such that ∀τ ∈ T , ∀x1 , x2 ∈
proach has asymptotic guarantees, while Gibbs et al. [2023] X : gτ (x1 ) = gτ (x2 ) → qτ (x1 ) = qτ (x2 ).
proposed an alternative with finite sample guarantees.
Piece-wise constant models provide an interpretable charac-
Group Identification for Local Conformal Prediction. terization of the identified groups based on the input features,
Lei and Wasserman [2014] proposes a “sandwich slicer” this is especially true for models such as trees, where the
approach that bins the input features before applying a decision rules used to identify each group (leaf node) are
group/local conditional conformal approach, while Sesia clearly laid out. Note that our approach could also be applied
and Romano [2021] proposes histogram binning of the ML to some interpretable feature space of the input by choosing
model’s output values. These approaches are simple but τ (ϕ(X)) where ϕ(·) is some mapping into an interpretable
greedy, and do not leverage the information of the distri- feature space. In particular, ϕ(X) = (X, f (X)) makes the
bution of the non-conformity scores. Existing kernel-based partitioning depend directly on the output of f . This allows
localizers for conformal prediction Guan [2023], Han et al. the implicit identification of different uncertainty regions
[2022] do not partition the input space and do not integrate based on the model’s prediction.
information about the non-conformity scores. The work by
Amoukou and Brunel [2023] is the closest to our approach.
They propose an adaptive conformal prediction approach 5
we can also consider soft-clustering such that τ : X →
|G|−1
that learns the non-conformity score weights with a random ∆ ×R
4.1 GENERALIZATION OF WORST GROUP the new model over a marginal quantile estimate. Note that
MIS-COVERAGE we cannot directly compare the worst group mis-coverage
(MC) between two models directly, since the MCs are com-
We want to learn a partition function τ (·) ∈ T that approx- puted across different group definitions. The MCR uses the
−1
imates the conditional quantile FS|X (1 − α)6 . In practice, marginal baseline as an intermediary model, and allows us to
we have access to a finite dataset D, on which the model compare these two models. The MCR ratio serves a similar
family T may be prone to overfitting. Therefore, we want to role to the R2 coefficient of determination (which compares
choose a regularization parameter for T that ensures that the the residual variance of a model against a constant baseline),
generalization properties of the final model are acceptable. but MCR is defined in terms of a pessimistic, worst-case
In particular, we want to learn a partition where the worst scenario. MCR serves as a computationally efficient alter-
group conditional coverage for the identified groups is as native to a full auditing approach where an auditor uses a
close as possible to 1 − α. To do so, we first introduce our sophisticated optimization procedure to identify the worst
definitions of group conditional mis-coverage (Definition computationally identifiable group in terms of mis-coverage.
4.1), worst group mis-coverage ratio (Definition 4.2), and
In practice, we observe that the proposed MCR is a better
then our proposed objective.
criteria for model selection and group identification than
Definition 4.1. Consider a distance function d : R × R → average pinball loss or simply worst group mis-coverage
R≥0 , G a set of groups with membership function g : X → on a held out dataset. As we show in Section 6, selecting
G, a threshold q ∈ R, and a target coverage 1−α. The group a model based only on average pinball loss on a held-out
conditional mis-coverage of threshold function q : X → R dataset tends to favor models with smaller group sizes whose
over variable S for a group gj ∈ G based on distance d is quantile estimates later fail to generalize, with worst-group
coverages that fall behind even the marginal quantile es-
M Cα (q, g; gj ) = EX,S [d(1 − α, P (S ≤ q(X)))|g(X) = gj ] timate. On the other hand, choosing only based on worst
(6) group mis-coverage (i.e. worst group MC instead of MCR)
tends to discard groups of low probability even in the large
Following Definition 4.1, we are interested in measuring sample regime. This is analized further in Section 6.
the worst group conditional mis-coverage w.r.t. the marginal
baseline, that is, the model that outputs a single quantile
estimate for the entire input space. This indicates if the 4.2 GROUP DISCOVERY OBJECTIVE
proposed grouping, and corresponding quantile estimates,
provide a significant improvement in terms of worst-group We want to learn a generalizable partition function τ (·) ∈
coverage over a simple, marginal approach. Definition 4.2 T that provides the best approximation of the conditional
−1
presents the proposed worst group mis-coverage ratio. quantile FS|X (1 − α). Additionally, we want to ensure that
the worst group mis-coverage across the learned partition
Definition 4.2. Consider a distance function d : R × R → improves over the one achieved with a baseline model over
R≥0 , Gτ the set of groups identified by τ (·), gτ (·), the cor- the same partition. To do this, we consider a regularization
responding quantile estimator qτ (·) , and q̂ ≃ FS−1 (1 − α) function Rθ (τ ) with parameters θ ∈ Θ that controls the
an empirical estimate of the average 1 − α quantile of S. complexity of model τ (·), the strength of the regularization
Then, we define the worst mis-coverage ratio as function is chosen based on the empirical MCR score over a
finite dataset Da . This is shown below
max M Cα (qτ ,gτ ;gj )
gj ∈Gτ
MCR α (τ ) = max M Cα (q̂,gτ ;gj ) (7) θ∗ ∈ arg min MCRα (τθ ; Da )
gj ∈Gτ θ∈Θ  
s.t. : τθ ∈ arg min EDb ℓ1−α (qτ (X), S)) + Rθ (τ ).
τ ∈T
For the distance function we consider d(1 − α, p) = |1 − (8)
α − p| or d(1 − α, p) = (1 − α − p)+ , where the latter The final partition function τ ∗ is the one that minimizes the
only considers under-coverage violations. The MCRα (τ ) is empirical expected pinball loss with regularization Rθ∗ ,
less than 1 if the worst group mis-coverage on the proposed
τ ∗ ∈ arg min EDb ℓ1−α (qτ (X), S)) + Rθ∗ (τ ). (9)
 
partition Gτ is lower (better) than the worst mis-coverage of τ ∈T
a single quantile estimate. In such case we may prefer the
proposed partition over the baseline model. Note that the average pinball loss is estimated over a dataset
Db that is independent from Da but sampled from the same
Given two different group partitions, MCR allows us to com- distribution. The objective we propose in Eq. 8 essentially
pare which of the two partitions identified a set of groups chooses the best model in terms of MCR score among the
that would be most benefited (in the worst group sense) by set of regularized, pinball-loss-minimizing models.
−1
6
FS|X (1 − α) = inf{ŝ ∈ supp(PS|X ) : P (S ≤ Ŝ|X) ≥ We stress that this objective is meaningful as a finite sample
1 − α} generalization constraint, since, given access to a sufficiently
large sample set to learn the group-conditional quantiles, the finite increments δθ and stopping when MCR fails to im-
MCR would be zero. In essence, given sufficient samples, prove is sufficient for our purposes, but more sophisticated
any quantile estimated for any partition of the input space zero-order approaches could substitute this update strategy.
would also have sufficient samples such that the estimated
quantile would achieve near-exact group conditional cover- Conformalizing the Conditional Quantiles of the Dis-
age. An algorithm to achieve Eq. (8), and a formalization of covered Regions. The learned clustering function gτ ∗ (·)
the above statement are provided in the following section. is then fed into a group conditional conformal prediction
mechanism, ACP such as Vovk [2012], Foygel Barber et al.
[2021], Jung et al. [2022], Gibbs et al. [2023] to provide
5 DISCOVERING AND conformalized thresholds for each identified group.
CONFORMALIZING GROUPS IN
For example, for clustering functions gτ ∗ (·) that partition
PRACTICE
the space with no overlaps, we consider a standard group
conditional split conformal method where ACP provides
We consider θ to be a regularization parameter that is mono-
the conformal quantile estimator qτCP∗ (·) based on the con-
tonically decreasing with model complexity. In this setting
formal quantile of each identified group. The corresponding
we propose Algorithm 1 to find the regularization strength
conformal set Cτ (Xn+1 ) for a new sample is defined as:
θ∗ that recovers the pinball loss minimizer with lowest MCR
from a family of clustering methods T . Following this dis- Cτ (Xn+1 ) = {y ∈ Y : Sf (Xn+1 , y) ≤ qτcp (Xn+1 )}
covery step, we then run a group-conditional conformal pre- (10)
diction mechanism on the discovered regions to conformal-
ize the score quantiles and produce conformal sets/intervals where the conformal quantile function qτcp (·) is
with local coverage guarantees. n
X 1[gn+1 = gi )] 1
Proposition 5.1 shows that Algorithm 1 is optimal in the qτcp (Xn+1 ) = Q1−α ( δsi + δ∞ )
i=1
ngi + 1 ngn+1 + 1
infinite sample regime, where the generalization objective is (11)
easily achieved by any partition. That is, even in the absence with gτ ∗ (xi ) = gi , and ngi ,7 the number of samples of
of generalization issues, Algorithm 1 correctly approximates group gi in dataset D2 , ∀i ∈ [n + 1]
−1
the conditional quantile FS|X (1 − α) within the desired
model class (and finds the best pinball loss minimizer in Moreover, for each identified group g ∈ Gτ ∗ the coverage
the presence of generalization challenges otherwise). Al- guarantees become
though this particular result hinges on the ‘infinite sample’  
1 − α ≤ P Yn+1 ∈ Cτ (Xn+1 )|gτ ∗ (Xn+1 ) = g
assumption, we stress that Algorithm 1 also performs group- 1
(12)
conditional conformal predictions on each of the recovered ≤1−α+ ng +1
.
groups (last step in Algorithm 1) which does have finite
Note that the upper bound depends on the number of sam-
sample group conditional guarantees as shown in Eq. 12
ples ng of group g in the calibration set.
Proposition 5.1. Given the objective in Eq. 8, if
Da = P (X, S) (infinite sample regime) and θ0 in Al- 5.1 LEARNING DECISION-TREE-BASED
gorithm 1 is the weakest admissible regularization, then REGIONS
τ ∗ = τθ0 , which also minimizes  pinball loss over 
all admissible regularizations
 ED ℓ1−α (q τ ∗ (X), S)) ≤ Decision trees make a natural candidate for learning parti-
ED ℓ1−α (qτθ (X), S)) , ∀θ ∈ Θ such that θ ≥ θ0 . tion functions, since they are inherently interpretable, es-
pecially at lower tree depths. We need access to a solver
Learning Generalizable Quantile Score Regions. Al-
M1−α that, given a dataset and a regularization parameter,
gorithm 1 assumes access to a solver for the τθ objective,
provides a tree that minimizes the 1 − α average pinball
denoted as M1−α , and a conformal prediction mechanism
loss as in Eq.8. The challenge we face with existing deci-
ACP in addition to a dataset (D1 ) containing input sam-
sion tree regression optimizers is that, as far as we know,
ples and their corresponding non-conformity scores. The
available solvers do not support pinball loss. Therefore, we
initial parameter θ0 is the weakest acceptable regularization
first train a surrogate model h∗ ∈ H that does have access
due
Q to interpretability purposes (e.g., maximum tree depth), to pinball loss solvers. Then, we approximate the output of
Θ (·) a projection operator into the regularization parame- h∗ with the decision tree by minimizing the mean square
ter space, and ∆θ > 0 a step size that guarantees a change
error loss against the surrogate model’s predicted (input de-
in θt when projected into Θ unless the minimum admissible
pendent) quantile. The procedure described here to learn a
complexity bound has been reached. The final clustering
decision tree for pinball loss minimization is summarized in
model τ ∗ (·) = (gτ ∗ , qτ ∗ )(·) is learned using the best regu-
larization parameter θ∗ ∈ Θ in terms of MCR. This simple
7
approach of steadily increasing regularization strength in ngi = |{j : gτ ∗ (xj ) = gi }i∈D2 |
Algorithm 1 Region Identification Meta-Algorithm MCR coverage num
model average max group min group groups
Require: i.i.d. dataset D1 of input samples and corresponding
non-conformity scores. M1−α (·, ·) : D × Θ → T solver for τ Housing: nsamples = 506, nfeatures = 13 | LGBM-Regressor R2 = 0.64 ± 0.03
in Eq. 8. ACP : D × G |D| → R|G| group-conditional conformal LCP - RF - G 1.45± 1.14 .8± .04 .91± .07 .64± .15 3.6± .55
RF - G .77± .6 .93± .03 .99± .01 .86± .06 3.6± .55
prediction mechanism. θ0 ∈ Θ weakest acceptable regulariza- PB - KMEANS .81± .3 .92± .02 .97± .04 .68± .33 8.4± 8.65
tion parameter, ∆θ regularization step size. MCR - KMEANS .75± .12 .91± .05 .95± .05 .84± .13 2.2± 1.64
PB _ DTREE .68± .31 .89± .02 .94± .03 .83± .04 3.4± .55
// Region Identification
MCR _ DTREE .65± .17 .92± .03 .95± .04 .88± .07 2.2± 1.3
MCR∗ ← ∞ Initialize best MCR init
Concrete: nsamples = 1030, nfeatures = 8 | LGBM-Regressor R2 = 0.82 ± 0.026
for t = 0, . . . , T do
LCP - RF - G 1.84± 1.66 .83± .01 .94± .05 .69± .11 4.6± .55
MCRt ← {} Initialize MCR set t RF - G .82± .68 .9± .05 .97± .02 .81± .11 4.6± .55
// K-fold Cross validation PB - KMEANS .66± .48 .91± .05 .97± .05 .83± .07 7.0± 3.24
MCR - KMEANS .88± .27 .91± .05 .92± .06 .88± .05 4.2± 7.16
for k = 1, . . . , K do PB _ DTREE .94± .57 .89± .04 .98± .02 .77± .07 6.6± .55
Split D1 randomly into Da,k and Db,k MCR _ DTREE .55± .72 .9± .04 .92± .06 .88± .04 2.4± 2.61
τθ = M1−α (Db,k , θt ), Energy: nsamples = 768, nfeatures = 8 | LGBM-Regressor R2 = 0.93 ± 0.05
MCRt ← MCRt ∪ M CR(τθ , Da,k ) LCP - RF - G .99± 1.31 .87± .06 .97± .03 .65± 0.05 5.0± 1.0
end for RF - G .65± .1 .92± .03 .99± .02 0.87± .06 4.8± 1.64
PB - KMEANS 1.04± .34 .85± .07 1.0± .0 .07± .15 47.8± 1.79
S MCR = mean(MCR t ) + std(MCR t ) MCR - KMEANS .68± .3 .94± .03 .96± .05 .78± .17 1.6± 9.5
if S MCR < MCR∗ then PB _ DTREE .63± .5 .93± .03 .97± .02 .87± .07 3.6± 1.52
MCR _ DTREE .5± .46 .92± .03 .96± .03 .88± .07 3.2± 1.64
MCR∗ ← S MCR, θ∗ ← θt
Power: nsamples = 9568, nfeatures = 4 | LGBM-Regressor R2 = 0.95 ± 0.01
end if Q
LCP - RF - G 3.67± 2.26 .82± .05 .86± .03 .78± .07 4.4± 1.95
θt+1 ← Θ (θt + ∆θ ) RF - G .47± .22 .9± .0 .92± .01 .88± .01 5.0± .71
end for PB - KMEANS .76± .18 .9± .01 .95± .03 .85± .02 15.0± 7.55
τ ∗ ← M1−α (D1 , θ∗ ) MCR - KMEANS
PB _ DTREE
.66± .23
1.13± .6
.91± .01
.9± .0
.96± .03
.98± .04
.86± .02
.76± .09
16.6± 10.26
17.2± 9.26
// Conformalize group conditional quantile predictor MCR _ DTREE .57± .2 .9± .01 .92± .03 .88± .03 5.8± 8.56
qτcp ← ACP (D1 , {gτ ∗ (xi )}i∈D1 ) Protein: : nsamples = 45730, nfeatures = 9 | LGBM-Regressor R2 = 0.46 ± 0.04
LCP - RF - G .83± .56 .9± .01 .94± .05 .85± .02 10.5± 6.4
output τcp = (qτcp , gτ ∗ ) RF - G .61± .36 .9± .0 .95± .05 .88± .03 11.0± 7.0
PB - KMEANS .59± .57 .9± .0 1.0± .0 .71± .22 4.8± 5.67
MCR - KMEANS .47± .3 .9± .0 .97± .05 .87± .03 11.4± 8.26
PB _ DTREE .79± .27 .9± .0 1.0± .0 .81± .01 31.2± .45
MCR _ DTREE .17± .14 .9± .0 .91± .01 .89± .01 4.4± .89
the following objective kin8mn: : nsamples = 8192, nfeatures = 8 | LGBM-Regressor R2 = 0.62 ± 0.03
LCP - RF - G 2.32± 1.1 .8± .02 .84± .02 .75± .04 4.6± 1.34
h∗ (X))2 + Rθ(τ ),
 
τθ ∈ arg minτ ∈T EDb (qτ (X) − RF - G .32± .18 .9± .0 .93± .01 .87± .01 5.2± .45
s.t. h∗ ∈ arg minh∈H EDb ℓ1−α (h(X), S)) .

PB - KMEANS .96± 0.67 .92± .0 1.0± .0 .72± .03 41.0± 8.57
MCR - KMEANS .76± .16 .91± .02 .94± .05 .82± .11 20.6± 7.06
(13) PB _ DTREE .73± .39 .9± .01 .97± .03 .8± .07 16.4± 6.58
In our experiments, we take H to be a family of gradient MCR _ DTREE .4± .2 .9± .01 .91± .02 .89± .02 3.0± 1.41

boosting decision trees that support pinball loss Ke et al.


[2017], and use hyperparameter optimization Akiba et al. Table 1: Comparison between the group discovery partition
[2019] to minimize overfitting in the surrogate model h∗ . methods. We show MCR, marginal, minimum, and maxi-
mum coverage group coverage on the identified partition.
We also report the number of groups per approach. Standard
6 EXPERIMENTS deviations are computed across 5 data splits. The proposed
MCR _ DTREE is consistently better in terms of MCR , with
We evaluate the proposed method on a variety of datasets values consistently below 1, indicating that the discovered
and show how the proposed MCR-score-based method is groups improve worst-group under-coverage w.r.t. to single
able to identify a set of groups whose local coverage is close threshold SCP. Every dataset uses a LGBM regressor as the
to the desired target, and show that this diminishes the under- base model. We highlight the lowest MCR and the smallest
and over-coverage gaps compared to the alternatives. average coverage above the objective (0.9) since models
with larger coverages are less efficient. For methods that
achieved the marginal coverage objective we highlight the
6.1 REGRESSION DATASET RESULTS
max and min group coverage closest to the 0.9 objective.
We used Gradient boosting (LGBM) Ke et al. [2017] as our
base regressor f ; the hyperparameters for each dataset were
selected using hyperparameter optimization using Akiba Datasets. We considered six regression tasks based on
et al. [2019] to minimize validation loss. Additional results datasets from the UCI repository Asuncion and Newman
using Lasso are shown in Appendix A.3. For all experiments, [2007]. These are the Boston Housing price prediction (14
we split the available training data as follows: 40% train, attributes, Housing) Harrison Jr and Rubinfeld [1978]; En-
40% calibration, 20% test. We use a target coverage/validity ergy efficiency prediction (12 building parameters, Energy)
of 0.9 (90%, α = 0.1). Tsanas and Xifara [2012]; Concrete compressive strength
to a minimum of 50 samples per leaf and max depth of 5. We
set the cost complexity pruning variable as the regularization
parameter θ with θ0 = 1e − 5 and ∆θt = 9 × θt . We com-
pare against a decision tree that minimizes average pinball
loss (i.e., Algorithm 1 where MCR is replaced by average
pinball loss), we denote it as PB _ DTREE. Additionally, we
compare against the group-wise random forest localizer con-
formalization method (LCP - RF - G) proposed by Amoukou
and Brunel [2023] which generates a partition using confor-
mity score weights extracted from a random forest, and later
(a) Energy (b) Power use a standard split conformal approach based on their iden-
tified groups (RF - G). Finally, we examine a simple K-means
clustering in the input space, where the number of clusters
is chosen based on best average pinball loss (PB - KMEANS)
and best MCR (MCR - KMEANS) with cross validation.

Coverage on Identified Groups. Table 1 shows the min-


imum and maximum group coverage for the partitions re-
covered by each approach. We observe that the proposed
MCR _ DTREE identifies partitions that consistently provide
the best (or second best) minimum coverage, and smallest
(c) Kin8mn (d) Protein
gap between maximum and minimum group coverage, all
while achieving the target marginal coverage of 0.9. In gen-
Figure 2: Scatter and distribution plot of the prediction in- eral, MCR _ DTREE tends to identify a smaller set of groups,
terval widths (x-axis) versus coverage (y-axis) of the groups with a wide range of interval widths as shown in Figure 2.
discovered by the proposed MCR _ DTREE and PB _ DTREE Moreover, it achieves the smallest MCR when compared to
methods across 6 datasets. We plot all the groups obtained the competing baselines. The MCR of MCR _ DTREE is con-
across 5-Fold realizations. The size of the group’s points sistently below 1, indicating that a baseline SCP approach
represents the group size. The target coverage is 0.9, we ob- would yield worse results in terms of worst group under-
serve that MCR _ DTREE tends to identify a smaller number coverage. We note that the partition identified by RF - G,
of groups of varying sizes, with group-conditional coverages once integrated with split conformal prediction, has signifi-
concentrated around the 0.9 objective. Moreover, the identi- cantly better performance than their LCP - RF - G alternative.
fied groups show diversity in the range of interval widths. RF - G achieves comparable results in some of the datasets,
PB _ DTREE detects a significant larger number of (smaller) with larger disparity in terms of coverage gap between
groups, with a larger variance in terms of group-conditional the identified groups, and worse MCR. PB - KMEANS and
coverage. Additional plots in Appendix A.3. MCR - KMEANS have large variances in their performance,
potentially due to the fact that KMEANS clusters do not
leverage the non-conformity scores.

prediction (8 attributes, Concrete) Yeh [2007]; Estimation Size and Efficiency of the Identified Groups. Figure 2
of the size of the residue based on different physical and shows the joint distribution of the mean width and coverage
chemical properties of protein tertiary structure (Protein) of the identified groups by MCR _ DTREE and PB _ DTREE ap-
Rana [2013] ; Net hourly electrical energy output prediction proaches across all datasets. We observe that MCR _ DTREE
of a combined cycle power plant (4 features, Power) Tfekci tends to identify a smaller number of groups when compared
and Kaya [2014]; Predict the distance of the end-effector to PB _ DTREE. PB _ DTREE tends to identify multiple groups
from a target based on the forward kinematics of a robot of small sizes, with a wide range of widths and coverage
arm (kin8mn) Rasmussen et al. [1996], Corke [1996]. ranges. MCR _ DTREE is able to identify groups with diverse
widths (as we can see in the marginal distribution of the
Methods. We evaluate the performance of Algorithm 1 mean width) but the identified groups have coverages closer
choosing τ to be a decision tree that minimizes the pinball to the desired objective of 0.9.
loss as described in Section 5.1. We use standard group-
conditional split conformal (ACP ) Vovk [2012] and denote Interpretable Groups. Figure 3 in Appendix A.3 shows
the final model as MCR _ DTREE. For the MCR score (Eq. the trees discovered by MCR _ DTREE. The discovered
7) we selected d(1 − α, p) = (1 − α − p)+ as our under- groups have different interval widths, indicating that the
coverage distance function. We constrain our decision trees uncertainty on the model’s prediction is non-uniform across
the input space. Moreover, groups with higher uncertainty Peter I Corke. A robotics toolbox for matlab. IEEE Robotics
(larger mean width) tend to have a smaller size. This can in- & Automation Magazine, 3(1):24–32, 1996.
form a data collection process by encouraging the collection
of samples from the identified high uncertainty minorities. J-C Delvenne, Sophia N Yaliraki, and Mauricio Barahona.
Stability of graph communities across time scales. Pro-
ceedings of the national academy of sciences, 107(29):
7 CONCLUSION 12755–12760, 2010.

Here we propose a method to learn an interpretable parti- Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas,
tion of the input space based on the uncertainty of a black and Ryan J Tibshirani. The limits of distribution-free con-
box model’s prediction. We leverage the conformal predic- ditional predictive inference. Information and Inference:
tion framework and decision tree models to identify a set A Journal of the IMA, 10(2):455–482, 2021.
of groups of varying sizes where the quantile of the non-
conformity scores are as homogeneous as possible within Subhankar Ghosh, Taha Belkhouja, Yan Yan, and Janard-
the group but significantly different across different groups. han Rao Doppa. Improving uncertainty quantification of
We propose a fitness criteria (group miscoverage ratio, MCR) deep classifiers via neighborhood conformal prediction:
and accompanying algorithm to achieve this and show its Novel algorithm and theoretical analysis. arXiv preprint
effectiveness in a varying set of regression datasets. Our arXiv:2303.10694, 2023.
proposed method is able to discover a set of groups with
Isaac Gibbs and Emmanuel Candes. Adaptive conformal
better local coverage performance than competing methods.
inference under distribution shift. Advances in Neural
Information Processing Systems, 34:1660–1672, 2021.
References
Isaac Gibbs, John J Cherian, and Emmanuel J Candès. Con-
Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru formal prediction with conditional guarantees. arXiv
Ohta, and Masanori Koyama. Optuna: A next-generation preprint arXiv:2305.12616, 2023.
hyperparameter optimization framework. In Proceedings
of the 25th ACM SIGKDD international conference on Leying Guan. Localized conformal prediction: A gen-
knowledge discovery & data mining, pages 2623–2631, eralized inference framework for conformal prediction.
2019. Biometrika, 110(1):33–50, 2023.

Salim I Amoukou and Nicolas JB Brunel. Adaptive con- Xing Han, Ziyang Tang, Joydeep Ghosh, and Qiang Liu.
formal prediction by reweighting nonconformity score. Split localized conformal prediction. arXiv preprint
arXiv preprint arXiv:2303.12695, 2023. arXiv:2206.13092, 2022.

Anastasios N Angelopoulos and Stephen Bates. A gen- David Harrison Jr and Daniel L Rubinfeld. Hedonic housing
tle introduction to conformal prediction and distribution- prices and the demand for clean air. Journal of environ-
free uncertainty quantification. arXiv preprint mental economics and management, 5(1):81–102, 1978.
arXiv:2107.07511, 2021.
Christopher Jung, Georgy Noarov, Ramya Ramalingam,
Arthur Asuncion and David Newman. Uci machine learning and Aaron Roth. Batch multivalid conformal prediction.
repository, 2007. arXiv preprint arXiv:2209.15145, 2022.

Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei
and Ryan J Tibshirani. Conformal prediction beyond Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Light-
exchangeability. The Annals of Statistics, 51(2):816–845, gbm: A highly efficient gradient boosting decision tree.
2023. Advances in neural information processing systems, 30,
2017.
Osbert Bastani, Varun Gupta, Christopher Jung, Georgy
Noarov, Ramya Ramalingam, and Aaron Roth. Practical Danijel Kivaranovic, Kory D Johnson, and Hannes Leeb.
adversarial multivalid conformal prediction. Advances Adaptive, distribution-free prediction intervals for deep
in Neural Information Processing Systems, 35:29362– networks. In International Conference on Artificial Intel-
29373, 2022. ligence and Statistics, pages 4346–4356. PMLR, 2020.

Henrik Boström, Ulf Johansson, and Tuwe Löfström. Mon- Jing Lei and Larry Wasserman. Distribution-free prediction
drian conformal predictive distributions. In Conformal bands for non-parametric regression. Journal of the Royal
and Probabilistic Prediction and Applications, pages 24– Statistical Society Series B: Statistical Methodology, 76
38. PMLR, 2021. (1):71–96, 2014.
Huiying Mao, Ryan Martin, and Brian J Reich. Valid model- Vincent A Traag, Ludo Waltman, and Nees Jan Van Eck.
free spatial prediction. Journal of the American Statistical From louvain to leiden: guaranteeing well-connected
Association, pages 1–11, 2022. communities. Scientific reports, 9(1):5233, 2019.
Nicolai Meinshausen and Greg Ridgeway. Quantile regres- Athanasios Tsanas and Angeliki Xifara. Energy effi-
sion forests. Journal of machine learning research, 7(6), ciency. UCI Machine Learning Repository, 2012. DOI:
2006. https://doi.org/10.24432/C51307.
Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Vladimir Vovk. Conditional validity of inductive conformal
Alex Gammerman. Inductive confidence machines for predictors. In Asian conference on machine learning,
regression. In Machine Learning: ECML 2002: 13th Eu- pages 475–490. PMLR, 2012.
ropean Conference on Machine Learning Helsinki, Fin-
land, August 19–23, 2002 Proceedings 13, pages 345–356. Vladimir Vovk, David Lindsay, Ilia Nouretdinov, and Alex
Springer, 2002. Gammerman. Mondrian confidence machine. Technical
Report, 2003.
Harris Papadopoulos, Vladimir Vovk, and Alexander Gam-
merman. Regression conformal prediction with nearest Vladimir Vovk, Alexander Gammerman, and Glenn Shafer.
neighbours. Journal of Artificial Intelligence Research, Algorithmic learning in a random world, volume 29.
40:815–840, 2011. Springer, 2005.

Prashant Rana. Physicochemical Properties of Protein Ter- I-Cheng Yeh. Concrete Compressive Strength.
tiary Structure. UCI Machine Learning Repository, 2013. UCI Machine Learning Repository, 2007. DOI:
DOI: https://doi.org/10.24432/C5QW3H. https://doi.org/10.24432/C5PK67.

Carl E Rasmussen, Radford M Neal, Geoffrey E Hinton,


Drew van Camp, Michael Revow, Zoubin Ghahramani,
R Kustra, and Robert Tibshirani. The delve manual. URL
http://www. cs. toronto. edu/˜ delve, 1996.
Yaniv Romano, Evan Patterson, and Emmanuel Candes.
Conformalized quantile regression. Advances in neural
information processing systems, 32, 2019.
Yaniv Romano, Rina Foygel Barber, Chiara Sabatti, and
Emmanuel Candès. With malice toward none: Assess-
ing uncertainty via equalized coverage. Harvard Data
Science Review, 2(2):4, 2020.
Nabeel Seedat, Alan Jeffares, Fergus Imrie, and Mihaela
van der Schaar. Improving adaptive conformal prediction
using self-supervised learning. In International Con-
ference on Artificial Intelligence and Statistics, pages
10160–10177. PMLR, 2023.
Matteo Sesia and Yaniv Romano. Conformal prediction
using conditional histograms. Advances in Neural Infor-
mation Processing Systems, 34:6304–6315, 2021.
Glenn Shafer and Vladimir Vovk. A tutorial on conformal
prediction. Journal of Machine Learning Research, 9(3),
2008.
Kamile Stankeviciute, Ahmed M Alaa, and Mihaela van der
Schaar. Conformal time-series forecasting. Advances in
neural information processing systems, 34:6216–6228,
2021.
Pnar Tfekci and Heysem Kaya. Combined Cycle Power
Plant. UCI Machine Learning Repository, 2014. DOI:
https://doi.org/10.24432/C5002N.
Identifying Homogeneous and Interpretable Groups for Conformal Prediction
(Supplementary Material)

Natalia Martinez Gil1 Dhaval Patel1 Chandra Reddy1 Giridhar Ganapavarapu1 Roman Vaculin1
Jayant Kalagnanam1

1
IBM Research, Yorktown Heights, New York, USA

A APPENDIX

A.1 PROOFS

Restatement of Proposition 5.1

Proposition A.1. Given the objective in Eq. 8, if D1 = P (X, S) (infinite sample regime) and θ0 in Algorithm 1 is the
weakest
 admissible regularization,
  then τ ∗ = τθ0, which also minimizes pinball loss over all admissible regularizations
ED ℓ1−α (qτ ∗ (X), S)) ≤ ED ℓ1−α (qτθ (X), S)) , ∀θ ∈ Θ such that θ ≥ θ0 .

Proof. We first show that in the infinite sample regime the MCR is zero ∀θ ∈ Θ, making all θ equivalent according to the
MCR criteria. Then we show that Algorithm 1 would choose θ ∗ = θ0 and since θ0 is the lowest regularization it achieves the
smallest expected pinball loss.
Given access to the real distribution D1 = P (X, S) for any θ ∈ Θ we get a finite set partition Gτθ such that the 1 − α
quantile estimate qτθ (X) is the exact group conditional quantile of the non-conformity score distribution for the group that
contains the instance X.
−1
qτθ (X) = FS|G=g τ (X)
(1 − α) (14)
θ

where gτθ (X) ∈ Gτθ , ∀X ∈ X . Then, in this asymptotic regime the group conditional miscoverage (Definition 4.1)
M Cα (qτθ , gτθ ; gj ) = 0 ∀g ∈ Gτθ , ∀g ∈ Gτθ and ∀θ ∈ Θ. Then MCRα (τθ ) as defined in Eq. 7 is 0 ∀θ ∈ Θ.
Since Algorithm 1 terminates on the first θ that achieves the minimumMCR then θ∗ = θ0 .Since θ0 is the weakest regulariza-

tion, and we assume infinite sample regime to learn τθ ∀θ ∈ Θ then ED ℓ1−α (qτ ∗ (X), S)) ≤ ED ℓ1−α (qτθ (X), S)) , ∀θ ∈
Θ such that θ ≥ θ0 .

A.2 EXPERIMENTAL DETAILS

A.2.1 Learning Decision Tree Based Regions

We learn a decision tree that approximates the non-conformity score quantile by optimizing Eq. 13. To do so, we first learn a
surrogate model h that minimizes the pinball loss 1 − α of the non-conformity scores.

Step 1: Learn Surrogate Model h In our experiments h is an LGBM quantile regressor that we learn using Optuna Akiba
et al. [2019] with the following hyperparameters over 5-fold validation where the final set of parameters for h∗ is chosen
based on best average pinball loss plus one standard deviation.

• Optimizer Configuration: N _ TRIALS = 200, TIMEOUT = 11200.


• LGBM Model Parameters Exploration: LAMBDA _1 ∼ loguniform(1e − 8, 10.0),LAMBDA _2 ∼ loguniform(1e −
8, 10.0), LEARNING _ RATE ∼ loguniform(1e − 8, 10.0), bagging_fraction ∈ [0.4, 1.0], bagging_freq ∈ [1, 7],
num_leaves ∈ [2, 100], num_boost_round ∈ [1, 100], min_child_samples ∈ [50, 200], max_depth = 2 ,

Step 2: Learn The Decision Tree Model τ To learn τ we optimize the mean square error distance w.r.t. the prediction of
the quantile LGBM regressor h∗ learned in the previous step. As stated in Section 6 in Algorithm 1 we consider trees up to a
maximum depth of 5 and at least 50 samples per leaf. The regularization parameter θ is the cost complexity pruning variable.
We set θ0 = 1e − 5 and a step size ∆θt = 9 × θt .

A.3 ADDITIONAL EXPERIMENTS

Figure 3b shows the decision trees that were obtain for the different datasets. We observe that the discovered regions have
different prediction interval widths indicating that the model’s prediction uncertainty is significantly different. Figure 4
shows the scatter and joint distribution between the prediction interval widths and coverage of the discovered groups. It
extends Figure 2 in the main manuscript including all datasets and the groups discovered by the RF-G approach proposed
by Amoukou and Brunel [2023]. Table 2 shows the same comparison presented in Table 1 but for a LASSO base model.
We observe that the number of discovered groups by the proposed method MCR _ DTREE is larger than those of a LGBM
regression model for the same dataset (Table 1). In most cases, the LGBM model is equal or better than LASSO in terms of r2
score, and therefore reduces the unexplained variance of the target Y |X. This leads to less regions of different uncertainty
and tighter prediction sets.
(a) Housing (b) Concrete

(c) Energy (d) Power

(e) Kin8nm (f) Protein

Figure 3: Example of decision trees identified for each regression dataset. (3a) In the Housing dataset groups are defined
based on the features corresponding to average number of rooms per dwelling (RM) and weighted distances to five Boston
employment centers (DIS). (3b) In the Concrete dataset the groups are defined based on the Cement and Fine Aggregate
components (kg in a m3 mixture). (3c) the groups in the Energy dataset are defined based on Glazing Area Distribution (X8),
Glazing Area (X7) and Wall Area (X3). (3d) In the Power dataset groups are defined based on Ambient Temperature (AT),
Exhaust Vacuum (V) and Relative Humidity (RH). (3e) In the kin8nm dataset the groups are defined by the measurements
on sensors from links 3, 5 and 6 from the robot arm. (3f) In the protein dataset the groups are defined by the features
corresponding to fractional area of exposed non polar residue (F3) and fractional area of exposed non polar part of residue
(F4).
(a) Housing (b) Concrete (c) Energy

(d) Power (e) Kin8mn (f) Protein

Figure 4: Scatter and distribution plot of the prediction interval widths (x-axis) versus coverage (y-axis) of the groups
discovered by the proposed MCR _ DTREE, PB _ DTREE and RF - G methods across 6 datasets. Here we plot all the groups
obtained across 5-Fold realizations. The size of the groups points represents the group size (number of samples). The
target coverage is 0.9, we observe that MCR _ DTREE tends to identify a smaller number of groups of varying sizes, with
group-conditional coverages concentrated around the 0.9 objective. Moreover, the identified groups show diversity in the
range of interval widths. PB _ DTREE detects a significant larger number of (smaller) groups, with a larger variance in terms
of group-conditional coverage.
MCR coverage num
model average max group min group groups

Housing: nsamples = 506, nfeatures = 13 | LASSO-Regressor R2 = 0.69 ± 0.04


LCP - RF - G 2.71±0.77 0.8±0.06 0.91±0.08 0.75±0.07 2.6±0.55
RF - G 0.42±0.38 0.91±0.03 0.96±0.03 0.81±0.15 3.2±0.45
PB - KMEANS 1.47±0.49 0.86±0.03 0.98±0.03 0.44±0.43 14.2±15.02
MCR - KMEANS 1.35±0.74 0.88±0.04 0.97±0.03 0.69±0.38 7.4±11.52
PB _ DTREE 0.32±0.21 0.88±0.03 0.98±0.05 0.83±0.05 4.0±1.87
MCR _ DTREE 0.25±0.39 0.89±0.04 0.95±0.04 0.84±0.07 3.6±2.07
Concrete: nsamples = 1030, nfeatures = 8 | LASSO-Regressor R2 = 0.60 ± 0.05
LCP - RF - G 1.37±1.12 0.83±0.02 0.96±0.04 0.7±0.05 5.4±0.55
RF - G 0.29 ±0.15 0.91±0.02 0.98±0.03 0.8±0.08 5.0±0.71
PB - KMEANS 0.89±0.48 0.9±0.05 1.0±0.0 0.26±0.37 37.2±16.93
MCR - KMEANS 0.43±0.43 0.92±0.02 0.97±0.03 0.7±0.3 15.8±18.98
PB _ DTREE 0.25±0.14 0.9±0.03 1.0±0.0 0.8±0.07 7.0±2.24
MCR _ DTREE 0.15±0.09 0.9±0.03 1.0±0.0 0.84±0.04 6.8±2.39
Energy: nsamples = 768, nfeatures = 8 | LASSO-Regressor R2 = 0.91 ± 0.005
LCP - RF - G 0.38±0.19 0.88±0.05 0.98±0.03 0.8±0.08 4.8±0.45
RF - G 0.12±0.12 0.94±0.02 1.0±0.0 0.87±0.06 5.0±0.71
PB - KMEANS 1.07±0.77 0.87±0.04 0.99±0.02 0.18±0.4 38.2±19.15
MCR - KMEANS 0.32±0.41 0.94±0.03 0.98±0.04 0.83±0.13 13.0±11.92
PB _ DTREE 0.12±0.16 0.94±0.02 0.99±0.03 0.84±0.11 9.0±3.46
MCR _ DTREE 0.05±0.09 0.94±0.02 0.98±0.02 0.89±0.03 6.0±3.24
Power: nsamples = 9568, nfeatures = 4 | LASSO-Regressor R2 = 0.93 ± 0.003
LCP - RF - G 2.04±1.26 0.82±0.05 0.86±0.08 0.78±0.05 6.0±2.24
RF - G 0.83±0.57 0.9±0.0 0.93±0.02 0.87±0.01 5.2±0.84
PB - KMEANS 0.73±0.27 0.91±0.01 0.99±0.02 0.78±0.05 37.2±5.22
MCR - KMEANS 0.46±0.15 0.9±0.0 0.93±0.03 0.88±0.03 6.0±7.28
PB _ DTREE 0.08±0.05 0.9±0.01 0.94±0.03 0.87±0.02 6.4±4.16
MCR _ DTREE 0.06±0.05 0.9±0.0 0.94±0.01 0.88±0.02 7.4±3.71
Protein: : nsamples = 45730, nfeatures = 9 | LASSO-Regressor R2 = 0.28 ± 0.01
LCP - RF - G 0.89±0.56 0.87±0.03 0.92±0.02 0.75±0.04 5.8±1.6
RF - G 0.44±0.37 0.9±0.0 0.95±0.05 0.87±0.02 6.00±1.59
PB - KMEANS 0.71±0.75 0.9±0.0 1.0±0.0 0.65±0.21 42.6±7.86
MCR - KMEANS 0.52±0.21 0.9±0.0 0.96±0.05 0.76±0.24 16.2±12.91
PB _ DTREE 0.44±0.37 0.9±0.0 1.0±0.0 0.83±0.02 15.6±0.89
MCR _ DTREE 0.2±0.08 0.9±0.0 0.93±0.03 0.89±0.01 5.6±2.19
kin8mn: : nsamples = 8192, nfeatures = 8 | LASSO-Regressor R2 = 0.40 ± 0.007
LCP - RF - G 1.68±0.29 0.79±0.01 0.81±0.01 0.77±0.01 3.0±0.0
RF - G 0.21±0.04 0.9±0.01 0.91±0.01 0.88±0.0 3.2±0.45
PB - KMEANS 0.67±0.16 0.92±0.01 0.99±0.01 0.76±0.04 39.4±14.06
MCR - KMEANS 0.44±0.37 0.9±0.01 0.93±0.04 0.87±0.05 11.6±21.47
PB _ DTREE 0.41±0.36 0.89±0.01 0.98±0.04 0.82±0.07 14.2±3.03
MCR _ DTREE 0.24±0.18 0.9±0.01 0.94±0.04 0.88±0.02 6.4±5.37

Table 2: Comparison between the group discovery partition methods. We show MCR, marginal, minimum, and maximum
coverage group coverage on the identified partition. We also report the number of groups per approach. Standard deviations
are computed across 5 data splits. The proposed MCR _ DTREE is consistently better in terms of MCR, with values consistently
below 1, indicating that the discovered groups improve worst-group under-coverage w.r.t. to single threshold SCP. Every
dataset uses a LASSO regressor as the base model. We highlight the lowest MCR and the smallest average coverage above the
objective (0.9). For methods that achieved the marginal coverage objective we highlight the max and min group coverage
closest to the 0.9 objective.

You might also like