313_identifying_homogeneous_and_in
313_identifying_homogeneous_and_in
Natalia Martinez Gil1 Dhaval Patel1 Chandra Reddy1 Giridhar Ganapavarapu1 Roman Vaculin1
Jayant Kalagnanam1
1
IBM Research, Yorktown Heights, New York, USA
prediction (8 attributes, Concrete) Yeh [2007]; Estimation Size and Efficiency of the Identified Groups. Figure 2
of the size of the residue based on different physical and shows the joint distribution of the mean width and coverage
chemical properties of protein tertiary structure (Protein) of the identified groups by MCR _ DTREE and PB _ DTREE ap-
Rana [2013] ; Net hourly electrical energy output prediction proaches across all datasets. We observe that MCR _ DTREE
of a combined cycle power plant (4 features, Power) Tfekci tends to identify a smaller number of groups when compared
and Kaya [2014]; Predict the distance of the end-effector to PB _ DTREE. PB _ DTREE tends to identify multiple groups
from a target based on the forward kinematics of a robot of small sizes, with a wide range of widths and coverage
arm (kin8mn) Rasmussen et al. [1996], Corke [1996]. ranges. MCR _ DTREE is able to identify groups with diverse
widths (as we can see in the marginal distribution of the
Methods. We evaluate the performance of Algorithm 1 mean width) but the identified groups have coverages closer
choosing τ to be a decision tree that minimizes the pinball to the desired objective of 0.9.
loss as described in Section 5.1. We use standard group-
conditional split conformal (ACP ) Vovk [2012] and denote Interpretable Groups. Figure 3 in Appendix A.3 shows
the final model as MCR _ DTREE. For the MCR score (Eq. the trees discovered by MCR _ DTREE. The discovered
7) we selected d(1 − α, p) = (1 − α − p)+ as our under- groups have different interval widths, indicating that the
coverage distance function. We constrain our decision trees uncertainty on the model’s prediction is non-uniform across
the input space. Moreover, groups with higher uncertainty Peter I Corke. A robotics toolbox for matlab. IEEE Robotics
(larger mean width) tend to have a smaller size. This can in- & Automation Magazine, 3(1):24–32, 1996.
form a data collection process by encouraging the collection
of samples from the identified high uncertainty minorities. J-C Delvenne, Sophia N Yaliraki, and Mauricio Barahona.
Stability of graph communities across time scales. Pro-
ceedings of the national academy of sciences, 107(29):
7 CONCLUSION 12755–12760, 2010.
Here we propose a method to learn an interpretable parti- Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas,
tion of the input space based on the uncertainty of a black and Ryan J Tibshirani. The limits of distribution-free con-
box model’s prediction. We leverage the conformal predic- ditional predictive inference. Information and Inference:
tion framework and decision tree models to identify a set A Journal of the IMA, 10(2):455–482, 2021.
of groups of varying sizes where the quantile of the non-
conformity scores are as homogeneous as possible within Subhankar Ghosh, Taha Belkhouja, Yan Yan, and Janard-
the group but significantly different across different groups. han Rao Doppa. Improving uncertainty quantification of
We propose a fitness criteria (group miscoverage ratio, MCR) deep classifiers via neighborhood conformal prediction:
and accompanying algorithm to achieve this and show its Novel algorithm and theoretical analysis. arXiv preprint
effectiveness in a varying set of regression datasets. Our arXiv:2303.10694, 2023.
proposed method is able to discover a set of groups with
Isaac Gibbs and Emmanuel Candes. Adaptive conformal
better local coverage performance than competing methods.
inference under distribution shift. Advances in Neural
Information Processing Systems, 34:1660–1672, 2021.
References
Isaac Gibbs, John J Cherian, and Emmanuel J Candès. Con-
Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru formal prediction with conditional guarantees. arXiv
Ohta, and Masanori Koyama. Optuna: A next-generation preprint arXiv:2305.12616, 2023.
hyperparameter optimization framework. In Proceedings
of the 25th ACM SIGKDD international conference on Leying Guan. Localized conformal prediction: A gen-
knowledge discovery & data mining, pages 2623–2631, eralized inference framework for conformal prediction.
2019. Biometrika, 110(1):33–50, 2023.
Salim I Amoukou and Nicolas JB Brunel. Adaptive con- Xing Han, Ziyang Tang, Joydeep Ghosh, and Qiang Liu.
formal prediction by reweighting nonconformity score. Split localized conformal prediction. arXiv preprint
arXiv preprint arXiv:2303.12695, 2023. arXiv:2206.13092, 2022.
Anastasios N Angelopoulos and Stephen Bates. A gen- David Harrison Jr and Daniel L Rubinfeld. Hedonic housing
tle introduction to conformal prediction and distribution- prices and the demand for clean air. Journal of environ-
free uncertainty quantification. arXiv preprint mental economics and management, 5(1):81–102, 1978.
arXiv:2107.07511, 2021.
Christopher Jung, Georgy Noarov, Ramya Ramalingam,
Arthur Asuncion and David Newman. Uci machine learning and Aaron Roth. Batch multivalid conformal prediction.
repository, 2007. arXiv preprint arXiv:2209.15145, 2022.
Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei
and Ryan J Tibshirani. Conformal prediction beyond Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Light-
exchangeability. The Annals of Statistics, 51(2):816–845, gbm: A highly efficient gradient boosting decision tree.
2023. Advances in neural information processing systems, 30,
2017.
Osbert Bastani, Varun Gupta, Christopher Jung, Georgy
Noarov, Ramya Ramalingam, and Aaron Roth. Practical Danijel Kivaranovic, Kory D Johnson, and Hannes Leeb.
adversarial multivalid conformal prediction. Advances Adaptive, distribution-free prediction intervals for deep
in Neural Information Processing Systems, 35:29362– networks. In International Conference on Artificial Intel-
29373, 2022. ligence and Statistics, pages 4346–4356. PMLR, 2020.
Henrik Boström, Ulf Johansson, and Tuwe Löfström. Mon- Jing Lei and Larry Wasserman. Distribution-free prediction
drian conformal predictive distributions. In Conformal bands for non-parametric regression. Journal of the Royal
and Probabilistic Prediction and Applications, pages 24– Statistical Society Series B: Statistical Methodology, 76
38. PMLR, 2021. (1):71–96, 2014.
Huiying Mao, Ryan Martin, and Brian J Reich. Valid model- Vincent A Traag, Ludo Waltman, and Nees Jan Van Eck.
free spatial prediction. Journal of the American Statistical From louvain to leiden: guaranteeing well-connected
Association, pages 1–11, 2022. communities. Scientific reports, 9(1):5233, 2019.
Nicolai Meinshausen and Greg Ridgeway. Quantile regres- Athanasios Tsanas and Angeliki Xifara. Energy effi-
sion forests. Journal of machine learning research, 7(6), ciency. UCI Machine Learning Repository, 2012. DOI:
2006. https://doi.org/10.24432/C51307.
Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Vladimir Vovk. Conditional validity of inductive conformal
Alex Gammerman. Inductive confidence machines for predictors. In Asian conference on machine learning,
regression. In Machine Learning: ECML 2002: 13th Eu- pages 475–490. PMLR, 2012.
ropean Conference on Machine Learning Helsinki, Fin-
land, August 19–23, 2002 Proceedings 13, pages 345–356. Vladimir Vovk, David Lindsay, Ilia Nouretdinov, and Alex
Springer, 2002. Gammerman. Mondrian confidence machine. Technical
Report, 2003.
Harris Papadopoulos, Vladimir Vovk, and Alexander Gam-
merman. Regression conformal prediction with nearest Vladimir Vovk, Alexander Gammerman, and Glenn Shafer.
neighbours. Journal of Artificial Intelligence Research, Algorithmic learning in a random world, volume 29.
40:815–840, 2011. Springer, 2005.
Prashant Rana. Physicochemical Properties of Protein Ter- I-Cheng Yeh. Concrete Compressive Strength.
tiary Structure. UCI Machine Learning Repository, 2013. UCI Machine Learning Repository, 2007. DOI:
DOI: https://doi.org/10.24432/C5QW3H. https://doi.org/10.24432/C5PK67.
Natalia Martinez Gil1 Dhaval Patel1 Chandra Reddy1 Giridhar Ganapavarapu1 Roman Vaculin1
Jayant Kalagnanam1
1
IBM Research, Yorktown Heights, New York, USA
A APPENDIX
A.1 PROOFS
Proposition A.1. Given the objective in Eq. 8, if D1 = P (X, S) (infinite sample regime) and θ0 in Algorithm 1 is the
weakest
admissible regularization,
then τ ∗ = τθ0, which also minimizes pinball loss over all admissible regularizations
ED ℓ1−α (qτ ∗ (X), S)) ≤ ED ℓ1−α (qτθ (X), S)) , ∀θ ∈ Θ such that θ ≥ θ0 .
Proof. We first show that in the infinite sample regime the MCR is zero ∀θ ∈ Θ, making all θ equivalent according to the
MCR criteria. Then we show that Algorithm 1 would choose θ ∗ = θ0 and since θ0 is the lowest regularization it achieves the
smallest expected pinball loss.
Given access to the real distribution D1 = P (X, S) for any θ ∈ Θ we get a finite set partition Gτθ such that the 1 − α
quantile estimate qτθ (X) is the exact group conditional quantile of the non-conformity score distribution for the group that
contains the instance X.
−1
qτθ (X) = FS|G=g τ (X)
(1 − α) (14)
θ
where gτθ (X) ∈ Gτθ , ∀X ∈ X . Then, in this asymptotic regime the group conditional miscoverage (Definition 4.1)
M Cα (qτθ , gτθ ; gj ) = 0 ∀g ∈ Gτθ , ∀g ∈ Gτθ and ∀θ ∈ Θ. Then MCRα (τθ ) as defined in Eq. 7 is 0 ∀θ ∈ Θ.
Since Algorithm 1 terminates on the first θ that achieves the minimumMCR then θ∗ = θ0 .Since θ0 is the weakest regulariza-
tion, and we assume infinite sample regime to learn τθ ∀θ ∈ Θ then ED ℓ1−α (qτ ∗ (X), S)) ≤ ED ℓ1−α (qτθ (X), S)) , ∀θ ∈
Θ such that θ ≥ θ0 .
We learn a decision tree that approximates the non-conformity score quantile by optimizing Eq. 13. To do so, we first learn a
surrogate model h that minimizes the pinball loss 1 − α of the non-conformity scores.
Step 1: Learn Surrogate Model h In our experiments h is an LGBM quantile regressor that we learn using Optuna Akiba
et al. [2019] with the following hyperparameters over 5-fold validation where the final set of parameters for h∗ is chosen
based on best average pinball loss plus one standard deviation.
Step 2: Learn The Decision Tree Model τ To learn τ we optimize the mean square error distance w.r.t. the prediction of
the quantile LGBM regressor h∗ learned in the previous step. As stated in Section 6 in Algorithm 1 we consider trees up to a
maximum depth of 5 and at least 50 samples per leaf. The regularization parameter θ is the cost complexity pruning variable.
We set θ0 = 1e − 5 and a step size ∆θt = 9 × θt .
Figure 3b shows the decision trees that were obtain for the different datasets. We observe that the discovered regions have
different prediction interval widths indicating that the model’s prediction uncertainty is significantly different. Figure 4
shows the scatter and joint distribution between the prediction interval widths and coverage of the discovered groups. It
extends Figure 2 in the main manuscript including all datasets and the groups discovered by the RF-G approach proposed
by Amoukou and Brunel [2023]. Table 2 shows the same comparison presented in Table 1 but for a LASSO base model.
We observe that the number of discovered groups by the proposed method MCR _ DTREE is larger than those of a LGBM
regression model for the same dataset (Table 1). In most cases, the LGBM model is equal or better than LASSO in terms of r2
score, and therefore reduces the unexplained variance of the target Y |X. This leads to less regions of different uncertainty
and tighter prediction sets.
(a) Housing (b) Concrete
Figure 3: Example of decision trees identified for each regression dataset. (3a) In the Housing dataset groups are defined
based on the features corresponding to average number of rooms per dwelling (RM) and weighted distances to five Boston
employment centers (DIS). (3b) In the Concrete dataset the groups are defined based on the Cement and Fine Aggregate
components (kg in a m3 mixture). (3c) the groups in the Energy dataset are defined based on Glazing Area Distribution (X8),
Glazing Area (X7) and Wall Area (X3). (3d) In the Power dataset groups are defined based on Ambient Temperature (AT),
Exhaust Vacuum (V) and Relative Humidity (RH). (3e) In the kin8nm dataset the groups are defined by the measurements
on sensors from links 3, 5 and 6 from the robot arm. (3f) In the protein dataset the groups are defined by the features
corresponding to fractional area of exposed non polar residue (F3) and fractional area of exposed non polar part of residue
(F4).
(a) Housing (b) Concrete (c) Energy
Figure 4: Scatter and distribution plot of the prediction interval widths (x-axis) versus coverage (y-axis) of the groups
discovered by the proposed MCR _ DTREE, PB _ DTREE and RF - G methods across 6 datasets. Here we plot all the groups
obtained across 5-Fold realizations. The size of the groups points represents the group size (number of samples). The
target coverage is 0.9, we observe that MCR _ DTREE tends to identify a smaller number of groups of varying sizes, with
group-conditional coverages concentrated around the 0.9 objective. Moreover, the identified groups show diversity in the
range of interval widths. PB _ DTREE detects a significant larger number of (smaller) groups, with a larger variance in terms
of group-conditional coverage.
MCR coverage num
model average max group min group groups
Table 2: Comparison between the group discovery partition methods. We show MCR, marginal, minimum, and maximum
coverage group coverage on the identified partition. We also report the number of groups per approach. Standard deviations
are computed across 5 data splits. The proposed MCR _ DTREE is consistently better in terms of MCR, with values consistently
below 1, indicating that the discovered groups improve worst-group under-coverage w.r.t. to single threshold SCP. Every
dataset uses a LASSO regressor as the base model. We highlight the lowest MCR and the smallest average coverage above the
objective (0.9). For methods that achieved the marginal coverage objective we highlight the max and min group coverage
closest to the 0.9 objective.