On Calibration of Modern Neural Networks
On Calibration of Modern Neural Networks
Accuracy
Accuracy
Avg. confidence
Avg. confidence
ing probability estimates representative of the 0.8
% of Samples
arXiv:1706.04599v2 [cs.LG] 3 Aug 2017
0.4
0.3
0.2
0.1
0.0
0 20 40 60 80 100 120 0 50 100 150 200 250 300 Without With 10−5 10−4 10−3 10−2
Depth Filters per layer Batch Normalization Weight decay
Figure 2. The effect of network depth (far left), width (middle left), Batch Normalization (middle right), and weight decay (far right) on
miscalibration, as measured by ECE (lower is better).
Table 1. ECE (%) (with M = 15 bins) on standard vision and NLP datasets before calibration and with various calibration methods.
The number following a model’s name denotes the network depth.
Extension of binning methods. One common way of ex- T is called the temperature, and it “softens” the softmax
tending binary calibration methods to the multiclass setting (i.e. raises the output entropy) with T > 1. As T → ∞,
is by treating the problem as K one-versus-all problems the probability q̂i approaches 1/K, which represents max-
(Zadrozny & Elkan, 2002). For k = 1, . . . , K, we form a imum uncertainty. With T = 1, we recover the original
binary calibration problem where the label is 1(yi = k) probability p̂i . As T → 0, the probability collapses to a
and the predicted probability is σSM (zi )(k) . This gives point mass (i.e. q̂i = 1). T is optimized with respect to
us K calibration models, each for a particular class. At NLL on the validation set. Because the parameter T does
test time, we obtain an unnormalized probability vector not change the maximum of the softmax function, the class
(1) (K) (k)
[q̂i , . . . , q̂i ], where q̂i is the calibrated probability for prediction ŷi0 remains unchanged. In other words, temper-
class k. The new class prediction ŷi0 is the argmax of the ature scaling does not affect the model’s accuracy.
vector, and the new confidence q̂i0 is the max of the vector Temperature scaling is commonly used in settings such as
PK (k)
normalized by k=1 q̂i . This extension can be applied knowledge distillation (Hinton et al., 2015) and statistical
to histogram binning, isotonic regression, and BBQ. mechanics (Jaynes, 1957). To the best of our knowledge,
we are not aware of any prior use in the context of calibrat-
Matrix and vector scaling are two multi-class exten- ing probabilistic models.3 The model is equivalent to max-
sions of Platt scaling. Let zi be the logits vector produced imizing the entropy of the output probability distribution
before the softmax layer for input xi . Matrix scaling ap- subject to certain constraints on the logits (see Section S2).
plies a linear transformation Wzi + b to the logits:
q̂i = max σSM (Wzi + b)(k) , 4.3. Other Related Works
k
(8) Calibration and confidence scores have been studied in var-
ŷi0 = argmax (Wzi + b)(k) . ious contexts in recent years. Kuleshov & Ermon (2016)
k
The parameters W and b are optimized with respect to study the problem of calibration in the online setting, where
NLL on the validation set. As the number of parameters the inputs can come from a potentially adversarial source.
for matrix scaling grows quadratically with the number of Kuleshov & Liang (2015) investigate how to produce cal-
classes K, we define vector scaling as a variant where W ibrated probabilities when the output space is a structured
is restricted to be a diagonal matrix. object. Lakshminarayanan et al. (2016) use ensembles of
networks to obtain uncertainty estimates. Pereyra et al.
(2017) penalize overconfident predictions as a form of reg-
Temperature scaling, the simplest extension of Platt ularization. Hendrycks & Gimpel (2017) use confidence
scaling, uses a single scalar parameter T > 0 for all classes.
3
Given the logit vector zi , the new confidence prediction is To highlight the connection with prior works we define tem-
perature scaling in terms of T1 instead of a multiplicative scalar.
q̂i = max σSM (zi /T )(k) . (9)
k
scores to determine if samples are out-of-distribution. gories by content. 9034/2259/7528 documents for
train/validation/test.
Bayesian neural networks (Denker & Lecun, 1990;
2. Reuters: News articles, partitioned into 8 cate-
MacKay, 1992) return a probability distribution over out-
gories by topic. 4388/1097/2189 documents for
puts as an alternative way to represent model uncertainty.
train/validation/test.
Gal & Ghahramani (2016) draw a connection between
3. Stanford Sentiment Treebank (SST) (Socher et al.,
Dropout (Srivastava et al., 2014) and model uncertainty,
2013): Movie reviews, represented as sentence parse
claiming that sampling models with dropped nodes is a
trees that are annotated by sentiment. Each sample in-
way to estimate the probability distribution over all pos-
cludes a coarse binary label and a fine grained 5-class
sible models for a given sample. Kendall & Gal (2017)
label. As described in (Tai et al., 2015), the train-
combine this approach with a model that outputs a predic-
ing/validation/test sets contain 6920/872/1821 docu-
tive mean and variance for each data point. This notion of
ments for binary, and 544/1101/2210 for fine-grained.
uncertainty is not restricted to classification problems. Ad-
ditionally, neural networks can be used in conjunction with On 20 News and Reuters, we train Deep Averaging Net-
Bayesian models that output complete distributions. For works (DANs) (Iyyer et al., 2015) with 3 feed-forward
example, deep kernel learning (Wilson et al., 2016a;b; Al- layers and Batch Normalization. On SST, we train
Shedivat et al., 2016) combines deep neural networks with TreeLSTMs (Long Short Term Memory) (Tai et al., 2015).
Gaussian processes on classification and regression prob- For both models we use the default hyperparmaeters sug-
lems. In contrast, our framework, which does not augment gested by the authors.
the neural network model, returns a confidence score rather
than returning a distribution of possible outputs.
Calibration Results. Table 1 displays model calibration,
as measured by ECE (with M = 15 bins), before and af-
5. Results ter applying the various methods (see Section S3 for MCE,
We apply the calibration methods in Section 4 to image NLL, and error tables). It is worth noting that most datasets
classification and document classification neural networks. and models experience some degree of miscalibration, with
For image classification we use 6 datasets: ECE typically between 4 to 10%. This is not architecture
specific: we observe miscalibration on convolutional net-
1. Caltech-UCSD Birds (Welinder et al., 2010): works (with and without skip connections), recurrent net-
200 bird species. 5994/2897/2897 images for works, and deep averaging networks. The two notable ex-
train/validation/test sets. ceptions are SVHN and Reuters, both of which experience
2. Stanford Cars (Krause et al., 2013): 196 classes of ECE values below 1%. Both of these datasets have very
cars by make, model, and year. 8041/4020/4020 im- low error (1.98% and 2.97%, respectively); and therefore
ages for train/validation/test. the ratio of ECE to error is comparable to other datasets.
3. ImageNet 2012 (Deng et al., 2009): Natural scene im-
Our most important discovery is the surprising effective-
ages from 1000 classes. 1.3 million/25,000/25,000
ness of temperature scaling despite its remarkable simplic-
images for train/validation/test.
ity. Temperature scaling outperforms all other methods on
4. CIFAR-10/CIFAR-100 (Krizhevsky & Hinton, 2009):
the vision tasks, and performs comparably to other methods
Color images (32 × 32) from 10/100 classes.
on the NLP datasets. What is perhaps even more surpris-
45,000/5,000/10,000 images for train/validation/test.
ing is that temperature scaling outperforms the vector and
5. Street View House Numbers (SVHN) (Netzer et al.,
matrix Platt scaling variants, which are strictly more gen-
2011): 32 × 32 colored images of cropped
eral methods. In fact, vector scaling recovers essentially
out house numbers from Google Street View.
the same solution as temperature scaling – the learned vec-
598,388/6,000/26,032 images for train/validation/test.
tor has nearly constant entries, and therefore is no different
We train state-of-the-art convolutional networks: ResNets than a scalar transformation. In other words, network mis-
(He et al., 2016), ResNets with stochastic depth (SD) calibration is intrinsically low dimensional.
(Huang et al., 2016), Wide ResNets (Zagoruyko & Ko- The only dataset that temperature scaling does not calibrate
modakis, 2016), and DenseNets (Huang et al., 2017). We is the Reuters dataset. In this instance, only one of the
use the data preprocessing, training procedures, and hyper- above methods is able to improve calibration. Because this
parameters as described in each paper. For Birds and Cars, dataset is well-calibrated to begin with (ECE ≤ 1%), there
we fine-tune networks pretrained on ImageNet. is not much room for improvement with any method, and
For document classification we experiment with 4 datasets: post-processing may not even be necessary to begin with.
It is also possible that our measurements are affected by
1. 20 News: News articles, partitioned into 20 cate- dataset split or by the particular binning scheme.
Uncal. - CIFAR-100 Temp. Scale - CIFAR-100 Hist. Bin. - CIFAR-100 Iso. Reg. - CIFAR-100
ResNet-110 (SD) ResNet-110 (SD) ResNet-110 (SD) ResNet-110 (SD)
1.0
Outputs Outputs Outputs Outputs
0.8 Gap Gap Gap Gap
Accuracy
0.6
0.4
0.2
ECE=12.67 ECE=0.96 ECE=2.46 ECE=4.16
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Confidence
Figure 4. Reliability diagrams for CIFAR-100 before (far left) and after calibration (middle left, middle right, far right).
Matrix scaling performs poorly on datasets with hundreds computational complexity of vector and matrix scaling are
of classes (i.e. Birds, Cars, and CIFAR-100), and fails linear and quadratic respectively in the number of classes,
to converge on the 1000-class ImageNet dataset. This is reflecting the number of parameters in each method. For
expected, since the number of parameters scales quadrat- CIFAR-100 (K = 100), finding a near-optimal vector scal-
ically with the number of classes. Any calibration model ing solution with conjugate gradient descent requires at
with tens of thousands (or more) parameters will overfit to least 2 orders of magnitude more time. Histogram binning
a small validation set, even when applying regularization. and isotonic regression take an order of magnitude longer
than temperature scaling, and BBQ takes roughly 3 orders
Binning methods improve calibration on most datasets, but
of magnitude more time.
do not outperform temperature scaling. Additionally, bin-
ning methods tend to change class predictions which hurts
accuracy (see Section S3). Histogram binning, the simplest
binning method, typically outperforms isotonic regression Ease of implementation. BBQ is arguably the most dif-
and BBQ, despite the fact that both methods are strictly ficult to implement, as it requires implementing a model
more general. This further supports our finding that cali- averaging scheme. While all other methods are relatively
bration is best corrected by simple models. easy to implement, temperature scaling may arguably be
the most straightforward to incorporate into a neural net-
Reliability diagrams. Figure 4 contains reliability dia- work pipeline. In Torch7 (Collobert et al., 2011), for ex-
grams for 110-layer ResNets on CIFAR-100 before and af- ample, we implement temperature scaling by inserting a
ter calibration. From the far left diagram, we see that the nn.MulConstant between the logits and the softmax,
uncalibrated ResNet tends to be overconfident in its pre- whose parameter is 1/T . We set T = 1 during training, and
dictions. We then can observe the effects of temperature subsequently find its optimal value on the validation set.4
scaling (middle left), histogram binning (middle right), and
isotonic regression (far right) on calibration. All three dis- 6. Conclusion
played methods produce much better confidence estimates.
Of the three methods, temperature scaling most closely re- Modern neural networks exhibit a strange phenomenon:
covers the desired diagonal function. Each of the bins are probabilistic error and miscalibration worsen even as clas-
well calibrated, which is remarkable given that all the prob- sification error is reduced. We have demonstrated that
abilities were modified by only a single parameter. We in- recent advances in neural network architecture and train-
clude reliability diagrams for other datasets in Section S4. ing – model capacity, normalization, and regularization
– have strong effects on network calibration. It remains
future work to understand why these trends affect cali-
Computation time. All methods scale linearly with the
bration while improving accuracy. Nevertheless, simple
number of validation set samples. Temperature scaling
techniques can effectively remedy the miscalibration phe-
is by far the fastest method, as it amounts to a one-
nomenon in neural networks. Temperature scaling is the
dimensional convex optimization problem. Using a conju-
simplest, fastest, and most straightforward of the methods,
gate gradient solver, the optimal temperature can be found
and surprisingly is often the most effective.
in 10 iterations, or a fraction of a second on most modern
hardware. In fact, even a naive line-search for the optimal 4
For an example implementation, see http://github.
temperature is faster than any of the other methods. The com/gpleiss/temperature_scaling.
Acknowledgments Hannun, Awni, Case, Carl, Casper, Jared, Catanzaro,
Bryan, Diamos, Greg, Elsen, Erich, Prenger, Ryan,
The authors are supported in part by the III-1618134, III- Satheesh, Sanjeev, Sengupta, Shubho, Coates, Adam,
1526012, and IIS-1149882 grants from the National Sci- et al. Deep speech: Scaling up end-to-end speech recog-
ence Foundation, as well as the Bill and Melinda Gates nition. arXiv preprint arXiv:1412.5567, 2014.
Foundation and the Office of Naval Research.
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,
References Jian. Deep residual learning for image recognition. In
CVPR, pp. 770–778, 2016.
Al-Shedivat, Maruan, Wilson, Andrew Gordon, Saatchi,
Yunus, Hu, Zhiting, and Xing, Eric P. Learning scal- Hendrycks, Dan and Gimpel, Kevin. A baseline for de-
able deep kernels with recurrent structure. arXiv preprint tecting misclassified and out-of-distribution examples in
arXiv:1610.08936, 2016. neural networks. In ICLR, 2017.
Hinton, Geoffrey, Vinyals, Oriol, and Dean, Jeff. Distilling
Bengio, Yoshua, Goodfellow, Ian J, and Courville, Aaron.
the knowledge in a neural network. 2015.
Deep learning. Nature, 521:436–444, 2015.
Huang, Gao, Sun, Yu, Liu, Zhuang, Sedra, Daniel, and
Bojarski, Mariusz, Del Testa, Davide, Dworakowski,
Weinberger, Kilian. Deep networks with stochastic
Daniel, Firner, Bernhard, Flepp, Beat, Goyal, Prasoon,
depth. In ECCV, 2016.
Jackel, Lawrence D, Monfort, Mathew, Muller, Urs,
Zhang, Jiakai, et al. End to end learning for self-driving Huang, Gao, Liu, Zhuang, Weinberger, Kilian Q, and
cars. arXiv preprint arXiv:1604.07316, 2016. van der Maaten, Laurens. Densely connected convolu-
tional networks. In CVPR, 2017.
Caruana, Rich, Lou, Yin, Gehrke, Johannes, Koch, Paul,
Sturm, Marc, and Elhadad, Noemie. Intelligible models Ioffe, Sergey and Szegedy, Christian. Batch normalization:
for healthcare: Predicting pneumonia risk and hospital Accelerating deep network training by reducing internal
30-day readmission. In KDD, 2015. covariate shift. 2015.
Collobert, Ronan, Kavukcuoglu, Koray, and Farabet, Iyyer, Mohit, Manjunatha, Varun, Boyd-Graber, Jordan,
Clément. Torch7: A matlab-like environment for ma- and Daumé III, Hal. Deep unordered composition rivals
chine learning. In BigLearn Workshop, NIPS, 2011. syntactic methods for text classification. In ACL, 2015.
Cosmides, Leda and Tooby, John. Are humans good intu- Jaynes, Edwin T. Information theory and statistical me-
itive statisticians after all? rethinking some conclusions chanics. Physical review, 106(4):620, 1957.
from the literature on judgment under uncertainty. cog- Jiang, Xiaoqian, Osl, Melanie, Kim, Jihoon, and Ohno-
nition, 58(1):1–73, 1996. Machado, Lucila. Calibrating predictive model estimates
DeGroot, Morris H and Fienberg, Stephen E. The compar- to support personalized medicine. Journal of the Amer-
ison and evaluation of forecasters. The statistician, pp. ican Medical Informatics Association, 19(2):263–274,
12–22, 1983. 2012.
Kendall, Alex and Cipolla, Roberto. Modelling uncertainty
Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai,
in deep learning for camera relocalization. 2016.
and Fei-Fei, Li. Imagenet: A large-scale hierarchical
image database. In CVPR, pp. 248–255, 2009. Kendall, Alex and Gal, Yarin. What uncertainties do we
need in bayesian deep learning for computer vision?
Denker, John S and Lecun, Yann. Transforming neural-net arXiv preprint arXiv:1703.04977, 2017.
output levels to probability distributions. In NIPS, pp.
853–859, 1990. Krause, Jonathan, Stark, Michael, Deng, Jia, and Fei-Fei,
Li. 3d object representations for fine-grained catego-
Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert. rization. In IEEE Workshop on 3D Representation and
The elements of statistical learning, volume 1. Springer Recognition (3dRR), Sydney, Australia, 2013.
series in statistics Springer, Berlin, 2001.
Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple
Gal, Yarin and Ghahramani, Zoubin. Dropout as a bayesian layers of features from tiny images, 2009.
approximation: Representing model uncertainty in deep
learning. In ICML, 2016. Kuleshov, Volodymyr and Ermon, Stefano. Reliable con-
fidence estimation via online learning. arXiv preprint
Girshick, Ross. Fast r-cnn. In ICCV, pp. 1440–1448, 2015. arXiv:1607.03594, 2016.
Supplementary Materials: On Calibration of Modern Neural Networks
Kuleshov, Volodymyr and Liang, Percy. Calibrated struc- Srivastava, Rupesh Kumar, Greff, Klaus, and Schmid-
tured prediction. In NIPS, pp. 3474–3482, 2015. huber, Jürgen. Highway networks. arXiv preprint
arXiv:1505.00387, 2015.
Lakshminarayanan, Balaji, Pritzel, Alexander, and Blun-
dell, Charles. Simple and scalable predictive uncer- Tai, Kai Sheng, Socher, Richard, and Manning, Christo-
tainty estimation using deep ensembles. arXiv preprint pher D. Improved semantic representations from tree-
arXiv:1612.01474, 2016. structured long short-term memory networks. 2015.
LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Vapnik, Vladimir N. Statistical Learning Theory. Wiley-
Patrick. Gradient-based learning applied to document Interscience, 1998.
recognition. Proceedings of the IEEE, 86(11):2278– Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F.,
2324, 1998. Belongie, S., and Perona, P. Caltech-UCSD Birds 200.
Technical Report CNS-TR-2010-001, California Insti-
MacKay, David JC. A practical bayesian framework for
tute of Technology, 2010.
backpropagation networks. Neural computation, 4(3):
448–472, 1992. Wilson, Andrew G, Hu, Zhiting, Salakhutdinov, Ruslan R,
and Xing, Eric P. Stochastic variational deep kernel
Naeini, Mahdi Pakdaman, Cooper, Gregory F, and learning. In NIPS, pp. 2586–2594, 2016a.
Hauskrecht, Milos. Obtaining well calibrated probabili-
ties using bayesian binning. In AAAI, pp. 2901, 2015. Wilson, Andrew Gordon, Hu, Zhiting, Salakhutdinov, Rus-
lan, and Xing, Eric P. Deep kernel learning. In AISTATS,
Netzer, Yuval, Wang, Tao, Coates, Adam, Bissacco, pp. 370–378, 2016b.
Alessandro, Wu, Bo, and Ng, Andrew Y. Reading dig-
its in natural images with unsupervised feature learning. Xiong, Wayne, Droppo, Jasha, Huang, Xuedong, Seide,
In Deep Learning and Unsupervised Feature Learning Frank, Seltzer, Mike, Stolcke, Andreas, Yu, Dong,
Workshop, NIPS, 2011. and Zweig, Geoffrey. Achieving human parity in
conversational speech recognition. arXiv preprint
Niculescu-Mizil, Alexandru and Caruana, Rich. Predicting arXiv:1610.05256, 2016.
good probabilities with supervised learning. In ICML,
Zadrozny, Bianca and Elkan, Charles. Obtaining cal-
pp. 625–632, 2005.
ibrated probability estimates from decision trees and
Pereyra, Gabriel, Tucker, George, Chorowski, Jan, Kaiser, naive bayesian classifiers. In ICML, pp. 609–616, 2001.
Łukasz, and Hinton, Geoffrey. Regularizing neural Zadrozny, Bianca and Elkan, Charles. Transforming classi-
networks by penalizing confident output distributions. fier scores into accurate multiclass probability estimates.
arXiv preprint arXiv:1701.06548, 2017. In KDD, pp. 694–699, 2002.
Platt, John et al. Probabilistic outputs for support vec- Zagoruyko, Sergey and Komodakis, Nikos. Wide residual
tor machines and comparisons to regularized likelihood networks. In BMVC, 2016.
methods. Advances in large margin classifiers, 10(3):
61–74, 1999. Zhang, Chiyuan, Bengio, Samy, Hardt, Moritz, Recht, Ben-
jamin, and Vinyals, Oriol. Understanding deep learning
Simonyan, Karen and Zisserman, Andrew. Very deep con- requires rethinking generalization. In ICLR, 2017.
volutional networks for large-scale image recognition. In
ICLR, 2015.
S1. Further Information on Calibration The first two constraint ensure that q is a probability dis-
Metrics tribution, while the last constraint limits the scope of distri-
butions. Intuitively, the constraint specifies that the average
We can connect the ECE metric with our exact miscalibra- true class logit is equal to the average weighted logit.
tion definition, which is restated here:
h i
E P Ŷ = Y | P̂ = p − p
Proof. We solve this constrained optimization problem us-
P̂
ing the Lagrangian. We first ignore the constraint q(zi )(k)
Let FP̂ (p) be the cumulative distribution function of P̂ so and later show that the solution satisfies this condition. Let
that FP̂ (b) − FP̂ (a) = P(P̂ ∈ [a, b]). Using the Riemann- λ, β1 , . . . , βn ∈ R be the Lagrangian multipliers and define
Stieltjes integral we have n X
X K
L=− q(zi )(k) log q(zi )(k)
h i
E P Ŷ = Y | P̂ = p − p
i=1 k=1
P̂ "K #
Z 1 n
(k) (y )
X X
= P Ŷ = Y | P̂ = p − p dFP̂ (p)
+λ zi q(zi )(k) − zi i
0 i=1 k=1
M
X n K
≈ P(Ŷ = Y |P̂ = pm ) − pm P(P̂ ∈ Im )
X X
(q(zi )(k) − 1).
+ βi
m=1 i=1 k=1
where Im represents the interval of bin Bm .
Taking the derivative with respect to q(zi )(k) gives
P(Ŷ = Y |P̂ = pm ) − pm is closely approximated
∂ (k)
by |acc(Bm ) − p̂(Bm )| for n large. Hence ECE using M (k)
L = −nK − log q(zi )(k) + λzi + βi .
∂q(zi )
binshconverges to the M-termiRiemann-Stieltjes sum of
Setting the gradient of the Lagrangian L to 0 and rearrang-
EP̂ P Ŷ = Y | P̂ = p − p .
ing gives
(k)
+βi −nK
q(zi )(k) = eλzi .
S2. Further Information on Temperature PK
Scaling Since k=1 q(zi )(k) = 1 for all i, we must have
(k)
Here we derive the temperature scaling model using the en- (k) eλzi
q(zi ) = PK (j)
,
tropy maximization principle with an appropriate balanced j=1 eλzi
equation. which recovers the temperature scaling model by setting
Claim 1. Given n samples’ logit vectors z1 , . . . , zn and T = λ1 .
class labels y1 , . . . , yn , temperature scaling is the unique
solution q to the following entropy maximization problem:
Figure S1 visualizes Claim 1. We see that, as training con-
n X
K
X tinues, the model begins to overfit with respect to NLL (red
max − q(zi )(k) log q(zi )(k) line). This results in a low-entropy softmax distribution
q
i=1 k=1
over classes (blue line), which explains the model’s over-
subject to q(zi )(k) ≥ 0 ∀i, k confidence. Temperature scaling not only lowers the NLL
K
X but also raises the entropy of the distribution (green line).
q(zi )(k) = 1 ∀i
k=1
n n X
K
S3. Additional Tables
(yi ) (k)
X X
zi = zi q(zi )(k) . Tables S1, S2, and S3 display the MCE, test error, and NLL
i=1 i=1 k=1 for all the experimental settings outlined in Section 5.
Supplementary Materials: On Calibration of Modern Neural Networks
1.5
0.5
0 100 200 300 400 500
Epoch
Figure S1. Entropy and NLL for CIFAR-100 before and after calibration. The optimal T selected by temperature scaling rises throughout
optimization, as the pre-calibration entropy decreases steadily. The post-calibration entropy and NLL on the validation set coincide
(which can be derived from the gradient optimality condition of T ).
Dataset Model Uncalibrated Hist. Binning Isotonic BBQ Temp. Scaling Vector Scaling Matrix Scaling
Birds ResNet 50 30.06% 25.35% 16.59% 11.72% 9.08% 9.81% 38.67%
Cars ResNet 50 41.55% 5.16% 15.23% 9.31% 20.23% 8.59% 29.65%
CIFAR-10 ResNet 110 33.78% 26.87% 7.8% 72.64% 8.56% 27.39% 22.89%
CIFAR-10 ResNet 110 (SD) 34.52% 17.0% 16.45% 19.26% 15.45% 15.55% 10.74%
CIFAR-10 Wide ResNet 32 27.97% 12.19% 6.19% 9.22% 9.11% 4.43% 9.65%
CIFAR-10 DenseNet 40 22.44% 7.77% 19.54% 14.57% 4.58% 3.17% 4.36%
CIFAR-10 LeNet 5 8.02% 16.49% 18.34% 82.35% 5.14% 19.39% 16.89%
CIFAR-100 ResNet 110 35.5% 7.03% 10.36% 10.9% 4.74% 2.5% 45.62%
CIFAR-100 ResNet 110 (SD) 26.42% 9.12% 10.95% 9.12% 8.85% 8.85% 35.6%
CIFAR-100 Wide ResNet 32 33.11% 6.22% 14.87% 11.88% 5.33% 6.31% 44.73%
CIFAR-100 DenseNet 40 21.52% 9.36% 10.59% 8.67% 19.4% 8.82% 38.64%
CIFAR-100 LeNet 5 10.25% 18.61% 3.64% 9.96% 5.22% 8.65% 18.77%
ImageNet DenseNet 161 14.07% 13.14% 11.57% 10.96% 12.29% 9.61% -
ImageNet ResNet 152 12.2% 14.57% 8.74% 8.85% 12.29% 9.61% -
SVHN ResNet 152 (SD) 19.36% 11.16% 18.67% 9.09% 18.05% 30.78% 18.76%
20 News DAN 3 17.03% 10.47% 9.13% 6.28% 8.21% 8.24% 17.43%
Reuters DAN 3 14.01% 16.78% 44.95% 36.18% 25.46% 18.88% 19.39%
SST Binary TreeLSTM 21.66% 3.22% 13.91% 36.43% 6.03% 6.03% 6.03%
SST Fine Grained TreeLSTM 27.85% 28.35% 19.0% 8.67% 44.75% 11.47% 11.78%
Table S1. MCE (%) (with M = 15 bins) on standard vision and NLP datasets before calibration and with various calibration methods.
The number following a model’s name denotes the network depth. MCE seems very sensitive to the binning scheme and is less suited
for small test sets.
S4. Additional Reliability Diagrams grams do not represent the proportion of predictions that
belong to a given bin.
We include reliability diagrams for additional datasets:
CIFAR-10 (Figure S2) and SST (Figure S3 and Figure S4).
Note that, as mentioned in Section 2, the reliability dia-
Supplementary Materials: On Calibration of Modern Neural Networks
Dataset Model Uncalibrated Hist. Binning Isotonic BBQ Temp. Scaling Vector Scaling Matrix Scaling
Birds ResNet 50 22.54% 55.02% 23.37% 37.76% 22.54% 22.99% 29.51%
Cars ResNet 50 14.28% 16.24% 14.9% 19.25% 14.28% 14.15% 17.98%
CIFAR-10 ResNet 110 6.21% 6.45% 6.36% 6.25% 6.21% 6.37% 6.42%
CIFAR-10 ResNet 110 (SD) 5.64% 5.59% 5.62% 5.55% 5.64% 5.62% 5.69%
CIFAR-10 Wide ResNet 32 6.96% 7.3% 7.01% 7.35% 6.96% 7.1% 7.27%
CIFAR-10 DenseNet 40 5.91% 6.12% 5.96% 6.0% 5.91% 5.96% 6.0%
CIFAR-10 LeNet 5 15.57% 15.63% 15.69% 15.64% 15.57% 15.53% 15.81%
CIFAR-100 ResNet 110 27.83% 34.78% 28.41% 28.56% 27.83% 27.82% 38.77%
CIFAR-100 ResNet 110 (SD) 24.91% 33.78% 25.42% 25.17% 24.91% 24.99% 35.09%
CIFAR-100 Wide ResNet 32 28.0% 34.29% 28.61% 29.08% 28.0% 28.45% 37.4%
CIFAR-100 DenseNet 40 26.45% 34.78% 26.73% 26.4% 26.45% 26.25% 36.14%
CIFAR-100 LeNet 5 44.92% 54.06% 45.77% 46.82% 44.92% 45.53% 52.44%
ImageNet DenseNet 161 22.57% 48.32% 23.2% 47.58% 22.57% 22.54% -
ImageNet ResNet 152 22.31% 48.1% 22.94% 47.6% 22.31% 22.56% -
SVHN ResNet 152 (SD) 1.98% 2.06% 2.04% 2.04% 1.98% 2.0% 2.08%
20 News DAN 3 20.06% 25.12% 20.29% 20.81% 20.06% 19.89% 22.0%
Reuters DAN 3 2.97% 7.81% 3.52% 3.93% 2.97% 2.83% 3.52%
SST Binary TreeLSTM 11.81% 12.08% 11.75% 11.26% 11.81% 11.81% 11.81%
SST Fine Grained TreeLSTM 49.5% 49.91% 48.55% 49.86% 49.5% 49.77% 48.51%
Table S2. Test error (%) on standard vision and NLP datasets before calibration and with various calibration methods. The number
following a model’s name denotes the network depth. Error with temperature scaling is exactly the same as uncalibrated.
Dataset Model Uncalibrated Hist. Binning Isotonic BBQ Temp. Scaling Vector Scaling Matrix Scaling
Birds ResNet 50 0.9786 1.6226 1.4128 1.2539 0.8792 0.9021 2.334
Cars ResNet 50 0.5488 0.7977 0.8793 0.6986 0.5311 0.5299 1.0206
CIFAR-10 ResNet 110 0.3285 0.2532 0.2237 0.263 0.2102 0.2088 0.2048
CIFAR-10 ResNet 110 (SD) 0.2959 0.2027 0.1867 0.2159 0.1718 0.1709 0.1766
CIFAR-10 Wide ResNet 32 0.3293 0.2778 0.2428 0.2774 0.2283 0.2275 0.2229
CIFAR-10 DenseNet 40 0.2228 0.212 0.1969 0.2087 0.1750 0.1757 0.176
CIFAR-10 LeNet 5 0.4688 0.529 0.4757 0.4984 0.459 0.4568 0.4607
CIFAR-100 ResNet 110 1.4978 1.4379 1.207 1.5466 1.0442 1.0485 2.5637
CIFAR-100 ResNet 110 (SD) 1.1157 1.1985 1.0317 1.1982 0.8613 0.8655 1.8182
CIFAR-100 Wide ResNet 32 1.3434 1.4499 1.2086 1.459 1.0565 1.0648 2.5507
CIFAR-100 DenseNet 40 1.0134 1.2156 1.0615 1.1572 0.9026 0.9011 1.9639
CIFAR-100 LeNet 5 1.6639 2.2574 1.8173 1.9893 1.6560 1.6648 2.1405
ImageNet DenseNet 161 0.9338 1.4716 1.1912 1.4272 0.8885 0.8879 -
ImageNet ResNet 152 0.8961 1.4507 1.1859 1.3987 0.8657 0.8742 -
SVHN ResNet 152 (SD) 0.0842 0.1137 0.095 0.1062 0.0821 0.0844 0.0924
20 News DAN 3 0.7949 1.0499 0.8968 0.9519 0.7387 0.7296 0.9089
Reuters DAN 3 0.102 0.2403 0.1475 0.1167 0.0994 0.0990 0.1491
SST Binary TreeLSTM 0.3367 0.2842 0.2908 0.2778 0.2739 0.2739 0.2739
SST Fine Grained TreeLSTM 1.1475 1.1717 1.1661 1.149 1.1168 1.1085 1.1112
Table S3. NLL (%) on standard vision and NLP datasets before calibration and with various calibration methods. The number following
a model’s name denotes the network depth. To summarize, NLL roughly follows the trends of ECE.
Supplementary Materials: On Calibration of Modern Neural Networks
Uncal. - CIFAR-10 Temp. Scale - CIFAR-10 Hist. Bin. - CIFAR-10 Iso. Reg. - CIFAR-10
ResNet-110 (SD) ResNet-110 (SD) ResNet-110 (SD) ResNet-110 (SD)
1.0
Outputs Outputs Outputs Outputs
0.8 Gap Gap Gap Gap
Accuracy
0.6
0.4
0.2
ECE=4.12 ECE=0.60 ECE=0.67 ECE=1.11
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Confidence
Figure S2. Reliability diagrams for CIFAR-10 before (far left) and after calibration (middle left, middle right, far right).
Uncal. - SST-FG Temp. Scale - SST-FG Hist. Bin. - SST-FG Iso. Reg. - SST-FG
Tree LSTM Tree LSTM Tree LSTM Tree LSTM
1.0
Outputs Outputs Outputs Outputs
0.8 Gap Gap Gap Gap
Accuracy
0.6
0.4
0.2
ECE=6.71 ECE=2.56 ECE=2.09 ECE=1.65
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Confidence
Figure S3. Reliability diagrams for SST Binary and SST Fine Grained before (far left) and after calibration (middle left, middle right,
far right).
Uncal. - SST-BIN Temp. Scale - SST-BIN Hist. Bin. - SST-BIN Iso. Reg. - SST-BIN
Tree LSTM Tree LSTM Tree LSTM Tree LSTM
1.0
Outputs Outputs Outputs Outputs
0.8 Gap Gap Gap Gap
Accuracy
0.6
0.4
0.2
ECE=6.63 ECE=1.84 ECE=1.93 ECE=1.65
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Confidence
Figure S4. Reliability diagrams for SST Binary and SST Fine Grained before (far left) and after calibration (middle left, middle right,
far right).