Curvature Tuning
Curvature Tuning
1
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
Pretrain Classification
Supervised,
Self-supervised,
……
Figure 1: Illustration of the Curvature Tuning (CT) mechanism for model steering. CT steers a pretrained model by replacing ReLUs
with a β-parameterized activation function and tuning β from 1 to 0, progressively smoothing the model’s decision boundary
across tasks (e.g., classification and regression). The β-parameterized activation function is defined in Equation (7).
model’s curvature, its optimal value can be determined via ResNet-50/152, Swin-T/S, as well as additional datasets,
cross-validation without requiring training or backpropa- further confirm CT’s effectiveness.
gation. This property ensures maximal computational and
A visual depiction of the CT mechanism is shown in Fig-
memory efficiency.
ure 1, and our key contributions are summarized below:
CT is interpretable for any value of β. CT replaces inter-
nal activation functions such as ReLU and Leaky ReLU with
a convex combination of a reparameterized Swish function 1. Theoretical Contribution: We introduce Curvature
(Ramachandran et al., 2017) and a Softplus function, con- Tuning (CT), a training-free model steering technique
trolled by the parameter β. This theoretically grounded con- that provably adjusts the curvature of model decision
struction directly modulates the model’s decision boundary boundaries using a single parameter. This principled
curvature. When β = 1, the original activation function is design ensures both efficiency and interpretability. De-
recovered, resulting in a piecewise affine decision boundary. tails are provided in Section 3.
When β = 0, the model becomes entirely linear, making
the decision boundary globally affine. Intermediate values
of β gradually smooth the decision boundary, offering a
2. Empirical Contribution: We demonstrate in Sec-
continuous transition between these two extremes.
tion 4 that CT enhances generalization and robustness
CT significantly improves a model’s performance across across various models, datasets, and tasks. For ex-
tasks and domains while enhancing robustness. We em- ample, CT improves out-of-distribution transfer per-
pirically validate the effectiveness of CT through extensive formances of ResNet-18/50 by 2.57%/1.74% across
experiments, demonstrating improvements in both general- seventeen downstream datasets, and improves Robust-
ization and robustness. For same-task generalization, trans- Bench robust accuracy by 11.76%/348.44%. It also
ferring ResNet-18 (He et al., 2016) across seventeen im- improves generalization of ReLU-based Swin-T/S on
age classification datasets—including MNIST, CIFAR-10, nine downstream datasets by 2.43%/3.33%.
CIFAR-100 (Krizhevsky et al., 2009), and ImageNet (Deng
et al., 2009)—yields a relative accuracy gain of 2.57%. For
cross-task generalization, CT achieves a relative improve- The remainder of this paper is organized as follows: Sec-
ment of 0.41% in the mIoU of a PSPNet (Zhao et al., 2017) tion 2 reviews current fine-tuning techniques and introduces
using an ImageNet-pretrained ResNet-50 as backbone on relevant spline concepts, the foundation for our method.
VOC2012 (Everingham et al.). Moreover, CT delivers a Section 3 details our proposed method and its theoretical
relative improvement of 11.76% in the robust accuracy of guarantees. Section 4 presents experimental results, and
an ImageNet-pretrained ResNet-18 on RobustBench (Croce Section 5 summarizes our findings and potential future di-
et al., 2020). Additional experiments with models such as rections.
2
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
multiple PEFT approaches, such as UniPELT (Mao et al., where 1{x∈ωr } is an indicator function that equals 1 if x
2021) and S4 (Chen et al., 2023). belongs to region ωr and 0 otherwise.
While these techniques vary in the parameters they modify, A max-affine spline function is a special case of an affine
they all require further training, which remains computa- spline function that does not explicit knowledge of Ω. In-
tionally expensive. In particular, backpropagation presents stead, its output is computed as the maximum value over
significant challenges for larger models. Additionally, their the affine functions:
application often involves tuning numerous hyperparame-
ters, typically guided by heuristics with limited theoretical
justification, making it difficult to determine optimal values. s[A, b](x) = max ⟨Ar,· , x⟩ + br (2)
r=1...R
Moreover, deep learning training remains largely opaque,
complicating the understanding of how pretrained knowl- The key result underpinning our study is that many deep net-
edge is preserved and limiting interpretability. For instance, work layers—such as fully connected layers, convolutional
deploying LoRA involves multiple design choices, including layers, and convex piecewise-affine activations (e.g., ReLU,
selecting the layers where it should be applied (Gao et al., max pooling, or maxout)—can be exactly represented as
2024), determining its rank (Valipour et al., 2022; Chen max-affine spline functions (Balestriero et al., 2018). Fur-
et al., 2024), choosing the scaling factor during inference ther details on this connection are provided in Appendix B.
3
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
While we primarily leverage this result from the spline for- We know that each unit of a layer is a max-affine spline.
mulation of deep networks, interested readers can refer to The inference process of each unit can thus be decomposed
(Balestriero et al., 2019) for a deeper exploration of how into two steps:
the spline of each layer composes to make the entire input-
output mapping of a DN an affine spline. 1. VQ Inference Step (region selection): Determine the
Now that we have reviewed the current fine-tuning meth- affine transformation that maximizes the output, which
ods and their limitations, and introduced relevant concepts can be viewed as a vector quantization process. The
in splines along with their connection to deep networks, decision is encoded in a selection variable t ∈ RR ,
which provide the theoretical foundation for our method, we where R is the number of input region partitions of the
proceed to present our proposed method in Section 3. max-affine spline function. In a MASO setting, the
selection variable t is a one-hot vector with the r∗ -th
entry set to 1, where:
3. Curvature Tuning: A Provable Method for
Model Steering r∗ = arg max ⟨Ar,· , x⟩ + br .
(3)
r∈{1,...,R}
In this section, we introduce our proposed method, Cur-
vature Tuning (CT). We first dive into its motivation and
2. Computation Step (affine transformation): Compute
construction in Sections 3.1 and 3.2, followed by implemen-
the output of the neuron based on the selection variable
tation details in Section 3.3. Readers focused on practical
t:
applications of CT will find our experiments in Section 4. XR
f (x) = tr · ⟨Ar,· , x⟩ + br . (4)
3.1. The β-VQ Inference Framework r=1
4
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
values
1
1
0
3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3
Figure 2: Visualization of nonlinearity smoothing through region assignment smoothing, max smoothing, and their combination. The
combined approach mitigates the opposing biases introduced by the individual methods.
5
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
4. Curvature Tuning: Enhancing Model Table 1 presents the results of the first experiment for
Generalization and Robustness ResNet-18. The complete results, including ResNet-50 and
ResNet-152, are provided in Table 8. CT consistently im-
In this section, we empirically validate the effectiveness of proves generalization across models and datasets, yielding
CT by demonstrating its ability to improve model generaliza- average relative accuracy gains of 1.68%, 1.96%, and 0.40%
tion on natural image classification (Section 4.1), as well as for ResNet-18, ResNet-50, and ResNet-152, respectively.
on medical image classification and more fine-grained tasks Notably, in 25 out of 27 transfer cases, CT enhances accu-
(Section 4.2). Additionally, we show that CT enhances ro- racy. Even when evaluated on the test set of the pretraining
bustness against adversarial and corrupted data (Section 4.3). dataset—where distribution shift is minimal—CT still pro-
We then demonstrate that CT is effective for transformers, vides improvements in 50% of cases, albeit with reduced
even with partial theoretical guarantees (Section 4.4). Fi- effectiveness. Additionally, the average β values2 for the
nally, we conduct ablation studies on its implementation in three models are 0.82, 0.89, and 0.95—values close to 1.
Section 4.5. This suggests that CT efficiently identifies an appropriate β.
For all experiments, the parameter β is searched within the To validate CT’s robustness3 under different linear prob-
range [0.5, 1) with a step size of 0.01. Each result is reported ing configurations, we provide additional experiments in
as the mean across three independent runs with seeds 42, Appendix D.2.
43, and 44. Additional experimental details and full results
with standard deviations are provided in Appendix D. The results for the second experiment is shown in Table 2.
CT improves performance in 26 out of 27 cases, achieving
4.1. Improving Generalization on Natural Image average relative improvements of 3.53%, 1.36%, and 0.77%
Datasets for ResNet-18, ResNet-50, and ResNet-152, respectively.
These improvements even exceed those in the first part of
In this subsection, we evaluate the effectiveness of CT in the experiment, further highlighting CT’s effectiveness in
improving model generalization across natural image classi- real-world generalization scenarios. Moreover, the average
fication datasets through two experiments: β values are 0.80, 0.93, and 0.95, respectively, once again
demonstrating CT’s efficiency. Additionally, we provide
1. Cross-Dataset Transfer: We pretrain ResNet-18, visualizations of accuracy trends during the β search process
ResNet-50, and ResNet-152 on MNIST, CIFAR-10, in Figure 3, showing a sharp increase leading to a distinct
CIFAR-100, or ImageNet, apply CT to the pretrained peak, followed by a gradual decline as β increases.
model, and evaluate its performance when transferred
In summary, our results demonstrate that CT effectively
to the remaining datasets1 . This experiment assesses
enhances model generalization across natural image classi-
CT’s impact when models are pretrained and trans-
fication datasets. In the following section, we extend this
ferred across datasets of varying sizes.
analysis to medical image datasets and more fine-grained
2. ImageNet-to-Multiple-Datasets Transfer: We apply tasks to further assess CT’s impact on generalization.
CT to ImageNet-pretrained ResNet-18, ResNet-50, and
ResNet-152 before transferring them to 9 additional 4.2. Improving Generalization on Medical Image
datasets: Arabic Characters (El-Sawy et al., 2017), Ara- Datasets and Fine-grained Tasks
bic Digits (El-Sawy et al., 2016), Beans (Makerere AI
In this subsection, we further evaluate CT’s impact on model
Lab, 2020), CUB-200-2011 (Wah et al., 2011), DTD
generalization in more complex scenarios, specifically on
(Cimpoi et al., 2014), Fashion MNIST (Xiao et al.,
medical image datasets and more fine-grained tasks than
2017), FGVC-Aircraft (Maji et al., 2013), Flowers102
single-label classification:
(Nilsback & Zisserman, 2008), Food101 (Bossard et al.,
2014). This experiment evaluates CT in a more real-
istic setting, where ImageNet-pretrained models are 1. Medical Image Datasets: We apply CT to ImageNet-
commonly used for downstream tasks. pretrained ResNet-18, ResNet-50, and ResNet-152 be-
fore transferring them to three medical image datasets:
PathMNIST, OCTMNIST, and DermaMNIST from
While CT is applied consistently, the transfer learning ap-
MedMNIST (Yang et al., 2023).
proach differs depending on the dataset. For the same
dataset (test set evaluation), the model is directly tested 2. Fine-grained Tasks: We apply CT to ImageNet-
after applying CT, whereas for a new dataset transfer, the pretrained ResNet-18, ResNet-50, and ResNet-152 for
classification layer is removed, and linear probing (via lo- more fine-grained downstream tasks beyond single-
gistic regression) is used for evaluation. 2
Computed only for cases where improvements are observed.
1 3
We exclude transfers to ImageNet due to computational costs. Referring to the robustness of CT itself, not the models.
6
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
Table 1: Accuracy of ResNet-18 trained and tested across MNIST, CIFAR-10, CIFAR-100, and ImageNet (bold entries indicate
improvement with CT). CT consistently enhances generalization across models and datasets, with β values close to 1. Reported
values are means over three runs; the complete results for ResNet-18, ResNet-50, and ResNet-152, including standard deviations, are
provided in Table 8.
Table 2: Accuracy of ImageNet-pretrained ResNet-18, ResNet-50, and ResNet-152 when transferred to 9 downstream datasets (bold
entries indicate improvement with CT). CT consistently enhances generalization across models and datasets, with β values close to 1.
Reported values are means over three runs; the complete results, including standard deviations, are provided in Table 9.
label classification, including multi-label prediction 2.81%/1.87%/1.00% in F1-score for ResNet-18/50/152 com-
on CelebA (Liu et al., 2015), regression on dSprites pared to per-attribute optimization. These results highlight
(Matthey et al., 2017)4 , and semantic segmentation on the stability of CT in fine-grained downstream tasks like
VOC2012 (Everingham et al.)5 . Detailed settings are multi-label prediction.
provided in Appendix D.3.
In conclusion, we demonstrate that CT significantly im-
proves model generalization even in more challenging set-
The results, summarized in Table 3, show that CT con- tings, including medical imaging and fine-grained tasks.
sistently enhances generalization across these more chal- Next, we investigate its role in enhancing model robustness.
lenging datasets and tasks. CT achieves average relative
improvements of 2.69%, 1.74%, and 4.25% for ResNet-18, 4.3. Improving Robustness on Adversarial and
ResNet-50, and ResNet-152, respectively. Additionally, the Corrupted Data
average β values—0.83, 0.86, and 0.85—further underscore
In this subsection, we demonstrate that CT enhances model
CT’s effectiveness and efficiency in complex scenarios.
robustness using RobustBench (Croce et al., 2020), a stan-
Furthermore, we compare per-attribute accuracy, balanced dardized benchmark for evaluating model robustness. Ro-
accuracy, and F1-score of an ImageNet-pretrained ResNet- bustBench includes both adversarial examples, such as
18/50/152 when transferred to multi-label prediction on ℓ2 and ℓ∞ -norm bounded perturbations, which measure
CelebA, evaluating two selection methods for β: one opti- a model’s resistance to adversarial changes, and naturally
mized per attribute based on the best metric value and one corrupted examples (Hendrycks & Dietterich, 2019), such
using a globally optimal β selected for the highest mean as noise and fog, which assess model’s robustness to real-
accuracy across attributes. Table 4 presents the partial re- world data distribution shifts.
sults for ResNet-18, while complete results for all models
To assess CT’s impact, we apply it to ImageNet-pretrained
are provided in Table 13, Table 14, Table 15, Table 16, Ta-
ResNet-18, ResNet-50, and ResNet-152 and evaluate their
ble 17 and Table 18. The performance gap between the
robustness on CIFAR-10, CIFAR-100, and ImageNet under
two methods is minimal: using a globally optimal β results
both adversarial attacks (ℓ2 and ℓ∞ ) and corruption-based
in an average relative reduction of 0.21%/0.18%/0.07% in
distortions. For each dataset, we sample 1,000 instances
accuracy, 0.96%/0.65%/0.30% in balanced accuracy, and
for evaluation. More detailed settings are provided in Ap-
4
We regress the orientation of the shapes in dSprites. pendix D.4.
5
Only tested on ResNet-50 due to the architecture of the PSP-
Net used. As summarized in Table 19, CT consistently improves
7
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
Table 3: Performance of ImageNet-pretrained ResNet-18, ResNet- Table 4: Comparison of per-attribute accuracy, balanced accuracy,
50, and ResNet-152 when transferred to challenging medical image and F1-score of ImageNet-pretrained ResNet-18 when transferred
datasets and fine-grained tasks (bold entries indicate improvement to multi-label prediction on CelebA, evaluating two selection meth-
with CT). CT consistently improves generalization across di- ods for β: one optimized per attribute (Micro Best) and one using
verse datasets and tasks. Reported values are means over three a globally optimal β (Macro Best) (bold values indicate cases
runs; the complete results, including standard deviations, are pro- where the metric remains the same under both selection methods).
vided in Table 12. The small performance gap between Micro Best and Macro
Best demonstrates the stability of CT on more fine-grained
(a) ResNet-18. Avg rel improve: 2.69%. Avg β: 0.83. downstream tasks. Reported values are means over three runs
for the first four attributes; the complete results for ResNet-18,
Dataset Metric ReLU + CT Rel Improve (%) (β)
PathMNIST Acc (%) ↑ 86.21 87.27 1.23 0.81
ResNet-50, and ResNet-152, including standard deviations, are
OCTMNIST Acc (%) ↑ 65.47 69.00 5.40 0.80 provided in Table 13, Table 14, Table 15, Table 16, Table 17 and
DermaMNIST Acc (%) ↑ 73.44 77.74 5.86 0.80 Table 18.
CelebA Mean Acc (%) ↑ 87.88 88.44 0.64 0.75
dSprites MSE ↓ 4.09 4.08 0.33 0.98 (a) Micro Best.
VOC2012 mIoU ↑ - - - -
Metric
(b) ResNet-50. Avg rel improve: 1.74%. Avg β: 0.86. Accuracy (%) Balanced Accuracy (%) F1 (%)
Attribute
5 Shadow 91.54 65.91 44.16
Dataset Metric ReLU + CT Rel Improve (%) (β)
Arch. Eyebrows 79.63 72.46 60.91
PathMNIST Acc (%) ↑ 89.83 89.88 0.06 0.98
Attractive 79.44 79.47 80.15
OCTMNIST Acc (%) ↑ 68.60 69.93 1.94 0.90
DermaMNIST Acc (%) ↑ 73.67 77.44 5.12 0.89 Bags Un. Eyes 82.80 65.29 45.80
CelebA Mean Acc (%) ↑ 89.18 89.42 0.27 0.91
(b) Macro Best.
dSprites MSE ↓ 4.40 4.28 2.62 0.53
VOC2012 mIoU ↑ 0.68 0.69 0.41 0.96 Metric
Accuracy (%) Balanced Accuracy (%) F1 (%)
Attribute
(c) ResNet-152. Avg rel improve: 4.25%. Avg β: 0.85. 5 Shadow 91.34 65.57 43.56
Dataset Metric ReLU + CT Rel Improve (%) (β) Arch. Eyebrows 79.34 71.97 60.14
PathMNIST Acc (%) ↑ 90.15 90.75 0.66 0.92 Attractive 79.41 79.46 80.14
OCTMNIST Acc (%) ↑ 70.23 71.03 1.14 0.98 Bags Un. Eyes 82.80 65.12 45.48
DermaMNIST Acc (%) ↑ 74.39 77.92 4.75 0.93 Avg Rel Reduct (%) 0.16 0.37 0.83
CelebA Mean Acc (%) ↑ 89.16 89.40 0.27 0.92
dSprites MSE ↓ 4.46 3.82 14.27 0.52
VOC2012 mIoU ↑ - - - -
4.4. Improving Generalization of Transformers
In this subsection, we demonstrate that CT is effective for
transformers. Unlike models such as ResNets, where all
layers adhere to the max-affine spline framework, transform-
ers include attention layers that do not fit directly within
model robustness across the tested scenarios, achieving av-
this framework. Consequently, we lose strict theoretical
erage relative improvements in robust accuracy6 of 11.76%,
guarantees when considering the end-to-end mapping of
348.44%, and 498.41% for ResNet-18, ResNet-50, and
transformers. However, from a layer-wise perspective, the
ResNet-152, respectively. Notably, the trend of increas-
feed-forward layer combined with the activation function
ing improvements as model size grows suggests that CT has
(if convex and piecewise-affine like ReLU) retains partial
the potential to be even more effective for larger models.
theoretical guarantees.
Moreover, the average β values are even closer to 1 com-
To show that CT remains effective even in cases with only
pared to the generalization experiments, with values of 0.92,
partial theoretical guarantees, we modify the Swin Trans-
0.95, and 0.98 for the three models, highlighting CT’s effi-
former (Liu et al., 2021) by replacing all GELU activations
ciency in optimizing robustness.
following the feed-forward layers with ReLU, enabling CT
These results demonstrate that CT effectively enhances the to be applied. The network is then pretrained on Imagenette
robustness of pretrained models against both adversarial (fastai, 2019) which is a subset of 10 easily classified classes
perturbations and common corruptions, further reinforcing from Imagenet and transferred to the 9 downstream datasets
its practical benefits. Having demonstrated CT’s impact used in the ImageNet-to-Multiple-Datasets Transfer ex-
on generalization and robustness in models that fully com- periment in Section 4.1. Further details on the experimental
ply with the max-affine spline framework, where curvature setup are provided in Appendix D.5.
tuning is provable from an end-to-end perspective, we now
As shown in Table 6, CT improves the generalization of
show that CT also works for transformers, where curvature
transformers even when partial theoretical guarantees are
tuning is only provable from a layer-wise perspective.
available. Specifically, CT achieves relative improvements
6
Cases where the ReLU baseline has zero robust accuracy are of 2.43% on Swin-T and 3.33% on Swin-S, with average β
excluded from computation. values of 0.92 and 0.94, respectively, demonstrating CT’s
8
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
Table 5: Robust accuracy of ImageNet-pretrained ResNet-18, ResNet-50, and ResNet-152 under ℓ2 /ℓ∞ adversarial attacks and common
corruptions on CIFAR-10, CIFAR-100, and ImageNet (bold entries indicate improvement with CT). CT consistently enhances robustness
across models, datasets, and robustness settings, with β values close to 1. Reported values are means over three runs; the complete
results, including standard deviations, are provided in Table 19.
effectiveness and efficiency once again. the robustness experiment (evaluating on RobustBench).
The results in Table 21 and Table 22 show that while both
Table 6: Accuracy of Imagenette-pretrained Swin-T and Swin-S individual components provide improvements (0.23% and
(ReLU-based) when transferred to 9 downstream datasets (bold 2.96% for Swish and SoftPlus in generalization; 8.06% and
entries indicate improvement with CT). CT consistently enhances 9.91% in robustness), the full CT formulation achieves the
generalization across models and datasets, with β values close
to 1, demonstrating its effectiveness even with partial theoreti-
highest gains (3.46% in generalization; 11.76% in robust-
cal guarantees. Reported values are means over three runs; the ness). These findings align with our theoretical insights
complete results, including standard deviations, are provided in in Section 3.2, which show that combining both functions
Table 20. helps mitigate decision boundary drift.
9
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
to an even broader range of state-of-the-art models with for lora-enabled distributed learning. arXiv preprint
minimal effort. arXiv:2412.15553, 2024.
Although CT and other Parameter-Efficient Fine-Tuning Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., , and
(PEFT) methods like LoRA cater to distinct fine-tuning Vedaldi, A. Describing textures in the wild. In Proceed-
needs, the theoretical foundations of CT have the potential to ings of the IEEE Conf. on Computer Vision and Pattern
inspire further advancements in state-of-the-art techniques Recognition (CVPR), 2014.
such as LoRA. We hope this work serves as a stepping stone
for future research into efficient, principled approaches to Croce, F., Andriushchenko, M., Sehwag, V., Debenedetti,
post-training model steering. E., Flammarion, N., Chiang, M., Mittal, P., and Hein,
M. Robustbench: a standardized adversarial robustness
benchmark. arXiv preprint arXiv:2010.09670, 2020.
Impact Statement
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,
This paper presents work whose goal is to advance the field L. Imagenet: A large-scale hierarchical image database.
of Machine Learning. There are many potential societal In 2009 IEEE conference on computer vision and pattern
consequences of our work, none which we feel must be recognition, pp. 248–255. Ieee, 2009.
specifically highlighted here.
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle,
A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan,
References
A., et al. The llama 3 herd of models. arXiv preprint
Balestriero, R. and Baraniuk, R. G. From hard to soft: arXiv:2407.21783, 2024.
Understanding deep network nonlinearities via vector
quantization and statistical inference. arXiv preprint El-Sawy, A., Hazem, E.-B., and Loey, M. Cnn for hand-
arXiv:1810.09274, 2018. written arabic digits recognition based on lenet-5. In
International conference on advanced intelligent systems
Balestriero, R., Cosentino, R., Aazhang, B., and Baraniuk, and informatics, pp. 566–575. Springer, 2016.
R. The geometry of deep networks: Power diagram
subdivision. Advances in Neural Information Processing El-Sawy, A., Loey, M., and El-Bakry, H. Arabic hand-
Systems, 32, 2019. written characters recognition using convolutional neural
network. WSEAS Transactions on Computer Research, 5:
Balestriero, R. et al. A spline theory of deep learning. In 11–19, 2017.
International Conference on Machine Learning, pp. 374–
383. PMLR, 2018. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J.,
and Zisserman, A. The PASCAL Visual Object Classes
Bossard, L., Guillaumin, M., and Van Gool, L. Food-101 – Challenge 2012 (VOC2012) Results. http://www.pascal-
mining discriminative components with random forests. network.org/challenges/VOC/voc2012/workshop/index.html.
In European Conference on Computer Vision, 2014.
fastai. Imagenette: A subset of 10 easily classified classes
Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., from imagenet. https://github.com/fastai/
Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., imagenette, 2019.
Gramfort, A., Grobler, J., Layton, R., VanderPlas, J.,
Joly, A., Holt, B., and Varoquaux, G. API design for ma- Gao, C., Chen, K., Rao, J., Sun, B., Liu, R., Peng, D., Zhang,
chine learning software: experiences from the scikit-learn Y., Guo, X., Yang, J., and Subrahmanian, V. Higher layers
project. In ECML PKDD Workshop: Languages for Data need more lora experts. arXiv preprint arXiv:2402.08562,
Mining and Machine Learning, pp. 108–122, 2013. 2024.
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Guo, D., Rush, A. M., and Kim, Y. Parameter-efficient
Bojanowski, P., and Joulin, A. Emerging properties in transfer learning with diff pruning. arXiv preprint
self-supervised vision transformers. In Proceedings of the arXiv:2012.07463, 2020.
IEEE/CVF international conference on computer vision,
Han, Z., Gao, C., Liu, J., Zhang, J., and Zhang, S. Q.
pp. 9650–9660, 2021.
Parameter-efficient fine-tuning for large models: A com-
Chen, J., Zhang, A., Shi, X., Li, M., Smola, A., and Yang, prehensive survey. arXiv preprint arXiv:2403.14608,
D. Parameter-efficient fine-tuning design spaces. arXiv 2024.
preprint arXiv:2301.01821, 2023.
Hayou, S., Ghosh, N., and Yu, B. The impact of ini-
Chen, S., Tavallaie, O., Nazemi, N., Chen, X., and Zomaya, tialization on lora finetuning dynamics. arXiv preprint
A. Y. Autorank: Mcda based rank personalization arXiv:2406.08447, 2024.
10
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and Vedaldi,
ing for image recognition. In Proceedings of the IEEE A. Fine-grained visual classification of aircraft. arXiv
conference on computer vision and pattern recognition, preprint arXiv:1306.5151, 2013.
pp. 770–778, 2016.
Makerere AI Lab. Bean Disease Dataset, Jan-
Hendrycks, D. and Dietterich, T. Benchmarking neural uary 2020. URL https://github.com/
network robustness to common corruptions and perturba- AI-Lab-Makerere/ibean/.
tions. arXiv preprint arXiv:1903.12261, 2019.
Mao, Y., Mathias, L., Hou, R., Almahairi, A., Ma, H.,
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., Han, J., Yih, W.-t., and Khabsa, M. Unipelt: A uni-
de Laroussilhe, Q., Gesmundo, A., Attariyan, M., and fied framework for parameter-efficient language model
Gelly, S. Parameter-efficient transfer learning for tuning. arXiv preprint arXiv:2110.07577, 2021.
nlp, 2019. URL https://arxiv.org/abs/1902.
Matthey, L., Higgins, I., Hassabis, D., and Lerchner,
00751.
A. dsprites: Disentanglement testing sprites dataset.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, https://github.com/deepmind/dsprites-dataset/, 2017.
S., Wang, L., and Chen, W. Lora: Low-rank adaptation of
Mirzadeh, I., Alizadeh, K., Mehta, S., Del Mundo, C. C.,
large language models, 2021. URL https://arxiv.
Tuzel, O., Samei, G., Rastegari, M., and Farajtabar,
org/abs/2106.09685.
M. Relu strikes back: Exploiting activation sparsity in
Jeddi, A., Shafiee, M. J., and Wong, A. A simple fine- large language models. arXiv preprint arXiv:2310.04564,
tuning is all you need: Towards robust deep learning via 2023.
adversarial fine-tuning. arXiv preprint arXiv:2012.13628, Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. On
2020. the number of linear regions of deep neural networks.
Advances in neural information processing systems, 27,
Kalajdzievski, D. A rank stabilization scaling factor for
2014.
fine-tuning with lora. arXiv preprint arXiv:2312.03732,
2023. Nilsback, M.-E. and Zisserman, A. Automated flower clas-
sification over a large number of classes. In 2008 Sixth
Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakr-
Indian conference on computer vision, graphics & image
ishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., San-
processing, pp. 722–729. IEEE, 2008.
keti, P., et al. Openvla: An open-source vision-language-
action model. arXiv preprint arXiv:2406.09246, 2024. Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec,
M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-
Krizhevsky, A., Hinton, G., et al. Learning multiple layers Nouby, A., et al. Dinov2: Learning robust visual features
of features from tiny images. 2009. without supervision. arXiv preprint arXiv:2304.07193,
2023.
Li, X. L. and Liang, P. Prefix-tuning: Optimizing continuous
prompts for generation. arXiv preprint arXiv:2101.00190, Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
2021. Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
L., et al. Pytorch: An imperative style, high-performance
Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal,
deep learning library. Advances in neural information
M., and Raffel, C. A. Few-shot parameter-efficient fine-
processing systems, 32, 2019.
tuning is better and cheaper than in-context learning. Ad-
vances in Neural Information Processing Systems, 35: Radford, A. Improving language understanding by genera-
1950–1965, 2022. tive pre-training. 2018.
Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
attributes in the wild. In Proceedings of International Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
Conference on Computer Vision (ICCV), December 2015. J., Krueger, G., and Sutskever, I. Learning transferable
visual models from natural language supervision, 2021.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, URL https://arxiv.org/abs/2103.00020.
S., and Guo, B. Swin transformer: Hierarchical vision
transformer using shifted windows. In Proceedings of the Ramachandran, P., Zoph, B., and Le, Q. V. Searching for
IEEE/CVF international conference on computer vision, activation functions. arXiv preprint arXiv:1710.05941,
pp. 10012–10022, 2021. 2017.
11
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
Yang, J., Shi, R., Wei, D., Liu, Z., Zhao, L., Ke, B., Pfister,
H., and Ni, B. Medmnist v2-a large-scale lightweight
benchmark for 2d and 3d biomedical image classification.
Scientific Data, 10(1):41, 2023.
Zhai, X., Mustafa, B., Kolesnikov, A., and Beyer, L. Sig-
moid loss for language image pre-training. In Proceed-
ings of the IEEE/CVF International Conference on Com-
puter Vision, pp. 11975–11986, 2023.
Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. Pyramid scene
parsing network. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pp. 2881–
2890, 2017.
12
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
Table 7: Number of parameters to be tuned during fine-tuning of different ResNet models (total parameters shown in parentheses) using
LoRA with varying ranks. While LoRA with low ranks significantly reduces the number of parameters required for fine-tuning,
these values still range from tens of thousands to over a million, reinforcing the need for CT in highly resource-constrained
scenarios.
B. Spline Theory
The spline theory of deep learning establishes that a large class of deep network (DN) layers can be modeled as MASOs.
More precisely:
Theorem B.1. Any DN layer comprising a linear operator (e.g., fully connected or convolutional layer) followed by a
convex and piecewise affine non-linear operator (e.g., ReLU, leaky-ReLU, absolute value activation, max/average/channel
pooling, maxout; with or without skip connections) is a MASO (Balestriero et al., 2018).
Consequently, a deep network (e.g., MLP, CNN, RNN, ResNet) composed of such linear operators and convex, piecewise
affine non-linear operators is a composition of MASOs. However, it is important to note that the network as a whole is not a
MASO but an ASO. In other words, conditioned on the input, such deep networks are equivalent to an affine transformation,
but globally, the transformation is not convex.
13
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
import torch
from torch import nn
class CT(nn.Module):
def __init__(self, beta=0, coeff=0.5, threshold=20, trainable=False):
assert 0 <= beta < 1
super().__init__()
self.beta = nn.Parameter(torch.tensor(beta))
self.beta.requires_grad_(trainable)
self.coeff = coeff
self.threshold = threshold
class ReplacementMapping:
def __init__(self, old_module, new_module, **kwargs):
self.old_module = old_module
self.new_module = new_module
self.kwargs = kwargs
return model
14
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
Table 8: Complete accuracy results of ResNet-18, ResNet-50, and ResNet-152 trained and tested across MNIST, CIFAR-10, CIFAR-100,
and ImageNet (bold entries indicate improvement with CT). CT consistently enhances generalization across models and datasets,
with β values close to 1. Reported values include means and standard deviations over three runs.
Accuracy Trends During β Search: Accuracy trends observed during β search across multiple models and datasets are
illustrated in Figure 3. As β increases, we observe a sharp rise in accuracy leading to a distinct peak, followed by a gradual
decline as β continues to increase.
Robustness of CT to Linear Probing Configurations: We demonstrate that CT consistently improves the generalization
performance of models across various linear probing settings.
The first setting we modify is the regularization strength of logistic regression (c) used to train the new classifier layer.
Note that c here refers to the inverse of the regularization strength in logistic regression (as implemented in scikit-learn
(Buitinck et al., 2013)), meaning that a larger value corresponds to a weaker regularization.
For the baseline described in Section 4.1, we set c = 1. To evaluate the impact of different regularization strengths, we also
15
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
Table 9: Complete accuracy results of ImageNet-pretrained ResNet-18, ResNet-50, and ResNet-152 when transferred to 9 downstream
datasets (bold entries indicate improvement with CT). CT consistently enhances generalization across models and datasets, with β
values close to 1. Reported values include means and standard deviations over three runs.
Figure 3: The accuracy trends during the β search process on various models and datasets. A sharp increase leading to a distinct peak,
followed by a gradual decline as β increases can be observed across models and datasets.
test c = 0.1 (stronger regularization) and c = 10 (weaker regularization). The results, presented in Table 10, demonstrate
that CT consistently enhances generalization performance across all tested regularization strengths.
The second setting we adjust is the feature map used for linear probing, specifically the number of layers contributing
features to the linear classifier. The baseline configuration uses only the last layer’s features. Here, we test using features
from the last 2 and 3 layers for linear probing. Due to the increased dimensionality of the combined feature maps, we train a
fully connected classifier using Adam with a learning rate of 10−3 , optimizing for 30 epochs with cross-entropy loss. The
results, shown in Table 11, indicate that CT consistently improves generalization performance regardless of the number of
layers used for feature extraction.
• Multi-label Classification on CelebA: For the multi-label classification task, since the output dimension differs from
the original single label classification one, we use MultiOutputClassifier from scikit-learn (Buitinck et al.,
2013) for linear probing, which trains 40 independent linear classifiers—one for each label.
• Regression on dSprites: For the regression task, we use the orientation of the shapes in dSprites as targets. Due to the
large dataset size, we only sample 50,000 images for training and 10,000 for testing. For transfer learning, we use
linear regression.
16
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
Table 10: Complete accuracy results of ResNet-18 trained and tested across MNIST, CIFAR-10, CIFAR-100, and ImageNet with varying
regularization strengths (c) for logistic regression during linear probing (bold entries indicate improvement with CT). CT consistently
enhances generalization across different c values, with β values close to 1. Reported values include means and standard deviations
over three runs.
Table 11: Complete accuracy results of ResNet-18 trained and tested across MNIST, CIFAR-10, CIFAR-100, and ImageNet with varying
number of layers from which features are extracted for linear probing (bold entries indicate improvement with CT). CT consistently
enhances generalization across different numbers of layers used, with β values close to 1. Reported values include means and
standard deviations over three runs.
• Semantic Segmentation on VOC2012: For the semantic segmentation task, we use an untrained PSPNet with a
ResNet-50 encoder pretrained on ImageNet. We first apply CT to the encoder, then freeze it while training the remaining
network (i.e. the decoder) on semantic segmentation, as this setup follows the common practice of using a pretrained
ResNet as a fixed feature extractor for downstream tasks.
Complete Experimental Results: The full results for generalization on medical image datasets and fine-grained tasks are
presented in Table 12. Additionally, the complete per-attribute metrics for β selection using the Micro Best and Macro Best
strategies are provided in Table 13 and Table 14 for ResNet-18, Table 15 and Table 16 for ResNet-50, and Table 17 and
Table 18 for ResNet-152.
17
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
Table 12: Complete results of the performance of ImageNet-pretrained ResNet-18, ResNet-50, and ResNet-152 when transferred to
challenging medical image datasets and fine-grained tasks (bold entries indicate improvement with CT). CT consistently improves
generalization across diverse datasets and tasks. Reported values include means and standard deviations over three runs.
Figure 4: Training and validation accuracy curves for Swin-T and Swin-S.
Complete Experimental Results: The full results of the generalization experiments for Swin transformers are provided in
Table 20.
18
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
Table 13: Per-attribute accuracy, balanced accuracy, and F1-score of ImageNet-pretrained ResNet-18 when transferred to multi-label
prediction on CelebA, with the β of CT optimized per attribute (Micro Best). Reported values include means and standard deviations
over three runs.
Metric
Accuracy Balanced Accuracy F1
Attribute
5 Shadow 91.54 ± 0.02 65.91 ± 0.01 44.16 ± 0.03
Arch. Eyebrows 79.63 ± 0.01 72.46 ± 0.02 60.91 ± 0.03
Attractive 79.44 ± 0.00 79.47 ± 0.00 80.15 ± 0.00
Bags Un. Eyes 82.80 ± 0.00 65.29 ± 0.01 45.80 ± 0.02
Bald 98.71 ± 0.00 80.49 ± 0.10 66.95 ± 0.18
Bangs 93.51 ± 0.01 84.83 ± 0.05 77.62 ± 0.06
Big Lips 69.53 ± 0.01 54.83 ± 0.03 20.94 ± 0.08
Big Nose 81.71 ± 0.02 66.77 ± 0.02 48.63 ± 0.05
Black Hair 86.41 ± 0.01 80.91 ± 0.01 73.35 ± 0.02
Blond Hair 94.92 ± 0.01 87.45 ± 0.04 80.20 ± 0.05
Blurry 96.05 ± 0.01 69.44 ± 0.06 50.48 ± 0.13
Brown Hair 85.89 ± 0.01 72.45 ± 0.02 56.72 ± 0.04
Bushy Eyebrows 89.14 ± 0.02 65.36 ± 0.03 44.23 ± 0.07
Chubby 95.09 ± 0.02 62.03 ± 0.06 34.98 ± 0.10
Double Chin 95.71 ± 0.02 61.68 ± 0.13 34.04 ± 0.34
Eyeglasses 98.91 ± 0.00 93.42 ± 0.03 91.15 ± 0.03
Goatee 96.07 ± 0.01 71.22 ± 0.07 50.55 ± 0.15
Gray Hair 97.83 ± 0.00 79.04 ± 0.11 63.38 ± 0.10
Heavy Makeup 88.01 ± 0.01 87.70 ± 0.01 85.32 ± 0.02
H. Cheekbones 81.45 ± 0.00 81.35 ± 0.00 80.39 ± 0.03
Male 93.80 ± 0.00 93.69 ± 0.00 92.08 ± 0.00
Mouth S. O. 80.45 ± 0.01 80.43 ± 0.01 79.90 ± 0.03
Mustache 96.23 ± 0.00 57.92 ± 0.00 25.01 ± 0.03
Narrow Eyes 85.90 ± 0.00 55.09 ± 0.01 19.17 ± 0.04
No Beard 91.32 ± 0.01 78.68 ± 0.01 95.00 ± 0.01
Oval Face 73.70 ± 0.00 60.37 ± 0.03 38.47 ± 0.05
Pale Skin 96.65 ± 0.02 67.28 ± 0.13 46.94 ± 0.32
Pointy Nose 73.96 ± 0.02 59.41 ± 0.01 35.86 ± 0.06
Reced. Hairline 91.87 ± 0.01 59.94 ± 0.02 30.94 ± 0.07
Rosy Cheeks 94.08 ± 0.01 67.31 ± 0.14 46.46 ± 0.21
Sideburns 96.59 ± 0.02 72.92 ± 0.09 55.83 ± 0.18
Smiling 84.71 ± 0.01 84.71 ± 0.01 84.60 ± 0.01
Straight Hair 81.62 ± 0.00 63.30 ± 0.01 42.02 ± 0.02
Wavy Hair 81.32 ± 0.02 77.06 ± 0.01 70.52 ± 0.02
Wear. Earrings 85.37 ± 0.00 71.29 ± 0.03 57.11 ± 0.04
Wear. Hat 98.76 ± 0.00 91.38 ± 0.00 85.00 ± 0.03
Wear. Lipstick 90.97 ± 0.00 91.03 ± 0.02 91.21 ± 0.01
Wear. Necklace 86.27 ± 0.00 51.21 ± 0.03 5.37 ± 0.09
Wear. Necktie 94.72 ± 0.01 71.55 ± 0.03 54.24 ± 0.04
Young 84.10 ± 0.01 72.23 ± 0.02 90.08 ± 0.00
19
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
Table 14: Per-attribute accuracy, balanced accuracy, and F1-score of ImageNet-pretrained ResNet-18 when transferred to multi-label
prediction on CelebA, with the β of CT optimized globally for the highest mean accuracy across attributes (Macro Best) (bold values
indicate cases where the metric remains the same under both Micro Best and Macro Best β). The small performance gap between
Micro Best and Macro Best demonstrates the stability of CT on more fine-grained downstream tasks. Reported values include
means and standard deviations over three runs.
Metric
Accuracy Balanced Accuracy F1
Attribute
5 Shadow 91.34 ± 0.02 65.57 ± 0.03 43.56 ± 0.08
Arch. Eyebrows 79.34 ± 0.01 71.97 ± 0.02 60.14 ± 0.02
Attractive 79.41 ± 0.02 79.46 ± 0.02 80.14 ± 0.02
Bags Un. Eyes 82.80 ± 0.00 65.12 ± 0.00 45.48 ± 0.02
Bald 98.55 ± 0.01 78.56 ± 0.10 62.94 ± 0.22
Bangs 93.33 ± 0.01 84.34 ± 0.02 76.91 ± 0.03
Big Lips 69.28 ± 0.01 54.39 ± 0.02 19.54 ± 0.07
Big Nose 81.65 ± 0.01 66.64 ± 0.02 48.36 ± 0.04
Black Hair 85.16 ± 0.01 79.45 ± 0.01 71.27 ± 0.01
Blond Hair 94.85 ± 0.00 87.33 ± 0.01 79.94 ± 0.01
Blurry 95.89 ± 0.00 68.48 ± 0.13 48.10 ± 0.28
Brown Hair 85.47 ± 0.01 70.99 ± 0.05 54.57 ± 0.08
Bushy Eyebrows 89.14 ± 0.02 64.74 ± 0.03 42.94 ± 0.07
Chubby 95.04 ± 0.01 61.84 ± 0.12 34.49 ± 0.26
Double Chin 95.70 ± 0.00 61.16 ± 0.07 32.83 ± 0.20
Eyeglasses 98.90 ± 0.00 93.40 ± 0.02 91.15 ± 0.03
Goatee 95.94 ± 0.00 69.30 ± 0.05 47.66 ± 0.08
Gray Hair 97.77 ± 0.01 78.74 ± 0.19 62.72 ± 0.25
Heavy Makeup 87.76 ± 0.01 87.50 ± 0.01 85.07 ± 0.01
H. Cheekbones 81.38 ± 0.00 81.18 ± 0.01 80.23 ± 0.01
Male 93.41 ± 0.02 93.44 ± 0.00 91.76 ± 0.01
Mouth S. O. 80.44 ± 0.01 80.35 ± 0.04 79.90 ± 0.03
Mustache 96.23 ± 0.01 57.30 ± 0.13 23.51 ± 0.31
Narrow Eyes 85.88 ± 0.01 54.91 ± 0.02 18.67 ± 0.04
No Beard 90.95 ± 0.01 77.63 ± 0.04 94.83 ± 0.01
Oval Face 73.63 ± 0.01 60.08 ± 0.02 37.76 ± 0.05
Pale Skin 96.46 ± 0.01 63.13 ± 0.06 38.83 ± 0.14
Pointy Nose 73.83 ± 0.00 58.86 ± 0.05 34.57 ± 0.11
Reced. Hairline 91.68 ± 0.01 59.12 ± 0.05 28.97 ± 0.12
Rosy Cheeks 93.91 ± 0.01 66.99 ± 0.09 45.60 ± 0.18
Sideburns 96.54 ± 0.01 71.33 ± 0.06 53.65 ± 0.04
Smiling 84.31 ± 0.01 84.15 ± 0.01 84.10 ± 0.02
Straight Hair 81.26 ± 0.01 62.67 ± 0.01 40.69 ± 0.02
Wavy Hair 80.98 ± 0.03 76.92 ± 0.02 70.32 ± 0.03
Wear. Earrings 85.26 ± 0.01 71.15 ± 0.01 56.84 ± 0.02
Wear. Hat 98.73 ± 0.00 90.19 ± 0.07 84.03 ± 0.11
Wear. Lipstick 90.92 ± 0.00 91.03 ± 0.02 91.19 ± 0.02
Wear. Necklace 86.05 ± 0.01 51.02 ± 0.04 4.79 ± 0.14
Wear. Necktie 94.38 ± 0.02 69.16 ± 0.07 50.14 ± 0.16
Young 84.03 ± 0.01 72.13 ± 0.01 90.04 ± 0.00
Avg Rel Reduction (%) 0.21 0.96 2.81
20
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
Table 15: Per-attribute accuracy, balanced accuracy, and F1-score of ImageNet-pretrained ResNet-50 when transferred to multi-label
prediction on CelebA, with the β of CT optimized per attribute (Micro Best). Reported values include means and standard deviations
over three runs.
Metric
Accuracy Balanced Accuracy F1
Attribute
5 Shadow 92.32 ± 0.02 71.34 ± 0.12 53.97 ± 0.21
Arched Eyebrows 80.61 ± 0.06 74.22 ± 0.10 63.54 ± 0.14
Attractive 80.68 ± 0.02 80.71 ± 0.02 81.20 ± 0.01
Bags Under Eyes 83.11 ± 0.02 67.27 ± 0.03 49.37 ± 0.06
Bald 98.73 ± 0.01 81.58 ± 0.30 67.90 ± 0.20
Bangs 94.79 ± 0.01 87.96 ± 0.03 82.34 ± 0.05
Big Lips 70.05 ± 0.01 56.10 ± 0.02 25.67 ± 0.09
Big Nose 82.52 ± 0.03 69.63 ± 0.03 53.41 ± 0.06
Black Hair 88.01 ± 0.02 83.17 ± 0.04 76.68 ± 0.05
Blond Hair 95.10 ± 0.02 88.07 ± 0.05 81.00 ± 0.06
Blurry 96.05 ± 0.02 71.40 ± 0.09 52.88 ± 0.11
Brown Hair 85.89 ± 0.03 74.11 ± 0.04 58.67 ± 0.07
Bushy Eyebrows 89.42 ± 0.00 67.66 ± 0.04 48.40 ± 0.06
Chubby 95.23 ± 0.02 65.82 ± 0.06 42.21 ± 0.13
Double Chin 96.00 ± 0.01 64.78 ± 0.09 41.04 ± 0.19
Eyeglasses 99.18 ± 0.00 95.01 ± 0.03 93.41 ± 0.04
Goatee 96.33 ± 0.01 75.45 ± 0.05 56.72 ± 0.10
Gray Hair 98.01 ± 0.01 80.34 ± 0.13 66.27 ± 0.16
Heavy Makeup 88.85 ± 0.01 88.66 ± 0.01 86.42 ± 0.02
High Cheekbones 84.00 ± 0.05 83.91 ± 0.05 83.05 ± 0.05
Male 95.74 ± 0.02 95.50 ± 0.02 94.49 ± 0.02
Mouth Slightly Open 86.89 ± 0.02 86.87 ± 0.02 86.48 ± 0.02
Mustache 96.35 ± 0.02 60.56 ± 0.17 31.46 ± 0.30
Narrow Eyes 86.34 ± 0.02 57.40 ± 0.08 26.10 ± 0.24
No Beard 93.19 ± 0.01 83.34 ± 0.07 96.06 ± 0.00
Oval Face 74.31 ± 0.03 62.11 ± 0.03 42.61 ± 0.07
Pale Skin 96.59 ± 0.02 69.14 ± 0.05 48.66 ± 0.05
Pointy Nose 74.84 ± 0.02 61.83 ± 0.03 41.67 ± 0.05
Receding Hairline 92.69 ± 0.01 65.82 ± 0.04 43.72 ± 0.05
Rosy Cheeks 94.44 ± 0.02 70.59 ± 0.04 52.41 ± 0.04
Sideburns 97.04 ± 0.02 77.68 ± 0.11 63.88 ± 0.14
Smiling 88.23 ± 0.00 88.23 ± 0.00 88.12 ± 0.00
Straight Hair 82.77 ± 0.01 67.49 ± 0.05 50.05 ± 0.09
Wavy Hair 83.21 ± 0.01 79.41 ± 0.02 73.95 ± 0.02
Wearing Earrings 86.97 ± 0.03 75.77 ± 0.04 64.25 ± 0.08
Wearing Hat 98.94 ± 0.00 93.44 ± 0.03 87.36 ± 0.06
Wearing Lipstick 92.23 ± 0.03 92.26 ± 0.03 92.48 ± 0.03
Wearing Necklace 86.30 ± 0.01 52.81 ± 0.02 11.70 ± 0.14
Wearing Necktie 95.28 ± 0.00 76.54 ± 0.05 61.91 ± 0.06
Young 85.85 ± 0.01 75.81 ± 0.02 91.07 ± 0.01
21
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
Table 16: Per-attribute accuracy, balanced accuracy, and F1-score of ImageNet-pretrained ResNet-50 when transferred to multi-label
prediction on CelebA, with the β of CT optimized globally for the highest mean accuracy across attributes (Macro Best) (bold values
indicate cases where the metric remains the same under both Micro Best and Macro Best β). The small performance gap between
Micro Best and Macro Best demonstrates the stability of CT on more fine-grained downstream tasks. Reported values include
means and standard deviations over three runs.
Metric
Accuracy Balanced Accuracy F1
Attribute
5 Shadow 92.24 ± 0.01 70.98 ± 0.06 53.47 ± 0.08
Arched Eyebrows 79.49 ± 0.02 72.08 ± 0.04 60.34 ± 0.07
Attractive 80.68 ± 0.02 80.57 ± 0.02 81.05 ± 0.02
Bags Under Eyes 83.11 ± 0.02 67.27 ± 0.04 49.36 ± 0.06
Bald 98.72 ± 0.00 81.18 ± 0.25 67.42 ± 0.36
Bangs 94.52 ± 0.00 87.58 ± 0.01 81.53 ± 0.01
Big Lips 70.03 ± 0.01 56.04 ± 0.03 25.48 ± 0.11
Big Nose 82.52 ± 0.03 69.48 ± 0.03 53.18 ± 0.05
Black Hair 87.58 ± 0.02 83.12 ± 0.04 76.57 ± 0.07
Blond Hair 95.10 ± 0.02 87.68 ± 0.02 80.42 ± 0.03
Blurry 95.96 ± 0.01 69.99 ± 0.06 50.43 ± 0.09
Brown Hair 85.59 ± 0.02 72.95 ± 0.06 56.99 ± 0.09
Bushy Eyebrows 88.96 ± 0.01 66.35 ± 0.02 45.76 ± 0.05
Chubby 95.23 ± 0.02 64.93 ± 0.06 40.46 ± 0.11
Double Chin 95.86 ± 0.00 64.08 ± 0.00 38.95 ± 0.07
Eyeglasses 99.06 ± 0.00 94.81 ± 0.07 92.96 ± 0.11
Goatee 96.19 ± 0.02 74.74 ± 0.07 55.57 ± 0.08
Gray Hair 97.94 ± 0.01 80.34 ± 0.13 66.27 ± 0.16
Heavy Makeup 88.76 ± 0.01 88.66 ± 0.01 86.42 ± 0.02
High Cheekbones 83.92 ± 0.03 83.82 ± 0.02 83.03 ± 0.02
Male 95.62 ± 0.02 95.50 ± 0.02 94.49 ± 0.02
Mouth Slightly Open 86.34 ± 0.03 85.90 ± 0.00 85.50 ± 0.01
Mustache 96.18 ± 0.01 60.41 ± 0.04 30.68 ± 0.16
Narrow Eyes 85.89 ± 0.01 55.31 ± 0.02 20.08 ± 0.05
No Beard 92.97 ± 0.03 83.04 ± 0.09 96.02 ± 0.01
Oval Face 74.01 ± 0.02 61.74 ± 0.01 41.92 ± 0.03
Pale Skin 96.50 ± 0.01 68.31 ± 0.09 47.87 ± 0.11
Pointy Nose 74.84 ± 0.01 61.61 ± 0.04 41.22 ± 0.09
Receding Hairline 92.59 ± 0.01 65.82 ± 0.04 43.72 ± 0.05
Rosy Cheeks 94.32 ± 0.00 69.86 ± 0.09 50.94 ± 0.18
Sideburns 96.91 ± 0.02 76.67 ± 0.22 62.11 ± 0.41
Smiling 88.23 ± 0.00 88.15 ± 0.03 88.04 ± 0.03
Straight Hair 82.72 ± 0.02 67.25 ± 0.05 49.58 ± 0.08
Wavy Hair 82.91 ± 0.02 79.34 ± 0.02 73.85 ± 0.03
Wearing Earrings 86.80 ± 0.01 75.40 ± 0.02 63.68 ± 0.04
Wearing Hat 98.88 ± 0.01 92.99 ± 0.12 86.57 ± 0.09
Wearing Lipstick 92.22 ± 0.01 92.26 ± 0.03 92.48 ± 0.03
Wearing Necklace 86.25 ± 0.01 52.81 ± 0.02 11.68 ± 0.08
Wearing Necktie 95.27 ± 0.00 76.41 ± 0.20 61.73 ± 0.39
Young 85.77 ± 0.04 75.71 ± 0.02 91.04 ± 0.01
Avg Rel Reduction (%) 0.18 0.65 1.87
22
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
Table 17: Per-attribute accuracy, balanced accuracy, and F1-score of ImageNet-pretrained ResNet-152 when transferred to multi-label
prediction on CelebA, with the β of CT optimized per attribute (Micro Best). Reported values include means and standard deviations
over three runs.
Metric
Accuracy Balanced Accuracy F1
Attribute
5 Shadow 92.22 ± 0.05 70.68 ± 0.32 52.89 ± 0.55
Arch. Eyebrows 79.90 ± 0.29 72.86 ± 0.37 61.54 ± 0.57
Attractive 80.68 ± 0.10 80.71 ± 0.10 81.17 ± 0.09
Bags Un. Eyes 83.30 ± 0.14 67.64 ± 0.25 50.06 ± 0.47
Bald 98.69 ± 0.03 79.25 ± 0.98 65.60 ± 1.06
Bangs 94.43 ± 0.01 87.16 ± 0.10 81.08 ± 0.08
Big Lips 70.15 ± 0.14 56.21 ± 0.10 25.95 ± 0.29
Big Nose 82.62 ± 0.07 69.63 ± 0.08 53.46 ± 0.15
Black Hair 88.07 ± 0.11 83.39 ± 0.21 76.91 ± 0.27
Blond Hair 95.01 ± 0.03 87.75 ± 0.05 80.59 ± 0.09
Blurry 96.09 ± 0.08 71.55 ± 1.06 53.15 ± 1.87
Brown Hair 86.44 ± 0.43 75.15 ± 1.21 60.29 ± 1.92
Bushy Eyebrows 89.50 ± 0.35 67.69 ± 0.95 48.56 ± 1.98
Chubby 95.10 ± 0.04 65.00 ± 0.53 40.24 ± 0.33
Double Chin 95.84 ± 0.02 64.87 ± 0.56 40.32 ± 0.93
Eyeglasses 99.04 ± 0.09 94.56 ± 0.34 92.35 ± 0.70
Goatee 96.41 ± 0.07 75.42 ± 0.17 57.21 ± 0.59
Gray Hair 97.89 ± 0.05 80.04 ± 0.15 64.72 ± 0.54
Heavy Makeup 88.97 ± 0.08 88.72 ± 0.03 86.51 ± 0.06
H. Cheekbones 84.04 ± 0.10 83.95 ± 0.09 83.11 ± 0.06
Male 95.56 ± 0.01 95.34 ± 0.00 94.26 ± 0.00
Mouth S. O. 84.97 ± 0.67 84.94 ± 0.68 84.44 ± 0.76
Mustache 96.30 ± 0.09 61.01 ± 0.27 32.10 ± 0.45
Narrow Eyes 86.04 ± 0.10 56.27 ± 0.70 22.85 ± 2.02
No Beard 93.00 ± 0.15 82.94 ± 0.17 95.95 ± 0.09
Oval Face 74.29 ± 0.18 62.11 ± 0.25 42.63 ± 0.47
Pale Skin 96.71 ± 0.11 69.98 ± 1.09 51.08 ± 2.16
Pointy Nose 74.95 ± 0.08 61.83 ± 0.06 41.59 ± 0.11
Reced. Hairline 92.42 ± 0.20 64.20 ± 1.15 40.32 ± 2.44
Rosy Cheeks 94.34 ± 0.06 70.34 ± 0.05 51.73 ± 0.11
Sideburns 96.98 ± 0.01 76.50 ± 0.58 62.38 ± 0.56
Smiling 87.93 ± 0.16 87.93 ± 0.16 87.85 ± 0.15
Straight Hair 82.49 ± 0.03 66.92 ± 0.21 48.98 ± 0.37
Wavy Hair 83.08 ± 0.02 79.22 ± 0.03 73.68 ± 0.04
Wear. Earrings 86.83 ± 0.14 75.37 ± 0.17 63.61 ± 0.33
Wear. Hat 98.95 ± 0.04 93.47 ± 0.22 87.52 ± 0.48
Wear. Lipstick 91.87 ± 0.23 91.91 ± 0.22 92.10 ± 0.24
Wear. Necklace 86.28 ± 0.01 52.80 ± 0.03 11.82 ± 0.10
Wear. Necktie 95.36 ± 0.06 77.28 ± 0.59 62.92 ± 0.81
Young 85.79 ± 0.02 75.89 ± 0.08 91.02 ± 0.02
23
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
Table 18: Per-attribute accuracy, balanced accuracy, and F1-score of ImageNet-pretrained ResNet-152 when transferred to multi-label
prediction on CelebA, with the β of CT optimized globally for the highest mean accuracy across attributes (Macro Best) (bold values
indicate cases where the metric remains the same under both Micro Best and Macro Best β). The small performance gap between
Micro Best and Macro Best demonstrates the stability of CT on more fine-grained downstream tasks. Reported values include
means and standard deviations over three runs.
Metric
Accuracy Balanced Accuracy F1
Attribute
5 Shadow 92.18 ± 0.10 70.58 ± 0.64 52.64 ± 1.14
Arch. Eyebrows 79.70 ± 0.23 72.44 ± 0.51 60.86 ± 0.78
Attractive 80.68 ± 0.10 80.65 ± 0.07 81.12 ± 0.05
Bags Un. Eyes 83.30 ± 0.14 67.49 ± 0.24 49.77 ± 0.43
Bald 98.58 ± 0.09 78.18 ± 1.97 63.82 ± 2.62
Bangs 94.39 ± 0.10 87.16 ± 0.10 81.08 ± 0.08
Big Lips 70.11 ± 0.06 56.16 ± 0.19 25.62 ± 0.50
Big Nose 82.62 ± 0.07 69.35 ± 0.14 52.91 ± 0.22
Black Hair 88.07 ± 0.11 83.32 ± 0.14 76.79 ± 0.15
Blond Hair 95.01 ± 0.03 87.75 ± 0.05 80.56 ± 0.11
Blurry 96.06 ± 0.10 71.19 ± 0.88 52.38 ± 1.83
Brown Hair 86.30 ± 0.53 74.96 ± 1.22 60.02 ± 1.87
Bushy Eyebrows 89.50 ± 0.35 67.24 ± 0.98 47.59 ± 2.12
Chubby 95.10 ± 0.04 64.92 ± 0.11 40.24 ± 0.33
Double Chin 95.84 ± 0.01 64.67 ± 0.43 39.71 ± 0.70
Eyeglasses 98.96 ± 0.10 94.38 ± 0.41 92.20 ± 0.67
Goatee 96.29 ± 0.04 75.13 ± 0.19 56.80 ± 0.22
Gray Hair 97.85 ± 0.12 80.00 ± 0.09 64.72 ± 0.54
Heavy Makeup 88.97 ± 0.08 88.57 ± 0.07 86.35 ± 0.09
H. Cheekbones 84.04 ± 0.10 83.77 ± 0.18 82.96 ± 0.18
Male 95.38 ± 0.24 95.25 ± 0.12 94.12 ± 0.20
Mouth S. O. 84.97 ± 0.67 84.62 ± 0.51 84.17 ± 0.52
Mustache 96.24 ± 0.01 60.15 ± 0.15 30.39 ± 0.50
Narrow Eyes 85.98 ± 0.22 55.54 ± 0.33 20.75 ± 0.97
No Beard 92.80 ± 0.24 82.82 ± 0.22 95.92 ± 0.05
Oval Face 74.29 ± 0.18 62.04 ± 0.21 42.56 ± 0.42
Pale Skin 96.71 ± 0.11 69.80 ± 1.01 50.41 ± 1.76
Pointy Nose 74.74 ± 0.02 61.52 ± 0.02 41.15 ± 0.16
Reced. Hairline 92.42 ± 0.20 64.04 ± 1.03 39.85 ± 2.23
Rosy Cheeks 94.28 ± 0.02 69.58 ± 0.05 50.24 ± 0.20
Sideburns 96.82 ± 0.08 76.50 ± 0.58 62.38 ± 0.56
Smiling 87.85 ± 0.21 87.93 ± 0.16 87.85 ± 0.15
Straight Hair 82.49 ± 0.03 66.83 ± 0.15 48.80 ± 0.25
Wavy Hair 82.82 ± 0.24 79.10 ± 0.21 73.50 ± 0.30
Wear. Earrings 86.78 ± 0.01 75.37 ± 0.17 63.61 ± 0.33
Wear. Hat 98.89 ± 0.02 93.47 ± 0.22 87.52 ± 0.48
Wear. Lipstick 91.87 ± 0.23 91.80 ± 0.24 92.01 ± 0.24
Wear. Necklace 86.17 ± 0.09 52.70 ± 0.06 11.44 ± 0.17
Wear. Necktie 95.34 ± 0.01 77.19 ± 0.50 62.83 ± 0.71
Young 85.62 ± 0.11 75.89 ± 0.08 91.02 ± 0.02
Avg Rel Reduction (%) 0.07 0.30 1.00
24
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
Table 19: Complete robust accuracy results of ImageNet-pretrained ResNet-18, ResNet-50, and ResNet-152 under ℓ2 /ℓ∞ adversarial
attacks and common corruptions on CIFAR-10, CIFAR-100, and ImageNet (bold entries indicate improvement with CT). CT consistently
enhances robustness across models, datasets, and robustness settings, with β values close to 1. Reported values include means and
standard deviations over three runs.
Table 20: Complete accuracy reuslts of Imagenette-pretrained Swin-T and Swin-S (ReLU-based) when transferred to 9 downstream
datasets (bold entries indicate improvement with CT). CT consistently enhances generalization across models and datasets, with
β values close to 1, demonstrating its effectiveness even with partial theoretical guarantees. Reported values include means and
standard deviations over three runs.
Table 21: Complete accuracy results of ImageNet-pretrained ResNet-18 when transferred to 13 downstream datasets, steered with
the Swish-only, SoftPlus-only, and combined (baseline) versions of CT (bold entries indicate improvement with CT). While both the
Swish-only and SoftPlus-only versions enhance model generalization, their improvements are less significant than the combined
version, validating our CT implementation. Reported values include means and standard deviations over three runs.
25
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
Table 22: Complete robust accuracy results of ImageNet-pretrained ResNet-18 under ℓ2 /ℓ∞ adversarial attacks and common corruptions
on CIFAR-10, CIFAR-100, and ImageNet, steered with the Swish-only, SoftPlus-only, and combined (baseline) versions of CT (bold
entries indicate improvement with CT). While both the Swish-only and SoftPlus-only versions enhance model robustness, their
improvements are less significant than the combined version, validating our CT implementation. Reported values include means
and standard deviations over three runs.
26