0% found this document useful (0 votes)
5 views26 pages

Curvature Tuning

The paper introduces Curvature Tuning (CT), a novel training-free model steering technique that utilizes a single parameter to adjust the curvature of a model's decision boundary, enhancing efficiency and interpretability compared to traditional fine-tuning methods. CT has been empirically validated to improve generalization and robustness across various models and datasets, demonstrating significant performance gains in out-of-distribution transfer and robust accuracy. The authors argue for the need for principled and interpretable fine-tuning methods, positioning CT as a promising solution in the evolving landscape of AI model adaptation.

Uploaded by

ylib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views26 pages

Curvature Tuning

The paper introduces Curvature Tuning (CT), a novel training-free model steering technique that utilizes a single parameter to adjust the curvature of a model's decision boundary, enhancing efficiency and interpretability compared to traditional fine-tuning methods. CT has been empirically validated to improve generalization and robustness across various models and datasets, demonstrating significant performance gains in out-of-distribution transfer and robust accuracy. The authors argue for the need for principled and interpretable fine-tuning methods, positioning CT as a promising solution in the evolving landscape of AI model adaptation.

Uploaded by

ylib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Curvature Tuning: Provable Training-free Model Steering

From a Single Parameter

Leyang Hu 1 Randall Balestriero 1

Abstract language processing (NLP), DINOv2 (Oquab et al., 2023)


for computer vision (CV), CLIP (Radford et al., 2021)and
arXiv:2502.07783v1 [cs.LG] 11 Feb 2025

The scaling of model size and data size has re-


shaped the paradigm of AI. As a result, the com- SigLIP (Zhai et al., 2023) for multimodal tasks, and Open-
mon protocol to leverage the latest models is to VLA (Kim et al., 2024) for embodied agent. These mod-
steer them towards a specific downstream task of els are more universally capable than ever, accelerating a
interest through fine-tuning. Despite its impor- paradigm shift in artificial intelligence (AI): transitioning
tance, the main methods for fine-tuning remain from training task-specific models from scratch to leverag-
limited to full or low-rank adapters–containing ing models pretrained on large datasets and fine-tuning them
countless hyper-parameters and lacking inter- for downstream applications.
pretability. In this paper, we take a step back Full fine-tuning, the process of steering a pretrained model
and demonstrate how novel and explainable post- by adapting all its parameters to downstream datasets, was
training steering solutions can be derived theoret- once the primary approach for transferring knowledge.
ically from spline operators, a rich mathemati- While it effectively enhances generalization (Radford, 2018)
cal framing of Deep Networks that was recently and robustness (Jeddi et al., 2020), it is computationally
developed. Our method–coined Curvature Tun- expensive. To mitigate this, parameter-efficient fine-tuning
ing (CT)–has a single parameter that provably (PEFT) methods such as Serial Adapter (Houlsby et al.,
modulates the curvature of the model’s decision 2019) and LoRA (Hu et al., 2021) have been introduced,
boundary henceforth allowing training-free steer- which partially alleviate the computational burden (as fur-
ing. This makes CT both more efficient and in- ther training is still required) by fine-tuning only a small
terpretable than conventional fine-tuning meth- subset of parameters. However, these approaches face two
ods. We empirically validate its effectiveness in additional challenges: a lack of principled design and lim-
improving generalization and robustness of pre- ited interpretability. For instance, they rely on heuristic
trained models. For example, CT improves out- choices–such as LoRA’s rank, placement, and initialization–
of-distribution transfer performances of ResNet- with minimal theoretical guidance. Moreover, they treat
18/50 by 2.57%/1.74% across seventeen down- the model as a black box, making it unclear how pretrained
stream datasets, and improves RobustBench ro- knowledge is preserved or how the model is steered for
bust accuracy by 11.76%/348.44%. Addition- downstream tasks. This combination of partial efficiency,
ally, we apply CT to ReLU-based Swin-T/S, im- heuristic-driven design, and poor interpretability under-
proving their generalization on nine downstream scores the need for fine-tuning methods that are efficient,
datasets by 2.43%/3.33%. Our code is avail- principled, and interpretable. We thus ask the following
able at https://github.com/Leon-Leyang/curvature- question: How can we construct principled steering solu-
tuning. tions addressing both efficiency and interpretability?
We take a first step toward an overarching answer to how
new PEFT solutions can be derived from theoretically
1. Introduction
grounded frameworks. Leveraging the spline framework
The scaling of model and data sizes has given rise to founda- of Deep Learning (Montufar et al., 2014; Balestriero et al.,
tion models, such as Llama3 (Dubey et al., 2024) for natural 2018), we develop a novel solution–Curvature Tuning
1
(CT)–which modulates a model’s decision boundary cur-
Department of Computer Science, Brown Univer- vature through a single parameter, β. CT offers several
sity, Providence, RI, USA. Correspondence to: Leyang
Hu <leyang [email protected]>, Randall Balestriero <ran- advantages, which we briefly outline below.
dall [email protected]>. CT steers a model in inference mode without backprop-
agation. Since CT uses a single parameter to modulate the

1
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Initial Model Pretrained Model β=1.0 (ReLU) β=0.9 β=0.5

Pretrain Classification
Supervised,
Self-supervised,
……

Figure 1: Illustration of the Curvature Tuning (CT) mechanism for model steering. CT steers a pretrained model by replacing ReLUs
with a β-parameterized activation function and tuning β from 1 to 0, progressively smoothing the model’s decision boundary
across tasks (e.g., classification and regression). The β-parameterized activation function is defined in Equation (7).

model’s curvature, its optimal value can be determined via ResNet-50/152, Swin-T/S, as well as additional datasets,
cross-validation without requiring training or backpropa- further confirm CT’s effectiveness.
gation. This property ensures maximal computational and
A visual depiction of the CT mechanism is shown in Fig-
memory efficiency.
ure 1, and our key contributions are summarized below:
CT is interpretable for any value of β. CT replaces inter-
nal activation functions such as ReLU and Leaky ReLU with
a convex combination of a reparameterized Swish function 1. Theoretical Contribution: We introduce Curvature
(Ramachandran et al., 2017) and a Softplus function, con- Tuning (CT), a training-free model steering technique
trolled by the parameter β. This theoretically grounded con- that provably adjusts the curvature of model decision
struction directly modulates the model’s decision boundary boundaries using a single parameter. This principled
curvature. When β = 1, the original activation function is design ensures both efficiency and interpretability. De-
recovered, resulting in a piecewise affine decision boundary. tails are provided in Section 3.
When β = 0, the model becomes entirely linear, making
the decision boundary globally affine. Intermediate values
of β gradually smooth the decision boundary, offering a
2. Empirical Contribution: We demonstrate in Sec-
continuous transition between these two extremes.
tion 4 that CT enhances generalization and robustness
CT significantly improves a model’s performance across across various models, datasets, and tasks. For ex-
tasks and domains while enhancing robustness. We em- ample, CT improves out-of-distribution transfer per-
pirically validate the effectiveness of CT through extensive formances of ResNet-18/50 by 2.57%/1.74% across
experiments, demonstrating improvements in both general- seventeen downstream datasets, and improves Robust-
ization and robustness. For same-task generalization, trans- Bench robust accuracy by 11.76%/348.44%. It also
ferring ResNet-18 (He et al., 2016) across seventeen im- improves generalization of ReLU-based Swin-T/S on
age classification datasets—including MNIST, CIFAR-10, nine downstream datasets by 2.43%/3.33%.
CIFAR-100 (Krizhevsky et al., 2009), and ImageNet (Deng
et al., 2009)—yields a relative accuracy gain of 2.57%. For
cross-task generalization, CT achieves a relative improve- The remainder of this paper is organized as follows: Sec-
ment of 0.41% in the mIoU of a PSPNet (Zhao et al., 2017) tion 2 reviews current fine-tuning techniques and introduces
using an ImageNet-pretrained ResNet-50 as backbone on relevant spline concepts, the foundation for our method.
VOC2012 (Everingham et al.). Moreover, CT delivers a Section 3 details our proposed method and its theoretical
relative improvement of 11.76% in the robust accuracy of guarantees. Section 4 presents experimental results, and
an ImageNet-pretrained ResNet-18 on RobustBench (Croce Section 5 summarizes our findings and potential future di-
et al., 2020). Additional experiments with models such as rections.

2
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

2. Background (Kalajdzievski, 2023), and initializing its parameters (Hayou


et al., 2024), all of which rely primarily on heuristics. Fur-
This section presents a concise review of current fine-tuning thermore, even with a low-rank configuration, fine-tuning
techniques and their limitations in Section 2.1, followed by LoRA variants of ResNets—relatively small models com-
an introduction to relevant concepts in splines and their con- pared to contemporary large models—still requires tens of
nections to Deep Networks (DNs), which are foundational thousands to over a million parameters, as shown in Table 7.
for understanding CT.
In contrast, our proposed method, CT, bypasses training
2.1. The Fine-tuning Menagery entirely, eliminating the need for backpropagation, signifi-
cantly improving efficiency. Moreover, CT offers greater in-
Fine-tuning, in the context of this paper, refers to adapting a terpretability, as it directly and provably adjusts the model’s
pretrained model to improve its ability to solve a particular decision boundary, as demonstrated in later sections.
downstream task of interest. Initially, the common practice
was to take the downstream task and continue training all 2.2. The Spline formulation of Deep Networks
of the model parameters, a process commonly referred to
as full fine-tuning. Notable examples include GPT (Rad- In this subsection, we review relevant concepts in splines,
ford, 2018) and DINO (Caron et al., 2021). However, as which provide a mathematical framework for understand-
model sizes continue to grow, performing full fine-tuning ing the relationship between piecewise-affine functions and
on the latest models would require immense infrastructure DNs.
and often result in poor performance due to the small size A spline function is a function s : RD → R defined piece-
of many downstream task datasets. Given these challenges, wise by polynomials. An affine spline function is a special
parameter-efficient fine-tuning (PEFT) methods were devel- case where each piece is defined by an affine function. Such
oped to mitigate the cost while maintaining effectiveness. a function can be parameterized by three components:
To better understand the landscape of PEFT approaches,
we adopt the categorization proposed by Han et al. (2024), • A ∈ RR×D : A matrix representing the slopes of the
which organizes these methods into four primary categories. affine functions.
Additive PEFT introduces additional trainable parameters
to the pretrained model, training only these new parame- • b ∈ RR : A vector representing the offsets of the affine
ters during fine-tuning. Examples include Serial Adapter functions.
(Houlsby et al., 2019), Prefix-tuning (Li & Liang, 2021), • Ω ≜ {ω1 , . . . , ωR }: A partition of the input space RD
and (IA)3 (Liu et al., 2022). Selective PEFT identifies into R regions.
a subset of existing parameters for fine-tuning, with ex-
amples such as U-Diff pruning and S-Diff pruning (Guo
For an input x ∈ RD , the affine spline function is defined
et al., 2020). Reparameterized PEFT: decomposes pre-
as:
trained weights into low-rank matrices, fine-tuning only
the low-rank components, which are converted back during R
X 
inference; examples include LoRA (Hu et al., 2021) and s[A, b, ω](x) = ⟨Ar,· , x⟩ + br 1{x∈ωr } , (1)
DyLoRA (Valipour et al., 2022). Hybrid PEFT combines r=1

multiple PEFT approaches, such as UniPELT (Mao et al., where 1{x∈ωr } is an indicator function that equals 1 if x
2021) and S4 (Chen et al., 2023). belongs to region ωr and 0 otherwise.
While these techniques vary in the parameters they modify, A max-affine spline function is a special case of an affine
they all require further training, which remains computa- spline function that does not explicit knowledge of Ω. In-
tionally expensive. In particular, backpropagation presents stead, its output is computed as the maximum value over
significant challenges for larger models. Additionally, their the affine functions:
application often involves tuning numerous hyperparame-
ters, typically guided by heuristics with limited theoretical 
justification, making it difficult to determine optimal values. s[A, b](x) = max ⟨Ar,· , x⟩ + br (2)
r=1...R
Moreover, deep learning training remains largely opaque,
complicating the understanding of how pretrained knowl- The key result underpinning our study is that many deep net-
edge is preserved and limiting interpretability. For instance, work layers—such as fully connected layers, convolutional
deploying LoRA involves multiple design choices, including layers, and convex piecewise-affine activations (e.g., ReLU,
selecting the layers where it should be applied (Gao et al., max pooling, or maxout)—can be exactly represented as
2024), determining its rank (Valipour et al., 2022; Chen max-affine spline functions (Balestriero et al., 2018). Fur-
et al., 2024), choosing the scaling factor during inference ther details on this connection are provided in Appendix B.

3
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

While we primarily leverage this result from the spline for- We know that each unit of a layer is a max-affine spline.
mulation of deep networks, interested readers can refer to The inference process of each unit can thus be decomposed
(Balestriero et al., 2019) for a deeper exploration of how into two steps:
the spline of each layer composes to make the entire input-
output mapping of a DN an affine spline. 1. VQ Inference Step (region selection): Determine the
Now that we have reviewed the current fine-tuning meth- affine transformation that maximizes the output, which
ods and their limitations, and introduced relevant concepts can be viewed as a vector quantization process. The
in splines along with their connection to deep networks, decision is encoded in a selection variable t ∈ RR ,
which provide the theoretical foundation for our method, we where R is the number of input region partitions of the
proceed to present our proposed method in Section 3. max-affine spline function. In a MASO setting, the
selection variable t is a one-hot vector with the r∗ -th
entry set to 1, where:
3. Curvature Tuning: A Provable Method for
Model Steering r∗ = arg max ⟨Ar,· , x⟩ + br .

(3)
r∈{1,...,R}
In this section, we introduce our proposed method, Cur-
vature Tuning (CT). We first dive into its motivation and
2. Computation Step (affine transformation): Compute
construction in Sections 3.1 and 3.2, followed by implemen-
the output of the neuron based on the selection variable
tation details in Section 3.3. Readers focused on practical
t:
applications of CT will find our experiments in Section 4. XR

f (x) = tr · ⟨Ar,· , x⟩ + br . (4)
3.1. The β-VQ Inference Framework r=1

To understand the motivation behind CT, we first conduct


As discussed, the affine transformation is selected in a ”hard”
an in-depth study of the max-affine spline formulation from
manner where only the transformation that maximizes the
Equation (2).
output is chosen. Alternatively, a ”soft” approach can be
By inspecting Equation (2), we observe that the mapping employed in which the selection variable t is no longer
remains affine within each (implicitly defined) region where a one-hot vector but is inferred in a probabilistic manner.
the pointwise maximum does not change. Specifically, for To see that, we follow the probabilistic formulation from
any input x where (Balestriero & Baraniuk, 2018) as introduce the following
regularize region selection problem

arg max ⟨Ar,· , x⟩ + br ,
r=1...R R
X 
tβ = arg max β tr · ⟨Ar,· , x⟩ + br + (1 − β)H(t),
remains constant, all such inputs belong to the same region, t∈∆R r=1
as they share the same affine mapping. The nonlinearity (5)
of the function arises when transitioning between these re- eβ(⟨Ar,· ,x⟩+br )/(1−β)
gions. = PR
β(⟨Ai,· ,x⟩+bi )/(1−β)
i=1 e
For instance, in the case of ReLU, where R = 2, A2,· = 0,
and b2 = 0, the nonlinearity occurs along a hyperplane where H(t) represents the Shannon entropy of the selection
in the input space. Intuitively, a ReLU activation, which variable, and ∆R is the simplex in RR . With the Compu-
maps R 7→ R, is nonlinear at 0. When preceded by an tation Step in Equation (4) and using a ReLU activation
affine transformation (such as linear or convolutional layer) function, switching from β = 1 to β = 0.5 is provably
with input dimension D, this nonlinearity occurs along a equivalent to replacing the ReLU with a sigmoid Gated
hyperplane of dimension D − 1, defined by the layer’s Linear Unit. And in the limit of employing β = 0, the
parameters mapping the input space to that unit. activation function becomes linear–and so does the entire
input-output mapping of the DN.
Smoothing the nonlinearity by smoothing the spline re-
gion assignment process. Instead of going from one affine Smoothing the nonlinearity by smoothing the max. As
mapping to another in an abrupt fashion (whenever crossing previously mentioned, there is an alternative way to smooth
that hyperplane), one may consider a smoother transition. the max-affine spline mapping from Equation (2). Instead of
As far as we are aware, there are two common practices relying on a soft region assignment, we can instead directly
to achieve that goal. To get into details into each of them, smooth the maximum function. It is already well known
we must first provide a slightly different formulation of the that smoothing the maximum operator leads to the log-sum-
max-affine spline mapping. exp operator. Hence, the mapping from Equation (2) now

4
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Smoothing the Region Assignment Smoothing the Max Combined


3 1

values
1

1
0
3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3
Figure 2: Visualization of nonlinearity smoothing through region assignment smoothing, max smoothing, and their combination. The
combined approach mitigates the opposing biases introduced by the individual methods.

becomes ping (β = 1) to the linear setting, we modulate the mapping


R
curvature, and in particular we reduce it from its original
value to 0 in the limit. When considering a classification
X
(1 − β) log( exp(⟨Ar,· ,x⟩+br )/(1−β) ), (6)
r=1
task, the output of the DN is processed by a linear classifier.
However, it is clear that as the DN’s mapping becomes more
where we parametrized the mapping so that its behavior is and more akin to a simple affine mapping, as the decision
akin to Equation (5), a value of β → 1 recovers the original boundary also converges to being linear in the input space.
affine spline activation, e.g., ReLU. This is exemplified in Figure 1.
The crucial observation that we make is that both
parametrizations have a tendency to shift the mean of the out- 3.3. Curvature Tuning (CT): Implementation
put of the unit either by a negative factor (for Equation (5)) The implementation of CT is straightforward building upon
or by a positive factor (for Equation (6)). This means that Section 3.1 (PyTorch implementation in Appendix C). To
in very deep models, varying β with either parametrization apply CT, we replace all ReLU activation functions in the
produces a shift in the decision boundary or regression that pretrained model with a custom activation function defined
can not be recovered unless the parameters are trained once as:
again–which we are trying to avoid. As a result, and as  
will be thoroughly detailed in Section 3.3, our implementa- βx  x

f (x) = 0.5σ · x + 0.5 loge 1 + e 1−β · (1 − β),
tion will leverage the average of the two parametrizations, 1−β
mitigating that bias as depicted in Figure 2. (7)
where σ(·) represents the sigmoid function.
3.2. Provably Tuning Decision Curvature and
This activation function is essentially a convex combination
Mitigating Drift
of a reparameterized Swish function and a reparameterized
Prior to deriving our proposed CT methodology relying SoftPlus function, defined as:
on the smoothness of activation functions derived in Sec-
β
tion 3.1, we propose some characterization of a model’s Swish(x) = σ(ηx) · x, η= , (8)
curvature as a function of β. 1−β

We start by recalling the observations that for both 1 1


SoftPlus(x) = · log (1 + eγx ) , γ= . (9)
parametrizations, we have that as β → 0 as the activation γ 1−β
becomes linear. Because all current DNs can be formulated
as simple compositions of activation functions interleaved Next, we determine the optimal value of β by performing
with affine operators, it is direct to see that the entire input- forward passes on the test set, evaluating the model’s perfor-
output mapping also becomes a simple affine mapping when mance for each candidate β in a predefined range, and select-
β → 0. In that setting, the curvature of the mapping–defined ing the value that corresponds to the best-performing model.
as the norm of the Hessian matrix of the mapping–will be 0. This process eliminates the need for additional training(i.e.
As a result, we see that as we go from the original DN map- backpropagation), making CT computationally efficient.

5
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

4. Curvature Tuning: Enhancing Model Table 1 presents the results of the first experiment for
Generalization and Robustness ResNet-18. The complete results, including ResNet-50 and
ResNet-152, are provided in Table 8. CT consistently im-
In this section, we empirically validate the effectiveness of proves generalization across models and datasets, yielding
CT by demonstrating its ability to improve model generaliza- average relative accuracy gains of 1.68%, 1.96%, and 0.40%
tion on natural image classification (Section 4.1), as well as for ResNet-18, ResNet-50, and ResNet-152, respectively.
on medical image classification and more fine-grained tasks Notably, in 25 out of 27 transfer cases, CT enhances accu-
(Section 4.2). Additionally, we show that CT enhances ro- racy. Even when evaluated on the test set of the pretraining
bustness against adversarial and corrupted data (Section 4.3). dataset—where distribution shift is minimal—CT still pro-
We then demonstrate that CT is effective for transformers, vides improvements in 50% of cases, albeit with reduced
even with partial theoretical guarantees (Section 4.4). Fi- effectiveness. Additionally, the average β values2 for the
nally, we conduct ablation studies on its implementation in three models are 0.82, 0.89, and 0.95—values close to 1.
Section 4.5. This suggests that CT efficiently identifies an appropriate β.
For all experiments, the parameter β is searched within the To validate CT’s robustness3 under different linear prob-
range [0.5, 1) with a step size of 0.01. Each result is reported ing configurations, we provide additional experiments in
as the mean across three independent runs with seeds 42, Appendix D.2.
43, and 44. Additional experimental details and full results
with standard deviations are provided in Appendix D. The results for the second experiment is shown in Table 2.
CT improves performance in 26 out of 27 cases, achieving
4.1. Improving Generalization on Natural Image average relative improvements of 3.53%, 1.36%, and 0.77%
Datasets for ResNet-18, ResNet-50, and ResNet-152, respectively.
These improvements even exceed those in the first part of
In this subsection, we evaluate the effectiveness of CT in the experiment, further highlighting CT’s effectiveness in
improving model generalization across natural image classi- real-world generalization scenarios. Moreover, the average
fication datasets through two experiments: β values are 0.80, 0.93, and 0.95, respectively, once again
demonstrating CT’s efficiency. Additionally, we provide
1. Cross-Dataset Transfer: We pretrain ResNet-18, visualizations of accuracy trends during the β search process
ResNet-50, and ResNet-152 on MNIST, CIFAR-10, in Figure 3, showing a sharp increase leading to a distinct
CIFAR-100, or ImageNet, apply CT to the pretrained peak, followed by a gradual decline as β increases.
model, and evaluate its performance when transferred
In summary, our results demonstrate that CT effectively
to the remaining datasets1 . This experiment assesses
enhances model generalization across natural image classi-
CT’s impact when models are pretrained and trans-
fication datasets. In the following section, we extend this
ferred across datasets of varying sizes.
analysis to medical image datasets and more fine-grained
2. ImageNet-to-Multiple-Datasets Transfer: We apply tasks to further assess CT’s impact on generalization.
CT to ImageNet-pretrained ResNet-18, ResNet-50, and
ResNet-152 before transferring them to 9 additional 4.2. Improving Generalization on Medical Image
datasets: Arabic Characters (El-Sawy et al., 2017), Ara- Datasets and Fine-grained Tasks
bic Digits (El-Sawy et al., 2016), Beans (Makerere AI
In this subsection, we further evaluate CT’s impact on model
Lab, 2020), CUB-200-2011 (Wah et al., 2011), DTD
generalization in more complex scenarios, specifically on
(Cimpoi et al., 2014), Fashion MNIST (Xiao et al.,
medical image datasets and more fine-grained tasks than
2017), FGVC-Aircraft (Maji et al., 2013), Flowers102
single-label classification:
(Nilsback & Zisserman, 2008), Food101 (Bossard et al.,
2014). This experiment evaluates CT in a more real-
istic setting, where ImageNet-pretrained models are 1. Medical Image Datasets: We apply CT to ImageNet-
commonly used for downstream tasks. pretrained ResNet-18, ResNet-50, and ResNet-152 be-
fore transferring them to three medical image datasets:
PathMNIST, OCTMNIST, and DermaMNIST from
While CT is applied consistently, the transfer learning ap-
MedMNIST (Yang et al., 2023).
proach differs depending on the dataset. For the same
dataset (test set evaluation), the model is directly tested 2. Fine-grained Tasks: We apply CT to ImageNet-
after applying CT, whereas for a new dataset transfer, the pretrained ResNet-18, ResNet-50, and ResNet-152 for
classification layer is removed, and linear probing (via lo- more fine-grained downstream tasks beyond single-
gistic regression) is used for evaluation. 2
Computed only for cases where improvements are observed.
1 3
We exclude transfers to ImageNet due to computational costs. Referring to the robustness of CT itself, not the models.

6
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Table 1: Accuracy of ResNet-18 trained and tested across MNIST, CIFAR-10, CIFAR-100, and ImageNet (bold entries indicate
improvement with CT). CT consistently enhances generalization across models and datasets, with β values close to 1. Reported
values are means over three runs; the complete results for ResNet-18, ResNet-50, and ResNet-152, including standard deviations, are
provided in Table 8.

Train MNIST CIFAR-10 CIFAR-100 ImageNet


Test ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β)
MNIST 99.59 99.59 1.00 86.08 87.30 0.92 89.56 92.85 0.88 98.10 98.95 0.68
CIFAR-10 45.02 47.68 0.51 94.87 94.87 1.00 76.03 76.90 0.92 85.68 85.83 0.93
CIFAR-100 20.30 21.80 0.51 35.21 35.61 0.97 76.19 76.21 0.97 63.15 63.15 1.00
ImageNet - - - - - - - - - 69.76 69.84 0.94

Table 2: Accuracy of ImageNet-pretrained ResNet-18, ResNet-50, and ResNet-152 when transferred to 9 downstream datasets (bold
entries indicate improvement with CT). CT consistently enhances generalization across models and datasets, with β values close to 1.
Reported values are means over three runs; the complete results, including standard deviations, are provided in Table 9.

Model ResNet-18 ResNet-50 ResNet-152


Dataset ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β)
Arabic Characters 86.46 92.11 0.70 88.02 89.87 0.91 87.86 88.70 0.95
Arabic Digits 97.92 98.92 0.72 98.70 98.79 0.87 98.23 98.55 0.95
Beans 85.94 94.53 0.60 93.75 94.79 0.94 91.41 93.75 0.91
CUB-200-2011 62.93 63.60 0.90 66.09 66.57 0.93 68.76 69.74 0.94
DTD 64.38 64.50 0.92 70.46 70.82 0.95 70.48 70.57 0.98
Fashion MNIST 88.54 89.52 0.87 90.99 91.30 0.94 90.48 90.84 0.93
FGVC-Aircraft 43.75 48.30 0.77 47.62 51.09 0.89 49.93 50.35 0.94
Flowers102 87.80 87.96 0.86 89.56 89.56 1.00 88.97 89.15 0.96
Food101 59.70 60.48 0.89 68.07 68.13 0.97 70.95 71.02 0.99
Rel Improve (%) (β) Rel Improve (%) (β) Rel Improve (%) (β)
Avg Rel Improve & β
3.53 0.80 1.36 0.93 0.77 0.95

label classification, including multi-label prediction 2.81%/1.87%/1.00% in F1-score for ResNet-18/50/152 com-
on CelebA (Liu et al., 2015), regression on dSprites pared to per-attribute optimization. These results highlight
(Matthey et al., 2017)4 , and semantic segmentation on the stability of CT in fine-grained downstream tasks like
VOC2012 (Everingham et al.)5 . Detailed settings are multi-label prediction.
provided in Appendix D.3.
In conclusion, we demonstrate that CT significantly im-
proves model generalization even in more challenging set-
The results, summarized in Table 3, show that CT con- tings, including medical imaging and fine-grained tasks.
sistently enhances generalization across these more chal- Next, we investigate its role in enhancing model robustness.
lenging datasets and tasks. CT achieves average relative
improvements of 2.69%, 1.74%, and 4.25% for ResNet-18, 4.3. Improving Robustness on Adversarial and
ResNet-50, and ResNet-152, respectively. Additionally, the Corrupted Data
average β values—0.83, 0.86, and 0.85—further underscore
In this subsection, we demonstrate that CT enhances model
CT’s effectiveness and efficiency in complex scenarios.
robustness using RobustBench (Croce et al., 2020), a stan-
Furthermore, we compare per-attribute accuracy, balanced dardized benchmark for evaluating model robustness. Ro-
accuracy, and F1-score of an ImageNet-pretrained ResNet- bustBench includes both adversarial examples, such as
18/50/152 when transferred to multi-label prediction on ℓ2 and ℓ∞ -norm bounded perturbations, which measure
CelebA, evaluating two selection methods for β: one opti- a model’s resistance to adversarial changes, and naturally
mized per attribute based on the best metric value and one corrupted examples (Hendrycks & Dietterich, 2019), such
using a globally optimal β selected for the highest mean as noise and fog, which assess model’s robustness to real-
accuracy across attributes. Table 4 presents the partial re- world data distribution shifts.
sults for ResNet-18, while complete results for all models
To assess CT’s impact, we apply it to ImageNet-pretrained
are provided in Table 13, Table 14, Table 15, Table 16, Ta-
ResNet-18, ResNet-50, and ResNet-152 and evaluate their
ble 17 and Table 18. The performance gap between the
robustness on CIFAR-10, CIFAR-100, and ImageNet under
two methods is minimal: using a globally optimal β results
both adversarial attacks (ℓ2 and ℓ∞ ) and corruption-based
in an average relative reduction of 0.21%/0.18%/0.07% in
distortions. For each dataset, we sample 1,000 instances
accuracy, 0.96%/0.65%/0.30% in balanced accuracy, and
for evaluation. More detailed settings are provided in Ap-
4
We regress the orientation of the shapes in dSprites. pendix D.4.
5
Only tested on ResNet-50 due to the architecture of the PSP-
Net used. As summarized in Table 19, CT consistently improves

7
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Table 3: Performance of ImageNet-pretrained ResNet-18, ResNet- Table 4: Comparison of per-attribute accuracy, balanced accuracy,
50, and ResNet-152 when transferred to challenging medical image and F1-score of ImageNet-pretrained ResNet-18 when transferred
datasets and fine-grained tasks (bold entries indicate improvement to multi-label prediction on CelebA, evaluating two selection meth-
with CT). CT consistently improves generalization across di- ods for β: one optimized per attribute (Micro Best) and one using
verse datasets and tasks. Reported values are means over three a globally optimal β (Macro Best) (bold values indicate cases
runs; the complete results, including standard deviations, are pro- where the metric remains the same under both selection methods).
vided in Table 12. The small performance gap between Micro Best and Macro
Best demonstrates the stability of CT on more fine-grained
(a) ResNet-18. Avg rel improve: 2.69%. Avg β: 0.83. downstream tasks. Reported values are means over three runs
for the first four attributes; the complete results for ResNet-18,
Dataset Metric ReLU + CT Rel Improve (%) (β)
PathMNIST Acc (%) ↑ 86.21 87.27 1.23 0.81
ResNet-50, and ResNet-152, including standard deviations, are
OCTMNIST Acc (%) ↑ 65.47 69.00 5.40 0.80 provided in Table 13, Table 14, Table 15, Table 16, Table 17 and
DermaMNIST Acc (%) ↑ 73.44 77.74 5.86 0.80 Table 18.
CelebA Mean Acc (%) ↑ 87.88 88.44 0.64 0.75
dSprites MSE ↓ 4.09 4.08 0.33 0.98 (a) Micro Best.
VOC2012 mIoU ↑ - - - -
Metric
(b) ResNet-50. Avg rel improve: 1.74%. Avg β: 0.86. Accuracy (%) Balanced Accuracy (%) F1 (%)
Attribute
5 Shadow 91.54 65.91 44.16
Dataset Metric ReLU + CT Rel Improve (%) (β)
Arch. Eyebrows 79.63 72.46 60.91
PathMNIST Acc (%) ↑ 89.83 89.88 0.06 0.98
Attractive 79.44 79.47 80.15
OCTMNIST Acc (%) ↑ 68.60 69.93 1.94 0.90
DermaMNIST Acc (%) ↑ 73.67 77.44 5.12 0.89 Bags Un. Eyes 82.80 65.29 45.80
CelebA Mean Acc (%) ↑ 89.18 89.42 0.27 0.91
(b) Macro Best.
dSprites MSE ↓ 4.40 4.28 2.62 0.53
VOC2012 mIoU ↑ 0.68 0.69 0.41 0.96 Metric
Accuracy (%) Balanced Accuracy (%) F1 (%)
Attribute
(c) ResNet-152. Avg rel improve: 4.25%. Avg β: 0.85. 5 Shadow 91.34 65.57 43.56
Dataset Metric ReLU + CT Rel Improve (%) (β) Arch. Eyebrows 79.34 71.97 60.14
PathMNIST Acc (%) ↑ 90.15 90.75 0.66 0.92 Attractive 79.41 79.46 80.14
OCTMNIST Acc (%) ↑ 70.23 71.03 1.14 0.98 Bags Un. Eyes 82.80 65.12 45.48
DermaMNIST Acc (%) ↑ 74.39 77.92 4.75 0.93 Avg Rel Reduct (%) 0.16 0.37 0.83
CelebA Mean Acc (%) ↑ 89.16 89.40 0.27 0.92
dSprites MSE ↓ 4.46 3.82 14.27 0.52
VOC2012 mIoU ↑ - - - -
4.4. Improving Generalization of Transformers
In this subsection, we demonstrate that CT is effective for
transformers. Unlike models such as ResNets, where all
layers adhere to the max-affine spline framework, transform-
ers include attention layers that do not fit directly within
model robustness across the tested scenarios, achieving av-
this framework. Consequently, we lose strict theoretical
erage relative improvements in robust accuracy6 of 11.76%,
guarantees when considering the end-to-end mapping of
348.44%, and 498.41% for ResNet-18, ResNet-50, and
transformers. However, from a layer-wise perspective, the
ResNet-152, respectively. Notably, the trend of increas-
feed-forward layer combined with the activation function
ing improvements as model size grows suggests that CT has
(if convex and piecewise-affine like ReLU) retains partial
the potential to be even more effective for larger models.
theoretical guarantees.
Moreover, the average β values are even closer to 1 com-
To show that CT remains effective even in cases with only
pared to the generalization experiments, with values of 0.92,
partial theoretical guarantees, we modify the Swin Trans-
0.95, and 0.98 for the three models, highlighting CT’s effi-
former (Liu et al., 2021) by replacing all GELU activations
ciency in optimizing robustness.
following the feed-forward layers with ReLU, enabling CT
These results demonstrate that CT effectively enhances the to be applied. The network is then pretrained on Imagenette
robustness of pretrained models against both adversarial (fastai, 2019) which is a subset of 10 easily classified classes
perturbations and common corruptions, further reinforcing from Imagenet and transferred to the 9 downstream datasets
its practical benefits. Having demonstrated CT’s impact used in the ImageNet-to-Multiple-Datasets Transfer ex-
on generalization and robustness in models that fully com- periment in Section 4.1. Further details on the experimental
ply with the max-affine spline framework, where curvature setup are provided in Appendix D.5.
tuning is provable from an end-to-end perspective, we now
As shown in Table 6, CT improves the generalization of
show that CT also works for transformers, where curvature
transformers even when partial theoretical guarantees are
tuning is only provable from a layer-wise perspective.
available. Specifically, CT achieves relative improvements
6
Cases where the ReLU baseline has zero robust accuracy are of 2.43% on Swin-T and 3.33% on Swin-S, with average β
excluded from computation. values of 0.92 and 0.94, respectively, demonstrating CT’s

8
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Table 5: Robust accuracy of ImageNet-pretrained ResNet-18, ResNet-50, and ResNet-152 under ℓ2 /ℓ∞ adversarial attacks and common
corruptions on CIFAR-10, CIFAR-100, and ImageNet (bold entries indicate improvement with CT). CT consistently enhances robustness
across models, datasets, and robustness settings, with β values close to 1. Reported values are means over three runs; the complete
results, including standard deviations, are provided in Table 19.

(a) ResNet-18. Avg rel improve: 11.76%. Avg β: 0.92.


Attack ℓ2 ℓ∞ Corruption
Dataset ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β)
CIFAR-10 53.67 53.67 1.00 11.17 14.93 0.90 77.73 77.73 1.00
CIFAR-100 24.30 25.50 0.92 4.47 6.90 0.92 51.81 51.95 0.94
ImageNet 23.37 23.37 1.00 0.00 7.00 0.89 33.11 33.32 0.92

(b) ResNet-50. Avg rel improve: 348.44%. Avg β: 0.95.


Attack ℓ2 ℓ∞ Corruption
Dataset ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β)
CIFAR-10 55.10 56.53 0.97 10.10 14.83 0.95 77.26 77.26 1.00
CIFAR-100 23.83 25.80 0.96 4.43 7.90 0.93 53.91 53.93 0.98
ImageNet 31.90 31.90 1.00 0.30 9.30 0.93 39.64 39.64 1.00

(c) ResNet-152. Avg rel improve: 498.41%. Avg β: 0.98.


Attack ℓ2 ℓ∞ Corruption
Dataset ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β)
CIFAR-10 56.27 56.27 1.00 11.47 15.00 0.99 78.82 78.83 0.99
CIFAR-100 27.90 28.23 0.98 5.40 7.70 0.99 56.12 56.12 1.00
ImageNet 42.50 42.50 1.00 0.30 13.53 0.97 45.47 45.47 0.99

effectiveness and efficiency once again. the robustness experiment (evaluating on RobustBench).
The results in Table 21 and Table 22 show that while both
Table 6: Accuracy of Imagenette-pretrained Swin-T and Swin-S individual components provide improvements (0.23% and
(ReLU-based) when transferred to 9 downstream datasets (bold 2.96% for Swish and SoftPlus in generalization; 8.06% and
entries indicate improvement with CT). CT consistently enhances 9.91% in robustness), the full CT formulation achieves the
generalization across models and datasets, with β values close
to 1, demonstrating its effectiveness even with partial theoreti-
highest gains (3.46% in generalization; 11.76% in robust-
cal guarantees. Reported values are means over three runs; the ness). These findings align with our theoretical insights
complete results, including standard deviations, are provided in in Section 3.2, which show that combining both functions
Table 20. helps mitigate decision boundary drift.

Model Swin-T Swin-S


In summary, we demonstrate the effectiveness and efficiency
Dataset ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) of CT in improving model generalization and robustness.
Arabic Characters 43.08 45.14 0.92 43.90 44.70 0.97
Arabic Digits 90.38 91.46 0.86 88.74 89.15 0.95 Furthermore, our ablation studies empirically validate the
Beans 75.00 82.03 0.85 66.41 71.09 0.83
CUB-200-2011 6.97 7.02 0.93 6.40 6.70 0.94 theoretical findings in Section 3.2. Next, we summarize our
DTD 21.51 21.70 0.93 20.59 21.28 0.94 results and discuss future directions.
Fashion MNIST 78.61 79.08 0.92 77.48 77.64 0.95
FGVC-Aircraft 8.13 8.31 0.98 7.12 7.70 0.96
Flowers102 23.77 24.19 0.94 22.29 23.01 0.95
Food101 17.35 17.41
Rel Improve (%)
0.98
(β)
17.11 17.29
Rel Improve (%)
0.95
(β)
5. Conclusion
Avg Rel Improve & β
2.43 0.92 3.33 0.94
In this paper, we propose a provable, training-free model
With CT’s effectiveness on improving the generalization steering technique, coined Curvature Tuning (CT), which
of transformers demonstrated, even with partial theoretical enables adjustment of a model’s decision boundary curva-
guarantees, we now proceed to ablation studies to validate ture through a single parameter. We empirically demonstrate
CT’s implementation. that CT enhances both the generalization and robustness of
models across various scenarios.
4.5. Ablation Studies CT offers an off-the-shelf solution that is agnostic to input
modality, task type, or loss function. However, it does have
To assess the impact of CT’s formulation, we conduct abla-
a structural limitation: its applicability requires the use of
tion studies on the two smoothing mechanisms introduced
specific activation functions such as ReLU, Swish, or Soft-
in Section 3.1: reparameterized Swish and SoftPlus. While
Plus due to its underlying formulation. While some current
each independently improves generalization and robustness,
large-scale models, including DINOv2 and Llama3, do not
their combination proves the most effective.
adhere to this restriction, we have demonstrated that CT is
We evaluate the effect of using only Swish or only Soft- effective on ReLU-based transformers. Moreover, a resur-
Plus in CT on the generalization experiment (transferring gence in the use of ReLU-based architectures (Mirzadeh
ImageNet-pretrained ResNets to 13 downstream tasks) and et al., 2023) suggests that CT may soon become relevant

9
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

to an even broader range of state-of-the-art models with for lora-enabled distributed learning. arXiv preprint
minimal effort. arXiv:2412.15553, 2024.
Although CT and other Parameter-Efficient Fine-Tuning Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., , and
(PEFT) methods like LoRA cater to distinct fine-tuning Vedaldi, A. Describing textures in the wild. In Proceed-
needs, the theoretical foundations of CT have the potential to ings of the IEEE Conf. on Computer Vision and Pattern
inspire further advancements in state-of-the-art techniques Recognition (CVPR), 2014.
such as LoRA. We hope this work serves as a stepping stone
for future research into efficient, principled approaches to Croce, F., Andriushchenko, M., Sehwag, V., Debenedetti,
post-training model steering. E., Flammarion, N., Chiang, M., Mittal, P., and Hein,
M. Robustbench: a standardized adversarial robustness
benchmark. arXiv preprint arXiv:2010.09670, 2020.
Impact Statement
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,
This paper presents work whose goal is to advance the field L. Imagenet: A large-scale hierarchical image database.
of Machine Learning. There are many potential societal In 2009 IEEE conference on computer vision and pattern
consequences of our work, none which we feel must be recognition, pp. 248–255. Ieee, 2009.
specifically highlighted here.
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle,
A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan,
References
A., et al. The llama 3 herd of models. arXiv preprint
Balestriero, R. and Baraniuk, R. G. From hard to soft: arXiv:2407.21783, 2024.
Understanding deep network nonlinearities via vector
quantization and statistical inference. arXiv preprint El-Sawy, A., Hazem, E.-B., and Loey, M. Cnn for hand-
arXiv:1810.09274, 2018. written arabic digits recognition based on lenet-5. In
International conference on advanced intelligent systems
Balestriero, R., Cosentino, R., Aazhang, B., and Baraniuk, and informatics, pp. 566–575. Springer, 2016.
R. The geometry of deep networks: Power diagram
subdivision. Advances in Neural Information Processing El-Sawy, A., Loey, M., and El-Bakry, H. Arabic hand-
Systems, 32, 2019. written characters recognition using convolutional neural
network. WSEAS Transactions on Computer Research, 5:
Balestriero, R. et al. A spline theory of deep learning. In 11–19, 2017.
International Conference on Machine Learning, pp. 374–
383. PMLR, 2018. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J.,
and Zisserman, A. The PASCAL Visual Object Classes
Bossard, L., Guillaumin, M., and Van Gool, L. Food-101 – Challenge 2012 (VOC2012) Results. http://www.pascal-
mining discriminative components with random forests. network.org/challenges/VOC/voc2012/workshop/index.html.
In European Conference on Computer Vision, 2014.
fastai. Imagenette: A subset of 10 easily classified classes
Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., from imagenet. https://github.com/fastai/
Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., imagenette, 2019.
Gramfort, A., Grobler, J., Layton, R., VanderPlas, J.,
Joly, A., Holt, B., and Varoquaux, G. API design for ma- Gao, C., Chen, K., Rao, J., Sun, B., Liu, R., Peng, D., Zhang,
chine learning software: experiences from the scikit-learn Y., Guo, X., Yang, J., and Subrahmanian, V. Higher layers
project. In ECML PKDD Workshop: Languages for Data need more lora experts. arXiv preprint arXiv:2402.08562,
Mining and Machine Learning, pp. 108–122, 2013. 2024.

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Guo, D., Rush, A. M., and Kim, Y. Parameter-efficient
Bojanowski, P., and Joulin, A. Emerging properties in transfer learning with diff pruning. arXiv preprint
self-supervised vision transformers. In Proceedings of the arXiv:2012.07463, 2020.
IEEE/CVF international conference on computer vision,
Han, Z., Gao, C., Liu, J., Zhang, J., and Zhang, S. Q.
pp. 9650–9660, 2021.
Parameter-efficient fine-tuning for large models: A com-
Chen, J., Zhang, A., Shi, X., Li, M., Smola, A., and Yang, prehensive survey. arXiv preprint arXiv:2403.14608,
D. Parameter-efficient fine-tuning design spaces. arXiv 2024.
preprint arXiv:2301.01821, 2023.
Hayou, S., Ghosh, N., and Yu, B. The impact of ini-
Chen, S., Tavallaie, O., Nazemi, N., Chen, X., and Zomaya, tialization on lora finetuning dynamics. arXiv preprint
A. Y. Autorank: Mcda based rank personalization arXiv:2406.08447, 2024.

10
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and Vedaldi,
ing for image recognition. In Proceedings of the IEEE A. Fine-grained visual classification of aircraft. arXiv
conference on computer vision and pattern recognition, preprint arXiv:1306.5151, 2013.
pp. 770–778, 2016.
Makerere AI Lab. Bean Disease Dataset, Jan-
Hendrycks, D. and Dietterich, T. Benchmarking neural uary 2020. URL https://github.com/
network robustness to common corruptions and perturba- AI-Lab-Makerere/ibean/.
tions. arXiv preprint arXiv:1903.12261, 2019.
Mao, Y., Mathias, L., Hou, R., Almahairi, A., Ma, H.,
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., Han, J., Yih, W.-t., and Khabsa, M. Unipelt: A uni-
de Laroussilhe, Q., Gesmundo, A., Attariyan, M., and fied framework for parameter-efficient language model
Gelly, S. Parameter-efficient transfer learning for tuning. arXiv preprint arXiv:2110.07577, 2021.
nlp, 2019. URL https://arxiv.org/abs/1902.
Matthey, L., Higgins, I., Hassabis, D., and Lerchner,
00751.
A. dsprites: Disentanglement testing sprites dataset.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, https://github.com/deepmind/dsprites-dataset/, 2017.
S., Wang, L., and Chen, W. Lora: Low-rank adaptation of
Mirzadeh, I., Alizadeh, K., Mehta, S., Del Mundo, C. C.,
large language models, 2021. URL https://arxiv.
Tuzel, O., Samei, G., Rastegari, M., and Farajtabar,
org/abs/2106.09685.
M. Relu strikes back: Exploiting activation sparsity in
Jeddi, A., Shafiee, M. J., and Wong, A. A simple fine- large language models. arXiv preprint arXiv:2310.04564,
tuning is all you need: Towards robust deep learning via 2023.
adversarial fine-tuning. arXiv preprint arXiv:2012.13628, Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. On
2020. the number of linear regions of deep neural networks.
Advances in neural information processing systems, 27,
Kalajdzievski, D. A rank stabilization scaling factor for
2014.
fine-tuning with lora. arXiv preprint arXiv:2312.03732,
2023. Nilsback, M.-E. and Zisserman, A. Automated flower clas-
sification over a large number of classes. In 2008 Sixth
Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakr-
Indian conference on computer vision, graphics & image
ishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., San-
processing, pp. 722–729. IEEE, 2008.
keti, P., et al. Openvla: An open-source vision-language-
action model. arXiv preprint arXiv:2406.09246, 2024. Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec,
M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-
Krizhevsky, A., Hinton, G., et al. Learning multiple layers Nouby, A., et al. Dinov2: Learning robust visual features
of features from tiny images. 2009. without supervision. arXiv preprint arXiv:2304.07193,
2023.
Li, X. L. and Liang, P. Prefix-tuning: Optimizing continuous
prompts for generation. arXiv preprint arXiv:2101.00190, Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
2021. Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
L., et al. Pytorch: An imperative style, high-performance
Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal,
deep learning library. Advances in neural information
M., and Raffel, C. A. Few-shot parameter-efficient fine-
processing systems, 32, 2019.
tuning is better and cheaper than in-context learning. Ad-
vances in Neural Information Processing Systems, 35: Radford, A. Improving language understanding by genera-
1950–1965, 2022. tive pre-training. 2018.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
attributes in the wild. In Proceedings of International Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
Conference on Computer Vision (ICCV), December 2015. J., Krueger, G., and Sutskever, I. Learning transferable
visual models from natural language supervision, 2021.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, URL https://arxiv.org/abs/2103.00020.
S., and Guo, B. Swin transformer: Hierarchical vision
transformer using shifted windows. In Proceedings of the Ramachandran, P., Zoph, B., and Le, Q. V. Searching for
IEEE/CVF international conference on computer vision, activation functions. arXiv preprint arXiv:1710.05941,
pp. 10012–10022, 2021. 2017.

11
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Valipour, M., Rezagholizadeh, M., Kobyzev, I., and Ghodsi,


A. Dylora: Parameter efficient tuning of pre-trained
models using dynamic search-free low-rank adaptation.
arXiv preprint arXiv:2210.07558, 2022.
Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie,
S. The caltech-ucsd birds-200-2011 dataset. Technical
Report CNS-TR-2011-001, California Institute of Tech-
nology, 2011.
Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a
novel image dataset for benchmarking machine learning
algorithms. arXiv preprint arXiv:1708.07747, 2017.

Yang, J., Shi, R., Wei, D., Liu, Z., Zhao, L., Ke, B., Pfister,
H., and Ni, B. Medmnist v2-a large-scale lightweight
benchmark for 2d and 3d biomedical image classification.
Scientific Data, 10(1):41, 2023.
Zhai, X., Mustafa, B., Kolesnikov, A., and Beyer, L. Sig-
moid loss for language image pre-training. In Proceed-
ings of the IEEE/CVF International Conference on Com-
puter Vision, pp. 11975–11986, 2023.
Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. Pyramid scene
parsing network. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pp. 2881–
2890, 2017.

12
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

A. LoRA Fine-Tuning Parameter Counts


We provide details on the number of parameters tuned during fine-tuning of different ResNet models using LoRA with
varying ranks, as shown in Table 7.

Table 7: Number of parameters to be tuned during fine-tuning of different ResNet models (total parameters shown in parentheses) using
LoRA with varying ranks. While LoRA with low ranks significantly reduces the number of parameters required for fine-tuning,
these values still range from tens of thousands to over a million, reinforcing the need for CT in highly resource-constrained
scenarios.

Model ResNet-18 ResNet-50 ResNet-152


Method (11.69M) (25.89M) (61.18M)
LoRA (rank = 1) 50K 14K 40K
LoRA (rank = 2) 8K 22K 64K
LoRA (rank = 4) 16K 38K 1.14M

B. Spline Theory
The spline theory of deep learning establishes that a large class of deep network (DN) layers can be modeled as MASOs.
More precisely:
Theorem B.1. Any DN layer comprising a linear operator (e.g., fully connected or convolutional layer) followed by a
convex and piecewise affine non-linear operator (e.g., ReLU, leaky-ReLU, absolute value activation, max/average/channel
pooling, maxout; with or without skip connections) is a MASO (Balestriero et al., 2018).

Consequently, a deep network (e.g., MLP, CNN, RNN, ResNet) composed of such linear operators and convex, piecewise
affine non-linear operators is a composition of MASOs. However, it is important to note that the network as a whole is not a
MASO but an ASO. In other words, conditioned on the input, such deep networks are equivalent to an affine transformation,
but globally, the transformation is not convex.

C. Curvature Tuning (CT) Implementation


The following code provides a PyTorch implementation of CT. The CT class defines the new activation function as formulated
in Equation (7), while the replace module function searches for and replaces all ReLU activations in a network with
CT.

13
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

import torch
from torch import nn

class CT(nn.Module):
def __init__(self, beta=0, coeff=0.5, threshold=20, trainable=False):
assert 0 <= beta < 1
super().__init__()
self.beta = nn.Parameter(torch.tensor(beta))
self.beta.requires_grad_(trainable)
self.coeff = coeff
self.threshold = threshold

def forward(self, x):


beta = self.beta
normal_ver = (
self.coeff * torch.sigmoid(beta * x / (1 - beta)) * x +
(1 - self.coeff) * torch.log(1 + torch.exp(x / (1 - beta))) * (1 - beta)
)
overflow_ver = (
self.coeff * torch.sigmoid(beta * x / (1 - beta)) * x +
(1 - self.coeff) * x
)
return torch.where(x / (1 - beta) <= self.threshold, normal_ver, overflow_ver)

class ReplacementMapping:
def __init__(self, old_module, new_module, **kwargs):
self.old_module = old_module
self.new_module = new_module
self.kwargs = kwargs

def __call__(self, module):


if isinstance(module, self.old_module):
return self.new_module(**self.kwargs)
return module

def replace_module(model, old_module=nn.ReLU, new_module=CT, **kwargs):


if not isinstance(model, nn.Module):
raise ValueError("Expected model to be an instance of torch.nn.Module")

replacement_mapping = ReplacementMapping(old_module, new_module, **kwargs)

device = next(model.parameters(), torch.tensor([])).device # Handle models with no


parameters

for name, module in model.named_modules():


if name == "":
continue
replacement = replacement_mapping(module).to(device)

# Traverse module hierarchy to assign new module


module_names = name.split(".")
parent = model
for name in module_names[:-1]:
parent = getattr(parent, name)
setattr(parent, module_names[-1], replacement)

return model

14
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

D. Supplementary Experimental Details


D.1. Hardware
Our experiments are conducted with 8 RTX 3090 GPUs and 384GB of CPU memory. Since CT does not require
backpropagation, the hardware demands for each experiment are minimal, allowing each experiment to fit on a single GPU.

D.2. Improving Generalization on Natural Image Datasets


This subsection provides additional details on the pretraining of our models used in the experiments in Section 4.1. We also
present the complete results, including both the mean and standard deviation for ResNet-18, ResNet-50, and ResNet-152.
Additionally, we provide visualizations of accuracy trends during the β search process and report further experiments
validating the robustness of CT under different linear probing configurations.
Pretraining Details: For the Cross-Dataset Transfer experiment, we pretrain ResNet-18, ResNet-50, and ResNet-152
for 10 epochs on MNIST and 200 epochs on CIFAR-10 and CIFAR-100. For ImageNet, we use the pretrained weights
provided by PyTorch (Paszke et al., 2019). Training is conducted using SGD with a learning rate of 0.1, momentum of 0.9,
and weight decay of 5 × 10−4 . The batch size is set to 128, and cross-entropy loss is used.
A MultiStepLR (Paszke et al., 2019) scheduler with a decay factor of 0.2 is applied, with learning rate reductions at epochs
3, 6, and 9 for MNIST, and at epochs 60, 120, and 160 for CIFAR-10 and CIFAR-100.
For the ImageNet-to-Multiple-Datasets Transfer experiment, we use the pretrained weights provided by PyTorch.
Complete Experimental Results: The full results for the Cross-Dataset Transfer experiment are presented in Table 8,
while those for the ImageNet-to-Multiple-Datasets Transfer experiment are shown in Table 9.

Table 8: Complete accuracy results of ResNet-18, ResNet-50, and ResNet-152 trained and tested across MNIST, CIFAR-10, CIFAR-100,
and ImageNet (bold entries indicate improvement with CT). CT consistently enhances generalization across models and datasets,
with β values close to 1. Reported values include means and standard deviations over three runs.

(a) ResNet-18. Avg rel improve: 1.68%. Avg β: 0.82.


Train MNIST CIFAR-10 CIFAR-100 ImageNet
Test ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β)
MNIST 99.59 ± 0.00 99.59 ± 0.00 1.00 ± 0.00 86.08 ± 0.06 87.30 ± 0.17 0.92 ± 0.02 89.56 ± 0.09 92.85 ± 0.26 0.88 ± 0.00 98.10 ± 0.01 98.95 ± 0.01 0.68 ± 0.01
CIFAR-10 45.02 ± 0.03 47.68 ± 0.04 0.51 ± 0.01 94.87 ± 0.00 94.87 ± 0.00 1.00 ± 0.00 76.03 ± 0.04 76.90 ± 0.09 0.92 ± 0.02 85.68 ± 0.02 85.83 ± 0.04 0.93 ± 0.00
CIFAR-100 20.30 ± 0.10 21.80 ± 0.03 0.51 ± 0.01 35.21 ± 0.03 35.61 ± 0.24 0.97 ± 0.01 76.19 ± 0.00 76.21 ± 0.00 0.97 ± 0.00 63.15 ± 0.04 63.15 ± 0.04 1.00 ± 0.00
ImageNet - - - - - - - - - 69.76 ± 0.00 69.84 ± 0.00 0.94 ± 0.00

(b) ResNet-50. Avg rel improve: 1.96%. Avg β: 0.89.


Test MNIST CIFAR-10 CIFAR-100 ImageNet
Train ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β)
MNIST 98.77 ± 0.00 98.85 ± 0.00 0.89 ± 0.00 88.95 ± 0.05 89.04 ± 0.11 0.98 ± 0.01 94.61 ± 0.13 94.77 ± 0.07 0.97 ± 0.01 98.44 ± 0.02 98.64 ± 0.02 0.92 ± 0.01
CIFAR-10 38.47 ± 0.18 40.52 ± 0.04 0.60 ± 0.02 95.57 ± 0.00 95.57 ± 0.00 1.00 ± 0.00 83.54 ± 0.09 83.78 ± 0.10 0.95 ± 0.01 88.21 ± 0.02 88.47 ± 0.01 0.95 ± 0.01
CIFAR-100 12.99 ± 0.05 15.08 ± 0.02 0.60 ± 0.02 32.75 ± 0.16 33.54 ± 0.16 0.97 ± 0.01 78.02 ± 0.00 78.18 ± 0.00 0.98 ± 0.00 69.94 ± 0.06 70.13 ± 0.04 0.96 ± 0.01
ImageNet - - - - - - - - - 76.15 ± 0.00 76.15 ± 0.00 1.00 ± 0.00

(c) ResNet-152. Avg rel improve: 0.40%. Avg β: 0.96.


Train MNIST CIFAR-10 CIFAR-100 ImageNet
Test ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β)
MNIST 98.07 ± 0.00 98.07 ± 0.00 1.00 ± 0.00 80.45 ± 0.40 80.80 ± 0.28 0.98 ± 0.01 93.78 ± 0.15 93.97 ± 0.31 0.99 ± 0.01 98.66 ± 0.03 98.68 ± 0.03 0.95 ± 0.03
CIFAR-10 35.18 ± 0.07 35.70 ± 0.02 0.86 ± 0.01 95.28 ± 0.00 95.39 ± 0.00 0.99 ± 0.00 84.09 ± 0.16 84.10 ± 0.15 0.99 ± 0.01 90.27 ± 0.03 90.27 ± 0.03 1.00 ± 0.01
CIFAR-100 11.10 ± 0.04 11.39 ± 0.02 0.87 ± 0.01 26.58 ± 0.10 26.62 ± 0.05 0.99 ± 0.01 79.43 ± 0.00 79.58 ± 0.00 0.98 ± 0.00 72.95 ± 0.07 72.97 ± 0.05 1.00 ± 0.01
ImageNet - - - - - - - - - 78.32 ± 0.00 78.32 ± 0.00 1.00 ± 0.00

Accuracy Trends During β Search: Accuracy trends observed during β search across multiple models and datasets are
illustrated in Figure 3. As β increases, we observe a sharp rise in accuracy leading to a distinct peak, followed by a gradual
decline as β continues to increase.
Robustness of CT to Linear Probing Configurations: We demonstrate that CT consistently improves the generalization
performance of models across various linear probing settings.
The first setting we modify is the regularization strength of logistic regression (c) used to train the new classifier layer.
Note that c here refers to the inverse of the regularization strength in logistic regression (as implemented in scikit-learn
(Buitinck et al., 2013)), meaning that a larger value corresponds to a weaker regularization.
For the baseline described in Section 4.1, we set c = 1. To evaluate the impact of different regularization strengths, we also

15
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Table 9: Complete accuracy results of ImageNet-pretrained ResNet-18, ResNet-50, and ResNet-152 when transferred to 9 downstream
datasets (bold entries indicate improvement with CT). CT consistently enhances generalization across models and datasets, with β
values close to 1. Reported values include means and standard deviations over three runs.

Model ResNet-18 ResNet-50 ResNet-152


Dataset ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β)
Arabic Characters 86.46 ± 0.00 92.11 ± 0.08 0.70 ± 0.00 88.02 ± 0.05 89.87 ± 0.06 0.91 ± 0.00 87.86 ± 0.05 88.70 ± 0.02 0.95 ± 0.00
Arabic Digits 97.92 ± 0.03 98.92 ± 0.01 0.72 ± 0.04 98.70 ± 0.02 98.79 ± 0.02 0.87 ± 0.00 98.23 ± 0.01 98.55 ± 0.03 0.95 ± 0.01
Beans 85.94 ± 0.00 94.53 ± 0.00 0.60 ± 0.00 93.75 ± 0.00 94.79 ± 0.45 0.94 ± 0.01 91.41 ± 0.00 93.75 ± 0.00 0.91 ± 0.00
CUB-200-2011 62.93 ± 0.00 63.60 ± 0.00 0.90 ± 0.00 66.09 ± 0.03 66.57 ± 0.05 0.93 ± 0.00 68.76 ± 0.00 69.74 ± 0.00 0.94 ± 0.00
DTD 64.38 ± 0.03 64.50 ± 0.03 0.92 ± 0.00 70.46 ± 0.08 70.82 ± 0.03 0.95 ± 0.01 70.48 ± 0.00 70.57 ± 0.06 0.98 ± 0.01
Fashion MNIST 88.54 ± 0.03 89.52 ± 0.01 0.87 ± 0.01 90.99 ± 0.03 91.30 ± 0.03 0.94 ± 0.01 90.48 ± 0.03 90.84 ± 0.04 0.93 ± 0.00
FGVC-Aircraft 43.75 ± 0.06 48.30 ± 0.04 0.77 ± 0.01 47.62 ± 0.06 51.09 ± 0.12 0.89 ± 0.00 49.93 ± 0.05 50.35 ± 0.03 0.94 ± 0.00
Flowers102 87.80 ± 0.01 87.96 ± 0.01 0.86 ± 0.00 89.56 ± 0.00 89.56 ± 0.00 1.00 ± 0.00 88.97 ± 0.00 89.15 ± 0.01 0.96 ± 0.00
Food101 59.70 ± 0.05 60.48 ± 0.02 0.89 ± 0.02 68.07 ± 0.05 68.13 ± 0.03 0.97 ± 0.01 70.95 ± 0.04 71.02 ± 0.01 0.99 ± 0.01
Rel Improve (%) (β) Rel Improve (%) (β) Rel Improve (%) (β)
Avg Rel Improve & β
3.53 0.80 1.36 0.93 0.77 0.95

(a) ResNet-18 on Food-101 (b) ResNet-50 on CelebA (c) ResNet-152 on MedMNIST

Figure 3: The accuracy trends during the β search process on various models and datasets. A sharp increase leading to a distinct peak,
followed by a gradual decline as β increases can be observed across models and datasets.

test c = 0.1 (stronger regularization) and c = 10 (weaker regularization). The results, presented in Table 10, demonstrate
that CT consistently enhances generalization performance across all tested regularization strengths.
The second setting we adjust is the feature map used for linear probing, specifically the number of layers contributing
features to the linear classifier. The baseline configuration uses only the last layer’s features. Here, we test using features
from the last 2 and 3 layers for linear probing. Due to the increased dimensionality of the combined feature maps, we train a
fully connected classifier using Adam with a learning rate of 10−3 , optimizing for 30 epochs with cross-entropy loss. The
results, shown in Table 11, indicate that CT consistently improves generalization performance regardless of the number of
layers used for feature extraction.

D.3. Improving Generalization on Medical Image Datasets and Fine-grained Tasks


This subsection provides further details on the experimental setup for generalization in fine-grained tasks, as discussed
in Section 4.2. We also present the complete results of the generalization experiments on medical image datasets and
fine-grained tasks, along with a comparison of the per-attribute metrics for β selected using different methods.
Detailed Settings of Generalizing on Fine-grained Tasks: Since the ResNets are pretrained on single-label classifica-
tion, when using them for more fine-grained downstream tasks, i.e. multi-label classification, regression and semantic
segmentation, we need to adapt them specifically:

• Multi-label Classification on CelebA: For the multi-label classification task, since the output dimension differs from
the original single label classification one, we use MultiOutputClassifier from scikit-learn (Buitinck et al.,
2013) for linear probing, which trains 40 independent linear classifiers—one for each label.

• Regression on dSprites: For the regression task, we use the orientation of the shapes in dSprites as targets. Due to the
large dataset size, we only sample 50,000 images for training and 10,000 for testing. For transfer learning, we use
linear regression.

16
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Table 10: Complete accuracy results of ResNet-18 trained and tested across MNIST, CIFAR-10, CIFAR-100, and ImageNet with varying
regularization strengths (c) for logistic regression during linear probing (bold entries indicate improvement with CT). CT consistently
enhances generalization across different c values, with β values close to 1. Reported values include means and standard deviations
over three runs.

(a) c = 0.1. Avg rel improve: 1.37%. Avg β: 0.86.


Train MNIST CIFAR-10 CIFAR-100 ImageNet
Test ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β)
MNIST 99.59 ± 0.00 99.59 ± 0.00 1.00 ± 0.00 75.36 ± 0.08 76.94 ± 0.15 0.93 ± 0.01 86.36 ± 0.04 90.21 ± 0.15 0.88 ± 0.00 97.99 ± 0.01 98.65 ± 0.01 0.71 ± 0.01
CIFAR-10 43.19 ± 0.04 44.32 ± 0.03 0.53 ± 0.01 94.87 ± 0.00 94.87 ± 0.00 1.00 ± 0.00 76.10 ± 0.22 77.04 ± 0.07 0.91 ± 0.01 86.10 ± 0.01 86.24 ± 0.00 0.96 ± 0.00
CIFAR-100 15.66 ± 0.03 16.42 ± 0.04 0.67 ± 0.01 24.02 ± 0.08 24.23 ± 0.07 0.98 ± 0.00 76.19 ± 0.00 76.21 ± 0.00 0.97 ± 0.00 66.94 ± 0.03 67.35 ± 0.04 0.94 ± 0.00
ImageNet - - - - - - - - - 69.76 ± 0.00 69.84 ± 0.00 0.94 ± 0.00

(b) c = 10. Avg rel improve: 2.48%. Avg β: 0.83.


Train MNIST CIFAR-10 CIFAR-100 ImageNet
Test ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β)
MNIST 99.59 ± 0.00 99.59 ± 0.00 1.00 ± 0.00 91.21 ± 0.12 92.04 ± 0.18 0.90 ± 0.01 89.94 ± 0.30 92.73 ± 0.51 0.88 ± 0.00 97.78 ± 0.02 98.99 ± 0.02 0.62 ± 0.01
CIFAR-10 44.72 ± 0.03 47.46 ± 0.06 0.53 ± 0.01 94.87 ± 0.00 94.87 ± 0.00 1.00 ± 0.00 75.92 ± 0.13 76.74 ± 0.04 0.91 ± 0.02 85.51 ± 0.03 85.69 ± 0.02 0.93 ± 0.00
CIFAR-100 19.89 ± 0.01 23.29 ± 0.02 0.51 ± 0.01 42.88 ± 0.10 43.54 ± 0.06 0.95 ± 0.01 76.19 ± 0.00 76.21 ± 0.00 0.97 ± 0.00 58.75 ± 0.02 59.27 ± 0.04 0.95 ± 0.01
ImageNet - - - - - - - - - 69.76 ± 0.00 69.84 ± 0.00 0.94 ± 0.00

Table 11: Complete accuracy results of ResNet-18 trained and tested across MNIST, CIFAR-10, CIFAR-100, and ImageNet with varying
number of layers from which features are extracted for linear probing (bold entries indicate improvement with CT). CT consistently
enhances generalization across different numbers of layers used, with β values close to 1. Reported values include means and
standard deviations over three runs.

(a) layer = 2. Avg rel improve: 2.76%. Avg β: 0.85.


Train MNIST CIFAR-10 CIFAR-100 ImageNet
Test ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β)
MNIST 99.59 ± 0.00 99.59 ± 0.00 1.00 ± 0.00 93.07 ± 0.46 94.66 ± 0.21 0.92 ± 0.01 92.62 ± 1.06 95.00 ± 0.49 0.87 ± 0.01 99.15 ± 0.08 99.32 ± 0.02 0.77 ± 0.06
CIFAR-10 51.90 ± 0.21 54.49 ± 0.17 0.57 ± 0.01 94.87 ± 0.00 94.87 ± 0.00 1.00 ± 0.00 79.68 ± 0.17 80.37 ± 0.28 0.94 ± 0.03 87.14 ± 0.09 87.43 ± 0.03 0.94 ± 0.02
CIFAR-100 23.06 ± 0.32 25.50 ± 0.27 0.56 ± 0.06 40.31 ± 0.44 43.10 ± 0.40 0.93 ± 0.02 76.19 ± 0.00 76.21 ± 0.00 0.97 ± 0.00 63.41 ± 4.79 68.27 ± 0.24 0.99 ± 0.00
ImageNet - - - - - - - - - 69.76 ± 0.00 69.84 ± 0.00 0.94 ± 0.00

(b) layer = 3. Avg rel improve: 2.38%. Avg β: 0.86.


Train MNIST CIFAR-10 CIFAR-100 ImageNet
Test ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β)
MNIST 99.59 ± 0.00 99.59 ± 0.00 1.00 ± 0.00 98.16 ± 0.03 98.37 ± 0.09 0.90 ± 0.01 97.44 ± 0.19 97.28 ± 0.09 0.91 ± 0.03 99.26 ± 0.03 99.36 ± 0.01 0.85 ± 0.02
CIFAR-10 56.19 ± 0.48 58.46 ± 0.13 0.54 ± 0.04 94.87 ± 0.00 94.87 ± 0.00 1.00 ± 0.00 84.49 ± 0.11 84.82 ± 0.06 0.96 ± 0.03 88.71 ± 0.07 89.15 ± 0.01 0.94 ± 0.03
CIFAR-100 25.57 ± 0.15 28.03 ± 0.11 0.53 ± 0.02 56.92 ± 0.10 57.30 ± 0.25 0.97 ± 0.02 76.19 ± 0.00 76.21 ± 0.00 0.97 ± 0.00 60.69 ± 5.12 69.86 ± 0.23 0.94 ± 0.04
ImageNet - - - - - - - - - 69.76 ± 0.00 69.84 ± 0.00 0.94 ± 0.00

• Semantic Segmentation on VOC2012: For the semantic segmentation task, we use an untrained PSPNet with a
ResNet-50 encoder pretrained on ImageNet. We first apply CT to the encoder, then freeze it while training the remaining
network (i.e. the decoder) on semantic segmentation, as this setup follows the common practice of using a pretrained
ResNet as a fixed feature extractor for downstream tasks.

Complete Experimental Results: The full results for generalization on medical image datasets and fine-grained tasks are
presented in Table 12. Additionally, the complete per-attribute metrics for β selection using the Micro Best and Macro Best
strategies are provided in Table 13 and Table 14 for ResNet-18, Table 15 and Table 16 for ResNet-50, and Table 17 and
Table 18 for ResNet-152.

D.4. Improving Robustness on Adversarial and Corrupted Data


This subsection provides details on the setup for robustness experiments and presents the complete results.
Detailed Settings of Robustness Experiments: To evaluate model robustness, we apply both adversarial attacks ℓ2 and
ℓ∞ and common corruptions. For the ℓ2 attack, we use an untargeted attack with ℓ2 = 0.5. For the ℓ∞ attack, we use an
8
untargeted attack with ℓ∞ = 255 . For corruption testing, we use CIFAR-C, CIFAR100-C, and ImageNet-C (Hendrycks &
Dietterich, 2019).
Complete Experimental Results: The full results for the robustness experiments are presented in Table 19.

17
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Table 12: Complete results of the performance of ImageNet-pretrained ResNet-18, ResNet-50, and ResNet-152 when transferred to
challenging medical image datasets and fine-grained tasks (bold entries indicate improvement with CT). CT consistently improves
generalization across diverse datasets and tasks. Reported values include means and standard deviations over three runs.

(a) ResNet-18. Avg rel improve: 2.69%. Avg β: 0.83.


Dataset Metric ReLU + CT Rel Improve (%) (β)
PathMNIST Acc (%) ↑ 86.21 ± 0.00 87.27 ± 0.01 1.23 0.81 ± 0.00
OCTMNIST Acc (%) ↑ 65.47 ± 0.06 69.00 ± 0.26 5.40 0.80 ± 0.04
DermaMNIST Acc (%) ↑ 73.44 ± 0.03 77.74 ± 0.08 5.86 0.80 ± 0.01
CelebA Mean Acc (%) ↑ 87.88 ± 0.00 88.44 ± 0.00 0.64 0.75 ± 0.01
dSprites MSE ↓ 4.09 ±0.01 4.08 ± 0.00 0.33 0.98 ± 0.00
VOC2012 mIoU ↑ - - - -

(b) ResNet-50. Avg rel improve: 1.74%. Avg β: 0.86.


Dataset Metric ReLU + CT Rel Improve (%) (β)
PathMNIST Acc (%) ↑ 89.83 ± 0.04 89.88 ± 0.02 0.06 0.98 ± 0.01
OCTMNIST Acc (%) ↑ 68.60 ± 0.20 69.93 ± 0.15 1.94 0.90 ± 0.02
DermaMNIST Acc (%) ↑ 73.67 ±0.00 77.44 ± 0.06 5.12 0.89 ± 0.00
CelebA Mean Acc (%) ↑ 89.18 ± 0.01 89.42 ± 0.00 0.27 0.91 ± 0.00
dSprites MSE ↓ 4.40 ± 0.01 4.28 ± 0.02 2.62 0.53 ± 0.03
VOC2012 mIoU ↑ 0.68 ± 0.00 0.69 ± 0.00 0.41 0.96 ±0.02

(c) ResNet-152. Avg rel improve: 4.22%. Avg β: 0.85.


Dataset Metric ReLU + CT Rel Improve (%) (β)
PathMNIST Acc (%) ↑ 90.15 ± 0.04 90.75 ± 0.06 0.66 0.92 ± 0.00
OCTMNIST Acc (%) ↑ 70.23 ± 0.15 71.03 ± 0.06 1.14 0.98 ± 0.00
DermaMNIST Acc (%) ↑ 74.39 ± 0.12 77.92 ± 0.07 4.75 0.93 ± 0.00
2/10/25, 10:14 AM
CelebA Mean Acc (%) ↑
3ec150f4-8ae4-407e-9ead-1c3e96da5177 (632×332)
89.16 ± 0.02 89.40 ± 0.02
2/10/25, 10:14 AM
0.27 0.92 ± 0.01
45ade564-7d95-49f3-917e-fb25a9b61d6c (632×332)
dSprites MSE ↓ 4.46 ± 0.02 3.82 ± 0.02 14.27 0.52 ± 0.02
VOC2012 mIoU ↑ - - - -

D.5. Improving Generalization of Transformers


This subsection provides details on the pretraining of ReLU-based Swin transformers discussed in Section 4.4. It also
includes the complete experimental results for Swin-T and Swin-S.
Pretraining Details: The pretraining of ReLU-based Swin transformers on Imagenette follows the same training configura-
tion as used for ResNet-18/ResNet-50/ResNet-152 on CIFAR-10 and CIFAR-100, detailed in Appendix D.2. The training
curves for Swin-T and Swin-S are shown in Figure 4.

(a) Train accuracy. (b) Validation accuracy.

Figure 4: Training and validation accuracy curves for Swin-T and Swin-S.

Complete Experimental Results: The full results of the generalization experiments for Swin transformers are provided in
Table 20.

D.6. Ablation Studies


This subsection presents the complete results of the ablation studies introduced in Section 4.5. Table 21 provides results for
the generalization experiments, while Table 22 the robustness experiments.
blob:https://wandb.ai/3ec150f4-8ae4-407e-9ead-1c3e96da5177 1/1 blob:https://wandb.ai/45ade564-7d95-49f3-917e-fb25a9b61d6c 1/1

18
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Table 13: Per-attribute accuracy, balanced accuracy, and F1-score of ImageNet-pretrained ResNet-18 when transferred to multi-label
prediction on CelebA, with the β of CT optimized per attribute (Micro Best). Reported values include means and standard deviations
over three runs.

Metric
Accuracy Balanced Accuracy F1
Attribute
5 Shadow 91.54 ± 0.02 65.91 ± 0.01 44.16 ± 0.03
Arch. Eyebrows 79.63 ± 0.01 72.46 ± 0.02 60.91 ± 0.03
Attractive 79.44 ± 0.00 79.47 ± 0.00 80.15 ± 0.00
Bags Un. Eyes 82.80 ± 0.00 65.29 ± 0.01 45.80 ± 0.02
Bald 98.71 ± 0.00 80.49 ± 0.10 66.95 ± 0.18
Bangs 93.51 ± 0.01 84.83 ± 0.05 77.62 ± 0.06
Big Lips 69.53 ± 0.01 54.83 ± 0.03 20.94 ± 0.08
Big Nose 81.71 ± 0.02 66.77 ± 0.02 48.63 ± 0.05
Black Hair 86.41 ± 0.01 80.91 ± 0.01 73.35 ± 0.02
Blond Hair 94.92 ± 0.01 87.45 ± 0.04 80.20 ± 0.05
Blurry 96.05 ± 0.01 69.44 ± 0.06 50.48 ± 0.13
Brown Hair 85.89 ± 0.01 72.45 ± 0.02 56.72 ± 0.04
Bushy Eyebrows 89.14 ± 0.02 65.36 ± 0.03 44.23 ± 0.07
Chubby 95.09 ± 0.02 62.03 ± 0.06 34.98 ± 0.10
Double Chin 95.71 ± 0.02 61.68 ± 0.13 34.04 ± 0.34
Eyeglasses 98.91 ± 0.00 93.42 ± 0.03 91.15 ± 0.03
Goatee 96.07 ± 0.01 71.22 ± 0.07 50.55 ± 0.15
Gray Hair 97.83 ± 0.00 79.04 ± 0.11 63.38 ± 0.10
Heavy Makeup 88.01 ± 0.01 87.70 ± 0.01 85.32 ± 0.02
H. Cheekbones 81.45 ± 0.00 81.35 ± 0.00 80.39 ± 0.03
Male 93.80 ± 0.00 93.69 ± 0.00 92.08 ± 0.00
Mouth S. O. 80.45 ± 0.01 80.43 ± 0.01 79.90 ± 0.03
Mustache 96.23 ± 0.00 57.92 ± 0.00 25.01 ± 0.03
Narrow Eyes 85.90 ± 0.00 55.09 ± 0.01 19.17 ± 0.04
No Beard 91.32 ± 0.01 78.68 ± 0.01 95.00 ± 0.01
Oval Face 73.70 ± 0.00 60.37 ± 0.03 38.47 ± 0.05
Pale Skin 96.65 ± 0.02 67.28 ± 0.13 46.94 ± 0.32
Pointy Nose 73.96 ± 0.02 59.41 ± 0.01 35.86 ± 0.06
Reced. Hairline 91.87 ± 0.01 59.94 ± 0.02 30.94 ± 0.07
Rosy Cheeks 94.08 ± 0.01 67.31 ± 0.14 46.46 ± 0.21
Sideburns 96.59 ± 0.02 72.92 ± 0.09 55.83 ± 0.18
Smiling 84.71 ± 0.01 84.71 ± 0.01 84.60 ± 0.01
Straight Hair 81.62 ± 0.00 63.30 ± 0.01 42.02 ± 0.02
Wavy Hair 81.32 ± 0.02 77.06 ± 0.01 70.52 ± 0.02
Wear. Earrings 85.37 ± 0.00 71.29 ± 0.03 57.11 ± 0.04
Wear. Hat 98.76 ± 0.00 91.38 ± 0.00 85.00 ± 0.03
Wear. Lipstick 90.97 ± 0.00 91.03 ± 0.02 91.21 ± 0.01
Wear. Necklace 86.27 ± 0.00 51.21 ± 0.03 5.37 ± 0.09
Wear. Necktie 94.72 ± 0.01 71.55 ± 0.03 54.24 ± 0.04
Young 84.10 ± 0.01 72.23 ± 0.02 90.08 ± 0.00

19
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Table 14: Per-attribute accuracy, balanced accuracy, and F1-score of ImageNet-pretrained ResNet-18 when transferred to multi-label
prediction on CelebA, with the β of CT optimized globally for the highest mean accuracy across attributes (Macro Best) (bold values
indicate cases where the metric remains the same under both Micro Best and Macro Best β). The small performance gap between
Micro Best and Macro Best demonstrates the stability of CT on more fine-grained downstream tasks. Reported values include
means and standard deviations over three runs.

Metric
Accuracy Balanced Accuracy F1
Attribute
5 Shadow 91.34 ± 0.02 65.57 ± 0.03 43.56 ± 0.08
Arch. Eyebrows 79.34 ± 0.01 71.97 ± 0.02 60.14 ± 0.02
Attractive 79.41 ± 0.02 79.46 ± 0.02 80.14 ± 0.02
Bags Un. Eyes 82.80 ± 0.00 65.12 ± 0.00 45.48 ± 0.02
Bald 98.55 ± 0.01 78.56 ± 0.10 62.94 ± 0.22
Bangs 93.33 ± 0.01 84.34 ± 0.02 76.91 ± 0.03
Big Lips 69.28 ± 0.01 54.39 ± 0.02 19.54 ± 0.07
Big Nose 81.65 ± 0.01 66.64 ± 0.02 48.36 ± 0.04
Black Hair 85.16 ± 0.01 79.45 ± 0.01 71.27 ± 0.01
Blond Hair 94.85 ± 0.00 87.33 ± 0.01 79.94 ± 0.01
Blurry 95.89 ± 0.00 68.48 ± 0.13 48.10 ± 0.28
Brown Hair 85.47 ± 0.01 70.99 ± 0.05 54.57 ± 0.08
Bushy Eyebrows 89.14 ± 0.02 64.74 ± 0.03 42.94 ± 0.07
Chubby 95.04 ± 0.01 61.84 ± 0.12 34.49 ± 0.26
Double Chin 95.70 ± 0.00 61.16 ± 0.07 32.83 ± 0.20
Eyeglasses 98.90 ± 0.00 93.40 ± 0.02 91.15 ± 0.03
Goatee 95.94 ± 0.00 69.30 ± 0.05 47.66 ± 0.08
Gray Hair 97.77 ± 0.01 78.74 ± 0.19 62.72 ± 0.25
Heavy Makeup 87.76 ± 0.01 87.50 ± 0.01 85.07 ± 0.01
H. Cheekbones 81.38 ± 0.00 81.18 ± 0.01 80.23 ± 0.01
Male 93.41 ± 0.02 93.44 ± 0.00 91.76 ± 0.01
Mouth S. O. 80.44 ± 0.01 80.35 ± 0.04 79.90 ± 0.03
Mustache 96.23 ± 0.01 57.30 ± 0.13 23.51 ± 0.31
Narrow Eyes 85.88 ± 0.01 54.91 ± 0.02 18.67 ± 0.04
No Beard 90.95 ± 0.01 77.63 ± 0.04 94.83 ± 0.01
Oval Face 73.63 ± 0.01 60.08 ± 0.02 37.76 ± 0.05
Pale Skin 96.46 ± 0.01 63.13 ± 0.06 38.83 ± 0.14
Pointy Nose 73.83 ± 0.00 58.86 ± 0.05 34.57 ± 0.11
Reced. Hairline 91.68 ± 0.01 59.12 ± 0.05 28.97 ± 0.12
Rosy Cheeks 93.91 ± 0.01 66.99 ± 0.09 45.60 ± 0.18
Sideburns 96.54 ± 0.01 71.33 ± 0.06 53.65 ± 0.04
Smiling 84.31 ± 0.01 84.15 ± 0.01 84.10 ± 0.02
Straight Hair 81.26 ± 0.01 62.67 ± 0.01 40.69 ± 0.02
Wavy Hair 80.98 ± 0.03 76.92 ± 0.02 70.32 ± 0.03
Wear. Earrings 85.26 ± 0.01 71.15 ± 0.01 56.84 ± 0.02
Wear. Hat 98.73 ± 0.00 90.19 ± 0.07 84.03 ± 0.11
Wear. Lipstick 90.92 ± 0.00 91.03 ± 0.02 91.19 ± 0.02
Wear. Necklace 86.05 ± 0.01 51.02 ± 0.04 4.79 ± 0.14
Wear. Necktie 94.38 ± 0.02 69.16 ± 0.07 50.14 ± 0.16
Young 84.03 ± 0.01 72.13 ± 0.01 90.04 ± 0.00
Avg Rel Reduction (%) 0.21 0.96 2.81

20
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Table 15: Per-attribute accuracy, balanced accuracy, and F1-score of ImageNet-pretrained ResNet-50 when transferred to multi-label
prediction on CelebA, with the β of CT optimized per attribute (Micro Best). Reported values include means and standard deviations
over three runs.

Metric
Accuracy Balanced Accuracy F1
Attribute
5 Shadow 92.32 ± 0.02 71.34 ± 0.12 53.97 ± 0.21
Arched Eyebrows 80.61 ± 0.06 74.22 ± 0.10 63.54 ± 0.14
Attractive 80.68 ± 0.02 80.71 ± 0.02 81.20 ± 0.01
Bags Under Eyes 83.11 ± 0.02 67.27 ± 0.03 49.37 ± 0.06
Bald 98.73 ± 0.01 81.58 ± 0.30 67.90 ± 0.20
Bangs 94.79 ± 0.01 87.96 ± 0.03 82.34 ± 0.05
Big Lips 70.05 ± 0.01 56.10 ± 0.02 25.67 ± 0.09
Big Nose 82.52 ± 0.03 69.63 ± 0.03 53.41 ± 0.06
Black Hair 88.01 ± 0.02 83.17 ± 0.04 76.68 ± 0.05
Blond Hair 95.10 ± 0.02 88.07 ± 0.05 81.00 ± 0.06
Blurry 96.05 ± 0.02 71.40 ± 0.09 52.88 ± 0.11
Brown Hair 85.89 ± 0.03 74.11 ± 0.04 58.67 ± 0.07
Bushy Eyebrows 89.42 ± 0.00 67.66 ± 0.04 48.40 ± 0.06
Chubby 95.23 ± 0.02 65.82 ± 0.06 42.21 ± 0.13
Double Chin 96.00 ± 0.01 64.78 ± 0.09 41.04 ± 0.19
Eyeglasses 99.18 ± 0.00 95.01 ± 0.03 93.41 ± 0.04
Goatee 96.33 ± 0.01 75.45 ± 0.05 56.72 ± 0.10
Gray Hair 98.01 ± 0.01 80.34 ± 0.13 66.27 ± 0.16
Heavy Makeup 88.85 ± 0.01 88.66 ± 0.01 86.42 ± 0.02
High Cheekbones 84.00 ± 0.05 83.91 ± 0.05 83.05 ± 0.05
Male 95.74 ± 0.02 95.50 ± 0.02 94.49 ± 0.02
Mouth Slightly Open 86.89 ± 0.02 86.87 ± 0.02 86.48 ± 0.02
Mustache 96.35 ± 0.02 60.56 ± 0.17 31.46 ± 0.30
Narrow Eyes 86.34 ± 0.02 57.40 ± 0.08 26.10 ± 0.24
No Beard 93.19 ± 0.01 83.34 ± 0.07 96.06 ± 0.00
Oval Face 74.31 ± 0.03 62.11 ± 0.03 42.61 ± 0.07
Pale Skin 96.59 ± 0.02 69.14 ± 0.05 48.66 ± 0.05
Pointy Nose 74.84 ± 0.02 61.83 ± 0.03 41.67 ± 0.05
Receding Hairline 92.69 ± 0.01 65.82 ± 0.04 43.72 ± 0.05
Rosy Cheeks 94.44 ± 0.02 70.59 ± 0.04 52.41 ± 0.04
Sideburns 97.04 ± 0.02 77.68 ± 0.11 63.88 ± 0.14
Smiling 88.23 ± 0.00 88.23 ± 0.00 88.12 ± 0.00
Straight Hair 82.77 ± 0.01 67.49 ± 0.05 50.05 ± 0.09
Wavy Hair 83.21 ± 0.01 79.41 ± 0.02 73.95 ± 0.02
Wearing Earrings 86.97 ± 0.03 75.77 ± 0.04 64.25 ± 0.08
Wearing Hat 98.94 ± 0.00 93.44 ± 0.03 87.36 ± 0.06
Wearing Lipstick 92.23 ± 0.03 92.26 ± 0.03 92.48 ± 0.03
Wearing Necklace 86.30 ± 0.01 52.81 ± 0.02 11.70 ± 0.14
Wearing Necktie 95.28 ± 0.00 76.54 ± 0.05 61.91 ± 0.06
Young 85.85 ± 0.01 75.81 ± 0.02 91.07 ± 0.01

21
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Table 16: Per-attribute accuracy, balanced accuracy, and F1-score of ImageNet-pretrained ResNet-50 when transferred to multi-label
prediction on CelebA, with the β of CT optimized globally for the highest mean accuracy across attributes (Macro Best) (bold values
indicate cases where the metric remains the same under both Micro Best and Macro Best β). The small performance gap between
Micro Best and Macro Best demonstrates the stability of CT on more fine-grained downstream tasks. Reported values include
means and standard deviations over three runs.

Metric
Accuracy Balanced Accuracy F1
Attribute
5 Shadow 92.24 ± 0.01 70.98 ± 0.06 53.47 ± 0.08
Arched Eyebrows 79.49 ± 0.02 72.08 ± 0.04 60.34 ± 0.07
Attractive 80.68 ± 0.02 80.57 ± 0.02 81.05 ± 0.02
Bags Under Eyes 83.11 ± 0.02 67.27 ± 0.04 49.36 ± 0.06
Bald 98.72 ± 0.00 81.18 ± 0.25 67.42 ± 0.36
Bangs 94.52 ± 0.00 87.58 ± 0.01 81.53 ± 0.01
Big Lips 70.03 ± 0.01 56.04 ± 0.03 25.48 ± 0.11
Big Nose 82.52 ± 0.03 69.48 ± 0.03 53.18 ± 0.05
Black Hair 87.58 ± 0.02 83.12 ± 0.04 76.57 ± 0.07
Blond Hair 95.10 ± 0.02 87.68 ± 0.02 80.42 ± 0.03
Blurry 95.96 ± 0.01 69.99 ± 0.06 50.43 ± 0.09
Brown Hair 85.59 ± 0.02 72.95 ± 0.06 56.99 ± 0.09
Bushy Eyebrows 88.96 ± 0.01 66.35 ± 0.02 45.76 ± 0.05
Chubby 95.23 ± 0.02 64.93 ± 0.06 40.46 ± 0.11
Double Chin 95.86 ± 0.00 64.08 ± 0.00 38.95 ± 0.07
Eyeglasses 99.06 ± 0.00 94.81 ± 0.07 92.96 ± 0.11
Goatee 96.19 ± 0.02 74.74 ± 0.07 55.57 ± 0.08
Gray Hair 97.94 ± 0.01 80.34 ± 0.13 66.27 ± 0.16
Heavy Makeup 88.76 ± 0.01 88.66 ± 0.01 86.42 ± 0.02
High Cheekbones 83.92 ± 0.03 83.82 ± 0.02 83.03 ± 0.02
Male 95.62 ± 0.02 95.50 ± 0.02 94.49 ± 0.02
Mouth Slightly Open 86.34 ± 0.03 85.90 ± 0.00 85.50 ± 0.01
Mustache 96.18 ± 0.01 60.41 ± 0.04 30.68 ± 0.16
Narrow Eyes 85.89 ± 0.01 55.31 ± 0.02 20.08 ± 0.05
No Beard 92.97 ± 0.03 83.04 ± 0.09 96.02 ± 0.01
Oval Face 74.01 ± 0.02 61.74 ± 0.01 41.92 ± 0.03
Pale Skin 96.50 ± 0.01 68.31 ± 0.09 47.87 ± 0.11
Pointy Nose 74.84 ± 0.01 61.61 ± 0.04 41.22 ± 0.09
Receding Hairline 92.59 ± 0.01 65.82 ± 0.04 43.72 ± 0.05
Rosy Cheeks 94.32 ± 0.00 69.86 ± 0.09 50.94 ± 0.18
Sideburns 96.91 ± 0.02 76.67 ± 0.22 62.11 ± 0.41
Smiling 88.23 ± 0.00 88.15 ± 0.03 88.04 ± 0.03
Straight Hair 82.72 ± 0.02 67.25 ± 0.05 49.58 ± 0.08
Wavy Hair 82.91 ± 0.02 79.34 ± 0.02 73.85 ± 0.03
Wearing Earrings 86.80 ± 0.01 75.40 ± 0.02 63.68 ± 0.04
Wearing Hat 98.88 ± 0.01 92.99 ± 0.12 86.57 ± 0.09
Wearing Lipstick 92.22 ± 0.01 92.26 ± 0.03 92.48 ± 0.03
Wearing Necklace 86.25 ± 0.01 52.81 ± 0.02 11.68 ± 0.08
Wearing Necktie 95.27 ± 0.00 76.41 ± 0.20 61.73 ± 0.39
Young 85.77 ± 0.04 75.71 ± 0.02 91.04 ± 0.01
Avg Rel Reduction (%) 0.18 0.65 1.87

22
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Table 17: Per-attribute accuracy, balanced accuracy, and F1-score of ImageNet-pretrained ResNet-152 when transferred to multi-label
prediction on CelebA, with the β of CT optimized per attribute (Micro Best). Reported values include means and standard deviations
over three runs.

Metric
Accuracy Balanced Accuracy F1
Attribute
5 Shadow 92.22 ± 0.05 70.68 ± 0.32 52.89 ± 0.55
Arch. Eyebrows 79.90 ± 0.29 72.86 ± 0.37 61.54 ± 0.57
Attractive 80.68 ± 0.10 80.71 ± 0.10 81.17 ± 0.09
Bags Un. Eyes 83.30 ± 0.14 67.64 ± 0.25 50.06 ± 0.47
Bald 98.69 ± 0.03 79.25 ± 0.98 65.60 ± 1.06
Bangs 94.43 ± 0.01 87.16 ± 0.10 81.08 ± 0.08
Big Lips 70.15 ± 0.14 56.21 ± 0.10 25.95 ± 0.29
Big Nose 82.62 ± 0.07 69.63 ± 0.08 53.46 ± 0.15
Black Hair 88.07 ± 0.11 83.39 ± 0.21 76.91 ± 0.27
Blond Hair 95.01 ± 0.03 87.75 ± 0.05 80.59 ± 0.09
Blurry 96.09 ± 0.08 71.55 ± 1.06 53.15 ± 1.87
Brown Hair 86.44 ± 0.43 75.15 ± 1.21 60.29 ± 1.92
Bushy Eyebrows 89.50 ± 0.35 67.69 ± 0.95 48.56 ± 1.98
Chubby 95.10 ± 0.04 65.00 ± 0.53 40.24 ± 0.33
Double Chin 95.84 ± 0.02 64.87 ± 0.56 40.32 ± 0.93
Eyeglasses 99.04 ± 0.09 94.56 ± 0.34 92.35 ± 0.70
Goatee 96.41 ± 0.07 75.42 ± 0.17 57.21 ± 0.59
Gray Hair 97.89 ± 0.05 80.04 ± 0.15 64.72 ± 0.54
Heavy Makeup 88.97 ± 0.08 88.72 ± 0.03 86.51 ± 0.06
H. Cheekbones 84.04 ± 0.10 83.95 ± 0.09 83.11 ± 0.06
Male 95.56 ± 0.01 95.34 ± 0.00 94.26 ± 0.00
Mouth S. O. 84.97 ± 0.67 84.94 ± 0.68 84.44 ± 0.76
Mustache 96.30 ± 0.09 61.01 ± 0.27 32.10 ± 0.45
Narrow Eyes 86.04 ± 0.10 56.27 ± 0.70 22.85 ± 2.02
No Beard 93.00 ± 0.15 82.94 ± 0.17 95.95 ± 0.09
Oval Face 74.29 ± 0.18 62.11 ± 0.25 42.63 ± 0.47
Pale Skin 96.71 ± 0.11 69.98 ± 1.09 51.08 ± 2.16
Pointy Nose 74.95 ± 0.08 61.83 ± 0.06 41.59 ± 0.11
Reced. Hairline 92.42 ± 0.20 64.20 ± 1.15 40.32 ± 2.44
Rosy Cheeks 94.34 ± 0.06 70.34 ± 0.05 51.73 ± 0.11
Sideburns 96.98 ± 0.01 76.50 ± 0.58 62.38 ± 0.56
Smiling 87.93 ± 0.16 87.93 ± 0.16 87.85 ± 0.15
Straight Hair 82.49 ± 0.03 66.92 ± 0.21 48.98 ± 0.37
Wavy Hair 83.08 ± 0.02 79.22 ± 0.03 73.68 ± 0.04
Wear. Earrings 86.83 ± 0.14 75.37 ± 0.17 63.61 ± 0.33
Wear. Hat 98.95 ± 0.04 93.47 ± 0.22 87.52 ± 0.48
Wear. Lipstick 91.87 ± 0.23 91.91 ± 0.22 92.10 ± 0.24
Wear. Necklace 86.28 ± 0.01 52.80 ± 0.03 11.82 ± 0.10
Wear. Necktie 95.36 ± 0.06 77.28 ± 0.59 62.92 ± 0.81
Young 85.79 ± 0.02 75.89 ± 0.08 91.02 ± 0.02

23
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Table 18: Per-attribute accuracy, balanced accuracy, and F1-score of ImageNet-pretrained ResNet-152 when transferred to multi-label
prediction on CelebA, with the β of CT optimized globally for the highest mean accuracy across attributes (Macro Best) (bold values
indicate cases where the metric remains the same under both Micro Best and Macro Best β). The small performance gap between
Micro Best and Macro Best demonstrates the stability of CT on more fine-grained downstream tasks. Reported values include
means and standard deviations over three runs.

Metric
Accuracy Balanced Accuracy F1
Attribute
5 Shadow 92.18 ± 0.10 70.58 ± 0.64 52.64 ± 1.14
Arch. Eyebrows 79.70 ± 0.23 72.44 ± 0.51 60.86 ± 0.78
Attractive 80.68 ± 0.10 80.65 ± 0.07 81.12 ± 0.05
Bags Un. Eyes 83.30 ± 0.14 67.49 ± 0.24 49.77 ± 0.43
Bald 98.58 ± 0.09 78.18 ± 1.97 63.82 ± 2.62
Bangs 94.39 ± 0.10 87.16 ± 0.10 81.08 ± 0.08
Big Lips 70.11 ± 0.06 56.16 ± 0.19 25.62 ± 0.50
Big Nose 82.62 ± 0.07 69.35 ± 0.14 52.91 ± 0.22
Black Hair 88.07 ± 0.11 83.32 ± 0.14 76.79 ± 0.15
Blond Hair 95.01 ± 0.03 87.75 ± 0.05 80.56 ± 0.11
Blurry 96.06 ± 0.10 71.19 ± 0.88 52.38 ± 1.83
Brown Hair 86.30 ± 0.53 74.96 ± 1.22 60.02 ± 1.87
Bushy Eyebrows 89.50 ± 0.35 67.24 ± 0.98 47.59 ± 2.12
Chubby 95.10 ± 0.04 64.92 ± 0.11 40.24 ± 0.33
Double Chin 95.84 ± 0.01 64.67 ± 0.43 39.71 ± 0.70
Eyeglasses 98.96 ± 0.10 94.38 ± 0.41 92.20 ± 0.67
Goatee 96.29 ± 0.04 75.13 ± 0.19 56.80 ± 0.22
Gray Hair 97.85 ± 0.12 80.00 ± 0.09 64.72 ± 0.54
Heavy Makeup 88.97 ± 0.08 88.57 ± 0.07 86.35 ± 0.09
H. Cheekbones 84.04 ± 0.10 83.77 ± 0.18 82.96 ± 0.18
Male 95.38 ± 0.24 95.25 ± 0.12 94.12 ± 0.20
Mouth S. O. 84.97 ± 0.67 84.62 ± 0.51 84.17 ± 0.52
Mustache 96.24 ± 0.01 60.15 ± 0.15 30.39 ± 0.50
Narrow Eyes 85.98 ± 0.22 55.54 ± 0.33 20.75 ± 0.97
No Beard 92.80 ± 0.24 82.82 ± 0.22 95.92 ± 0.05
Oval Face 74.29 ± 0.18 62.04 ± 0.21 42.56 ± 0.42
Pale Skin 96.71 ± 0.11 69.80 ± 1.01 50.41 ± 1.76
Pointy Nose 74.74 ± 0.02 61.52 ± 0.02 41.15 ± 0.16
Reced. Hairline 92.42 ± 0.20 64.04 ± 1.03 39.85 ± 2.23
Rosy Cheeks 94.28 ± 0.02 69.58 ± 0.05 50.24 ± 0.20
Sideburns 96.82 ± 0.08 76.50 ± 0.58 62.38 ± 0.56
Smiling 87.85 ± 0.21 87.93 ± 0.16 87.85 ± 0.15
Straight Hair 82.49 ± 0.03 66.83 ± 0.15 48.80 ± 0.25
Wavy Hair 82.82 ± 0.24 79.10 ± 0.21 73.50 ± 0.30
Wear. Earrings 86.78 ± 0.01 75.37 ± 0.17 63.61 ± 0.33
Wear. Hat 98.89 ± 0.02 93.47 ± 0.22 87.52 ± 0.48
Wear. Lipstick 91.87 ± 0.23 91.80 ± 0.24 92.01 ± 0.24
Wear. Necklace 86.17 ± 0.09 52.70 ± 0.06 11.44 ± 0.17
Wear. Necktie 95.34 ± 0.01 77.19 ± 0.50 62.83 ± 0.71
Young 85.62 ± 0.11 75.89 ± 0.08 91.02 ± 0.02
Avg Rel Reduction (%) 0.07 0.30 1.00

24
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Table 19: Complete robust accuracy results of ImageNet-pretrained ResNet-18, ResNet-50, and ResNet-152 under ℓ2 /ℓ∞ adversarial
attacks and common corruptions on CIFAR-10, CIFAR-100, and ImageNet (bold entries indicate improvement with CT). CT consistently
enhances robustness across models, datasets, and robustness settings, with β values close to 1. Reported values include means and
standard deviations over three runs.

(a) ResNet-18. Avg rel improve: 11.76%. Avg β: 0.92.


Attack ℓ2 ℓ∞ Corruption
Dataset ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β)
CIFAR-10 53.67 ± 0.32 53.67 ± 0.32 1.00 ± 0.00 11.17 ± 0.06 14.93 ± 0.06 0.90 ± 0.00 77.73 ± 0.00 77.73 ± 0.00 1.00 ± 0.00
CIFAR-100 24.30 ± 0.10 25.50 ± 0.00 0.92 ± 0.00 4.47 ± 0.06 6.90 ± 0.00 0.92 ± 0.00 51.81 ± 0.00 51.95 ± 0.00 0.94 ± 0.00
ImageNet 23.37 ± 0.06 23.37 ± 0.06 1.00 ± 0.00 0.00 ± 0.00 7.00 ± 0.10 0.89 ± 0.00 33.11 ± 0.00 33.32 ± 0.00 0.92 ± 0.00

(b) ResNet-50. Avg rel improve: 348.44%. Avg β: 0.95.


Attack ℓ2 ℓ∞ Corruption
Dataset ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β)
CIFAR-10 55.10 ± 0.10 56.53 ± 0.21 0.97 ± 0.00 10.10 ± 0.17 14.83 ± 0.06 0.95 ± 0.00 77.26 ± 0.00 77.26 ± 0.00 1.00 ± 0.00
CIFAR-100 23.83 ± 0.06 25.80 ± 0.20 0.96 ± 0.00 4.43 ± 0.06 7.90 ± 0.00 0.93 ± 0.00 53.91 ± 0.00 53.93 ± 0.00 0.98 ± 0.00
ImageNet 31.90 ± 0.00 31.90 ± 0.00 1.00 ± 0.00 0.30 ± 0.00 9.30 ± 0.17 0.93 ± 0.00 39.64 ± 0.00 39.64 ± 0.00 1.00 ± 0.00

(c) ResNet-152. Avg rel improve: 498.41%. Avg β: 0.98.


Attack ℓ2 ℓ∞ Corruption
Dataset ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β)
CIFAR-10 56.27 ± 0.23 56.27 ± 0.23 1.00 ± 0.00 11.47 ± 0.06 15.00 ± 0.20 0.99 ± 0.00 78.82 ± 0.00 78.83 ± 0.00 0.99 ± 0.00
CIFAR-100 27.90 ± 0.10 28.23 ± 0.12 0.98 ± 0.00 5.40 ± 0.00 7.70 ± 0.17 0.99 ± 0.00 56.12 ± 0.00 56.12 ± 0.00 1.00 ± 0.00
ImageNet 42.50 ± 0.00 42.50 ± 0.00 1.00 ± 0.00 0.30 ± 0.00 13.53 ± 0.06 0.97 ± 0.01 45.47 ± 0.00 45.47 ± 0.00 0.99 ± 0.00

Table 20: Complete accuracy reuslts of Imagenette-pretrained Swin-T and Swin-S (ReLU-based) when transferred to 9 downstream
datasets (bold entries indicate improvement with CT). CT consistently enhances generalization across models and datasets, with
β values close to 1, demonstrating its effectiveness even with partial theoretical guarantees. Reported values include means and
standard deviations over three runs.

Model Swin-T Swin-S


Dataset ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β)
Arabic Characters 43.08 ± 0.06 45.14 ± 0.07 0.92 ± 0.00 43.90 ± 0.08 44.70 ± 0.09 0.97 ± 0.00
Arabic Digits 90.38 ± 0.04 91.46 ± 0.03 0.86 ± 0.01 88.74 ± 0.01 89.15 ± 0.06 0.95 ± 0.01
Beans 75.00 ± 0.00 82.03 ± 0.00 0.85 ± 0.00 66.41 ± 0.00 71.09 ± 0.00 0.83 ± 0.00
CUB-200-2011 6.97 ± 0.00 7.02 ± 0.00 0.93 ± 0.01 6.40 ± 0.00 6.70 ± 0.00 0.94 ± 0.00
DTD 21.51 ± 0.06 21.70 ± 0.00 0.93 ± 0.00 20.59 ± 0.00 21.28 ± 0.00 0.94 ± 0.00
Fashion MNIST 78.61 ± 0.02 79.08 ± 0.01 0.92 ± 0.02 77.48 ± 0.01 77.64 ± 0.05 0.95 ± 0.00
FGVC-Aircraft 8.13 ± 0.00 8.31 ± 0.00 0.98 ± 0.00 7.12 ± 0.07 7.70 ± 0.02 0.96 ± 0.00
Flowers102 23.77 ± 0.02 24.19 ± 0.07 0.94 ± 0.00 22.29 ± 0.02 23.01 ± 0.05 0.95 ± 0.00
Food101 17.35 ± 0.02 17.41 ± 0.04 0.98 ± 0.01 17.11 ± 0.01 17.29 ± 0.03 0.95 ± 0.01
Rel Improve (%) (β) Rel Improve (%) (β)
Avg Rel Improve & β
2.43 0.92 3.33 0.94

Table 21: Complete accuracy results of ImageNet-pretrained ResNet-18 when transferred to 13 downstream datasets, steered with
the Swish-only, SoftPlus-only, and combined (baseline) versions of CT (bold entries indicate improvement with CT). While both the
Swish-only and SoftPlus-only versions enhance model generalization, their improvements are less significant than the combined
version, validating our CT implementation. Reported values include means and standard deviations over three runs.

CT Implementation Swish only SoftPlus only Combination


Dataset ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β)
Arabic Characters 86.49 ± 0.05 86.49 ± 0.05 1.00 ± 0.00 86.48 ± 0.12 89.90 ± 0.05 0.92 ± 0.00 86.46 ± 0.00 92.11 ± 0.08 0.70 ± 0.00
Arabic Digits 97.92 ± 0.03 97.99 ± 0.01 0.95 ± 0.01 97.91 ± 0.01 98.81 ± 0.02 0.89 ± 0.00 97.92 ± 0.03 98.92 ± 0.01 0.72 ± 0.04
Beans 85.94 ± 0.00 86.72 ± 0.00 0.92 ± 0.00 85.94 ± 0.00 94.53 ± 0.00 0.85 ± 0.00 85.94 ± 0.00 94.53 ± 0.00 0.60 ± 0.00
CelebA 87.88 ± 0.00 87.88 ± 0.00 1.00 ± 0.00 87.88 ± 0.00 88.53 ± 0.01 0.89 ± 0.01 87.88 ± 0.00 88.44 ± 0.00 0.75 ± 0.01
CUB-200-2011 62.93 ± 0.00 62.93 ± 0.00 1.00 ± 0.00 62.97 ± 0.04 64.13 ± 0.06 0.95 ± 0.00 62.93 ± 0.00 63.60 ± 0.00 0.90 ± 0.00
DTD 64.38 ± 0.03 64.38 ± 0.03 1.00 ± 0.00 64.41 ± 0.06 64.54 ± 0.03 0.95 ± 0.00 64.38 ± 0.03 64.50 ± 0.03 0.92 ± 0.00
Fashion MNIST 88.54 ± 0.02 88.54 ± 0.02 1.00 ± 0.00 88.53 ± 0.01 89.38 ± 0.02 0.94 ± 0.00 88.54 ± 0.03 89.52 ± 0.01 0.87 ± 0.01
FGVC-Aircraft 43.70 ± 0.11 43.82 ± 0.02 0.99 ± 0.01 43.74 ± 0.03 46.84 ± 0.06 0.91 ± 0.00 43.75 ± 0.06 48.30 ± 0.04 0.77 ± 0.01
Flowers102 87.80 ± 0.01 87.80 ± 0.01 1.00 ± 0.00 87.79 ± 0.01 88.59 ± 0.06 0.95 ± 0.00 87.80 ± 0.01 87.96 ± 0.01 0.86 ± 0.00
Food101 59.70 ± 0.05 59.74 ± 0.02 0.99 ± 0.01 59.77 ± 0.07 61.06 ± 0.05 0.93 ± 0.00 59.70 ± 0.05 60.48 ± 0.02 0.89 ± 0.02
MedMNIST (PathMNIST) 86.22 ± 0.01 86.27 ± 0.02 0.99 ± 0.00 86.23 ± 0.02 88.58 ± 0.10 0.90 ± 0.00 86.21 ± 0.00 87.27 ± 0.01 0.81 ± 0.00
MedMNIST (OCTMNIST) 65.47 ± 0.06 65.47 ± 0.06 1.00 ± 0.00 65.53 ± 0.15 67.27 ± 0.06 0.93 ± 0.00 65.47 ± 0.06 69.00 ± 0.26 0.80 ± 0.04
MedMNIST (DermaMNIST) 73.44 ± 0.03 74.03 ± 0.07 0.63 ± 0.01 73.39 ± 0.03 76.61 ± 0.00 0.88 ± 0.01 73.44 ± 0.03 77.74 ± 0.08 0.80 ± 0.01
Rel Improve (%) (β) Rel Improve (%) (β) Rel Improve (%) (β)
Avg Rel Improve & β
0.23 0.87 2.96 0.92 3.46 0.80

25
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Table 22: Complete robust accuracy results of ImageNet-pretrained ResNet-18 under ℓ2 /ℓ∞ adversarial attacks and common corruptions
on CIFAR-10, CIFAR-100, and ImageNet, steered with the Swish-only, SoftPlus-only, and combined (baseline) versions of CT (bold
entries indicate improvement with CT). While both the Swish-only and SoftPlus-only versions enhance model robustness, their
improvements are less significant than the combined version, validating our CT implementation. Reported values include means
and standard deviations over three runs.

(a) Swish only. Avg rel improve: 8.06%. Avg β: 0.99.


Attack ℓ2 ℓ∞ Corruption
Dataset ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β)
CIFAR-10 53.67 ± 0.32 53.67 ± 0.32 1.00 ± 0.00 11.17 ± 0.06 13.60 ± 0.26 0.99 ± 0.00 77.73 ± 0.00 77.77 ± 0.00 0.99 ± 0.00
CIFAR-100 24.30 ± 0.10 24.30 ± 0.10 1.00 ± 0.00 4.47 ± 0.06 6.37 ± 0.15 0.99 ± 0.00 51.81 ± 0.00 51.81 ± 0.00 1.00 ± 0.00
ImageNet 23.37 ± 0.06 23.37 ± 0.06 1.00 ± 0.00 0.00 ± 0.00 4.50 ± 0.30 0.99 ± 0.00 33.11 ± 0.00 33.13 ± 0.00 0.99 ± 0.00

(b) SoftPlus only. Avg rel improve: 9.91%. Avg β: 0.98.


Attack ℓ2 ℓ∞ Corruption
Dataset ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β) ReLU (%) + CT (%) (β)
CIFAR-10 53.67 ± 0.32 54.07 ± 0.06 0.98 ± 0.00 11.17 ± 0.06 14.47 ± 0.06 0.99 ± 0.00 77.73 ± 0.00 77.73 ± 0.00 1.00 ± 0.00
CIFAR-100 24.30 ± 0.10 24.93 ± 0.31 0.97 ± 0.00 4.47 ± 0.06 6.53 ± 0.15 0.99 ± 0.00 51.81 ± 0.00 51.81 ± 0.00 1.00 ± 0.00
ImageNet 23.37 ± 0.06 23.37 ± 0.06 1.00 ± 0.00 0.00 ± 0.00 6.60 ± 0.10 0.94 ± 0.01 33.11 ± 0.00 33.15 ± 0.00 0.99 ± 0.00

26

You might also like