AATN Merged
AATN Merged
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Illia Polosukhin∗ ‡
[email protected]
Abstract
∗
Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started
the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and
has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head
attention and the parameter-free position representation and became the other person involved in nearly every
detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and
tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and
efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and
implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating
our research.
†
Work performed while at Google Brain.
‡
Work performed while at Google Research.
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks
in particular, have been firmly established as state of the art approaches in sequence modeling and
transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous
efforts have since continued to push the boundaries of recurrent language models and encoder-decoder
architectures [38, 24, 15].
Recurrent models typically factor computation along the symbol positions of the input and output
sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden
states ht , as a function of the previous hidden state ht−1 and the input for position t. This inherently
sequential nature precludes parallelization within training examples, which becomes critical at longer
sequence lengths, as memory constraints limit batching across examples. Recent work has achieved
significant improvements in computational efficiency through factorization tricks [21] and conditional
computation [32], while also improving model performance in case of the latter. The fundamental
constraint of sequential computation, however, remains.
Attention mechanisms have become an integral part of compelling sequence modeling and transduc-
tion models in various tasks, allowing modeling of dependencies without regard to their distance in
the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms
are used in conjunction with a recurrent network.
In this work we propose the Transformer, a model architecture eschewing recurrence and instead
relying entirely on an attention mechanism to draw global dependencies between input and output.
The Transformer allows for significantly more parallelization and can reach a new state of the art in
translation quality after being trained for as little as twelve hours on eight P100 GPUs.
2 Background
The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU
[16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building
block, computing hidden representations in parallel for all input and output positions. In these models,
the number of operations required to relate signals from two arbitrary input or output positions grows
in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes
it more difficult to learn dependencies between distant positions [12]. In the Transformer this is
reduced to a constant number of operations, albeit at the cost of reduced effective resolution due
to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as
described in section 3.2.
Self-attention, sometimes called intra-attention is an attention mechanism relating different positions
of a single sequence in order to compute a representation of the sequence. Self-attention has been
used successfully in a variety of tasks including reading comprehension, abstractive summarization,
textual entailment and learning task-independent sentence representations [4, 27, 28, 22].
End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-
aligned recurrence and have been shown to perform well on simple-language question answering and
language modeling tasks [34].
To the best of our knowledge, however, the Transformer is the first transduction model relying
entirely on self-attention to compute representations of its input and output without using sequence-
aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate
self-attention and discuss its advantages over models such as [17, 18] and [9].
3 Model Architecture
Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35].
Here, the encoder maps an input sequence of symbol representations (x1 , ..., xn ) to a sequence
of continuous representations z = (z1 , ..., zn ). Given z, the decoder then generates an output
sequence (y1 , ..., ym ) of symbols one element at a time. At each step the model is auto-regressive
[10], consuming the previously generated symbols as additional input when generating the next.
2
Figure 1: The Transformer - model architecture.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully
connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,
respectively.
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two
sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-
wise fully connected feed-forward network. We employ a residual connection [11] around each of
the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is
LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer
itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
layers, produce outputs of dimension dmodel = 512.
Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two
sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head
attention over the output of the encoder stack. Similar to the encoder, we employ residual connections
around each of the sub-layers, followed by layer normalization. We also modify the self-attention
sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This
masking, combined with fact that the output embeddings are offset by one position, ensures that the
predictions for position i can depend only on the known outputs at positions less than i.
3.2 Attention
An attention function can be described as mapping a query and a set of key-value pairs to an output,
where the query, keys, values, and output are all vectors. The output is computed as a weighted sum
3
Scaled Dot-Product Attention Multi-Head Attention
Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several
attention layers running in parallel.
of the values, where the weight assigned to each value is computed by a compatibility function of the
query with the corresponding key.
QK T
Attention(Q, K, V ) = softmax( √ )V (1)
dk
The two most commonly used attention functions are additive attention [2], and dot-product (multi-
plicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor
of √1d . Additive attention computes the compatibility function using a feed-forward network with
k
a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is
much faster and more space-efficient in practice, since it can be implemented using highly optimized
matrix multiplication code.
While for small values of dk the two mechanisms perform similarly, additive attention outperforms
dot product attention without scaling for larger values of dk [3]. We suspect that for large values of
dk , the dot products grow large in magnitude, pushing the softmax function into regions where it has
extremely small gradients 4 . To counteract this effect, we scale the dot products by √1d .
k
4
output values. These are concatenated and once again projected, resulting in the final values, as
depicted in Figure 2.
Multi-head attention allows the model to jointly attend to information from different representation
subspaces at different positions. With a single attention head, averaging inhibits this.
Where the projections are parameter matrices WiQ ∈ Rdmodel ×dk , WiK ∈ Rdmodel ×dk , WiV ∈ Rdmodel ×dv
and W O ∈ Rhdv ×dmodel .
In this work we employ h = 8 parallel attention layers, or heads. For each of these we use
dk = dv = dmodel /h = 64. Due to the reduced dimension of each head, the total computational cost
is similar to that of single-head attention with full dimensionality.
• In "encoder-decoder attention" layers, the queries come from the previous decoder layer,
and the memory keys and values come from the output of the encoder. This allows every
position in the decoder to attend over all positions in the input sequence. This mimics the
typical encoder-decoder attention mechanisms in sequence-to-sequence models such as
[38, 2, 9].
• The encoder contains self-attention layers. In a self-attention layer all of the keys, values
and queries come from the same place, in this case, the output of the previous layer in the
encoder. Each position in the encoder can attend to all positions in the previous layer of the
encoder.
• Similarly, self-attention layers in the decoder allow each position in the decoder to attend to
all positions in the decoder up to and including that position. We need to prevent leftward
information flow in the decoder to preserve the auto-regressive property. We implement this
inside of scaled dot-product attention by masking out (setting to −∞) all values in the input
of the softmax which correspond to illegal connections. See Figure 2.
In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully
connected feed-forward network, which is applied to each position separately and identically. This
consists of two linear transformations with a ReLU activation in between.
While the linear transformations are the same across different positions, they use different parameters
from layer to layer. Another way of describing this is as two convolutions with kernel size 1.
The dimensionality of input and output is dmodel = 512, and the inner-layer has dimensionality
df f = 2048.
Similarly to other sequence transduction models, we use learned embeddings to convert the input
tokens and output tokens to vectors of dimension dmodel . We also use the usual learned linear transfor-
mation and softmax function to convert the decoder output to predicted next-token probabilities. In
our model, we share the same weight matrix between the two embedding layers and the pre-softmax√
linear transformation, similar to [30]. In the embedding layers, we multiply those weights by dmodel .
5
Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations
for different layer types. n is the sequence length, d is the representation dimension, k is the kernel
size of convolutions and r the size of the neighborhood in restricted self-attention.
Since our model contains no recurrence and no convolution, in order for the model to make use of the
order of the sequence, we must inject some information about the relative or absolute position of the
tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the
bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel
as the embeddings, so that the two can be summed. There are many choices of positional encodings,
learned and fixed [9].
In this work, we use sine and cosine functions of different frequencies:
P E(pos,2i) = sin(pos/100002i/dmodel )
P E(pos,2i+1) = cos(pos/100002i/dmodel )
where pos is the position and i is the dimension. That is, each dimension of the positional encoding
corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We
chose this function because we hypothesized it would allow the model to easily learn to attend by
relative positions, since for any fixed offset k, P Epos+k can be represented as a linear function of
P Epos .
We also experimented with using learned positional embeddings [9] instead, and found that the two
versions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version
because it may allow the model to extrapolate to sequence lengths longer than the ones encountered
during training.
4 Why Self-Attention
In this section we compare various aspects of self-attention layers to the recurrent and convolu-
tional layers commonly used for mapping one variable-length sequence of symbol representations
(x1 , ..., xn ) to another sequence of equal length (z1 , ..., zn ), with xi , zi ∈ Rd , such as a hidden
layer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we
consider three desiderata.
One is the total computational complexity per layer. Another is the amount of computation that can
be parallelized, as measured by the minimum number of sequential operations required.
The third is the path length between long-range dependencies in the network. Learning long-range
dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the
ability to learn such dependencies is the length of the paths forward and backward signals have to
traverse in the network. The shorter these paths between any combination of positions in the input
and output sequences, the easier it is to learn long-range dependencies [12]. Hence we also compare
the maximum path length between any two input and output positions in networks composed of the
different layer types.
As noted in Table 1, a self-attention layer connects all positions with a constant number of sequentially
executed operations, whereas a recurrent layer requires O(n) sequential operations. In terms of
computational complexity, self-attention layers are faster than recurrent layers when the sequence
6
length n is smaller than the representation dimensionality d, which is most often the case with
sentence representations used by state-of-the-art models in machine translations, such as word-piece
[38] and byte-pair [31] representations. To improve computational performance for tasks involving
very long sequences, self-attention could be restricted to considering only a neighborhood of size r in
the input sequence centered around the respective output position. This would increase the maximum
path length to O(n/r). We plan to investigate this approach further in future work.
A single convolutional layer with kernel width k < n does not connect all pairs of input and output
positions. Doing so requires a stack of O(n/k) convolutional layers in the case of contiguous kernels,
or O(logk (n)) in the case of dilated convolutions [18], increasing the length of the longest paths
between any two positions in the network. Convolutional layers are generally more expensive than
recurrent layers, by a factor of k. Separable convolutions [6], however, decrease the complexity
considerably, to O(k · n · d + n · d2 ). Even with k = n, however, the complexity of a separable
convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer,
the approach we take in our model.
As side benefit, self-attention could yield more interpretable models. We inspect attention distributions
from our models and present and discuss examples in the appendix. Not only do individual attention
heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic
and semantic structure of the sentences.
5 Training
We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million
sentence pairs. Sentences were encoded using byte-pair encoding [3], which has a shared source-
target vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT
2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece
vocabulary [38]. Sentence pairs were batched together by approximate sequence length. Each training
batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000
target tokens.
We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using
the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We
trained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the
bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps
(3.5 days).
5.3 Optimizer
We used the Adam optimizer [20] with β1 = 0.9, β2 = 0.98 and ϵ = 10−9 . We varied the learning
rate over the course of training, according to the formula:
lrate = d−0.5
model · min(step_num
−0.5
, step_num · warmup_steps−1.5 ) (3)
This corresponds to increasing the learning rate linearly for the first warmup_steps training steps,
and decreasing it thereafter proportionally to the inverse square root of the step number. We used
warmup_steps = 4000.
5.4 Regularization
7
Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the
English-to-German and English-to-French newstest2014 tests at a fraction of the training cost.
BLEU Training Cost (FLOPs)
Model
EN-DE EN-FR EN-DE EN-FR
ByteNet [18] 23.75
Deep-Att + PosUnk [39] 39.2 1.0 · 1020
GNMT + RL [38] 24.6 39.92 2.3 · 1019 1.4 · 1020
ConvS2S [9] 25.16 40.46 9.6 · 1018 1.5 · 1020
MoE [32] 26.03 40.56 2.0 · 1019 1.2 · 1020
Deep-Att + PosUnk Ensemble [39] 40.4 8.0 · 1020
20
GNMT + RL Ensemble [38] 26.30 41.16 1.8 · 10 1.1 · 1021
19
ConvS2S Ensemble [9] 26.36 41.29 7.7 · 10 1.2 · 1021
Transformer (base model) 27.3 38.1 3.3 · 1018
Transformer (big) 28.4 41.8 2.3 · 1019
Residual Dropout We apply dropout [33] to the output of each sub-layer, before it is added to the
sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the
positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of
Pdrop = 0.1.
Label Smoothing During training, we employed label smoothing of value ϵls = 0.1 [36]. This
hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.
6 Results
On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big)
in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0
BLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is
listed in the bottom line of Table 3. Training took 3.5 days on 8 P100 GPUs. Even our base model
surpasses all previously published models and ensembles, at a fraction of the training cost of any of
the competitive models.
On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0,
outperforming all of the previously published single models, at less than 1/4 the training cost of the
previous state-of-the-art model. The Transformer (big) model trained for English-to-French used
dropout rate Pdrop = 0.1, instead of 0.3.
For the base models, we used a single model obtained by averaging the last 5 checkpoints, which
were written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints. We
used beam search with a beam size of 4 and length penalty α = 0.6 [38]. These hyperparameters
were chosen after experimentation on the development set. We set the maximum output length during
inference to input length + 50, but terminate early when possible [38].
Table 2 summarizes our results and compares our translation quality and training costs to other model
architectures from the literature. We estimate the number of floating point operations used to train a
model by multiplying the training time, the number of GPUs used, and an estimate of the sustained
single-precision floating-point capacity of each GPU 5 .
To evaluate the importance of different components of the Transformer, we varied our base model
in different ways, measuring the change in performance on English-to-German translation on the
5
We used values of 2.8, 3.7, 6.0 and 9.5 TFLOPS for K80, K40, M40 and P100, respectively.
8
Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base
model. All metrics are on the English-to-German translation development set, newstest2013. Listed
perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to
per-word perplexities.
development set, newstest2013. We used beam search as described in the previous section, but no
checkpoint averaging. We present these results in Table 3.
In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions,
keeping the amount of computation constant, as described in Section 3.2.2. While single-head
attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.
In Table 3 rows (B), we observe that reducing the attention key size dk hurts model quality. This
suggests that determining compatibility is not easy and that a more sophisticated compatibility
function than dot product may be beneficial. We further observe in rows (C) and (D) that, as expected,
bigger models are better, and dropout is very helpful in avoiding over-fitting. In row (E) we replace our
sinusoidal positional encoding with learned positional embeddings [9], and observe nearly identical
results to the base model.
To evaluate if the Transformer can generalize to other tasks we performed experiments on English
constituency parsing. This task presents specific challenges: the output is subject to strong structural
constraints and is significantly longer than the input. Furthermore, RNN sequence-to-sequence
models have not been able to attain state-of-the-art results in small-data regimes [37].
We trained a 4-layer transformer with dmodel = 1024 on the Wall Street Journal (WSJ) portion of the
Penn Treebank [25], about 40K training sentences. We also trained it in a semi-supervised setting,
using the larger high-confidence and BerkleyParser corpora from with approximately 17M sentences
[37]. We used a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens
for the semi-supervised setting.
We performed only a small number of experiments to select the dropout, both attention and residual
(section 5.4), learning rates and beam size on the Section 22 development set, all other parameters
remained unchanged from the English-to-German base translation model. During inference, we
9
Table 4: The Transformer generalizes well to English constituency parsing (Results are on Section 23
of WSJ)
Parser Training WSJ 23 F1
Vinyals & Kaiser el al. (2014) [37] WSJ only, discriminative 88.3
Petrov et al. (2006) [29] WSJ only, discriminative 90.4
Zhu et al. (2013) [40] WSJ only, discriminative 90.4
Dyer et al. (2016) [8] WSJ only, discriminative 91.7
Transformer (4 layers) WSJ only, discriminative 91.3
Zhu et al. (2013) [40] semi-supervised 91.3
Huang & Harper (2009) [14] semi-supervised 91.3
McClosky et al. (2006) [26] semi-supervised 92.1
Vinyals & Kaiser el al. (2014) [37] semi-supervised 92.1
Transformer (4 layers) semi-supervised 92.7
Luong et al. (2015) [23] multi-task 93.0
Dyer et al. (2016) [8] generative 93.3
increased the maximum output length to input length + 300. We used a beam size of 21 and α = 0.3
for both WSJ only and the semi-supervised setting.
Our results in Table 4 show that despite the lack of task-specific tuning our model performs sur-
prisingly well, yielding better results than all previously reported models with the exception of the
Recurrent Neural Network Grammar [8].
In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the Berkeley-
Parser [29] even when training only on the WSJ training set of 40K sentences.
7 Conclusion
In this work, we presented the Transformer, the first sequence transduction model based entirely on
attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with
multi-headed self-attention.
For translation tasks, the Transformer can be trained significantly faster than architectures based
on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014
English-to-French translation tasks, we achieve a new state of the art. In the former task our best
model outperforms even all previously reported ensembles.
We are excited about the future of attention-based models and plan to apply them to other tasks. We
plan to extend the Transformer to problems involving input and output modalities other than text and
to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs
such as images, audio and video. Making generation less sequential is another research goals of ours.
The code we used to train and evaluate our models is available at https://github.com/
tensorflow/tensor2tensor.
Acknowledgements We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful
comments, corrections and inspiration.
References
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
arXiv:1607.06450, 2016.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
learning to align and translate. CoRR, abs/1409.0473, 2014.
[3] Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of neural
machine translation architectures. CoRR, abs/1703.03906, 2017.
[4] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine
reading. arXiv preprint arXiv:1601.06733, 2016.
10
[5] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk,
and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical
machine translation. CoRR, abs/1406.1078, 2014.
[6] Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv
preprint arXiv:1610.02357, 2016.
[7] Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation
of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.
[8] Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. Recurrent neural
network grammars. In Proc. of NAACL, 2016.
[9] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolu-
tional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017.
[10] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint
arXiv:1308.0850, 2013.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im-
age recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 770–778, 2016.
[12] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. Gradient flow in
recurrent nets: the difficulty of learning long-term dependencies, 2001.
[13] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
[14] Zhongqiang Huang and Mary Harper. Self-training PCFG grammars with latent annotations
across languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural
Language Processing, pages 832–841. ACL, August 2009.
[15] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring
the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
[16] Łukasz Kaiser and Samy Bengio. Can active memory replace attention? In Advances in Neural
Information Processing Systems, (NIPS), 2016.
[17] Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In International Conference
on Learning Representations (ICLR), 2016.
[18] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Ko-
ray Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099v2,
2017.
[19] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. Structured attention networks.
In International Conference on Learning Representations, 2017.
[20] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
[21] Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks. arXiv preprint
arXiv:1703.10722, 2017.
[22] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen
Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. arXiv preprint
arXiv:1703.03130, 2017.
[23] Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task
sequence to sequence learning. arXiv preprint arXiv:1511.06114, 2015.
[24] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-
based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
11
[25] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated
corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
[26] David McClosky, Eugene Charniak, and Mark Johnson. Effective self-training for parsing. In
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference,
pages 152–159. ACL, June 2006.
[27] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention
model. In Empirical Methods in Natural Language Processing, 2016.
[28] Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive
summarization. arXiv preprint arXiv:1705.04304, 2017.
[29] Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. Learning accurate, compact,
and interpretable tree annotation. In Proceedings of the 21st International Conference on
Computational Linguistics and 44th Annual Meeting of the ACL, pages 433–440. ACL, July
2006.
[30] Ofir Press and Lior Wolf. Using the output embedding to improve language models. arXiv
preprint arXiv:1608.05859, 2016.
[31] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words
with subword units. arXiv preprint arXiv:1508.07909, 2015.
[32] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton,
and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts
layer. arXiv preprint arXiv:1701.06538, 2017.
[33] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-
nov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine
Learning Research, 15(1):1929–1958, 2014.
[34] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory
networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,
Advances in Neural Information Processing Systems 28, pages 2440–2448. Curran Associates,
Inc., 2015.
[35] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural
networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.
[36] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna.
Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
[37] Vinyals & Kaiser, Koo, Petrov, Sutskever, and Hinton. Grammar as a foreign language. In
Advances in Neural Information Processing Systems, 2015.
[38] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine
translation system: Bridging the gap between human and machine translation. arXiv preprint
arXiv:1609.08144, 2016.
[39] Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. Deep recurrent models with
fast-forward connections for neural machine translation. CoRR, abs/1606.04199, 2016.
[40] Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. Fast and accurate
shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the ACL (Volume
1: Long Papers), pages 434–443. ACL, August 2013.
12
Input-Input Layer5
Attention Visualizations
governments
registration
American
majority
process
<EOS>
passed
making
difficult
<pad>
<pad>
<pad>
<pad>
<pad>
<pad>
voting
since
more
2009
have
spirit
laws
new
that
this
the
or
of
in
is
It
.
It
is
in
this
spirit
that
a
majority
of
American
governments
have
passed
new
laws
since
2009
making
the
registration
or
voting
process
more
difficult
.
<EOS>
<pad>
<pad>
<pad>
<pad>
<pad>
<pad>
Figure 3: An example of the attention mechanism following long-distance dependencies in the
encoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of
the verb ‘making’, completing the phrase ‘making...more difficult’. Attentions here shown only for
the word ‘making’. Different colors represent different heads. Best viewed in color.
13
The The The The
Law Law Law Law
will will will will
never never never never
be be be be
perfect perfect perfect perfect
, , , ,
but but but but
its its its its
application application application application
should should should should
Input-Input Layer5
Input-Input Layer5
be be be be
just just just just
14
- - - -
this this this this
and 6. Note that the attentions are very sharp for this word.
is is is is
what what what what
we we we we
are are are are
missing missing missing missing
, , , ,
in in in in
my my my my
opinion opinion opinion opinion
. . . .
<EOS> <EOS> <EOS> <EOS>
<pad> <pad> <pad> <pad>
Full attentions for head 5. Bottom: Isolated attentions from just the word ‘its’ for attention heads 5
Figure 4: Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution. Top:
The The The The
Law Law Law Law
will will will will
never never never never
be be be be
perfect perfect perfect perfect
, , , ,
but but but but
its its its its
application application application application
should should should should
Input-Input Layer5
Input-Input Layer5
be be be be
just just just just
15
- - - -
this this this this
is is is is
what what what what
we we we we
sentence. We give two such examples above, from two different heads from the encoder self-attention
Figure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the
Focus Your Attention (with Adaptive IIR Filters)
Shahar Lutati Itamar Zimerman Lior Wolf
The School of Computer Science
Tel Aviv University
[email protected]
[email protected]
[email protected]
the input sequence prior to applying conven- terization (Fu et al., 2023; Li et al., 2022). These
tional attention. The input is split into chunks, methods have demonstrated improved performance
and the coefficients of these filters are deter- on tasks involving long-range dependencies and
mined based on previous chunks to maintain with sub-quantitative complexity. They have also
causality. Despite their relatively low order, the shown effectiveness in enhancing long-range trans-
causal adaptive filters are shown to focus atten-
former capabilities (Ma et al., 2022; Zuo et al.,
tion on the relevant sequence elements. The
new layer is grounded in control theory, and is 2022; Saon et al., 2023). However, it remains un-
shown to generalize diagonal state-space layers. certain whether these models can scale up or func-
The layer performs on-par with state-of-the-art tion similarly to transformers across a diverse range
networks, with a fraction of their parameters of tasks (Vardasbi et al., 2023).
and with time complexity that is sub-quadratic This work strives to efficiently integrate
with input size. The obtained layer is favorable
to layers such as Heyna, GPT2, and Mega, both
convolution-based sequence models and transform-
with respect to the number of parameters and ers, to provide a model that is capable of handling
the obtained level of performance on multiple both short and long dependencies. The attempt
long-range sequence problems. to combine these components was first presented
by Ma et al. (2022), who used a simple global
1 Introduction convolution before each transformer block. This
Designing sequence models that capture short- and convolution is parameterized by the Exponential
long-term dependencies is a central goal of se- Moving Average (EMA) recurrent rule and can be
quence modeling. Besides performance, compu- seen as an IIR filter. In this work, instead of using
tational complexity also plays a part when deal- first-order IIR filters, we introduce learnable adap-
ing with long sequences. Although transform- tive IIR filters, which allow us to propose Focus,
ers (Vaswani et al., 2017) excel at tasks that involve a layer that combines local attention and a novel
short-range dependencies, their performance on type of regularized global convolution grounded on
data with long-range dependencies can be poor. For a hypernetwork that produces adaptive IIR filters.
example, regardless of the high time complexity, Our main contribution is the focus layer, which
on the Long Range Arena benchmark (LRA) (Tay has several unique properties: (i) We are the first
et al., 2020) transformers perform poorly compared to use data-dependent global filters, which are im-
to other sequence models. plemented by a global hyper-network mechanism
Another approach that has emerged to address that focuses local attention. (ii) In contrast to other
long-range data processing is the utilization of reg- works in the domain that employ FIR filters, it re-
ularized implicit global (long) convolutions. In this lies on IIR filters. (iii). We present an efficient
technique, convolutions are employed along the and stable computation of those IIR filters. (iv)
sequence dimension, enabling convolutions with Theoretically, our layer is grounded in the theory
a global receptive field. Initially, this approach of control systems, similar to state-space layers,
was implemented through state-space layers (Gu which are built on the state-space model (SSM) of
et al., 2021b,a), which introduced a recurrent layer control theory. Furthermore, in Sec. 4 we show
that IIR filters are a generalization of SSMs and Hammerstein model, using differentiable IIRs as
diagonal-linear RNNs, which have recently been its filtering component, and train it on a guitar sig-
recognized as remarkable long-range learning ar- nal.
chitectures (Gupta et al., 2022a; Gu et al., 2022;
Orvieto et al., 2023; Gupta et al., 2022b; Saon et al., 2.2 Global Convolutions
2023; David et al., 2023). By drawing upon the The global convolution, also known as a long con-
extensive research conducted on IIR filters, our volution, is a layer that applies scalar convolutions
findings can provide additional insights into the along the sequence dimension, enabling the han-
effectiveness, stability, expressiveness, and initial- dling of unrestricted 1-D sequences. Empirically,
ization of those layers. these layers have shown strong performance in
tasks involving long-range dependencies, partic-
2 Background and related work ularly in domains such as NLP (Dao et al., 2022b;
Mehta et al., 2022; Wang et al., 2022), audio (Goel
IIR filters, known as infinite impulse response fil- et al., 2022), speech (Saon et al., 2023), video (Is-
ters, are digital filters that utilize feedback to gen- lam et al., 2022; Wang et al., 2023), time-series
erate an output signal. Their primary applications analysis (Zhang et al., 2023) and more. Moreover,
involve signal smoothing, filtering, and signal mod- they exhibit computational efficiency as their cost
ification. These filters are extensively employed is sub-quadratic. However, to achieve SOTA re-
in various fields, such as audio processing, speech sults, appropriate regularization is necessary. The
processing, and image processing. One notable approach of (Gu et al., 2021b,a; Ma et al., 2022; Li
advantage of IIR filters is their ability to achieve a et al., 2022) incorporates a parameterization that
significantly sharper roll-off in the transition region inherently regularizes the kernel and decoupling
compared to an FIR filter of the same order. This sequence length from parameter count. (Romero
is made possible by the presence of complex poles et al., 2021; Poli et al., 2023) utilizes an implicit
in the IIR filters, which enable them to attenuate parameterization learned by FFNs operating on
frequencies more rapidly. positional encodings, while (Fu et al., 2023) ex-
The state-space representation of an IIR filter is plicitly regularizes the convolution kernels using
a convenient way to represent the filter’s dynamics squash or smooth operators.
and to implement it in software. It consists of
three parts: the state vector, the state transition 2.3 Long Range Transformers
matrix, and the output matrix. The state vector Transformers (Vaswani et al., 2017) have emerged
contains the filter’s internal state variables, the state as highly effective models for various tasks, but
transition matrix describes how the state vector their widespread adoption has been limited by the
changes over time, and the output matrix describes quadratic cost of the self-attention mechanism and
how the output signal is computed from the state poor performance on long-range tasks. Researchers
vector. Such representation are described in (Zhang have pursued diverse approaches to overcome this
et al., 2023) challenge and to create efficient transformer archi-
tectures (Fournier et al., 2021; Tay et al., 2022).
2.1 Learnable IIR Filters From the perspective of efficiency, techniques such
Since IIR filters are computationally efficient, yet as sparse attention (Child et al., 2019), low-rank
expressive, it is natural to design IIR filters with attention (Wang et al., 2020; Winata et al., 2020),
Deep Learning. (Kuznetsov et al., 2020) proposes kernel-based attention (Choromanski et al., 2020),
an approach to using traditional digital IIR filter recurrent mechanisms (Hutchins et al., 2022; Dai
structures inside deep-learning networks trained et al., 2019), and efficient IO-awareness-based
using backpropagation. The authors establish the implementation (Dao et al., 2022a) proved effi-
link between such structures and recurrent neural cient. From the perspective of effectiveness, (Yu
networks and present three different differentiable et al., 2023; Ivgi et al., 2023) combines local and
IIR filter topologies. They compare the proposed global attention models hierarchically, enhancing
topologies against each other and an established the model’s ability to handle extensive context
baseline and show that the proposed topologies Other techniques employ global memory-based At-
can achieve better performance in some cases. Ad- tention (Gupta and Berant, 2020; Al Adel, 2022;
ditionally, the authors present a simple Wiener- Burtsev), and (Zhou et al., 2022) applies attention
in the frequency domain to expand long-range ca- 3 Method
pabilities.
3.1 Overview
2.4 Hyper Networks We start by discussing the main design choices of
our architecture.
A hypernetwork (Ha et al., 2016) is a function that
Chunking and the combination of local and
maps a set of inputs to a set of weights, which
global models Given the quadratic complexity
are used as the parameters of a “primary network”.
of transformers, chunking is a common practice for
Hypernetworks have been shown to be effective
computing short-range attention efficiently. How-
for a variety of tasks, including, for example, im-
ever, despite excelling in short-range tasks, full-
age classification (Lutati and Wolf, 2023), natural
length transformers often struggle to handle long-
language processing (He et al., 2022), and speech
range dependencies and often perform comparably
recognition (Szatkowski et al., 2022). They have
to local-attention-based transformers (Xiong et al.,
also been shown to be able to improve the perfor-
2021). Recent studies have demonstrated that a
mance of neural networks on meta-learning tasks,
combination of local and global transformers can
such as few-shot learning (Bertinetto et al., 2016),
achieve state-of-the-art performance on long-range
continual learning (Von Oswald et al., 2019), and
tasks (Ivgi et al., 2023; Yu et al., 2023; Hutchins
neural architecture search (Zhang et al., 2019).
et al., 2022). Inspired by these findings, we intro-
2.5 Adaptive Filtering duce local attention as the local model, which is
combined with a novel type of global convolution
Adaptive filtering is a technique used to improve as the global model. Furthermore, in contrast to
the quality of a signal by removing noise or in- (Hutchins et al., 2022; Bulatov et al., 2023), our
terference. Adaptive filters are able to adapt to global model does not use recurrent computations,
changes in the signal or the environment, making since it severely restricts parallelization.
them well-suited for a variety of applications. Adaptive IIR Filters In MEGA (Ma et al.,
One common technique used in adaptive filter- 2022), it was demonstrated that incorporating an
ing is the short-time Fourier transform (STFT), EMA at the beginning of each transformer block
which provides a time-frequency representation of improves transformer performance in long-range
a signal. It enables the analysis of time-varying tasks. EMA can be viewed as a convolution opera-
properties of a signal by dividing it into short-time tion using simple first-order IIR filters. Motivated
windows and applying the Fourier transform to by this finding, we adopt a more versatile and ex-
each window. The STFT reveals the distribution pressive convolution approach that utilizes adaptive
of frequency content over time, which allows the filters generated by a hypernetwork. Since the hy-
adaptive filter to track the frequency content of the pernetwork is an integral part of our global model,
signal and adapt its coefficients accordingly. How- it employs global convolutions. Specifically, the
ever, STFT introduce non-causal implementation regularized global convolution of (Fu et al., 2023)
due to overlapping time-bins. To mitigate it, we is used, as the most straightforward option. A com-
introduce chunked-FFT, a degenerate form of the mon challenge with hypernetworks is ensuring rela-
STFT. tively small output sizes. In this regard, leveraging
Recent research in AI has focused on using deep IIR filters, which have only a few parameters, is a
learning to improve the performance of adaptive reasonable choice.
filters. For example, deep learning has been used
to improve the performance of adaptive filters for 3.2 The Focus Layer
noise cancellation (Zhang and Wang, 2021), echo In this section, we describe the focus attention
cancellation (Haubner and Kellermann, 2022), and head, our primary contribution. This head is in-
equalization (Zhou et al., 2020). Deep learning tegrated into the MEGA backbone, as visualized in
has also been used to develop new adaptive filter Fig 1. Let x be the input for the focus layer, where
architectures that are more robust to noise and inter- x ∈ RL×D , L is the sequence length, and D is
ference (Alwan and Hussain, 2022). Revach et al. the input’s dimension. Our method, termed Focus,
(2022) demonstrate how deep learning can be used utilizes the foundations of adaptive filtering theory
to improve the performance of Kalman filtering to cope with very long stochastic sequences. Given
(Kalman, 1960), a classical control algorithm. the seasonality of the sequence, the resolution of
Figure 1: Focus Architecture: (a) The architecture of a single head. (b) The obtained layer. (c) The entire model.
The architecture of the model and layer are defined similarly to MEGA (Ma et al., 2022). Blocks in blue are not
learned, while blocks in red are learned parameters. S2P (serial to parallel) and P 2S (parallel to serial) are the
chunking and the de-chunking operations, respectively.
the FFT is determined by its size in each time-bin. reducing substantially the computational cost. Fur-
Denote the size of the FFT in a single time-bin as thermore, the embedding is permuted such that the
NFFT. feature space has the size O while N bins is added
The first component of the Focus layer is the to the batch dimension for parallel computing.
hypernetwork, H. The output of H is Θ, which The second component of H is a 2-layer MLP
is the set of IIR kernels used for the forward pro- with sigmoid activations that maps the embedding
cessing of the sequence. Θ has a dimension of e to a tensor with size N bins × D × F × 2. With
N bins × D × F × 2, denoting F kernels, each mapping of latent dimension O to 2 · F
with a kernel size of two for D feature channels.
The kernel is unique per time-bin, N bins, which Θ = M LP (e) , (3)
makes the filter adaptive to changes over time.
where Θ is the IIR kernel, with size N bins × D ×
Θ = H(x) . (1) F × 2. M LP is the forward MLP mapping, as
described above. Since H is a hypernetwork, the
H has two main components. The first is a shal- initialization of the last layer of the MLPs follows
low global convolution (Fu et al., 2023) based sub- (Chang et al., 2020). The rest of the layers follow
model that is followed by adaptive max pooling the Xavier initialization (Glorot and Bengio, 2010).
(over each feature channel) (Pytorch) with a size of The input x is split into non-overlapping time bins,
O × N bins, where O is the oversampling factor. where each time bin is passed through the FFT of
the size N F F T . Denote the input in the r-th time
e = MaxPool(GlobalConv(x)) (2) bin as xr .
This can be achieved if b is positive. In the sce- Next, the time complexity of the attention head
nario where ∆ ≤ 0, the exponents have an imagi- depends on the size of the context length M ,
nary part, causing it to oscillate. This is called an
under-damped response. This response is stable, Atten ≈ O(CM 2 D + CM D2 ) (26)
a1 y[t − 1] − a2 y[t − 2]
By substituting the values of b0 = CB+D, b1 =
AD, a1 = −A, b2 = a2 = 0, it becomes evident
that the IIR filter can be constrained to a linear
SSM. In machine learning, D is often omitted in
SSMs or diagonal RNNs, since it can be seen as a
parameter-based skip-connection. In this case, the
SSM can be represented by an IIR filter of order 1.
As mentioned earlier, higher-order filters can
introduce stability issues. Therefore, our decision
to utilize IIR filters of order 2 is justified, as we opt
for the most expressive IIR filters that still maintain
Figure 3: Time Complexity of the Focus layer and of stability during training.
Attention, increasing sequence length
5 Experiments
The total time complexity of the Focus layer is, Below we present experimental results for the pro-
therefore, posed Focus layer. In addition to our full method,
we introduce an ablation to evaluate the importance
Focus ≈ O(Llog(L) · D + CM 2 D) (27) of adaptive filtering, in which instead of the hyper-
where we neglected smaller terms when the se- network H, the IIR filters are conventional learned
quence length is large (greater than dimensions). parameters. This ablation is denoted by “Focus-H”.
Recalling that L = M C, and rearranging terms,
5.1 In-context learning
we have,
In order to evaluate our method relative to other
Focus ≈ O(DL · (log(L) + M )) (28) state-of-the-art long-range architectures, such as
obtaining sub-quadratic time complexity with re- (Poli et al., 2023), (Dai et al., 2019), the associative
spect to input sequence length. A visual compari- recall synthetic task is evaluated. The associative
son of overall complexity versus the standard atten- recall task was first introduced in (Elhage et al.,
tion head is depicted in Fig. 3. 2021) and is part of a number of simple yet infor-
Expressiveness An emerging class of diagonal mative tasks that test the capabilities of the model
linear RNNs (Orvieto et al., 2023; Gupta et al., in processing long-range sequences.
2022b) recently achieved near SOTA results in sev- In the associative recall task, each string is
eral long-range tasks. They include complex and formed by concatenating key-value pairs sampled
real variants, as well as diagonal state-space lay- randomly from a dictionary. The model should
ers (Gupta et al., 2022a; Gu et al., 2022). The output the correct value given a singular key, re-
following recurrent rule describes each channel of gardless of whether the key is in the long sequence.
those layers: Similarly to Poli et al. (2023), we employ the as-
sociative recall task in order to explore the memory
s[t] = As[t − 1] + Bx[t], y[t] = Cs[t] + Dx[t] capabilities of our model.
(29) In all synthetic data experiments the same shared
where s[t] is the recurrent state at time t. By isolat- hyperparameters are used, with the exception of
ing s[t − 1], we can rewrite Eq. 29 as follows: the sequence length. The hyperparameters are de-
1 D picted in in Appendix A. The AdamW optimizer
s[t − 1] = y[t − 1] − x[t − 1] (30) (Loshchilov and Hutter, 2017) is used.
C C
As can be seen in Tab. 1, our model is able to ob-
y[t] = CAs[t − 1] + (CB + D)x[t] = (31) tain an accuracy of 100% for all sequence lengths,
Ay[t − 1] + (CB + D)x[t] + ADx[t − 1] without overfitting, despite the low number of ex-
amples (2000), and with no memory explosion
Recall that the differential equation of an IIR
thanks to linear scaling with input size. These
filter of order 2 can be represented as follows:
results show that the Focus mechanism is able to
y[t] = b0 x[t] + b1 x[t − 1] + b2 x[t − 2]− (32) improve the performance of regular transformers
Table 1: Test accuracy (%) for associative recall on long sequences of length L and a vocabulary size of 30. NF -
not feasible to test. NR = not reported.
3
Northeastern University
4
The NSF Institute for Artificial Intelligence and Fundamental Interactions
Abstract
Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-
Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs).
While MLPs have fixed activation functions on nodes (“neurons”), KANs have learnable
activation functions on edges (“weights”). KANs have no linear weights at all – every
weight parameter is replaced by a univariate function parametrized as a spline. We show
that this seemingly simple change makes KANs outperform MLPs in terms of accuracy
and interpretability, on small-scale AI + Science tasks. For accuracy, smaller KANs can
achieve comparable or better accuracy than larger MLPs in function fitting tasks. Theo-
retically and empirically, KANs possess faster neural scaling laws than MLPs. For inter-
pretability, KANs can be intuitively visualized and can easily interact with human users.
Through two examples in mathematics and physics, KANs are shown to be useful “collabo-
rators” helping scientists (re)discover mathematical and physical laws. In summary, KANs
are promising alternatives for MLPs, opening opportunities for further improving today’s
deep learning models which rely heavily on MLPs.
N(ϵ) 2n+1 n
Formula
f(x) ≈ aiσ(wi ⋅ x + bi) f(x) = Φq ϕq,p(xp)
(Shallow) ∑ ∑ ∑
i=1 q=1 p=1
Formula
(Deep)
MLP(x) = (W3 ∘ σ2 ∘ W2 ∘ σ1 ∘ W1)(x) KAN(x) = (Φ3 ∘ Φ2 ∘ Φ1)(x)
2
Figure 2.1: Our proposed Kolmogorov-Arnold networks are in honor of two great late mathematicians, Andrey
Kolmogorov and Vladimir Arnold. KANs are mathematically sound, accurate and interpretable.
splines would fail for large N due to COD; MLPs can potentially learn the the generalized additive
structure, but they are very inefficient for approximating the exponential and sine functions with say,
ReLU activations. In contrast, KANs can learn both the compositional structure and the univariate
functions quite well, hence outperforming MLPs by a large margin (see Figure 3.1).
Throughout this paper, we will use extensive numerical experiments to show that KANs can lead to
accuracy and interpretability improvement over MLPs, at least on small-scale AI + Science tasks.
The organization of the paper is illustrated in Figure 2.1. In Section 2, we introduce the KAN archi-
tecture and its mathematical foundation, introduce network simplification techniques to make KANs
interpretable, and introduce a grid extension technique to make KANs more accurate. In Section 3,
we show that KANs are more accurate than MLPs for data fitting: KANs can beat the curse of dimen-
sionality when there is a compositional structure in data, achieving better scaling laws than MLPs.
We also demonstrate the potential of KANs in PDE solving via a simple example of the Poisson
equation. In Section 4, we show that KANs are interpretable and can be used for scientific discov-
eries. We use two examples from mathematics (knot theory) and physics (Anderson localization) to
demonstrate that KANs can be helpful “collaborators” for scientists to (re)discover math and phys-
ical laws. Section 5 summarizes related works. In Section 6, we conclude by discussing broad im-
pacts and future directions. Codes are available at https://github.com/KindXiaoming/pykan
and can also be installed via pip install pykan.
3
Figure 2.2: Left: Notations of activations that flow through the network. Right: an activation function is
parameterized as a B-spline, which allows switching between coarse-grained and fine-grained grids.
single variable and the binary operation of addition. More specifically, for a smooth f : [0, 1]n → R,
2n+1 n
!
X X
f (x) = f (x1 , · · · , xn ) = Φq ϕq,p (xp ) , (2.1)
q=1 p=1
where ϕq,p : [0, 1] → R and Φq : R → R. In a sense, they showed that the only true multivariate
function is addition, since every other function can be written using univariate functions and sum.
One might naively consider this great news for machine learning: learning a high-dimensional func-
tion boils down to learning a polynomial number of 1D functions. However, these 1D functions can
be non-smooth and even fractal, so they may not be learnable in practice [19, 20]. Because of this
pathological behavior, the Kolmogorov-Arnold representation theorem was basically sentenced to
death in machine learning, regarded as theoretically sound but practically useless [19, 20].
However, we are more optimistic about the usefulness of the Kolmogorov-Arnold theorem for ma-
chine learning. First of all, we need not stick to the original Eq. (2.1) which has only two-layer non-
linearities and a small number of terms (2n + 1) in the hidden layer: we will generalize the network
to arbitrary widths and depths. Secondly, most functions in science and daily life are often smooth
and have sparse compositional structures, potentially facilitating smooth Kolmogorov-Arnold rep-
resentations. The philosophy here is close to the mindset of physicists, who often care more about
typical cases rather than worst cases. After all, our physical world and machine learning tasks must
have structures to make physics and machine learning useful or generalizable at all [21].
4
As mentioned, such a network is known to be too simple to approximate any function arbitrarily
well in practice with smooth splines! We therefore generalize our KAN to be wider and deeper.
It is not immediately clear how to make KANs deeper, since Kolmogorov-Arnold representations
correspond to two-layer KANs. To the best of our knowledge, there is not yet a “generalized”
version of the theorem that corresponds to deeper KANs.
The breakthrough occurs when we notice the analogy between MLPs and KANs. In MLPs, once we
define a layer (which is composed of a linear transformation and nonlinearties), we can stack more
layers to make the network deeper. To build deep KANs, we should first answer: “what is a KAN
layer?” It turns out that a KAN layer with nin -dimensional inputs and nout -dimensional outputs can
be defined as a matrix of 1D functions
where the functions ϕq,p have trainable parameters, as detaild below. In the Kolmogov-Arnold theo-
rem, the inner functions form a KAN layer with nin = n and nout = 2n + 1, and the outer functions
form a KAN layer with nin = 2n + 1 and nout = 1. So the Kolmogorov-Arnold representations in
Eq. (2.1) are simply compositions of two KAN layers. Now it becomes clear what it means to have
deeper Kolmogorov-Arnold representations: simply stack more KAN layers!
Let us introduce some notation. This paragraph will be a bit technical, but readers can refer to Fig-
ure 2.2 (left) for a concrete example and intuitive understanding. The shape of a KAN is represented
by an integer array
[n0 , n1 , · · · , nL ], (2.3)
where ni is the number of nodes in the ith layer of the computational graph. We denote the ith
neuron in the lth layer by (l, i), and the activation value of the (l, i)-neuron by xl,i . Between layer l
and layer l + 1, there are nl nl+1 activation functions: the activation function that connects (l, i) and
(l + 1, j) is denoted by
The pre-activation of ϕl,j,i is simply xl,i ; the post-activation of ϕl,j,i is denoted by x̃l,j,i ≡
ϕl,j,i (xl,i ). The activation value of the (l + 1, j) neuron is simply the sum of all incoming post-
activations:
nl
X nl
X
xl+1,j = x̃l,j,i = ϕl,j,i (xl,i ), j = 1, · · · , nl+1 . (2.5)
i=1 i=1
In matrix form, this reads
ϕl,1,1 (·) ϕl,1,2 (·) ··· ϕl,1,nl (·)
ϕl,2,1 (·) ϕl,2,2 (·) ··· ϕl,2,nl (·)
xl+1 = .. .. .. xl , (2.6)
. . .
ϕl,nl+1 ,1 (·) ϕl,nl+1 ,2 (·) · · · ϕl,nl+1 ,nl (·)
| {z }
Φl
where Φl is the function matrix corresponding to the lth KAN layer. A general KAN network is a
composition of L layers: given an input vector x0 ∈ Rn0 , the output of KAN is
5
We can also rewrite the above equation to make it more analogous to Eq. (2.1), assuming output
dimension nL = 1, and define f (x) ≡ KAN(x):
!!!
nL−1 nL−2 n2 n1 n0
X X X X X
f (x) = ϕL−1,iL ,iL−1 ··· ϕ2,i3 ,i2 ϕ1,i2 ,i1 ϕ0,i1 ,i0 (xi0 ) ···,
iL−1 =1 iL−2 =1 i2 =1 i1 =1 i0 =1
(2.8)
which is quite cumbersome. In contrast, our abstraction of KAN layers and their visualizations are
cleaner and intuitive. The original Kolmogorov-Arnold representation Eq. (2.1) corresponds to a
2-Layer KAN with shape [n, 2n + 1, 1]. Notice that all the operations are differentiable, so we can
train KANs with back propagation. For comparison, an MLP can be written as interleaving of affine
transformations W and non-linearities σ:
It is clear that MLPs treat linear transformations and nonlinearities separately as W and σ, while
KANs treat them all together in Φ. In Figure 0.1 (c) and (d), we visualize a three-layer MLP and a
three-layer KAN, to clarify their differences.
Implementation details. Although a KAN layer Eq. (2.5) looks extremely simple, it is non-trivial
to make it well optimizable. The key tricks are:
(1) Residual activation functions. We include a basis function b(x) (similar to residual connec-
tions) such that the activation function ϕ(x) is the sum of the basis function b(x) and the spline
function:
We set
where ci s are trainable (see Figure 2.2 for an illustration). In principle wb and ws are redundant
since it can be absorbed into b(x) and spline(x). However, we still include these factors (which
are by default trainable) to better control the overall magnitude of the activation function.
(2) Initialization scales. Each activation function is initialized to have ws = 1 and spline(x) ≈ 0 2 .
wb is initialized according to the Xavier initialization, which has been used to initialize linear
layers in MLPs.
(3) Update of spline grids. We update each grid on the fly according to its input activations, to
address the issue that splines are defined on bounded regions but activation values can evolve
out of the fixed region during training 3 .
(1) of depth L,
2
This is done by drawing B-spline coefficients ci ∼ N (0, σ 2 ) with a small σ, typically we set σ = 0.1.
3
Other possibilities are: (a) the grid is learnable with gradient descent, e.g., [22]; (b) use normalization such
that the input range is fixed. We tried (b) at first but its performance is inferior to our current approach.
6
(2) with layers of equal width n0 = n1 = · · · = nL = N ,
(3) with each spline of order k (usually k = 3) on G intervals (for G + 1 grid points).
Then there are in total O(N 2 L(G + k)) ∼ O(N 2 LG) parameters. In contrast, an MLP with depth
L and width N only needs O(N 2 L) parameters, which appears to be more efficient than KAN. For-
tunately, KANs usually require much smaller N than MLPs, which not only saves parameters, but
also achieves better generalization (see e.g., Figure 3.1 and 3.3) and facilitates interpretability. We
remark that for 1D problems, we can take N = L = 1 and the KAN network in our implementation
is nothing but a spline approximation. For higher dimensions, we characterize the generalization
behavior of KANs with a theorem below.
can be smoothly represented by a [4, 2, 1, 1] KAN which is 3-Layer, but may not admit a 2-Layer
KAN with smooth activations. To facilitate an approximation analysis, we still assume smoothness
of activations, but allow the representations to be arbitrarily wide and deep, as in Eq. (2.7). To
emphasize the dependence of our KAN on the finite set of grid points, we use ΦG G
l and Φl,i,j below
to replace the notation Φl and Φl,i,j used in Eq. (2.5) and (2.6).
Theorem 2.1 (Approximation theory, KAT). Let x = (x1 , x2 , · · · , xn ). Suppose that a function
f (x) admits a representation
as in Eq. (2.7), where each one of the Φl,i,j are (k + 1)-times continuously differentiable. Then
there exists a constant C depending on f and its representation, such that we have the following
approximation bound in terms of the grid size G: there exist k-th order B-spline functions ΦG l,i,j
such that for any 0 ≤ m ≤ k, we have the bound
−k−1+m
∥f − (ΦG G G G
L−1 ◦ ΦL−2 ◦ · · · ◦ Φ1 ◦ Φ0 )x∥C m ≤ CG . (2.15)
Here we adopt the notation of C m -norm measuring the magnitude of derivatives up to order m:
Proof. By the classical 1D B-spline theory [23] and the fact that Φl,i,j as continuous functions can
be uniformly bounded on a bounded domain, we know that there exist finite-grid B-spline functions
ΦGl,i,j such that for any 0 ≤ m ≤ k,
−k−1+m
∥(Φl,i,j ◦Φl−1 ◦Φl−2 ◦· · ·◦Φ1 ◦Φ0 )x−(ΦG
l,i,j ◦Φl−1 ◦Φl−2 ◦· · ·◦Φ1 ◦Φ0 )x∥C m ≤ CG ,
with a constant C independent of G. We fix those B-spline approximations. Therefore we have that
the residue Rl defined via
Rl := (ΦG G G G G
L−1 ◦ · · · ◦ Φl+1 ◦ Φl ◦ Φl−1 ◦ · · · ◦ Φ0 )x − (ΦL−1 ◦ · · · ◦ Φl+1 ◦ Φl ◦ Φl−1 ◦ · · · ◦ Φ0 )x
7
satisfies
∥Rl ∥C m ≤ CG−k−1+m ,
with a constant independent of G. Finally notice that
f − (ΦG G G G
L−1 ◦ ΦL−2 ◦ · · · ◦ Φ1 ◦ Φ0 )x = RL−1 + RL−2 + · · · + R1 + R0 ,
We know that asymptotically, provided that the assumption in Theorem 2.1 holds, KANs with fi-
nite grid size can approximate the function well with a residue rate independent of the dimension,
hence beating curse of dimensionality! This comes naturally since we only use splines to approx-
imate 1D functions. In particular, for m = 0, we recover the accuracy in L∞ norm, which in turn
provides a bound of RMSE on the finite domain, which gives a scaling exponent k + 1. Of course,
the constant C is dependent on the representation; hence it will depend on the dimension. We will
leave the discussion of the dependence of the constant on the dimension as a future work.
We remark that although the Kolmogorov-Arnold theorem Eq. (2.1) corresponds to a KAN repre-
sentation with shape [d, 2d + 1, 1], its functions are not necessarily smooth. On the other hand, if we
are able to identify a smooth representation (maybe at the cost of extra layers or making the KAN
wider than the theory prescribes), then Theorem 2.1 indicates that we can beat the curse of dimen-
sionality (COD). This should not come as a surprise since we can inherently learn the structure of
the function and make our finite-sample KAN approximation interpretable.
Neural scaling laws: comparison to other theories. Neural scaling laws are the phenomenon
where test loss decreases with more model parameters, i.e., ℓ ∝ N −α where ℓ is test RMSE, N is
the number of parameters, and α is the scaling exponent. A larger α promises more improvement
by simply scaling up the model. Different theories have been proposed to predict α. Sharma &
Kaplan [24] suggest that α comes from data fitting on an input manifold of intrinsic dimensionality
d. If the model function class is piecewise polynomials of order k (k = 1 for ReLU), then the
standard approximation theory implies α = (k + 1)/d from the approximation theory. This bound
suffers from the curse of dimensionality, so people have sought other bounds independent of d by
leveraging compositional structures. In particular, Michaud et al. [25] considered computational
graphs that only involve unary (e.g., squared, sine, exp) and binary (+ and ×) operations, finding
α = (k + 1)/d∗ = (k + 1)/2, where d∗ = 2 is the maximum arity. Poggio et al. [19] leveraged the
idea of compositional sparsity and proved that given function class Wm (function whose derivatives
2
are continuous up to m-th order), one needs N = O(ϵ− m ) number of parameters to achieve error ϵ,
which is equivalent to α = m 2 . Our approach, which assumes the existence of smooth Kolmogorov-
Arnold representations, decomposes the high-dimensional function into several 1D functions, giving
α = k+1 (where k is the piecewise polynomial order of the splines). We choose k = 3 cubic splines
so α = 4 which is the largest and best scaling exponent compared to other works. We will show in
Section 3.1 that this bound α = 4 can in fact be achieved empirically with KANs, while previous
work [25] reported that MLPs have problems even saturating slower bounds (e.g., α = 1) and
plateau quickly. Of course, we can increase k to match the smoothness of functions, but too high k
might be too oscillatory, leading to optimization issues.
Comparison between KAT and UAT. The power of fully-connected neural networks is justified by
the universal approximation theorem (UAT), which states that given a function and error tolerance
ϵ > 0, a two-layer network with k > N (ϵ) neurons can approximate the function within error ϵ.
However, the UAT guarantees no bound for how N (ϵ) scales with ϵ. Indeed, it suffers from the COD,
and N has been shown to grow exponentially with d in some cases [21]. The difference between
8
Fitting f(x, y) = exp(sin( x) + y2)
101
KAN [2,5,1] 101
KAN [2,1,1]
train train
10 1 test 10 1 test
RMSE
10 5 10 5
00
00
0
0
10 7 10 7
=10
=20
=10
=20
=50
=10
=10
=20
=50
=20
=50
=10
=3
=5
=3
=5
grid
grid
grid
grid
grid
grid
grid
grid
grid
grid
grid
grid
grid
grid
grid
grid
grid
grid
10 9 10 9
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
step step
10 2 KAN [2,5,1]
KAN [2,1,1]
G 3
test loss
10 4
G 2 100
10 5
10 6
KAN [2,5,1] sqrt(mean of squared)
KAN [2,1,1] sqrt(mean of squared) G 4
10 7 KAN [2,1,1] sqrt(median of squared) 10 1
Figure 2.3: We can make KANs more accurate by grid extension (fine-graining spline grids). Top left (right):
training dynamics of a [2, 5, 1] ([2, 1, 1]) KAN. Both models display staircases in their loss curves, i.e., loss
suddently drops then plateaus after grid extension. Bottom left: test RMSE follows scaling laws against grid
size G. Bottom right: training time scales favorably with grid size G.
KAT and UAT is a consequence that KANs take advantage of the intrinsically low-dimensional rep-
resentation of the function while MLPs do not. In KAT, we highlight quantifying the approximation
error in the compositional space. In the literature, generalization error bounds, taking into account
finite samples of training data, for a similar space have been studied for regression problems; see
[26, 27], and also specifically for MLPs with ReLU activations [28]. On the other hand, for general
function spaces like Sobolev or Besov spaces, the nonlinear n-widths theory [29, 30, 31] indicates
that we can never beat the curse of dimensionality, while MLPs with ReLU activations can achieve
the tight rate [32, 33, 34]. This fact again motivates us to consider functions of compositional struc-
ture, the much "nicer" functions that we encounter in practice and in science, to overcome the COD.
Compared with MLPs, we may use a smaller architecture in practice, since we learn general non-
linear activation functions; see also [28] where the depth of the ReLU MLPs needs to reach at least
log n to have the desired rate, where n is the number of samples. Indeed, we will show that KANs
are nicely aligned with symbolic functions while MLPs are not.
9
grid with G1 intervals has grid points at {t0 = a, t1 , t2 , · · · , tG1 = b}, which is augmented to
{t−k , · · · , t−1 , t0 , · · · , tG1 , tG1 +1 , · · · , tG1 +k }. There are G1 + k B-spline basis functions, with
the ith B-spline Bi (x) being non-zero only on [t−k+i , ti+1 ] (i = 0, · · · , G1 + k − 1). Then f
on the coarse grid is expressed in terms of linear combination of these B-splines basis functions
PG1 +k−1
fcoarse (x) = i=0 ci Bi (x). Given a finer grid with G2 intervals, f on the fine grid is cor-
PG2 +k−1 ′ ′
respondingly ffine (x) = j=0 cj Bj (x). The parameters c′j s can be initialized from the pa-
rameters ci by minimizing the distance between ffine (x) to fcoarse (x) (over some distribution of
x):
2
G2X+k−1 G1X
+k−1
{c′j } = argmin E c′j Bj′ (x) − ci Bi (x) , (2.16)
{c′j } x∼p(x)
j=0 i=0
which can be implemented by the least squares algorithm. We perform grid extension for all splines
in a KAN independently.
Toy example: staricase-like loss curves. We use a toy example f (x, y) = exp(sin(πx) + y 2 ) to
demonstrate the effect of grid extension. In Figure 2.3 (top left), we show the train and test RMSE for
a [2, 5, 1] KAN. The number of grid points starts as 3, increases to a higher value every 200 LBFGS
steps, ending up with 1000 grid points. It is clear that every time fine graining happens, the training
loss drops faster than before (except for the finest grid with 1000 points, where optimization ceases
to work probably due to bad loss landscapes). However, the test losses first go down then go up,
displaying a U-shape, due to the bias-variance tradeoff (underfitting vs. overfitting). We conjecture
that the optimal test loss is achieved at the interpolation threshold when the number of parameters
match the number of data points. Since our training samples are 1000 and the total parameters of a
[2, 5, 1] KAN is 15G (G is the number of grid intervals), we expect the interpolation threshold to be
G = 1000/15 ≈ 67, which roughly agrees with our experimentally observed value G ∼ 50.
Small KANs generalize better. Is this the best test performance we can achieve? Notice that the
synthetic task can be represented exactly by a [2, 1, 1] KAN, so we train a [2, 1, 1] KAN and present
the training dynamics in Figure 2.3 top right. Interestingly, it can achieve even lower test losses
than the [2, 5, 1] KAN, with clearer staircase structures and the interpolation threshold is delayed
to a larger grid size as a result of fewer parameters. This highlights a subtlety of choosing KAN
architectures. If we do not know the problem structure, how can we determine the minimal KAN
shape? In Section 2.5, we will propose a method to auto-discover such minimal KAN architecture
via regularization and pruning.
Scaling laws: comparison with theory. We are also interested in how the test loss decreases as the
number of grid parameters increases. In Figure 2.3 (bottom left), a [2,1,1] KAN scales roughly as
test RMSE ∝ G−3 . However, according to the Theorem 2.1, we would expect test RMSE ∝ G−4 .
We found that the errors across samples are not uniform. This is probably attributed to boundary
effects [25]. In fact, there are a few samples that have significantly larger errors than others, making
the overall scaling slow down. If we plot the square root of the median (not mean) of the squared
losses, we get a scaling closer to G−4 . Despite this suboptimality (probably due to optimization),
KANs still have much better scaling laws than MLPs, for data fitting (Figure 3.1) and PDE solving
(Figure 3.3). In addition, the training time scales favorably with the number of grid points G, shown
in Figure 2.3 bottom right 4 .
4
When G = 1000, training becomes significantly slower, which is specific to the use of the LBFGS opti-
mizer with line search. We conjecture that the loss landscape becomes bad for G = 1000, so line search with
trying to find an optimal step size within maximal iterations without early stopping.
10
External vs Internal degrees of freedom. A new concept that KANs highlights is a distinction
between external versus internal degrees of freedom (parameters). The computational graph of how
nodes are connected represents external degrees of freedom (“dofs”), while the grid points inside
an activation function are internal degrees of freedom. KANs benefit from the fact that they have
both external dofs and internal dofs. External dofs (that MLPs also have but splines do not) are
responsible for learning compositional structures of multiple variables. Internal dofs (that splines
also have but MLPs do not) are responsible for learning univariate functions.
(1) There is no linear “weight” in KANs. Linear weights are replaced by learnable activation func-
tions, so we should define the L1 norm of these activation functions.
(2) We find L1 to be insufficient for sparsification of KANs; instead an additional entropy regular-
ization is necessary (see Appendix C for more details).
We define the L1 norm of an activation function ϕ to be its average magnitude over its Np inputs,
i.e.,
Np
1 X
|ϕ|1 ≡ ϕ(x(s) ) . (2.17)
Np s=1
Then for a KAN layer Φ with nin inputs and nout outputs, we define the L1 norm of Φ to be the
sum of L1 norms of all activation functions, i.e.,
nin n
X X out
The total training objective ℓtotal is the prediction loss ℓpred plus L1 and entropy regularization of
all KAN layers: !
L−1
X L−1
X
ℓtotal = ℓpred + λ µ1 |Φl |1 + µ2 S(Φl ) , (2.20)
l=0 l=0
where µ1 , µ2 are relative magnitudes usually set to µ1 = µ2 = 1, and λ controls overall regulariza-
tion magnitude.
11
Figure 2.4: An example of how to do symbolic regression with KAN.
2. Visualization. When we visualize a KAN, to get a sense of magnitudes, we set the transparency
of an activation function ϕl,i,j proportional to tanh(βAl,i,j ) where β = 3 . Hence, functions with
small magnitude appear faded out to allow us to focus on important ones.
3. Pruning. After training with sparsification penalty, we may also want to prune the network to a
smaller subnetwork. We sparsify KANs on the node level (rather than on the edge level). For each
node (say the ith neuron in the lth layer), we define its incoming and outgoing score as
and consider a node to be important if both incoming and outgoing scores are greater than a threshold
hyperparameter θ = 10−2 by default. All unimportant neurons are pruned.
4. Symbolification. In cases where we suspect that some activation functions are in fact sym-
bolic (e.g., cos or log), we provide an interface to set them to be a specified symbolic form,
fix_symbolic(l,i,j,f) can set the (l, i, j) activation to be f . However, we cannot simply set
the activation function to be the exact symbolic formula, since its inputs and outputs may have shifts
and scalings. So, we obtain preactivations x and postactivations y from samples, and fit affine pa-
rameters (a, b, c, d) such that y ≈ cf (ax + b) + d. The fitting is done by iterative grid search of a, b
and linear regression.
Besides these techniques, we provide additional tools that allow users to apply more fine-grained
control to KANs, listed in Appendix A.
12
Given data points (xi , yi , fi ), i = 1, 2, · · · , Np , a hypothetical user Alice is interested in figuring
out the symbolic formula. The steps of Alice’s interaction with the KANs are described below
(illustrated in Figure 2.4):
Step 1: Training with sparsification. Starting from a fully-connected [2, 5, 1] KAN, training with
sparsification regularization can make it quite sparse. 4 out of 5 neurons in the hidden layer appear
useless, hence we want to prune them away.
Step 2: Pruning. Automatic pruning is seen to discard all hidden neurons except the last one,
leaving a [2, 1, 1] KAN. The activation functions appear to be known symbolic functions.
Step 3: Setting symbolic functions. Assuming that the user can correctly guess these symbolic
formulas from staring at the KAN plot, they can set
fix_symbolic(0,0,0,‘sin’)
fix_symbolic(0,1,0,‘xˆ2’) (2.23)
fix_symbolic(1,0,0,‘exp’).
In case the user has no domain knowledge or no idea which symbolic functions these activation
functions might be, we provide a function suggest_symbolic to suggest symbolic candidates.
Step 4: Further training. After symbolifying all the activation functions in the network, the only
remaining parameters are the affine parameters. We continue training these affine parameters, and
when we see the loss dropping to machine precision, we know that we have found the correct sym-
bolic expression.
Step 5: Output the symbolic formula. Sympy is used to compute the symbolic formula of the
2
output node. The user obtains 1.0e1.0y +1.0sin(3.14x) , which is the true answer (we only displayed
two decimals for π).
Remark: Why not symbolic regression (SR)? It is reasonable to use symbolic regression for this
example. However, symbolic regression methods are in general brittle and hard to debug. They
either return a success or a failure in the end without outputting interpretable intermediate results.
In contrast, KANs do continuous search (with gradient descent) in function space, so their results
are more continuous and hence more robust. Moreover, users have more control over KANs as
compared to SR due to KANs’ transparency. The way we visualize KANs is like displaying KANs’
“brain” to users, and users can perform “surgery” (debugging) on KANs. This level of control is
typically unavailable for SR. We will show examples of this in Section 4.4. More generally, when the
target function is not symbolic, symbolic regression will fail but KANs can still provide something
meaningful. For example, a special function (e.g., a Bessel function) is impossible to SR to learn
unless it is provided in advance, but KANs can use splines to approximate it numerically anyway
(see Figure 4.1 (d)).
13
100
f(x) = J0(20x) f(x, y) = exp(sin( x) + y2) 10 2
f(x, y) = xy 1 ( sin2( xi )))
f(x1, , x100) = exp(100
i=1
2 f(x1, x2, x3, x4) = exp(sin(x12 + x22) + sin(x32 + x42))
10 1 10 1
10 1
10 1
10 3
10 2 10 2 N 0.04
10 2
10 4
10 2
test RMSE
10 3 10 3 10 3
10 5
KAN (depth 3)
10 4 N 4
KAN (depth 2) 10 4
KAN (depth 2) KAN (depth 2) 10 3 KAN (depth 2) 10 4 N 1
KAN (depth 2)
N 4 N 4 N 2
N 4 N 2
MLP (depth 2) MLP (depth 2) 10 6
MLP (depth 2) MLP (depth 2) MLP (depth 2)
10 5 MLP (depth 3) 10 5 MLP (depth 3) MLP (depth 3) MLP (depth 3) 10 5
MLP (depth 3)
MLP (depth 4) MLP (depth 4) 10 7 MLP (depth 4) 10 4 N 4 MLP (depth 4) MLP (depth 4)
MLP (depth 5) MLP (depth 5) MLP (depth 5) MLP (depth 5) 10 6 N 4
MLP (depth 5)
10 6 10 6
Theory (KAN) Theory (KAN) 10 8 Theory (KAN) Theory (KAN) Theory (KAN)
Theory (ID) Theory (ID) Theory (ID) Theory (ID) 10 7
Theory (ID)
10 7 10 7 10 5
101 102 103 104 105 101 102 103 104 105 101 102 103 104 105 103 104 105 102 103 104
Number of parameters Number of parameters Number of parameters Number of parameters Number of parameters
Figure 3.1: Compare KANs to MLPs on five toy examples. KANs can almost saturate the fastest scaling law
predicted by our theory (α = 4), while MLPs scales slowly and plateau quickly.
(1) f (x) = J0 (20x), which is the Bessel function. Since it is a univariate function, it can be
represented by a spline, which is a [1, 1] KAN.
(2) f (x, y) = exp(sin(πx) + y 2 ). We know that it can be exactly represented by a [2, 1, 1] KAN.
(3) f (x, y) = xy. We know from Figure 4.1 that it can be exactly represented by a [2, 2, 1] KAN.
1
P100 2 πxi
(4) A high-dimensional example f (x1 , · · · , x100 ) = exp( 100 i=1 sin ( 2 )) which can be rep-
resented by a [100, 1, 1] KAN.
(5) A four-dimensional example f (x1 , x2 , x3 , x4 ) = exp( 12 (sin(π(x21 + x22 )) + sin(π(x23 + x24 ))))
which can be represented by a [4, 4, 2, 1] KAN.
We train these KANs by increasing grid points every 200 steps, in total covering G =
{3, 5, 10, 20, 50, 100, 200, 500, 1000}. We train MLPs with different depths and widths as base-
lines. Both MLPs and KANs are trained with LBFGS for 1800 steps in total. We plot test RMSE as
a function of the number of parameters for KANs and MLPs in Figure 3.1, showing that KANs have
better scaling curves than MLPs, especially for the high-dimensional example. For comparison, we
plot the lines predicted from our KAN theory as red dashed (α = k + 1 = 4), and the lines predicted
from Sharma & Kaplan [24] as black-dashed (α = (k + 1)/d = 4/d). KANs can almost saturate the
steeper red lines, while MLPs struggle to converge even as fast as the slower black lines and plateau
quickly. We also note that for the last example, the 2-Layer KAN [4, 9, 1] behaves much worse than
the 3-Layer KAN (shape [4, 2, 2, 1]). This highlights the greater expressive power of deeper KANs,
which is the same for MLPs: deeper MLPs have more expressive power than shallower ones. Note
that we have adopted the vanilla setup where both KANs and MLPs are trained with LBFGS with-
out advanced techniques, e.g., switching between Adam and LBFGS, or boosting [35]. We leave the
comparison of KANs and MLPs in advanced setups for future work.
14
10 1 ellipj ellipkinc ellipeinc jv yv
10 3
RMSE
KAN train
10 5
KAN test
MLP train
10 7 MLP test
10 1 kv iv lpmv_m_0 lpmv_m_1 lpmv_m_2
10 3
RMSE
10 5
10 7
10 5
10 7
101 102 103 104 101 102 103 104 101 102 103 104 101 102 103 104 101 102 103 104
number of parameters number of parameters number of parameters number of parameters number of parameters
Figure 3.2: Fitting special functions. We show the Pareto Frontier of KANs and MLPs in the plane spanned by
the number of model parameters and RMSE loss. Consistently accross all special functions, KANs have better
Pareto Frontiers than MLPs. The definitions of these special functions are in Table 1.
15
Minimal KAN shape
Name scipy.special API Minimal KAN test RMSE Best KAN shape Best KAN test RMSE MLP test RMSE
test RMSE < 10−2
Jacobian elliptic functions ellipj(x, y) [2,2,1] 7.29 × 10−3 [2,3,2,1,1,1] 1.33 × 10−4 6.48 × 10−4
Incomplete elliptic integral of the first kind ellipkinc(x, y) [2,2,1,1] 1.00 × 10−3 [2,2,1,1,1] 1.24 × 10−4 5.52 × 10−4
Incomplete elliptic integral of the second kind ellipeinc(x, y) [2,2,1,1] 8.36 × 10−5 [2,2,1,1] 8.26 × 10−5 3.04 × 10−4
Bessel function of the first kind jv(x, y) [2,2,1] 4.93 × 10−3 [2,3,1,1,1] 1.64 × 10−3 5.52 × 10−3
Bessel function of the second kind yv(x, y) [2,3,1] 1.89 × 10−3 [2,2,2,1] 1.49 × 10−5 3.45 × 10−4
Modified Bessel function of the second kind kv(x, y) [2,1,1] 4.89 × 10−3 [2,2,1] 2.52 × 10−5 1.67 × 10−4
Modified Bessel function of the first kind iv(x, y) [2,4,3,2,1,1] 9.28 × 10−3 [2,4,3,2,1,1] 9.28 × 10−3 1.07 × 10−2
Associated Legendre function (m = 0) lpmv(0, x, y) [2,2,1] 5.25 × 10−5 [2,2,1] 5.25 × 10−5 1.74 × 10−2
Associated Legendre function (m = 1) lpmv(1, x, y) [2,4,1] 6.90 × 10−4 [2,4,1] 6.90 × 10−4 1.50 × 10−3
Associated Legendre function (m = 2) lpmv(2, x, y) [2,2,1] 4.88 × 10−3 [2,3,2,1] 2.26 × 10−4 9.43 × 10−4
spherical harmonics (m = 0, n = 1) sph_harm(0, 1, x, y) [2,1,1] 2.21 × 10−7 [2,1,1] 2.21 × 10−7 1.25 × 10−6
spherical harmonics (m = 1, n = 1) sph_harm(1, 1, x, y) [2,2,1] 7.86 × 10−4 [2,3,2,1] 1.22 × 10−4 6.70 × 10−4
spherical harmonics (m = 0, n = 2) sph_harm(0, 2, x, y) [2,1,1] 1.95 × 10−7 [2,1,1] 1.95 × 10−7 2.85 × 10−6
spherical harmonics (m = 1, n = 2) sph_harm(1, 2, x, y) [2,2,1] 4.70 × 10−4 [2,2,1,1] 1.50 × 10−5 1.84 × 10−3
spherical harmonics (m = 2, n = 2) sph_harm(2, 2, x, y) [2,2,1] 1.12 × 10−3 [2,2,3,2,1] 9.45 × 10−5 6.21 × 10−4
Pruned
KAN shape Pruned Human-constructed Pruned Unpruned MLP
Human-constructed
Feynman Eq. Original Formula Dimensionless formula Variables (smallest shape KAN shape KAN loss KAN loss KAN loss loss
KAN shape
that achieves (lowest loss) (lowest test RMSE) (lowest test RMSE) (lowest test RMSE) (lowest test RMSE)
RMSE < 10−2 )
θ2
√ θ2
√
I.6.2 exp(− 2σ 2 )/ 2πσ 2 exp(− 2σ 2 )/ 2πσ 2 θ, σ [2,2,1,1] [2,2,1] [2,2,1,1] 7.66 × 10−5 2.86 × 10−5 4.60 × 10−5 1.45 × 10−4
2 √ 2 √
I.6.2b exp(− (θ−θ 1)
2σ 2 )/ 2πσ
2 exp(− (θ−θ 1)
2σ 2 )/ 2πσ
2 θ, θ1 , σ [3,2,2,1,1] [3,4,1] [3,2,2,1,1] 1.22 × 10−3 4.45 × 10−4 1.25 × 10−3 7.40 × 10−4
Gm1 m2
I.9.18 (x2 −x1 )2 +(y2 −y1 )2 +(z2 −z1 )2
a
(b−1)2 +(c−d)2 +(e−f )2 a, b, c, d, e, f [6,4,2,1,1] [6,4,1,1] [6,4,1,1] 1.48 × 10−3 8.62 × 10−3 6.56 × 10−3 1.59 × 10−3
I.12.11 q(Ef + Bvsinθ) 1 + asinθ a, θ [2,2,2,1] [2,2,1] [2,2,1] 2.07 × 10−3 1.39 × 10−3 9.13 × 10−4 6.71 × 10−4
I.13.12 Gm1 m2 ( r12 − 1
r1 ) a( 1b − 1) a, b [2,2,1] [2,2,1] [2,2,1] 7.22 × 10−3 4.81 × 10−3 2.72 × 10−3 1.42 × 10−3
I.15.3x √x−utu √1−a
1−b2
a, b [2,2,1,1] [2,1,1] [2,2,1,1,1] 7.35 × 10−3 1.58 × 10−3 1.14 × 10−3 8.54 × 10−4
1−( c )2
I.16.6 u+v
1+ uv
a+b
1+ab a, b [2,2,2,2,2,1] [2,2,1] [2,2,1] 1.06 × 10−3 1.19 × 10−3 1.53 × 10−3 6.20 × 10−4
c2
m1 r1 +m2 r2
I.18.4 m1 +m2
1+ab
1+a a, b [2,2,2,1,1] [2,2,1] [2,2,1] 3.92 × 10−4 1.50 × 10−4 1.32 × 10−3 3.68 × 10−4
I.26.2 arcsin(nsinθ2 ) arcsin(nsinθ2 ) n, θ2 [2,2,2,1,1] [2,2,1] [2,2,2,1,1] 1.22 × 10−1 7.90 × 10−4 8.63 × 10−4 1.24 × 10−3
I.27.6 1
1
+ dn
1
1+ab a, b [2,2,1,1] [2,1,1] [2,1,1] 2.22 × 10−4 1.94 × 10−4 2.14 × 10−4 2.46 × 10−4
d1 2
p p
I.29.16 x21 + x22 − 2x1 x2 cos(θ1 − θ2 ) 1 + a2 − 2acos(θ1 − θ2 ) a, θ1 , θ2 [3,2,2,3,2,1,1] [3,2,2,1] [3,2,3,1] 2.36 × 10−1 3.99 × 10−3 3.20 × 10−3 4.64 × 10−3
sin2 ( nθ2 ) sin2 ( nθ2 )
I.30.3 I∗,0 sin2 ( θ2 ) sin2 ( θ2 )
n, θ [2,3,2,2,1,1] [2,4,3,1] [2,3,2,3,1,1] 3.85 × 10−1 1.03 × 10−3 1.11 × 10−2 1.50 × 10−2
I.30.5 λ
arcsin( nd ) arcsin( na ) a, n [2,1,1] [2,1,1] [2,1,1,1,1,1] 2.23 × 10−4 3.49 × 10−5 6.92 × 10−5 9.45 × 10−5
√ √
I.37.4 I∗ = I1 + I2 + 2 I1 I2 cosδ 1 + a + 2 acosδ a, δ [2,3,2,1] [2,2,1] [2,2,1] 7.57 × 10−5 4.91 × 10−6 3.41 × 10−4 5.67 × 10−4
I.40.1 n0 exp(− mgx
kb T ) n0 e−a n0 , a [2,1,1] [2,2,1] [2,2,1,1,1,2,1] 3.45 × 10−3 5.01 × 10−4 3.12 × 10−4 3.99 × 10−4
I.44.4 nkb T ln( VV21 ) nlna n, a [2,2,1] [2,2,1] [2,2,1] 2.30 × 10 −5
2.43 × 10 −5
1.10 × 10 −4
3.99 × 10−4
−4 −4 −4
I.50.26 x1 (cos(ωt) + αcos (wt)) 2
cosa + αcos a 2
a, α [2,2,3,1] [2,3,1] [2,3,2,1] 1.52 × 10 5.82 × 10 4.90 × 10 1.53 × 10−3
k(T2 −T1 )A
II.2.42 d (a − 1)b a, b [2,2,1] [2,2,1] [2,2,2,1] 8.54 × 10−4 7.22 × 10−4 1.22 × 10−3 1.81 × 10−4
3 pd z
p √
II.6.15a 4πϵ r 5 x2 + y 2 1 2
4π c a + b
2 a, b, c [3,2,2,2,1] [3,2,1,1] [3,2,1,1] 2.61 × 10−3 3.28 × 10−3 1.35 × 10−3 5.92 × 10−4
pd Ef cosθ
II.11.7 n0 (1 + kb T ) n0 (1 + acosθ) n0 , a, θ [3,3,3,2,2,1] [3,3,1,1] [3,3,1,1] 7.10 × 10−3 8.52 × 10−3 5.03 × 10−3 5.92 × 10−4
II.11.27 nα
1− nα ϵEf nα
1− nα n, α [2,2,1,2,1] [2,1,1] [2,2,1] 2.67 × 10−5 4.40 × 10−5 1.43 × 10−5 7.18 × 10−5
3 3
n0 n0
II.35.18 exp( µmB µmB exp(a)+exp(−a) n0 , a [2,1,1] [2,1,1] [2,1,1,1] 4.13 × 10−4 1.58 × 10−4 7.71 × 10−5 7.92 × 10−5
k T )+exp(− k T )
b b
µm B µm αM
II.36.38 kb T + ϵc2 kb T a + αb a, α, b [3,3,1] [3,2,1] [3,2,1] 2.85 × 10−3 1.15 × 10−3 3.03 × 10−3 2.15 × 10−3
II.38.3 Y Ax
d
a
b a, b [2,1,1] [2,1,1] [2,2,1,1,1] 1.47 × 10−4 8.78 × 10−5 6.43 × 10−4 5.26 × 10−4
2
pd Ef sin ((ω−ω0 )t/2) sin2 ( b−c
2 )
III.9.52 h ((ω−ω0 )t/2)2 a ( b−c 2
a, b, c [3,2,3,1,1] [3,3,2,1] [3,3,2,1,1,1] 4.43 × 10−2 3.90 × 10−3 2.11 × 10−2 9.07 × 10−4
2 )
q √
III.10.19 µm Bx2 + By2 + Bz2 1 + a2 + b2 a, b [2,1,1] [2,1,1] [2,1,2,1] 2.54 × 10−3 1.18 × 10−3 8.16 × 10−4 1.67 × 10−4
III.17.37 β(1 + αcosθ) β(1 + αcosθ) α, β, θ [3,3,3,2,2,1] [3,3,1] [3,3,1] 1.10 × 10−3 5.03 × 10−4 4.12 × 10−4 6.80 × 10−4
Given the structure of the dataset, we may construct KANs by hand, but we are not sure if they are
optimal. In this regime, it is interesting to compare human-constructed KANs and auto-discovered
KANs via pruning (techniques in Section 2.5.1).
Feynman dataset. The Feynman dataset collects many physics equations from Feynman’s text-
books [36, 37]. For our purpose, we are interested in problems in the Feynman_no_units dataset
that have at least 2 variables, since univariate problems are trivial for KANs (they simplify to 1D
splines). A sample equation from the Feynman dataset is the relativisic velocity addition formula
The dataset can be constructed by randomly drawing ui ∈ (−1, 1), vi ∈ (−1, 1), and computing
fi = f (ui , vi ). Given many tuples (ui , vi , fi ), a neural network is trained and aims to predict f
from u and v. We are interested in (1) how well a neural network can perform on test samples; (2)
how much we can learn about the structure of the problem from neural networks.
16
We compare four kinds of neural networks:
4 − (x−y)
4 , which corresponds to a [2, 2, 1] KAN. The constructued shapes are listed in
the “Human-constructed KAN shape” in Table 2.
(2) KANs without pruning. We fix the KAN shape to width 5 and depths are swept over {2,3,4,5,6}.
(3) KAN with pruning. We use the sparsification (λ = 10−2 or 10−3 ) and the pruning technique
from Section 2.5.1 to obtain a smaller KAN from a fixed-shape KAN from (2).
(4) MLPs with fixed width 5, depths swept in {2, 3, 4, 5, 6}, and activations chosen from
{Tanh, ReLU, SiLU}.
Each KAN is initialized to have G = 3, trained with LBFGS, with increasing number of grid points
every 200 steps to cover G = {3, 5, 10, 20, 50, 100, 200}. For each hyperparameter combination,
we try 3 random seeds. For each dataset (equation) and each method, we report the results of the
best model (minimal KAN shape, or lowest test loss) over random seeds and depths in Table 2. We
find that MLPs and KANs behave comparably on average. For each dataset and each model family
(KANs or MLPs), we plot the Pareto frontier in the plane spanned by the number of parameters
and RMSE losses, shown in Figure D.1 in Appendix D. We conjecture that the Feynman datasets
are too simple to let KANs make further improvements, in the sense that variable dependence is
usually smooth or monotonic, which is in contrast to the complexity of special functions which
often demonstrate oscillatory behavior.
Auto-discovered KANs are smaller than human-constructed ones. We report the pruned KAN
shape in two columns of Table 2; one column is for the minimal pruned KAN shape that can achieve
reasonable loss (i.e., test RMSE smaller than 10−2 ); the other column is for the pruned KAN that
achieves lowest test loss. For completeness, we visualize all 54 pruned KANs in Appendix D (Fig-
ure D.2 and D.3). It is interesting to observe that auto-discovered KAN shapes (for both minimal
and best) are usually smaller than our human constructions. This means that KA representations can
be more efficient than we imagine. At the same time, this may make interpretability subtle because
information is being squashed into a smaller space than what we are comfortable with.
u+v
Consider the relativistic velocity composition f (u, v) = 1+uv , for example. Our construction is
quite deep because we were assuming that multiplication of u, v would use two layers (see Figure 4.1
(a)), inversion of 1 + uv would use one layer, and multiplication of u + v and 1/(1 + uv) would
use another two layers6 , resulting a total of 5 layers. However, the auto-discovered KANs are only 2
layers deep! In hindsight, this is actually expected if we recall the rapidity trick in relativity: define
the two “rapidities” a ≡ arctanh u and b ≡ arctanh v. The relativistic composition of velocities
u+v
are simple additions in rapidity space, i.e., 1+uv = tanh(arctanh u + arctanh v), which can be
realized by a two-layer KAN. Pretending we do not know the notion of rapidity in physics, we could
potentially discover this concept right from KANs without trial-and-error symbolic manipulations.
The interpretability of KANs which can facilitate scientific discovery is the main topic in Section 4.
17
KAN [2,10,1] KAN [2,10,1] 10 1
KAN [2,5,1] KAN [2,5,1]
10 1
MLP [2,10,1] MLP [2,10,1] KAN [2,7,1] KAN [2,7,1]
MLP [2,100,100,100,1] 100 MLP [2,100,100,100,1] KAN [2,10,1] 100 KAN [2,10,1]
10 2
10 2 MLP (depth 2) MLP (depth 2)
MLP (depth 3) MLP (depth 3)
H1 error squared
H1 error squared
L2 error squared
L2 error squared
10 1 10 3 MLP (depth 4) 10 1 MLP (depth 4)
10 3 MLP (depth 5) MLP (depth 5)
N 4 N 4
10 4
10 2
10 4 10 2
10 5
10 5
10 3 10 3
10 6
10 6
10 4 10 4
10 7 10 7
0 50 100 150 200 250 0 50 100 150 200 250 101 102 103 104 105 101 102 103 104 105
step step Number of parameters Number of parameters
Figure 3.3: The PDE example. We plot L2 squared and H1 squared losses between the predicted solution and
ground truth solution. First and second: training dynamics of losses. Third and fourth: scaling laws of losses
against the number of parameters. KANs converge faster, achieve lower losses, and have steeper scaling laws
than MLPs.
where we use lossi to denote the interior loss, discretized and evaluated by a uniform sampling of ni
points zi = (xi , yi ) inside the domain, and similarly we use lossb to denote the boundary loss, dis-
cretized and evaluated by a uniform sampling of nb points on the boundary. α is the hyperparameter
balancing the effect of the two terms.
We compare the KAN architecture with that of MLPs using the same hyperparameters ni = 10000,
nb = 800, and α = 0.01. We measure both the error in the L2 norm and energy (H 1 ) norm
and see that KAN achieves a much better scaling law with a smaller error, using smaller networks
and fewer parameters; see Figure 3.3. A 2-Layer width-10 KAN is 100 times more accurate than
a 4-Layer width-100 MLP (10−7 vs 10−5 MSE) and 100 times more parameter efficient (102 vs
104 parameters). Therefore we speculate that KANs might have the potential of serving as a good
neural network representation for model reduction of PDEs. However, we want to note that our
implementation of KANs are typically 10x slower than MLPs to train. The ground truth being a
symbolic formula might be an unfair comparison for MLPs since KANs are good at representing
symbolic formulas. In general, KANs and MLPs are good at representing different function classes
of PDE solutions, which needs detailed future study to understand their respective boundaries.
18
Phase 1 Phase 2 Phase 3 Phase 4 Phase 5
1
Data 0
1
KAN
1
MLP
-1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1
Figure 3.4: A toy continual learning problem. The dataset is a 1D regression task with 5 Gaussian peaks (top
row). Data around each peak is presented sequentially (instead of all at once) to KANs and MLPs. KANs
(middle row) can perfectly avoid catastrophic forgetting, while MLPs (bottom row) display severe catastrophic
forgetting.
ally distinct modules placed locally in space. When a new task is learned, structure re-organization
only occurs in local regions responsible for relevant skills [41, 42], leaving other regions intact. Most
artificial neural networks, including MLPs, do not have this notion of locality, which is probably the
reason for catastrophic forgetting.
We show that KANs have local plasticity and can avoid catastrophic forgetting by leveraging the
locality of splines. The idea is simple: since spline bases are local, a sample will only affect a
few nearby spline coefficients, leaving far-away coefficients intact (which is desirable since far-
away regions may have already stored information that we want to preserve). By contrast, since
MLPs usually use global activations, e.g., ReLU/Tanh/SiLU etc., any local change may propagate
uncontrollably to regions far away, destroying the information being stored there.
We use a toy example to validate this intuition. The 1D regression task is composed of 5 Gaussian
peaks. Data around each peak is presented sequentially (instead of all at once) to KANs and MLPs,
as shown in Figure 3.4 top row. KAN and MLP predictions after each training phase are shown in
the middle and bottom rows. As expected, KAN only remodels regions where data is present on
in the current phase, leaving previous regions unchanged. By contrast, MLPs remodels the whole
region after seeing new data samples, leading to catastrophic forgetting.
Here we simply present our preliminary results on an extremely simple example, to demonstrate
how one could possibly leverage locality in KANs (thanks to spline parametrizations) to reduce
catastrophic forgetting. However, it remains unclear whether our method can generalize to more
realistic setups, especially in high-dimensional cases where it is unclear how to define “locality”. In
future work, We would also like to study how our method can be connected to and combined with
SOTA methods in continual learning [43, 44].
19
Figure 4.1: KANs are interepretable for simple symbolic tasks
matter physics (Section 4.4). KANs could potentially be the foundation model for AI + Science due
to their accuracy (last section) and interpretability (this section).
(a) Multiplication f (x, y) = xy. A [2, 5, 1] KAN is pruned to a [2, 2, 1] KAN. The learned acti-
vation functions are linear and quadratic. From the computation graph, we see that the way it
computes xy is leveraging 2xy = (x + y)2 − (x2 + y 2 ).
(b) Division of positive numbers f (x, y) = x/y. A [2, 5, 1] KAN is pruned to a [2, 1, 1] KAN.
The learned activation functions are logarithmic and exponential functions, and the KAN is
computing x/y by leveraging the identity x/y = exp(logx − logy).
(c) Numerical to categorical. The task is to convert a real number in [0, 1] to its first decimal digit (as
one hots), e.g., 0.0618 → [1, 0, 0, 0, 0, · · · ], 0.314 → [0, 0, 0, 1, 0, · · · ]. Notice that activation
functions are learned to be spikes located around the corresponding decimal digits.
(d) Special function f (x, y) = exp(J0 (20x) + y 2 ). One limitation of symbolic regression is that it
will never find the correct formula of a special function if the special function is not provided as
prior knowledge. KANs can learn special functions – the highly wiggly Bessel function J0 (20x)
is learned (numerically) by KAN.
(e) Phase transition f (x1 , x2 , x3 ) = tanh(5(x41 + x42 + x43 − 1)). Phase transitions are of great
interest in physics, so we want KANs to be able to detect phase transitions and to identify the
correct order parameters. We use the tanh function to simulate the phase transition behavior,
and the order parameter is the combination of the quartic terms of x1 , x2 , x3 . Both the quartic
20
dependence and tanh dependence emerge after KAN training. This is a simplified case of a
localization phase transition discussed in Section 4.4.
p
(f) Deeper compositions f (x1 , x2 , x3 , x4 ) = (x1 − x2 )2 + (x3 − x4 )2 . To compute this, we
would need the identity function, squared function, and square root, which requires at least
a three-layer KAN. Indeed, we find that a [4, 3, 3, 1] KAN can be auto-pruned to a [4, 2, 1, 1]
KAN, which exactly corresponds to the computation graph we would expect.
More examples from the Feynman dataset and the special function dataset are visualized in Fig-
ure D.2, D.3, F.1, F.2 in Appendices D and F.
f (x1 , x2 , · · · , xd ) ≈ 0. (4.1)
For example, consider a set of features (x1 , x2 , x3 ) that satisfies x3 = exp(sin(πx1 ) + x22 ). Then a
valid f is f (x1 , x2 , x3 ) = sin(πx1 ) + x22 − log(x3 ) = 0, implying that points of (x1 , x2 , x3 ) form
a 2D submanifold specified by f = 0 instead of filling the whole 3D space.
If an algorithm for solving the unsupervised problem can be devised, it has a considerable advantage
over the supervised problem, since it requires only the sets of features S = (x1 , x2 , · · · , xd ). The
supervised problem, on the other hand, tries to predict subsets of features in terms of the others, i.e.
it splits S = Sin ∪ Sout into input and output features of the function to be learned. Without domain
expertise to advise the splitting, there are 2d − 2 possibilities such that |Sin | > 0 and |Sout | > 0.
This exponentially large space of supervised problems can be avoided by using the unsupervised
approach. This unsupervised learning approach will be valuable to the knot dataset in Section 4.3.
A Google Deepmind team [45] manually chose signature to be the target variable, otherwise they
would face this combinatorial problem described above. This raises the question whether we can
instead tackle the unsupervised learning directly. We present our method and a toy example below.
We tackle the unsupervised learning problem by turning it into a supervised learning problem on
all of the d features, without requiring the choice of a splitting. The essential idea is to learn a
function f (x1 , . . . , xd ) = 0 such that f is not the 0-function. To do this, similar to contrastive
learning, we define positive samples and negative samples: positive samples are feature vectors of
real data. Negative samples are constructed by feature corruption. To ensure that the overall feature
distribution for each topological invariant stays the same, we perform feature corruption by random
permutation of each feature across the entire training set. Now we want to train a network g such
that g(xreal ) = 1 and g(xfake ) = 0 which turns the problem into a supervised problem. However,
remember that we originally want f (xreal ) = 0 and f (xfake ) ̸= 0. We can achieve this by having
x2
g = σ ◦ f where σ(x) = exp(− 2w 2 ) is a Gaussian function with a small width w, which can be
conveniently realized by a KAN with shape [..., 1, 1] whose last activation is set to be the Gaussian
function σ and all previous layers form f . Except for the modifications mentioned above, everything
else is the same for supervised training.
Now we demonstrate that the unsupervised paradigm works for a synthetic example. Let us consider
a 6D dataset, where (x1 , x2 , x3 ) are dependent variables such that x3 = exp(sin(x1 ) + x22 ); (x4 , x5 )
21
Figure 4.2: Unsupervised learning of a toy task. KANs can identify groups of dependent variables, i.e.,
(x1 , x2 , x3 ) and (x4 , x5 ) in this case.
are dependent variables with x5 = x34 ; x6 is independent of the other variables. In Figure 4.2, we
show that for seed = 0, KAN reveals the functional dependence among x1 ,x2 , and x3 ; for another
seed = 2024, KAN reveals the functional dependence between x4 and x5 . Our preliminary results
rely on randomness (different seeds) to discover different relations; in the future we would like to
investigate a more systematic and more controlled way to discover a complete set of relations. Even
so, our tool in its current status can provide insights for scientific tasks. We present our results with
the knot dataset in Section 4.3.
22
Figure 4.3: Knot dataset, supervised mode. With KANs, we rediscover Deepmind’s results that signature is
mainly dependent on meridinal translation (real and imaginary parts).
We show below that KANs not only rediscover these results with much smaller networks and much
more automation, but also present some interesting new results and insights.
To investigate (1), we treat 17 knot invariants as inputs and signature as outputs. Similar to the
setup in [45], signatures (which are even numbers) are encoded as one-hot vectors and networks are
trained with cross-entropy loss. We find that an extremely small [17, 1, 14] KAN is able to achieve
81.6% test accuracy (while Deepmind’s 4-layer width-300 MLP achieves 78% test accuracy). The
[17, 1, 14] KAN (G = 3, k = 3) has ≈ 200 parameters, while the MLP has ≈ 3 × 105 parameters,
shown in Table 3. It is remarkable that KANs can be both more accurate and much more parameter
efficient than MLPs at the same time. In terms of interpretability, we scale the transparency of each
activation according to its magnitude, so it becomes immediately clear which input variables are
important without the need for feature attribution (see Figure 4.3 left): signature is mostly dependent
on µr , and slightly dependent on µi and λ, while dependence on other variables is small. We then
train a [3, 1, 14] KAN on the three important variables, obtaining test accuracy 78.2%. Our results
have one subtle difference from results in [45]: they find that signature is mostly dependent on µi ,
while we find that signature is mostly dependent on µr . This difference could be due to subtle
algorithmic choices, but has led us to carry out the following experiments: (a) ablation studies. We
show that µr contributes more to accuracy than µi (see Figure 4.3): for example, µr alone can
achieve 65.0% accuracy, while µi alone can only achieve 43.8% accuracy. (b) We find a symbolic
formula (in Table 4) which only involves µr and λ, but can achieve 77.8% test accuracy.
To investigate (2), i.e., obtain the symbolic form of σ, we formulate the problem as a regression
task. Using auto-symbolic regression introduced in Section 2.5.1, we can convert a trained KAN
23
test r2 with r2 with DM
Id Formula Discovered by
acc Signature formula
λµr
A (µ2r +µ2i )
Human (DM) 83.1% 0.946 1
B −0.02sin(4.98µi + 0.85) + 0.08|4.02µr + 6.28| − 0.52 − [3, 1] KAN 62.6% 0.837 0.897
2
0.04e−0.88(1−0.45λ)
2
2
+0.09e−0.06(1−0.21λ)
C 0.17tan(−1.51+0.1e−1.43(1−0.4µi ) + [3, 1, 1] KAN 71.9% 0.871 0.934
2
1.32e−3.18(1−0.43µr ) )
D −0.09 + 1.04exp(−9.59(−0.62sin(0.61µr + 7.26)) − [3, 2, 1] KAN 84.0% 0.947 0.997
2 2
0.32tan(0.03λ − 6.59) + 1 − 0.11e−1.77(0.31−µi ) ) −
−7.6(0.65(1−0.01λ)3
1.09e + 0.27atan(0.53µi − 0.6) +
0.09 + exp(−2.58(1 − 0.36µr )2 ))
4.76λµr [3,2,1] KAN
E 3.09µi +6.05µ2r +3.54µ2i
82.8% 0.946 0.997
+ Pade approx
2.94−2.92(1−0.10µr )2
F 0.32(0.18−µr )2 +5.36(1−0.04λ)2 +0.50 [3, 1] KAN/[3, 1] KAN 77.8% 0.925 0.977
Table 4: Symbolic formulas of signature as a function of meridinal translation µ (real µr , imag µi ) and lon-
gitudinal translation λ. In [45], formula A was discovered by human scientists inspired by neural network
attribution results. Formulas B-F are auto-discovered by KANs. KANs can trade-off between simplicity and
accuracy (B, C, D). By adding more inductive biases, KAN is able to discover formula E which is not too
dissimilar from formula A. KANs also discovered a formula F which only involves two variables (µr and λ)
instead of all three variables, with little sacrifice in accuracy.
into symbolic formulas. We train KANs with shapes [3, 1], [3, 1, 1], [3, 2, 1], whose corresponding
symbolic formulas are displayed in Table 4 B-D. It is clear that by having a larger KAN, both
accuracy and complexity increase. So KANs provide not just a single symbolic formula, but a whole
Pareto frontier of formulas, trading off simplicity and accuracy. However, KANs need additional
inductive biases to further simplify these equations to rediscover the formula from [45] (Table 4 A).
We have tested two scenarios: (1) in the first scenario, we assume the ground truth formula has a
multi-variate Pade representation (division of two multi-variate Taylor series). We first train [3, 2, 1]
and then fit it to a Pade representation. We can obtain Formula E in Table 4, which bears similarity
with Deepmind’s formula. (2) We hypothesize that the division is not very interpretable for KANs,
so we train two KANs (one for the numerator and the other for the denominator) and divide them
manually. Surprisingly, we end up with the formula F (in Table 4) which only involves µr and λ,
although µi is also provided but ignored by KANs.
So far, we have rediscovered the main results from [45]. It is remarkable to see that KANs made
this discovery very intuitive and convenient. Instead of using feature attribution methods (which
are great methods), one can instead simply stare at visualizations of KANs. Moreover, automatic
symbolic regression also makes the discovery of symbolic formulas much easier.
In the next part, we propose a new paradigm of “AI for Math” not included in the Deepmind pa-
per, where we aim to use KANs’ unsupervised learning mode to discover more relations (besides
signature) in knot invariants.
Unsupervised learning As we mentioned in Section 4.2, unsupervised learning is the setup that is
more promising since it avoids manual partition of input and output variables which have combina-
torially many possibilities. In the unsupervised learning mode, we treat all 18 variables (including
signature) as inputs such that they are on the same footing. Knot data are positive samples, and
we randomly shuffle features to obtain negative samples. An [18, 1, 1] KAN is trained to classify
whether a given feature vector belongs to a positive sample (1) or a negative sample (0). We man-
ually set the second layer activation to be the Gaussian function with a peak one centered at zero,
so positive samples will have activations at (around) zero, implicitly giving a relation among knot
P18
invariants i=1 gi (xi ) = 0 where xi stands for a feature (invariant), and gi is the corresponding
24
Figure 4.4: Knot dataset, unsupervised mode. With KANs, we rediscover three mathematical relations in the
knot dataset.
activation function which can be readily read off from KAN diagrams. We train the KANs with
λ = {10−2 , 10−3 } to favor sparse combination of inputs, and seed = {0, 1, · · · , 99}. All 200 net-
works can be grouped into three clusters, with representative KANs displayed in Figure 4.4. These
three groups of dependent variables are:
(1) The first group of dependent variables is signature, real part of meridinal distance, and longi-
tudinal distance (plus two other variables which can be removed because of (3)). This is the
signature dependence studied above, so it is very interesting to see that this dependence relation
is rediscovered again in the unsupervised mode.
(2) The second group of variables involve cusp volume V , real part of meridinal translation µr and
longitudinal translation λ. Their activations all look like logarithmic functions (which can be
verified by the implied symbolic functionality in Section 2.5.1). So the relation is − log V +
log µr + log λ = 0 which is equivalent to V = µr λ, which is true by definition. It is, however,
reassuring that we discover this relation without any prior knowledge.
(3) The third group of variables includes the real part of short geodesic gr and injectivity radius.
Their activations look qualitatively the same but differ by a minus sign, so it is conjectured that
these two variables have a linear correlation. We plot 2D scatters, finding that 2r upper bounds
gr , which is also a well-known relation [47].
It is interesting that KANs’ unsupervised mode can rediscover several known mathematical rela-
tions. The good news is that the results discovered by KANs are probably reliable; the bad news
is that we have not discovered anything new yet. It is worth noting that we have chosen a shallow
KAN for simple visualization, but deeper KANs can probably find more relations if they exist. We
would like to investigate how to discover more complicated relations with deeper KANs in future
work.
25
4.4 Application to Physics: Anderson localization
Anderson localization is the fundamental phenomenon in which disorder in a quantum system leads
to the localization of electronic wave functions, causing all transport to be ceased [48]. In one and
two dimensions, scaling arguments show that all electronic eigenstates are exponentially localized
for an infinitesimal amount of random disorder [49, 50]. In contrast, in three dimensions, a critical
energy forms a phase boundary that separates the extended states from the localized states, known
as a mobility edge. The understanding of these mobility edges is crucial for explaining various
fundamental phenomena such as the metal-insulator transition in solids [51], as well as localization
effects of light in photonic devices [52, 53, 54, 55, 56]. It is therefore necessary to develop micro-
scopic models that exhibit mobility edges to enable detailed investigations. Developing such models
is often more practical in lower dimensions, where introducing quasiperiodicity instead of random
disorder can also result in mobility edges that separate localized and extended phases. Furthermore,
experimental realizations of analytical mobility edges can help resolve the debate on localization in
interacting systems [57, 58]. Indeed, several recent studies have focused on identifying such models
and deriving exact analytic expressions for their mobility edges [59, 60, 61, 62, 63, 64, 65].
Here, we apply KANs to numerical data generated from quasiperiodic tight-binding models to ex-
tract their mobility edges. In particular, we examine three classes of models: the Mosaic model
(MM) [63], the generalized Aubry-André model (GAAM) [62] and the modified Aubry-André
model (MAAM) [60]. For the MM, we testify KAN’s ability to accurately extract mobility edge
as a 1D function of energy. For the GAAM, we find that the formula obtained from a KAN closely
matches the ground truth. For the more complicated MAAM, we demonstrate yet another exam-
ple of the symbolic interpretability of this framework. A user can simplify the complex expression
obtained from KANs (and corresponding symbolic formulas) by means of a “collaboration” where
the human generates hypotheses to obtain a better match (e.g., making an assumption of the form of
certain activation function), after which KANs can carry out quick hypotheses testing.
To quantify the localization of states in these models, the inverse participation ratio (IPR) is com-
monly used. The IPR for the k th eigenstate, ψ (k) , is given by
(k)
|ψn |4
P
IPRk = P n 2 (4.2)
(k) 2
n |ψ n |
where the sum runs over the site index. Here, we use the related measure of localization – the fractal
dimension of the states, given by
log(IPRk )
Dk = − (4.3)
log(N )
where N is the system size. Dk = 0(1) indicates localized (extended) states.
Mosaic Model (MM) We first consider a class of tight-binding models defined by the Hamilto-
nian [63]
X † X
H=t cn+1 cn + H.c. + Vn (λ, ϕ)c†n cn , (4.4)
n n
26
Figure 4.5: Results for the Mosaic Model. Top: phase diagram. Middle and Bottom:√
KANs can obtain both
qualitative intuition (bottom) and extract quantitative results (middle). φ = 1+2 5 is the golden ratio.
where t is the nearest-neighbor coupling, cn (c†n ) is the annihilation (creation) operator at site n and
the potential energy Vn is given by
(
λ cos(2πnb + ϕ) j = mκ
Vn (λ, ϕ) = (4.5)
0, otherwise,
To introduce
√
quasiperiodicity, we set b to be irrational (in particular, we choose b to be the golden
ratio 1+2 5 ). κ is an integer and the quasiperiodic potential occurs with interval κ. The energy
(E) spectrum for this model generically contains extended and localized regimes separated by a
mobility edge. Interestingly, a unique feature found here is that the mobility edges are present for an
arbitrarily strong quasiperiodic potential (i.e. there are always extended states present in the system
that co-exist with localized ones).
The mobility edge can be described by g(λ, E) ≡ λ − |fκ (E)| = 0. g(λ, E) > 0 and g(λ, E) <
0 correspond to localized and extended phases, respectively. Learning the mobility edge therefore
hinges on learning the “order parameter” g(λ, E). Admittedly, this problem can be tackled by many
other theoretical methods for this class of models [63], but we will demonstrate below that our KAN
framework is ready and convenient to take in assumptions and inductive biases from human users.
Let us assume a hypothetical user Alice, who is a new PhD student in condensed matter physics,
and she is provided with a [2, 1] KAN as an assistant for the task. Firstly, she understands that
this is a classification task, so it is wise to set the activation function in the second layer to be
sigmoid by using the fix_symbolic functionality. Secondly, she realizes that learning the whole
2D function g(λ, E) is unnecessary because in the end she only cares about λ = λ(E) determined
by g(λ, E) = 0. In so doing, it is reasonable to assume g(λ, E) = λ − h(E) = 0. Alice simply sets
the activation function of λ to be linear by again using the fix_symbolic functionality. Now Alice
trains the KAN network and conveniently obtains the mobility edge, as shown in Figure 4.5. Alice
can get both intuitive qualitative understanding (bottom) and quantitative results (middle), which
well match the ground truth (top).
27
System Origin Mobility Edge Formula Accuracy
Theory αE + 2λ − 2 = 0 99.2%
GAAM
KAN auto
2 2 + 45.13λ − 54.45 = 0
1.52E
0.66E
+ 21.06αE + +
3.55α +
0.91α 99.0%
Theory E + exp(p) − λcoshp = 0 98.6%
KAN auto 13.99sin(0.28sin(0.87λ + 2.22) − 0.84arctan(0.58E − 0.26) + 0.85arctan(0.94p + 97.1%
0.13)−8.14)−16.74+43.08exp(−0.93(0.06(0.13−p)2 −0.27tanh(0.65E +0.25)+
MAAM 0.63arctan(0.54λ − 0.62) + 1)2 ) = 0
KAN man (step 2) + auto 4.19(0.28sin(0.97λ + 2.17) − 0.77arctan(0.83E − 0.19) + arctan(0.97p + 0.15) − 97.7%
0.35)2 − 28.93 + 39.27exp(−0.6(0.28cosh2 (0.49p − 0.16) − 0.34arctan(0.65E +
0.51) + 0.83arctan(0.54λ − 0.62) + 1)2 ) = 0
KAN man (step 3) + auto −4.63E − 10.25(−0.94sin(0.97λ − 6.81) + tanh(0.8p − 0.45) + 0.09)2 + 97.7%
11.78sin(0.76p − 1.41) + 22.49arctan(1.08λ − 1.32) + 31.72 = 0
KAN man (step 4A) 6.92E − 6.23(−0.92λ − 1)2 + 2572.45(−0.05λ + 0.95cosh(0.11p + 0.4) − 1)2 − 96.6%
12.96cosh2 (0.53p + 0.16) + 19.89 = 0
KAN man (step 4B) 7.25E − 8.81(−0.83λ − 1)2 − 4.08(−p − 0.04)2 + 12.71(−0.71λ + (0.3p + 1)2 − 95.4%
0.86)2 + 10.29 = 0
Table 5: Symbolic formulas for two systems GAAM and MAAM, ground truth ones and KAN-discovered
ones.
Generalized Andre-Aubry Model (GAAM) We next consider a class of tight-binding models de-
fined by the Hamiltonian [62]
X † X
H=t cn+1 cn + H.c. + Vn (α, λ, ϕ)c†n cn , (4.6)
n n
where t is the nearest-neighbor coupling, cn (c†n ) is the annihilation (creation) operator at site n and
the potential energy Vn is given by
cos(2πnb + ϕ)
Vn (α, λ, ϕ) = 2λ , (4.7)
1 − α cos(2πnb + ϕ)
which is smooth for α ∈ (−1, 1). To introduce quasiperiodicity, we again set b to be irrational (in
particular, we choose b to be the golden ratio). As before, we would like to obtain an expression for
the mobility edge. For these models, the mobility edge is given by the closed form expression [62,
64],
We randomly sample the model parameters: ϕ, α and λ (setting the energy scale t = 1) and calculate
the energy eigenvalues as well as the fractal dimension of the corresponding eigenstates, which
forms our training dataset.
Here the “order parameter” to be learned is g(α, E, λ, ϕ) = αE + 2(λ − 1) and mobility edge
corresponds to g = 0. Let us again assume that Alice wants to figure out the mobility edge but
only has access to IPR or fractal dimension data, so she decides to use KAN to help her with the
task. Alice wants the model to be as small as possible, so she could either start from a large model
and use auto-pruning to get a small model, or she could guess a reasonable small model based on
her understanding of the complexity of the given problem. Either way, let us assume she arrives
at a [4, 2, 1, 1] KAN. First, she sets the last activation to be sigmoid because this is a classification
problem. She trains her KAN with some sparsity regularization to accuracy 98.7% and visualizes the
trained KAN in Figure 4.6 (a) step 1. She observes that ϕ is not picked up on at all, which makes her
realize that the mobility edge is independent of ϕ (agreeing with Eq. (4.8)). In addition, she observes
that almost all other activation functions are linear or quadratic, so she turns on automatic symbolic
28
snapping, constraining the library to be only linear or quadratic. After that, she immediately gets a
network which is already symbolic (shown in Figure 4.6 (a) step 2), with comparable (even slightly
better) accuracy 98.9%. By using symbolic_formula functionality, Alice conveniently gets the
symbolic form of g, shown in Table 5 GAAM-KAN auto (row three). Perhaps she wants to cross out
some small terms and snap coefficient to small integers, which takes her close to the true answer.
This hypothetical story for Alice would be completely different if she is using a symbolic regres-
sion method. If she is lucky, SR can return the exact correct formula. However, the vast majority
of the time SR does not return useful results and it is impossible for Alice to “debug” or inter-
act with the underlying process of symbolic regression. Furthermore, Alice may feel uncomfort-
able/inexperienced to provide a library of symbolic terms as prior knowledge to SR before SR is
run. By constrast in KANs, Alice does not need to put any prior information to KANs. She can first
get some clues by staring at a trained KAN and only then it is her job to decide which hypothesis
she wants to make (e.g., “all activations are linear or quadratic”) and implement her hypothesis in
KANs. Although it is not likely for KANs to return the correct answer immediately, KANs will
always return something useful, and Alice can collaborate with it to refine the results.
Modified Andre-Aubry Model (MAAM) The last class of models we consider is defined by the
Hamiltonian [60]
X ′ X
H= te−p|n−n | c†n cn′ + H.c. + Vn (λ, ϕ)c†n cn , (4.9)
n̸=n′ n
where t is the strength of the exponentially decaying coupling in space, cn (c†n ) is the annihilation
(creation) operator at site n and the potential energy Vn is given by
As before, to introduce quasiperiodicity, we set b to be irrational (the golden ratio). For these models,
the mobility edge is given by the closed form expression [60],
where we define t1 ≡ texp(−p) as the nearest neighbor hopping strength, and we set t1 = 1 below.
Let us assume Alice wants to figure out the mobility edge for MAAM. This task is more complicated
and requires more human wisdom. As in the last example, Alice starts from a [4, 2, 1, 1] KAN and
trains it but gets an accuracy around 75% which is less than acceptable. She then chooses a larger
[4, 3, 1, 1] KAN and successfully gets 98.4% which is acceptable (Figure 4.6 (b) step 1). Alice
notices that ϕ is not picked up on by KANs, which means that the mobility edge is independent of
the phase factor ϕ (agreeing with Eq. (4.11)). If Alice turns on the automatic symbolic regression
(using a large library consisting of exp, tanh etc.), she would get a complicated formula in Tabel 5-
MAAM-KAN auto, which has 97.1% accuracy. However, if Alice wants to find a simpler symbolic
formula, she will want to use the manual mode where she does the symbolic snapping by herself.
Before that she finds that the [4, 3, 1, 1] KAN after training can then be pruned to be [4, 2, 1, 1], while
maintaining 97.7% accuracy (Figure 4.6 (b)). Alice may think that all activation functions except
those dependent on p are linear or quadratic and snap them to be either linear or quadratic manually
by using fix_symbolic. After snapping and retraining, the updated KAN is shown in Figure 4.6 (c)
step 3, maintaining 97.7% accuracy. From now on, Alice may make two different choices based on
her prior knowledge. In one case, Alice may have guessed that the dependence on p is cosh, so she
sets the activations of p to be cosh function. She retrains KAN and gets 96.9% accuracy (Figure 4.6
(c) Step 4A). In another case, Alice does not know the cosh p dependence, so she pursues simplicity
29
Figure 4.6: Human-KAN collaboration to discover mobility edges of GAAM and MAAM. The human user can
choose to be lazy (using the auto mode) or more involved (using the manual mode). More details in text.
and again assumes the functions of p to be quadratic. She retrains KAN and gets 95.4% accuracy
(Figure 4.6 (c) Step 4B). If she tried both, she would realize that cosh is better in terms of accuracy,
while quadratic is better in terms of simplicity. The formulas corresponding to these steps are listed
in Table 5. It is clear that the more manual operations are done by Alice, the simpler the symbolic
formula is (which slight sacrifice in accuracy). KANs have a “knob" that a user can tune to trade-off
between simplicity and accuracy (sometimes simplicity can even lead to better accuracy, as in the
GAAM case).
5 Related works
Kolmogorov-Arnold theorem and neural networks. The connection between the Kolmogorov-
Arnold theorem (KAT) and neural networks is not new in the literature [66, 67, 9, 10, 11, 12, 13,
14, 68, 69], but the pathological behavior of inner functions makes KAT appear unpromising in
practice [66]. Most of these prior works stick to the original 2-layer width-(2n + 1) networks, which
were limited in expressive power and many of them are even predating back-propagation. Therefore,
most studies were built on theories with rather limited or artificial toy experiments. More broadly
speaking, KANs are also somewhat related to generalized additive models (GAMs) [70], graph
neural networks [71] and kernel machines [72]. The connections are intriguing and fundamental but
might be out of the scope of the current paper. Our contribution lies in generalizing the Kolmogorov
network to arbitrary widths and depths, revitalizing and contexualizing them in today’s deep learning
stream, as well as highlighting its potential role as a foundation model for AI + Science.
Neural Scaling Laws (NSLs). NSLs are the phenomena where test losses behave as power laws
against model size, data, compute etc [73, 74, 75, 76, 24, 77, 78, 79]. The origin of NSLs still
30
remains mysterious, but competitive theories include intrinsic dimensionality [73], quantization of
tasks [78], resource theory [79], random features [77], compositional sparsity [66], and maximu
arity [25]. This paper contributes to this space by showing that a high-dimensional function can
surprisingly scale as a 1D function (which is the best possible bound one can hope for) if it has
a smooth Kolmogorov-Arnold representation. Our paper brings fresh optimism to neural scaling
laws, since it promises the fastest scaling exponent ever. We have shown in our experiments that
this fast neural scaling law can be achieved on synthetic datasets, but future research is required
to address the question whether this fast scaling is achievable for more complicated tasks (e.g.,
language modeling): Do KA representations exist for general tasks? If so, does our training find
these representations in practice?
Mechanistic Interpretability (MI). MI is an emerging field that aims to mechanistically understand
the inner workings of neural networks [80, 81, 82, 83, 84, 85, 86, 87, 5]. MI research can be roughly
divided into passive and active MI research. Most MI research is passive in focusing on understand-
ing existing neural networks trained with standard methods. Active MI research attempts to achieve
interpretability by designing intrinsically interpretable architectures or developing training methods
to explicitly encourage interpretability [86, 87]. Our work lies in the second category, where the
model and training method are by design interpretable.
Learnable activations. The idea of learnable activations in neural networks is not new in ma-
chine learning. Trainable activations functions are learned in a differentiable way [88, 14, 89, 90]
or searched in a discrete way [91]. Activation function are parametrized as polynomials [88],
splines [14, 92, 93], sigmoid linear unit [89], or neural networks [90]. KANs use B-splines to
parametrize their activation functions. We also present our preliminary results on learnable activa-
tion networks (LANs), whose properties lie between KANs and MLPs and their results are deferred
to Appendix B to focus on KANs in the main paper.
Symbolic Regression. There are many off-the-shelf symbolic regression methods based on genetic
algorithms (Eureka [94], GPLearn [95], PySR [96]), neural-network based methods (EQL [97],
OccamNet [98]), physics-inspired method (AI Feynman [36, 37]), and reinforcement learning-based
methods [99]. KANs are most similar to neural network-based methods, but differ from previous
works in that our activation functions are continuously learned before symbolic snapping rather than
manually fixed [94, 98].
Physics-Informed Neural Networks (PINNs) and Physics-Informed Neural Operators
(PINOs). In Subsection 3.4, we demonstrate that KANs can replace the paradigm of using MLPs for
imposing PDE loss when solving PDEs. We refer to Deep Ritz Method [100], PINNs [38, 39, 101]
for PDE solving, and Fourier Neural operator [102], PINOs [103, 104, 105], DeepONet [106] for
operator learning methods learning the solution map. There is potential to replace MLPs with KANs
in all the aforementioned networks.
AI for Mathematics. As we saw in Subsection 4.3, AI has recently been applied to several problems
in Knot theory, including detecting whether a knot is the unknot [107, 108] or a ribbon knot [46], and
predicting knot invariants and uncovering relations among them [109, 110, 111, 45]. For a summary
of data science applications to datasets in mathematics and theoretical physics see e.g. [112, 113],
and for ideas how to obtain rigorous results from ML techniques in these fields, see [114].
6 Discussion
In this section, we discuss KANs’ limitations and future directions from the perspective of mathe-
matical foundation, algorithms and applications.
31
Mathematical aspects: Although we have presented preliminary mathematical analysis of KANs
(Theorem 2.1), our mathematical understanding of them is still very limited. The Kolmogorov-
Arnold representation theorem has been studied thoroughly in mathematics, but the theorem corre-
sponds to KANs with shape [n, 2n + 1, 1], which is a very restricted subclass of KANs. Does our
empirical success with deeper KANs imply something fundamental in mathematics? An appeal-
ing generalized Kolmogorov-Arnold theorem could define “deeper” Kolmogorov-Arnold represen-
tations beyond depth-2 compositions, and potentially relate smoothness of activation functions to
depth. Hypothetically, there exist functions which cannot be represented smoothly in the original
(depth-2) Kolmogorov-Arnold representations, but might be smoothly represented with depth-3 or
beyond. Can we use this notion of “Kolmogorov-Arnold depth” to characterize function classes?
(1) Accuracy. Multiple choices in architecture design and training are not fully investigated so
alternatives can potentially further improve accuracy. For example, spline activation functions
might be replaced by radial basis functions or other local kernels. Adaptive grid strategies can
be used.
(2) Efficiency. One major reason why KANs run slowly is because different activation functions
cannot leverage batch computation (large data through the same function). Actually, one can
interpolate between activation functions being all the same (MLPs) and all different (KANs),
by grouping activation functions into multiple groups (“multi-head”), where members within a
group share the same activation function.
(3) Hybrid of KANs and MLPs. KANs have two major differences compared to MLPs:
Which change is more essential to explain KAN’s advantage? We present our preliminary results
in Appendix B where we study a model which has (ii), i.e., activation functions are learnable
(like KANs), but not (i), i.e., activation functions are on nodes (like MLPs). Moreover, one can
also construct another model with fixed activations (like MLPs) but on edges (like KANs).
(4) Adaptivity. Thanks to the intrinsic locality of spline basis functions, we can introduce adap-
tivity in the design and training of KANs to enhance both accuracy and efficiency: see the
idea of multi-level training like multigrid methods as in [115, 116], or domain-dependent basis
functions like multiscale methods as in [117].
Application aspects: We have presented some preliminary evidences that KANs are more effective
than MLPs in science-related tasks, e.g., fitting physical equations and PDE solving. We would like
to apply KANs to solve Navier-Stokes equations, density functional theory, or any other tasks that
can be formulated as regression or PDE solving. We would also like to apply KANs to machine-
learning-related tasks, which would require integrating KANs into current architectures, e.g., trans-
formers – one may propose “kansformers” which replace MLPs by KANs in transformers.
KAN as a “language model” for AI + Science The reason why large language models are so
transformative is because they are useful to anyone who can speak natural language. The language
of science is functions. KANs are composed of interpretable functions, so when a human user stares
at a KAN, it is like communicating with it using the language of functions. This paragraph aims
to promote the AI-Scientist-Collaboration paradigm rather than our specific tool KANs. Just like
people use different languages to communicate, we expect that in the future KANs will be just one
32
Figure 6.1: Should I use KANs or MLPs?
of the languages for AI + Science, although KANs will be one of the very first languages that would
enable AI and human to communicate. However, enabled by KANs, the AI-Scientist-Collaboration
paradigm has never been this easy and convenient, which leads us to rethink the paradigm of how
we want to approach AI + Science: Do we want AI scientists, or do we want AI that helps scientists?
The intrinsic difficulty of (fully automated) AI scientists is that it is hard to make human preferences
quantitative, which would codify human preferences into AI objectives. In fact, scientists in different
fields may feel differently about which functions are simple or interpretable. As a result, it is more
desirable for scientists to have an AI that can speak the scientific language (functions) and can
conveniently interact with inductive biases of individual scientist(s) to adapt to a specific scientific
domain.
Currently, the biggest bottleneck of KANs lies in its slow training. KANs are usually 10x slower
than MLPs, given the same number of parameters. We should be honest that we did not try hard
to optimize KANs’ efficiency though, so we deem KANs’ slow training more as an engineering
problem to be improved in the future rather than a fundamental limitation. If one wants to train a
model fast, one should use MLPs. In other cases, however, KANs should be comparable or better
than MLPs, which makes them worth trying. The decision tree in Figure 6.1 can help decide when
to use a KAN. In short, if you care about interpretability and/or accuracy, and slow training is not a
major concern, we suggest trying KANs, at least for small-scale AI + Science problems.
Acknowledgement
We would like to thank Mikail Khona, Tomaso Poggio, Pingchuan Ma, Rui Wang, Di Luo, Sara
Beery, Catherine Liang, Yiping Lu, Nicholas H. Nelsen, Nikola Kovachki, Jonathan W. Siegel,
Hongkai Zhao, Juncai He, Shi Lab (Humphrey Shi, Steven Walton, Chuanhao Yan) and Matthieu
Darcy for fruitful discussion and constructive suggestions. Z.L., F.R., J.H., M.S. and M.T. are sup-
ported by IAIFI through NSF grant PHY-2019786. The work of FR is in addition supported by
the NSF grant PHY-2210333 and by startup funding from Northeastern University. Y.W and T.H
are supported by the NSF Grant DMS-2205590 and the Choi Family Gift Fund. S. V. and M. S.
acknowledge support from the U.S. Office of Naval Research (ONR) Multidisciplinary University
Research Initiative (MURI) under Grant No. N00014-20-1-2325 on Robust Photonic Materials with
Higher-Order Topological Protection.
33
References
[1] Simon Haykin. Neural networks: a comprehensive foundation. Prentice Hall PTR, 1994.
[3] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks
are universal approximators. Neural networks, 2(5):359–366, 1989.
[4] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural informa-
tion processing systems, 30, 2017.
[5] Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse
autoencoders find highly interpretable features in language models. arXiv preprint
arXiv:2309.08600, 2023.
[8] Jürgen Braun and Michael Griebel. On a constructive proof of kolmogorov’s superposition
theorem. Constructive approximation, 30:653–675, 2009.
[9] David A Sprecher and Sorin Draghici. Space-filling curves and kolmogorov superposition-
based neural networks. Neural Networks, 15(1):57–67, 2002.
[10] Mario Köppen. On the training of a kolmogorov network. In Artificial Neural Net-
works—ICANN 2002: International Conference Madrid, Spain, August 28–30, 2002 Pro-
ceedings 12, pages 474–479. Springer, 2002.
[11] Ji-Nan Lin and Rolf Unbehauen. On the realization of a kolmogorov network. Neural Com-
putation, 5(1):18–20, 1993.
[12] Ming-Jun Lai and Zhaiming Shen. The kolmogorov superposition theorem can break the
curse of dimensionality when approximating high dimensional functions. arXiv preprint
arXiv:2112.09963, 2021.
[13] Pierre-Emmanuel Leni, Yohan D Fougerolle, and Frédéric Truchetet. The kolmogorov spline
network for image processing. In Image Processing: Concepts, Methodologies, Tools, and
Applications, pages 54–78. IGI Global, 2013.
[14] Daniele Fakhoury, Emanuele Fakhoury, and Hendrik Speleers. Exsplinet: An interpretable
and expressive spline-based neural network. Neural Networks, 152:332–346, 2022.
[15] Hadrien Montanelli and Haizhao Yang. Error bounds for deep relu networks using the
kolmogorov–arnold superposition theorem. Neural Networks, 129:1–6, 2020.
[16] Juncai He. On the optimal expressive power of relu dnns and its application in approximation
with kolmogorov superposition theorem. arXiv preprint arXiv:2308.05509, 2023.
34
[17] Juncai He, Lin Li, Jinchao Xu, and Chunyue Zheng. Relu deep neural networks and linear
finite elements. arXiv preprint arXiv:1807.03973, 2018.
[18] Juncai He and Jinchao Xu. Deep neural networks and finite elements of any order on arbitrary
dimensions. arXiv preprint arXiv:2312.14276, 2023.
[19] Tomaso Poggio, Andrzej Banburski, and Qianli Liao. Theoretical issues in deep networks.
Proceedings of the National Academy of Sciences, 117(48):30039–30045, 2020.
[20] Federico Girosi and Tomaso Poggio. Representation properties of networks: Kolmogorov’s
theorem is irrelevant. Neural Computation, 1(4):465–469, 1989.
[21] Henry W Lin, Max Tegmark, and David Rolnick. Why does deep and cheap learning work
so well? Journal of Statistical Physics, 168:1223–1247, 2017.
[22] Hongyi Xu, Funshing Sin, Yufeng Zhu, and Jernej Barbič. Nonlinear material design using
principal stretches. ACM Transactions on Graphics (TOG), 34(4):1–11, 2015.
[23] Carl De Boor. A practical guide to splines, volume 27. springer-verlag New York, 1978.
[24] Utkarsh Sharma and Jared Kaplan. A neural scaling law from the dimension of the data
manifold. arXiv preprint arXiv:2004.10802, 2020.
[25] Eric J Michaud, Ziming Liu, and Max Tegmark. Precision machine learning. Entropy,
25(1):175, 2023.
[26] Joel L Horowitz and Enno Mammen. Rate-optimal estimation for a general class of nonpara-
metric regression models with unknown link functions. 2007.
[27] Michael Kohler and Sophie Langer. On the rate of convergence of fully connected deep neural
network regression estimates. The Annals of Statistics, 49(4):2231–2249, 2021.
[28] Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu
activation function. 2020.
[29] Ronald A DeVore, Ralph Howard, and Charles Micchelli. Optimal nonlinear approximation.
Manuscripta mathematica, 63:469–478, 1989.
[30] Ronald A DeVore, George Kyriazis, Dany Leviatan, and Vladimir M Tikhomirov. Wavelet
compression and nonlinear n-widths. Adv. Comput. Math., 1(2):197–214, 1993.
[31] Jonathan W Siegel. Sharp lower bounds on the manifold widths of sobolev and besov spaces.
arXiv preprint arXiv:2402.04407, 2024.
[32] Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural Net-
works, 94:103–114, 2017.
[33] Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vc-
dimension and pseudodimension bounds for piecewise linear neural networks. Journal of
Machine Learning Research, 20(63):1–17, 2019.
[34] Jonathan W Siegel. Optimal approximation rates for deep relu neural networks on sobolev
and besov spaces. Journal of Machine Learning Research, 24(357):1–52, 2023.
[35] Yongji Wang and Ching-Yao Lai. Multi-stage neural networks: Function approximator of
machine precision. Journal of Computational Physics, page 112865, 2024.
[36] Silviu-Marian Udrescu and Max Tegmark. Ai feynman: A physics-inspired method for sym-
bolic regression. Science Advances, 6(16):eaay2631, 2020.
35
[37] Silviu-Marian Udrescu, Andrew Tan, Jiahai Feng, Orisvaldo Neto, Tailin Wu, and Max
Tegmark. Ai feynman 2.0: Pareto-optimal symbolic regression exploiting graph modular-
ity. Advances in Neural Information Processing Systems, 33:4860–4871, 2020.
[38] Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural net-
works: A deep learning framework for solving forward and inverse problems involving non-
linear partial differential equations. Journal of Computational physics, 378:686–707, 2019.
[39] George Em Karniadakis, Ioannis G Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu
Yang. Physics-informed machine learning. Nature Reviews Physics, 3(6):422–440, 2021.
[40] Ronald Kemker, Marc McClure, Angelina Abitino, Tyler Hayes, and Christopher Kanan.
Measuring catastrophic forgetting in neural networks. In Proceedings of the AAAI conference
on artificial intelligence, volume 32, 2018.
[41] Bryan Kolb and Ian Q Whishaw. Brain plasticity and behavior. Annual review of psychology,
49(1):43–64, 1998.
[42] David Meunier, Renaud Lambiotte, and Edward T Bullmore. Modular and hierarchically
modular organization of brain networks. Frontiers in neuroscience, 4:7572, 2010.
[43] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins,
Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska,
et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national
academy of sciences, 114(13):3521–3526, 2017.
[44] Aojun Lu, Tao Feng, Hangjie Yuan, Xiaotian Song, and Yanan Sun. Revisiting neural net-
works for continual learning: An architectural perspective, 2024.
[45] Alex Davies, Petar Veličković, Lars Buesing, Sam Blackwell, Daniel Zheng, Nenad Tomašev,
Richard Tanburn, Peter Battaglia, Charles Blundell, András Juhász, et al. Advancing mathe-
matics by guiding human intuition with ai. Nature, 600(7887):70–74, 2021.
[46] Sergei Gukov, James Halverson, Ciprian Manolescu, and Fabian Ruehle. Searching for rib-
bons with machine learning, 2023.
[47] P. Petersen. Riemannian Geometry. Graduate Texts in Mathematics. Springer New York,
2006.
[48] Philip W Anderson. Absence of diffusion in certain random lattices. Physical review,
109(5):1492, 1958.
[49] David J Thouless. A relation between the density of states and range of localization for one
dimensional random systems. Journal of Physics C: Solid State Physics, 5(1):77, 1972.
[50] Elihu Abrahams, PW Anderson, DC Licciardello, and TV Ramakrishnan. Scaling theory
of localization: Absence of quantum diffusion in two dimensions. Physical Review Letters,
42(10):673, 1979.
[51] Ad Lagendijk, Bart van Tiggelen, and Diederik S Wiersma. Fifty years of anderson localiza-
tion. Physics today, 62(8):24–29, 2009.
[52] Mordechai Segev, Yaron Silberberg, and Demetrios N Christodoulides. Anderson localization
of light. Nature Photonics, 7(3):197–204, 2013.
[53] Z Valy Vardeny, Ajay Nahata, and Amit Agrawal. Optics of photonic quasicrystals. Nature
photonics, 7(3):177–187, 2013.
36
[54] Sajeev John. Strong localization of photons in certain disordered dielectric superlattices.
Physical review letters, 58(23):2486, 1987.
[55] Yoav Lahini, Rami Pugatch, Francesca Pozzi, Marc Sorel, Roberto Morandotti, Nir David-
son, and Yaron Silberberg. Observation of a localization transition in quasiperiodic photonic
lattices. Physical review letters, 103(1):013901, 2009.
[56] Sachin Vaidya, Christina Jörg, Kyle Linn, Megan Goh, and Mikael C Rechtsman. Reen-
trant delocalization transition in one-dimensional photonic quasicrystals. Physical Review
Research, 5(3):033170, 2023.
[57] Wojciech De Roeck, Francois Huveneers, Markus Müller, and Mauro Schiulaz. Absence of
many-body mobility edges. Physical Review B, 93(1):014203, 2016.
[58] Xiaopeng Li, Sriram Ganeshan, JH Pixley, and S Das Sarma. Many-body localization and
quantum nonergodicity in a model with a single-particle mobility edge. Physical review
letters, 115(18):186601, 2015.
[59] Fangzhao Alex An, Karmela Padavić, Eric J Meier, Suraj Hegde, Sriram Ganeshan, JH Pixley,
Smitha Vishveshwara, and Bryce Gadway. Interactions and mobility edges: Observing the
generalized aubry-andré model. Physical review letters, 126(4):040603, 2021.
[60] J Biddle and S Das Sarma. Predicted mobility edges in one-dimensional incommensurate
optical lattices: An exactly solvable model of anderson localization. Physical review letters,
104(7):070601, 2010.
[61] Alexander Duthie, Sthitadhi Roy, and David E Logan. Self-consistent theory of mobility
edges in quasiperiodic chains. Physical Review B, 103(6):L060201, 2021.
[62] Sriram Ganeshan, JH Pixley, and S Das Sarma. Nearest neighbor tight binding models with
an exact mobility edge in one dimension. Physical review letters, 114(14):146601, 2015.
[63] Yucheng Wang, Xu Xia, Long Zhang, Hepeng Yao, Shu Chen, Jiangong You, Qi Zhou, and
Xiong-Jun Liu. One-dimensional quasiperiodic mosaic lattice with exact mobility edges.
Physical Review Letters, 125(19):196604, 2020.
[64] Yucheng Wang, Xu Xia, Yongjian Wang, Zuohuan Zheng, and Xiong-Jun Liu. Duality be-
tween two generalized aubry-andré models with exact mobility edges. Physical Review B,
103(17):174205, 2021.
[65] Xin-Chi Zhou, Yongjian Wang, Ting-Fung Jeffrey Poon, Qi Zhou, and Xiong-Jun Liu.
Exact new mobility edges between critical and localized states. Physical Review Letters,
131(17):176401, 2023.
[66] Tomaso Poggio. How deep sparse networks avoid the curse of dimensionality: Efficiently
computable functions are compositionally sparse. CBMM Memo, 10:2022, 2022.
[68] Aysu Ismayilova and Vugar E Ismailov. On the kolmogorov neural networks. Neural Net-
works, page 106333, 2024.
[69] Michael Poluektov and Andrew Polar. A new iterative method for construction of the
kolmogorov-arnold representation. arXiv preprint arXiv:2305.08194, 2023.
37
[70] Rishabh Agarwal, Levi Melnick, Nicholas Frosst, Xuezhou Zhang, Ben Lengerich, Rich
Caruana, and Geoffrey E Hinton. Neural additive models: Interpretable machine learning
with neural nets. Advances in neural information processing systems, 34:4699–4711, 2021.
[71] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdi-
nov, and Alexander J Smola. Deep sets. Advances in neural information processing systems,
30, 2017.
[72] Huan Song, Jayaraman J Thiagarajan, Prasanna Sattigeri, and Andreas Spanias. Optimizing
kernel machines using deep learning. IEEE transactions on neural networks and learning
systems, 29(11):5528–5540, 2018.
[73] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon
Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural
language models. arXiv preprint arXiv:2001.08361, 2020.
[74] Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Hee-
woo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive
generative modeling. arXiv preprint arXiv:2010.14701, 2020.
[75] Mitchell A Gordon, Kevin Duh, and Jared Kaplan. Data and parameter scaling laws for neural
machine translation. In ACL Rolling Review - May 2021, 2021.
[76] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kia-
ninejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is
predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
[77] Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining
neural scaling laws. arXiv preprint arXiv:2102.06701, 2021.
[78] Eric J Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural
scaling. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[79] Jinyeop Song, Ziming Liu, Max Tegmark, and Jeff Gore. A resource model for neural scaling
law. arXiv preprint arXiv:2402.05164, 2024.
[80] Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom
Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning
and induction heads. arXiv preprint arXiv:2209.11895, 2022.
[81] Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual
associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372,
2022.
[82] Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt.
Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The
Eleventh International Conference on Learning Representations, 2023.
[83] Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna
Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of
superposition. arXiv preprint arXiv:2209.10652, 2022.
[84] Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress
measures for grokking via mechanistic interpretability. In The Eleventh International Con-
ference on Learning Representations, 2023.
38
[85] Ziqian Zhong, Ziming Liu, Max Tegmark, and Jacob Andreas. The clock and the pizza:
Two stories in mechanistic explanation of neural networks. In Thirty-seventh Conference on
Neural Information Processing Systems, 2023.
[86] Ziming Liu, Eric Gan, and Max Tegmark. Seeing is believing: Brain-inspired modular train-
ing for mechanistic interpretability. Entropy, 26(1):41, 2023.
[87] Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston,
Sheer ElShowk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernandez, Amanda
Askell, Kamal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli,
Liane Lovitt, Zac Hatfield-Dodds, Jackson Kernion, Tom Conerly, Shauna Kravec, Stanislav
Fort, Saurav Kadavath, Josh Jacobson, Eli Tran-Johnson, Jared Kaplan, Jack Clark, Tom
Brown, Sam McCandlish, Dario Amodei, and Christopher Olah. Softmax linear units. Trans-
former Circuits Thread, 2022. https://transformer-circuits.pub/2022/solu/index.html.
[88] Mohit Goyal, Rajan Goyal, and Brejesh Lall. Learning activation functions: A new paradigm
for understanding neural networks. arXiv preprint arXiv:1906.09529, 2019.
[89] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv
preprint arXiv:1710.05941, 2017.
[90] Shijun Zhang, Zuowei Shen, and Haizhao Yang. Neural network architecture beyond width
and depth. Advances in Neural Information Processing Systems, 35:5669–5681, 2022.
[91] Garrett Bingham and Risto Miikkulainen. Discovering parametric activation functions. Neu-
ral Networks, 148:48–65, 2022.
[92] Pakshal Bohra, Joaquim Campos, Harshit Gupta, Shayan Aziznejad, and Michael Unser.
Learning activation functions in deep (spline) neural networks. IEEE Open Journal of Signal
Processing, 1:295–309, 2020.
[93] Shayan Aziznejad and Michael Unser. Deep spline networks with control of lipschitz regular-
ity. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 3242–3246. IEEE, 2019.
[94] Renáta Dubcáková. Eureqa: software review. Genetic Programming and Evolvable Ma-
chines, 12:173–178, 2011.
[96] Miles Cranmer. Interpretable machine learning for science with pysr and symbolicregression.
jl. arXiv preprint arXiv:2305.01582, 2023.
[97] Georg Martius and Christoph H Lampert. Extrapolation and learning equations. arXiv
preprint arXiv:1610.02995, 2016.
[98] Owen Dugan, Rumen Dangovski, Allan Costa, Samuel Kim, Pawan Goyal, Joseph Jacobson,
and Marin Soljačić. Occamnet: A fast neural model for symbolic regression at scale. arXiv
preprint arXiv:2007.10784, 2020.
[99] Terrell N. Mundhenk, Mikel Landajuela, Ruben Glatt, Claudio P. Santiago, Daniel faissol,
and Brenden K. Petersen. Symbolic regression via deep reinforcement learning enhanced
genetic programming seeding. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman
Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
39
[100] Bing Yu et al. The deep ritz method: a deep learning-based numerical algorithm for solving
variational problems. Communications in Mathematics and Statistics, 6(1):1–12, 2018.
[101] Junwoo Cho, Seungtae Nam, Hyunmo Yang, Seok-Bae Yun, Youngjoon Hong, and Eun-
byung Park. Separable physics-informed neural networks. Advances in Neural Information
Processing Systems, 36, 2024.
[102] Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya,
Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial dif-
ferential equations. arXiv preprint arXiv:2010.08895, 2020.
[103] Zongyi Li, Hongkai Zheng, Nikola Kovachki, David Jin, Haoxuan Chen, Burigede Liu, Kam-
yar Azizzadenesheli, and Anima Anandkumar. Physics-informed neural operator for learning
partial differential equations. ACM/JMS Journal of Data Science, 2021.
[104] Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya,
Andrew Stuart, and Anima Anandkumar. Neural operator: Learning maps between function
spaces with applications to pdes. Journal of Machine Learning Research, 24(89):1–97, 2023.
[105] Haydn Maust, Zongyi Li, Yixuan Wang, Daniel Leibovici, Oscar Bruno, Thomas Hou,
and Anima Anandkumar. Fourier continuation for exact derivative computation in physics-
informed neural operators. arXiv preprint arXiv:2211.15960, 2022.
[106] Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learn-
ing nonlinear operators via deeponet based on the universal approximation theorem of opera-
tors. Nature machine intelligence, 3(3):218–229, 2021.
[107] Sergei Gukov, James Halverson, Fabian Ruehle, and Piotr Sułkowski. Learning to Unknot.
Mach. Learn. Sci. Tech., 2(2):025035, 2021.
[109] Mark C Hughes. A neural network approach to predicting and computing knot invariants.
Journal of Knot Theory and Its Ramifications, 29(03):2050005, 2020.
[110] Jessica Craven, Vishnu Jejjala, and Arjun Kar. Disentangling a deep learned volume formula.
JHEP, 06:040, 2021.
[111] Jessica Craven, Mark Hughes, Vishnu Jejjala, and Arjun Kar. Illuminating new and known
relations between knot invariants. 11 2022.
[112] Fabian Ruehle. Data science applications to string theory. Phys. Rept., 839:1–117, 2020.
[113] Y.H. He. Machine Learning in Pure Mathematics and Theoretical Physics. G - Refer-
ence,Information and Interdisciplinary Subjects Series. World Scientific, 2023.
[114] Sergei Gukov, James Halverson, and Fabian Ruehle. Rigor with machine learning from field
theory to the poincaréconjecture. Nature Reviews Physics, 2024.
[115] Shumao Zhang, Pengchuan Zhang, and Thomas Y Hou. Multiscale invertible generative
networks for high-dimensional bayesian inference. In International Conference on Machine
Learning, pages 12632–12641. PMLR, 2021.
[116] Jinchao Xu and Ludmil Zikatanov. Algebraic multigrid methods. Acta Numerica, 26:591–
721, 2017.
40
[117] Yifan Chen, Thomas Y Hou, and Yixuan Wang. Exponentially convergent multiscale finite
element method. Communications on Applied Mathematics and Computation, pages 1–17,
2023.
[118] Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein.
Implicit neural representations with periodic activation functions. Advances in neural infor-
mation processing systems, 33:7462–7473, 2020.
41
Appendix
A KAN Functionalities
Table 6 includes common functionalities that users may find useful.
Functionality Descriptions
model.train(dataset) training model on dataset
model.plot() plotting
model.prune() pruning
fix the activation function ϕl,i,j
model.fix_symbolic(l,i,j,fun)
to be the symbolic function fun
suggest symbolic functions that match
model.suggest_symbolic(l,i,j)
the numerical value of ϕl,i,j
use top 1 symbolic suggestions from suggest_symbolic
model.auto_symbolic()
to replace all activation functions
model.symbolic_formula() return the symbolic formula
Table 6: KAN functionalities
42
Figure B.1: Training of a learnable activation network (LAN) on the toy example f (x, y) = exp(sin(πx)+y 2 ).
Figure B.2: LANs on synthetic examples. LANs do not appear to be very interpretable. We conjecture that the
weight matrices leave too many degree of freedoms.
the existence of weight matrices. First, weight matrices are less readily interpretable than learn-
able activation functions. Second, weight matrices bring in too many degrees of freedom, making
learnable activation functions too unconstrained. Our preliminary results with LANs seem to imply
that getting rid of linear weight matrices (by having learnable activations on edges, like KANs) is
necessary for interpretability.
43
Figure B.3: A SIREN network (fixed sine activations) can be adapted to LANs (learnable activations) to im-
prove image representations.
b(x) to sine functions, the same setup as in SIREN but let spline(x) be trainable. For both MLP
and LAN, the shape is [2,128,128,128,128,128,1]. We train them with the Adam optimizer, batch
size 4096, for 5000 steps with learning rate 10−3 and 5000 steps with learning rate 10−4 . As shown
in Figure B.3, the LAN (orange) can achieve higher PSNR than the MLP (blue) due to the LAN’s
flexibility to fine tune activation functions. We show that it is also possible to initialize a LAN from
an MLP and further fine tune the LAN (green) for better PSNR. We have chosen G = 5 in our
experiments, so the additional parameter increase is roughly G/N = 5/128 ≈ 4% over the original
parameters.
C Dependence on hyperparameters
We show the effects of hyperparamters on the f (x, y) = exp(sin(πx) + y 2 ) case in Figure C.1. To
get an interpretable graph, we want the number of active activation functions to be as small (ideally
3) as possible.
(1) We need entropy penalty to reduce the number of active activation functions. Without entropy
penalty, there are many duplicate functions.
(2) Results can depend on random seeds. With some unlucky seed, the pruned network could be
larger than needed.
(4) The grid number G also has a subtle effect on interpretability. When G is too small, because
each one of activation function is not very expressive, the network tends to use the ensembling
strategy, making interpretation harder.
(5) The piecewise polynomial order k only has a subtle effect on interpretability. However, it be-
haves a bit like the random seeds which do not display any visible pattern in this toy example.
D Feynman KANs
We include more results on the Feynman dataset (Section 3.3). Figure D.1 shows the pareto fron-
tiers of KANs and MLPs for each Feynman dataset. Figure D.3 and D.2 visualize minimal KANs
(under the constraint test RMSE < 10−2 ) and best KANs (with the lowest test RMSE loss) for each
Feynman equation fitting task.
44
Figure C.1: Effects of hyperparameters on interpretability results.
45
10 1 I.6.2 I.6.2b I.9.18 I.12.11
KAN train
10 3 KAN test
RMSE
10 5
10 7
10 5
10 7
10 5
10 7
10 5
10 7
10 5
10 7
10 5
10 7
101 102 103 104 101 102 103 104 101 102 103 104 101 102 103 104
number of parameters number of parameters number of parameters number of parameters
Figure D.1: The Pareto Frontiers of KANs and MLPs for Feynman datasets.
46
Figure D.2: Best Feynman KANs
47
Figure D.3: Minimal Feynman KANs
48
Figure F.1: Best special KANs
49
Figure F.2: Minimal special KANs
50
KAN 2.0:
Kolmogorov-Arnold Networks Meet Science
Ziming Liu1,4∗ Pingchuan Ma1,3 Yixuan Wang2 Wojciech Matusik1,3 Max Tegmark1,4
1
Massachusetts Institute of Technology
2
California Institute of Technology
arXiv:2408.10205v1 [cs.LG] 19 Aug 2024
3
Computer Science and Artificial Intelligence Laboratory (CSAIL), MIT
4
The NSF Institute for Artificial Intelligence and Fundamental Interactions
Abstract
To be more concrete, scientific explanations may have different levels, ranging from the coars-
est/easiest/correlational to the finest/hardest/causal:
• Important features: For example, “y is fully determined by x1 and x2 , while other factors
do no matter.” In other words, there exists a function f such that y = f (x1 , x2 ).
• Modular structures: For instance, “x1 and x2 contributes to y independently in an additive
way.“ This means there exists functions g and h such that y = g(x1 ) + h(x2 ).
• Symbolic formulas: For example, “y depends on x1 as a sine function and on x2 as an
exponential function”. In other words, y = sin(x1 ) + exp(x2 ).
The paper reports on how to incorporate and extract these properties from KANs. The structure of
the paper is as follows (illustrated in Figure 1): In Section 2, we augment the original KAN with
multiplication nodes, introducing a new model called MultKAN. In Section 3, we explore ways to
embed scientific inductive biases into KANs, focusing on important features (Section 3.1), modular
2
Figure 2: Top: comparing KAN and MultKAN diagrams. MultKAN has extra multiplication layers M. Bot-
tom: After training on f (x, y) = xy, KAN learns an algorithm requiring two addition nodes, while MultKAN
requires only one multiplication node.
structures (Section 3.2), and symbolic formulas (Section 3.3). In Section 4, we propose methods to
extract scientific knowledge from KANs, again covering important features (Section 4.1), modular
structures (Section 4.2), and symbolic formulas (Section 4.3). In Section 5, we apply KANs to
various scientific discovery tasks using the tools developed in the previous sections. These tasks
include discovering conserved quantities, symmetries, Lagrangians, and constitutive laws. Codes
are available at https://github.com/KindXiaoming/pykan and can also be installed via pip
install pykan. Although the title of the paper is “KAN 2.0”, the release version of pykan is
0.2.x.
This implies that addition is the only true multivariate operation, while other multivariate operations
(including multiplication) can be expressed as additions combined with univariate functions. For
example, to multiply two positive numbers x and y, we can express this as xy = exp(logx + logy) 2
whose right-hand side only consists of addition and univariate functions (log and exp).
2
If x and y can be negative, one may choose a large c > 0 and express xy = exp(log(x + c) + log(y +
c)) − c(x + y) − c2 . Other constructions include quadratic functions, such as xy = ((x + y)2 − (x − y)2 )/4
or xy = ((x + y)2 − x2 − y 2 )/2.
3
However, given the prevalence of multiplications in both science and everyday life, it is desirable
to explicitly include multiplications in KANs, which could potentially enhance both interpretability
and capacity.
Kolmogorov-Arnold Network (KAN) While the KART Eq. (1) corresponds to a two-layer net-
work, Liu et al. [57] managed to extend it to arbitrary depths by recognizing that seemingly different
outer functions Φq and inner functions ϕq,p can be unified through their proposed KAN layers. A
depth-L KAN can be constructed simply by stacking L KAN layers. The shape of a depth-L KAN is
represented by an integer array [n0 , n1 , · · · , nL ] where nl denotes the number of neurons in the lth
neuron layers. The lth KAN layer, with nl input dimensions and nl+1 output dimensions, transforms
an input vector xl ∈ Rnl to xl+1 ∈ Rnl+1
ϕl,1,1 (·) ϕl,2,1 (·) ··· ϕl,nl ,1 (·)
ϕl,1,2 (·) ϕl,2,2 (·) ··· ϕl,nl ,2 (·)
xl+1 = .
.. .
.. .. xl , (2)
.
ϕl,1,nl+1 (·) ϕl,2,nl+1 (·) · · · ϕl,nl ,nl+1 (·)
| {z }
Φl
4
Figure 3: Adding
p auxiliary variables to inputs enhances interpretability. For the relativistic mass equation,
m = m0 / 1 − v 2 /cp 2 , (a) a two-layer KAN is needed if only (m , v, c) are used as inputs. (b) If we add
0
β ≡ v/c and γ ≡ 1/ 1 − β 2 as auxiliary variables to KANs, a one-layer KAN suffices (seed 0). (c) seed 1
finds a different solution, which is sub-optimal and can be avoided through hypothesis testing (Section 4.3).
the multiplication task f (x, y) = xy, the MultKAN indeed learns to use one multiplication node,
making it perform simple multiplication, as all the learned activation functions are linear (Figure 2
bottom right).
Although KANs have previously been seen as a special case of MultKANs, we extend the definition
and treat “KAN” and “MultKAN” as synonyms. By default, when we refer to KANs, multiplication
is allowed. If we specifically refer to a KAN without multiplication, we will explicitly state so.
3 Science to KANs
In science, domain knowledge is crucial, allowing us to work effectively even with small or zero
data. Therefore, it is beneficial to adopt a physics-informed approach for KANs: we should in-
corporate available inductive biases into KANs while preserving their flexibility to discover new
physics from data.
We explore three types of inductive biases that can be integrated into KANs. From the coars-
est/easiest/correlational to the finest/hardest/causal, they are important features (Section 3.1), mod-
ular structures (Section 3.2) and symbolic formulas (Section 3.3).
3.1 Adding important features to KANs
In a regression problem, the goal is to find a function f such that y = f (x1 , x2 , · · · , xn ). Suppose
we want to introduce an auxiliary input variable a = a(x1 , x2 , . . . , xn ), transforming the function
to y = f (x1 , · · · , xn , a). Although the auxiliary variable a does not add new information, it can
increase the expressive power of the neural network. This is because the network does not need to
expend resources to calculate the auxiliary variable. Additionally, the computations may become
simpler, leading to improved interpretability. Users can add auxiliary features to inputs using the
augment_input method:
model.augment_input(original_variables, auxiliary_variables, dataset) (6)
p
As an example, consider the formula for relativistic mass m(m0 , v, c) = m0 / 1 − (v/c)2 where
m0 is the rest mass, v is the velocity of the point mass, and c is p
the speed of light.
p Since physicists
often work with dimensionless numbers β ≡ v/c and γ ≡ 1/ 1 − β 2 ≡ 1/ 1 − (v/c)2 , they
might introduce β and γ alongside v and c as inputs. Figure 3, shows KANs with and without these
auxiliary variables: (a) illustrates the KAN compiled from the symbolic formula (see Section 3.3 for
the KAN compiler), which requires 5 edges; (b)(c) shows KANs with auxiliary variables, requiring
only 2 or 3 edges and achieving loses of 10−6 and 10−4 , respectively. Note that (b) and (c) differ
only in random seeds. Seed 1 represents a sub-optimal solution because it also identifies
p β = v/c
as a key feature. This is not surprising, as in the classical limit v ≪ c, γ ≡ 1/ 1 − (v/c)2 ≈
1 + (v/c)2 /2 = 1 + β 2 /2. The variation due to different seeds can be seen either as a feature or
a bug: As a feature, this diversity can help find sub-optimal solutions which may nevertheless offer
5
Figure 4: Building modular structures to KANs: (a) multiplicative separability;(b) symmetries.
interesting insights; as a bug, it can be eliminated using the hypothesis testing method proposed in
Section 4.3.
3.2 Building modular structures to KANs
Modularity is prevalent in nature: for example, the human cerebral cortex is divided into several
functionally distinct modules, each of these modules responsible for specific tasks such as percep-
tion or decision making. This modularity simplifies the understanding of neural networks, as it
allows us to interpret clusters of neurons collectively rather than analyzing each neuron individually.
Structural modularity is characterized by clusters of connections where intra-cluster connections are
much stronger than inter-cluster ones. To enforce modularity, we introduce the module method,
which preserves intra-cluster connections while removing inter-cluster connections. The modules
are specified by users. The syntax is
model.module(start_layer_id, ‘[nodes_id]->[subnodes_id]->[nodes_id]...’)
(7)
For example, if a user wants to assign specific nodes/subnodes to a module – say, the 0th node
in layer 1, the 1st and 3rd subnode in layer 1, the 1st and 3rd node in layer 2 – they might use
module(1,‘[0]->[1,3]->[1,3]’). To be concrete, there are two types of modularity: separabil-
ity and symmetry.
Separability We say a function is considered separable if it can be expressed as a
sum or product of functions of non-overlapping variable groups. For example, a four-
variable function f (x1 , x2 , x3 , x4 ) is maximally multiplicatively separable if it has the form
f1 (x1 )f2 (x2 )f3 (x3 )f4 (x4 ), creating four distinct groups (1), (2), (3), (4). Users can create these
modules by calling the module method four times: module(0,‘[i]->[i]’), i = 0, 1, 2, 3,
shown in Figure 4 (a). The final call may be skipped since the first three are sufficient to de-
fine the groups. Weaker forms of multiplicative separability might be f1 (x1 , x2 )f2 (x3 , x4 ) (calling
module(0,‘[0,1]->[0,1]’)) or f1 (x1 )f2 (x2 , x3 , x4 ) (calling module(0,‘[0]->[0]’)).
Generalized Symmetry We say a function is symmetric in variables (x1 , x2 ) if f (x1 , x2 , x3 , · · · ) =
g(h(x1 , x2 ), x3 , · · · ). This property is termed symmetry because the value of f remains un-
changed as long as h(x1 , x2 ) is constant, even if x1 and x2 vary. p For example, a function f
is rotational invariant in 2D if f (x1 , x2 ) = g(r), where r ≡ x21 + x22 . When symmetry in-
volves only a subset of variables, it can be considered hierarchical since x1 and x2 interact first
through h (2-Layer KAN), and then h interacts with other variables via g (2-Layer KAN). Sup-
pose a four-variable function has a hierarchical form f (x1 , x2 , x3 , x4 ) = h(f (x1 , x2 ), g(x3 , x4 )),
6
Figure 5: KAN compiler (kanpiler) converts symbolic expressions to KANs. (a) how kanpiler works: the
symbolic formula is first parsed to an expression tree, which is then converted to a KAN. (b) Applying KANs
to 10 equations (selected from the Feynman dataset). (c) Expand a compiled KAN to increase its expressive
power.
as illustrated in Figure 4 (b). We can use the module method to create this structure by call-
ing module(0,‘[0,1]->[0,1]->[0,1]->[0]’), ensuring that the variable groups (x1 , x2 ) and
(x3 , x4 ) do not interact in the first two layers.
7
all nuances due to their specific functional forms. In contrast, neural networks are highly expressive
but may inefficiently spend training time and data to learn domain knowledge already known to
scientists. To leverage the strengths of both approaches, we propose a two-step procedure: (1)
compile symbolic equations into KANs and (2) fine-tune these KANs using data. The first step aims
to embed known domain knowledge into KANs, while the second step focuses on learning new
“physics” from data.
kanpiler (KAN compiler) The goal of the kanpiler is to convert a symbolic formula to a KAN. The
process, illustrated in Figure 5 (a), involves three main steps: (1) The symbolic formula is parsed
into a tree structure, where nodes represent expressions, and edges denote operations/functions. (2)
This tree is then modified to align with the structure of a KAN graph. Modifications include moving
all leaf nodes to the input layer via dummy edges, and adding dummy subnodes/nodes to match
KAN architecture. These dummy edges/nodes/subnodes only perform identity transformation. (3)
The variables are combined in the first layer, effectively converting the tree into a graph. For visual
clarity, 1D curves are placed on edges to represent functions. We have benchmarked the kanpiler on
the Feynman dataset and it successfully handles all 120 equations. Examples are shown in Figure 5
(b). The kanpiler takes input variables (as sympy symbols) and output expression (as a sympy
expression), and returns a KAN model
model = kanpiler(input_variables, output_expression) (8)
Note that the returned KAN model is in the symbolic mode, i.e., the symbolic functions are exactly
encoded. If we instead use cubic splines to approximate these symbolic functions, we get MSE
losses ℓ ∝ N −8 [57], where N is the number of grid intervals (proportional to the number of model
parameters).
Width/depth expansion for increased expressive power The KAN network generated by the kan-
piler is compact, without no redundant edges, which might limit its expressive power and hinder
further fine-tuning. To address this, we propose expand_width and expand_depth methods to
expand the network to become wider and deeper, as shown in Figure 5 (c). The expansion methods
initially add zero activation functions, which suffer from zero gradients during training. Therefore,
the perturb method should be used to perturb these zero functions into non-zero values, making
them trainable with non-zero gradients.
4 KANs to Science
Today’s black box deep neural networks are powerful, but interpreting these models remains chal-
lenging. Scientists seek not only high-performing models but also the ability to extract meaningful
knowledge from the models. In this section, we focus on enhancing the interpretability of KANs
scientific purposes. We will explore three levels of knowledge extraction from KANs, from the most
basic to the most complex: important features (Section 4.1), modular structures (Section 4.2), and
symbolic formulas (Section 4.3).
8
Figure 6: Identifying important features in KANs. (a) comparing the attribution score to the L1 norm used
in Liu et al. [57]. On two synthetic tasks, the attribution score brings more insights than the L1 norm. (b)
Attribution scores can be computed for inputs and used for input pruning.
iteratively from the output layer to the input layer. We set all output dimensions to have unit scores,
i.e., AL,i = 1, i = 0, 1, · · · , nL − 1 5 , and compute scores as follows:
nl
El,j X
Bl−1,i,j = Al,j , Al−1,i = Bl−1,i,j , l = L, L − 1, · · · , 1. (9)
Nl+1,j j=0
Comparing El,i,j and Bl,i,j We find that Bl,i,j provides a more accurate reflection of edge im-
portance. In Figure 6, we compare KANs trained on two equations y = exp(sin(πx1 ) + x22 ) and
y = (x21 + x22 )2 + (x23 + x24 )2 and visualize KANs with importance scores being E (L1 norm) or B
(attribution score). For the first equation, attributions scores reveal a cleaner graph than L1 norms,
as many active edges in the first layer do not contribute to the final output due to inactive subsequent
edges. The attribution score accounts for this, resulting in a more meaningful graph. For the second
equation y = (x21 + x22 )2 + (x23 + x24 )2 , we can tell from the symbolic equation that all four vari-
ables are equally important. The attribution scores correctly reflect the equal importance of all four
variables, whereas the L1 norm incorrectly suggests that x3 and x4 are more important than x1 and
x2 .
Pruning inputs based on attribution scores In real datasets, input dimensionality can be large,
but only a few variables may be relevant. To address this, we propose pruning away irrelevant
features based on attribution scores so that we can focus on the most relevant ones. Users can apply
the prune_input to retain only the most relevant variables. For instance, if there are 100 input
P99
features ordered by decreasing relevance in the function y = i=0 x2i /2i , xi ∈ [−1, 1], and after
training, only the first five features show significantly higher attribution scores, the prune_input
method will retain only these five features. The pruned network becomes compact and interpretable,
whereas the original KAN with 100 inputs is too dense for straightforward interpretation.
9
Figure 7: Inducing anatomical modularity in neural networks through neuron swapping. The approach involves
assigning spatial coordinates to neurons and permuting them to minimize the overall connection cost. For two
tasks (left: multitask parity, right: hierarchical majority voting), neuron swapping works for KANs (top) in
both cases and works for MLPs (bottom) for multitask parity.
10
Figure 8: Detecting functional modularity in KANs. (a) We study three types of functional modularity: sep-
arability (additive or multiplicative), general separability, and symmetry. (b) Applying these tests recursively
converts a function into a tree. Here the function can be symbolic functions (top), KANs (middle) or MLPs
(bottom). Both KANs and MLPs produce correct tree graphs at the end of training but show different training
dynamics.
11
all 1 ≤ i ≤ k and k + 1 ≤ j ≤ n, then we know f is additively separable. For multiplicative
separability, we can convert it to additive separability by taking the logarithm:
f (x1 , x2 , · · · xn ) = g(x1 , . . . , xk ) × h(xk+1 , . . . , xn )
(11)
log |f (x1 , x2 , · · · , xn )| = log |g(x1 , . . . , xk )| + log |h(xk+1 , . . . , xn )|
2
|
To detect multiplicative separability, we define Hij ≡ ∂∂xlog|f
i ∂xj
, and check for block structure. Users
can call test_separability to test general separability.
Generalized separability: A function f has generalized separability if
f (x1 , x2 , · · · xn ) = F (g(x1 , . . . , xk ) + h(xk+1 , . . . , xn )). (12)
To detect generalized separability, we compute
∂f ∂F ∂g ∂f ∂F ∂h
= (1 ≤ i ≤ k), = (k + 1 ≤ j ≤ n)
∂xi ∂g ∂xi ∂xj ∂h ∂xi
(13)
∂f /∂xi ∂F/∂g ∂g/∂xi ∂g/∂xi 1
= = = gxi (x1 , x2 , · · · xk ) × .
∂f /∂xj ∂F/∂h ∂h/∂xj ∂h/∂xj hxj (xk+1 , · · · , xn )
∂f /∂xi
where we have used ∂F ∂F
∂g = ∂h . Note that ∂f /∂xj is multiplicatively separable, it can be detected by
the separability test proposed above. Users can call test_general_separability to check for
additive or multiplicative separability.
Generalized Symmetry: A function has generalized symmetry (in the first k variables) if
f (x1 , x2 , · · · , xn ) = g(h(x1 , · · · , xk ), xk+1 , · · · , xn ). (14)
We denote y = (x1 , · · · , xk ) and z = (xk+1 , · · · , xn ). This property is called generalized symme-
try because f retains the same value as long as h is held constant, regardless of individual values
∂g ∂g
of x1 , · · · , xk . We compute the gradient of f with respect to y: ∇y f = ∂h ∇y h. Since ∂h is a
∇y f
scalar function, it does not change the direction of ∇y h. Thus, the direction of ∇y f ≡ |∇y f | is
d
independent of z, i.e.,
∇z (∇ d y f ) = 0, (15)
which is the condition for symmetry. Users can call the test_symmetry method to check for
symmetries.
Tree converter The three types of functional modularity form a hierarchy: symmetry is the most
general, general separability is intermediate, and separability is the most specific. Mathematically,
Separability ⊂ Generalized Separability ⊂ Generalized Symmetry (16)
To obtain the maximal hierarchy of modular structures, we apply generalized symmetry detection
recursively, forming groups as small as k = 2 variables and extending to all k = n variables. For
example, let us consider an 8-variable function
f (x1 , · · · , x8 ) = ((x21 + x22 )2 + (x23 + x24 )2 )2 + ((x25 + x26 )2 + (x27 + x28 )2 )2 , (17)
which has four k = 2 generalized symmetries, involving groups (x1 , x2 ), (x3 , x4 ), (x5 , x6 ),
(x7 , x8 ); two k = 2 generalized symmetries, involving groups (x1 , x2 , x3 , x4 ) and (x5 , x6 , x7 , x8 ).
As such, each k = 4 group contains two k = 2 groups, demonstrating a hierarchy. For each gen-
eralized symmetry, we can also test if the generalized symmetry is further generalized separable or
separable. Users can use the method plot_tree to obtain the tree graph for a function (the function
could be any Python expressions, neural networks, etc.). For a neural network model, users can
simply call model.tree(). The tree plot can have the style ‘tree’ (by default) or ‘box’.
Examples Figure 8 (b) provides two examples. When the exact symbolic functions are input to
plot_tree, the ground truth tree graphs are obtained. We are particularly interested in whether the
tree converter works for neural networks. For these simple cases, both KANs and MLPs can find
the correct graph if sufficiently trained. Figure 8 (b) (bottom) shows the evolution of the tree graphs
during KAN and MLP training. It is particularly interesting to see how neural networks gradually
learn the correct modular structure. In the first case f (x1 , x2 , x3 , x4 ) = (x21 + x22 )2 + (x23 + x24 )2 ,
both KAN and MLP gradually pick up more inductive biases (their intermediate states are different)
12
Figure 9: Three tricks to facilitate symbolic regression. Trick A (top row): detecting and leveraging modular
structures. Trick B (middle row): sparse connection initialization. Trick C (bottom row): Hypothesis testing.
p
until they reach the correct structure. In the second case, f (x1 , x2 , x3 ) = sin(x1 )/ x22 + x23 , both
the models initially detect multiplicative separability for all three variables, showing even higher
symmetry than the correct structure. After training progresses, both models “realize” that: in order
to better fit data (loss becomes lower), such high symmetry structure can no longer be met and should
be relaxed to a less stringent structure. An additional observation is that KAN has an intermediate
structure not found in the MLP. There are two caveats we would like to mention: (1) results can
be seed and/or threshold-dependent. (2) all tests rely on second-order derivatives, which may not
be robust due to the model being trained only on zero-order information. Adversarial constructions
such as fϵ (x) = f (x)+ϵsin( xϵ ) could lead to issues, because although |fϵ (x)−f (x)| → 0 as ϵ → 0,
|fϵ′′ (x) − f ′′ (x)| → ∞ as ϵ → 0. Although such extreme cases are unlikely in practice, smoothness
is necessary to ensure the success of our methods.
13
We first initialize a large KAN (presumably expressive enough) to fit the dataset to a reasonable
accuracy. After training, the tree graph is extracted (ref Sec 4.2) from the trained KAN, which
shows multiplicative separability. Then we can build the modular structure into a second KAN (ref
Sec 3.2), train it, and then symbolify all 1D functions to derive the formula.
Trick B: Sparse initialization Symbolic formulas typically correspond to KANs with sparse con-
nections (see Figure 5 (b)), so initializing KANs sparsely aligns them better with the inductive biases
of symbolic formulas. Otherwise, densely initialized KANs require careful regularization to promote
sparsity. Sparse initialization can be achieved by passing the argument “sparse_init=True” to the
KAN initializer. For example, for the function f (q, E, v, B, θ) = q(E + vBsinθ), a sparsely initial-
ized KAN closely resembles the final trained KAN, requiring only minor adjustments in training. In
contrast, a dense initialization would involve extensive training to remove unnecessary edges.
Trick C: Hypothesis Testing When faced with multiple reasonable hypotheses, we can try all of
them (branching into “parallel universes”) to test which hypothesis is the most accurate and/or sim-
plest. To facilitate hypothesis testing, we build a checkpoint system that automatically saves model
versions whenever p changes (e.g., training, pruning) are made. For example, consider the function
f (m0 , v, c) = m0 / 1 − (v/c)2 . We start from a randomly initialized KAN, which has pversion 0.0.
After training, it evolves to version 0.1, where it activates on both β = v/c and γ = 1/ 1 − (v/c)2 .
Hypothesize that only β or γ might be needed. We first set the edge on γ to zero, and train the model,
obtaining a 6.5 × 10−4 test RMSE (version 0.2). To test the alternative hypothesis, we want to re-
vert back to the branching point (version 0.1) – we call model.rewind(‘0.1’) which rewinds the
model back to version 0.1. To indicate that rewind is called, version 0.1 is renamed to version 1.1.
Now we set the edge on β to be zero, train the model, obtaining a 2.0 × 10−6 test RMSE (the version
becomes 1.2). Comparing versions 0.2 and 1.2 indicates that the second hypothesis is better due to
the lower loss given the same complexity (both hypotheses have two non-zero edges).
5 Applications
The previous sections primarily focused on regression problems for pedagogical purposes. In this
section, we apply KANs to discover physical concepts, such as conserved quantities, Lagrangians,
hidden symmetries, and constitutive laws. These examples illustrate how the tools proposed in this
paper can be effectively integrated into real-life scientific research to tackle these complex tasks.
Figure 10: Using KANs to discover conserved quantities for the 2D harmonic oscillator.
Conserved quantities are physical quantities that remain constant over time. For example, a free-
falling ball converts its gravitational potential energy into kinetic energy, while the total energy (the
sum of both forms of energy) remains constant (assuming negligible air resistance). Conserved
quantities are crucial because they often correspond to symmetries in physical systems and can sim-
plify calculations by reducing the dimensionality of the system. Traditionally, deriving conserved
quantities with paper and pencil can be time-consuming and demands extensive domain knowl-
edge. Recently, machine learning techniques have been explored to discover conserved quanti-
ties [55, 53, 54, 58, 32, 89].
14
We follow the approach Liu et al. [53], which derived a differential equation that conserved quan-
tities must satisfy, thus transforming the problem of finding conserved quantities into differential
equation solving. They used multi-layer perceptrons (MLPs) to parameterize conserved quantities.
We basically follow their procedure but replace MLPs with KANs. To be specific, they consider a
dynamical system with the state variable z ∈ Rd governed by the equation dz
dt = f (z). The necessary
and sufficient condition for a function H(z) to be a conserved quantity is that f (z) · ∇H(z) = 0
for all z. For example, in a 1D harmonic oscillator, the phase space is characterized by position
and momentum, z = (x, p), and the evolution equation is d(x, p)/dt = (p, −x). The energy
H = 21 (x2 + p2 ) is a conserved quantity because f (z) · ∇H(z) = (p, −x) · (x, p) = 0. We param-
PN 2
eterize H using a KAN, and train it with the loss function ℓ = i=1 f (z(i) ) · ∇H(z
b (i)
) where
∇b is the normalized gradient, and z(i) are the ith data point uniformly drawn from the hypercube
[−1, 1]d .
We choose the 2D harmonic oscillator to test KANs, characterized by (x, y, px , py ). It has three
conserved quantities: (1) energy along x direction: H1 = 21 (x2 + p2x ); (2) energy along y direction:
H2 = 21 (y 2 + p2y ); (3) angular momentum H3 = xpy − ypx . We train [4, [0, 2], 1] KANs with three
different random seeds, as shown in Figure 10, which correspond to H1 , H2 and H3 respectively.
Figure 11: Use KANs to learn Lagrangians for the single pendulum (top) and a relativistic mass in a uniform
field (bottom).
˙
q̈ = (∇q̇ ∇Tq̇ L)−1 [∇q L − (∇q ∇Tq̇ q)] (18)
Given the fundamental role of the Lagrangian, an interesting question is whether we can infer the La-
grangian from data. Following [19], we train a Lagrangian neural network to predict q̈ from (q, q̇).
An LNN uses an MLP to parameterize L(q, q̇), and computes the Eq. (18) to predict instant accel-
erations q̈. However, LNNs face two main challenges: (1) The training of LNNs can be unstable
15
due to the second-order derivatives and matrix inversion in Eq. (18). (2) LNNs lack interpretability
because MLPs themselves are not easily interpretable. We address these issues using KANs.
To tackle the first challenge, we note that the matrix inversion of the Hessian (∇q̇ ∇Tq̇ L)−1 be-
comes problematic when the Hessian has eigenvalues close to zero. To mitigate this, we initialize
(∇q̇ ∇Tq̇ L) as a positive definite matrix (or a positive number in 1D). Since (∇q̇ ∇Tq̇ L) is the mass
m in classical mechanics and kinetic energy is usually T = 21 mq̇2 , encoding this prior knowledge
into KANs is more straightforward than into MLPs (using the kanpiler introduced in Section 3.3).
The kanpiler can convert the symbolic formula T into a KAN (as shown in Figure 11). We use this
converted KAN for initialization and continue training, resulting in much greater stability compared
to random initialization. After training, symbolic regression can be applied to each edge to extract
out symbolic formulas, addressing the second challenge.
We show two 1D examples in Figure 11, a single pendulum and a relativistic mass in a uniform field.
The compiled KANs are displayed on the left, with edges on q̇ displaying quadratic functions and
edges on q as zero functions.
Single pendulum The q̇ part remains a quadratic function T (q̇) = 21 q̇ 2 while the q part learns to be
a cosine function, as V (q) = 1 − cos(q). In Figure 11 top, the results from suggest_symbolic
display the top five functions that best match the splines, considering both fitness and simplicity. As
expected, the cosine and the quadratic function appear at the top of the lists.
Relativistic mass in a uniform field After training, the kinetic energy part deviates from T =
1 2 2 −1/2
2 q̇ because, for a relativistic particle, Tr = (1 − q̇ ) − 1. In Figure 11 (bottom), symbolic
regression successfully finds V (q) = q, but fails to identify Tr due to its compositional nature, as our
symbolic regression only searches for simple functions. By assuming the first function composition
is quadratic, we create another [1, 1, 1] KAN to fit Tr and set the first function to be the quadratic
function using fix_symbolic, and train only the second learnable function. After training, we see
that the ground truth x−1/2 appears among the top five candidates. However, x1/2 fits the spline
slightly better, as indicated by a higher R-squared value. This suggests that symbolic regression is
sensitive to noise (due to imperfect learning) and prior knowledge is crucial for correct judgment.
For instance, knowing that kinetic energy should diverge as velocity approaches the speed of light
helps confirm x−1/2 as the correct term, since x1/2 does not exhibit the expected divergence.
5.3 Discovering hidden symmetry
Figure 12: Rediscovering the hidden symmetry of the Schwarzschild black hole with MLPs and KANs. (a)
∆t(r) learned by the MLP is a globally smooth solution; (b) ∆t(r) learned by the KAN is a domain-wall
solution; (c) The KAN shows a loss spike at the domain wall; (d) A KAN can be used to fine-tune the MLP
solution close to machine precision.
Philip Anderson famously argued that “it is only slightly overstating that case to say that physics is
the study of symmetry”, emphasizing how the discovery of symmetries has been invaluable for both
deepening our understanding and solving problems more efficiently.
However, symmetries are sometimes not manifest but hidden, only revealed by applying some co-
ordinate transformation. For example, after Schwarzschild discovered his eponymous black hole
metric, it took 17 years for Painlevé, Gullstrand and Lemaître to uncover its hidden translational
symmetry. They demonstrated that the spatial sections could be made translationally invariant with
a clever coordinate transformation, thereby deepening our understanding of black holes [65]. Liu &
Tegmark [56] showed that the Gullstrand-Painlevé transformation can be discovered by training an
MLP in minutes. However, they did not get extremely high precision (i.e., machine precision) for
the solution. We attempt to revisit this problem using KANs.
16
Suppose there is a Schwarzschild black hole in spacetime (t, x, y, z) with mass 2M = 1, centered
at x = y = z = 0 with a radius rs = 2M = 1. The Schwarzschild metric describes how space and
time distorts around it:
1 − 2M 0 0 0
r 2
2M x 2M xy 2M xz
0 −1 − (r−2M )r 2 − (r−2M )r 2 − (r−2M )r 2
gµν = (19)
2
2M xy 2M y 2M yz
0 − (r−2M )r2 −1 − (r−2M )r2 − (r−2M )r2
2M xz 2M yz 2M z 2
0 − (r−2M )r 2 − (r−2M )r 2 −1 − (r−2M )r 2 .
which exhibits translation invariance in the spatial section (the lower right 3 × 3 block is the
Euclidean metric). Liu & Tegmark [56] used an MLP to learn the mapping from (t, x, y, z) to
′ ′ ′ ′
(t′ , x′ , y ′ , z ′ ). Defining the Jacobian matrix J ≡ ∂(t ,x ,y ,z ) ′
∂(t,x,y,z) , g is tranformed to g = J
−T
gJ−1 .
We take the bottom right 3 × 3 block of g′ and take its difference to the Euclidean metric to obtain
the MSE loss. The loss is minimized by doing gradient descents on the MLP. To make things simple,
they assume knowing x′ = x, y ′ = y, z ′ = z, and only use an MLP (1 pinput and 1 output) to predict
the temporal difference ∆t(r) = t′ − t = 2M (2u + ln( u−1 u+1 )), u ≡ r
2M from the radius r.
MLP and KAN find different solutions We trained both an MLP and a KAN to minimize this
loss function, with results shown in Figure 12. Since the task has 1 input dimension and 1 output
dimension, the KAN effectively reduces to a spline. We originally expected KANs to outperform
MLPs, because splines are known to be superior in low-dimensional settings [63]. However, while
MLP can achieve 10−8 loss, the KAN gets stuck at 10−3 loss despite grid refinements. It turned
out that KAN and MLP learned two different solutions: while the MLP found a globally smooth
solution (Figure 12 (a)), the KAN learned a domain-wall solution (Figure 12 (b)). The domain wall
solution has a singular point that separates the whole curve into two segments. The left segment
learns ∆t(r) correctly, while the right segment learns −∆t(r), which is also a valid solution but
differs from the left segment by a minus sign. There is a loss spike appearing at the singular point
(Figure 12 (c)). One might consider this as a feature of KANs because domain wall solutions are
prevalent in nature. However, if one considers this a flaw, KANs can still obtain globally smooth
solutions by adding regularizations (to reduce spline oscillations) or experimenting with different
random seeds (roughly 1 out of 3 random seeds finds a global smooth solution).
KANs can achieve extreme precision Although the MLP finds the globally smooth solution and
achieves 10−8 loss, the loss is still far from machine precision. We found that neither longer training
nor increasing the MLP’s size significantly reduced the loss. Therefore, we turned to KANs, which,
as splines in 1D, can achieve arbitrary accuracy by refining the grid (given infinite data). We first
used the MLP as a teacher, generating supervised pairs (x, y) to train the KAN to fit the supervised
data. This way, the KAN is initialized to a globally smooth solution. We then iteratively refined the
KAN by increasing the number of grid intervals to 1000. In the end, the fine-tuned KANs achieve a
loss of 10−15 , close to machine precision (Figure 12 (d)).
5.4 Learning constitutive laws
A constitutive law defines the behavior and properties of a material by modeling how it responds to
external forces or deformations. One of the simplest forms of constitutive law is Hooke’s Law [34],
which relates the strain and stress of elastic materials linearly. Constitutive laws encompass a wide
range of materials, including elastic materials [80, 68], plastic materials [64], and fluids [8]. Tra-
ditionally, these laws were derived from first principles based on theoretical and experimental stud-
ies [79, 81, 6, 29]. Recent advancements, however, have introduced data-driven approaches that
leverage machine learning to discover and refine these laws from dedicated datasets [73, 91, 59, 60].
17
Figure 13: Discovering constitutive laws (relations between the pressure tensor P and the strain tensor F ) with
KANs by interacting with them. Top: predicting the diagonal element P11 ; bottom: predicting the off-diagonal
element P12 .
18
We follow the standard notations and experimental setups in the elasticity part of NCLaw [59] and
define the constitutive law as a parameterized function Eθ (F) → P, where F denotes the defor-
mation tensor, P the first Piola–Kirchhoff stress tensor, and θ the parameters in the constitutive
law.
Many isotropic materials have linear constitutive laws when deformation is small:
Pl = µ(F + FT − 2I) + λ(Tr(F) − 3)I. (21)
However, when deformation gets larger, nonlinear effects start to kick in. For example, a Neo-
Hookean material has the following constitutive law:
P = µ(FFT − I) + λlog(det(F))I, (22)
where µ and λ are the so-called Lamé parameters determined by the so-called Young’s modulus Y
Y Yν
and Poisson ratio ν as µ = 2(1+ν) , λ = (1+ν)(1−2ν) . For simplicity, we choose Y = 1 and ν = 0.2,
5 5
hence µ = 12 ≈ 0.42 and λ = 18 ≈ 0.28.
Assuming we are working with Neo-Hookean materials, and our goal is to use KANs to predict the
P tensor from the F tensor. Suppose we do not know they are neo-Hookean materials, but we have
the prior knowledge that the linear constitutive law is approximately valid for small deformation.
Due to symmetries, it suffices to demonstrate that we can accurately predict P11 and P12 from
the 9 matrix elements of F. We want to compile linear constitutive laws into KANs, which are
P11 = 2µ(F11 − 1) + λ(F11 + F22 + F33 − 3), and P12 = µ(F12 + F21 ). We want to extract
2 2 2
Neo-Hookean laws from trained KANs, which are P11 = µ(F11 + F12 + F13 − 1) + λlog(det(F)),
and P12 = µ(F11 F21 + F12 F22 + F13 F23 ). We generate a synthetic dataset by sampling Fij
independently from U [δij − w, δij + w](w = 0.2) and using the Neo-Hookean constitutive law to
compute P. Our interaction with KANs is illustrated in Figure 13. In both cases, we successfully
figured out the true symbolic formulas in the end, with the aid of some inductive biases. However, the
key takeaway is not that we can rediscover the exact symbolic formulas – given that prior knowledge
skews the process – but rather in real-world scenarios, where the answers are unknown and users
can make guesses based on prior knowledge, the pykan package makes it easy to test or incorporate
prior knowledge.
Predicting P11 In step 1, we compile the linear constitutive law P11 = 2µ(F11 − 1) + λ(F11 +
F22 + F33 − 3) to a KAN using kanpiler, resulting in a 10−2 loss. In step 2, we perturb the KAN
so that it becomes trainable (indicated by the color change from red to purple; red denotes a purely
symbolic part, while purple indicates that both symbolic and spline parts are active). In step 3, we
train the perturbed model until convergence, giving a 6 × 10−3 loss. In step 4, assuming that the
determinant is a key auxiliary variable, we use expand_width (for the KAN) and augment_input
(for the dataset) to include the determinant |F |. In step 5, we train the KAN until convergence,
giving a 2 × 10−4 loss. In step 6, we symbolify the KAN to obtain a symbolic formula P11 =
0.42(F112 2
+ F12 + F132
− 1) + 0.28log(|F |), which achieves a 3 × 10−11 loss.
Predicting P12 We experimented with and without encoding the linear constitutive law as prior
knowledge. With prior knowledge: in step 1, we compile the linear constitutive law to a KAN,
resulting in a loss of 10−2 . We then perform a series of operations, including expand (step 2), perturb
(step 3), train (step 4), prune (step 5) and finally symbolic (step 6). The influence of prior knowledge
is evident, as the final KAN only identifies minor correction terms to the linear constitutive law. The
2 2
final KAN is symbolified as P12 = 0.42(F12 +F21 )+0.44F13 F23 −0.03F21 +0.02F12 which yields
a 7 × 10−3 loss, only slightly better than the linear constitutive law. Without prior knowledge: in
step 1, we randomly initialize the KAN model. In step 2, we train the KAN with regularization. In
step 3, we prune the KAN to be a more compact model. In step 4, we symbolify the KAN, yielding
P12 = 0.42(F11 F21 + F12 F22 + F13 F23 ), which closely matches the exact formula, achieving a
6 × 10−9 loss. Comparing the two scenarios – one with and one without prior knowledge – reveals a
surprising outcome: in this example, prior knowledge appears harmful, possibly because the linear
constitutive law is probably near a (bad) local minimum which is hard for the model to escape.
However, we should probably not randomly extrapolate this conclusion to more complicated tasks
and larger networks. For more complicated tasks, finding a local minimum via gradient descent
might be challenging enough, making an approximate initial solution desirable. Additionally, larger
networks might be sufficiently over-parameterized to eliminate bad local minima, ensuring that all
local minima are global and interconnected.
19
Figure 14: KAN interpolates between software 1.0 and 2.0. (a) KANs strike a balance between interpretability
(software 1.0) and learnability (software 2.0). (b) KANs’ Pareto frontier on the interpretability-scale plane. The
amount of interpretation we can get from KANs depends on problem scales and interpretability methods.
6 Related works
Kolmogorov-Arnold Networks (KANs), inspired by the Kolmogorov-Arnold representation the-
orem (KART), were recently proposed by Liu et al. [57]. Although the connection between
KART and networks has long been deemed irrelevant [30], Liu et al. generalized the origi-
nal two-layer network to arbitrary depths and demonstrated their promise for science-oriented
tasks given their accuracy and interpretability. Subsequent research has explored the application
of KANs across various domains, including graphs [12, 22, 38, 99], partial differential equa-
tions [87, 78] and operator learning [1, 78, 67], tabular data [70], time series [85, 28, 93, 27],
human activity recognition [49, 50],neuroscience [96, 33], quantum science [40, 46, 4], computer
vision [17, 7, 44, 16, 76, 10], kernel learning [101], nuclear physics [48], electrical engineering [69],
biology [71]. Liu et al. used B-splines to parameterize 1D functions, and other research have ex-
plored various activation functions, including wavelet [11, 76], radial basis function [47], Fourier
series [92]), finite basis [35, 82], Jacobi basis functions [2], polynomial basis functions [75], ratio-
nal functions [3]. Other techniques for KANs have also been proposed including regularization [5],
Kansformer (combining transformer and KAN) [15], adaptive grid update [72], federated learn-
ing [98] , Convolutional KANs [10]. There have been ongoing debates regarding whether KANs
really outperform other neural networks (especially MLPs) on various domains [7, 16, 42, 77, 97],
which suggests that while KANs show promise for machine learning tasks, further development is
needed to surpass state-of-the-art models.
Machine Learning for Physical Laws A major goal for KANs is to aid in the discovery of
new physical laws from data. Previous research has shown that machine learning can be used to
learn various types of physical laws, including equations of motion [90, 13, 43, 20], conservation
laws [55, 53, 54, 58, 32, 89], symmetries [39, 56, 94], phase transitions [88, 14], Lagrangian and
Hamiltonian [19, 31], and symbolic regression [18, 61, 23, 74], etc. However, making neural net-
works interpretable often requires domain-specific knowledge, limiting their generality. We hope
that KANs will evolve into universal foundation models for physical discoveries.
Mechanistic Interpretability seeks to understand how neural networks operate in a fundamental
level [21, 62, 86, 25, 66, 100, 51, 24, 45, 26]. Some research in this area focuses on designing
models that are inherently interpretable [24] or proposing training methods that explicitly promote
interpretability [51]. KANs fall into this category since the Kolmogorov-Arnold theorem decom-
poses a high-dimensional function into a collection of 1D functions, which are significantly easier
to interpret than high-dimensional functions.
7 Discussion
KAN interpolates between software 1.0 and 2.0 The key difference between Kolmogorov-Arnold
Networks (KANs) and other neural networks (software 2.0, a term coined by Andrej Karpathy) lies
in their greater interpretability, which allows for manipulation by users, similar to traditional soft-
ware (software 1.0). However, KANs are not entirely traditional software, as they (1) learnability
(good), enabling them to learn new things from data, and (2) reduced interpretability (bad) as they
20
become less interpretable and controllable as the network scales increase. In Figure 14 (a) visualizes
the position of software 1.0, software 2.0, and KANs on the interpretability-learnability plane, illus-
trating how KANs can balance the trade-offs between these two paradigms. The goal of this paper
is to propose various tools that make KANs more like software 1.0, while leveraging the learnability
of software 2.0.
Efficiency improvement The original pykan package [57] was poor in efficiency. We have incor-
porated a few techniques to improve its efficiency.
1. Efficient splines evaluations. Inspired by Efficient KAN [9], we have optimized spline eval-
uations by avoiding unnecessary input expansions. For a KAN with L layers, N neurons
per layer, and grid size G, memory usage has been reduced from O(LN 2 G) to O(LN G).
2. Enabling the symbolic branch only when needed. A KAN layer contains both a spline
branch and a symbolic branch. The symbolic branch is much more time-consuming than the
spline branch since it cannot be parallelized (disastrous double loops are needed). However,
in many applications, the symbolic branch is unnecessary, so we can skip it when possible,
significantly reducing runtime, especially when the network is large.
3. Saving intermediate activations only when needed. To plot KAN diagrams, intermediate
activations must be saved. Initially, activations were saved by default, leading to slower
runtime and excessive memory usage. We now save intermediate activations only when
needed (e.g., for plotting or applying regularizations in training). Users can enable these
efficiency improvements with a single line: model.speed().
4. GPU acceleration. Initially, all models were run on CPUs due to the small-scale nature of
the problems. We have now made the model GPU-compatible 6 . For example, training a
[4,100,100,100,1] with Adam for 100 steps used to take an entire day on a CPU (before
implementing 1, 2, 3), but now takes 20 seconds on a CPU and less than one second on a
GPU. However, KANs still lag behind MLPs in efficiency, especially at large scales. The
community has been working towards benchmarking and improving KAN’s efficiency and
the efficiency gap has been significantly reduced [36].
Since the objective of this paper is to make KANs more like software 1.0, when facing trade-offs
between 1.0 (being interactive and versatile) and 2.0 (being efficient and specific), we prioritize
interactivity and versatility over efficiency. For example, we store cached data within models (which
consumes additional memory), so users can simply call model.plot() to generate a KAN diagram
without manually doing a forward pass to collect data.
Interpretability Although the learnable univariate functions in KANs are more interpretable than
weight matrices in MLPs, scalability remains a challenge. As KAN models scale up, even if all
spline functions are interpretable individually, it becomes increasingly difficult to manage the com-
bined output of these 1D functions. Consequently, a KAN may only remain interpretable when
the network scale is relatively small (Figure 14 (b), thick red line). It is important to note that
interpretability depends on both intrinsic factors (related to the model itself) and extrinsic factors
(related to interpretability methods). Advanced interpretability methods should be able to handle
interpretability at various levels. For example, by interpreting KANs with symbolic regression,
modularity discovery and feature attribution (Figure 14 (b), thin red lines), the Pareto Frontier of
interpretability versus scale extends beyond what a KAN alone can achieve. A promising direction
for future research is to develop more advanced interpretability methods that can further push the
current Pareto Frontiers.
Future work This paper introduces a framework that integrates KANs with scientific knowledge,
focusing primarily on small-scale, physics-related examples. Moving forward, two promising direc-
tions include applying this framework to larger-scale problems and extending it to other scientific
disciplines beyond physics.
Acknowledgement
We would like to thank Yizhou Liu, Di Luo, Akash Kundu and many GitHub users for fruitful
discussion and constructive suggestions. We extend special thanks to GitHub user Blealtan for
6
Models can be trained on GPUs, but not all functionalities already supported GPU.
21
making public their awesome work on making KANs efficient. Z.L. and M.T. are supported by
IAIFI through NSF grant PHY-2019786.
References
[1] D. W. Abueidda, P. Pantidis, and M. E. Mobasher. Deepokan: Deep operator network based
on kolmogorov arnold networks for mechanics problems. arXiv preprint arXiv:2405.19143,
2024.
[2] A. A. Aghaei. fkan: Fractional kolmogorov-arnold networks with trainable jacobi basis func-
tions. arXiv preprint arXiv:2406.07456, 2024.
[3] A. A. Aghaei. rkan: Rational kolmogorov-arnold networks. arXiv preprint arXiv:2406.14495,
2024.
[4] T. Ahmed and M. H. R. Sifat. Graphkan: Graph kolmogorov arnold network for small
molecule-protein interaction predictions. In ICML’24 Workshop ML for Life and Material
Science: From Theory to Industry Applications, 2024.
[5] M. G. Altarabichi. Dropkan: Regularizing kans by masking post-activations. arXiv preprint
arXiv:2407.13044, 2024.
[6] E. M. Arruda and M. C. Boyce. A three-dimensional constitutive model for the large
stretch behavior of rubber elastic materials. Journal of the Mechanics and Physics of Solids,
41(2):389–412, 1993.
[7] B. Azam and N. Akhtar. Suitability of kans for computer vision: A preliminary investigation.
arXiv preprint arXiv:2406.09087, 2024.
[8] G. K. Batchelor. An introduction to fluid dynamics. Cambridge university press, 2000.
[9] Blealtan. Blealtan/efficient-kan: An efficient pure-pytorch implementation of kolmogorov-
arnold network (kan).
[10] A. D. Bodner, A. S. Tepsich, J. N. Spolski, and S. Pourteau. Convolutional kolmogorov-
arnold networks. arXiv preprint arXiv:2406.13155, 2024.
[11] Z. Bozorgasl and H. Chen. Wav-kan: Wavelet kolmogorov-arnold networks. arXiv preprint
arXiv:2405.12832, 2024.
[12] R. Bresson, G. Nikolentzos, G. Panagopoulos, M. Chatzianastasis, J. Pang, and M. Vazir-
giannis. Kagnns: Kolmogorov-arnold networks meet graph learning. arXiv preprint
arXiv:2406.18380, 2024.
[13] S. L. Brunton, J. L. Proctor, and J. N. Kutz. Discovering governing equations from data by
sparse identification of nonlinear dynamical systems. Proceedings of the national academy
of sciences, 113(15):3932–3937, 2016.
[14] J. Carrasquilla and R. G. Melko. Machine learning phases of matter. Nature Physics,
13(5):431–434, 2017.
[15] Y. Chen, Z. Zhu, S. Zhu, L. Qiu, B. Zou, F. Jia, Y. Zhu, C. Zhang, Z. Fang, F. Qin, et al.
Sckansformer: Fine-grained classification of bone marrow cells via kansformer backbone
and hierarchical attention mechanisms. arXiv preprint arXiv:2406.09931, 2024.
[16] M. Cheon. Demonstrating the efficacy of kolmogorov-arnold networks in vision tasks. arXiv
preprint arXiv:2406.14916, 2024.
[17] M. Cheon. Kolmogorov-arnold network for satellite image classification in remote sensing.
arXiv preprint arXiv:2406.00600, 2024.
[18] M. Cranmer. Interpretable machine learning for science with pysr and symbolicregression. jl.
arXiv preprint arXiv:2305.01582, 2023.
[19] M. Cranmer, S. Greydanus, S. Hoyer, P. Battaglia, D. Spergel, and S. Ho. Lagrangian neural
networks. arXiv preprint arXiv:2003.04630, 2020.
22
[20] M. Cranmer, A. Sanchez Gonzalez, P. Battaglia, R. Xu, K. Cranmer, D. Spergel, and S. Ho.
Discovering symbolic models from deep learning with inductive biases. Advances in neural
information processing systems, 33:17429–17442, 2020.
[21] H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey. Sparse autoencoders find
highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023.
[22] G. De Carlo, A. Mastropietro, and A. Anagnostopoulos. Kolmogorov-arnold graph neural
networks. arXiv preprint arXiv:2406.18354, 2024.
[23] O. Dugan, R. Dangovski, A. Costa, S. Kim, P. Goyal, J. Jacobson, and M. Soljačić. Occamnet:
A fast neural model for symbolic regression at scale. arXiv preprint arXiv:2007.10784, 2020.
[24] N. Elhage, T. Hume, C. Olsson, N. Nanda, T. Henighan, S. Johnston, S. ElShowk, N. Joseph,
N. DasSarma, B. Mann, D. Hernandez, A. Askell, K. Ndousse, A. Jones, D. Drain, A. Chen,
Y. Bai, D. Ganguli, L. Lovitt, Z. Hatfield-Dodds, J. Kernion, T. Conerly, S. Kravec, S. Fort,
S. Kadavath, J. Jacobson, E. Tran-Johnson, J. Kaplan, J. Clark, T. Brown, S. McCan-
dlish, D. Amodei, and C. Olah. Softmax linear units. Transformer Circuits Thread, 2022.
https://transformer-circuits.pub/2022/solu/index.html.
[25] N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds,
R. Lasenby, D. Drain, C. Chen, et al. Toy models of superposition. arXiv preprint
arXiv:2209.10652, 2022.
[26] J. Engels, I. Liao, E. J. Michaud, W. Gurnee, and M. Tegmark. Not all language model
features are linear. arXiv preprint arXiv:2405.14860, 2024.
[27] R. Genet and H. Inzirillo. A temporal kolmogorov-arnold transformer for time series fore-
casting. arXiv preprint arXiv:2406.02486, 2024.
[28] R. Genet and H. Inzirillo. Tkan: Temporal kolmogorov-arnold networks. arXiv preprint
arXiv:2405.07344, 2024.
[29] A. N. Gent. A new constitutive relation for rubber. Rubber chemistry and technology,
69(1):59–61, 1996.
[30] F. Girosi and T. Poggio. Representation properties of networks: Kolmogorov’s theorem is
irrelevant. Neural Computation, 1(4):465–469, 1989.
[31] S. Greydanus, M. Dzamba, and J. Yosinski. Hamiltonian neural networks. Advances in neural
information processing systems, 32, 2019.
[32] S. Ha and H. Jeong. Discovering conservation laws from trajectories via machine learning.
arXiv preprint arXiv:2102.04008, 2021.
[33] L. F. Herbozo Contreras, J. Cui, L. Yu, Z. Huang, A. Nikpour, and O. Kavehei. Kan-eeg:
Towards replacing backbone-mlp for an effective seizure detection system. medRxiv, pages
2024–06, 2024.
[34] R. Hooke. Lectures de potentia restitutiva, or of spring explaining the power of springing
bodies. Number 6. John Martyn, 2016.
[35] A. A. Howard, B. Jacob, S. H. Murphy, A. Heinlein, and P. Stinis. Finite basis kolmogorov-
arnold networks: domain decomposition for data-driven and physics-informed problems.
arXiv preprint arXiv:2406.19662, 2024.
[36] Jerry-Master. Jerry-master/kan-benchmarking.
[37] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool,
R. Bates, A. Žídek, A. Potapenko, et al. Highly accurate protein structure prediction with
alphafold. nature, 596(7873):583–589, 2021.
[38] M. Kiamari, M. Kiamari, and B. Krishnamachari. Gkan: Graph kolmogorov-arnold networks.
arXiv preprint arXiv:2406.06470, 2024.
[39] S. Krippendorf and M. Syvaeri. Detecting symmetries with neural networks. Machine Learn-
ing: Science and Technology, 2(1):015010, 2020.
23
[40] A. Kundu, A. Sarkar, and A. Sadhu. Kanqas: Kolmogorov arnold network for quantum
architecture search. arXiv preprint arXiv:2406.17630, 2024.
[41] R. Lam, A. Sanchez-Gonzalez, M. Willson, P. Wirnsberger, M. Fortunato, F. Alet, S. Ravuri,
T. Ewalds, Z. Eaton-Rosen, W. Hu, et al. Learning skillful medium-range global weather
forecasting. Science, 382(6677):1416–1421, 2023.
[42] T. X. H. Le, T. D. Tran, H. L. Pham, V. T. D. Le, T. H. Vu, V. T. Nguyen, Y. Nakashima,
et al. Exploring the limitations of kolmogorov-arnold networks in classification: Insights to
software training and hardware implementation. arXiv preprint arXiv:2407.17790, 2024.
[43] P. Lemos, N. Jeffrey, M. Cranmer, S. Ho, and P. Battaglia. Rediscovering orbital mechanics
with machine learning. Machine Learning: Science and Technology, 4(4):045002, 2023.
[44] C. Li, X. Liu, W. Li, C. Wang, H. Liu, and Y. Yuan. U-kan makes strong backbone for medical
image segmentation and generation. arXiv preprint arXiv:2406.02918, 2024.
[45] K. Li, A. K. Hopkins, D. Bau, F. Viégas, H. Pfister, and M. Wattenberg. Emergent world
representations: Exploring a sequence model trained on a synthetic task. In The Eleventh
International Conference on Learning Representations, 2023.
[46] X. Li, Z. Feng, Y. Chen, W. Dai, Z. He, Y. Zhou, and S. Jiao. Coeff-kans: A paradigm to
address the electrolyte field with kans. arXiv preprint arXiv:2407.20265, 2024.
[47] Z. Li. Kolmogorov-arnold networks are radial basis function networks. arXiv preprint
arXiv:2405.06721, 2024.
[48] H. Liu, J. Lei, and Z. Ren. From complexity to clarity: Kolmogorov-arnold networks in
nuclear binding energy prediction, 2024.
[49] M. Liu, S. Bian, B. Zhou, and P. Lukowicz. ikan: Global incremental learning with kan for
human activity recognition across heterogeneous datasets. arXiv preprint arXiv:2406.01646,
2024.
[50] M. Liu, D. Geißler, D. Nshimyimana, S. Bian, B. Zhou, and P. Lukowicz. Initial investigation
of kolmogorov-arnold networks (kans) as feature extractors for imu based human activity
recognition. arXiv preprint arXiv:2406.11914, 2024.
[51] Z. Liu, E. Gan, and M. Tegmark. Seeing is believing: Brain-inspired modular training for
mechanistic interpretability. Entropy, 26(1):41, 2023.
[52] Z. Liu, M. Khona, I. R. Fiete, and M. Tegmark. Growing brains: Co-emergence of anatomical
and functional modularity in recurrent neural networks. arXiv preprint arXiv:2310.07711,
2023.
[53] Z. Liu, V. Madhavan, and M. Tegmark. Machine learning conservation laws from differential
equations. Physical Review E, 106(4):045307, 2022.
[54] Z. Liu, P. O. Sturm, S. Bharadwaj, S. J. Silva, and M. Tegmark. Interpretable conservation
laws as sparse invariants. Phys. Rev. E, 109:L023301, Feb 2024.
[55] Z. Liu and M. Tegmark. Machine learning conservation laws from trajectories. Phys. Rev.
Lett., 126:180604, May 2021.
[56] Z. Liu and M. Tegmark. Machine learning hidden symmetries. Physical Review Letters,
128(18):180201, 2022.
[57] Z. Liu, Y. Wang, S. Vaidya, F. Ruehle, J. Halverson, M. Soljačić, T. Y. Hou, and M. Tegmark.
Kan: Kolmogorov-arnold networks. arXiv preprint arXiv:2404.19756, 2024.
[58] P. Y. Lu, R. Dangovski, and M. Soljačić. Discovering conservation laws using optimal trans-
port and manifold learning. Nature Communications, 14(1):4744, 2023.
[59] P. Ma, P. Y. Chen, B. Deng, J. B. Tenenbaum, T. Du, C. Gan, and W. Matusik. Learning neural
constitutive laws from motion observations for generalizable pde dynamics. In International
Conference on Machine Learning, pages 23279–23300. PMLR, 2023.
24
[60] P. Ma, T.-H. Wang, M. Guo, Z. Sun, J. B. Tenenbaum, D. Rus, C. Gan, and W. Matusik.
Llm and simulation as bilevel optimizers: A new paradigm to advance physical scientific
discovery. In Forty-first International Conference on Machine Learning, 2024.
[61] G. Martius and C. H. Lampert. Extrapolation and learning equations. arXiv preprint
arXiv:1610.02995, 2016.
[62] K. Meng, D. Bau, A. J. Andonian, and Y. Belinkov. Locating and editing factual associations
in GPT. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural
Information Processing Systems, 2022.
[63] E. J. Michaud, Z. Liu, and M. Tegmark. Precision machine learning. Entropy, 25(1):175,
2023.
[64] R. v. Mises. Mechanik der festen körper im plastisch-deformablen zustand. Nachrichten
von der Gesellschaft der Wissenschaften zu Göttingen, Mathematisch-Physikalische Klasse,
1913:582–592, 1913.
[65] C. Misner, K. Thorne, and J. Wheeler. Gravitation. Princeton University Press, 2017.
[66] N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking
via mechanistic interpretability. In The Eleventh International Conference on Learning Rep-
resentations, 2023.
[67] G. Nehma and M. Tiwari. Leveraging kans for enhanced deep koopman operator discovery.
arXiv preprint arXiv:2406.02875, 2024.
[68] R. W. Ogden. Non-linear elastic deformations. Courier Corporation, 1997.
[69] Y. Peng, M. He, F. Hu, Z. Mao, X. Huang, and J. Ding. Predictive modeling of flexible ehd
pumps using kolmogorov-arnold networks. arXiv preprint arXiv:2405.07488, 2024.
[70] E. Poeta, F. Giobergia, E. Pastor, T. Cerquitelli, and E. Baralis. A benchmarking study of
kolmogorov-arnold networks on tabular data. arXiv preprint arXiv:2406.14529, 2024.
[71] P. Pratyush, C. Carrier, S. Pokharel, H. D. Ismail, M. Chaudhari, and D. B. KC. Calmphoskan:
Prediction of general phosphorylation sites in proteins via fusion of codon aware embeddings
with amino acid aware embeddings and wavelet-based kolmogorov arnold network. bioRxiv,
pages 2024–07, 2024.
[72] S. Rigas, M. Papachristou, T. Papadopoulos, F. Anagnostopoulos, and G. Alexandridis.
Adaptive training of grid-dependent physics-informed kolmogorov-arnold networks. arXiv
preprint arXiv:2407.17611, 2024.
[73] A. Sanchez-Gonzalez, J. Godwin, T. Pfaff, R. Ying, J. Leskovec, and P. Battaglia. Learning
to simulate complex physics with graph networks. In International conference on machine
learning, pages 8459–8468. PMLR, 2020.
[74] M. Schmidt and H. Lipson. Distilling free-form natural laws from experimental data. science,
324(5923):81–85, 2009.
[75] S. T. Seydi. Exploring the potential of polynomial basis functions in kolmogorov-arnold
networks: A comparative study of different groups of polynomials. arXiv preprint
arXiv:2406.02583, 2024.
[76] S. T. Seydi. Unveiling the power of wavelets: A wavelet-based kolmogorov-arnold network
for hyperspectral image classification. arXiv preprint arXiv:2406.07869, 2024.
[77] H. Shen, C. Zeng, J. Wang, and Q. Wang. Reduced effectiveness of kolmogorov-arnold
networks on functions with noise. arXiv preprint arXiv:2407.14882, 2024.
[78] K. Shukla, J. D. Toscano, Z. Wang, Z. Zou, and G. E. Karniadakis. A comprehensive and
fair comparison between mlp and kan representations for differential equations and operator
networks. arXiv preprint arXiv:2406.02917, 2024.
[79] E. Sifakis and J. Barbic. Fem simulation of 3d deformable solids: a practitioner’s guide to
theory, discretization and model reduction. In Acm siggraph 2012 courses, pages 1–50. 2012.
25
[80] W. S. Slaughter. The linearized theory of elasticity. Springer Science & Business Media,
2012.
[81] B. Smith, F. D. Goes, and T. Kim. Stable neo-hookean flesh simulation. ACM Transactions
on Graphics (TOG), 37(2):1–15, 2018.
[82] H.-T. Ta. Bsrbf-kan: A combination of b-splines and radial basic functions in kolmogorov-
arnold networks. arXiv preprint arXiv:2406.11173, 2024.
[83] T. H. Trinh, Y. Wu, Q. V. Le, H. He, and T. Luong. Solving olympiad geometry without
human demonstrations. Nature, 625(7995):476–482, 2024.
[84] S.-M. Udrescu, A. Tan, J. Feng, O. Neto, T. Wu, and M. Tegmark. Ai feynman 2.0: Pareto-
optimal symbolic regression exploiting graph modularity. Advances in Neural Information
Processing Systems, 33:4860–4871, 2020.
[85] C. J. Vaca-Rubio, L. Blanco, R. Pereira, and M. Caus. Kolmogorov-arnold networks (kans)
for time series analysis. arXiv preprint arXiv:2405.08790, 2024.
[86] K. R. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the
wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International
Conference on Learning Representations, 2023.
[87] Y. Wang, J. Sun, J. Bai, C. Anitescu, M. S. Eshaghi, X. Zhuang, T. Rabczuk, and Y. Liu.
Kolmogorov arnold informed neural network: A physics-informed deep learning framework
for solving pdes based on kolmogorov arnold networks. arXiv preprint arXiv:2406.11045,
2024.
[88] S. J. Wetzel. Unsupervised learning of phase transitions: From principal component analysis
to variational autoencoders. Physical Review E, 96(2):022140, 2017.
[89] S. J. Wetzel, R. G. Melko, J. Scott, M. Panju, and V. Ganesh. Discovering symmetry invariants
and conserved quantities by interpreting siamese neural networks. Phys. Rev. Res., 2:033499,
Sep 2020.
[90] T. Wu and M. Tegmark. Toward an artificial intelligence physicist for unsupervised learning.
Physical Review E, 100(3):033311, 2019.
[91] H. Xu, F. Sin, Y. Zhu, and J. Barbič. Nonlinear material design using principal stretches.
ACM Transactions on Graphics (TOG), 34(4):1–11, 2015.
[92] J. Xu, Z. Chen, J. Li, S. Yang, W. Wang, X. Hu, and E. C.-H. Ngai. Fourierkan-gcf: Fourier
kolmogorov-arnold network–an effective and efficient feature transformation for graph col-
laborative filtering. arXiv preprint arXiv:2406.01034, 2024.
[93] K. Xu, L. Chen, and S. Wang. Kolmogorov-arnold networks for time series: Bridging pre-
dictive power and interpretability. arXiv preprint arXiv:2406.02496, 2024.
[94] J. Yang, R. Walters, N. Dehmamy, and R. Yu. Generative adversarial symmetry discovery. In
International Conference on Machine Learning, pages 39488–39508. PMLR, 2023.
[95] K. Yang, A. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil, R. J. Prenger, and
A. Anandkumar. Leandojo: Theorem proving with retrieval-augmented language models.
Advances in Neural Information Processing Systems, 36, 2024.
[96] S. Yang, L. Qin, and X. Yu. Endowing interpretability for neural cognitive diagnosis by
efficient kolmogorov-arnold networks. arXiv preprint arXiv:2405.14399, 2024.
[97] R. Yu, W. Yu, and X. Wang. Kan or mlp: A fairer comparison. arXiv preprint
arXiv:2407.16674, 2024.
[98] E. Zeydan, C. J. Vaca-Rubio, L. Blanco, R. Pereira, M. Caus, and A. Aydeger. F-kans:
Federated kolmogorov-arnold networks. arXiv preprint arXiv:2407.20100, 2024.
[99] F. Zhang and X. Zhang. Graphkan: Enhancing feature extraction with graph kolmogorov
arnold networks. arXiv preprint arXiv:2406.13597, 2024.
26
[100] Z. Zhong, Z. Liu, M. Tegmark, and J. Andreas. The clock and the pizza: Two stories in mech-
anistic explanation of neural networks. In Thirty-seventh Conference on Neural Information
Processing Systems, 2023.
[101] S. Zinage, S. Mondal, and S. Sarkar. Dkl-kan: Scalable deep kernel learning using
kolmogorov-arnold networks. arXiv preprint arXiv:2407.21176, 2024.
27
TKAN: Temporal Kolmogorov-Arnold Networks
Rémi Genet1 and Hugo Inzirillo2
1
DRM, Université Paris Dauphine - PSL
2
CREST-ENSAE, Institut Polytechnique de Paris
Abstract—Recurrent Neural Networks (RNNs) have revo- Econometric models (ARIMA[2], [3]) and other extensions,
lutionized many areas of machine learning, particularly in Recurrent Neural Networks (RNNs) [4] are families of models
natural language and data sequence processing. Long Short- that have been recognized for their proven effectiveness in
Term Memory (LSTM) has demonstrated its ability to capture
long-term dependencies in sequential data. Inspired by the many forecasting scenarios[5].
arXiv:2405.07344v2 [cs.LG] 5 Jun 2024
2
Each RKAN layer manages this short term memory
throughout the processing in each layer.
KAN(x, t) = (ΦL−1,t ◦ ΦL−2,t ◦ · · · ◦ Φ1,t ◦ Φ0,t )(x, t). (10)
• Gating Mecanism: these mechanisms help to manage the
information flow. The model decides which information B. TKAN Architecture
should be retained or forgotten over time.
A. Recurring Kolmogorov-Arnold Networks (RKAN)
In neural networks, particularly in the context of recurrent
neural networks (RNN), a recurrent kernel refers to the set of
weights that are applied to the hidden state from the previous
timestep during the network’s operation. This kernel plays a
crucial role in how an RNN processes sequential data over
time. Let us denote τ = 1, 2, ... a discrete time steps. Each
step has a forward pass and backward pass. During the forward
pass the ouput or activation of units are computed. during the
backward pass the computation of the error for all weights is
computed. During each timestep, an RNN receives an input
vector and the hidden state from the previous timestep ht−1 .
The recurrent kernel is a matrix of weights that transforms this
previous hidden state. This operation is usually followed by
an addition, the result of this transformation is passed through
a non-linear function, an activation function f (.) we can take
many forms tanh, ReLU etc. The update could be formulated
as such Fig. 2: A two layers Temporal Kolmogorov-Arnold Networks
(TKAN) Block
ht = f (Whh ht−1 + Whx xt + bh ), (7)
where ht is the hidden state at time t ∈ τ and Wh and Wx For the next step, towards to maintain the memory, we took
are the recurrent kernel a weight matrix that transforms the inspiration from the LSTM [18], [19]. We denote the input
previous hidden states ht−1 and the input kernel, a weight vector of dimension d denoted by xt . This unit uses several
matrix transforming the current input denoted xt , respectively. internal vectors and gates to manage information flow. The
In the latter sections we propose a new kind of update for KAN. forget gate, with activation vector ft ,
We propose a process to maintain the memory of past inputs ft = σ(Wf xt + Uf ht−1 + bf ), (11)
by incoporating previous hidden state into the current states
enabling the network to exhibit dynamic temporal behavior. decides what information to forget from the previous state. The
Recurrent Kernel is the key for RKAN layers learn from input gate, with activation vector denoted it ,
sequence where context and order of data matters. We design it = σ(Wi xt + Ui ht−1 + bi ), (12)
the TKAN to leverage the power of Kolmogorov-Anrold
Network while offering memory management to handle time controls which new information to include. The output gate,
dependency. To introduce time dependency we modify each with activation vector ot ,
transformation function ϕl,j,i to be time dependent. Let us
denote hl,i (t) a memory function capturing the history of node ot = σ(KAN(⃗x, t)), (13)
i in l-th layer; determines what information from the current state to output
given KAN(⃗x, t) from Eq. (10) . The hidden state, ht , captures
nl nl
X X the unit’s output, while the cell state, ct is updated such
xl+1,j (t) = x̃l,j,i (t) = ϕl,j,i,t (xl,i (t), hl,i (t)), j = 1, · · · , nl+1 .
i=1 i=1
(8) ct = ft ⊙ ct−1 + it ⊙ c̃t , (14)
The "memory" step hl,i (t) is defined as a combination of past where c̃t = σ(Wc xt + Uc ht−1 + bc ) represents its internal
hidden states, such: memory. All these internal states have a dimensionality of h.
The ouput denoted ht is given by
hl,i (t) = Whh hl,i (t − 1) + Whz xl,i (t), (9)
ht = ot ⊙ tanh(ct ). (15)
where W is a vector of weights. W weights the importance
of past values relative to the most recent input. About the Finally, the unit relies on matrix of parameters W and U , and
Recurring KAN Layer, the network now embbeds memory a bias vector b. The predicted is finally obtained with a linear
management at each layer: layer
3
even if here the minimum of the series is 0, it’s simply a matter
ŷt = Why ht + by . (16) of dividing by the maximum value. The objective is to scale
the data in the [0, 1] interval to avoid an explosive effect during
In the latter section we will describe the learning task and learning due to the power exponent. This pre-processing is,
the several tests proceeded. We started to test the power of however, adjusted on the training set, the adjustment meaning
prediction of our model for one steap ahead forecasting and only the search for the maximum value of each series, which
in the second time for multi-step ahead [20], [21]. is then used directly on the test set. This means that on the
IV. L EARNING TASK test set, it is possible to have data greater than 1, but as no
optimization is used, this is not a problem.
In order to assess the relevance of extending our model, we
created a simple training task to judge its ability to improve Finally, we split our dataset into a training set and a test
out-of-sample prediction quality over several steps ahead of set, with a standard proportion of 80-20. This represents over
standard layers such as GRU or LSTM. We have chosen to 21,000 points in the training set and 5,000 in the test set.
carry out our study not on synthetic data, but rather on real
C. Loss Function for Model Training
market data, as these seem more relevant to us, since synthetic
market data can be easily biased to match an experiment. Since we have a numerical prediction problem, we have
opted to optimize our model using the root mean square error
A. Task Definition and Dataset (RMSE) as the loss function, whose formula is simple:
The task we are creating is to predict the market for notional N 2
trades over the futures period, which is not an easy task as 1 X (i) (i)
MSE = X̂t+1 − Xt+1 ,
the market is difficult to predict, but not impossible as unlike N i=1
returns, volumes have internal patterns such as seasonality,
where N represents the number of samples in the dataset,
autocorrelation, and so on. (i)
X̂t+1 denotes the predicted notional values of Bitcoin at time
(i)
To do this, we used Binance as our only data source, as it t + 1 for the i-th sample, and Xt+1 are the corresponding true
is the largest player in the crypto-currency market, and due to values. The first reason is that this is the most widely used and
the lack of regulation taking into account all exchange data, standard method in machine learning for this type of problem.
exchanges are subject to a lot of falsified data by small players The second reason is the metric we want to use to display the
who use wash trading to boost their numbers. results, namely the R-squared (R2 ). The (R2 ) is interesting as
Our dataset consists of the notional amounts traded each a metric because it not only gives information on the error but
hour on several assets: BTC, ETH, ADA, XMR, EOS, MATIC, also on the error given the variance of the estimated series,
TRX, FTM, BNB, XLM, ENJ, CHZ, BUSD, ATOM, LINK, which means it’s much easier to tell whether the model is
ETC, XRP, BCH and LTC, which are to be used to predict performing well or not. It’s also a measure widely used by
just one of them, BTC. The data period runs from January 1, econometricians and practitioners for this very reason. However,
2020 to December 31, 2022. minimizing the MSE is exactly the same as maximizing the
(R2 ), as its formula indicates:
B. Preprocessing PN (i) (i) 2
2 i=1 (X̂t+1 − Xt+1 )
Data preparation is a necessary step for most machine R = 1 − PN (i)
,
2
learning methods, in order to help gradient descent when the i=1 (Xt+1 − X̄t+1 )
data are of different sizes, to have stationarity on the series, As we can see, the upper terms of the quotient are equivalent to
and so on. This is even more true when using the Kolmogorov- the sum of the squared errors, the other two components being
Arnold network, as the underlying B-Spline activations have fixed, the optimization of the mean squared error is totally
a power exponent, so having poorly scaled data would result similar to the optimization of (R2 ).
in an over- or underflow that would hinder learning. This is
also true for market volume data, or notional data, as this is a D. Benchmarks
series with very different scales between assets and between 1) Model Architectures: In order to compare with compa-
points in time, not to mention total non-stationarity over a long rable things, we tested our TKAN layers against two of the
period. most widely used RNNs for multi-step predictions, namely
managed recurrent units (GRU) and long-term short-term
To obtain data that can be used for training, but also return
memory (LSTM). As can be seen, we are not comparing
meaningful losses to optimize, we use a two-stage scaling. The
ourselves to complete model architectures such as a temporal
first step is to divide the values in the series by the moving
fusion transformer, as what we are proposing is more of a
median of the last two weeks. This moving median window
layer than a complete model architecture.
is also shifted by the number of steps forward we want to
predict, so as not to include foresight. This first pre-treatment To compare the three fairly, we’ve opted for a very simple
aims to make the series more stationary over time. The second configuration, where we create three models to be compared
pre-processing applied is a simple MinMaxScaling per asset, in the same way:
4
1) An initial recurrent layer of 100 units that returns 1) Performance Metrics Summary: The results show a very
complete sequences logical decrease in R2 with the number of forward steps
2) intermediate recurrent layer of 100 units, which returns we want to predict, which is quite normal since we have
only the last hidden state. less information for the forward steps. However, the results
3) A final dense layer with linear activation, with as many clearly show two things: the first is that while all models
units as there is timesteps to predict ahead have a relatively small difference in performance over a short
For the TKAN model, we used 5 B-spline activations of order 0 horizon such as 1 to 3 steps ahead, performance changes
to 4 as sublayer activations, while we use the standard activation much more with a longer horizon. The LSTM model even
function for GRU and LSTM. Finally, we also compared the becomes useless and counter-productive with 6 periods, while
three models to the most naive benchmark, which consists of TKAN and GRU always achieve a higher average R-squared
using the last value as a predictor of the future, the value being value. However, TKAN stands out with longer time steps, with
repeated when using several predictions in advance. an R-squared value at least 25% higher than that of GRU.
2) Note on training details: Metrics are calculated directly Another very interesting point is model stability, i.e. the ability
on scaled data and not on unscaled data. There are two reasons to calibrate well-functioning weights from samples without
for this: firstly, MinMax scaling has no impact on the metric too much variation from one experiment to the next, and here
since the minimum is 0 and the data interval is [0,1]; rescaling again, TKAN showed much better stability than all the other
would simply have involved multiplying all the points, which models.
would not have changed the R-squared. Nor do we rescale
from the median split, as otherwise the series mean would 2) Training Dynamics and Model Stability: To better under-
not have been stable over time, leading to a drift in error stand the differences in performance between the models, we
magnitude on certain parts of the series, rendering the metric visualized training and validation losses over several training
insignificant. In terms of optimizing model details, we used the sessions for each model. These graphs offer a dynamic view of
Adam optimizer, one of the most popular choices, used 20% each model’s learning process and ability to generalize beyond
of our training set as a validation set and included two training the training data.
recalls. The first is an early learning reminder, which interrupts
training after 6 consecutive periods without improvement on
the validation set, and restores the weights associated with
the best loss obtained on the validation set. The second is a
plateau learning rate reduction, which halves the learning rate
after three consecutive periods showing no improvement on
the validation set.
E. Results
To evaluate the model’s performance, taking into account the
risk of maladaptation that may occur, we repeat the experiment
5 times for each, and display below the mean and standard
deviation of the train results obtained from these 5 experiments.
Fig. 3: TKAN training and validation loss over epochs
2
TABLE I: Average (R ) obtained over 5 run
Time TKAN:5 B-Spline GRU:default LSTM:default Last Value
1 0.350845 0.365136 0.355532 0.292171
3 0.198842 0.200674 0.061220 -0.062813
6 0.140543 0.082504 -0.225838 -0.331346
9 0.117477 0.087164 -0.290584 -0.457718
12 0.105111 0.017864 -0.473220 -0.518252
15 0.086077 0.033423 -0.404432 -0.555633
5
[3] P. Mondal, L. Shit, and S. Goswami, “Study of effectiveness of time
series modeling (arima) in forecasting stock prices,” International Journal
of Computer Science, Engineering and Applications, vol. 4, no. 2, p. 13,
2014.
[4] L. R. Medsker, L. Jain et al., “Recurrent neural networks,” Design and
Applications, vol. 5, no. 64-67, p. 2, 2001.
[5] H. Hewamalage, C. Bergmeir, and K. Bandara, “Recurrent neural
networks for time series forecasting: Current status and future directions,”
International Journal of Forecasting, vol. 37, no. 1, pp. 388–427, 2021.
[6] Q. V. Le, N. Jaitly, and G. E. Hinton, “A simple way to initialize recurrent
networks of rectified linear units,” arXiv preprint arXiv:1504.00941, 2015.
[7] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies
Fig. 5: LSTM training and validation loss over epochs with gradient descent is difficult,” IEEE Transactions on Neural Networks,
vol. 5, no. 2, pp. 157–166, 1994.
[8] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, “Empirical evaluation
The visual representations clearly corroborate the statistical of gated recurrent neural networks on sequence modeling,” CoRR, vol.
results presented above. The GRU and LSTM models show a abs/1412.3555, 2014.
[9] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward
significant divergence between their learning loss and validation networks are universal approximators,” Neural networks, vol. 2, no. 5,
trajectories, particularly as the number of epochs increases. pp. 359–366, 1989.
This divergence suggests a potential over-fitting where the [10] S. Haykin, Neural networks: a comprehensive foundation. Prentice Hall
PTR, 1998.
model learns idiosyncrasies from the training data rather than [11] G. Cybenko, “Approximation by superpositions of a sigmoidal function,”
generalizing them. This stability in the TKAN model’s learning Mathematics of control, signals and systems, vol. 2, no. 4, pp. 303–314,
process, evident in the closer alignment of its learning and 1989.
[12] P. J. Werbos, “Backpropagation through time: what it does and how to
validation loss curves, implies a consistent learning model that do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
effectively captures the underlying patterns in the data without [13] R. J. Williams, “A learning algorithm for continually running fully
overfitting. recurrent neural netwokrs,” Neural Computation, vol. 1, pp. 256–263,
1989.
[14] S. Hochreiter, “The vanishing gradient problem during learning recurrent
V. C ONCLUSION neural nets and problem solutions,” International Journal of Uncertainty,
In this paper, we proposed an adaptation of the Kolmogorov- Fuzziness and Knowledge-Based Systems, vol. 6, no. 02, pp. 107–116,
1998.
Arnold Network architecture for time series that incorporates [15] Z. Liu, Y. Wang, S. Vaidya, F. Ruehle, J. Halverson, M. Soljačić, T. Y.
both recurring and gating mechanisms. The architecture, Hou, and M. Tegmark, “Kan: Kolmogorov-arnold networks,” arXiv
while not so complicated, enables improving multiple steps preprint arXiv:2404.19756, 2024.
[16] F. Rosenblatt, “The perceptron: A probabilistic model for information
performances and stability compared to traditional methods storage and organization in the brain.” Psychological Review, vol. 65,
and seems to be promising. The temporal Kolmogorov-Arnold pp. 386–408, 1958.
networks (TKANs) combine the best features of recurrent [17] A. N. Kolmogorov, On the representation of continuous functions of
several variables by superpositions of continuous functions of a smaller
neural networks (RNNs) and Kolmogorov-Arnold Networks number of variables. American Mathematical Society, 1961.
(KANs). This new architecture tackles the usual problems [18] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
of RNNs (long-term dependency). TKANs embed Recurrent computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[19] R. C. Staudemeyer and E. R. Morris, “Understanding lstm–a tutorial
Kolmogorov-Arnold Networks (RKAN). These layers help the into long short-term memory recurrent neural networks,” arXiv preprint
system to memorize and use new and old information efficiently. arXiv:1909.09586, 2019.
Compared with traditional models such as LSTM and GRU, [20] C. Fan, Y. Zhang, Y. Pan, X. Li, C. Zhang, R. Yuan, D. Wu, W. Wang,
J. Pei, and H. Huang, “Multi-horizon time series forecasting with
TKAN particularly stands out when it comes to making longer- temporal attention learning,” in Proceedings of the 25th ACM SIGKDD
term predictions, showing that it is capable of handling different International conference on knowledge discovery & data mining, 2019,
situations and longer periods of time. Our experiments show pp. 2527–2535.
[21] B. Lim, S. Ö. Arık, N. Loeff, and T. Pfister, “Temporal fusion transformers
that it is usable and more stable than GRU and LSTM on real for interpretable multi-horizon time series forecasting,” International
historical market data. While not specifically interesting for Journal of Forecasting, vol. 37, no. 4, pp. 1748–1764, 2021.
short-term predictions, it especially demonstrates an ability to
largely outperform other models when it comes to multi-step
predictions.This also confirms that the idea developed in the
original KAN paper works well on real use cases and is totally
relevant for time series analysis. This paper opens interesting
new ways to improve our capacities to calibrate accurate time-
series models over multiple steps, which is of the hardest tasks
in temporal analysis.
R EFERENCES
[1] O. B. Sezer, M. U. Gudelek, and A. M. Ozbayoglu, “Financial time
series forecasting with deep learning: A systematic literature review:
2005–2019,” Applied soft computing, vol. 90, p. 106181, 2020.
[2] S. Makridakis and M. Hibon, “Arma models and the box–jenkins
methodology,” Journal of forecasting, vol. 16, no. 3, pp. 147–163, 1997.
6
C ONVOLUTIONAL KOLMOGOROV–A RNOLD N ETWORKS
November 5, 2024
A BSTRACT
In this paper, we introduce Convolutional Kolmogorov-Arnold Networks (Convolutional KANs), an
innovative alternative to the standard Convolutional Neural Networks (CNNs) that have revolutionized
the field of computer vision. By integrating the learneable non-linear activation functions presented
in Kolmogorov-Arnold Networks (KANs) into convolutions, we propose a new layer. Throughout
the paper, we empirically validate the performance of Convolutional KANs against traditional
architectures across Fashion-MNIST dataset, finding that, in some cases, this new approach maintains
a similar level of accuracy while using half the number of parameters. KAN Convolutions seem to
learn more per kernel, which opens up a new horizon of possibilities in deep learning for computer
vision.
1 Introduction
The field of deep learning is constantly changing, the fast improvement of architectures has helped the advancement of
computer vision in tasks involving complex spatial data. Convolutional Neural Networks proposed by LeCun et al.[8]
are widely used due to their ability to handle high-dimensional data arrays such as images. Normally, these networks
rely on linear transformations followed by an optional activation function in their convolutional layers to understand
spatial relationships, which significantly reduced the number of parameters to capture complex patterns in images.
In recent years, there has been an increase in the integration of advanced mathematical theories into deep learning
architectures which have helped neural networks in handling complex data structures. Kolmogorov-Arnold Networks
(KANs) [10] are a promising alternative to Multi-Layer Perceptrons (MLPs)[6] that use the Kolmogorov-Arnold
theorem to integrate splines which is a key component of their architecture.
In light of these advancements, this paper explores the adaptation of KANs to convolutional layers, a common element
in many CNN architectures used in computer vision. Traditional CNNs utilize fixed activation functions and linear
transformations which, while effective, can benefit from the flexibility and reduced parametric complexity offered by
KANs. By employing spline-based convolutional layers, as proposed in SplineCNN by M. Fey and J. E. Lenssen et al.
[4], networks can capture non-linear relationships more effectively.
Throughout this paper, we begin with a high-level overview of the KAN architecture to set the stage for a comprehensive
mathematical treatment of Convolutional KANs. We will provide a detailed examination of different Convolutional
KANs architectures and benchmark their performance against traditional models, focusing on parameter efficiency
within the MNIST and Fashion-MNIST datasets. Our hypothesis posits that Convolutional KANs, by leveraging
spline-based layers, will require fewer parameters while achieving accuracy levels competitive with established
A PREPRINT - N OVEMBER 5, 2024
benchmarks, potentially setting a new standard in neural network architectures for image-related tasks [3]. For further
exploration and practical application, the code for this layer and all the experiments is available at our GitHub repository:
GitHub/Convolutional-KANs.
2 Related work
Kolmogorov-Arnold theorem and neural networks
The application of the Kolmogorov-Arnold theorem in neural networks marks a significant theoretical integration that
enhances the expressiveness and efficiency of neural models. The theorem, which provides a way to represent any
multivariate continuous function as a composition of univariate functions and additions, has been adapted in the design
of Kolmogorov-Arnold Networks (KANs). KANs differ from traditional MLPs by replacing linear weight matrices
with learnable splines, thus reducing the number of parameters required and potentially improving the generalization
capabilities of the network [10].
Splines in Convolutional Neural Networks
One noteworthy development in neural network architecture involves the use of splines, particularly in the context of
Convolutional Neural Networks (CNNs). The SplineCNN, as proposed by M. Fey and J. E. Lenssen et al.[4], introduces
spline-based convolutional layers that enhance the network’s ability to capture non-linear relationships in the data. This
approach is particularly effective in geometric deep learning, where the adaptability of splines plays a crucial role in
handling non-Euclidean data.
A significant aspect of the method proposed by the authors is its treatment of images like those in the MNIST dataset,
where it first processes the images by interpreting them as graphs before classification. This graph-based approach
allows SplineCNN to handle irregular data structures effectively. Unlike SplineCNN, our Convolutional Kolmogorov-
Arnold Networks (Convolutional KANs) apply spline functions directly on structured data such as images and matrices
without needing their conversion into graphs.
3.1 Architecture
The core of KANs resides in their unique architecture. Unlike traditional MLPs that use fixed activation functions
at nodes, KANs implement learnable activation functions on the network edges. This critical shift from static to
dynamic node functions involves replacing conventional linear weight matrices with adaptive spline functions, which
are parametrized and optimized during training. This allows for a more flexible and responsive model architecture that
can dynamically adapt to complex data patterns.
In more detail, the Kolmogorov-Arnold representation theorem posits that a multivariate function f (x1 , . . . , xn ) can be
expressed as:
2n+1 n
!
X X
f (x1 , . . . , xn ) = Φq ϕq,p (xp )
q=1 p=1
Here, ϕq,p are univariate functions mapping each input variable (xp ) such ϕq,p : [0, 1] → R, and Φq : R → R, univariate
functions.
KANs structure each layer as a matrix of these learnable 1D functions:
Φ = {ϕq,p }, p = 1, 2, . . . , nin , q = 1, 2, . . . , nout
Particularly, each function ϕq,p can be defined as a B-spline, a type of spline function defined by a linear combination of
basis splines, enhancing the network’s ability to learn complex data representations. Here, nin represents the number of
input features to a particular layer, while nout denotes the number of output features produced by that layer, reflecting
2
A PREPRINT - N OVEMBER 5, 2024
the dimensionality transformations across the network layers. The activation functions ϕl,j,i in this matrix are such
learnable spline functions, expressed as:
X
spline(x) = ci Bi (x), ci are trainable coefficients
i
This formulation allows each ϕl,j,i to adapt its shape based on the data, offering unprecedented flexibility in how the
network models interactions between inputs.
The overall structure of a KAN is analogous to stacking layers in MLPs, but with the enhancement of utilizing complex
functional mappings instead of simple linear transformations and nonlinear activations:
KAN(x) = (ΦL−1 ◦ ΦL−2 ◦ · · · ◦ Φ0 )(x)
Each layer’s transformation, Φl , acts on the input xl to produce the next layer’s input xl+1 , described as:
ϕl,1,1 (·) ··· ϕl,1,nl (·)
xl+1 = Φl (xl ) = .. .. ..
xl
. . .
ϕl,nl+1 ,1 (·) · · · ϕl,nl+1 ,nl (·)
where each activation function ϕl,j,i is a spline, providing a rich, adaptable response surface to model inputs.
The motivation for using the architecture of KANs, with learnable activation functions on edges, enhances their
expressive power and efficiency. By replacing linear weight matrices with spline functions, KANs reduce the number of
parameters needed to achieve high accuracy, leading to faster convergence and better generalization.
In Computer Vision, convolutions are normally used interchangeably with a mathematical operation used in
Convolutional Neural-Networks. The operation consists in passing a kernel, or filter, across the input and calculating
the dot product at each position. In KAN Convolutions the main idea is to propose an alternative implementation of
this mathematical operation utilizing the approach of Kolmogorov-Arnold Networks. The main difference between
KAN Convolutions and the convolutions used in CNNs lies in the kernel. In CNNs it is made of weights whereas
in Convolutional KANs, each element of the kernel, ϕ, is a learnable non-linear functions that utilizes B-Splines.
Formally, each element is defined as:
3
A PREPRINT - N OVEMBER 5, 2024
N X
X M
(Image ∗ K)i,j = ϕkl (ai+k,j+l ) (3)
k=1 l=1
ϕ ϕ12
KAN Kernel = 11
ϕ21 ϕ22
ϕ11 (a11 ) + ϕ12 (a12 ) + ϕ21 (a21 ) + ϕ22 (a22 ) ··· r1(p−1)
ϕ11 (a21 ) + ϕ12 (a22 ) + ϕ21 (a31 ) + ϕ22 (a32 ) ··· r2(p−1)
Image ∗ KAN Kernel = .. .. .. (4)
. . .
ϕ11 (am1 ) + ϕ12 (am2 ) + ϕ21 (a(m+1)1 ) + ϕ22 (a(m+1)2 ) · · · rm(p−1)
The grid refers to the set of points which we want to discretize. It is initialized with a previously defined interval and
control points, but during training, some of the input variables to each ϕ might get out of range from the grid limits. To
tackle this we extended the grid to be able to capture the variables which escape the original limits. The method to
extend the grid described in the original KAN paper and consists in the following optimization problem.
G2X
+k−1 G1X
+k−1
{c′j } = arg min Ex∼p(x) [ c′j B(x′ )′j − cj B(x)j ] (5)
c′j j=0 j=0
Where G1 is the previous grid size, G2 is the new grid size and k is the B Spline degree.
While testing the models we effectively verified that the output variables of a KAN Convolutional Layer weren’t
bounded to the default grid range of [−1, 1]. This is a problem, especially when using multiple convolutional layers
since the input of a convolutional layer should be in the range that the B-Spline operates in so that the "learning" is
done by the splines and not the weight that modifies the SiLu. To solve this issue, during training each time an input
falls outside the grid range, the grid is updated. This consists of maintaining spline shape between the original grid
size and maintaining the same amount of control points, and extending the spline to a range that contains the input.
Another solution for this issue is the implementation of batch normalization, after each convolutional layer a batch
normalization layer is applied. This approach adds very few learnable µ and σ parameters that Standardize the inputs to
the layer to µ = 0 and σ = 1. This ensures that most, but not all, of the outputs are between in range. As is seen in
Figure 1, when the input is out of the Spline range, the layer will act as as SiLu activation as is expected, so if the range
is not updated and most of inputs land out of range, a KAN will not differ from an MLP that uses SiLu activations.
4
A PREPRINT - N OVEMBER 5, 2024
Figure 1: Splines learned by the first convolution at the first position for different ranges. The left plot shows the spline
learned within the range [−1, 1], while the right plot shows the spline learned within the range [−10, 10]. The SILU
(Sigmoid Linear Unit) function is added to the spline across the entire range, but the spline is only defined within
[−1, 1]. Thus, outside this range, the SILU function predominates.
Upon analyzing the splines learned by the network in different convolutions, we did not find any recognizable pattern.
The behavior of the splines varies significantly across different convolutional layers and positions, indicating that the
learning process is highly context-dependent and does not conform to a simple, uniform structure.
As previously mentioned, the amount of parameters is one of the main advantages of using Convolutional KANs. With
ϕ defined as in Equation 1, the parameters for each ϕ are the two weights, w1 and w2 , together with the control points
which can be adjusted to change the shape of each spline. Therefore there are gridsize + 2 parameters for each ϕ. Let
the convolution kernel be of size K × K, in total we have K²(gridsize + 2) parameters for each Convolutional KAN
layer, compared to only K² for a CNN convolutional layer. In our experiments the gridsize is typically between k and k²,
with k tending to be a small value between 2 and 16.
In the convolutional layers, Convolutional KANs have more parameters, but as they utilize splines, they have more
adaptability to process the spatial information and thus require less amount of fully connected layers which significantly
increase the amount of parameters. That is where the advantage of utilizing splines really shows, we are able to reduce
the amount of non-convolutional layers and thus reduce the parameter count.
5 Experiments
In this section we explain the different experiments we conducted to analyze the performance of different models that
use KAN Convolutional Layers against a classical convolutional neural-network.
During experimentation we used two datasets, MNIST and Fashion MNIST. Doing all the experimentation in both
datasets would double the time and cost of the experiments, difficulting to run all the proper experiments of different
architectures. As Fashion MNIST presents some more complexity than MNIST, we decided to tune hiperparameters
and present results only on Fashion MNIST.
We proposed architectures that use a mix between Fully Connected (MLP), KAN, KAN Convolutional and Convolutional
layers. The following Figure 2 shows the different architectures used that have KANs:
5
A PREPRINT - N OVEMBER 5, 2024
Figure 2: KAN Architectures used in experiments. The Max Pooling layers are done after every Convolutional Layer,
but for simplicity sake of the scheme we decided to show it only at the end. Every architecture has at the end a Log
Softmax layer.
Figure 3: Standard Architectures used in experiments. The Max Pooling layers are done after every Convolutional
Layer, but for simplicity sake of the scheme we decided to show it only at the end. Every architecture has at the end a
Log Softmax layer.
We conducted a hyperparameter search using grid search, testing 8 different combinations for each model. For KAN
models, we performed different grid searches for KAN grid sizes of 10 and 20 (the grid size for the B-Spline) , as
parameter count increases significantly with grid size and we wanted to compare how the accuracy changes when grid
size changes. Another possible hiperparameter is the B-Spline degree. We decided to not tune this hiperparameter
and chose it as degree 3, as KAN authors [10] suggest as default and because of the wide usage of cubic splines. It is
important to define the degree sufficiently small to not overfit.
Because of the long training times of these models, we opted not to do a K-Fold Cross Validation, but to split the
training set into train and valid sets and tune hiperparameters with those sets. Once we find the best hiperparameter
sets with the previous datasets, we join train and valid together, train a model with train and valid sets with the optimal
hiperparameters and finally report the results using test set.
The possible values for the hiperparamters are defined in Table 1
In addition to these possible values, the epochs are also a learned hiperparameter.
6
A PREPRINT - N OVEMBER 5, 2024
5.1 Loss
For every model that we trained, the Categorical Cross Entropy loss was used as base, but for KAN models, there are 2
additional regularization terms proposed in the KAN paper [10].
Cross entropy loss is defined as:
N X
X C
Lce = − yi,c log(pi,c ) (6)
i=1 c=1
L−1
X L−1
X
Lreg = Lce + λ(µ1 |Φl |1 + µ2 S(Φl )) (7)
l=1 l=1
With:
nin n out
X X |ϕi,j | |ϕi,j |
S(Φl )) = − log( ) (8)
i=1 j=1
|Φ|1 |Φ|1
In our early experiments we found out that the trained models performed better and trained faster wihtout the regulariza-
tion terms, so in our experiments we ended defining λ = 0.
5.2 Hardware
For conducting the hiperparameter tuning for all the models, we used a ’g2-standard-4’ Google Cloud instance, which
has 4 VCPUS, 2 cores, 16 RAM and a NVIDIA L4 GPU.
6 Results
This section displays an analysis of the performance of the different proposed models in the previously described
experiments. The following Figure ?? shows the final accuracies of the proposed models. To obtain the hiperparameters
and final models, the training time was of almost 2 and a half days.
Table ?? presents a comparison of accuracy, precision, recall, F1 score, parameter count, and training time per epoch
for the proposed models tested on the Fashion MNIST Dataset.
7
A PREPRINT - N OVEMBER 5, 2024
Results of the table, can be seen in Figure 4 which illustrates the relationship between parameter count and accuracy for
these models, providing a graphical representation of the data.
8
A PREPRINT - N OVEMBER 5, 2024
Following the results analysis, Figure 4 and Table ?? illustrate the relationship between parameter counts versus
accuracy for the described models but applied to the Fashion MNIST dataset. In the experiments with this dataset, we
find that:
In the smaller models, we find that both using MLPs or KANs after the flatten, KAN convolutions outperform classic
convolutions. This are the Small and Medium models and here the fair comparisons are:
Although KANC MLP and CNN of equal sizes have the same architecture and only changes the type of convolution,
KAN convolutions provide a considerable ammount of parameters. Ignoring the difference of parameters, as said before,
within the same architechture, both Small and Medium KANC MLPs outperform CNNs. But, for example, KANC
MLP (Small) is bigger than CNN (Medium) and with slighly less accuracy (88.34% vs 88.15%).
However, when we increase the depth of the MLPs, classic convolutions show a small advantage over KAN Convolutions.
This can be seen in the case of CNN (Big) vs KANC MLP (Big), where CNN has 89.44% and KANC MLP 89.15%.
This can be explained by the fact that the learning could be done by the MLP, and maybe KAN Convolutions are
compressing more the information, losing data that the MLP might need. Also, we tried a model with even more
convolutions, which is CNN(Medium, but with more convs), and shows even better accuracy than CNN (Big), which
has less convolutions, but one more fully connected layer and because of that, more parameters.
In the case where we use a KAN Network after the convolutions, using KAN convolutions (KKAN Small) gets 87.67%
accuracy while using classic convolutions, Conv & KAN (Small) gets 88.01% and Conv & KAN (Medium) gets 87.92%
. Conv & KAN is 6 or 7 times faster to train (depending on the grid size), but KKAN Small achieves slightly better
accuracy than Conv & KAN (Medium) (for any grid sizes) and with almost half parameters (when Conv KAN (Medium)
has grid size 20 and KKAN grid size 10). Needless to say, KKAN (Medium), which only increases the ammount of
convolutions, suprasses this accuracy with grid size 20, which achieves 88.56%, but parameter count increases from
38000 to 74875 (Because of the increased number of convolutions and KAN Neurons).
In the current experiments, adding KAN kernels keeping the same number of Convolutional layers seem to faster reach
a limit on the accuracy increase, while with classic convolutions it seems to be necessary to achieve a higher accuracy.
One explanation to this might be the fact that KAN Kernels are more expressible and one kernel could be learning what
it might take several classic kernels.
In addition, we can verify that grid size can change abrutly the model accuracy. For achieving the best performance this
parameter should be tuned for every KAN layer, although this process demands significantly greater computational
resources. In the cases of KKAN Small, KANC MLP Medium and Big and both Conv & KANs, no accuracy was
gained by increasing grid size from 10 to 20, and in some instances, accuracy was even reduced. However, in models
such as CNN Small and KKAN Medium, accuracy improved by over 0.5%, which represents a noticeable gain given
the minimal accuracy differences across these models.
7 Conclusions
In this paper we proposed a new way to adapt the idea of learning splines proposed in Kolmogorov-Arnold Networks
to convolutional layers used widely in Computer Vision. We implemented a KAN Convolutional Layer that uses a
kernel made of learnable non-linear functions that use B-Splines. We have found out that with equal architectures,
KAN Convolutions seem to "learn more", showing better accuracy in the analogous models except in the "Big" ones,
where the difference was that we added more fully connected layers, which may be doing the learning instead of the
convolutions.
When using KANs after the flatten, KAN Convolutions achieve better accuracy than we can classic convolutions even
using half the parameters. This is seen in the comparisons between KKAN and Normal Conv & KAN.
When trying an MLP, KAN convolutions achieve higher accuracy in the small models, but when having a 2 layer MLPs,
the classic CNN wins by 0.41% with ∼ 26.62k parameters. While KAN Convolutions seem to learn more per kernel,
we have to consider that each KAN kernel has much more parameters. So comparing the same architectures gives an
advantage in terms of expressibility to KAN Convolutions.
9
A PREPRINT - N OVEMBER 5, 2024
A key factor here is that KANs seem to maintain accuracy with lower parameter count, KANC MLP (Medium) achieves
88.99% with almost 15k parameters, but the training time is almost 10 times slower with the current implementations
of KANs and its derivatives.
Additionally, MLPs show a better performance than KANs for image classification tasks. In the "Small" and "Medium"
cases, when using KAN Convolutions, using MLPs after the flatten gives both better accuracy and smaller parameters
counts than using KANs. As expected, KKANs achieve better accuracy than its equal CNNs, but the parameter count
difference is too big (22k vs 1.5k and 38k vs 3k). Since the parameters of KAN layers grow much faster than those
of MLPs, with no apparent gain in accuracy, KANs might not be suitable as dense layers, especially because CNNs
require a large number of neurons after the flattening layer.
Based on the original KAN paper [10] proposal that KANs are more interpretable, we have tried to find a way to
interpret this new type of convolutions, but at the moment, we have not found any clear way to visualize the B-Splines
learned in each pixel of the convolutions. Given that we are working on images, it seems that the classic approach of
visualizing what the filter does to the image seems the more ’human’ way to get a sense of what is being learned, but
because of the high dimensionality of a KAN kernel input-output relationship, at the moment we haven’t come up with
any interpretation.
The investigation leads us to understand the limitations of this new and promising idea. These limitations are similar to
those presented in the original KAN paper. The KAN Linear Layer and our KAN Convolutional Layer are new and
need to be optimized before they are able to be scaled properly, as shown in the time per epoch metrics. This paper is a
starting point for integrating KANs into computer vision, and shows that Convolutional KANs have the potential to be
an alternative to Convolutional Neural Networks.
References
[1] Vladimir I. Arnold and Andrey N. Kolmogorov. On functions of three variables. Doklady Akademii Nauk SSSR,
114:679–681, 1957.
[2] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals,
and Systems, 2(4):303–314, 1989.
[3] Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando de Freitas. Predicting parameters
in deep learning. In Advances in neural information processing systems, pages 2148–2156, 2013.
[4] Matthias Fey and Jan Eric Lenssen. Splinecnn: Fast geometric deep learning with continuous b-spline kernels.
arXiv preprint arXiv:1711.08920, 2018.
[5] Simon Haykin. Neural networks: a comprehensive foundation. Prentice Hall PTR, 1994.
[6] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approxi-
mators. Neural Networks, 2(5):359–366, 1989.
[7] A. K. Kolmogorov. On the representation of continuous functions of several variables by superposition of
continuous functions of one variable and addition. Doklady Akademii Nauk SSSR, 114:369–373, 1957.
[8] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[9] Ziyao Li. Kolmogorov-arnold networks are radial basis function networks, 2024.
[10] Ziming Liu. Kan: Kolmogorov–arnold networks. arXiv preprint arXiv:2404.19756, 2024.
10
Kolmogorov–Arnold Transformer
ViT
ViT + KAN
Input Input Input
Emb. Emb. Emb.
Figure 1: (Left) Architecture of standard transformer (e.g. ViT), ViT+KAN which substitutes the MLP with a KAN, and our KAT model.
In KAT, the MLP layers in transformers are replaced with GR-KAN layers. (Right) Performance on the ImageNet dataset. KAT∗ indicates
that the model was initialized using a pre-trained ViT. Generally, KAT outperforms both the ViT and DeiT models. ViT+KAN performs
poorly on ImageNet-level training.
Abstract
Transformers stand as the cornerstone of mordern deep learning. Traditionally, these models rely on multi-layer
perceptron (MLP) layers to mix the information between channels. In this paper, we introduce the Kolmogorov–Arnold
Transformer (KAT), a novel architecture that replaces MLP layers with Kolmogorov-Arnold Network (KAN) layers to
enhance the expressiveness and performance of the model. Integrating KANs into transformers, however, is no easy
feat, especially when scaled up. Specifically, we identify three key challenges: (C1) Base function. The standard B-spline
function used in KANs is not optimized for parallel computing on modern hardware, resulting in slower inference speeds.
(C2) Parameter and Computation Inefficiency. KAN requires a unique function for each input-output pair, making the
computation extremely large. (C3) Weight initialization. The initialization of weights in KANs is particularly challenging
due to their learnable activation functions, which are critical for achieving convergence in deep neural networks. To
overcome the aforementioned challenges, we propose three key solutions: (S1) Rational basis. We replace B-spline functions
with rational functions to improve compatibility with modern GPUs. By implementing this in CUDA, we achieve faster
computations. (S2) Group KAN. We share the activation weights through a group of neurons, to reduce the computational
load without sacrificing performance. (S3) Variance-preserving initialization. We carefully initialize the activation weights
to make sure that the activation variance is maintained across layers. With these designs, KAT scales effectively and readily
outperforms traditional MLP-based transformers. We demonstrate the advantages of KAT across various tasks, including
image recognition, object detection, and semantic segmentation. It consistently enhances performance over the standard
transformer architectures of different model sizes. Our code is openly available at https://github.com/Adamdad/kat.
1 Introduction
Transformers have become the de facto architecture in deep learning, widely adopted in computer vision [DBK+ 21] and
natural language processing [VSP+ 17]. At their core, transformers are built upon two fundamental components: attention
1
modules and multi-layer perceptrons (MLPs). Although significant research has focused on replacing the traditional
attention mechanism with alternative operations [LLC+ 21, LMW+ 22, THK+ 21], these variants still lean heavily on MLPs.
Surprisingly, there have been relatively few efforts [Sha20] aimed at enhancing MLPs themselves.
Opening up the box, MLPs are composed of stacked linear layers coupled with non-linear activations. What makes
it so popular is that, theoretically, they can approximate any function, assuming that there are enough neurons avail-
able [HSW89].
However, despite their versatility, MLPs face limitations in modeling complex functions. For example, when using ReLU-like
activation, a two-layer MLP may struggle to fit periodic functions. Moreover, employing gradient descent to train these
networks often results in prolonged convergence times for high-frequency components [RBA+ 19, BGG+ 20, RJKK19]. These
challenges have led researchers to explore alternative, perhaps more expressive architectures than MLPs.
Recently, Kolmogorov-Arnold Networks (KANs) emerged as a powerful alternative. KANs are noted for their theoretical
parameter efficiency, potentially requiring fewer parameters to model complex functions [LWV+ 24]. They are particularly
suitable for mathematical or symbolic regression tasks [YYW24, BC24a, LMW+ 24]. The key to such success is the learnable
base function in each input-output pair. Those functions are often parameterized by B-spline curves [UAE93, GR74]. This
design allows KANs to approximate more intricate functions through a summation of spline bases.
Given its potential, integrating KAN layers into transformers [VSP+ 17] becomes an exciting topic. Such integration
may boost the expressiveness and efficiency of transformers, enhancing their competitiveness across a wide range of
applications.
Unfortunately, this ambition has been met with limited success. In particular, KANs have been reported to be “10× slower
than MLPs, given the same number of parameters”. Initial attempts to apply KANs to vision recognition tasks have yielded
disappointing results. Even on a small scale, these studies have consistently fallen short of matching, let alone surpassing,
the performance of traditional architectures. This lack of improvement is often attributed to the limited computational
resources and ongoing scalability problems [Che24a, BTSP24, Che24a, Che24b].
In a preliminary experiment, we attempted to replace MLP layers in the Vision Transformer (ViT) with KAN layers. It
creates a model, which we call ViT+KAN. However, as shown in Figure 1 (Right), this straightforward substitution led to
significant challenges when performing ImageNet-scale training, resulting in poor performance. Scalability, therefore,
remains a significant obstacle for KAN-based models.
Motivation and Challenges. Through dedicated analysis, we have identified several key challenges that hinder the
effectiveness of KANs in large-scale applications, ultimately limiting their scalability.
• (C1) Base function. The standard B-spline functions in KANs are not ideal for parallel computing architectures typical
of modern GPUs. B-splines require recursive computation, which significantly slows down even the most optimized
implementations.
• (C2) Parameter and Computation Inefficiency. Each unique input-output pair in a KAN requires a distinct set of
parameters and base functions. This necessity causes an exponential growth in the number of parameters as the
network’s hidden size increases, resulting in substantial computational overhead and scalability issues.
• (C3) Weight initialization. The weight initialization in KANs is similar to that in MLPs, but it does not meet KANs’
needs for convergence. This mismatch can lead to instability and degraded performance during the training process.
Our Approach. In this paper, we introduce Kolmogorov–Arnold Transformer (KAT), which successfully integrates KANs
into transformers for large-scale training scenarios such as ImageNet. Beyond simple replacement, We have developed
three key innovations (S1-S3) to address these challenges (C1-C3) respectively.
• (S1) Rational activation. We employ rational function as our base function and provide full CUDA implementation. It
aligns better with modern GPU architectures, enhancing computational efficiency and compatibility.
• (S2) Group KAN. We share function coefficients and base functions among groups of edges. This strategy reduces
computational load significantly without sacrificing performance.
• (S3) Variance-preserving initialization. We carefully initialize weights to maintain consistent variance in activations
across the model’s layers. This ensures stability during training and improves the model’s learning dynamics.
2
By combining all solutions S1-S3, we present a new variant of KAN, called Group-Rational KAN (GR-KAN), to replace
the MLP in transformer. We show that GR-KAN is computationally efficient, easy to implement, and can be seamlessly
integrated into vision transformers, replacing MLP layers to achieve superior performance. Furthermore, our designs allow
KAT to load pre-trained weights from ViT models and continue training to achieve even better results.
We empirically validate KAT across a range of vision tasks, including image recognition, object detection, and semantic
segmentation. The results demonstrate that KAT outperforms traditional MLP-based transformers, achieving enhanced
performance with comparable computational requirements. As illustrated in Figure 1, KAT-B achieves 82.3% accuracy on
ImageNet-1K, surpassing the ViT model of the same size by 3.1%. When initialized with pre-trained weights from ViT, the
performance further improves to 82.7%.
The contributions of our paper are threefold. First, we conduct a thorough analysis of the challenges in scaling KAN-based
models, particularly focusing on inefficiencies in base functions, parameterization, and weight initialization. Based on this
analysis, we propose a set of solutions: rational activation functions tailored for GPU efficiency, Group KAN to reduce
computational overhead, and variance-preserving initialization to ensure stable training. Second, leveraging these insights,
we introduce the Kolmogorov–Arnold Transformer (KAT) and scale it to ImageNet-level training, successfully integrating
KANs into large-scale models. Third, we validate our approach through extensive experiments, showing that KAT not only
matches but surpasses the performance of ViT models, all under similar computational requirements.
2 Preliminary
2.1 Kolmogorov-Arnold representation theorem
The Kolmogorov-Arnold representation theorem [HN87] states that any multivariate continuous function 𝑓 , defined on a
bounded domain, can be expressed as a finite composition of continuous univariate functions and addition. Specifically, for
a smooth function 𝑓 : [0, 1] 𝑛 → R, it can be represented as:
2𝑛+1 𝑛
!
∑︁ ∑︁
𝑓 (𝑥 1, . . . , 𝑥𝑛 ) = Φ𝑞 𝜙𝑞,𝑝 (𝑥 𝑝 )
𝑞=1 𝑝=1
Here, each function 𝜙𝑞,𝑝 : [0, 1] → R and Φ𝑞 : R → R are continuous. This means that the (2d+1)(d+1) univariate functions
Φ𝑞 and 𝜙𝑞,𝑝 are enough for an exact representation of a d-variate function.
This theorm can be written in matrix form as follows:
𝑓 (x) = Φout ◦ Φin ◦ x (1)
where Φin and Φout are defined as:
𝜙 1,1 (·) ··· 𝜙 1,𝑛 (·)
Φin = . .. .. (2)
..
. .
𝜙 2𝑑+1,1 (·) · · · 𝜙 2𝑑+1,𝑑 (·)
Φout = Φ1 (·) · · · Φ2𝑑+1 (·)
(3)
This decomposition illustrates how 𝑓 can be built from simpler functions, showcasing an essential property of multivariate
continuous functions.
3
Note that Eq 4 can be seen as a generalized form of Eq 1, such that Φ = Φin ◦ Φout . A general KAN network is a stacking of
𝐿 layers: given an input vector x0 ∈ R𝑑0 , the output of KAN is 𝐾𝐴𝑁 (x0 ) = Φ𝐿−1 ◦ Φ𝐿−2 · · · ◦ Φ0 ◦ x0 .
In practice, [LWV+ 24] parameterizes Φ use a linear combination of SiLU activation [EUD18] and a B-spline function
𝑥 ∑︁
𝜙 (𝑥) = 𝑤𝑏 silu(𝑥) + 𝑤𝑠 spline(𝑥),where silu(𝑥) = , spline(𝑥) = 𝑐𝑖 𝐵𝑖 (𝑥) (5)
1 + 𝑒 −𝑥 𝑖
B-spline is not GPU Friendly. The use of B-spline functions in KAN layers introduces challenges when implemented
on GPUs. First, B-splines are not standard functions within CUDA. Implementing them using pure PyTorch and NumPy
results in slower performance on modern GPU devices due to the lack of optimized CUDA support. Second, the localized
nature of B-Spline computations complicates their use in parallel GPU processes. Typically, each control point influences
only a small adjacent area of the curve. This leads to sparse or recursive computations, a type of operation that GPUs
manage less efficiently. Although there are efficient implementation for cubic B-Spline [RT12, RtHRS08, SH05], scaling
these methods to higher orders is not straightforward.
Parameter and Computation Inefficiency. Unlike standard neural networks, KAN employs a learnable base func-
tion for each pair of input-output channels. This design inherently leads to an increased parameter count and higher
computational demands, especially when scaling up the width and depth of a neural network.
In the standard configuration of KAN, a layer with 𝑑𝑖 𝑛 input and 𝑑𝑜 𝑢𝑡 output channels incorporates an B-spline function for
each input-output pair, of order 𝐾 on 𝐺 intervals. This results in the network having a total of (𝑑𝑖𝑛 ×𝑑𝑜𝑢𝑡 ) × (𝐺 +𝐾 + 3) +𝑑𝑜𝑢𝑡
learnable parameters. In contrast, a typical MLP only needs (𝑑𝑖𝑛 × 𝑑𝑜𝑢𝑡 ) + 𝑑𝑜𝑢𝑡 parameters.
In 1
n terms of computation, the FLOPs for one sample in B-spline with De Boor-Cox Iterative o [Boo71] formulation is
FLOPs of non-linear function × 𝑑𝑖𝑛 + (𝑑𝑖𝑛 × 𝑑𝑜𝑢𝑡 ) × [9𝐾 × (𝐺 + 1.5𝐾) + 2𝐺 − 2.5𝐾 + 3] . Meanwhile, the FLOPs for an
n o
equivalent MLP layer is merely FLOPs of non-linear function × 𝑑𝑜𝑢𝑡 + 2 × (𝑑𝑖𝑛 × 𝑑𝑜𝑢𝑡 ) .
Overall, the parameter size and computational effort of KAN are on the order of 𝑂 (𝐺 + 𝐾) and 𝑂 (𝐺𝐾) times greater than
those of a conventional MLP, respectively. This significant increase in complexity is a primary reason why KAN struggles
to scale effectively.
Weights are not Properly Initialized. Deep learning heavily relies on good weight initialization to enable trainability
and convergence. A fundamental principle is to ensure variance-preserving, meaning that the variance of the signal should
remain constant as it propagates through multiple layers, whether forward or backward [LBOM02, GB10, HZRS15]. This
principle ensures that the activation and gradient maintain stability across layers.
However, in the KAN paper, the initialization strategy deviates from this principle. Specifically, the B-spline coefficients
𝑐𝑖 are initialized as N (0, 𝜎 2 ) with 𝜎 = 0.1, and 𝑤𝑠 = 1 and 𝑤𝑏 ∼ 𝑈 [− √𝑑 6+𝑑 , √𝑑 6+𝑑 ] are initialized according to the
𝑖𝑛 𝑜𝑢𝑡 𝑖𝑛 𝑜𝑢𝑡
Xavier initialization [GB10]. The combined output variance of the model can be expressed as:
𝑉 𝑎𝑟 [𝜙 (𝑥)] = 𝑉 𝑎𝑟 [𝑤𝑏 silu(𝑥)] + 𝑉 𝑎𝑟 [𝑤𝑠 spline(𝑥)] = 3E[silu2 (𝑥)] + E[spline2 (𝑥)] (6)
If we assume the input 𝑥 is normally distributed, 𝑥 ∼ N (0, 𝜎𝑥2 ) and consider a zero-th order spline, the variance of spline(𝑥)
at any point 𝑥 is simply:
∑︁ ∑︁
E[spline2 (𝑥)] = 𝑐𝑖2𝑉 𝑎𝑟 [𝐵𝑖 (𝑥)] = 𝜎 2 𝑉 𝑎𝑟 [𝐵𝑖 (𝑥)] = 𝜎 2 = 0.01 (7)
𝑖 𝑖
1 For full computation derivation, please see [YYW24].
4
For the SiLU activation function, although exact variance calculations are complex, numerical estimations indicates
E[silu2 (𝑥)] ≈ 0.355𝜎𝑥2 . Combining these, we find 𝑉 𝑎𝑟 [𝜙 (𝑥)] ≈ 0.01 + 1.064𝜎𝑥2 ≠ 𝑉 𝑎𝑟 [𝑥].
This indicates that, under zero-th order spline, 𝑉 𝑎𝑟 [𝜙 (𝑥)] ≠ 𝑉 𝑎𝑟 [𝑥]. With higher-order splines, the variance instability
might increase. Thus, the default initialization opposes the essential variance-preserving principle.
4 Kolmogorov–Arnold Transformer
As discussed earlier, the standard KAN faces three major challenges that limit its use in large, deep neural networks. In this
section, we refine its design to better suit modern transformers, allowing us to replace MLP layers with KANs.
where xℓ stands for the output feature sequence at the ℓ layer. As illustrated, we replace all two-layer MLPs with
two-layer KANs while keeping the attention layers unchanged. Although similar efforts have been made in specific
domains [CGD24, CZZ+ 24], a simple replacement is not enough to achieve scalability in large models.
Most importantly, here, we introduce a special kind Group-Rational KAN. We use rational functions as the base function for
KAN (Section 4.2) and share parameters between a group of edges (Section 4.3). We also specify the weight initialization
scheme to ensure stable training (Section 4.4). Together, these enhancements make KAT more scalable and improve
performance.
𝑃 (𝑥) 𝑎 0 + 𝑎 1𝑥 + · · · + 𝑎𝑚 𝑥 𝑚
𝜙 (𝑥) = 𝑤𝐹 (𝑥) = 𝑤 =𝑤 (11)
𝑄 (𝑥) 𝑏 0 + 𝑏 1𝑥 + · · · + 𝑏𝑛 𝑥 𝑛
𝑎𝑛 and 𝑏𝑚 are coefficient of the rational function and 𝑤 is the scaling factor. This function is said to have degree 𝑚/𝑛. We
hope to learn those 𝑎𝑛 , 𝑏𝑚 and 𝑤 through end-to-end backpropagation.
To avoid instability caused by poles, where 𝑄 (𝑥) → 0 and 𝜙 (𝑥) → ±∞, we employ a Safe Padé Activation Unit
(PAU) [MSK20] as our basis, which is a modified form of the standard rational function
𝑎 0 + 𝑎 1𝑥 + · · · + 𝑎𝑚 𝑥 𝑚
𝐹 (𝑥) = (12)
1 + |𝑏 1𝑥 + · · · + 𝑏𝑛 𝑥 𝑛 |
Why use Rational Function? There are practical and theoretical reasons for selecting rational functions as our base
functions.
First, from an efficiency perspective, evaluating polynomials involves simple operations that are highly suitable for parallel
computing. This makes rational functions computationally efficient for large-scale models.
5
Second, from a theoretical perspective, rational functions can approximate a wider range of functions—including those
with singularities or sharp variations—more efficiently and accurately than polynomials [Wal35, BJG61]. Since B-splines
are essentially sums of local polynomials, rational functions offer a theoretical advantage over B-splines for modeling
complex behaviors.
Third, from a practical perspective, rational activations have already been successfully used as activation functions in
neural networks [BNT20, MSK20].
Given these reasons, we adopt rational functions as the base functions in our KAN layers to enhance the model’s
expressiveness, stability, and computational efficiency.
Implement Rational Function on GPU. With the rational function, a core contribution in this paper is to implement it
efficiently on parralized devices like GPU. In stead of using pytorch with automatic differentiation, we implement it fully
with CUDA [NBGS08].
𝛿𝐹 𝛿𝐹 𝛿𝐹
• Similar to [MSK20], we compute the explicit gradients of 𝛿𝑎𝑚 , 𝛿𝑏𝑛 and 𝛿𝑥
𝑎 0 + 𝑎 1𝑥 + · · · + 𝑎𝑚 𝑥 𝑚 = 𝑎 0 + 𝑥 (𝑎 1 + 𝑥 (𝑎 2 + 𝑥 (. . . ))) (14)
This allows the evaluation of a polynomial of degree n with only 𝑛 multiplications and 𝑛 additions. By default, we
use 𝑚 = 5 and 𝑛 = 4.
Through this efficient CUDA implementation, we largely reduce the computation for each evaluation of the base function.
As shown in Table 1, with a scalar input, the rational function with the Horner method is much cheaper than the B-spline
used in the KAN paper.
Table 1: Comparison of FLOPs for different functions. Compared to B-spline function. using Horner’s method with the Rational function
reduces FLOPs by approximately 9.3× compared to the B-Spline function.
Name FLOPs
B-Spline (G=3, K=3) 204
Rational (m=5, n=4) 46
Rational (m=5, n=4) w Horner 21
6
Output Channels
𝜙!,! 𝜙!,# 𝜙!,$ 𝜙!,% 𝜙&! 𝜙&! 𝜙&! 𝜙&! 𝜙 𝜙 𝜙 𝜙
Input Channels
𝜙#,! 𝜙#,# 𝜙#,$ 𝜙#,% 𝜙&! 𝜙&! 𝜙&! 𝜙&! 𝜙 𝜙 𝜙 𝜙
Figure 2: Comparing our Group KAN with vanilla KAN and MLPs. While KAN has unique function on each input-output pairs, Group
KAN share these functions at with a groups of edges.
Table 2: Comparison of parameter counts among different models. Func FLOPs stands for the computation of used non-linear activation.
In KAN, 𝐾 represents the order number and 𝐺 the grid number. For our GR-KAN, 𝑚 and 𝑛 indicate the order of polynomials, and 𝑔
represents the number of groups. Our model, GR-KAN, has a parameter size comparable to a constant increase over the MLP, whereas
the KAN model’s parameters scale with (𝐺 + 𝐾 + 3).
Suppose 𝑖 is the index of the input channel. With 𝑔 groups, each group contains 𝑑𝑔 = 𝑑𝑖𝑛 /𝑔 channels, where ⌊𝑖/𝑑𝑔 ⌋ is the
group index. The operation of GR-KAN on input vector x can be expressed as
hÍ i
GR-KAN(x) = Φ ◦ x = 𝑑𝑖𝑛 Í𝑑𝑖𝑛
𝑖=1 𝑤 𝑖,1 𝐹 ⌊𝑖/𝑑𝑔 ⌋ (𝑥𝑖 ) . . . 𝑖=1 𝑤 𝑖,𝑑𝑜𝑢𝑡 𝐹 ⌊𝑖/𝑑𝑔 ⌋ (𝑥𝑖 ) (15)
With a simple rewrite, this can be expressed in matrix form as the product of a weight matrix W ∈ R𝑑in ×𝑑out and a input-wise
rational function F
𝑤 1,1 · · · 𝑤 1,𝑑in
⊤
GR-KAN(x) = WF(x) = ... .. .. × 𝐹
⌊1/𝑑𝑔 ⌋ (𝑥 1 ) . . . 𝐹 ⌊𝑑𝑖𝑛 /𝑑𝑔 ⌋ (𝑥𝑑𝑖𝑛 ) (16)
. .
𝑤𝑑 ,1 · · · 𝑤𝑑 ,𝑑
out out in
As such, we can implement this GR-KAN layer as a group-wise rational function F followed by a linear layer
GR-KAN(x) = linear(group_rational(x)) (17)
In this form, sharing parameters across each input channel allows direct application of the rational function to the input
vector, equivalently applying it across each grouped edge. In this way, GR-KAN functions as a specialized MLP, with 1)
learnable non-linear functions, 2) activation preceding the linear layer, and 3) unique activation functions tailored for each
group of edges.
In experiments, we notice that for rational function, we share the denominator coefficient 𝑏𝑛 among all groups and use
different 𝑎𝑚 for each group. It gets better performance.
Parameter and Computation Savings. The original KAN requires 𝑑𝑖𝑛 × 𝑑𝑜𝑢𝑡 unique activation functions. Through our
grouping strategy, only 𝑔 unique functions are needed, reducing the parameter count to a constant overhead compared to a
standard MLP.
Except the saving on parameter number, this grouping also reduces computational demands. Each input channel computes
the activation function 𝜙 once, shared across all corresponding output channels. In contrast, the original KAN requires that
each output channel 𝑗 to independently compute 𝜙𝑖,𝑗 . This results in significant computational savings. The comparison of
the number of parameters and computation is listed in Table 2.
7
𝑉 𝑎𝑟 [𝑥 ]
Name 𝛼= E[𝐹 (𝑥 ) 2 ]
Identity 1
ReLU 2
GELU 2.3568
Swish/SiLU 2.8178
GEGLU 0.7112
SwishGLU 0.8434
where 𝑥, 𝑦, and 𝑤 represent the random variables of each element in 𝑥𝑖 , 𝑦 𝑗 , and 𝑤𝑖,𝑗 respectively. When layers are stacked,
we aim for the variance of the input-output activations to remain consistent, expressed as:
Since 𝐹 (𝑥) is the rational function containing coefficients 𝑎𝑚 and 𝑏𝑛 , the initialization of 𝑤 and these coefficients are
𝑉 𝑎𝑟 [𝑥 ]
interdependent—the form of 𝐹 (𝑥) influences the appropriate initialization of 𝑤. The crucial step is to calculate E[𝐹 (𝑥 ) 2 ]
and adjust 𝑤 to maintain consistent activation scaling.
For our rational function defined in Equation 12, computing E[𝐹 (𝑥) 2 ] involves evaluating:
∫ +∞ ∫ +∞
𝑎 0 + 𝑎 1𝑥 + · · · + 𝑎𝑚 𝑥 𝑚 2
E[𝐹 (𝑥) 2 ] = 𝐹 2 (𝑥)𝑓 (𝑥)𝑑𝑥 = ( ) 𝑓 (𝑥)𝑑𝑥 (21)
−∞ −∞ 1 + |𝑏 1𝑥 + · · · + 𝑏𝑛 𝑥 𝑛 |
1
where 𝑓 (𝑥) is the density function of 𝑥. Unlike activation functions such as ReLU, for which E[𝐹 (𝑥) 2 ] = 2 𝑉 𝑎𝑟 [𝑥],
computing E[𝐹 (𝑥) 2 ] for the rational function is challenging due to the lack of a closed-form solution.
Initialize 𝑎, 𝑏 first, then initialize 𝑤. To make the process manageable, Instead of sampling 𝑤, 𝑎, and 𝑏 jointly, we
proceed sequentially. Initially, we determine 𝑎 and 𝑏 such that 𝐹 fits established activations like ReLU, GELU, and Swish.
Figure 3 illustrates the fitted functions.
(𝑥 ) ] 2
Once 𝑎 and 𝑏 are set, we estimate the gain 𝛼 = E[𝐹 2
𝑉 𝑎𝑟 [𝑥 ] numerically, assuming 𝑥 ∼ N (0, 1) . The calculated gains, 𝛼, are
documented in Table 3. We use the gain value to initialize 𝑤 from N (0, 𝑑𝛼𝑖𝑛 ).
Initialize KAT from ViT. In addition to random weight initialization, we can also transfer weights from a pre-trained ViT
to our KAT model. This transfer is straightforward for most layers, as KAT can replicate the micro-architecture of ViT,
except for the KAN layer.
2 This assumption is justified as the inputs to the KAN layer are normalized using layer normalization as in Equation 10. The LN layers are initialized
8
MLP
For the GR-KAN layer, weight transfer is still feasible, as Identity FC1 σ FC2
shown in Figure 4. Because the GR-KAN layer consists of a
linear layer and a group-wise rational layer, we can directly
load the weights of the linear layer from the MLP in the Group rational 2 Linear 1 Group rational 2 Linear 2
5 Experiments
5.1 Experimental Setup
We modify the original ViT [DBK+ 21] architecture by substituting its MLP layers with GR-KAN layers. By default, these
KAN layers employ a rational function with parameters 𝑚 = 5 and 𝑛 = 4, and are organized into groups of 8 (𝑔 = 8). Each
transformer block contains 2 KAN layers. The first GR-KAN layer’s 𝑎𝑚 and 𝑏𝑛 are initialized to fit the identity function,
while the second is initialized to mimic the Swish function [RZL17]. The attention layers are initialized with Mimetic
Initialization [TK23]. The remainder of the architecture remains unchanged. We intentionally do not use hierarchical
architectures [YLZ+ 22] for simplicity.
Model Variant. We select the configurations of KAT to be identical with those used in ViT [DBK+ 21], as summarized in
Table 4. All variants use an input patch size of 16 × 16.
9
Table 5: Comparative Analysis of Model Performance and Computational Efficiency on ImageNet-1K. We measure the FLOPs under 2242
using fvcore package. ∗ indicates that the model is initialized using a pre-trained ViT model, otherwise trained from scratch.
These findings highlight the efficacy of the KAT approach in balancing computational efficiency with improved performance,
suggesting valuable directions for further research in optimizing transformer architectures.
10
5.4 Semantic Segmentation
Experiment Setup. We evaluated our KAT model on the ADE20K dataset [ZZP+ 17]. This dataset comprises 150 semantic
categories with 20,000 images in the training set and 2,000 in the validation set. For our experiments, we utilized KAT
as the backbone for the UperNet framework [XLZ+ 18], initializing it with ImageNet pre-trained weights. The training
was conducted using the AdamW optimizer [LH19] with a learning rate of 0.0001 and a batch size of 16, across 160,000
iterations. Our implementation was carried out using the PyTorch and mmsegmentation libraries, and the experiments
were performed on two NVIDIA H100 GPUs. For comparison, we evaluated UperNet with other backbones, including
DeiT, Swin Transformer, and ConvNeXt.
Results. Table 7 summarizes the segmentation results. Overall, KAT demonstrates a competitive improvement over plain
ViT-based architectures, achieving a 2.4% improvement over DeiT-S and a 0.2% improvement over DeiT-B. This performance
boost comes with a slight increase in computational cost, reflected in the higher FLOPs. Similar to the detection results,
KAT shows more significant gains in smaller models. However, it still falls short compared to models with hierarchical
architectures, such as ConvNeXt, which benefit from more efficient structural design.
Table 9: Throughput and Peak memory for different activation on A5000 GPU. Input size is fixed to [64, 1000, 512].
11
Benefit of CUDA Implementation. To evaluate the efficiency improvements introduced by our CUDA implementation
discussed in Section 4.2, we conducted experiments to measure both forward pass speed and peak memory usage. Specifically,
we compared our CUDA implementation against two alternative methods. The first is called Torch Looped, which loops over
each channel group, applies the rational function, and then concatenates the results. The second is called Torch Vectorized.
In this method, the input tensor is reshaped according to the channel groups, the rational function is applied in a vectorized
manner, and the tensor is reshaped back to its original form. We compare these three implementation on A5000 GPU,
under 1) different group number 𝑔 ∈ {1, 2, 4, 8, 16}. 2) different input dim 𝐷 ∈ {128, 256, 512, 1024, 2048}
Torch Looped Torch Vectorized CUDA (Ours) Torch Looped Torch Vectorized CUDA (Ours)
4000
Throughput (batch/s)
103
2000
102
1000
0
1 2 4 8 16 1 2 4 8 16
Group Number Group Number
(a) Throughput (batch/s) for Different Group Sizes. Larger the better. (b) Peak Memory (MB) for Different Group Sizes. Smaller the better.
Figure 5: Comparison of Throughput and Peak Memory for Different Methods and Group Sizes. Input size is fixed to [64, 1000, 512].
Torch Looped Torch Vectorized CUDA (Ours) Torch Looped Torch Vectorized CUDA (Ours)
16000
8000
Throughput (batches/sec)
14000
Peak Memory (MB)
12000
6000
10000
8000
4000
6000
2000 4000
2000
0 0
250 500 750 1000 1250 1500 1750 2000 250 500 750 1000 1250 1500 1750 2000
Input Shape Dimension) Input Shape Dimension)
(a) Throughput (batch/s) for Input Dimension Sizes. Larger the better. (b) Peak Memory (MB) for Input Dimension Sizes. Smaller the better.
Figure 6: Comparison of Throughput and Peak Memory for Different Methods and Input Dimension Sizes. Group size is fixed to 8.
The results, presented in Figure 5 and Figure 6, clearly demonstrate that our CUDA implementation significantly outperforms
both the Torch Looped and Torch Vectorized implementations, offering superior speed and memory efficiency.
Rational Initialization. We tested our KAT-T model with different initializations of the rational functions when training
from scratch. As shown in Table 10, the “Identity - Swish” initialization achieves the best performance, which we have
adopted as our default setting.
12
Visualization of Trained Functions
An important aspect to examine is the behavior of the trained rational functions. As shown in Figure 7, we plot the
functions for KAT-S with 𝑔 = 8 across all 12 layers. The results indicate that within each layer, the rational functions
exhibit similar trends, while the functions across different layers tend to differ from one another.
Acknowledgement
We would like to acknowledge that computational work involved in this research work is partially supported by NUS IT’s
Research Computing group using grant numbers NUSREC-HPC-00001. We thank Weihao Yu, Qiuhong Shen and Runpeng
yu for valuable discussions.
References
[Agh24] Alireza Afzal Aghaei. rkan: Rational kolmogorov-arnold networks. arXiv preprint arXiv:2406.14495, 2024.
[BC24a] Zavareh Bozorgasl and Hao Chen. Wav-kan: Wavelet kolmogorov-arnold networks. arXiv preprint
arXiv:2405.12832, 2024.
[BC24b] Zavareh Bozorgasl and Hao Chen. Wav-kan: Wavelet kolmogorov-arnold networks, 2024.
13
[BGG+ 20] Ronen Basri, Meirav Galun, Amnon Geifman, David Jacobs, Yoni Kasten, and Shira Kritchman. Frequency bias
in neural networks for input of non-uniform density. In International Conference on Machine Learning, pages
685–694. PMLR, 2020.
[BJG61] George A Baker Jr and John L Gammel. The padé approximant. Journal of Mathematical Analysis and
Applications, 2(1):21–30, 1961.
[BNT20] Nicolas Boullé, Yuji Nakatsukasa, and Alex Townsend. Rational neural networks. Advances in neural information
processing systems, 33:14243–14253, 2020.
[Boo71] C de Boor. Subroutine package for calculating with b-splines, 1971.
[BTSP24] Alexander Dylan Bodner, Antonio Santiago Tepsich, Jack Natan Spolski, and Santiago Pourteau. Convolutional
kolmogorov-arnold networks. arXiv preprint arXiv:2406.13155, 2024.
[CGD24] Ziwen Chen, Gundavarapu, and WU DI. Vision-kan: Exploring the possibility of kan replacing mlp in vision
transformer. https://github.com/chenziwenhaoshuai/Vision-KAN.git, 2024.
[Che24a] Minjong Cheon. Demonstrating the efficacy of kolmogorov-arnold networks in vision tasks. arXiv preprint
arXiv:2406.14916, 2024.
[Che24b] Minjong Cheon. Kolmogorov-arnold network for satellite image classification in remote sensing. arXiv preprint
arXiv:2406.00600, 2024.
[CWP+ 19] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng,
Ziwei Liu, Jiarui Xu, et al. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint
arXiv:1906.07155, 2019.
[CZSL20] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data
augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition workshops, pages 702–703, 2020.
[CZZ+ 24] Yifei Chen, Zhu Zhu, Shenghao Zhu, Linwei Qiu, Binfeng Zou, Fan Jia, Yunpeng Zhu, Chenyan Zhang, Zhaojie
Fang, Feiwei Qin, et al. Sckansformer: Fine-grained classification of bone marrow cells via kansformer backbone
and hierarchical attention mechanisms. arXiv preprint arXiv:2406.09931, 2024.
[DBK+ 21] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An
image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
[EUD18] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function
approximation in reinforcement learning. Neural networks, 107:3–11, 2018.
[Fuk69] Kunihiko Fukushima. Visual feature extraction by a multilayered network of analog threshold elements. IEEE
Transactions on Systems Science and Cybernetics, 5(4):322–333, 1969.
[GB10] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks.
In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256.
JMLR Workshop and Conference Proceedings, 2010.
[GR74] William J Gordon and Richard F Riesenfeld. B-spline curves and surfaces. In Computer aided geometric design,
pages 95–126. Elsevier, 1974.
[HG16] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
[HGDG17] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE
international conference on computer vision, pages 2961–2969, 2017.
[HN87] Robert Hecht-Nielsen. Kolmogorov’s mapping neural network existence theorem. In Proceedings of the
international conference on Neural Networks, volume 3, pages 11–14. IEEE press New York, NY, USA, 1987.
14
[Hor15] WG Horner. A new method of solving numerical equations of all orders, by continuous approximation. In
Abstracts of the Papers Printed in the Philosophical Transactions of the Royal Society of London, volume 2, pages
117–117. JSTOR, 1815.
[HSL+ 16] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth.
In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016,
Proceedings, Part IV 14, pages 646–661. Springer, 2016.
[HSW89] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal
approximators. Neural networks, 2(5):359–366, 1989.
[HZRS15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level
performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision,
pages 1026–1034, 2015.
[LB+ 95] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series. The handbook
of brain theory and neural networks, 3361(10):1995, 1995.
[LBD+ 89] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and
Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation,
1(4):541–551, 1989.
[LBOM02] Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In Neural networks:
Tricks of the trade, pages 9–50. Springer, 2002.
[LH93] Henry Leung and Simon Haykin. Rational function neural network. Neural Computation, 5(6):928–938, 1993.
[LH19] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019.
[Li24] Ziyao Li. Kolmogorov-arnold networks are radial basis function networks. ArXiv, abs/2405.06721, 2024.
[LLC+ 21] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin
transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international
conference on computer vision, pages 10012–10022, 2021.
[LMB+ 14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th
European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer,
2014.
[LMGH22] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for
object detection. In European conference on computer vision, pages 280–296. Springer, 2022.
[LMW+ 22] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet
for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages
11976–11986, 2022.
[LMW+ 24] Ziming Liu, Pingchuan Ma, Yixuan Wang, Wojciech Matusik, and Max Tegmark. Kan 2.0: Kolmogorov-arnold
networks meet science. arXiv preprint arXiv:2408.10205, 2024.
[LWV+ 24] Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljačić, Thomas Y Hou, and
Max Tegmark. Kan: Kolmogorov-arnold networks. arXiv preprint arXiv:2404.19756, 2024.
[MSK20] Alejandro Molina, Patrick Schramowski, and Kristian Kersting. Padé activation units: End-to-end learning of
flexible activation functions in deep networks. In International Conference on Learning Representations, 2020.
[NBGS08] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with cuda: Is
cuda the parallel programming model that application developers have been waiting for? Queue, 6(2):40–53,
2008.
[Noe24] Gist Noesis. Fourierkan, 2024.
15
[RBA+ 19] Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio,
and Aaron Courville. On the spectral bias of neural networks. In International conference on machine learning,
pages 5301–5310. PMLR, 2019.
[RJKK19] Basri Ronen, David Jacobs, Yoni Kasten, and Shira Kritchman. The convergence rate of neural networks for
learned functions of different frequencies. Advances in Neural Information Processing Systems, 32, 2019.
[RT12] Daniel Ruijters and Philippe Thévenaz. Gpu prefilter for accurate cubic b-spline interpolation. The Computer
Journal, 55(1):15–20, 2012.
[RtHRS08] Daniel Ruijters, Bart M ter Haar Romeny, and Paul Suetens. Efficient gpu-based texture interpolation using
uniform b-splines. Journal of Graphics Tools, 13(4):61–69, 2008.
[RZL17] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprint
arXiv:1710.05941, 2017.
[SH05] Christian Sigg and Markus Hadwiger. Fast third-order texture filtering. GPU gems, 2:313–329, 2005.
[Sha20] Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
[SVI+ 16] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception
architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 2818–2826, 2016.
[TCD+ 21] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou.
Training data-efficient image transformers & distillation through attention. In International conference on
machine learning, pages 10347–10357. PMLR, 2021.
[Tel17] Matus Telgarsky. Neural networks and rational functions. In International Conference on Machine Learning,
pages 3387–3393. PMLR, 2017.
[THK+ 21] Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica
Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision.
Advances in neural information processing systems, 34:24261–24272, 2021.
[TK23] Asher Trockman and J Zico Kolter. Mimetic initialization of self-attention layers. In International Conference
on Machine Learning, pages 34456–34468. PMLR, 2023.
[UAE93] Michael Unser, Akram Aldroubi, and Murray Eden. B-spline signal processing. i. theory. IEEE transactions on
signal processing, 41(2):821–833, 1993.
[VSP+ 17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,
and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M.
Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information
Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017,
Long Beach, CA, USA, pages 5998–6008, 2017.
[Wal35] Joseph Leonard Walsh. Interpolation and approximation by rational functions in the complex domain, volume 20.
American Mathematical Soc., 1935.
[WH18] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference on computer vision
(ECCV), pages 3–19, 2018.
[XLZ+ 18] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene
understanding. In Proceedings of the European conference on computer vision (ECCV), pages 418–434, 2018.
[YHO+ 19] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix:
Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF
international conference on computer vision, pages 6023–6032, 2019.
16
[YLZ+ 22] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan.
Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition, pages 10819–10829, 2022.
[YSZ+ 23] Weihao Yu, Chenyang Si, Pan Zhou, Mi Luo, Yichen Zhou, Jiashi Feng, Shuicheng Yan, and Xinchao Wang.
Metaformer baselines for vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[YYW24] Runpeng Yu, Weihao Yu, and Xinchao Wang. Kan or mlp: A fairer comparison. arXiv preprint arXiv:2407.16674,
2024.
[YZL+ 22] Xingyi Yang, Daquan Zhou, Songhua Liu, Jingwen Ye, and Xinchao Wang. Deep model reassembly. Advances
in neural information processing systems, 35:25739–25753, 2022.
[ZCDL18] Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk
minimization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada,
April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
[ZZK+ 20] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In
Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13001–13008, 2020.
[ZZP+ 17] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through
ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641,
2017.
Denominator
The denominator involves the absolute value of a polynomial of degree 𝑛:
𝑛 (𝑛+1)
• Multiplications: There are 2 multiplications for powers of 𝑥 and 𝑛 multiplications for coefficients 𝑏𝑖 , giving
𝑛 (𝑛+1)
2 + 𝑛.
• Additions: There are 𝑛 additions for polynomial terms and 1 additional addition after the absolute value operation.
• Absolute value operation: 1 absolute value calculation.
Division. There is 1 division operation for the final computation of 𝐹 (𝑥).
Total FLOPs. The total FLOPs for any 𝑚 and 𝑛 are:
𝑚(𝑚 + 1) 𝑛(𝑛 + 1)
Multiplications: + + 𝑚 + 𝑛 + 1, Additions: 𝑚 + 𝑛 + 1, Absolute Value: 1, Division: 1
2 2
In case 𝑚 = 5 and 𝑛 = 4, there are totally 34 multiplications, 10 summations, 1 absolute value and 1 division. In total
46.
17
7.2 Horner’s Method
Using horner’s method, for a polynomial of order 𝑚, we need 𝑚 summations and 𝑚 multiplications.
Thus, for numerator, we need 𝑚 summations and 𝑚 multiplications. For denominator, we need 𝑛 + 1 summations and 𝑛
multiplications. In total, we need 𝑚 + 𝑛 + 1 summation, 𝑚 + 𝑛 multiplications, 1 absolute value, and 1 division.
In case 𝑚 = 5 and 𝑛 = 4, there are a total of 21 FLOPs, comprising 9 multiplications, 10 summations, 1 absolute value, and 1
division.
KAT
Tiny Small Base
Input resolution 2242
Epochs 300
Batch size 1024
Optimizer AdamW
Adam 𝜖 1 × 10−8
Adam (𝛽 1 , 𝛽 2 ) (0.9, 0.999)
Learning rate 1 × 10−3
Learning rate decay Cosine
Gradient clipping None
Warmup epochs 5
Weight decay 0.05
Rand Augment 9/0.5
Repeated Augmentation off
Cutmix 1.0
Mixup 0.8
Cutmix-Mixup switch prob 0.5
Random erasing prob 0.25
Label smoothing 0.1
Peak stochastic depth rate 0.1 0.1 0.4
Random erasing prob 0.25
EMA decay rate 0.9999
18
5
blocks.0.kan1.0 blocks.0.kan1.1 5
blocks.0.kan1.2 5
blocks.0.kan1.3 blocks.0.kan1.4 blocks.0.kan1.5 blocks.0.kan1.6 blocks.0.kan1.7
5 2.5 2.5 5
0 0 0 0 0 0.0 0.0
5 2.5 0
5 5 5 5 2.5
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
blocks.0.kan2.0 blocks.0.kan2.1 blocks.0.kan2.2 blocks.0.kan2.3 blocks.0.kan2.4 blocks.0.kan2.5 blocks.0.kan2.6 blocks.0.kan2.7
1.0 1
1 1 1 1 1
0.5 0.5
0 0.0 0 0 0 0 0 0.0
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
blocks.1.kan1.0 blocks.1.kan1.1 blocks.1.kan1.2 blocks.1.kan1.3 blocks.1.kan1.4 blocks.1.kan1.5 blocks.1.kan1.6 blocks.1.kan1.7
2 2 2 2 1 1
1 2
0 0 0 0 0 0 0
1 0
1
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
blocks.1.kan2.0 blocks.1.kan2.1 blocks.1.kan2.2 blocks.1.kan2.3 blocks.1.kan2.4 blocks.1.kan2.5 blocks.1.kan2.6 blocks.1.kan2.7
5 20 5.0
2.5 10 2.5
5 2.5 5
10 0.0 5 0.0
0 0 2.5 0.0 0 0
0 2.5
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
blocks.2.kan1.0 blocks.2.kan1.1 blocks.2.kan1.2 blocks.2.kan1.3 blocks.2.kan1.4 blocks.2.kan1.5 blocks.2.kan1.6 blocks.2.kan1.7
1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0
1 1 1 1 1 1 1
1
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
blocks.2.kan2.0 blocks.2.kan2.1 blocks.2.kan2.2 blocks.2.kan2.3 blocks.2.kan2.4 blocks.2.kan2.5 blocks.2.kan2.6 blocks.2.kan2.7
20
10 10 2 2
10 2 2
5 10
0 0 0 0
0 0 0 0
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
blocks.3.kan1.0 blocks.3.kan1.1 2
blocks.3.kan1.2 blocks.3.kan1.3 2
blocks.3.kan1.4 2
blocks.3.kan1.5 blocks.3.kan1.6 blocks.3.kan1.7
2 2 1
0 0 0 0 0 0 0 0
2 2 2 2 1
2 2 2
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
blocks.3.kan2.0 blocks.3.kan2.1 blocks.3.kan2.2 blocks.3.kan2.3 blocks.3.kan2.4 blocks.3.kan2.5 blocks.3.kan2.6 blocks.3.kan2.7
5 5
2.5 5 2.5 0 5
0
0.0 0 0 0.0
0 0
5 10
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
blocks.4.kan1.0 blocks.4.kan1.1 blocks.4.kan1.2 blocks.4.kan1.3 blocks.4.kan1.4 blocks.4.kan1.5 blocks.4.kan1.6 2.5
blocks.4.kan1.7
2 2 2 2
0 0 0 0.0
0 0 0 0
2 2 2.5
2 2 2 2 2
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
blocks.4.kan2.0 blocks.4.kan2.1 blocks.4.kan2.2 blocks.4.kan2.3 blocks.4.kan2.4 blocks.4.kan2.5 blocks.4.kan2.6 blocks.4.kan2.7
5 10 5 5 5 20 20 5
5 0
0 0 0
0 0
5 0 0 5
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
blocks.5.kan1.0 2
blocks.5.kan1.1 blocks.5.kan1.2 2
blocks.5.kan1.3 2
blocks.5.kan1.4 2
blocks.5.kan1.5 blocks.5.kan1.6 blocks.5.kan1.7
2 2
2 2.5
0 0 0 0 0
0 0 0.0
2 2 2 2
2 2 2 2.5
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
blocks.5.kan2.0 blocks.5.kan2.1 blocks.5.kan2.2 blocks.5.kan2.3 blocks.5.kan2.4 blocks.5.kan2.5 blocks.5.kan2.6 blocks.5.kan2.7
20
5 5 5 20 10 5
10 5 0
0 10 0 0
0 0 0 5
5 0 10 5
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
blocks.6.kan1.0 blocks.6.kan1.1 blocks.6.kan1.2 blocks.6.kan1.3 2
blocks.6.kan1.4 blocks.6.kan1.5 blocks.6.kan1.6 blocks.6.kan1.7
1 1
0 0 0 0 0 0 0 0
1 2 1
2 2 2 2
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
blocks.6.kan2.0 blocks.6.kan2.1 blocks.6.kan2.2 blocks.6.kan2.3 blocks.6.kan2.4 blocks.6.kan2.5 blocks.6.kan2.6 blocks.6.kan2.7
40 20 20 20 20
25 25 20
20
0 0 0 0 0 0 0
0
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
blocks.7.kan1.0 blocks.7.kan1.1 blocks.7.kan1.2 blocks.7.kan1.3 blocks.7.kan1.4 blocks.7.kan1.5 blocks.7.kan1.6 blocks.7.kan1.7
2
1 1 1 1
0 0 0 0 0 0 0 0
1 1 1 2 1
2 2 2
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
blocks.7.kan2.0 blocks.7.kan2.1 blocks.7.kan2.2 blocks.7.kan2.3 blocks.7.kan2.4 blocks.7.kan2.5 blocks.7.kan2.6 blocks.7.kan2.7
50 50 50 50 50 50 50 50
0 0 0 0 0 0 0 0
50
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
blocks.8.kan1.0 blocks.8.kan1.1 blocks.8.kan1.2 blocks.8.kan1.3 blocks.8.kan1.4 blocks.8.kan1.5 blocks.8.kan1.6 blocks.8.kan1.7
1 2
1 1 1 1 1
0 0 0 0 0 0 0 0
1 1 1 1 1 1
2
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
blocks.8.kan2.0 blocks.8.kan2.1 blocks.8.kan2.2 blocks.8.kan2.3 blocks.8.kan2.4 blocks.8.kan2.5 blocks.8.kan2.6 blocks.8.kan2.7
100 100 100 100 100 100 100 100
0 0 0 0 0 0 0 0
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
2
blocks.9.kan1.0 blocks.9.kan1.1 blocks.9.kan1.2 2
blocks.9.kan1.3 blocks.9.kan1.4 blocks.9.kan1.5 blocks.9.kan1.6 2
blocks.9.kan1.7
2 2
2
0 0 0 0 0 0 0
0
2 2 2 2 2 2 2
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
blocks.9.kan2.0 blocks.9.kan2.1 blocks.9.kan2.2 blocks.9.kan2.3 blocks.9.kan2.4 blocks.9.kan2.5 blocks.9.kan2.6 blocks.9.kan2.7
0 0 0 0 0 0 0 0
100 100 100 100 100 100 100 100
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
blocks.10.kan1.0 blocks.10.kan1.1 blocks.10.kan1.2 blocks.10.kan1.3 blocks.10.kan1.4 blocks.10.kan1.5 blocks.10.kan1.6 2
blocks.10.kan1.7
1
0 0 0 0 0 0 0 0
1 2
2 2 2 2 2 2
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
20
blocks.10.kan2.0 20
blocks.10.kan2.1 20
blocks.10.kan2.2 blocks.10.kan2.3 blocks.10.kan2.4 blocks.10.kan2.5 20
blocks.10.kan2.6 blocks.10.kan2.7
20 20
0 0 0 0 0 0 0 0
20 20 20 20 20 20 20 20
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
blocks.11.kan1.0 blocks.11.kan1.1 blocks.11.kan1.2 blocks.11.kan1.3 blocks.11.kan1.4 blocks.11.kan1.5 1
blocks.11.kan1.6 blocks.11.kan1.7
1 1 1 1 1 1
1
0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
blocks.11.kan2.0 blocks.11.kan2.1 blocks.11.kan2.2 blocks.11.kan2.3 blocks.11.kan2.4 blocks.11.kan2.5 blocks.11.kan2.6 blocks.11.kan2.7
10 10 10 10
5.0
5 5 5 5
2.5 5 5 5
0.0 0 0 0 0 0 0 0
2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2
Figure 7: Fitted rational functions for KAT-S model, with 12 layers and 8 groups.
19