There are multiple training methods for Kolmogorov-Arnold network. This article explains only Kaczmarz and Newton-Kaczmarz methods,
previously published by authors. The other methods,
such as Broyden, Adam and stochastic gradient descent can be found in literature.
Introduction
Kolmogorov-Arnold model is a set of continuous functions $\Phi_q, f_{q,p}$ converting input vectors $X$, shown in formula by their elements $x_p$,
called features, into modeled scalars $m$, called targets
$$ m = \sum_{q=0}^{2n} \Phi_q\left(\sum_{p=1}^{n} f_{q,p}(x_{p})\right). $$
Training the model is identification of these functions provided dataset as a list of corresponding input vectors $X^k$ and targets $m^k$.
This model is called Kolmogorov-Arnold representation (KAR). The generalization of this model for more than two layers or multiple targets
is called Kolmogorov-Arnold network (KAN) and regarded as an alternative to multilayer perceptron (MLP).
We use subscripts for indexes in formulas and superscripts to denote particular values, for example, when we say that $x_i = x^k$,
that means that variable $x$ sitting at the indexed position $i$ takes the numerical value $x^k$. Equality ${x_i}^{k+1} = {x_i}^{k} + \Delta$
means modification of the element $x_i$ by adding $\Delta$ in $k-th$ computational step.
One function only
Let us start from the one function only and move on further to KAN. The 1D dataset is given by points with their coordinates
$x^k, y^k$ and the goal is to adjust arbitrary piecewise linear function given by their ordinates $f_i$ and arranged with an even abscissa intervals $D$
in the field of definition $[x_{min}, x_{max}]$, assuming given points are spread near certain curve line making this approximation reasonable.
For every abscissa $x$ we can identify corresponding linear segment by computing index
$$ i = \lfloor (x - x_{min}) / D \rfloor, $$
(where $\lfloor \cdot \rfloor$ is "FLOOR" function), relative offset to abscissa from the left
$$L = (x - x_{min}) / D - i = R / D,$$
$L \in [0, 1.0)$ and relative offset from the right is $1.0 - L$.
Having that, we can specify desired equality as linear interpolation $\mathbf{v}^T \mathbf{f} = y$, where $\mathbf{f}$ is vector with
ordered ordinates and $\mathbf{v}^T$ is vector with only two non zero elements matching positions of corresponding ordinates
$$\mathbf{f}^T = [f_0\: f_1\: f_2\: ... f_i\: f_{i+1} ...],$$
$$\mathbf{v}^T = [0,\: 0,\: 0,\: ... 1-L,\: L, ...].$$
Please note that position of $1-L$ is matching left edge of linear segment and $L$ is matching right edge. The reason
must be obvious.
Collecting of system of linear algebraic equations for this case is not even necessary.
The training is actually much simpler procedure known as projection descent introduced by Stefan Kaczmarz in 1937. It only needs navigation dataset with red points data,
identification of particular linear segment for each point, and update left and right points by modifications
$${f_i}^{k+1} = {f_i}^{k} + \frac{(y - \hat y) \cdot (1 - L) }{(1 - L)^2 + L^2}$$
$${f_{i + 1}}^{k+1} = {f_{i + 1}}^{k} + \frac{(y - \hat y) \cdot L }{(1-L)^2 + L^2},$$
where $y$ and $\hat y = {f_i}^{k} \cdot (1 - L) + {f_{i + 1}}^{k} \cdot (L) = {f_i}^{k} + ({f_{i + 1}}^{k} - {f_i}^{k}) \cdot L$ are provided and
computed targets. The squared norms $(1-L)^2 + L^2$ are always in range $[1, 0.5]$ and can be replaced by
regularization $\alpha$ for computational stability.
The matter of projection descent is navigation of arbitrary point through hyperplanes of linear equations by projecting it from one hyperplane to another. The convergence
is extremely fast when all hyperplanes are near pairwise orthogonal which is true in this case. The point that is navigated, in this case, is vector
$\mathbf{f}^T = [f_0\: f_1\: f_2\: ... f_i\: f_{i+1} ...]$.
The training may be arranged in so-called online manner, that means while reading
dataset. The experimental points or observations of real systems usually not fit any model exactly that means that training must be interrupted after achieving
certain accuracy. Regularization provides computational stability to the training process.
Usually multiple runs through the dataset (epochs) are needed.
That was the hardest part of training KAN, the outstanding part is even simpler.
Generalized additive models (GAMs) or discrete Urysohn operators (DUOs)
Now we consider the model with the sum of multiple functions.
They are called GAMs or DUOs
$$ z = \sum_{p=1}^{n} f_p(x_p).$$
All we need to do is to reuse previous section. The difference between given and estimated targets $z - \hat z$ is now evenly split between
all functions of the model and everything else is performed as already explained. It is still the same projection descent introduced by
Stefan Kaczmarz, only with block vectors.
$$[f_{0,0}\: f_{0,1}\: f_{0,2}\: ... f_{0,i}\: f_{0,i+1} ... f_{1,0}\: f_{1,1}\: f_{1,2}\: ... f_{1,i}\: f_{1,i+1} ...],$$
$$[0,\: 0,\: 0,\: ... (1-L_0),\: L_0, ... 0,\: 0,\: 0,\: ... (1-L_1),\: L_1, ...].$$
The name Generalized Additive Model does not need an explanation. Name Discrete Urysohn Operator needs short clarification. There is well known
integral equation named after its researcher Pavel Urysohn. It may take several forms, one of them looks like follows:
$$ z = \int_0^T U(x(s), s)ds. $$
In this case $x(s)$ is a function defined on $[0,T]$ and $U$ is a function with of two arguments called kernel. These two formulas, sum and integral, can be regarded
as discrete and continuous forms of the same model. The values $x_p$ can be
interpreted as sequential points of function $x(s)$ and functions $f_p$ as slices of $U$, so sum is a finite difference approximation to
Urysohn integral equation. The term Discrete Urysohn operator is used in mathematical literature, it is not a new terminology.
Training a model goes record by record. For each of them we compute estimation $\hat z$, find the difference $\Delta z = z - \hat z$, divide it evenly
between all functions $\Delta z / n$, multiply by parameter of regularization $\alpha \cdot \Delta z / n$ and update corresponding single linear
segment of every $f_p$ by adding $\alpha \cdot \Delta z/n \cdot (1-L)$ to the left ordinate and $\alpha \cdot \Delta z/n \cdot L$ to the
right ordinate. The convergence is extremely fast, because if to express training process as an iterative solution, the
corresponding system of linear algebraic equations, if constructed, should be very sparse.
KAN and Newton-Kaczmarz method
Kolmogorov-Arnold networks (KAN) is a generalization of Kolmogorov-Arnold representation (KAR). KAN has multiple layers mapping one vector
into another by the line of discrete urysohn operators
$$ u_i = \sum_{j=1}^{5} f_{i,j}(x_j), \:\:i=1,2,3.$$
If we introduce matrix of the first order derivatives $F$ in the neighborhood of given $X$
$$
\begin{bmatrix}
\frac{\partial f_{1,1}}{\partial x_1} & \frac{\partial f_{1,2}}{\partial x_2} & \dots & \frac{\partial f_{1,5}}{\partial x_5} \\
\frac{\partial f_{2,1}}{\partial x_1} & \frac{\partial f_{2,2}}{\partial x_2} & \dots & \frac{\partial f_{2,5}}{\partial x_5} \\
\frac{\partial f_{3,1}}{\partial x_1} & \frac{\partial f_{3,2}}{\partial x_2} & \dots & \frac{\partial f_{3,5}}{\partial x_5} \\
\end{bmatrix}
$$
than a step of gradient descent, improving model $U$ by adding reducing residual error increment $\Delta U$ for a particular record will
be $\alpha F^T \Delta U = \Delta X$, where $\alpha$ is regularization. Having this $\Delta X$ we can improve all urysohns in the layer below,
and having current $X$ and $\Delta U$, we update the layer shown in the picture.
Multiple layers means that vector $X$ may not be given features but the output of the previous layer and vector $U$ may not
necessarily be the targets but the input to the next layer. In case of multiple layers the gradient update is propagated backwards
according to elementary chain rule which is expressed by the simple product of matrices $\alpha .. F_3^T F_2^T F_1^T \Delta U$,
where each individual matrix is constructed for the row of urysohns in the layer in the same way as in above example.
This algorithm is called Newton-Kaczmarz, strict
mathematical proof of its local convergence can be found in PREPRINT.
There is a big similarity with neural networks (NN). The product of matrices $F_{i+1}^T F_{i}^T..$ is similar to back propagation and each
individual urysohn is connected to others in the way similar to individual connections of neurons in NN. The application of matrices $F_i^T$ is well
tested and sort of standard approach known for decades. It was used, for example, for solution of system of linear algebraic equations $Ax = y$ by
iterations $x_{i+1} = x_i + \alpha A^T(y - Ax_i)$. The method is called gradient-based iterative or least squares gradient descent and
regularization in this case can be chosen as $1 / ||A^TA||$.
Vector $y - Ax$ in above example is equivalent
to $\Delta U$ and small increments $\Delta X$ and $\Delta U$ are associated by the matrix of the first order derivatives $F$.
Remarks on belated solution to the problem
Until very recent time the training of Kolmogorov-Arnold model was considered impossible. You
can find scientific papers published near 2020 still saying that building Kolmogorov-Arnold model is mission impossible. It is
interesting phenomenon because all methods used for this identification were known since 1930s and 1950s and programmatic
implementation needs sort of 200 lines. The number of arithmetic operations for training KAN is significantly lower than for
NN. There are some benchmark examples where training KAN goes 50 times quicker compared to even well optimized NN implementations
designed for quick execution (not a Python code). This is the reason of the recent hype, the surprising moment is why now and
not in 1960s.
Rescaling of intermediate arguments
The limits for definition of intermediate arguments $\hat z_q$ may change during the training. It can be easily fixed by moving
$x_{min}, x_{max}$ and recalculating of $D$. These limits can be chosen freely at the start of the training and updated when
necessary. Below is an example of rescaling. It is executed each time function FitDefinition(x) is called, that means before
processing of argument $x$.
Original set of limits must be very obvious. All inner functions use features, so ranges for arguments are known. The limits for inner function values can be chosen freely.
The computed inner function values are summed and used as an argument for the top functions, so it is a sum of known number of random values with unknown distribution. But
since the limits are adjusted anyway and need to be assigned only approximately, the distribution may be assumed as uniform. The outer function values must match the
target range. It all can be found in code.
Using splines does not improve accuracy
It may look like piecewise linear model is an approximation to any smooth function model and on that reason is less accurate. It is WRONG. There are multiple
coding examples showing that accuracy is the same. The reason is clear. The target is the sum of $2n + 1$ piecewise linear functions and each
argument for them is the sum of another $n$ piecewise linear functions with feely chosen sizes of the segments. The basis functions
can be replaced during training with preservation of already achieved accuracy. There are some coding examples on this site where
training starts as piecewise linear model and ended as spline model.
Computational complexity
The simple way of evaluate computational complexity is to find the number of long operations, such as division and multiplication during training. The other operations,
such as allocating memory for arrays and objects and navigating through them should not be a big burden, because these lists and sizes are tiny in general case.
We estimate it for piecewise linear case.
The main time is spent in the following operations of the training loop:
The 'addend' object here is one term of Kolmogorov-Arnold
$$ \Phi_q\left(\sum_{p=1}^{n} f_{q,p}(x_{p})\right). $$
Function 'ComputeUsingInput' computes the function value for each function of the addend, which is number of inputs plus one. Function
'UpdateUsingMemory' updates each of these functions reusing few already computed parameters. These parts can be found in
'UnivariatePL::GetFunctionUsingInput' and 'UnivariatePL::UpdateUsingMemory'. You can see that they have only 3 long operations
all together. There are also additional computations of numerical derivatives and multiplication by the difference for each
outer function:
which adds 2 long operations for each outer function. And, finally, one more multiplication by regularization parameter.
So for $n$ intputs (features) we need $3(2n + 1)(n + 1) + 2(2n + 1) + 1$ long operations per record.
Number of required epochs is
in general similar to neural networks 20 to 100. So everything else must be clear. This is how we achieve quick processing.
For example, 5 inputs, 10 000 records may take from 0.1 to 0.4 seconds on a single CPU.
If we want to compare training time to MLP, we need large dataset, those examples with training for a fraction of second can't provide
it due to some overhead. Such examples can be found on this site. The C++ KAN typically is 3 to 10 times faster than well optimized
NN code.
Proof of the claimed efficiency
All statements of this essay can be verified by downloading and testing the code provided on this site.
This concept is published in high rated journals and publicly available since 2021.