The complete idiot's guide to Kolmogorov-Arnold Networks
In 1924 Pavel Urysohn introduced integral equation. In one of its forms it maps a function $z(s)$ to a scalar $u$ via kernel function $F[z,s]$
$$
u = \int_{0}^{T} F[z(s),s]ds.
$$
If to replace continuous $z$ by the sequence of values and slice surface $F$ into set of univariate functions, we obtain its
quadrature formula called in literature discrete Urysohn operator or generalized additive model
$$
u = \sum_{j = 1}^{n} f_j(z_j).
$$
If we combine multiple operators, it becomes mapping of vector $Z$ to vector $U$
$$
u_i = \sum_{j = 1}^{n} f_{i,j}(z_j), \quad i=1,2,3...
$$
If we combine multiple layers of such vector-to-vector mappings, it becomes a Kolmogorov-Arnold network (KAN), for example three layers like
below
$$
u_i = \sum_{j = 1}^{n} f_{i,j}(z_j), \quad i=1,2,3...m \\
z_i = \sum_{j = 1}^{k} g_{i,j}(y_j), \quad i=1,2,3...n \\
y_i = \sum_{j = 1}^{s} q_{i,j}(x_j), \quad i=1,2,3...k
$$
The bottom equation is inner layer and $x_j$ are inputs or features. The upper equation is outer layer and $u_i$ are outputs or targets.
For training KAN we need identify several tables of univariate functions: such as $s \times k$ functions $q_{i,j}$ of inner layer +
$k \times n$ functions $g_{i,j}$ of hidden layer and $n \times m$
functions $f_{i,j}$ of outer layer provided only dataset as list of corresponding vectors $X_k$ and $U_k$.
KAN training
I can note at this point that it works with any choice of the sought function types, including B-Splines, but I explain everything for piecewise linear type, because
it is a key for understanding the training method.
The second property, only for convenience of explanation, is the even size of abscissas differences for all linear segments within each function. It is not the same size for all functions,
but only within each individual function. As it can be seen later, it is not a limitation of the method, only a convenience.
At first, we consider single urysohn which maps vector $Z$ to a scalar $\hat u$ with an error $e = \Delta u = u - \hat u$ which we wish to reduce in update step.
It can be done in a multiple different ways, but we suggest projection descent introduced by Stefan Kaczmarz in 1937.
The difference $\Delta u$ is divided evenly between all functions $\Delta u/n$ then corresponding linear segment is identified and
its two adjacent points are updated by moving either up or down.
In order to do that we compute relative parameter $R$
$$
R = (z - z_{min}) / \Delta z.
$$
Integer part of it $floor(R) = \lfloor R \rfloor $ is index $i$ in the picture and fractional part $L \in [0,1] = R - i$ is a relative offset - distance within corresponding segment from the left.
Having $i$ and $L$ we update left and right function points
$$
f_i += \alpha \times \Delta u/n \times (1.0 - L) \\
f_{i+1} += \alpha \times \Delta u/n \times L,
$$
where $+=$ is programmatic 'add and assign' operator, and $\alpha$ is learning rate. When using proper learning rate or regularization the size $n$
can be dropped
$$
f_i += \alpha \times \Delta u \times (1.0 - L) \\
f_{i+1} += \alpha \times \Delta u \times L.
$$
Now we only need to find an increment for each individual urysohn reducing the entire network discrepancy for a specific record.
Let us do it in a way widely used since 1950s. We introduce a loss function for the outer level
$$
Loss = \frac{1}{2} \sum_{i=1}^m \left [ u_i - \sum_{j=1}^n f_{i,j}(z_j) \right] ^2
$$
and perform gradient descend step
by taking partial derivatives of the loss with respect to $z_j$ and assigning them to zero. The simplest algebraic transformations will give us the following result
$$
\Delta Z = \alpha F^T \Delta U,
$$
where $F$ is a matrix of partial derivatives at points $z_1, z_2, ...$
$$
\begin{bmatrix}
\frac{\partial f_{1,1}}{\partial z_1} & \frac{\partial f_{1,2}}{\partial z_2} & \dots & \frac{\partial f_{1,n}}{\partial z_n} \\
\frac{\partial f_{2,1}}{\partial z_1} & \frac{\partial f_{2,2}}{\partial z_2} & \dots & \frac{\partial f_{2,n}}{\partial z_n} \\
&& \dots &\\
\frac{\partial f_{m,1}}{\partial z_1} & \frac{\partial f_{m,2}}{\partial z_2} & \dots & \frac{\partial f_{m,n}}{\partial z_n} \\
\end{bmatrix}.
$$
Since all functions and arguments are known at the training step, the matrix is simply a table of numbers. At the start of the training process, the
functions are initialized randomly.
Positive or negative sign before $\alpha F^T \Delta U$ defines gradient direction and must be chosen from the meaning, which must be obvious and depends on
how you define $\Delta U$ as $\hat u - u$ or $u - \hat u$.
Since vector $\Delta U$ is known model error and matrix $F$ is defined by current function values (state of the model), we can obtain wanted increments for each layer and
update it. In case of more than three layers, the increments are propagated further down in the same way by multiplications
of corresponding matrices of partial derivatives
$$
\Delta Y = \alpha G^T \Delta Z = \alpha G^T F^T \Delta U.
$$
So to train KAN we need few sequential steps, we compute targets and obtain vector of model errors, we compute all wanted increments for each urysohn in
the network and, having it, update each urysohn. By applying this step record-by-record in multiple runs through data we train the network.
Some clarification notes
Each layer is defined by 3D matrix or a tensor. The matrix of derivatives is 2D because it is subset from this tensor defined by particular input vector.
It can be more clear if we consider elementary linearization step for a single urysohn and scalar as an element of the output vector
$$
u = \sum_{j = 1}^{n} f_j(z_j),
$$
$$
\Delta u = \sum_{j = 1}^{n} \frac {\partial f_j(z_j)}{\partial z_j} \Delta z_j.
$$
Same vector of input increments is used in each urysohn of the layer for assembling derivative matrix.
If we convert discrete form of a layer to continuous one, we'll obtain Urysohn integral equation in almost its classic form
$$
u(t) = \int_{s_1}^{s_2} F[z(s),s,t]ds, \quad t \in [t_1, t_2]
$$
with 3D kernel. It was introduced as a different from considered here mathematical problem. The word equation usually means the equality
which contains an unknown object to be found. That
unknown object was $z(s)$ for given $u(t)$ and $F[z(s),s,t]$. Training of the network is a different problem $-$ it is building the kernel having
set of inputs and outputs. More than that, we want to identify multiple kernels which define the chain of equations with several unobserved
layers. In one of the references below it is mathematically proven that there are infinity of solutions for every possible dataset
providing same accuracy level. But that part is for researchers. Here I try to explain concept for engineers.
The immediate question 'how it all works' has quick and simple answer. I show here one challenging scalability test. The features are random matrices $5 \times 5$,
$25$ features. The target is determinant. Due to significant non-linearity and sensitivity to small changes we need significant number of records. We have
10,000,000 training records, 2,000,000 validation records (not used in training) and the accuracy metric as Pearson correlation coefficient. The baseline
benchmark was MATLAB neural networks, because MATLAB is designed for a quick processing and is very good candidate for comparison. The test results are
shown below in a chart
The red label shows time and accuracy for Kolmogorov-Arnold model with piecewise linear functions. Number of urysohns in inner layer 200, number of
urysohns in outer layer is 1, number of linear segments in inner layer 3 (4 points), number of linear segments in outer layer 17 (18 points). The other labels in the chart show
results for different MATLAB designed neural networks.
The code to reproduce this experiment is available.