KAN's Core

Kolmogorov-Arnold model is a set of continuous functions $\Phi_q, \phi_{q,p}$ converting input vectors $X$, shown in formula by their elements $x_p$, called features, into modeled scalars $m$, called targets $$ m = \sum_{q=0}^{2n} \Phi_q\left(\sum_{p=1}^{n} \phi_{q,p}(x_{p})\right). $$ Training the model is identification of these functions provided dataset as a list of corresponding input vectors $X^k$ and targets $m^k$. This model has recently acquired name Kolmogorov-Arnold network (KAN) and regarded as an alternative to multilayer perceptron (MLP).

We use subscripts for indexes in formulas and superscripts to denote particular values, for example, when we say that $x_i = x^k$, that means that variable $x$ sitting at the indexed position $i$ takes the numerical value $x^k$.

One function only

Let us start from the one function only and move on further to KAN. The dataset is given by points with their coordinates $x^k, y^k$ and our goal is to adjust arbitrary piecewise linear function given by their ordinates arranged with an even abscissa intervals $D$ in the field of definition $[x_{min}, x_{max}]$. In the picture we showed only one linear segment and one point for the explanation.

For every red point abscissa $x$ we can find the corresponding linear segment and offset from the left $L$ by computing relative scaled distance $S$ $$ S = (x - x_{min}) / D. $$ The integer part $i = floor(S)$ of this scaled distance is an index of the left point in the involved segment. The fractional part $L = S - i$ is a relative offset from the left which is non negative and less than one, $L \in [0, 1.0)$. The offset from the right is $R = 1 - L$.

If we stretch all ordinates of the piecewise linear function into a vector $\mathbf{f}^T = [f_0\: f_1\: f_2\: ... f_i\: f_{i+1} ...]$ than corresponding ordinate $y$ of the red point, for the perfect model, must be equal to inner product $\mathbf{v}^T \mathbf{f}$, where vector $\mathbf{v}^T = [0\: 0\: 0\: ... R\: L ...]$ contains only two non null elements on the positions matching left and right function values of involved linear segment. Please note that position of $R$ is matching left border and $L$ is matching right border. The reason must be obvious.

We don't even need to build system of linear algebraic equations to tune our function values, although we can do it. The training is actually much simpler procedure known as projection descent introduced by Stefan Kaczmarz in 1937. It only needs navigation dataset with red points data, identification of particular linear segment for each point, and update left and right points by adding $$\frac{(y - \hat y) \cdot R }{(R^2 + L^2)}$$ to the left ordinate $f_i$ and $$\frac{(y - \hat y) \cdot L }{(R^2 + L^2)}$$ to the right ordinate $f_{i+1}$. Here $y$ is actual and $\hat y$ computed values. The norms $(R^2 + L^2) \in [1, 0.5]$ and can be replaced by regularization $\alpha$ for computational stability.

The matter of projection descent is navigation of arbitrary point through hyperplanes of linear equations by projecting it from one hyperplane to another. The convergence is extremely fast when all hyperplanes are near pairwise orthogonal which is true in our case. The point that is navigated, in our case, is vector $\mathbf{f}^T = [f_0\: f_1\: f_2\: ... f_i\: f_{i+1} ...]$. The training may be arranged in so-called online manner, that means while reading dataset. The experimental points or observations of real systems usually not fit any model exactly that means we have to interrupt training after achieving certain accuracy. Regularization makes the training stable. We need multiple runs through data (epochs) and updates of involved segments by adding $\alpha \cdot \Delta y \cdot R$ and $\alpha \cdot \Delta y \cdot L$ to corresponding ordinates.

That was the hardest part of training KAN, the outstanding part is even simpler.

Generalized additive models (GAM) or discrete Urysohn operators (DUO)

Now we consider the model with the sum of multiple functions. They are called GAM or DUO $$ z = \sum_{p=1}^{n} \phi_p(x_p).$$ All we need to do is to reuse previous section. The difference between given and estimated targets $z - \hat z$ is now evenly split between all functions of the model and everything else is performed as already explained. It is still same projection descent introduced by Stefan Kaczmarz, only with block vectors. $$[f_{0,0}\: f_{0,1}\: f_{0,2}\: ... f_{0,i}\: f_{0,i+1} ... f_{1,0}\: f_{1,1}\: f_{1,2}\: ... f_{1,i}\: f_{1,i+1} ...],$$ $$[0\: 0\: 0\: ... R\: L ... 0\: 0\: 0\: ... R\: L ...].$$
The name Generalized Additive Model does not need an explanation. Name Discrete Urysohn Operator needs short clarification. There is well known integral equation named after its researcher Pavel Urysohn. It may take several forms, one of them looks like follows:

$$ z = \int_0^T U(x(s), s)ds. $$ In this case $x(s)$ is a function defined on $[0,T]$ and $U$ is a function with of two arguments called kernel. These two formulas can be regarded as discrete and continuous forms of the same model. The values $x_p$ can be interpreted as sequential points of function $x(s)$ and functions $\phi_p$ as slices of $U$, so sum is a finite difference approximation to Urysohn integral equation. The term Discrete Urysohn operator is used in mathematical literature, I don't introduce new terminology here.

Training a model goes record by record. For each of them we compute estimation $\hat z$, find the difference $\Delta z = z - \hat z$, divide it evenly between all functions $\Delta z / n$, multiply by parameter of regularization $\alpha \cdot \Delta z / n$ and update corresponding single linear segment of every $\phi_p$ by adding $\alpha \cdot \Delta z/n \cdot R$ to the left ordinate and $\alpha \cdot \Delta z/n \cdot L$ to the right ordinate. The convergence is extremely fast, because if to express training process as an iterative solution, the corresponding system of linear algebraic equations will be very sparse.

Kolmogorov + Arnold = quick, simple and accurate

Obviously, KAN is a tree of GAMs or DUOs. If we make $2n + 1$ different GAMs or DUOs for the same inputs $x_p$ and consider their outputs $z_q$ as latent or unobserved variables (similar to hidden layer in MLP) than we can express final model output also as GAM or DUO $$ m = \sum_{q=0}^{2n} \Phi_q(z_q),$$ where $z_q$ are $$ z_q = \sum_{p=1}^{n} \phi_{q,p}(x_p).$$ The training is update of inner and outer operators for each new record. Inner operators are those which have features as inputs and outer operator computes the target. The outer operator is updated as already explained by reducing difference $\Delta m = m - \hat m$ for computed hidden layer $z_q$ values. For the inner operators, all we need to do is to divide the difference $\Delta m = m - \hat m$ evenly by the number of functions in outer GAM which yields $\Delta m /(2n +1)$, and pass it down to inner GAMs for updating. We need regularization for computational stability and the convergence is faster when we multiply this term by corresponding derivative $$ \frac {\alpha \cdot \Delta m}{2n + 1} \: \frac{d\Phi_q}{dz_q}.$$ Derivatives exist in every point of definition fields since all functions are piecewise linear. Multiplication by derivatives $d\Phi_q/dz_q$ expedites convergence and increases accuracy of the model.

Proof of the claimed simplicity

All statements of this essay can be verified by downloading and testing the code provided on this site. The performance and accuracy is the same or better than MLP. This concept is published in high rated journals and publicly available since 2021.


Video lecture
Code download