Content

About this project

The complete idiot's guide to Kolmogorov-Arnold Networks

In 1924 Pavel Urysohn introduced integral equation. In one of its forms it maps a function $z(s)$ to a scalar $u$ via kernel function $F[z,s]$ $$ u = \int_{0}^{T} F[z(s),s]ds. $$ If to replace continuous $z$ by the sequence of values and slice surface $F$ into set of univariate functions, we obtain its quadrature formula called in literature discrete Urysohn operator or generalized additive model $$ u = \sum_{j = 1}^{n} f_j(z_j). $$ If we combine multiple operators, it becomes mapping of vector $Z$ to vector $U$ $$ u_i = \sum_{j = 1}^{n} f_{i,j}(z_j), \quad i=1,2,3... $$ If we combine multiple layers of such vector-to-vector mappings, it becomes a Kolmogorov-Arnold network (KAN), for example three layers like below $$ u_i = \sum_{j = 1}^{n} f_{i,j}(z_j), \quad i=1,2,3...m \\ z_i = \sum_{j = 1}^{k} g_{i,j}(y_j), \quad i=1,2,3...n \\ y_i = \sum_{j = 1}^{s} q_{i,j}(x_j), \quad i=1,2,3...k $$ The bottom equation is inner layer and $x_j$ are inputs or features. The upper equation is outer layer and $u_i$ are outputs or targets. For training KAN we need identify several tables of univariate functions: such as $s \times k$ functions $q_{i,j}$ of inner layer + $k \times n$ functions $g_{i,j}$ of hidden layer and $n \times m$ functions $f_{i,j}$ of outer layer provided only dataset as list of corresponding vectors $X_k$ and $U_k$.

KAN training

I can note at this point that it works with any choice of the sought function types, including B-Splines, but I explain everything for piecewise linear type, because it is a key for understanding the training method. The second property, only for convenience of explanation, is the even size of abscissas differences for all linear segments within each function. It is not the same size for all functions, but only within each individual function. As it can be seen later, it is not a limitation of the method, only a convenience.

At first, we consider single urysohn which maps vector $Z$ to a scalar $\hat u$ with an error $e = \Delta u = u - \hat u$ which we wish to reduce in update step. It can be done in a multiple different ways, but we suggest projection descent introduced by Stefan Kaczmarz in 1937. The difference $\Delta u$ is divided evenly between all functions $\Delta u/n$ then corresponding linear segment is identified and its two adjacent points are updated by moving either up or down.

In order to do that we compute relative parameter $R$ $$ R = (z - z_{min}) / \Delta z. $$ Integer part of it $floor(R) = \lfloor R \rfloor $ is index $i$ in the picture and fractional part $L \in [0,1] = R - i$ is a relative offset - distance within corresponding segment from the left. Having $i$ and $L$ we update left and right function points $$ f_i += \alpha \times \Delta u/n \times (1.0 - L) \\ f_{i+1} += \alpha \times \Delta u/n \times L, $$ where $+=$ is programmatic 'add and assign' operator, and $\alpha$ is learning rate. When using proper learning rate or regularization the size $n$ can be dropped $$ f_i += \alpha \times \Delta u \times (1.0 - L) \\ f_{i+1} += \alpha \times \Delta u \times L. $$

Now we only need to find an increment for each individual urysohn reducing the entire network discrepancy for a specific record. Let us do it in a way widely used since 1950s. We introduce a loss function for the outer level $$ Loss = \frac{1}{2} \sum_{i=1}^m \left [ u_i - \sum_{j=1}^n f_{i,j}(z_j) \right] ^2 $$ and perform gradient descend step by taking partial derivatives of the loss with respect to $z_j$ and assigning them to zero. The simplest algebraic transformations will give us the following result $$ \Delta Z = \alpha F^T \Delta U, $$ where $F$ is a matrix of partial derivatives at points $z_1, z_2, ...$ $$ \begin{bmatrix} \frac{\partial f_{1,1}}{\partial z_1} & \frac{\partial f_{1,2}}{\partial z_2} & \dots & \frac{\partial f_{1,n}}{\partial z_n} \\ \frac{\partial f_{2,1}}{\partial z_1} & \frac{\partial f_{2,2}}{\partial z_2} & \dots & \frac{\partial f_{2,n}}{\partial z_n} \\ && \dots &\\ \frac{\partial f_{m,1}}{\partial z_1} & \frac{\partial f_{m,2}}{\partial z_2} & \dots & \frac{\partial f_{m,n}}{\partial z_n} \\ \end{bmatrix}. $$ Since all functions and arguments are known at the training step, the matrix is simply a table of numbers. At the start of the training process, the functions are initialized randomly. Positive or negative sign before $\alpha F^T \Delta U$ defines gradient direction and must be chosen from the meaning, which must be obvious and depends on how you define $\Delta U$ as $\hat u - u$ or $u - \hat u$. Since vector $\Delta U$ is known model error and matrix $F$ is defined by current function values (state of the model), we can obtain wanted increments for each layer and update it. In case of more than three layers, the increments are propagated further down in the same way by multiplications of corresponding matrices of partial derivatives $$ \Delta Y = \alpha G^T \Delta Z = \alpha G^T F^T \Delta U. $$ So to train KAN we need few sequential steps, we compute targets and obtain vector of model errors, we compute all wanted increments for each urysohn in the network and, having it, update each urysohn. By applying this step record-by-record in multiple runs through data we train the network.

Some clarification notes

Each layer is defined by 3D matrix or a tensor. The matrix of derivatives is 2D because it is subset from this tensor defined by particular input vector. It can be more clear if we consider elementary linearization step for a single urysohn and scalar as an element of the output vector $$ u = \sum_{j = 1}^{n} f_j(z_j), $$ $$ \Delta u = \sum_{j = 1}^{n} \frac {\partial f_j(z_j)}{\partial z_j} \Delta z_j. $$ Same vector of input increments is used in each urysohn of the layer for assembling derivative matrix.

If we convert discrete form of a layer to continuous one, we'll obtain Urysohn integral equation in almost its classic form $$ u(t) = \int_{s_1}^{s_2} F[z(s),s,t]ds, \quad t \in [t_1, t_2] $$ with 3D kernel. It was introduced as a different from considered here mathematical problem. The word equation usually means the equality which contains an unknown object to be found. That unknown object was $z(s)$ for given $u(t)$ and $F[z(s),s,t]$. Training of the network is a different problem $-$ it is building the kernel having set of inputs and outputs. More than that, we want to identify multiple kernels which define the chain of equations with several unobserved layers. In one of the references below it is mathematically proven that there are infinity of solutions for every possible dataset providing same accuracy level. But that part is for researchers. Here I try to explain concept for engineers.

References

Discrete Urysohn Operator
Kolmogorov-Arnold Representation
Newton-Kaczmarz Method

How it all works

The immediate question 'how it all works' has quick and simple answer. I show here one challenging scalability test. The features are random matrices $5 \times 5$, $25$ features. The target is determinant. Due to significant non-linearity and sensitivity to small changes we need significant number of records. We have 10,000,000 training records, 2,000,000 validation records (not used in training) and the accuracy metric as Pearson correlation coefficient. The baseline benchmark was MATLAB neural networks, because MATLAB is designed for a quick processing and is very good candidate for comparison. The test results are shown below in a chart