Content



GPU friendly code

DEMO of theoretical concept Jacobian
DEMO 2 Jacobian2
DEMO 3 Newton-Kaczmarz vs. Gauss-Newton





Unfortunately, GPU code can't be very sophisticated and handle deep structures of nested objects. But GPUs are effective in processing concurrent loops on mutually exclusive data segments. On that reason the choice for GPU friendly algorithms usually evolves around family of Gauss-Newton methods. There are different flavors of them and the principal idea is collecting concurrently large system of linear algebraic equations $$ J \Delta \beta = \Delta R $$ and solving them for large data segments. $\Delta R$ is vector of residuals, computed for particular model parameters' denoted by $\beta$ and large enough subset of the training data, and $J$ is Jacobian matrix. The method is very well explained in a literature and there is no needs to repeat it here.

Room for counter-intuitive tricks

The common sense tells us that anything straight forward and primitive is not very effective. In this particular case it appeared to be true. Straight forward assembling matrix and solving system appeared to be 10 times slower than Newton-Kaczmarz method. We need at least 15 concurrent threads for assembling matrix concurrently only to approach to Newton-Kaczmarz performance.

However, some time later I found few coding tricks that made this method faster than Kaczmarz even in case of sequential processing. I demo it on a single Urysohn object. Below is print out
Initial relative RMSE for validation set 0.219072
Relative RMSE for validation set after Gauss-Newton step 0.000371, training time 0.004000
Relative RMSE for validation set after 16 Newton-Kaczmarz steps 0.000548, training time 0.006000
and Gauss-Newton is quicker and more accurate even without threading, so concurrent processing can improve it significantly.

What can be done

      First thing is very simple and quick computation of partial derivatives in case of all piecewise linear functions

      $$\frac {\partial F(x)}{\partial \beta_k} = 0.75, \:\:\:\: \frac {\partial F(x)}{\partial \beta_{k+1}} = 0.25.$$ Those who wonder how I've got these numbers, I redirect to Euclid's theorems on similarity of triangles (near 300 year BC).

      Second thing is using operations with sparse matrices, since normally $J$ has near 10% of non-zero elements. The picture shows only one linear segment, but there are more and $F$ depends also on $\beta_{k-1}, \beta_{k-2}, ... \beta_{k+2}, ...$, but their partial derivatives for this particular $x$ are zeros and the rows in $J$ have only few non-zero elements.

      And third thing is to build matrix $J^TJ$ and vector $J^T\Delta R$ directly, bypassing assembling $J$. The size of $J^TJ$ is equal to the number of nodes of all model functions, so it is something in [100, 10000] independently of the dataset size, which can have several million records.

      It is already demonstrated on the site how easy is to change number of linear segments in functions without retraining and preserving already acquired accuracy. So, provided that, the training may start from 2 linear segments, for which $J^TJ$ matrix will be tiny, and move on to more complex models gradually. This is the biggest advantage of KAN over NN. In NN adding the neurons needs retraining (at least modified part). In KAN number of linear segments can be incremented without loss of already acquired accuracy and even basis functions can be switched from piecewise linear into splines or others.

Multi layered KAN

Multi layered KAN is the tree of nested functions, so the chain rule of taking derivatives is applied, anything else is the same. The sample of elementary two layer KAN is on the second link on the top.

Demo for realistic case

It is DEMO 3 at the top. The dataset is 10 000 records, features are random $3 \times 3$ matrices, targets are determinants. All operations are assembling of large matrix, which can be done concurrently for each record. Solution of system of linear equations, which can be done concurrently as well and update model parameters.

I did not use threading or GPU code to keep everything clear for end users. It all implemented in a single processor and on that reason it is about 8 times slower than Kaczmarz. Obviously, using CPU threading will not beat Kaczmarz significantly and GPU code is needed. This DEMO actually shows GPU friendly code.



April 6, 2025