Content

About this project

GPU friendly code

DEMO of theoretical concept	Jacobian
DEMO 2	Jacobian2
DEMO 3	Newton-Kaczmarz vs. Gauss-Newton

Unfortunately, GPU code can't be very sophisticated and handle deep structures of nested objects. But GPUs are effective in processing concurrent loops on mutually exclusive data segments. On that reason the choice for GPU friendly algorithms usually evolves around family of Gauss-Newton methods. There are different flavors of them and the principal idea is collecting concurrently large system of linear algebraic equations $$ J \Delta \beta = \Delta R $$ and solving them for large data segments. $\Delta R$ is vector of residuals, computed for particular model parameters' denoted by $\beta$ and large enough subset of the training data, and $J$ is Jacobian matrix. The method is very well explained in a literature and there is no needs to repeat it here.

Room for counter-intuitive tricks

The common sense tells us that anything straight forward and primitive is not very effective. In this particular case it appeared to be true. Straight forward assembling matrix and solving system appeared to be 10 times slower than Newton-Kaczmarz method. We need at least 15 concurrent threads for assembling matrix concurrently only to approach to Newton-Kaczmarz performance.

However, some time later I found few coding tricks that made this method faster than Kaczmarz even in case of sequential processing. I demo it on a single Urysohn object. Below is print out

Initial relative RMSE for validation set 0.219072
Relative RMSE for validation set after Gauss-Newton step 0.000371, training time 0.004000
Relative RMSE for validation set after 16 Newton-Kaczmarz steps 0.000548, training time 0.006000

and Gauss-Newton is quicker and more accurate even without threading, so concurrent processing can improve it significantly.

What can be done

Multi layered KAN

Multi layered KAN is the tree of nested functions, so the chain rule of taking derivatives is applied, anything else is the same. The sample of elementary two layer KAN is on the second link on the top.

Demo for realistic case

It is DEMO 3 at the top. The dataset is 10 000 records, features are random $3 \times 3$ matrices, targets are determinants. All operations are assembling of large matrix, which can be done concurrently for each record. Solution of system of linear equations, which can be done concurrently as well and update model parameters.

I did not use threading or GPU code to keep everything clear for end users. It all implemented in a single processor and on that reason it is about 8 times slower than Kaczmarz. Obviously, using CPU threading will not beat Kaczmarz significantly and GPU code is needed. This DEMO actually shows GPU friendly code.

April 6, 2025