When Kolmogorov-Arnold network $$ M(x_1, x_2, x_3, ... , x_n) = \sum_{q=0}^{2n} \Phi_q\left(\sum_{p=1}^{n} \phi_{q,p}(x_{p})\right), $$ where $\Phi_q, \phi_{q,p}$ are univariate functions, trained on the physical objects, piecewice linear model is accurate enough. For the pysical systems recorded data are usually approximate, due to errors in recordings, unobserved factors (epistemic uncertainty) and some system trend during recording, when the object is not absolutly stationary. However, there are some cases when dataset is exact and features and targets belong to certain but unknown analytical formula. For example, when KAN is used for numerical solution of partial differential equations. In this case, the accuracy can be increased by approximation of modeled functions by splines or other smooth functions.

The splines can be applied in a several different ways, it is a wide topic, so this article documents the particular approach used in our code.

Each function from the model $f(x)$ is expressed as linear combination of the basis functions $B_i(x)$. Every basis function is modeled by cubic splines. The tricky part in the program is reduction of the number of splines. For example, for 5 features we need 66 functions in a model. If to use 10 point grid we need 660 basis functions, 5940 cubic splines and 23760 parameters to express them. They all must be organized in structures and navigating through them repeatedly in the long loops will take significant time. The reduction can be achieved by using dimensionless basis functions which shapes only depend on the number of points.

Some critical choices

  • Each basis function is created for unit vector $e^T = [0\: 0\: 1\: ... \: 0 \:0]$ which has only one non null element.
  • All unit vectors are pairwise orthogonal and their number equals to number of points in modeled function.
  • The intervals between points are equal.
The actual code for building the splines for unit functions is sitting in class AlgebraHelper and it is written based on example provided in the article of James Keesling. It has a unit test. Two samples of such basis functions are shown in the image below.

Since entire set of basis functions depends only on the number of points, the object holding them can be shared between the modeled functions. This shared object instantiated from the class Basis. The parameters $c^T = [c_1 \: c_2 ...]$ for linear combination of basis functions belong to each individual function (object Univariate).

Making everything clear, we can say that one particular function $f_k$ is expressed by linear combination of basis functions in the following way $$f_k(x) = \sum_{j=0}^n c_{j}^{(k)} B_j(x).$$ Another function $f_m$ may be expressed as different linear combination of the same basis functions in case if they both have same number of points in the unit vectors used for instantiation $$f_m(x) = \sum_{j=0}^n c_{j}^{(m)} B_j(x).$$ The fields of definitions for $f_k$ and $f_m$ may be different but we use relative argument $x$ which points to the segment associated to particular cubic spline. All basis functions are generated once at the start of the training process and only coefficients $c$ are updated at each training step.

Projection descent for a single function

Let us drop indexes for convenience. The function value is computed via basis function $$\hat f(x) = \sum_{j=0}^n c_{j} B_j(x).$$ The computed value $\hat f$ is different from wanted value $f$. According to projection descent algorithm all coefficients $c_j$ are updated in the following way $$c_j = c_j + \alpha [f(x) - \hat f(x)] B_j(x), $$ where $\alpha$ is regularization parameter. By applying it for each coefficient in every function for all records in multiple epochs the model is trained to output targets. Wanted function values are computed in a similar to back propagation algorithm explained in another article.