Content

About this project

Infinite concurrency

DEMO of theoretical possibility of infinite concurrency

Why it is possible

The neural networks allow parallel processing. It may theoretically reduce training up to 100 times compared to sequential case, but in reality, it is only 10 to 20 times quicker. There are some datasets and large networks where training needs days and weeks and our goal is to achieve concurrency higher than that. Here we try to define why it is possible and how training algorithm can be customized for parallel processing.

The model can be defined in continuous form $$ z(t) = \int_{s_1}^{s_2} F_1[y(s),s,t]ds, \quad t \in [t_1, t_2] $$ $$ y(s) = \int_{p_1}^{p_2} F_2[x(p),p,s]dp, \quad s \in [s_1, s_2] $$ $$ ... $$ Transformation it to finite differences via quadratures gives the tree of functions. The most simple KAN may be expressed via three univariate functions $g,p,q$ for the dataset presented by matrix $X$ and vector $Y$ $$ y_i = g[p(x_{i1}) + q(x_{i2})] \quad i = 1,2,... $$ The matter of record by record descent is computing the difference $\Delta y$ which reduces residual error for individual record and modifying functions $g,p,q$ by moving local segments in the vicinity of current arguments either up or down. Despite its primitiveness and simplicity this method appeared to be very effective. For example, for the dataset (10 million records) where features were random $5 \times 5$ matrices and target was determinant, the sequential training by this method was about 50 times faster neural networks built by optimized for fast execution software (not Python code). The model had two layers, inner layer had $200 \times 25$ piecewise linear functions with $5$ points each and outer layer had $200$ piecewise linear functions with 22 points each. The entire model was defined by $29,400$ parameters and baseline neural network had near same number parameters. When training goes record by record the next difference $\Delta y_{i+1}$ is computed for the model modified on the previous step, so we can't compute residuals concurrently and can't concurrently modify functions, so it looks like concurrency is not possible and this algorithm, although being very effective can't be further improved. It is not true and there is a way to make parallel training.

Some critical properties of the model and training process:

The critical part of computed residual $\Delta y$ is sign and not magnitude. The sign of $\Delta y$ defines the direction of moving segments $g,p,q$ and magnitude how far we move them. Since there is regularization (learning rate) the changes are small and, when modification goes into right direction, the training eventially converges.
The partial update also works. If we found right residual $\Delta y$ but updated only two of three functions, we reduced the error for this specific record anyway.
If we start computing mean residual errors for large validation subset, we can see that not each individual update improves the model. It is being improved statistically but not in each step.
Having current model, we can concurrently compute multiple residuals and choose the record with largest difference. It is logical to assume that, by choosing records with large discrepancy and skipping those with small discrepancies we can optimize training.
When we selecting records with large discrepancy, we can use model that was few training steps ago, because the modifications are small, features $X_i$ are different, computations are applied to different segments of the functions, the residual error is relatively large and the critical in residual error is sign, not magnitude.

How it can be designed

Getting all these properties listed, we can derive the following parallel concept:

The records are being processed concurrently by multiple clients computing residuals and selecting records with large residual errors. After record is selected client computes updates for several individual functions in a model. Each client compute updates for different sets of functions. Clients can be threads or processes running on different processors, it is optional, they only must be concurrent and independent.
Multiple independent servers are performing only updates of the model. Each server receiving messages from clients which function must be updated and for which argument. Each server is responsible for the part of the model and receives corresponding update messages only for the area of its responsibility.
Clients deliver updates to servers via messaging system.
There is synchronization system for the model between clients and servers. Ideally the clients should always have latest model produced by modifications in servers, but since model is being changed gradually, the short delays are not critical. The servers will provide the copy of the model in read-only memory segment or disk from which clients will read models periodically. This reading will occur in concurrent processes and will not be a bottle neck.

When it is needed

If AI is used for optimization of selling furniture and tells manufacturer where to deliver goods, the training time may not be that important. The management may wait for a few minutes before loading goods to trucks and give drivers directions.

There are some other cases where datasets contain hundreds millions records, records have several hundred features and vector targets with related elements. Such data need models with half million parameters and several months of training. These are the cases where concurrent training is really helpfull.

Quick DEMO of feasibility

The link on the top is not actual concurrent processing, but a DEMO confirming that it is feasible. The multiple computation steps are disconnected from the model modification and use the older model. At one point, accumulated records for updates are processed and model is updated. It clearly shows that computations for updating can be significantly ahead of actual update steps and be executed concurrently as well as updates for each individual urysohn. Below is the printout. It does not suppose to show advantages in processing speed, since it is pseudo concurrency, but is shows same accuracy as sequential processing.

Training determinants of random 4 * 4 matrices
Pearson 0.896438, current relative error 0.027064, time 6.547
Pearson 0.927670, current relative error 0.023846, time 12.967
Pearson 0.955986, current relative error 0.018277, time 19.575
Pearson 0.970939, current relative error 0.014764, time 26.099
Pearson 0.971122, current relative error 0.016525, time 32.690
Pearson 0.952031, current relative error 0.021085, time 39.274
Pearson 0.973259, current relative error 0.014171, time 46.540
Pearson 0.980657, current relative error 0.012167, time 53.785