Content
|
Infinite concurrency
Why it is possible
The neural networks allow parallel processing. It may theoretically reduce training up to 100 times compared to sequential case, but
in reality, it is only 10 to 20 times quicker. There are some datasets and large networks where training needs days and weeks and our goal is
to achieve concurrency higher than that. Here we try to define why it is possible and how training algorithm can be customized for
parallel processing.
The model can be defined in continuous form
$$
z(t) = \int_{s_1}^{s_2} F_1[y(s),s,t]ds, \quad t \in [t_1, t_2]
$$
$$
y(s) = \int_{p_1}^{p_2} F_2[x(p),p,s]dp, \quad s \in [s_1, s_2]
$$
$$
...
$$
Transformation it to finite differences via quadratures gives the tree of functions. The most simple KAN may be expressed via three
univariate functions $g,p,q$ for the dataset presented by matrix $X$ and vector $Y$
$$
y_i = g[p(x_{i1}) + q(x_{i2})] \quad i = 1,2,...
$$
The matter of record by record descent is computing the difference $\Delta y$ which reduces residual error for individual record
and modifying functions $g,p,q$ by moving local segments in the vicinity of current arguments either up or down. Despite its primitiveness and simplicity this
method appeared to be very effective. For example, for the dataset (10 million records) where features were random $5 \times 5$ matrices and target was
determinant, the sequential training by this method was about 50 times faster neural networks built by optimized for fast execution software (not Python code).
The model had two layers, inner layer had $200 \times 25$ piecewise linear functions with $5$ points each and outer layer had $200$ piecewise linear
functions with 22 points each. The entire model was defined by $29,400$ parameters and baseline neural network had near same number parameters.
When training goes record by record the next difference $\Delta y_{i+1}$ is computed for the model modified on the previous step, so we
can't compute residuals concurrently and can't concurrently modify functions, so it looks like concurrency is not possible and this algorithm,
although being very effective can't be further improved. It is not true and there is a way to make parallel training.
Some critical properties of the model and training process:
-
The critical part of computed residual $\Delta y$ is sign and not magnitude. The sign of $\Delta y$ defines the direction of moving
segments $g,p,q$ and magnitude how far we move them. Since there is regularization (learning rate) the changes are small and,
when modification goes into right direction, the training eventially converges.
-
The partial update also works. If we found right residual $\Delta y$ but updated only two of three functions, we reduced
the error for this specific record anyway.
-
If we start computing mean residual errors for large validation subset, we can see that not each individual update improves
the model. It is being improved statistically but not in each step.
-
Having current model, we can concurrently compute multiple residuals and choose the record with largest
difference. It is logical to assume that, by choosing records with large discrepancy and skipping those with small
discrepancies we can optimize training.
-
When we selecting records with large discrepancy, we can use model that was few training steps ago, because the modifications
are small, features $X_i$ are different, computations are applied to different segments of the functions, the residual error
is relatively large and the critical in residual error is sign, not magnitude.
How it can be designed
Getting all these properties listed, we can derive the following parallel concept:
-
The records are being processed concurrently by multiple clients computing residuals and selecting records with large residual errors.
After record is selected client computes updates for several individual functions in a model. Each
client compute updates for different sets of functions. Clients can be threads or processes running on
different processors, it is optional, they only must be concurrent and independent.
-
Multiple independent servers are performing only updates of the model. Each server receiving messages from clients
which function must be updated and for which argument. Each server is responsible for the part of the model and receives
corresponding update messages only for the area of its responsibility.
-
Clients deliver updates to servers via messaging system.
-
There is synchronization system for the model between clients and servers. Ideally the clients should always have latest model produced
by modifications in servers, but since model is being changed gradually, the short delays are not critical. The servers will provide the copy of the model in read-only memory
segment or disk from which clients will read models periodically. This reading will occur in concurrent processes and will not be
a bottle neck.
When it is needed
If AI is used for optimization of selling furniture and tells manufacturer where to deliver goods, the training time may not be that important.
The management may wait for a few minutes before loading goods to trucks and give drivers directions.
There are some other cases where datasets contain hundreds millions records, records have several hundred features and vector targets with related
elements. Such data need models with half million parameters and several months of training.
These are the cases where concurrent training is really helpfull.
Light weight concept test
The full concept test needs cloud service with messaging system and database. The clients probably should be running in Linux boxes and servers may
write model into database which will be used by clients for synchronization using "read committed" principal. The model, when trained, can be serialized,
transferred, used and even further tuned by 200 lines of code. That sounds like a big project.
The quick test may be done on single PC. It will not be that effective but must show advantage compared to sequential concept. Clients and servers
must be threads. Messaging system can be queues, which unfortunately need mutexes for operations. The model can be stored in a memory in several copies
and becomes available for clients by atomic swap of the corresponding global pointer. That sounds like regular programming project for couple of
weeks.
|
|
|