LBFGS
Limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm
This abbreviation for some reason causes stupid grins. It took me a long time to learn accurately reproduce the acronym,
but I'm still working on memorizing the last names of all involved.
This algorithm has something in common with Newton-Kaczmarz. In theory it has a linear relationship between training time and model size.
Actually it is not linear, it is near linear, but for large models people observed significant slow down in training process.
However, I decided to make a quick comparison. The algorithm of LBFGS is not that simple to implement after reading
Wikipedia article, so I have to use one of available and popular libraries. I chose the one provided by
Jorge Nocedal and Naoaki Okazaki. It appeared well written and
easy to use, great respect to all gentlemen involved in development of this library. It took literally couple hours
to build, link to my application and reformat few interface functions accordingly.
I only needed quick comparison on a small model to see a big picture. I chose single Urysohn
$$
y = \sum_{j=1}^{n}f_j(x_j)
$$
with the data generated by formula
$$
y = \sqrt{\sum_{j=1}^{24}(x_j^2)}
$$
where $x_j$ are uniformly distributed random numbers from [0,10]. Training set 2000 records, validation set 200 records.
The functions were piecewise linear with 8 points, so the total size of the model was 24 * 8 = 192, which makes the model tiny.
Below is the print out of the program
LBGFS
RRMSE for validation = 0.906832
RRMSE for validation = 0.162657
RRMSE for validation = 0.136328
RRMSE for validation = 0.050677
RRMSE for validation = 0.037353
RRMSE for validation = 0.015971
RRMSE for validation = 0.011046
RRMSE for validation = 0.008607
Training time for LBFGS 0.018000
Kaczmarz training
RRMSE for validation = 0.034167, epoch = 0, training time 0.001000
RRMSE for validation = 0.019246, epoch = 1, training time 0.002000
RRMSE for validation = 0.013395, epoch = 2, training time 0.002000
RRMSE for validation = 0.010886, epoch = 3, training time 0.003000
The results may vary for each execution, but almost always Kaczmarz is 3 to 8 times faster even for this tiny model.
The biggest advantage of Kaczmarz over LBFGS is that its training time depends on the number of functions (which is 24) and not on
the number of nodes (which is 8) in each function, since computation determines linear segment and use it, and update as well
finds corresponding linear segment and updates it. In LBFGS the model size equal to the size of Hessian matrix, which inverse
is approximated during training. The algorithm handles the large sizes, but it is not perfectly linear.
I may provide test on a larger model some time later when I find time for this.
|





|