Benchmark Study: Accelerating AI Financial Forecasting Models

We have tested our hyper-efficient AI training solution to train DeepLOB multi-horizon networks on Limit Order Book data to predict stock price changes. We demonstrated an acceleration of 36x faster compared to the next generation GPU.

When time to train is a key constraint of model performance, like in High Frequency Trading, we offer a new dimension to the more conventional hardware scalability approach. By using a smarter Deep Neural Network training process, our solutions can cover the performance gap between general purpose and specialized hardware or supercharge the specialized hardware far beyond what is currently possible.


Decision making under strict time constraints is one of the hardest challenges for AI (Artificial Intelligence) as the time constraint directly limits the size and complexity of computations that can be performed. This limits the usefulness of the resulting models and the consequent decisions being made.

This is even more true in fields like HFT (High Frequency Trading), where the natural evolution of the market behavior over time directly affects the performance of the AI models. Such models inevitably become less accurate over their lifetime and continuous model retraining is necessary to stay relevant, or an entirely new model needs to be developed.

Increasing the number of computations performed within a given time is the natural way to shorten model time to train, thereby mitigating against model aging problems. Traditionally, horizontal scaling or chasing the latest hardware upgrades are the most frequent strategies to increase the training throughput when software optimization has exhausted the performance of the hardware at hand.

At Hard Sums we aim to make smarter use of the available time, by training DNN (Deep Neural Network) models hyper efficiently.

Rather than accelerating the training by faster number crunching, our unique approach makes each calculation more informative. This results in tremendous contraction of the time to train for complex models. And this acceleration is independent of both the number of parallel replicas that can be handled and the generation of the hardware.

Benchmark Study

We have recently benchmarked our performance by applying our training engine to the models presented in the study, by Z. Zhang and S. Zohren from Oxford-Man Institute of Quantitative Finance[1], that was reported in the June 2021 Graphcore blog[2].

The study presents the DeepLOB-Seq2Seq and DeepLob-Attention neural network architectures. Both networks are sequence to sequence models for multi-horizon forecasting of financial movements from LOB (Limit Order Book) data.

In summary, the performances of DeepLOB multi-horizon networks are tested on a publicly available dataset[3] and a comparison is made of the training times on a single Nvidia GeForce RTX 2080 GPU and a Graphcore IPU. The authors of the study state that the training procedure for each of the two architectures consisted of 200 epochs. The study demonstrated the IPU supremacy over the GPU as being 8 times faster in this setting as shown in the following table:

GPU time for 200 epochs
(time for 1 epoch)

IPU time for 200 epochs
(time for 1 epoch)


12h (215s)

1:40h (30s)

DeepLOB-Attention15h (270s)1:50h (33s)

For our benchmarking exercise, we ran our tests on two different hardware types:

  1. an Nvidia GeForce 1080Ti GPU, which is one generation older than the GPU used in the study
  2. an Intel i9 9920x CPU

We trained both the DeepLOB-Seq2Seq and DeepLob-Attention architectures using the same open-source dataset to make a like-for-like comparison.

Results & Conclusions

We demonstrated that our hyper-efficient approach allows us to obtain equivalent testing accuracies, but in a fraction of the time. We trained the DeepLOB multi-horizon networks 36 times faster than a comparable computational power GPU from Nvidia (20minutes vs 12 hours). And five times faster than the innovative IPU from Graphcore (20 minutes vs 1.6 hours).

In this setting, we show that a single five-year-old GPU using our innovative training approach enables the possibility to push into production a new advanced AI trading strategy that is trained on data acquired not more than 22 minutes ago. No longer is it necessary to invest in hardware to achieve such performances.

With Hard Sums Technologies, new trading strategies can be tested and put into production more rapidly, and models can be retained more frequently on more current data than ever before. Furthermore, the acceleration unlocks the possibility to use some of the gained time to better test and optimise the trading strategy before deployment to achieve better outcomes for the fund.

And of course, running our approach on the very latest hardware would multiply the benefits even further. We would expect that our training approach deployed on Graphcore IPU would result in an approximate 250x speed up compared to the reference Nvidia GeForce RTX 2080 hardware used in the study.

External links

[1] Full paper for the study:

[2] Graphcore blog entry:

[3] Dataset:

Get updates

Join our mailing list to keep updated on Hard Sums news and insights

By signing up you agree to our Privacy Policy

Thank you for signing up to our mailing list