Intel TBB C++ with MIC available for massive multicore HFT
C++ library built for massively-parallel multicore processors
NOTE that most offloading libraries use OpenMP but it is commercial. I will stick with TBB for now.
CUDA programmers need to remember that the Phi is designed as a coprocessor that runs Linux. Unlike GPUs that act only as accelerators, the Phi coprocessor has the ability to be used as a very capable support processor merely by compiling existing applications to run natively on it. Although this means that Phi coprocessors will probably not be a performance star for non-vector applications, they can still be used to speed applications via their many-core parallelism and high memory bandwidth.
Xeon Phi might have the edge on Nvidia GPUs when it comes to double-precision FP. IIRC the performance of Pascal (and other GPUs) on DP is pretty awful, and that’s a big problem for many real-world HPC applications…
The poor double-precision performance is only an issue on consumer-grade Nvidia cards (e.g. anything that is not in their Tesla line of compute cards aimed toward HPC). In recent years, Nvidia has intentionally crippled DP performance on non-professional cards in order to ensure that those who need that aren’t tempted to purchase the much-cheaper Geforce devices instead…
Intel needs to stop playing this game of Xeon Phi vs GPGPUs like this. They are very different, and their strengths are different. After having benchmarked both of these many times, I realized that they should just be clear which problem domains are better on the Xeon Phi. GPGPU cores are “much dumber” and you get a lot more of them, which is perfectly fine for linear algebra. So any task which is asking the GPGPU to do straight repeated linear algebra (machine learning), obviously the GPGPU will be faster because that’s pretty much all it can do.
But the Xeon Phi has much faster data transfers, much faster memory allocation, can be used with standard MPI/OpenMP/OpenACC, and Knights Landing will be byte-compatible with x86. Do you have a code you already setup with MPI or OpenMP? As long as the memory requirements aren’t too high, you probably already set it up to minimize communications, and so you get a free 240 threads for every node you put a Xeon Phi in (without changing your program!). Does your program run for an indeterminate amount of time and have to allocate memory? Then the Xeon Phi will be faster. Do you have to use it simply as an accelerator, i.e. the problem size is too large for the memory of the card so you will have to keep pushing things back to the CPU? Then the Xeon Phi will be faster (and Knights Landing will have more memory, alleviating this problem even more).
See 18;10 for vanilla pricing engine options example speed up
NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!