Tag Archives: multicore

Who knows this C++ multicore programming book

Who knows this C++ multicore programming book

This is old school from 2007 which is quite old. I am looking for anyone who knows an equivalent of this book


Join my FREE newsletter to learn more about which C++ books are good multicore low level latency

NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

Intel TBB C++ with MIC available for massive multicore HFT

Intel TBB C++ with MIC available for massive multicore HFT

C++  library built for massively-parallel multicore processors





NOTE that most offloading libraries use OpenMP but it is commercial. I will stick with TBB for now.




CUDA programmers need to remember that the Phi is designed as a coprocessor that runs Linux. Unlike GPUs that act only as accelerators, the Phi coprocessor has the ability to be used as a very capable support processor merely by compiling existing applications to run natively on it. Although this means that Phi coprocessors will probably not be a performance star for non-vector applications, they can still be used to speed applications via their many-core parallelism and high memory bandwidth.


Xeon Phi might have the edge on Nvidia GPUs when it comes to double-precision FP. IIRC the performance of Pascal (and other GPUs) on DP is pretty awful, and that’s a big problem for many real-world HPC applications…

The poor double-precision performance is only an issue on consumer-grade Nvidia cards (e.g. anything that is not in their Tesla line of compute cards aimed toward HPC). In recent years, Nvidia has intentionally crippled DP performance on non-professional cards in order to ensure that those who need that aren’t tempted to purchase the much-cheaper Geforce devices instead…

Intel needs to stop playing this game of Xeon Phi vs GPGPUs like this. They are very different, and their strengths are different. After having benchmarked both of these many times, I realized that they should just be clear which problem domains are better on the Xeon Phi. GPGPU cores are “much dumber” and you get a lot more of them, which is perfectly fine for linear algebra. So any task which is asking the GPGPU to do straight repeated linear algebra (machine learning), obviously the GPGPU will be faster because that’s pretty much all it can do.

But the Xeon Phi has much faster data transfers, much faster memory allocation, can be used with standard MPI/OpenMP/OpenACC, and Knights Landing will be byte-compatible with x86. Do you have a code you already setup with MPI or OpenMP? As long as the memory requirements aren’t too high, you probably already set it up to minimize communications, and so you get a free 240 threads for every node you put a Xeon Phi in (without changing your program!). Does your program run for an indeterminate amount of time and have to allocate memory? Then the Xeon Phi will be faster. Do you have to use it simply as an accelerator, i.e. the problem size is too large for the memory of the card so you will have to keep pushing things back to the CPU? Then the Xeon Phi will be faster (and Knights Landing will have more memory, alleviating this problem even more).

Join my FREE newsletter to learn more about these tricks to help automated trading


See 18;10 for vanilla pricing engine options example speed up

NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

Multicore research lead to these HFT like resources

Multicore research lead to these HFT like resources

After researching Intel TBB, I came across these resources:

This guy look knowledgeable http://www.1024cores.net/home/about-me

HFT latency talk with Intel IBB/Fasflow: https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/287132

Parallelizing High-Frequency Trading Applications by Using C++11 Attributes

FastFlow: http://ieeexplore.ieee.org/xpl/login.jsp?reload=true&tp=&arnumber=7345639&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel7%2F7293439%2F7345607%2F07345639.pdf%3Farnumber%3D7345639


Join my FREE newsletter to learn more about how multicore enhances your automate trading

NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

Fastflow multicore C++ for potential HFT like rapid trading

Fastflow multicore C++ for potential HFT like performance trading

Over the coming weeks, you’ll start to see me focus more on trading concepts versus technology. There seems to be a much more lucrative interest from professional traders recently so I’d like to start focusing on them in the future. I’ve already done an email on that a few days ago.


As I start to focus on more low latency, lower-level trading software to meet higher speeds of both market tick capturing and order management. I will not really call this high-frequency trading but this is being designed with that in mind. The biggest and most crucial aspect of it all is multi-core processing for all your algorithms analyzing done in parallel. There are a few languages that can do this but C++ is the most efficient which is why all the HFT shops to use it. We all know that is the way in terms of being a standard.


So if you go onto my YouTube channel, you will find a playlist that focuses on a C++ multi-core framework called FastFlow. Most recently, I just made a 20-minute video on the future of my high speed automated trading software.


 Check out this video here

The final dance? High speed trading software architecture

In fact, I have set aside Monday night to do a live webinar Meetup where you can challenge me on all these concepts. Any technical person is clearly invited to talk about it. Just to let you know, this will be the only time I will talk about it in detail. Also, don’t expect any source code to be released for it. I will only be presenting high level concepts in my usual ghetto presentation style


Get detail on this one time only event details here

Join one of my Meetups listed in the link above.

Pretty trading charts with Matplotlib and PyQtChart

The following night, for Tuesday, I will be presenting all the desired Python and Qt choices I have at my disposal. They are of course open source.


Get the details here


This is the second last module I am presenting as part of my “Independent trading business in Python” course series. The very last module which will be presented on Tuesday, May 3 where I will be discussing graphical user interface rapid development with Qt Designer.

I always have my source code available with live demos for question-and-answer’s at these times.


We are coming down to the wire before I start getting into the next phase for strategy development starting the following Tuesday on May 10. This first phase will focus on pair trading or arbitrage for any equity that Yahoo Finance tracks. That actually gives you access to over 158,000 tradeable instruments! Consider this: this is where pure maximum potential resides! Why trade link a chump? And in coming weeks to months, all of this will disappear. Vanish! Goodbye!


Once again, this is part of my Algorithmic Trading course series.


Get immediate access here


Or get full benefit description and details here


Thanks Bryan

NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

can Intel TBB be used multicore and cluster programming together. In linux which is best for multicore and cluster programming

Hi Friends, I would like to ask a question can Intel TBB be used multicore and cluster programming together. In linux which is best for multicore and cluster programming.

Can MPI be used for Multicore Programming and Cluster Programming in Linux? Can MPI be used in windows?


Yes, yes and yes. TBB doesn’t go across MPI processes though because each process is in a diofferent memory space. Not sure I understand the Linux question though.


TBB could be used for multicore programming. However, it doesn’t support cluster programming on its own. For this purpose you have to combine it with MPI.

As for MPI, you could use it for *both* multicore programming and cluster programming in Linux. However, there might be a noticeable impact in performance as well as programming efforts compared to thread-based parallel programming (TBB/OpenMP) due to the nature of MPI (message passing) — if you use it solely on multicore machine.

And yes, MPI also runs in Windows.


In Linux if we use Intel TBB and MPI together will it cause error. Is there any document pdf or book how to use MPI and Intel TBB together.


You should probably ask that in the Intel TBB forum on Intel’s website. It’s impossible to answer without knowing what the error is or what you’re trying to do and a whole raft of other information.


Disclaimer 1… I work for Intel.
Disclaimer 2… Your milage may vary…

Specific questions are best sent to the Intel developer forums and that its impossible to answer the question as you have phased it. That said, if used properly, the tools can work together. If used improperly, like a hammer, yes it is easy for you to hit your thumb, but the trick is not too. 😉

Most production quality MPI implementations such as Intel’s can be set up to use/or not a shared memory messaging when it can or can not. It’s all about the hardware you have, how the system has been put together. But… the system has to be correct set up and maintained.

So to anser you >>development<< question, many (??most??) commercial & research HPC codes write first using the message passing paradigm as  mention and then tune the ap for shared memory as they can. If done carefully, a huge advantage is that it tends to make the code more easily moved from different scale HPC systems.

The problem is that for TBB and the like SMP, NuMA, SUMA hardware (where the memory is aimed to be keep coherent so that a thread-like programming model works) is extremely expensive to build. Traditionally message based systems (call them multicomputers, distributed processors or clusters) are much more cost effective and can be made to scale better if the codes are written for same, but must be carefully assembled, provisioned and maintained.

But as D hinted, there are few absolutes here. It is possible to build an extremely large scale MP such as the Altix machines at Pittsburgh Supercomputer Center and LRZ in Germany. But as I said, such machines are extremely expensive.

But with a cluster of course, it is difficult to put together >>properly<< and keeping it that way can be tough. The SMP and it’s system software is correct by design.

Some thoughts – hopefully helpful hints ….

1) Google: “Intel Cluster Tools”

Intel spends a lot of money developing and producing production tools for HPC application developers. In particular the “Intel Cluster Tools” – these tools work for any INTEL*64 architecture and are >>targeted<< for commercial and large scale academic projects. [Note I used the GNU tools also — all I’m saying is that besides a great compiler and libraries like MPI, TBB and MK:, you will also find some excellent tools for debugging messaged base code, as well as some of the best performance tuning tools].

2) Google “Intel Cluster Ready”

Intel also spends a lot of money helping its direct volume customers that integrate systems that build HPC application execution platforms for those codes. The Intel experience with “beowulf” style clusters is that to make them consistent and keep them that way, they must be designed and manufactured consistently from the start, provisioned with solid cluster provisioning tools and effort is spent >>keeping<< a cluster consistent (i.e. fighting entropy). Because that is not an easy task, Intel has tool that it >>gives<< to it partners to do that. The tool ( called the Intel Cluster Checker) makes the job of designing, manufacturing,and deploying/operating/administer a cluster >>much<< easier.

Please note that Intel do >>not<< sell it, so please do not ask me for it for the BYO folks. You get the tools from your cluster vendor. All Intel Cluster Ready certified clusters are required to have it. So if you ask your platform vendor for an ICR certified system, you will:
+ know the system is a correct executing platform when it is delivered
+ it is staying that way as you use it.


MPI itself can be used on a multicore systems. Ironically even on shared memory space, MPI can provide better scalability and performance than most of the threading based approaches.


yes there is an impact from MPI compared to multithreaded programming: it is faster because it does not have thread overhead. However, it takes a little more memory because of all the buffering.


I agree with both of you that MPI is faster and more scalable.

However, sometimes the cost of MPI inter-communications have to be taken into account too, even in intranode, hence the *message-passing overhead*. Of course this is heavily depends on your parallel implementation (and memory performance).


I would like to know how efficient MPI will be on





Can we use MPI for integrating cluster(connecting and passing data between clusters)(not multicore programming) and Intel TBB for multicore programming within MPI at the same time?


SGI Altix UV comes with NUMAlink 5, a fast and low latency interconnect which is perfect for MPI.

Combination of TBB and MPI (hybrid) is possible. However, depends on the combination of TBB and MPI application you picked, you might have to tinker a bit with the MPI compilation script (mpicc, mpiCC, etc…).

Or for a working out-of-the-box solution, you could follow advise and evaluate the “Intel Cluster Tools” (it is free for 30 days).

1 day ago


First of all, I would be very reluctant to use a solution that locks you into one vendor’s compiler. I’d use OpenMP for threads and MPI for message passing.

Secondly, I don’t understand the argument of “thread overhead” being used against using threads. Threads are lightweight processes, while MPI uses full processes, which by definition have more overhead.

I agree that hybrid thread/MPI programming is a nightmare, especially when you want to remain portable. For th very least, only one thread on a node should do communication. The safest bet would be to only communicate outside thread-parallel code regions, but that severely restricts your parallelisation options.


+1 t on MPI vs thread overhead. Also, Intel TBB isn’t a compiler solution, it a C++ library, and the code is available as open source or as a commercial binary library. It works very well on AMD or Intel CPUs. In our software we have MPI, OpenMP and TBB, doing different things for different tasks in a shared-memory system. The work they do doesn’t always overlap (see  comment above for why) but it works very well.

If you’re not in a shared-memory system and on a traditional distributed cluster, you will more or less have to use MPI, that’s the current standard for distributed parallel programming. If you can reasonably isolate distributing the data over the cluster from doing the parallel work on each node, then you can use OpenMP or IntelTBB, or both. If you just have to do some parallel loops, with same work and workloads on a large set of data, OpenMP is less work for you to program. If you have parallel or concurrent work that has a changing workload, like many different calls with varying amounts of time to complete, then TBB might be better because it’s task-based and will automatically load balance, giving you better scale-up.

Having all of us here shouting out specific solutions or implementations isn’t completely helpful, we don’t know exactly what you’re trying to do. You’ll have to think about what your problem is, how you’re going to decompose it, how you’ll move data around and what work can be done independently or in parallel. That’s often not a 5-minute job that can be answered in a forum.

Can you recommend books for using MPI and Intel TBB together? Thanks.



NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!