can Intel TBB be used multicore and cluster programming together. In linux which is best for multicore and cluster programming

(Last Updated On: August 4, 2011)
Learn the Secret

Get  our 2 Free Books

Get these now which land directly to their inbox.
Invalid email address

Hi Friends, I would like to ask a question can Intel TBB be used multicore and cluster programming together. In linux which is best for multicore and cluster programming.

Can MPI be used for Multicore Programming and Cluster Programming in Linux? Can MPI be used in windows?


Yes, yes and yes. TBB doesn’t go across MPI processes though because each process is in a diofferent memory space. Not sure I understand the Linux question though.


TBB could be used for multicore programming. However, it doesn’t support cluster programming on its own. For this purpose you have to combine it with MPI.

As for MPI, you could use it for *both* multicore programming and cluster programming in Linux. However, there might be a noticeable impact in performance as well as programming efforts compared to thread-based parallel programming (TBB/OpenMP) due to the nature of MPI (message passing) — if you use it solely on multicore machine.

And yes, MPI also runs in Windows.


In Linux if we use Intel TBB and MPI together will it cause error. Is there any document pdf or book how to use MPI and Intel TBB together.


You should probably ask that in the Intel TBB forum on Intel’s website. It’s impossible to answer without knowing what the error is or what you’re trying to do and a whole raft of other information.


Disclaimer 1… I work for Intel.
Disclaimer 2… Your milage may vary…

Specific questions are best sent to the Intel developer forums and that its impossible to answer the question as you have phased it. That said, if used properly, the tools can work together. If used improperly, like a hammer, yes it is easy for you to hit your thumb, but the trick is not too. ūüėČ

Most production quality MPI implementations such as Intel’s can be set up to use/or not a shared memory messaging when it can or can not. It’s all about the hardware you have, how the system has been put together. But… the system has to be correct set up and maintained.

So to anser you >>development<< question, many (??most??) commercial & research HPC codes write first using the message passing paradigm as  mention and then tune the ap for shared memory as they can. If done carefully, a huge advantage is that it tends to make the code more easily moved from different scale HPC systems.

The problem is that for TBB and the like SMP, NuMA, SUMA hardware (where the memory is aimed to be keep coherent so that a thread-like programming model works) is extremely expensive to build. Traditionally message based systems (call them multicomputers, distributed processors or clusters) are much more cost effective and can be made to scale better if the codes are written for same, but must be carefully assembled, provisioned and maintained.

But as D hinted, there are few absolutes here. It is possible to build an extremely large scale MP such as the Altix machines at Pittsburgh Supercomputer Center and LRZ in Germany. But as I said, such machines are extremely expensive.

But with a cluster of course, it is difficult to put together >>properly<< and keeping it that way can be tough. The SMP and it’s system software is correct by design.

Some thoughts – hopefully helpful hints ….

1) Google: “Intel Cluster Tools”

Intel spends a lot of money developing and producing production tools for HPC application developers. In particular the “Intel Cluster Tools” – these tools work for any INTEL*64 architecture and are >>targeted<< for commercial and large scale academic projects. [Note I used the GNU tools also — all I’m saying is that besides a great compiler and libraries like MPI, TBB and MK:, you will also find some excellent tools for debugging messaged base code, as well as some of the best performance tuning tools].

2) Google “Intel Cluster Ready”

Intel also spends a lot of money helping its direct volume customers that integrate systems that build HPC application execution platforms for those codes. The Intel experience with “beowulf” style clusters is that to make them consistent and keep them that way, they must be designed and manufactured consistently from the start, provisioned with solid cluster provisioning tools and effort is spent >>keeping<< a cluster consistent (i.e. fighting entropy). Because that is not an easy task, Intel has tool that it >>gives<< to it partners to do that. The tool ( called the Intel Cluster Checker) makes the job of designing, manufacturing,and deploying/operating/administer a cluster >>much<< easier.

Please note that Intel do >>not<< sell it, so please do not ask me for it for the BYO folks. You get the tools from your cluster vendor. All Intel Cluster Ready certified clusters are required to have it. So if you ask your platform vendor for an ICR certified system, you will:
+ know the system is a correct executing platform when it is delivered
+ it is staying that way as you use it.


MPI itself can be used on a multicore systems. Ironically even on shared memory space, MPI can provide better scalability and performance than most of the threading based approaches.


yes there is an impact from MPI compared to multithreaded programming: it is faster because it does not have thread overhead. However, it takes a little more memory because of all the buffering.


I agree with both of you that MPI is faster and more scalable.

However, sometimes the cost of MPI inter-communications have to be taken into account too, even in intranode, hence the *message-passing overhead*. Of course this is heavily depends on your parallel implementation (and memory performance).


I would like to know how efficient MPI will be on



Can we use MPI for integrating cluster(connecting and passing data between clusters)(not multicore programming) and Intel TBB for multicore programming within MPI at the same time?


SGI Altix UV comes with NUMAlink 5, a fast and low latency interconnect which is perfect for MPI.

Combination of TBB and MPI (hybrid) is possible. However, depends on the combination of TBB and MPI application you picked, you might have to tinker a bit with the MPI compilation script (mpicc, mpiCC, etc…).

Or for a working out-of-the-box solution, you could follow advise and evaluate the “Intel Cluster Tools” (it is free for 30 days).

1 day ago


First of all, I would be very reluctant to use a solution that locks you into one vendor’s compiler. I’d use OpenMP for threads and MPI for message passing.

Secondly, I don’t understand the argument of “thread overhead” being used against using threads. Threads are lightweight processes, while MPI uses full processes, which by definition have more overhead.

I agree that hybrid thread/MPI programming is a nightmare, especially when you want to remain portable. For th very least, only one thread on a node should do communication. The safest bet would be to only communicate outside thread-parallel code regions, but that severely restricts your parallelisation options.


+1 t on MPI vs thread overhead. Also, Intel TBB isn’t a compiler solution, it a C++ library, and the code is available as open source or as a commercial binary library. It works very well on AMD or Intel CPUs. In our software we have MPI, OpenMP and TBB, doing different things for different tasks in a shared-memory system. The work they do doesn’t always overlap (see ¬†comment above for why) but it works very well.

If you’re not in a shared-memory system and on a traditional distributed cluster, you will more or less have to use MPI, that’s the current standard for distributed parallel programming. If you can reasonably isolate distributing the data over the cluster from doing the parallel work on each node, then you can use OpenMP or IntelTBB, or both. If you just have to do some parallel loops, with same work and workloads on a large set of data, OpenMP is less work for you to program. If you have parallel or concurrent work that has a changing workload, like many different calls with varying amounts of time to complete, then TBB might be better because it’s task-based and will automatically load balance, giving you better scale-up.

Having all of us here shouting out specific solutions or implementations isn’t completely helpful, we don’t know exactly what you’re trying to do. You’ll have to think about what your problem is, how you’re going to decompose it, how you’ll move data around and what work can be done independently or in parallel. That’s often not a 5-minute job that can be answered in a forum.

Can you recommend books for using MPI and Intel TBB together? Thanks.




NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!
This entry was posted in Quant Development and tagged , , , on by .

About caustic

Hi i there My name is Bryan Downing. I am part of a company called QuantLabs.Net This is specifically a company with a high profile blog about technology, trading, financial, investment, quant, etc. It posts things on how to do job interviews with large companies like Morgan Stanley, Bloomberg, Citibank, and IBM. It also posts different unique tips and tricks on Java, C++, or C programming. It posts about different techniques in learning about Matlab and building models or strategies. There is a lot here if you are into venturing into the financial world like quant or technical analysis. It also discusses the future generation of trading and programming Specialties: C++, Java, C#, Matlab, quant, models, strategies, technical analysis, linux, windows P.S. I have been known to be the worst typist. Do not be offended by it as I like to bang stuff out and put priorty of what I do over typing. Maybe one day I can get a full time copy editor to help out. Do note I prefer videos as they are much easier to produce so check out my many video at