Tag Archives: multihreading

Monoid design pattern to Fastflow multihreading C++ library for high speed trading like HFT

Monoid design pattern to Fastflow multihreading C++ library for high speed trading like HFT

As I started digging further into Monoid pattern thing, it was promising with these articles:




One of these links does have source code demo. To be honest, the code was not commented or documented at all which means I cannot go beyond than looking at. It did not compile with my GCC 4.9 but I have no patience to figure it out from there.

Check out the history here:

C++ event driven meta programming libraries

Or watch the video here:

Event driven C++ Metaprograming

As these links base the methodology off Haskell, there are a number of reasons which this functional programming language could be as fast as C or C++. Here are some other comparisons:


As I am no expert here, but it was recommended another multithreading library could handle the same performance using these Monoid design patterns. It does look promising but after some further digging, Intel TBB or Boost Futures came up. I think that is a yucky proposition when I knew about a faster (and easier) multi-threading library called Fastflow.
I took a look to see if it was abandoned. To my surprise, 2.1 just came out 2 days ago. Talk about perfect timing!


There was a performance graph to show how Fastflow performs against other libraries. It seems to keep up with OpenMP which is the fastest one compared to CILK or TBB. Again, I am no expert here but I think it is worthy to revisit this multithreading library.

fastflow 236876

It is also comforting to know the FIX8 project chose FastFlow as well for concurrency which they claim their library is quite fast. Knowing all this with my revisit to both C++ on Linux, it might important to showcase my earlier on this library:


Videos: https://www.youtube.com/user/quantlabs/search?query=fastflow

Keep your eyes peeled on more up to date topics on Fastflow


NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

Reasons to use Fastflow for C++ HFT multihreading over TBB, CILK, and Open MPI,CUDA compared to high end server

Reasons to use Fastflow for C++ HFT multihreading over TBB, CILK, and Open MPI,CUDA compared to high end server

Most of these references came from http://calvados.di.unipi.it/dokuwiki/doku.php?id=ffnamespace:about

Get more development news on this HFT platform as I build this thing

Goto end for them

Most useful tutorial is the last link this post!

Despite being very e-
cient on some classes of applications, OpenMP and MPI share a common set of
problems: poor separation of concerns among application and system aspects,
a rather low level of abstraction presented to the application programmers and
poor support for really ne grained applications are all considerations hindering
easy use of MPI and OpenMP. Actually, it is not even clear yet if the mixed
MPI/OpenMP programming model always o
ers the most e
ective mechanisms
for programming clusters of SMP systems [4].

l. A ff_dnode cannot have
both external input and output channels at the same time since the minimal pure
FastFlow application is composed of at least 2 nodes (a pipeline of two sequen-
tial nodes or a farm with an Emitter node and a sequential worker node).

In FastFlow, we used ZeroMQ as the external transport for the ff_dnode
concurrent entity. I

Note Java has garbage collector but:
In general, lock-free
dynamic concurrent data structures that use CAS operations should be sup-
ported by safe memory reclamation techniques in programming environments
without automatic garbage collection
FastFlow is a C++ parallel programming framework aimed at simplifying the development of ecient ap-
plications for multi-core platforms. The key vision of FastFlow is that ease-of-development and runtime
eciency can both be achieved by raising the abstraction level of the design phase, thus providing devel-
opers with a suitable set of parallel programming patterns that can be eciently compiled onto the target
The word accelerator is often used in the context of hardware accelerators. Usually accelerators feature
a di
erent architecture with respect to standard CPUs and thus, in order to ease exploitation of their
computational power, speci c libraries are developed. In the case of GPGPUs those (low-level) libraries
include Brook [13], NVidia CUDA, and OpenCL. At a higher-level, Ooad [14] enables ooading of parts
of a C++ application, which are wrapped in ooad blocks, onto hardware accelerators for asynchronous
execution; OMPSs [15] enables the ooading of OpenCL and CUDA kernels as an OpenMP extension [16].
FastFlow, in contrast with these frameworks, does not target speci c (hardware) accelerators but realizes a
****virtual accelerator**** running on the main CPUs and thus does not require the development of speci c code.
**** pg 36 on highly IMPRESSIVE performance with Teslsa C0250
The shift from 1 Gbit to 10 Gbit networks has pushed hardware manufacturers to find
new solutions for exploiting multicore architectures. The first goal is to use all the available cores for improving packet receive/transmission. For this reason, modern network
adapters feature multi-queue RX/TX, so that a physical network adapter is logically partitioned into several logical adapters each sharing the same MAC address and Ethernet port. Incoming packets are decoded in hardware, and a hash value based on various
packet fields such as IP address, protocol and port is computed. Based on the hash value,
the network adapter places the packets in a specific queue. This way the kernel can simultaneously poll and transmit packets from each queue, thus maximizing the overall
performance. Unfortunately the operating systems are not mature enough to exploit this
feature, as they do not expose queues to user-space applications thus limiting them to
viewing the network adapter as a single entity. The outcome is that fetching packets in
parallel from the network adapter is not possible unless applications can directly access
the various queues. PF_RING [8] is a packet processing framework that we have developed that implements various mechanisms for enhancing packet processing and that also
allows applications to natively access adapters’ queues. Recently PF_RING has been enhanced with support of 10 Gbit DNA (Direct NIC Access) that allows applications to receive/transmit packets while completely bypassing the kernel, as they can access directly
the queues that have been previously mapped into user space memory. This means that
the cost of receiving/transmitting a packet is basically the cost of a memory read/write,
thus making 10 Gbit wire-rate packet RX/TX now manageable using commodity network adapters.

Important tutorials for Fast Flow


NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!