best language to code (C, C++ or JAva) for an app where speed and low latency are importan
The problem is there seems to be alot of hype about FPGA. Acutally not just FPGA but competing systems in the low latency space. Various systems are quoting figures of 1/3/5 microseconds. Comparing these is not straight forward as some systems are stat arb and some OMS. Even within this category the functionality will be different so comparisons need to factor this in.
I saw an independant study on the web which implemented a solution in software and on an FPGA. The serial solution ran four times faster in software than on the FPGA. For the parallel solution, when the concurrent threads exceeded the number of available cores the FPGA solution was faster.
For an investment bank a low latency DMA/program trading solution needs to trade across multiple exchanges. In Europe there are large differences in behaviour across exchanges. The trading system needs to normalise these differences as well as support custom transformations, franchise protection checks, risk checks (client position limits, restricted stocks etc), exchange checks (eg price tolerance, tick scale etc). Normalisation requires full order management and correct handling of trade busts/cancels as well as amends and IOCs. This is a complex problem, do you (can you?) put all this in an FPGA card ? Do you use some hybrid solution with an FPGA NIC ? With old NICs and app to wire TCP times greater than 15ms I see the potential. But with O/S and BIOS tuning and latest NICs like Solarflare with app to wire times around 2 micros (and Mellanox quoting 1 usec) whats the gain ?
An object orientated system using java may have 70000 lines of app code with 30000 lines of unit test code. While the unit tests dont make the code faster they do provide confidence and help facilitate fast development and bug fixing. Given the complexity of the problem and requirements for fast code changes I believe an OO solution is best. A boundary system with multiple input and output sessions will encode/decode FIX/exchange binary messages into real objects to represent the event (eg NewOrderSingle). How would you use an FPGA NIC to do the event decoding/encoding for a hybrid system ?
I would expect FPGA to have lower outliers than a CPU based solution. But the key figure is the 90th and 95th percentiles. If you have the fastest system at the 95th percentile then all other things being equal you have the best chance of hitting the order book and successfully trading 95% of the time.
Its clear some people are building FPGA and FPGA hybrid systems. All I can say is good luck. It will be interesting to see how this evolves.
The original question for the thread is which is best for building a low latency system C, C++, java (, FPGA). As already stated by other feedback all three can be used to build a low latency system. Which is best to use is clearly debatable
Yes clearly it is a very complex system and the conversation thread has strayed very far from which language is the best … but as was suggested before that topics been beaten to death I am enjoying this conversation so:
I think the key to using FPGAs effectively is to understand which parts of the application are invariant and which require multiple modalities because of exchange or market rules etc. Depending on complexity these variations can be accommodated as modes within the FPGA or use separate FPGA program binaries if you can dedicate an FPGA resource to each market/exchange. What you want to avoid is a lot of hardware/software interaction because then you lose the speed that the hardware was supposed to deliver. Another principle of optimization is to identify the “hot” code and prioritize this for embedding into the FPGA. I look at as more of a tuning process then a once and done classic waterfall design. That’s why ideally you would have firmware engineers who could work on both sides so functionality could shift from software to hardware in a seamless frictionless way. In this way the hardware/software mix is optimized based on meeting requirements and technical analysis rather than a given team’s technical capabilities which unfortunately is the de-facto method that usually determines the mix in a real world project.
I agree with many of your points and others I disagree. I believe the latency numbers for software you mention are minimums (not even averages). STAC and Solarflare published latency tests herehttp://www.stacresearch.com/solarflare (free registration required). Under higher loads the latencies are much higher and maximums go into the milliseconds. I have no problem with ignoring the outliers outside of 99% or greater (personally I think this is waiting for a Black Swan to hit it) of the tests but what were the conditions under which the outliers occurred? Heavy trade volume is when things matter most and software does not fare well when compared to hardware under those conditions.
I am curious to see the paper that shows the FPGA being slower than the CPU. The latest Xilinx Journal has a case study where Maxeler and JP Morgan used FPGAs to accelerate financial analytics. FPGAs do not have the overhead of processing instructions that CPUs do (there is a Microsoft Research paper for this posted in our LinkedIn group). We also have some papers which demonstrate where FPGAs can be used very well in financial applications. I do agree that a hybrid approach (FPGA, GPU, CPU) is best.
Comparing an FPGA based NIC to a standard high performance NIC is comparing apples and oranges. The FPGA will have done a lot of processing before it hands off the data to the OS. I think this is a paradigm shift that will take time to adjust.
This has been a very useful discussion..I would like to keep this discussion going since FPGA is a HOT Topic of discussion in the industry today. As with any technology, there are always pros and cons and I think it is important to weigh the pros v/s cons and then decide on which suits the best based on the Business Model being approached. Any idea on ay future technology that is in the R&D stages that could beat out the Software & FPGA model…????
e latency numbers I mentioned were taken using 10GE Corvil device with taps on the NICs.
At 125,000 order events per second on a single socket using TCP_NODELAY the trading system 90th percentile internal latency is 6 micros with mean of 5 micros. The app to wire time is 2.5 micros (should be 2 usecs with single port 10GE solarflare). At 99th percentile the app latency increases to 15 usecs with the app to wire time of 3 usecs. At 1000 order events per second the 99th percentile is 8 micros with the app to wire time of 2.5 usecs. Note this is a full DMA FE+OM+MA trading system.
Ofcause to get these timings you have to tune the BIOS, OS (which was redhat 5.5) and jvm.
Regarding the FPGA report I will post a link when I can dig it out
ith all this talk of FPGA design I am wondering if anyone is writing custom VHDL or are most financial FPGA applications relying on vendor supplied binaries with canned off the shelf functionality. Or are people mixing vendor supplied IP for vanilla functionality with their proprietary “secret sauce”.
Having the design ability in house makes a big difference with FPGA designs. Modern FPGAs have a routing step which is actually non-deterministic so once you finalize the VHDL “circuit” the hardware designer has to start the route and verify timing. So you could end up with a technically correct VHDL that results in a binary that fails on timing.
I would imagine that a financial firm that was overly reliant on external expertise would have a hard time working with FPGAs because of issues like these. But I would be curious on any experiences that people working on financial FPGA applications care to relate.
another technologies to be aware of is CUDA – NVIDIA GPU multicore potential challenger to Intel CPU hegemony.
A major issue is cumulative risk checking (position based checking). E.g. short-sell vs loan availability or long sell vs settled position. As soon as you have multiple accounts trading large numbers of securities, and you add different types of booking (cash, CFD, Swap-settled, etc.), the business logic becomes very complex and can’t really be offloaded. Not to mention margin risk checks for derivatives trading.
A lot of solutions address this by doing minimal stateless compliance-required per-order checking in-band (restricted list, price range etc.), but doing the complex cumulative risk-checks in a separate system that simply monitors the flow without delaying orders. But a lot of brokers won’t accept the risk of being momentarily exposed and having to chase-cancel. Also, in Asia, you have to load balance across multiple market connections for all markets, so simple pass-through type systems won’t work.
There are a few very sophisticated clients that are load balancing themselves and want essentially naked access to the market. But the vast majority of electronic trading clients need the gateway to be more intelligent than a simple FIFO queue with some stateless filtering. All these headline latency claims are basically meaningless given the diversity of solutions in the space and the lack of transparency about what is being measured at what rate in what environment with what market.
We’ve looked at FPGAs but our judgment is that iWarp gets you most of what you would get with FPGA acceleration. The core application will still need to be done in software with an OS. Especially if you also add replication & failover to the requirements.
On the original topic, I think Java is doable but requires a lot of work, and unless you need to express your business logic in Java you’re probably better off with C++.
There’s also the issue of FIX. The most demanding clients hate it, and want to use market-native. But you have to enrich that for account/strategy info etc. typically. Also any pass-back parameters (e.g. parent order IDs). So you end up with e.g. OUCH+ or Arrowhead+, which isn’t really any kind of standard, so cue broker lock-in. But if you do minimal message transformation, minimal compliance risk checks, and pass through messages essentially without significant transformation, you can get down to single microseconds, but that’s a very niche solution that’s attractive to a very small number of players. And the more generic solutions are not far behind anyway in terms of latency, but have much richer functionality
From a Linked In group discussion
NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!