Tag Archives: super computer

Mexican super computer for HFT

Mexican super computer for HFT

I have thought about this cross exchange stock pricing strategy once

Alberto Alonso, director of GACS, next to the Breogan supercomputer.

On a gated residential street about an hour’s drive south of Mexico City’s main business district lives Breogan, a $350,000 computer that Alberto Alonso built to shake up the nation’s stock market.

To read the entire article, go to http://bloom.bg/25NZMgp

Join my FREE newsletter to learn more about building these type of systems for automated trading


NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

Here is how JP Morgan decided to FPGA as a super computer for their risk management needs

Here is how JP Morgan decided to FPGA as a super computer for their risk management needs

Yes I want to stick with low level high frequency trading stuff. Here are some considerations for it.

Here is how JP Morgan decided to FPGA as a super computer for their risk management needs

An old but I am sure it is still useful technology for them. I do think I mentioned Maxeller before thanks to that Barclay video on HFT


Is that easy to build your own Custom Linux Kernel? Oh wait, I am still hung over from last nite

It does look as hard as one of would have thought but these are only words. What hell does waiteth upon me?


That alone is a lot to chew on. Some requested this so I decided to listen, I extended this offering tuntil Sunday midnite due to all these technical firewall issues.

Introduction to Quant Elite Membership

It doubles after that so I hope that is ok.




NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

Here is how JP Morgan decided to FPGA as a super computer for their risk management needs

Here is how JP Morgan decided to FPGA as a super computer for their risk management needs

An old but I am sure it is still useful technology for them. I do think I mentioned Maxeller before thanks to that Barclay video on HFT


Join my FREE newsletter if you are ever interested in this HFT stuff


NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

CUDA as a super computer exciting with libraries for neural net learning, genetic algos, FFT, PCA, BLAS

HI there

The QuantLabs.net rate increase is coming and scheduled for first week of January 2013, see details below but why wait?

The march to my high frequency trading system continues! I have listed below the latest accomplishments in the world of GPU, CUDA ,and Matlab. One major take away in all this is that if you choose CUDA, you will be happy to develop the new CUDA 5 math library. All in all, it is very powerful and mind blowingly fast!

1. HFT Youtube video Demo of $30 96 core Nvidia CUDA GPU with Microsoft Visual C++ for ultra fast quant analysis


2. How to get your Microsoft Windows Visual C++ CUDA sample files working with Nvidia Geforce CUDA GPU board


3. GPU CUDA 3rd party high level C++ library for math awesomeness with genetic algorithm, neural net learning, PCA, FFT, BLAS


4. Using CUDA GPU to build a HFT super computer. The debate of Windows versus Linux is also over!


CUDA is very valuable to any trading firm that is using it so I am making a huge investment in it. I have also confirmed CUDA software developers charge top dollar so my mindset is making me think that if I supply ready to drop code for a HFT system, I should charge top dollar for the  QuantLabs.net Premium membership. As a result, I am leaning that way as I am scheduling the first week of January 2013 for the rate increase. I am not sure if it will be 25% or 50%. Either way, you better get in on the membership action while it is very, very affordable right now.


Membership benefits here.

P.S. Tomorrow will be further secrets to be revealed about Matlab’s data analytical power.


NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

Using CUDA GPU to build a HFT super computer. The debate of Windows versus Linux is also over! #linux #cuda #gpu #windows #hft

Using CUDA GPU to build a HFT super computer. The debate of Windows versus Linux is also over! #linux #cuda #gpu #windows #hft

So you want to build a super computer?  Nvidia has this interesting article:


You could build your solution with decent quality provider Super Micro. http://www.supermicro.com/products/system/4U/7046/SYS-7046GT-TRF.cfm

I can confirm some big HFT shops use this brand. You could play with the calculator to customize your hardware including the CUDA boards. Go play.

From my calculations, you could a potentially build a quick 8000 core system for under $10K but maybe you should also hold out for the future as prices may come done.

A few notes from my 12 hours of CUDA knowledge, commodity PCs only have only 1 slot for your CUDA board so that means you cannot put in multiple ones as I originally thought you could do. The system above can so keep that in mind.

As for operating systems:

Also, the operating system debate is kind over on my end. As I have wasted countless weeks on Linux, the Windows has saved me in productivity. As the Nvidia link mentioned, for optimal performance you should be on Windows XP 64 bit or Linux 64 bit. To benchmark the comparison,  I would like to see the exact CUDA hardware configuration set up done with exact software (like the Nvidia C++ financial examples). As a result, I won’t entertain this debate further until a video is provided to see the results. Or better yet, I would like to Nvidia do this benchmark comparison.

Also, XP is easier to work with as compared to the horrendous set up with Linux. I really don’t need to hear the debate Linux is more stable and secure. Security is not concerned when your server is properly ring fenced with decent firewall and other configuration. As for stability, I wonder how banks use XP throughout their ecosystem with little concern. So you reboot the box late at every nite to clear out any dangling processes. If XP is indeed just as fast, I ask myself why would I go down the path of pulling out hair with sub par Linux administration and set up.

From my high level research, everything comes down to drivers that sets the difference. I wonder if Nvidia provided drivers are really much better on Linux or Mac OSX. I kind of wonder about that but only Nvidia really can answer that. As said, they do offer the choice of XP and Linux so maybe the performance is the same. Who knows? Who cares? I just want to build to working HFT system with strategies/algos that generate lots of $$.

All in all, if you are really worried about it. Develop your platform using only C++ libraries with no Microsoft .NET calls. That way it should port over fairly quickly no matter what operating system platform you choose for your PRODUCTTION LIVE system.  Just a note, don’t you hate it when Microsoft bastardizes a language like C++ anyhow?

See which way I go by joining my free newsletter



NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

Amazon Builds World’s Fastest Nonexistent Super Computer according to Wired.com

Amazon Builds World’s Fastest Nonexistent Super Computer according to Wired.com

The 42nd fastest supercomputer on earth doesn’t exist. This fall, Amazon built a virtual supercomputer atop its Elastic Compute Cloud — a web service that spins up virtual servers whenever you want them — and this nonexistent mega



How to measure a Supercomputer-in-the-cloud? Isn’t it the same as just adding the GFLOPS of the home computers which provide compute power to the SETI project? Of course this is a rather philosophical discussion: Is it a supercomputer if it is virtual? With other words: “Calculo ergo sum”?

The point is IMHO that an existing supercomputer is able to provide a certain compute power at a certain point in time. Non existing, cloud based computers can do so if there are dedicated (reserved) distributed compute nodes assigned to a defined task. Just a number of compute nodes which possibly can deliver some TFLOPS does not make a supercomputer.

What do you think?


The top500 list, which amazon computer is number 42, is based on linpack benchmark (http://en.wikipedia.org/wiki/LINPACK). While some argue that in linpack don’t stress the cluster network, all these computer had to do some task jointly. Probably in the amazon cloud you can reserve entire machines, so your nodes won’t be perturbed by other applications.




NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

Building one’s own super-computer?

Building one’s own super-computer?
I’d like to build my own super-computer. First, money is an object, so I’d like to keep the expenses down. I’ve tried using the ATI 5870 for off-loading computation, but because my app. has a lot of random memory probes, there wasn’t any speed-up. So, I am limited to using multiple CPU’s. Is using a quad Opteron board the best solution? Are there any exotic high count MIMD architectures available? Also, is the 12 CPU Opteron better than an 8-way HT Intel chip?
The cluster on which I have daily access has both quad Opteron and and quad Xeon nodes and in terms of memory access performance, the Xeon nodes perform significantly better. The performance difference probably comes from the way the cache memory is being used, although the Quick Path Interconnect (QPI) probably plays an important role there. I haven’t had the time to look into these things as seriously as I wanted — there’s some really fascinating stuff to be read in the optimization guides AMD and Intel offer.

On the other hand, if your performance requirements don’t go through the ceiling you can stick with the Opterons and you’ll be fine with it. The performance of the Xeon nodes in the cluster I’m working on are about 1.4-1.9 times better than those of the Opteron-based nodes but I’m really happy with the execution time on both of them. In fact, a 12 CPU Opteron might be a better option for a development platform since you can get a better idea about the scalability of your application. If your application scales well you’ll probably get comparable results against an 8-way HT Intel in the same performance class. The benefit you get from HT does depend on your app; read through Intel’s specs and see if anything applies to your case. My programs have to do heavy data crunching so hyperthreading doesn’t help me with anything, but I have seen cases where it provides some performance improvement (albeit fairly modest).

In terms of exotic architectures there’s probably some luck to be had with ARM, although you are probably going to be limited to distributed-memory systems as multi-core ARM systems able to sustain the kind of work required in HPC are in their early days. But I think this is something to look for in a year or two.
Henry Neeman from OU Supercomputing Center for Education and Reseach in Norman goes through this exercise where he looks at transfer speed for his big cluster and he says “So, even at theoretical peak, any code that does less than 73 calculations per byte transferred between RAM and cache has speed limited by RAM bandwidth.”

For that reason, I went with 8-core instead of 12-core.

His “Supercomputing in Plain English” Workshop is great and will be offered weekly starting Tue Jan 25 2-3 Central via H.323, WebEx, QuickTime, EVO, Access Grid. Seehttp://www.oscer.ou.edu/education.php
I think some more info would really help, mainly what is your biggest goal with this project? Are you just looking to work more with multicore application development? HPC architecture, engineering, and designs? HPC OS deployment and management? MPI? Are you trying to build a POC to show your company the benefits of an HPC environment? Is this going into production for a company or research group? How many users? The list goes on.

Starting at the more basic end –

If you’re looking to build a platform for application development and deployment testing, you can build a “supercomputer” out of spare desktop PCs. If you can learn/use Linux, there are several free HPC implementations of Linux distributions out there you can deploy, then build your own applications on and run on a multi-node/multi-core platform. I’ve got two clusters I run at our facility, one built out of spare desktop PC’s that was heading for the scrap-heap, and another out of servers that came out of our production datacenter. The former I use exclusively, the latter our application developers have access to for testing their code out of a production environment. Neither give production quality performance (jokingly referred to as “Low Performance Compute Clusters”), but to get in and run the software, test applications out, and test new settings, features, designs, etc with our OS, they’re perfect.

If, however, you’re looking to build a production HPC cluster for users/researchers/developers to run and actually use for research, modeling, simulations, number crunching, etc, then I should warn you there are dozens of other considerations beyond CPU design and architecture you need to take into consideration. Buying a rack (or 20’s) worth of servers and stringing them all together won’t give you the best bang for the buck. Power, cooling, hardware vendor, memory/hard-drive/clock speed on each node, the node interconnect technology, OS selection/licensing, support contracts… the list goes on again.

As far as raw CPU architectures, we recently evaluated the 12-core AMD Operterons against the 6-core Intel Nehalem/Westmere chips. On similar (near identical) HP hardware, with a standard 4-gb of memory per core (2 chips on each board meant 24 cores on the AMDs and 12 cores on the Intels, but each had 4gb of memory per core) the performance was night and day – the 2.6ghz Intels vs the 2.3ghz AMDs smoked them by a factor of 1.5. We also tried various mathlibs – the Intel mathlib, the AMD mathlib, a few others. In short, core-per-core the Intel’s perform much higher. You’ll also pay a higher price.

The AMD’s had a different advantage though – per /node/ the total Mflops/Gflops performance came out much higher, mainly because they had 2x the number of cores – note however the gflops were NOT 2x the Intels. With 2x the cores, they got about 1.4, maybe 1.5x the performance. The performance numbers don’t scale linearly between architectures. Our group chose to base performance on core capabilities, not node capabilities, so the advantage was lost. However, the price difference was high enough that I had to give the AMD’s serious evaluation or I’d be remiss in my responsibility to the company. We ended up selecting the 3.3ghz Intel 6-core Westmere’s. They’re higher wattage, and thus cost more in power, but the performance is overwhelmingly better.

We also tested the difference between using Intel’s hyperthreading and not. We found that there was actually an increase in performance by using hyperthreading. We are able to run 24 simultaneous threads, with a 10% to 15% drop in overall thread performance. Given that you’re running 2x the threads on the same hardware, that performance-per-thread drop is already outdone by the amount of processing accomplished with 2x the threads, on the same hardware. Our engineers were quite surprised, and pleasantly so. (followup…)
if performance requirements aren’t overwhelming, and you’re working on a budget, the AMD’s are absolutely worth evaluating. They don’t perform as well, but they don’t cost nearly as much as the Intel’s. And his point about experimenting with application scaling based on the number of cores is spot on.

We chose to stick with x86 architectures because of the custom code we’re working with, and nobody felt like porting over to Arm or PowerPC, even though there are arguably some strong performance benefits. Most of the x86 hardware is also much more accessible, and can be replaced/augmented at a wider selection of shops/vendors/resellers.

I hope this helps. Do feel free to send me a note or a message if you’d like any more info on our results. I’m also curious to hear more about what you’re looking to accomplish. Best of luck –
I am working on a neural network application, with 200K neurons 4 2-D layers, and 100 connections per neuron. Connections follow a probability distribution (think of a mexican hat) mostly between adjacent layers. The point is, that there is quiet a bit of random RAM probes when summing up impulses. I have a quad-XEON, which I’m running multi-threaded with 8 threads built on openMP. Conceptually at least, the application partitions nicely, and I’ve gotten near linear speed-up with the more threads. Additionally, I tried using an ATI 5870, along with every trick I could think of in OpenCL to get speed, the XEON was still faster.

The neural network takes 1-3 days to train on simple problems. I’d like to get this way down, as I would like to start working on more complex problems. This is why I’m wondering what I can do to get real speed. I’m considering getting another XEON, as I have a dual CPU board, hopefully, this will halve the computation time.
Sounds like you’re running on a single server at the moment?

There are some types of computation (mostly linear algebra) that the GPUs excel at. I’ve heard from several developers that the architecture is not well suited for other types of computations, especially those involving lots of branching or forking. I’m not a dev so I don’t know the specifics, but I’m not surprised you’re still getting better performance under x86 hardware. You might find that CUDA and the Nvidia GPU cards give you better performance as those are dedicated computation boards, rather than video cards that also interpret OpenCL.

How chatty is your application over MPI? Is there lots of inter-thread communication? Or just the occasional reduce/sync?

To get “real speed” as you call it, I’d look into getting several servers with at minimum a 1gbit interconnect, lots of ram, depending on your disk i/o needs raid-based storage on each node rather than a single hard drive, and possibly shared storage depending on your output file sizes and throughput. It’s probably safe to start with more nodes, ram, and cpu, then beef up storage if you find that disk and file reads/writes are a choke point. Start with the basics then build from there, but at least 3 or 4 nodes with 2x quad or 6-core chips will give you a lot of bang for the buck.
As others have said, it would be helpful to learn more about your application, etc.

Also, you might consider renting time on something like EC2, especially if you’re not going to use the machine a lot. Here’s an interesting article on the subject:

I tried EC2 earlier, unfortunately, that gave me rather poor performance, something like 5X slower than on a single box. Their is enough dependencies between neural connections, and subsequently memory access, that the best approach appears to be having the entire model sitting in RAM, on one platform.

I am running a single CPU right now, I will probably add a second CPU next month, when Intel comes out with their new XEON chips, I’d like to get a 6-core of some sort. I’m concerned about the long term scalability of this solution, as my current quad-CPU takes over a day to run a test. It sure would be nice to come up with a 10X solution, and see the way to 100X, as I plan to expand the number of connections per neuron to at least 1000, from the current 100.

do beware that if you’re going to put 2 cpu’s in one server, you
*have* to make them the same. Same model, same core count, same clock
speed. There are no two ways about this I’m afraid. So if you’re planning
to get an upgraded chip you’ll need to buy two (making sure your server
will support that specific model) or buy a new server.
I don’t know where in the world you are but sometimes computer or corporate
recycling centers get some above average equipment through at very good
prices. You would probably need to beef it up with ram or hard drives or
such, but could be worth it. You can also get off corporate lease stuff at
a very good price, that will only be a generation or two behind – still
more than adequate for most tasks, especially strung together in parallel.
If you’d like more info let me know and I’ll find you some links.
thanks for the heads up. Do you know why they need to be the same? Is it the motherboard, the OS, or the chips themselves? Now, I’m wondering if I’d be better off getting a the same quad XEON, or upgrading to a 6-core with a faster clock. Also, I don’t know how to check if the 6-core will be compatible with my motherboard, the socket is the same, but are there other considerations?
It’s clock synchronization. The system won’t know how to handle two
different speeds. Logic says if you put in a faster chip of the same line
that it might just downclock it to match the slower, but I’ve never tried
and everything I read says they must be the same.
If you can find out the type of motherboard you have you can check the web
site and intel’s ark page, you can find a list of compatible chips.
4 core boards may not be compatible with 6 core chips. They’re pretty
specific. You’ll definitely want to do the research before you buy to make
sure you get hardware that plays nicely together.
Is it a name brand computer like Dell or HP? If so, what’s the model?

I am using a Tyan 7025 motherboard. I’ve briefly looked at their online doc., I didn’t see anything about matching processors, although what you’ve posted makes sense. Tyan did change the online spec to indicate that the board is compatible with Nehalem-EP, when I bought the board last year, it was rated for Nehalem. Wonder if they’ve changed the board, or just the doc.
how’s this going? Did you get your upgrade?
I did get an additional processor, I’ll put it in over the weekend. I contacted Tyan, they told me that I had to match the processor, just as you had suggested, thanks.

It seems to me that the biggest issue with parallel processing is memory contention. I wonder when someone is going to tile a chip with the 64-bit version of ARM cores, and put a lot of cache on the chip, instead of pushing vectorization.
There’s been a lot of discussion recently about that topic. People are wondering where ccNUMA is going, and someone just the other day asked me if ARM was going to become more prevalent in HPC technology (questions to which I don’t have even a foggy answer). Writing a program, you see things from a very different perspective than I do. I’ll be curious to see what happens with your designs – please send me a note some time and let me know how it’s going!
One thing I’m wondering about, in this dual CPU environment, is whether memory access is uniform between processors. If it is not, does OpenMP have memory allocators to bind given memory allocations to specific processors. This goes along with the notion of thread affinity for processors, having memory allocation affinity as well.

Even if memory access is uniform, as there appears to be 2 memory controllers on the board, it would still be preferable to have memory allocation affinity
Doug Eadline’s Clustermonkey site will be a good resource for you:
You can go the Cluster Monkey route with clustered PS3’s/mother boards ala Beowulf solutions, or look at the surplus option of used Cray’s or older SGI’s, get’em cheap – reload them with Linux/cluster’em you may have to do some low level driver development (time to earn your spurs) in the end you have a DIY HPC solution.
Here’s a recent article I came across that I thought might help you out as well: http://arstechnica.com/science/news/2011/03/high-performance-computing-on-gamer-pcs-part-1-hardware.ars
ARM chips are very efficient, but they are not up to large scale hpc and you will find that they will use quite a large amount more code to do the same thing in a Xeon. This is the same sort of problems that you would see in the ATi solution you tried – they work very well for specific data sets, but not so well for generic large scale algorithms. Its usually a balancing act to find the best hardware for the specific problem.

A possible way around these issues is a change in ideology and developing software that runs in very finite packets systems – fbp is an example of this (as is some of the IP cloud software). This way data and code are closely associated and can maximise the use of most hardware systems.

Finally, to get the best out of a system – profile. Get a hold of VTune or similar and profile every little bit of execution. This will give you the best hope in gaining best performance improvements – with some luck, there could be some inline asm replacements that could make large changes in performance. A classic example of big runtime improvements is replace any stl:map calls at runtime with hash arrays or dual coupled arrays – huge performance improvements.


NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!