Building one’s own super-computer?
I’d like to build my own super-computer. First, money is an object, so I’d like to keep the expenses down. I’ve tried using the ATI 5870 for off-loading computation, but because my app. has a lot of random memory probes, there wasn’t any speed-up. So, I am limited to using multiple CPU’s. Is using a quad Opteron board the best solution? Are there any exotic high count MIMD architectures available? Also, is the 12 CPU Opteron better than an 8-way HT Intel chip?
The cluster on which I have daily access has both quad Opteron and and quad Xeon nodes and in terms of memory access performance, the Xeon nodes perform significantly better. The performance difference probably comes from the way the cache memory is being used, although the Quick Path Interconnect (QPI) probably plays an important role there. I haven’t had the time to look into these things as seriously as I wanted — there’s some really fascinating stuff to be read in the optimization guides AMD and Intel offer.
On the other hand, if your performance requirements don’t go through the ceiling you can stick with the Opterons and you’ll be fine with it. The performance of the Xeon nodes in the cluster I’m working on are about 1.4-1.9 times better than those of the Opteron-based nodes but I’m really happy with the execution time on both of them. In fact, a 12 CPU Opteron might be a better option for a development platform since you can get a better idea about the scalability of your application. If your application scales well you’ll probably get comparable results against an 8-way HT Intel in the same performance class. The benefit you get from HT does depend on your app; read through Intel’s specs and see if anything applies to your case. My programs have to do heavy data crunching so hyperthreading doesn’t help me with anything, but I have seen cases where it provides some performance improvement (albeit fairly modest).
In terms of exotic architectures there’s probably some luck to be had with ARM, although you are probably going to be limited to distributed-memory systems as multi-core ARM systems able to sustain the kind of work required in HPC are in their early days. But I think this is something to look for in a year or two.
Henry Neeman from OU Supercomputing Center for Education and Reseach in Norman goes through this exercise where he looks at transfer speed for his big cluster and he says “So, even at theoretical peak, any code that does less than 73 calculations per byte transferred between RAM and cache has speed limited by RAM bandwidth.”
For that reason, I went with 8-core instead of 12-core.
His “Supercomputing in Plain English” Workshop is great and will be offered weekly starting Tue Jan 25 2-3 Central via H.323, WebEx, QuickTime, EVO, Access Grid. Seehttp://www.oscer.ou.edu/education.php
I think some more info would really help, mainly what is your biggest goal with this project? Are you just looking to work more with multicore application development? HPC architecture, engineering, and designs? HPC OS deployment and management? MPI? Are you trying to build a POC to show your company the benefits of an HPC environment? Is this going into production for a company or research group? How many users? The list goes on.
Starting at the more basic end –
If you’re looking to build a platform for application development and deployment testing, you can build a “supercomputer” out of spare desktop PCs. If you can learn/use Linux, there are several free HPC implementations of Linux distributions out there you can deploy, then build your own applications on and run on a multi-node/multi-core platform. I’ve got two clusters I run at our facility, one built out of spare desktop PC’s that was heading for the scrap-heap, and another out of servers that came out of our production datacenter. The former I use exclusively, the latter our application developers have access to for testing their code out of a production environment. Neither give production quality performance (jokingly referred to as “Low Performance Compute Clusters”), but to get in and run the software, test applications out, and test new settings, features, designs, etc with our OS, they’re perfect.
If, however, you’re looking to build a production HPC cluster for users/researchers/developers to run and actually use for research, modeling, simulations, number crunching, etc, then I should warn you there are dozens of other considerations beyond CPU design and architecture you need to take into consideration. Buying a rack (or 20’s) worth of servers and stringing them all together won’t give you the best bang for the buck. Power, cooling, hardware vendor, memory/hard-drive/clock speed on each node, the node interconnect technology, OS selection/licensing, support contracts… the list goes on again.
As far as raw CPU architectures, we recently evaluated the 12-core AMD Operterons against the 6-core Intel Nehalem/Westmere chips. On similar (near identical) HP hardware, with a standard 4-gb of memory per core (2 chips on each board meant 24 cores on the AMDs and 12 cores on the Intels, but each had 4gb of memory per core) the performance was night and day – the 2.6ghz Intels vs the 2.3ghz AMDs smoked them by a factor of 1.5. We also tried various mathlibs – the Intel mathlib, the AMD mathlib, a few others. In short, core-per-core the Intel’s perform much higher. You’ll also pay a higher price.
The AMD’s had a different advantage though – per /node/ the total Mflops/Gflops performance came out much higher, mainly because they had 2x the number of cores – note however the gflops were NOT 2x the Intels. With 2x the cores, they got about 1.4, maybe 1.5x the performance. The performance numbers don’t scale linearly between architectures. Our group chose to base performance on core capabilities, not node capabilities, so the advantage was lost. However, the price difference was high enough that I had to give the AMD’s serious evaluation or I’d be remiss in my responsibility to the company. We ended up selecting the 3.3ghz Intel 6-core Westmere’s. They’re higher wattage, and thus cost more in power, but the performance is overwhelmingly better.
We also tested the difference between using Intel’s hyperthreading and not. We found that there was actually an increase in performance by using hyperthreading. We are able to run 24 simultaneous threads, with a 10% to 15% drop in overall thread performance. Given that you’re running 2x the threads on the same hardware, that performance-per-thread drop is already outdone by the amount of processing accomplished with 2x the threads, on the same hardware. Our engineers were quite surprised, and pleasantly so. (followup…)
if performance requirements aren’t overwhelming, and you’re working on a budget, the AMD’s are absolutely worth evaluating. They don’t perform as well, but they don’t cost nearly as much as the Intel’s. And his point about experimenting with application scaling based on the number of cores is spot on.
We chose to stick with x86 architectures because of the custom code we’re working with, and nobody felt like porting over to Arm or PowerPC, even though there are arguably some strong performance benefits. Most of the x86 hardware is also much more accessible, and can be replaced/augmented at a wider selection of shops/vendors/resellers.
I hope this helps. Do feel free to send me a note or a message if you’d like any more info on our results. I’m also curious to hear more about what you’re looking to accomplish. Best of luck –
I am working on a neural network application, with 200K neurons 4 2-D layers, and 100 connections per neuron. Connections follow a probability distribution (think of a mexican hat) mostly between adjacent layers. The point is, that there is quiet a bit of random RAM probes when summing up impulses. I have a quad-XEON, which I’m running multi-threaded with 8 threads built on openMP. Conceptually at least, the application partitions nicely, and I’ve gotten near linear speed-up with the more threads. Additionally, I tried using an ATI 5870, along with every trick I could think of in OpenCL to get speed, the XEON was still faster.
The neural network takes 1-3 days to train on simple problems. I’d like to get this way down, as I would like to start working on more complex problems. This is why I’m wondering what I can do to get real speed. I’m considering getting another XEON, as I have a dual CPU board, hopefully, this will halve the computation time.
Sounds like you’re running on a single server at the moment?
There are some types of computation (mostly linear algebra) that the GPUs excel at. I’ve heard from several developers that the architecture is not well suited for other types of computations, especially those involving lots of branching or forking. I’m not a dev so I don’t know the specifics, but I’m not surprised you’re still getting better performance under x86 hardware. You might find that CUDA and the Nvidia GPU cards give you better performance as those are dedicated computation boards, rather than video cards that also interpret OpenCL.
How chatty is your application over MPI? Is there lots of inter-thread communication? Or just the occasional reduce/sync?
To get “real speed” as you call it, I’d look into getting several servers with at minimum a 1gbit interconnect, lots of ram, depending on your disk i/o needs raid-based storage on each node rather than a single hard drive, and possibly shared storage depending on your output file sizes and throughput. It’s probably safe to start with more nodes, ram, and cpu, then beef up storage if you find that disk and file reads/writes are a choke point. Start with the basics then build from there, but at least 3 or 4 nodes with 2x quad or 6-core chips will give you a lot of bang for the buck.
As others have said, it would be helpful to learn more about your application, etc.
Also, you might consider renting time on something like EC2, especially if you’re not going to use the machine a lot. Here’s an interesting article on the subject:
I tried EC2 earlier, unfortunately, that gave me rather poor performance, something like 5X slower than on a single box. Their is enough dependencies between neural connections, and subsequently memory access, that the best approach appears to be having the entire model sitting in RAM, on one platform.
I am running a single CPU right now, I will probably add a second CPU next month, when Intel comes out with their new XEON chips, I’d like to get a 6-core of some sort. I’m concerned about the long term scalability of this solution, as my current quad-CPU takes over a day to run a test. It sure would be nice to come up with a 10X solution, and see the way to 100X, as I plan to expand the number of connections per neuron to at least 1000, from the current 100.
do beware that if you’re going to put 2 cpu’s in one server, you
*have* to make them the same. Same model, same core count, same clock
speed. There are no two ways about this I’m afraid. So if you’re planning
to get an upgraded chip you’ll need to buy two (making sure your server
will support that specific model) or buy a new server.
I don’t know where in the world you are but sometimes computer or corporate
recycling centers get some above average equipment through at very good
prices. You would probably need to beef it up with ram or hard drives or
such, but could be worth it. You can also get off corporate lease stuff at
a very good price, that will only be a generation or two behind – still
more than adequate for most tasks, especially strung together in parallel.
If you’d like more info let me know and I’ll find you some links.
thanks for the heads up. Do you know why they need to be the same? Is it the motherboard, the OS, or the chips themselves? Now, I’m wondering if I’d be better off getting a the same quad XEON, or upgrading to a 6-core with a faster clock. Also, I don’t know how to check if the 6-core will be compatible with my motherboard, the socket is the same, but are there other considerations?
It’s clock synchronization. The system won’t know how to handle two
different speeds. Logic says if you put in a faster chip of the same line
that it might just downclock it to match the slower, but I’ve never tried
and everything I read says they must be the same.
If you can find out the type of motherboard you have you can check the web
site and intel’s ark page, you can find a list of compatible chips.
4 core boards may not be compatible with 6 core chips. They’re pretty
specific. You’ll definitely want to do the research before you buy to make
sure you get hardware that plays nicely together.
Is it a name brand computer like Dell or HP? If so, what’s the model?
I am using a Tyan 7025 motherboard. I’ve briefly looked at their online doc., I didn’t see anything about matching processors, although what you’ve posted makes sense. Tyan did change the online spec to indicate that the board is compatible with Nehalem-EP, when I bought the board last year, it was rated for Nehalem. Wonder if they’ve changed the board, or just the doc.
how’s this going? Did you get your upgrade?
I did get an additional processor, I’ll put it in over the weekend. I contacted Tyan, they told me that I had to match the processor, just as you had suggested, thanks.
It seems to me that the biggest issue with parallel processing is memory contention. I wonder when someone is going to tile a chip with the 64-bit version of ARM cores, and put a lot of cache on the chip, instead of pushing vectorization.
There’s been a lot of discussion recently about that topic. People are wondering where ccNUMA is going, and someone just the other day asked me if ARM was going to become more prevalent in HPC technology (questions to which I don’t have even a foggy answer). Writing a program, you see things from a very different perspective than I do. I’ll be curious to see what happens with your designs – please send me a note some time and let me know how it’s going!
One thing I’m wondering about, in this dual CPU environment, is whether memory access is uniform between processors. If it is not, does OpenMP have memory allocators to bind given memory allocations to specific processors. This goes along with the notion of thread affinity for processors, having memory allocation affinity as well.
Even if memory access is uniform, as there appears to be 2 memory controllers on the board, it would still be preferable to have memory allocation affinity
Doug Eadline’s Clustermonkey site will be a good resource for you:
You can go the Cluster Monkey route with clustered PS3’s/mother boards ala Beowulf solutions, or look at the surplus option of used Cray’s or older SGI’s, get’em cheap – reload them with Linux/cluster’em you may have to do some low level driver development (time to earn your spurs) in the end you have a DIY HPC solution.
Here’s a recent article I came across that I thought might help you out as well: http://arstechnica.com/science/news/2011/03/high-performance-computing-on-gamer-pcs-part-1-hardware.ars
ARM chips are very efficient, but they are not up to large scale hpc and you will find that they will use quite a large amount more code to do the same thing in a Xeon. This is the same sort of problems that you would see in the ATi solution you tried – they work very well for specific data sets, but not so well for generic large scale algorithms. Its usually a balancing act to find the best hardware for the specific problem.
A possible way around these issues is a change in ideology and developing software that runs in very finite packets systems – fbp is an example of this (as is some of the IP cloud software). This way data and code are closely associated and can maximise the use of most hardware systems.
Finally, to get the best out of a system – profile. Get a hold of VTune or similar and profile every little bit of execution. This will give you the best hope in gaining best performance improvements – with some luck, there could be some inline asm replacements that could make large changes in performance. A classic example of big runtime improvements is replace any stl:map calls at runtime with hash arrays or dual coupled arrays – huge performance improvements.
HOW DO YOU START A PROFITABLE TRADING BUSINESS? Read more NOW >>>NOTE
I now post my TRADING ALERTS
into my personal FACEBOOK ACCOUNT
. Don't worry as I don't post stupid cat videos or what I eat!