GPU, OpenCL, CUDA and Hadoop in quant analytics and high frequency trading HFT
After reading http://hgpu.org/?p=7413 (and being interested in Hadoop quite some time), I got curious if more efforts have been done or are under development. It is clear that workers can be sped up a lot with OpenCL and alike techniques, increasing the speed of a cluster. Do note I am an OpenCL-specialist, so somewhat biased.
Do you guys know of any project where Big Data has been combined with OpenCL, CUDA, Aparapi, etc?
An overview I found is from 2010: http://jimmyraywv.blogspot.com/2010/12/java-parallelization-options.html
To answer your first question have their been any projects w Hadoop and GPUs?
Can I talk about the ones I know? No. Sorry.
Is it possible? Sure there are a couple of ways of using GPUs in Hadoop, however there are some issues that you have to consider.
Recommendations? Yes. JNI is your friend.
Does using a GPU make sense?
It depends. You have to consider what you are doing, and how much of a performance boost you gain. You have to weigh this against the cost of the GPU, assuming it fits in your chassis, versus the cost of just expanding your cluster.
Here like in other advanced concepts, YMMV based on the quality of your code and the approach of your solution.
Thanks for the cliff hanger.
One of the things I have questions about is how to do the double map best: first from total to node, and then from node to NDRange. I prefer to have it done in just one step, and send it packed per X items to each node, which in turn only needs to unpack.
Also JavaCL, JOCL and Aparapi are my friends – JNI is their friend in turn.
If a solution to the problem exists in both Hadoop and GPGPU, it makes sense a lot.
Cliffhanger? Sorry no.
You ask a set of questions of where
1) it depends on what you are doing…
2) is it cheaper to bulk up or grow out your cluster…
3) to give you a full answer would require the potential of violating an nda or two…
Sorry to be cryptic, I’ll try harder…
CUDA, java API or C API?
( I would suggest C hence JNI.)
Note you mention a couple of ‘friends’… Another free clue… KISS. You want to keep your code small and tight…
Problem… One CUDA, multiple map/reduce slots… How are you going to solve that?
(again there are multiple solutions… YMMV)
still cryptic? Sorry. Maybe take it offline?
Not sure what you mean by double map…
Hadoop is not particularly impressive when it comes to performance. One can look at TPC-H Hive benchmarks for a 100 GB file :
Basically, on 11-node cluster consisting of IBM eServers, each with 8GB of memory and 4 hard drives the first TPC-H query (Q1) takes 500 seconds when using Hadoop.
When using a GPU on a single box with 8GB of memory the query Q1 runs in just 18 seconds ( using some home-written software to run SQL on GPU).
So yes, there is definitely a room for improvement in Hadoop.
I think you need to learn more about Hadoop…
So I take it that your NDA is so restrictive that you even cannot tell us what you are disagreeing with ?
we are all curious what you you, but NDA is NDA. Loads of hints, but it is always the description of the problem-field itself that explains why it works, not the techniques used. I just wanted to know what type of problems have proven good results, or hear in what cases for example MPI is better.
By double map I meant that mapping needs to be done twice (sorry, translation-problem from Dutch). First mapping from data to nodes, then mapping from node to GPU-cores.
The reason for this question is that I don’t want to tackle problems that can be solved under a minute on one GPU-powered machine, but one that needs to be distributed on several GPU-powered machines. Hadoop is interesting to do that distribution.
You were comparing the distribution-system (local vs distributed via Hadoop) with the hardware (CPU vs GPU), so hence that remark.
What would you like me to tell you?
Sorry, but as a consultant, I end up signing NDAs all the time. So when you ask a question, the answers tend to be incomplete. Why? Because there are some things I can talk about. some things I probably can talk about, but are in a grey area, and then there are things I really can’t talk about. So the hard part is trying to figure out what my client thinks that I can and can’t talk about, so the easiest thing to do is not talk about it period.
But you asked a question about people using GPUs which is an interesting problem.
Here’s what I can probably say to help point you in the right direction….
1) Solve the problem using Java and No GPUs. Now if you want to speed things up you can just add more nodes.
2) You can also see if you can use a GPU. Note: Not all problems work well w GPUs. (More on that in a second…) If you use a GPU, you have two choices… 1) Java API or C API. We chose C, because its a more robust API for our problem.
If you choose C, then you will want to wrap your CUDA code in JNI. Note… you want to avoid using frameworks because you end up increasing the size of your executable Jar and also you need to use the distributed cache to move the CUDA object code around the cluster too.
So now each Mapper.map() method you set up and send your code to the GPU, get your result and clean up the GPU connection. Not too terribly efficient, but depending on the problem faster than not using a GPU.
3) You look to ways to speed that up.
Now you also may have an additional problem… you can have N map slots on a node in the cluster. each slot will want to talk to the GPU at roughly the same time. That could become an issue.
As to your Double Mapper problem… its more of an issue of allowing concurrent access to the GPU from multiple copies of the code running at the same time.
If you go back to my earlier post…
I said that the effectiveness of the GPU will depend on what you want to do and how you can utilize it in your problem that you want to solve.
You also have to consider that the GPU will add a significant increase to the cost of your node. So you have to ask yourself if you would be better off trying to get the GPU to work or if you would be better off just adding nodes to the cluster.
There are other issues. At one client, we had 1U boxes where we couldn’t fit the GPU cards in and had to use a vendor specific approach. Unfortunately, the vendor’s solution wasn’t ready for prime time and we had issues.
The big thing that you have to realize is that there is a cost in moving data to and from the GPU. (Ok, you probably already know this.) So while the GPU is blazingly fast, the speed improvements you may actually get is much less than you expect.
Does that help?
Bottom line… if you have a small cluster w GPUs and you’re doing something fairly simple like a complex equation that you want to apply to each tuple in a large data set? You may find some value. If you’ve got to move a lot of data in and out of the GPU? Not so much.
Seems you’ve got a chip on your shoulder when it comes to Hadoop.
Hive is a subset of Hadoop.
100GB would be considered small to some who post here.
The whole argument that you could write custom code to do the same thing that the TPC-H query does is a bit of a fallacy since it would fall outside of the scope of what is permitted by the TPC org.
Maybe you don’t remember when Oracle gamed the system with their TPC-C benchmarks or
there gets some misunderstanding in this conversation. Sorry!
My two questions are now:
* are there examples of Hadoop with GPGPU? Then specifically the Hadoop-part, as I already know GPGPU.
* if another distribution-technique is used for Hadoop, do you know why?
For the GPGPU-part, I know something about that already. http://www.streamcomputing.eu/blog/ is my company’s blog.
100 GB is a lot of data for a GPU. It greatly exceeds any existing GPU memory. So it is a good indicator of a GPU’s ability to handle large amount of data. About custom code – I believe that people care about correct and fast results and not whether the code is permitted by some organization .
NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!