1 Terabyte of Text in 7 Seconds
Here’s a recipe for building a fast large scale query system in weeks from mostly open source software.
We’re starting to open up about the large data infrastructure work at Quantcast. We process about 6 PB of data a day and have interesting stories to tell. This is the first of a series of technical blog posts. Stay tuned.
Turbo Mapreduce (Or How to Run Sawzall on 1 Terabyte of Text in 7 Seconds) By Silvius Rus, on behalf of the Cluster Team at Quantcast. More Quantcast infrastructure stories coming soon, stay tuned.
Can you say more about the 7 sec sawzall job in the whitepaper?
It’s essentially a text line counter, the simplest Sawzall program that traverses all the input and exercises the map, combine, and reduce code.
Interesting. Does sawzill mean that its more a log analysis app or (as in engineering support),or user data analysis? I’m trying to do text summarization benchmarking…generally trying to get a sense of state of the art throughput. Do you all have any public numbers for those kinds of workloads?
Sawzall runs on text or structured data and it does not care much where the data came from :). Sorry, I cannot discuss the workload or the hardware setup. Raw performance depends on several factors. On top of raw performance, smart execution engines make use of software and hardware acceleration to improve the user-perceived performance (thus tuned for a specific process). This makes benchmarking challenging. What is your exact benchmark definition?That makes sense. I am looking at text summarization in particular more on the algorithmic side in terms of approaches that could run in a MR framework. In terms of benchmarking what I have in mind is to start by taking the spinn3r social feed ICWSM data set, restrict to documents < 200 words. A simple measure is streaming throughput: given a data feed of a given rate, what is the rate at which sentence length summary can be generated. Would appreciate any thoughts you might have.–
Sawzall runs on text or structured data and it does not care much where the data came from :). Sorry, I cannot discuss the workload or the hardware setup. Raw performance depends on several factors. On top of raw performance, smart execution engines make use of software and hardware acceleration to improve the user-perceived performance (thus tuned for a specific process). This makes benchmarking challenging. What is your exact benchmark definition?
That makes sense. I am looking at text summarization in particular more on the algorithmic side in terms of approaches that could run in a MR framework. In terms of benchmarking what I have in mind is to start by taking the spinn3r social feed ICWSM data set, restrict to documents < 200 words. A simple measure is streaming throughput: given a data feed of a given rate, what is the rate at which sentence length summary can be generated. Would appreciate any thoughts you might have.
Yes… well that is like using excavator for beating a speed record of F1. This could be done by single 1U machine, problem is you need to load 1,2 Terabit/s to the machine. Which is possible, but quite expensive, but surely less expensive than cluster of fairly powerful servers and bunch of academics using basic concepts of parallel computing to do so. My point is that challenge here is storage and storage I/O not really the HPC computing…
I’d like to learn how to run data-dependent text processing and aggregation at 1.2 Tbps on a 1U machine. On a general purpose CPU core I see about 0.2 to 1.2 Gbps on Sawzall one-liners for text input. Are you implying running a general record processor like Sawzall on a GPU or FPGA? Can you actually do that and get the rate you claim? Can you at least share some numbers to make this claim believable?
By the way, your cost comparison may make sense to some but not to me. If I already need a cluster to manipulate large data (it’s how large data often gets produced), it costs close to nothing to use the exact same hardware to run quick data mining queries. Why would I ever research, build, develop for, and maintain yet another architecture? Even if I started fresh, I’m not sure I’d buy the argument against the cluster of commodity hardware. I guess I’m again not sold on the GPU/FPGA argument, but I’m certainly open to find out more about it.
It seems the question is the cost of what you’re trying to extract from the data, which for summarization could be anywhere from constant time to O(n^3). Perhaps one point is that if there is already a distributed storage infrastructure, its sensible to distribute the analysis among the stores instead of having to move the data yet again. I’m taking as your general point that for low-latency the analysis pipeline cannot overwhelm I/O, and this should drive architecture decisions. I think Silvius’ whitepaper mentions the use of an empirically derived set of job queues to mitigate impact runtime on overall throughput.
You obviously need very high throughput, this could be achieved most economically through using multiple FPGAs, it would be possible on any other highly paralelistic/highly throughput architecture, though in a case of eg Nvidia Tesla I am not really sure it would fit in 1U. But back to case based on FPGA … yes it’s definitely possible. But you need to design proper shared nothing architecture system (no central point just synced “agents” on the same level). But for this simple operation you could even use SINGLE Virtex 7 which can achieve this sort of throughput on single chip, however it would be still better use two or just because of handling input maybe four (this is actually board about half size of 1U 19” box). Each chip can achieve 2.515Tb/s peak bi-directional bandwidth, however there are many constraints achieving this sort of I/O, hence four chips. Each Virtex 7 can achieve 400G sustained aggregate bandwidth handling network protocol and doing filtering, which is MUCH more complex task than actually counting a lines of text.
FPGA is developing fast, build that simple core is not really challenge currently.
And to my example, better to say, it’s like using excavator to beat F1 speed record and believing using many will help you to achieve better results. In computing is this sort of thing possible but I am not sure about efficiency. Call me conservative but I am rather help advancing technology itself than trying to find out how to use excavator to beat F1 speed record.
Jakub Jochec • Charles Earl: Yes I agree, this problem needs to be divided and distributed among multiple storage and then parallely loaded, yes. Quite simple when actually solving this on multiple head cluster. However I bet this operation takes much more than 7 seconds. I a case with single board the biggest challenge is how to push the data through the I/O of the board.
Thank you for the information on FPGAs. Let me emphasize that I/O and network aren’t the bottleneck in our case. It’s CPU, for most nontrivial programs. However, I’m not convinced that we can use FPGAs (though I’d love to). Here’s why.
Counting lines of text is the trivial example. We built this system to run arbitrary Sawzall programs. We don’t know the queries until they are executed, so they need to be compiled on the fly. Some code paths could possibly be done efficiently in FPGA, but in general you may still need to go through a general purpose CPU for every record, and that’s the bottleneck, so you end up needing thousands of general purpose cores. It would be interesting to see exactly how much could be offloaded to FPGA. Until then you actually do need a cluster (or a tighter supercomputer) to host these CPUs. Which comes with the inherent data path problems. So it’s not really excavator vs. F1, unless you are able to compile (a large subset of) Sawzall to FPGA. That would be very cool.
About distributing and loading the data. At large scale data is usually already distributed. To read it fast you need a network with the right bisection and enough spindles. And the right software stack to keep it all in balance, so that thousands of CPUs stay busy. We read data from disk with every single query. It doesn’t make much sense to cache it in RAM in our setup, as different users pick at different TBs from a data set of several PBs.
Well then let’s just specify a problems and we see if we can offer you ways how we can offload it to fpga.