Quant analytics: Do big data use cases require the data to be stored? Hadoop? MapReduce?
Recently I checked out some Hadoop use cases like index building, text analytics, etc, and it seems that most of them don’t really want to save the data, only do the analysis, and then save the results – indexes, summary of text with link to source. It seems like Stream processing would be much more suitable and cost effect.
you need to do both. Regardless of whether or not you’re a product manager for a stream processing solution. Normally, I would have deleted this post but am going to let it stand as a stark reminder to others who might think of posting something so self serving.
I would love to learn more how Stream processing can help there.
From my limited perspective I can say, many use Hadoop for performing machine-learning tasks for which batch-oriented computation is a great fit. Many use Hadoop for Top-K-like analysis which can be quite memory-demanding if not temporary results are serialized to disk. Recently, Hadoop gets used to build indexes on highly-dimensional data which used to be Relational DB’s turf.
Can you give us some examples how Stream processing helps with these problems, or which tasks specifically can be better solved with Stream processing than with MapReduce?
Every use case is different. However many people like hadoop because it gives them a simple and inexpensive way to retain raw data for longer. IE before hadoop an enterprise could only store 2 months of raw logs and would have to summarize and delete, now with hadoop users can keep that raw data longer.
there are implementations of stream based map/reduce (several co’s have them now). This allows the capabilities to be mixed and matched depending upon the specific use case. I think here is a cost/benefit analysis of processing data in flight vs the cost of storage because he’s the product manager for Infostreams. The better way to look at this is to understand what real-time benefits can be realized in addition to storing the data for later research, etc. I have asked for some use cases before from the group to highlight what might be a standard big data, low latency data flow. I will post one of those diagrams later today that we use to explain this process.
you need both. Cloudscale (cloudscale.com) allows the same analytics program to run unchanged on, for example, two years of historical data (MapReduce-style) or on a live realtime stream coming in at a rate of millions of events per second (StreamProcessing-style). If you want our in-memory analytics engine in the cloud (AWS) we use S3 storage for the former, and AWS cluster nodes with dedicated 10GigE for the latter.
Stream beats MapReduce by 1000x or so in any case where low latency matters, i.e. where you have a window of between say 0.1sec to 10secs to go from raw data to analytics to action. The list of such applications in business, web, finance and gov is long and getting longer every day.
I just posted a presentation that shows one or two use cases incorporating streaming mr and traditional batch mr. Please comment as voraciously as I do….
Hadoop is not really meant for real time stream processing. I am interested in this area and have been learning about S4 from Apache and Storm from Twitter for real time distributed scalable stream processing.
Its important to remember that a lot of companies treat their Hadoop cluster as if its part of their ‘sekret’ sauce. That is its protected IP and it doesn’t get talked about.
Also you need to stop propagating the ‘urban myth’ that Hadoop is good or not good for any one thing. Volkmar of HStreaming would disagree w Pranab. 😉
The real truth is that even if you are processing streaming data, you still tee it off to be stored because you will want to do analytics on the data over time. So that today’s stream is last weeks ‘old news’ and if you want to do trend analysis… well you’re going to need both. 😉
Agree that if you are doing analytic over long time horizon stream processing is not the way. However if your processing does not require any kind of history or even if it does and the time window is very short, stream processing is the right way, if real time response is critical.
One scenario where you need both is fraud prediction. You might build the prediction model using Hadoop through batch processing. But to detect fraud based on events that are happening now, you need real time stream processing, which will use the prediction model. You want to detect fraud now rather than through some batch process 2 hours from now, when the damage may already have been done.
Thanks for all the feedback – I’m certainly aware of many use cases that require both – create mining models, or text models with warehouses or hadoop historic data, then load into a Streaming system to detect and take action in real time. The group I work with offers a hadoop distribution and also warehouses, so it’s natural we have those use cases. You can certainly interpret that I was fishing, but I really was interested to learn. It seems there are many use cases where they don’t really want to save the data, such as sentiment analysis or geospatial or event processing. So, why was hadoop used? There weren’t robust enough (or any depending on how long ago) streaming solutions?
Hadoop was used because you need to save the data. Disk is cheap. Infostreams is not. There have been streaming processing solutions now for close to 10 years. There have been EAI solutions (remember event processing before CEP?) for the past 20 years. They save the data because it’s easier and so that it can be examined later. Specifically, in many instances, the creating of inverted indexes is often re-examined in the construction of complex keys (keys with embedded meaning) and to do that, even if you created the initial index with a stream processor, you might need to re-calculate those indexes down the road. If you didn’t store the base info, you can’t do that. You say you work with the group at IBM that offers Hadoop; perhaps they might have some insight into why they’ve persisted data for your clients?
The most obvious (extreme) cases where it *sometimes* makes sense to delete the primary raw data after high velocity live streaming analysis are certain RFID, sensor and location apps, where the raw data is a highly synchronous high frequency firehose, but it’s amenable to simple, fast extreme compression, with no significant loss of information with respect to those apps.
NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!