Tag Archives: MapReduce

Youtube video response on Linear Regression MapReduce in R and Hadoop

Youtube video response on Linear Regression MapReduce in R and Hadoop

Want to learn more? Join my FREE newsletter

Thanks to Minstand on Youtube for this. It was a comment.     Example Script for a simple MapReduce Job with the RevolutionAnalytics rmr2-Package. Example and package source: github.com/RevolutionAnalytics/rmr2 Hadoop is running on a VirtualBox Image with Zentyal Server 3.01 (x86_64 GNU/Linux 3.2.0-29-generic#46Ubuntu). Installed Applications: • Java 1.6.0_27 (OpenJDK Runtime Environment, OpenJDK 64-Bit Server) • Apache Hadoop 1.0.4 ,Apache Pig 0.11.0, Apache Hive 0.9.0, Apache Thrift 0.9.0 • R 2.15.2 with rhbase 1.1, rhdfs 1.0.5 and rmr2 2.1.0 Installation and config made my hair grey, but finally I ran about 200jobs and everything’s fast&stable. 🙂 There was a video response on Youtube called Linear Regression MapReduce in R

HOW DO YOU START A PROFITABLE TRADING BUSINESS? Read more NOW >>>

NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

More links for Google use of MapReduce in Hadoop and use of Scala at Twitter which replaced some Ruby back end

More links for Google use of MapReduce in Hadoop and use of Scala at Twitter which replaced some Ruby  back end

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/mapreduce-osdi04.pdf

http://www.artima.com/scalazine/articles/twitter_on_scala.html

HOW DO YOU START A PROFITABLE TRADING BUSINESS? Read more NOW >>>

NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

Watch Matlab do parallel computing on a GPU very easily and quickly? Why use R, Hadoop, MapReduce, HBASE, etc?

Watch Matlab do parallel computing on a GPU very easily and quickly? Why use R, Hadoop, MapReduce, HBASE, etc?

If you watched this webinar, you would be asking yourself several many questions:

  1. Why use R which seems kind of premature compared to Matlab?
  2. 2Even if you used this Revolutionary Analytics project with Hadoop, you still have so much to set up and manage compared to Matlab’s Parallel Toolbox.  This R project is no where complete as compared to the maturity of Matlab.

Regardless of cost, I still feel this makes you much more productive as compared to setting and up managing hardware and underlying technologies like Hadoop/HBASE/MapReduce etc with R. It just seems to me, it would not add up as compared to the productivity you get out box with Matlab. Sometimes the cost of time needs to be factored into something like open source technologies which are free. Just my worthless two cents.

Parallel Computing with MATLAB in Computational Finance

 

http://www.mathworks.com/company/events/webinars/wbnr51891.html?id=51891&p1=801727294&p2=801727312

 

HOW DO YOU START A PROFITABLE TRADING BUSINESS? Read more NOW >>>

NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

Quant analytics: Do big data use cases require the data to be stored? Hadoop? MapReduce?

Quant analytics: Do big data use cases require the data to be stored? Hadoop? MapReduce?

Recently I checked out some Hadoop use cases like index building, text analytics, etc, and it seems that most of them don’t really want to save the data, only do the analysis, and then save the results – indexes, summary of text with link to source. It seems like Stream processing would be much more suitable and cost effect.

 

you need to do both. Regardless of whether or not you’re a product manager for a stream processing solution. Normally, I would have deleted this post but am going to let it stand as a stark reminder to others who might think of posting something so self serving.

 

I would love to learn more how Stream processing can help there.

From my limited perspective I can say, many use Hadoop for performing machine-learning tasks for which batch-oriented computation is a great fit. Many use Hadoop for Top-K-like analysis which can be quite memory-demanding if not temporary results are serialized to disk. Recently, Hadoop gets used to build indexes on highly-dimensional data which used to be Relational DB’s turf.

Can you give us some examples how Stream processing helps with these problems, or which tasks specifically can be better solved with Stream processing than with MapReduce?

 

Every use case is different. However many people like hadoop because it gives them a simple and inexpensive way to retain raw data for longer. IE before hadoop an enterprise could only store 2 months of raw logs and would have to summarize and delete, now with hadoop users can keep that raw data longer.

 

there are implementations of stream based map/reduce (several co’s have them now). This allows the capabilities to be mixed and matched depending upon the specific use case. I think here is a cost/benefit analysis of processing data in flight vs the cost of storage because he’s the product manager for Infostreams. The better way to look at this is to understand what real-time benefits can be realized in addition to storing the data for later research, etc. I have asked for some use cases before from the group to highlight what might be a standard big data, low latency data flow. I will post one of those diagrams later today that we use to explain this process.

 

you need both. Cloudscale (cloudscale.com) allows the same analytics program to run unchanged on, for example, two years of historical data (MapReduce-style) or on a live realtime stream coming in at a rate of millions of events per second (StreamProcessing-style). If you want our in-memory analytics engine in the cloud (AWS) we use S3 storage for the former, and AWS cluster nodes with dedicated 10GigE for the latter.

Stream beats MapReduce by 1000x or so in any case where low latency matters, i.e. where you have a window of between say 0.1sec to 10secs to go from raw data to analytics to action. The list of such applications in business, web, finance and gov is long and getting longer every day.

 

-=-

I just posted a presentation that shows one or two use cases incorporating streaming mr and traditional batch mr. Please comment as voraciously as I do….

 

Hadoop is not really meant for real time stream processing. I am interested in this area and have been learning about S4 from Apache and Storm from Twitter for real time distributed scalable stream processing.

http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.htmlhttps://github.com/s4/core

 

Its important to remember that a lot of companies treat their Hadoop cluster as if its part of their ‘sekret’ sauce. That is its protected IP and it doesn’t get talked about.

Also you need to stop propagating the ‘urban myth’ that Hadoop is good or not good for any one thing. Volkmar of HStreaming would disagree w Pranab. 😉

 

The real truth is that even if you are processing streaming data, you still tee it off to be stored because you will want to do analytics on the data over time. So that today’s stream is last weeks ‘old news’ and if you want to do trend analysis… well you’re going to need both. 😉

 

==

Agree that if you are doing analytic over long time horizon stream processing is not the way. However if your processing does not require any kind of history or even if it does and the time window is very short, stream processing is the right way, if real time response is critical.

One scenario where you need both is fraud prediction. You might build the prediction model using Hadoop through batch processing. But to detect fraud based on events that are happening now, you need real time stream processing, which will use the prediction model. You want to detect fraud now rather than through some batch process 2 hours from now, when the damage may already have been done.

 

Thanks for all the feedback – I’m certainly aware of many use cases that require both – create mining models, or text models with warehouses or hadoop historic data, then load into a Streaming system to detect and take action in real time. The group I work with offers a hadoop distribution and also warehouses, so it’s natural we have those use cases. You can certainly interpret that I was fishing, but I really was interested to learn. It seems there are many use cases where they don’t really want to save the data, such as sentiment analysis or geospatial or event processing. So, why was hadoop used? There weren’t robust enough (or any depending on how long ago) streaming solutions?

 

==

Hadoop was used because you need to save the data. Disk is cheap. Infostreams is not. There have been streaming processing solutions now for close to 10 years. There have been EAI solutions (remember event processing before CEP?) for the past 20 years. They save the data because it’s easier and so that it can be examined later. Specifically, in many instances, the creating of inverted indexes is often re-examined in the construction of complex keys (keys with embedded meaning) and to do that, even if you created the initial index with a stream processor, you might need to re-calculate those indexes down the road. If you didn’t store the base info, you can’t do that. You say you work with the group at IBM that offers Hadoop; perhaps they might have some insight into why they’ve persisted data for your clients?

 

The most obvious (extreme) cases where it *sometimes* makes sense to delete the primary raw data after high velocity live streaming analysis are certain RFID, sensor and location apps, where the raw data is a highly synchronous high frequency firehose, but it’s amenable to simple, fast extreme compression, with no significant loss of information with respect to those apps.

 

 

HOW DO YOU START A PROFITABLE TRADING BUSINESS? Read more NOW >>>

NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

Intoducing Hortonworks – The Next Generation of Apache Hadoop MapReduce

Intoducing Hortonworks – The Next Generation of Apache Hadoop MapReduce
This presentation presents some loose details about the future of Hadoop as hortonworks sees it. This is important because one of the founders of hortonworks is Arun Murthy. Arun has been at Yahoo for a number of years focused on Hadoop. To say that Arun knows Hadoop would be somewhat of an understatement.

The big news to me here is that the limitations of Hadoop are finally being acknowledged and would seem that their in the process of being addressed. Especially if they could take Hadoop into the real of continuous query.

As I pointed out to the guys at backtype about Storm – everyone who knows anything about this space is going real time. I’ve watched that shift happen over the last two years – it started as a trickle, people said it was impossible. Look at the space now – the pace is mind boggling.

Take a look at the presentation – and let’s discuss!

HOW DO YOU START A PROFITABLE TRADING BUSINESS? Read more NOW >>>

NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

Quant development : Are Hadoop and MapReduce Just Too Complicated for the Average Enterprise?

Quant development : Are Hadoop and MapReduce Just Too Complicated for the Average Enterprise?

Seems like these technologies might in danger of “organ transplant” rejection

 

Bringing Hadoop into the Enterprise itbusinessedge.com

New technologies in the enterprise can be a lot like organ transplants: Theres always a chance that the IT body is going to reject them. Thats why sometimes…

 

HOW DO YOU START A PROFITABLE TRADING BUSINESS? Read more NOW >>>

NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!