Is this big data? What metrics would you run against this dataset for quant analysis?
The dataset consists of 65,000+ customers, 2,000+ skus, served through 20+ channels. The dataset is comprehensive, granular profitability calculated at the intersection of the customer, channel and product. ~300 million line items are calculated monthly, and includes YTD data. This is accomplished in a closing window of <18 hours, as subsegment P&L’s are generated from the recordset. Do you consider this big data? If so, or if not, why?
Nope. Too small. Easily fits inside a relational model.
Do you plan to merge this data with any social media or other semi structured data (images, financial documents, etc)? What is the velocity associated with the collection of this data? I am guessing there has been some data modeling done here but what rate of change do you anticipate with the different data assets that you plan to use? Some of these items will provide insight into if this would leverage a big data solution or not.
However, to a point if the data footprint can fit into a database and the data asset and elements within it can be serviced well in a relational solution…… pursue that first.
this is certainly at the smaller end of the spectrum of what is generally accepted as Big Data. Big Data is typically defined by variety and volume of data, but add in velocity and that can all change. You say you’re analyzing this all in monthly batches in under 18hours. Your data can become ‘big’ if your company needs to start getting this information weekly or even daily to benefit from tracking customer/market trends or evaluating channel metrics and performance and enable more agile decision making. Then you need to deal with bringing that 18 hour window down to say 3 or 4, as well as reducing your nightly batch windows by enough to accommodate the new process – a lot easier and cheaper than people tend to think, but still requires an investment. Decide what data is most important to you now, how often and quickly you need that data and base your solution strategy on that.
Thanks – appreciate the responses…more to come, but I’ll start some new threads. Some issues…velocity – definition and approprate standard measures….RDMS – don’t we have to get the data into some type of OLAP to actually analyze?
it depends on what the analytics are. If you can do what you need to in a relational format easily and are thinking in terms of cubes, you’ll likely be better served doing that. If, however, your needs include flexibility to expand/add/drop data sets for experimentation, analytics where a query structure isn’t the most natural way to express the question, or is computationally very expensive (think machine learning for pattern identification) even if the data set itself is not “big”
There is this expression “… If all you have is a hammer, then all your problems look like a nail”. … I’m sure I fractured that statement, but you get the idea…
Based on your initial problem definition, you have to ask yourself if you can easily fit the problem in to an RDBMS. Meaning the data does not require the scale of Hadoop.
Based on your initial numbers… Not even close.
It’s surprising that while Tom works for a large vendor that’s spending lots of money, attempting to buy market share, he forgets that his company also sells two RDBMS engines that are probably better equipped to solve your use case. Or rather one of those engines. (This happens to be that proverbial hammer. 😉
Using IBM as an example, check out IDS.
Here you can use the engine as your OLTP source. Which is what you will want since you are talking about a system of record. You have built in extensibility in that you can extend the relational model. See Stonebraker’s Illustra that IFMX bought in ’95. You have this thing called RTL where you can load over 50k of tick data a second. So you can really handle velocity. (Note: you added velocity as an after thought and your data uses do not suggest that level of velocity. RTL would be overkill. ) but that same extensibility allowed them to create IWA. Essentially an all in memory appliance which you can attach to your RDBMS engine and do queries across both machines using an industry standard SQL.
Of course there are limitations in terms of scalability. However for your data set size, they could be an option.
In terms of Analytics… There used to be a partnership with NAG hence the NAG data blade… So you’ve got that covered.
It’s a pitty that Janet killed Arrowhead. Had it gone through things would have looked a bit different in terms of the big data space.
But I digress. The point is that I can solve your problem with a different toolset. As someone who is not a talking head, but is actually working as a solutions architect, each solution has its share of trade offs. You have to balance the pluses and minuses when trying to ind the ight solution for you…
evidentially missed the part where I said “If you can do what you need to in a relational format easily and are thinking in terms of cubes, you’ll likely be better served doing that.”
You were just regurgitating what was already posted in earlier responses. 🙂
You went on to make. Comment about how it’s not always the size of data, but the complexity of tha Analytics…
Which again I point to advances in IDS that have been stable for the past 10+ years that handle complex Analytics.
Again size and complexity point to a non Hadoop solution. Add velocity which the OP did and again IDS solves that issue.
While you work for IBM, in IM, you don’t really know your own product sets. Typical for IBM. Don’t feel bad though, I seriously doubt there are any if not a handful of people who could tie all the products in IBMs portfolio together…
Like I said in a terse post way at the top… Not a big data problem…;-)
Statistics are maintained at each of the 300 million data points. The stats first, drive costs to the appropriate channel/customer/product; and, second intersect the costs to arrive at a cost at the channel/customer/product.
Some of the statistics and sources include:
..a) Route Management System (#Services, Channel_ID, TimeOnRoute);
..b) “Hand Held” System – time stamps (#MinutesAtService for various activities);
..c) Warehouse management system (labor activity distribution);
..d) 3rd Party Freight Management System (Freight Lanes and Costs);
..e) Inventory (#QuantityOnHand);
..f) Certan specialized databases (Assets, Payroll)
..f) ERP transactional system including Inventory Transfer (SKU counts across lanes), G/L (costs); Order Entry (order counts by SKU & Customer)
The 300 million result set is a single source for all sku/customer/channel reporting and analytics giving management transparency into the customer supply chain. It drives tactical and strategic decisions including; a) Subsegment P&Ls; b) Pricing (value of customer relationship); c) Process Improvement; d) Logistics optimize; e) Product release profile; f) new customer profile.
Yes, it is RDMS/OLAP driven, monthly (auditable to the G/L). It seems that the consensus is that calling this “big data” is a stretch….maybe “very large data”. However, this is driving decision making – is “big data” doing the same?
If you want to fix the 18 hour job cycle then, yes, some big data technologies might be useful for parallelizing the analysis. Hadoop processes don’t really allow you to pack much algorithm in one pass. You could make a preprocessing phase that creates a common input dataset, and then run separate parallel algorithms on that for your various analyses.
There are open source OLAP databases (Pentaho for one) so you can have several servers, each running a different analysis.
NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!