September 26, 2012

When should analytics be in-memory?

I was asked today for rules or guidance regarding “analytical problems, situations, or techniques better suited for in-database versus in-memory processing”. There are actually two kinds of distinction to be drawn:

Let’s focus on the first part of that — what work, in principle, should be done in memory? 

Please note that:

Thus, the choice whether or not to do something entirely in memory usually isn’t a simple one; even in theory, it depends on metrics such as database size, hardware capacity and the like, as well as on the specific approaches used to manage data in the in-memory and on-disk alternatives.

The two biggest exceptions that come to mind are:

To be more specific, let’s look at everybody’s two favorite kinds of analytics — business intelligence (BI) and predictive modeling — which haven’t yet been well-integrated. Two of the best reasons for putting BI in memory are probably:

Some BI vendors, especially the visualization-intensive ones, address these needs via proprietary in-memory data stores. Others just create temporary in-memory databases from reports and other query result sets. Either can work.

Two other reasons to do BI in-memory are:

Reasons such as “Prebuilding aggregates is annoying; in-memory lets you use the raw data” are often just disguised forms of the performance argument.

Finally, in the case of predictive modeling, I find it hard to separate the in-memory question from parallelism. The default way of doing predictive modeling is:

The biggest problems with that plan seem to occur when:

That said, I don’t currently have much to add to what I wrote about the in-memory/in-database trade-off for parallel predictive modeling back in April, 2011.

Comments

2 Responses to “When should analytics be in-memory?”

  1. Thomas W Dinsmore on September 27th, 2012 9:22 am

    All analytic computations run in memory. The distinction between traditional analytics, in-database analytics and the new category of “in-memory analytics” is the size of the memory.

    Take SAS, for example. Legacy SAS software runs in memory; when the size of the data set exceeds memory, SAS swaps back in forth between memory and disk, which causes a performance hit. SAS’ new in-memory product (HPA) runs on a box with a large enough memory to avoid swapping on large problems.

    Of course, since the HPA box isn’t large enough to hold the entire data warehouse, you’re still moving data around. (Teradata’s largest Model 700 maxes out at 40TB uncompressed).

    By eliminating data movement, in-database analytics deployed in an MPP architecture always trump “in-memory” analytics when two conditions are true:

    (1) The data is already in the database (because it’s your warehouse)

    (2) The analytics problem is embarrassingly parallel. This is always true for scoring and is true for the most commonly used analytics. Case-Based Reasoning, on the other hand, is not embarrassingly parallel, and is an appropriate application for pure in-memory analytics.

    SAS has produced no evidence to date that in-memory analytics outperform in-database analytics on comparable problems. The claimed performance benefits (100x) are similar to those reported by those who use in-database analytics.

  2. Neil Hepburn on September 30th, 2012 7:45 pm

    For me it comes down to flexible interactive ad hoc querying. This is the sweet spot of analytics. You can’t beat the combination of human intelligence and experience working with an apparatus that can answer virtually any question in less than a second. This is when the magic happens.

    But it’s not only just about memory. It’s about overall latency. Namely, never leaving the bus to begin with, not even to go to the network. In other words, scale-out doesn’t work here. This is why they still build super computers, and why the cray was always designed to be cylindrical – for minimum latency.

    Associative tools like QlikView and PowerPivot simply will not work when there is latency.

    On a related note, I believe we reached an inflection point in 2009 when Windows 7 was released. What happened was this was the first time where Microsoft really got the 64-bit OS right, and 64-bit became the defacto standards, leading to abundant cheap memory. For example, I can now buy a server with 2 TB RAM for < $200k.

    Once you've been working with post-OLAP technologies like QlikView and PowerPivot, it becomes pretty obvious that OLAP is turning into a niche technology (possibly suitable for EPM, but that's about it).

    We've seen this play out before. Those that recall the slow but steady transition from pre-relational hierarchical (e.g. IMS) and network (e.g. IDS) databases which recognize this pattern. Unfortunately, in the B2B world, these changes are glacial.

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.