I was asked today for rules or guidance regarding “analytical problems, situations, or techniques better suited for in-database versus in-memory processing”. There are actually two kinds of distinction to be drawn:
- Some workloads, in principle, should run on data to which there’s very fast and unfettered access — so fast and unfettered that you’d love the whole data set to be in RAM. For others, there is little such need.
- Some products, in practice, are coupled to specific in-memory data stores or to specific DBMS, even though other similar products don’t make the same storage assumptions.
Let’s focus on the first part of that — what work, in principle, should be done in memory?
Please note that:
- (Almost) anything you can do in-memory can also be done without the whole data set being in RAM. It’s all a matter of performance.
- If all your data fits into RAM, that’s great, and you can leave it there.
- A lot depends on how you manage data in memory.
Thus, the choice whether or not to do something entirely in memory usually isn’t a simple one; even in theory, it depends on metrics such as database size, hardware capacity and the like, as well as on the specific approaches used to manage data in the in-memory and on-disk alternatives.
The two biggest exceptions that come to mind are:
- Some algorithms rely on fairly random access to the data. Those are typically best started by putting the whole data set into memory. In particular, approaches to relationship analytics and graph processing tend to be fundamentally in-memory, for example in the case of Yarcdata.
- Some workloads need such low latency there’s no time to write the data to disk before analyzing it. This is the core use case for complex event/stream processing.
To be more specific, let’s look at everybody’s two favorite kinds of analytics — business intelligence (BI) and predictive modeling – which haven’t yet been well-integrated. Two of the best reasons for putting BI in memory are probably:
- You want to keep drilling down on the result set to your original query. That makes a lot of sense; you shouldn’t have to go to disk each time.
- You want to keep trying new visualizations of substantially the same data set. Ditto.
Some BI vendors, especially the visualization-intensive ones, address these needs via proprietary in-memory data stores. Others just create temporary in-memory databases from reports and other query result sets. Either can work.
Two other reasons to do BI in-memory are:
- You really want low latency. So far this is a fairly niche use case, but it’s one that could well grow.
- You just don’t think your DBMS is fast enough. That one has led to considerable marketing hype; while Oracle may not be fast enough, at least at an affordable cost, in many cases an analytic RDBMS alternative would do a great job.
Reasons such as “Prebuilding aggregates is annoying; in-memory lets you use the raw data” are often just disguised forms of the performance argument.
Finally, in the case of predictive modeling, I find it hard to separate the in-memory question from parallelism. The default way of doing predictive modeling is:
- Run a big query.
- Put the result set in RAM.
- Model on it.
The biggest problems with that plan seem to occur when:
- The data is on many scale-out nodes.
- The extracts don’t really fit well into RAM.
That said, I don’t currently have much to add to what I wrote about the in-memory/in-database trade-off for parallel predictive modeling back in April, 2011.