When I posted a long list of architectural options for analytic DBMS, I left a couple of IOUs in for missing parts. One was in the area of what is sometimes called advanced-analytics functionality, which roughly speaking means aspects of analytic database management systems that are not directly related to conventional* SQL queries.
*Main examples of “conventional” = filtering, simple aggregrations.
The point of such functionality is generally twofold. First, it helps you execute analytic algorithms with high performance, due to reducing data movement and/or executing the analytics in parallel. Second, it helps you create and execute sophisticated analytic processes with (relatively) little effort.
For now, I’m going to refer to an analytic RDBMS that has been extended by advanced-analytics functionality as an analytic computing system, rather than as some kind of “platform,” although I suspect the latter term is more likely to wind up winning. So far, there have been five major categories of subsystem or add-on module that contribute to making an analytic DBMS a more fully-fledged analytic computing system:
- SQL extensions. Examples include SQL-2003 analytics (notably windowing), or vendor-specific temporal functionality.
- A framework for UDFs (User-Defined Functions) to further extend SQL. At its core, a relational DBMS is a big SQL interpreter. SQL, while powerful, only does a limited number of things. User-Defined Functions are new predicates in the SQL language that do additional things.
- An execution engine for analytic processes that is less coupled to the SQL engine than a pure UDF framework might be. The two main approaches are MapReduce (e.g. Aster Data) and general C++ libraries (Netezza, ParAccel).
- Libraries of pre-built analytic processes. Commonly included are statistics, (other machine learning), general linear algebra, and Monte Carlo analysis. Some of these functions are fully parallelized (perhaps tens per vendor). Others just play nicely with the vendor’s execution framework, in that a separate copy can be run on each node (up to thousands per vendor, for those who bring in open source statistics libraries).
- Development tools such as integrated development environments (IDEs). Aster keeps trying to convince me that having built a nice Eclipse IDE is a major competitive differentiation.
The most structural or architectural are the UDF framework and the non-UDF analytic execution engine. But even those are in essence add-on modules, which means that pretty much any vendor can do any part of them if they invest enough resources in the effort. So I expect considerable convergence over time as the industry and market discover which capabilities are or aren’t particularly useful.
When I’m being told about an analytic DBMS that supposedly has evolved into an analytic computing system, some of my top-of-mind questions are:
- How does the execution work? UDFs? MapReduce? Something else? What forms can the inputs and outputs of a UDF take? And by the way, what’s your complete list of MapReduce integration possibilities?
- What languages are currently supported? The obvious choices are C++ (if that’s the style of execution engine); SAS, R, or other statistical languages; and anything that is commonly associated with MapReduce (Java, Python, et al.).
- In-process, out-of-process, or both? In-process runs faster; out-of-process is more stable, in that the advanced-analytics part can crash without bringing down the whole DBMS. Even if other languages are available out-of-process, C++ might be the only in-process choice. If you don’t have out-of-process execution, you may not be serious about offering really broad analytic capabilities.
- Is there anything special about your library of pre-built, fully parallel processes? For example, I like Netezza’s broad approach to linear algebra, Greenplum’s sparse vector manipulation, and a number of Aster Data’s packages. ParAccel’s list looks interesting too, although I haven’t grilled them about what is or isn’t fully parallel.
- How does the associated memory management work? Can you create temporary data structures that survive longer than the process that spawned them? (Those can be useful for various kinds of lookup table.) Do they have to be tabular? (Graphs and other alternatives can be useful.) And by the way, do UDFs or other processes have enough RAM under their control to run efficiently?
- How is workload management handled? Hopefully, everything that runs on the same cluster is handled by one integrated workload management system. Vendors for whom that isn’t true today should have a clear road map for getting there, because the alternative is something of a mess.
Please note what I’m not including in this discussion — the integration of DBMS and fairly ordinary business intelligence. That may have virtues, for reasons of price or performance, and the virtues may grow as in-memory BI and/or data management capabilities evolve. But for the foreseeable future, BI/DBMS integration is a fairly separate matter from the integration of analytic DBMS with sophisticated investigative analytics.