I posted last October about PADB (ParAccel Analytic DataBase), but held back on various topics since PADB 3.0 was still under NDA. By the time PADB 3.0 was released, I was on blogging hiatus. Let’s do a bit of ParAccel catch-up now.
One big part of PADB 3.0 was an analytics extensibility framework. If we match PADB against my recent analytic computing system checklist,
- ParAccel is proud of PADB’s coverage in analytics-oriented SQL standard capabilities.
- I’m not aware of any PADB SQL goodies that go beyond the ANSI standards.
- PADB has a pretty flexible framework for user-defined functions (UDFs). In particular, ParAccel asserts this framework is even better than MapReduce, because it lets you do more steps at once, although I have trouble convincing myself that that makes sense in an important way.
- Anyhow — like Aster Data, ParAccel asserts that the same framework on which its DBMS is built has now been exposed to people wanting to write other kinds of analytic processes. (But Aster Data describes its framework as being pretty straight MapReduce.)
- All of PADB’s analytic process execution capabilities are subsumed in the UDF framework.
- PADB does not yet contain much in the way of fully parallelized analytic libraries. Exception: Like many of its competitors, ParAccel has a Fuzzy Logix partnership.
- ParAccel hasn’t focused yet on analytic development ease of use. (And that’s putting it mildly.)
- The only language now supported for PADB analytics is C++. ParAccel promises more language support, with (at least) Java and R coming in the summer.
- In line with its extreme focus on speed, ParAccel for now offers only in-process analytics execution.
- In a near-future release (just heading into QA now), ParAccel promises that PADB UDFs will be very flexible in terms of what kinds of memory structures it manages. However, if you want a structure to persist past the end of a query, you need to map it to a row architecture.
- ParAccel’s workload management is still primitive — just a short-query bias, rather than any kind of explicit prioritization. Hence, the question as to whether workload management extends to analytic process execution is fairly moot.
In other news, ParAccel’s Bala Narasimhan wrote:
Historically, an analyst who wants to spin up a new data mart with all of this data will have to wait for a number of days for the data copy to be made available. Instead, if you deploy PADB with a SAN that has fast and efficient snapshot and cloning capabilities, you can spin up multi-TB data marts in seconds.
That turns out to be not quite as ridiculous as it sounds. The scenario is:
- You’re using storage-area network technology with a copy-on-write option.
- You use the SAN’s copy-on-write option to make a second virtual copy of the database in question (or of certain tables/files/blocks from it).
- You point a separate instance of PADB at it, either on a separate cluster (“in seconds” — yeah, right) or else via virtualization (e.g. VMware — that sounds more plausible).
Hmm. I have no actual knowledge of this, but it sounds like a capability that EMC should also offer soon, given the historical Greenplum focus on data mart spin-out.