When you are selecting an analytic DBMS or appliance, most of the evaluation boils down to two questions:
- How quickly and cost-effectively does it execute SQL?
- What analytic functionality, SQL or otherwise, does it do a good job of executing?
And so, in undertaking such a selection, you need to start by addressing three issues:
- What does “speed” mean to you?
- What does “cost” mean to you?
- What analytic functionality do you need anyway?
Key elements of cost* include:
- Software license and maintenance
- Hardware purchase cost, maintenance, electric power, and computer room burden
- Database and system administration
- (For some uses cases) Programming
*Assuming a classical in-house IT shop, where products are typically bought rather than leased/rented. With outsourced and/or monthly-fee structures, the details change but the principles remain the same.
Most of that can be evaluated pretty well via a spreadsheet, although things can get a bit tricky when you get to people costs, which are a large fraction of the whole. In particular, different analytic DBMS product suites have great, high-performance support for different (and often rapidly growing) sets of functionality – basic and advanced SQL, statistics, and more. Figuring out which ones will be best for your programmers, and how significant the differences are — well, that’s a lot like any other programming language evaluation, and those are rarely neat or clean-cut.
But when it comes to evaluating speed, there’s no substitute for a well-designed proof of concept (POC). Many analytic DBMS and appliance vendors are happy to let you do a POC, on your own premises (or remotely if you prefer), under your control, at no cost to you. And that’s great. It is crucial that a POC be run either by you, by a consultant* answerable to you, or – if you decide the vendor must run it for you – at least with you watching every step of the way and knowing exactly what is being done. Appliance vendors do find it cheaper to run POCs on their own premises, so a certain reluctance to ship you a box is understandable. But make no compromises about the transparency of a POC, or about your control of exactly what it is that gets tested.
*Since I sell consulting services for users evaluating analytic DBMS, I naturally am biased to think that consultants can be very useful in the process. But whether you should use them a little (sanity check), a medium amount (work with you through the process), or heavily (actually drive the process for you and/or execute the POCs) is very dependent upon your specific situation.
So far as I’ve been able to tell:
- Netezza loves to ship boxes to prospects for POCs, and have them set up the boxes and do POCs themselves. That’s a big reason why Netezza wants to call attention to this subject.
- Oracle has generally been pretty reluctant to ship Exadata boxes out for POCs. That’s the other reason Netezza wants to call attention to the issue.
- Open source vendors make it easy for you to download and test at least their community editions.
- Vertica makes it pretty easy for you to test its software too (download or cloud).
- ParAccel has generally insisted on running POCs itself, although it will do so on your premises if you insist.
- Teradata naturally tries to do POCs on its own premises, but doesn’t insist too hard. (Edit: Randy Lea of Teradata says that Teradata is now doing over half its POCs onsite.)
Most of the criticisms I’ve heard of vendors’ POC practices have been directed at Oracle or ParAccel.
For most POCs, it’s a good conceptual template to form and then test a hypothesis to the effect of:
- For a given technology product assemblage (brand of DBMS, number of nodes, etc.), and
- For a given level of human effort (e.g., administrative effort), you can
- Run a given a workload, with
- Satisfactory and satisfactorily consistent response times
Sometimes absolute throughput and price/performance are important secondary considerations; sometimes they’re less germane. But either way, it’s almost always right to focus primarily on the questions of “What do I want this system to do?” and “What do I think we’re going to have to invest in it?” By way of contrast, it’s often misleading to focus too much on questions like “What’s the one number that best describes the performance of this system?” — even if you customize that calculation for your environment – or, even worse, “How much speed-up can I get on my single worst Query from Hell?”
The fundamental rule of POC construction is: Model your entire use case as best you can. That means you need to consider, at a minimum:
- Your whole concurrent query, other analytic, and low-latency update workload (peak).
- Your whole query, analytic, load, backup, and maintenance workload (ongoing).
- Partial-failure scenarios.
- Your core SLAs (Service-Level Agreements).
Of course, that’s not as easy as it sounds. Presumably, the main reason you’re getting a new analytic DBMS is that you want to do new kinds of analysis. By the very nature of analytics, you won’t know what analytic operations are most useful until you try them out and see what their results are. On the other hand – if you haven’t done considerable thinking about how you’re going to use your new analytic database, how did you ever get funding for the project in the first place?
Seriously, I could write multiple posts, each as long as this one (but more application-oriented), about how to upgrade your analytic capabilities (and which fool’s gold to avoid). But this has gotten pretty long already, so for now I’ll just stop here.
Note: My clients at Netezza asked me to write something short about POCs they could use as a kind of foreword to some collateral, where by “short” they meant single-paragraph or something like that. They’re great clients, so I said yes, under the condition I could also use it as a blog post. Except … this post didn’t turn out to be nearly as short as they envisioned. Oops.
My February, 2009 slide deck on how to select an analytic DBMS is in many parts still pretty current