Vertica 6 was recently announced, and so it seemed like a good time to catch up on Vertica features. The main topics I want to address are:
- External tables and the associated new Hadoop connector.
- Online schema evolution.
- Workload management.
- I have some tidbits to add to my June, 2011 coverage of Vertica’s analytic functionality.
- I’ll stand for now on my previous coverage of Vertica’s database organization.
In general, the main themes of Vertica 6 appear to be:
- Enterprise/SaaS-friendliness, high uptime, and so on.
- Improved analytic usefulness.
Let’s do the analytic functionality first. Notes on that include:
- Vertica has extended its user-defined function/analytic procedure/whatever functionality to include user-defined load. (Same SDK, different specific classes.)
- One of the languages Vertica supports is R. But for now, parallel R is limited to “Of course, you can run the same functions and procedures on many nodes at once.”
- Based on community activity around bugs and so on, it seems there are users for Vertica’s JSON-based Twitter sentiment analysis plug-in.
I’ll also take this opportunity to expand on something I wrote about a few vendors — including Vertica — at the end of my post on approximate query results. When I probed how customers of Vertica and other RDBMS-based analytic platform vendors used vendor-proprietary advanced analytic SQL and other analytic capabilities, answers included:
- Serious custom coding. Market segments mentioned in this connection included financial services and OEM/SaaS.
- Create a view or a view-like function, then proceed from there in the usual way.
Tighter business intelligence tool integrations are “under development”.
Vertica’s external tables capability sounds a lot like what other vendors have. In particular, it sounds a lot like Teradata Aster’s SQL-H, only without the HCatalog integration and with a different set of analytic functionality. The pitch is similar too — leave the data where it is when you’re analyzing it more experimentally, but if you want real performance, bring it into the RDBMS.
Datameer has a similar story almost in reverse — take the data out of an RDBMS into less performant Hadoop for easier analysis.
As is the case for other RDBMS with similar features, what Vertica’s external tables capability may be used for the most is Hadoop connectivity. Indeed, Vertica has a new Hadoop connector in beta, based on external tables. Technical notes on that include:
- Unlike Vertica’s old Hadoop connector, Vertica’s new Hadoop connector doesn’t require any MapReduce programming.
- There’s a bulk (re)load every time you use external tables. The workaround is to CREATE TABLE as SELECT.
- I gather Vertica’s new connector has normal transactional guarantees, albeit for very bulky transactions. The same can’t be said for the old connector.
Ever more RDBMS vendors, for example Clustrix, allow schema changes without taking the database down. So does Vertica. Notes on that include:
- It relies on the fact that Vertica uses MVCC (Multi-Version Concurrency Control) rather than doing updates in place. MVCC is pretty much a best practice now, at least for disk storage.
- In particular, a Vertica query locates all the data it needs when it starts running. Database changes made while the query is running don’t affect it. That’s true of ordinary updates and of schema changes alike.
- Schema changes really boil down to dropping and adding columns. Or you can make very simple column changes, such as widening one.
- If you widen a column, that has no effect on the data already in place. (Columnar storage, variable length records — what would there be to change?)
As for Vertica 6 workload management, core concepts include:
- Queries are assigned to various pools based on configurable rules. So far as I can tell:
- The rules are set via an administrative GUI …
- … and are pretty flexible.
- There’s SQL under the covers.
- Pools can have low, medium, or high priority.
- There’s a short query bias for prioritizing queries within pools. (Actual query duration, not estimated.)
There also are further manual or automatic reprioritizations. And of course runaway queries eventually get shut down.
I’m finding it surprisingly hard to pin down what things Vertica’s workload management actually manages, because:
- Vertica seems to prefer soft resource management over hard; for example, if one query pool isn’t using all the CPU it’s entitled to, Vertica wants to give those cycles to other queries.*
- Vertica seems to prefer focusing on query priorities than on the specification of actual resource allocations.
*There is also a hard throttle, ala DB2 or Netezza, for cases when customers don’t want the best performance, so as to better manage user expectations. But I don’t think it’s been as important in Vertica’s customer base as it has been for the other products.
That said, the story is approximately that you can configure a query pool according to:
- How much RAM it gets.
- How many queries it’s expected to run at once.
- How many queries it’s limited to running at once.
- How much CPU, network bandwidth, and I/O bandwidth it’s expected to consume.
Unfortunately, I neglected to ask how fully user-defined analytic procedures were integrated under Vertica’s workload management umbrella.
And finally, I’ll close with notes on two more Vertica features.
- Vertica is proud that backups that can be scheduled separately on a table-by-table basis, seeing that ability as part of the “Make it easier to manage Vertica as an always-up system” theme.
- As opposed to Oracle’s annoying 1000 column limit, Vertica tables can have an effectively unlimited number of columns (whatever the machine’s RAM can support).