Theory and architecture
Analysis of design choices in databases and database management systems. Related subjects include:
- Any subcategory
- Database diversity
- Explicit support for specific data types
- (in Text Technologies) Text search
Netezza on concurrency and workload management
I visited Netezza Friday for what was mainly an NDA meeting. But while I was there I asked where Netezza stood on concurrency, workload management, and rapid data mart spin-out. Netezza’s claims in those regards turned out to be surprisingly strong.
In the biggest surprise, Netezza claimed at least one customer had >5,000 simultaneous users, and a second had >4,000. Both are household names. Other unspecified Netezza customers apparently also have >1,000 simultaneous users. Read more
Categories: Data warehouse appliances, Data warehousing, Netezza, Teradata, Theory and architecture, Workload management | 13 Comments |
While I’m venting about benchmarks
Late last year, Vertica made hoo-hah about what it called a world-record data warehouse load speed benchmark. I wrote at the time that this showed Vertica wasn’t painfully slow at loading, always a concern with column stores. But otherwise I mocked the idea that there was something useful to be learned from the whole exercise.
Well, guess what? In a throwaway line in a comment on Daniel Abadi’s blog, Barry Zane of ParAccel pointed out
we posted a load rate of almost 9TB/hour, which is, of course record breaking on its own
Quite right.
I hope the nonsense stops there, but I’m not optimistic …
Categories: Benchmarks and POCs, Columnar database management, Data integration and middleware, EAI, EII, ETL, ELT, ETLT, Vertica Systems | Leave a Comment |
Progress in figuring out what ParAccel is doing
Barry Zane of ParAccel has — finally! — started a blog. Barrry’s first post, probably in connection with ParAccel’s recent TPC-H submission and subsequent brouhaha, consisted mainly of metaphor + very elementary and well-known arguments for column stores. Barry’s second post, however, was in direct response to Daniel Abadi’s speculation about ParAccel’s architecture. That post also promises a follow-up addressing the TPC-H in a more substantive way.
(Edit: As of October, 2010, those links have been redirected away from the original posts, which seem to have been taken down.)
Barry’s points include:
- ParAccel never used the row-oriented Postgres execution engine. This is contrary to Daniel’s speculation.
- ParAccel previously used an adaption of the Postgres cost-based optimizer, but now has written a new one from scratch.
- ParAccel has designed its optimizer to handle lots and lots of joins. One reason Barry offers is that ParAccel wants to run customers’ old schemas unaltered, whether or not those are really optimal for the ParAccel DBMS. That approach is somewhat in contrast to Vertica, which originally focused entirely on star schemas. And it goes well with ParAccel’s interest in appealing to customers who at least think they want to run ParAccel in Oracle or SQL Server emulation mode.
Also in the post, Barry:
- Makes an extremely silly marketing exaggeration by referring to ” the only other vendor that was able to run the 30TB TPC-H” (emphasis mine).
- Makes the more excusable marketing exaggeration “Publishing the benchmark with unmatched performance is simply one way to demonstrate robustness and flexibility. Nothing more, nothing less.”
- Makes the very clear marketing claim “For customers, the real test will be their own bake-offs, where our performance has never been beaten.” (Emphasis mine.) That last one directly contradicts what I’ve been told by at least two ParAccel competitors, so I’ll be curious to see what they come up with to substantiate their version of the story.
Anyhow, it’s great to see ParAccel retreating from its obsessive secrecy, which in my opinion has been even worse than Netezza’s used to be.
Categories: Columnar database management, Data warehousing, ParAccel | 2 Comments |
Hasso Plattner calls for in-memory OLTP column stores
Former SAP CEO Hasso Plattner has written a paper called A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database, in association with a SIGMOD keynote address.* The approach Plattner advocates is an MPP in-memory column store, presumably somewhat akin to SAP’s frequently renamed Business Warehouse Accelerator/Business Intelligence Accelerator/BWA/BIA/Son-of-TREX technology. There also are strong similarities to the MPP in-memory row store project H-Store/VoltDB, although I don’t know whether Plattner would go so far as to adopt the H-Store view that all transactions should run in stored procedures. Unsurprisingly, SAP applications are used as the OLTP paradigm throughout.
*Thanks to Dave Kellogg for tipping me off to Plattner’s paper. I only went to two SIGMOD sessions, neither of which was Plattner’s. Nobody actually mentioned Plattner’s talk to me when I was down at SIGMOD.
Perhaps the most interesting part is Plattner’s claim that what’s demanding about OLTP isn’t database updating per se, but rather maintaining aggregates for quick-response analytics. In his main example of that point, Plattner proposes a real-life “more than 18” table schema, of which 2 are base tables, and (most of?) the rest are materialized views that his proposed database architecture dispenses with (because analytic performance is sufficiently good without them). Thus, Plattner’s core columnar argument seemingly is
columnar –> natively fast analytics –> no need to maintain aggregates –> much lower update burden.
That said — if Plattner’s paper contained a clear statement of how much more expensive it is to insert or update a single row in a columnar vs. row-based system, I overlooked it. Instead, Plattner seems to be arguing that the volume of base-table updates is low enough that — whatever it may be — column-store update overhead is an acceptable price to pay. (At one point he claims that only 5% of the data inserted in a financial application ever gets changed.) That may actually be true in a financial accounting system, but seems more questionable in a sufficiently large application that gets its updates from automatic devices, or from the consumer web.
Other highlights include: Read more
Daniel Abadi has a theory about ParAccel
When I was at SIGMOD last week, ParAccel and its SIGMOD talk were mentioned several times, always in puzzled and at least slightly unflattering terms. (Typical comment: “Why did they present a paper about that? We were doing the same thing in our company years ago.”) That doesn’t prove much per se, since most of the mentions were by competitors and/or Vertica-affiliated academics, and since my own unflattering ParAccel-related comments were rather fresh at the time.
But now Daniel Abadi has done a brilliant, detailed, speculative analysis of ParAccel’s publications. Here’s the meat, emphasis mine: Read more
Categories: Benchmarks and POCs, Columnar database management, Data warehousing, ParAccel, Theory and architecture | 30 Comments |
Yahoo is up to 10 petabytes now?
According to somebody (I forget who) who attended Yahoo’s SIGMOD presentation last week, the big Yahoo database is now up to 10 petabytes in size, in line with Yahoo’s predictions last year. Apparently, Yahoo also gave more details of how the technology works.
Categories: Columnar database management, Data warehousing, Web analytics, Yahoo | 5 Comments |
Notes on columnar/TPC-H compression
I was chatting with Omer Trajman of Vertica, and he said that a 70% compression figure for ParAccel’s recent TPC-H filing sounded about right.* When I noted that seemed kind of low, Omer pointed out that TPC-H data is pseudo-random, while real-life data has much more correlation among the values in different columns. E.g., in retail, a customer is likely to consistently shop at the same stores and to put similar items into his shopping basket).
*Omer was involved in Vertica’s TPC-H-data-based load speed benchmark, and is Vertica’s representative to the TPC.
But why does this matter? After all, Vertica compresses one column at a time (unlike, say, Clearpace). Well, the reason is that Vertica — like other column stores — wants to store different columns in the same row order, for obvious benefits in both reading and writing. So, for example, if all the rows that include Gotham City are grouped sequentially, then all the rows mentioning Bruce Wayne are likely to be near each other as well, while none of the rows that mention Clark Kent will be mixed in.
And when a set of consecutive entries has low cardinality, it’s easier to get high levels of compression.
Categories: Benchmarks and POCs, Columnar database management, Data warehousing, Database compression, Vertica Systems | Leave a Comment |
NoSQL?
Eric Lai emailed today to ask what I thought about the NoSQL folks, and especially whether I thought their ideas were useful for enterprises in general, as opposed to just Web 2.0 companies. That was the first I heard of NoSQL, which seems to be a community discussing SQL alternatives popular among the cloud/big-web-company set, such as BigTable, Hadoop, Cassandra and so on. My short answers are:
- In most cases, no.
- Most of these technologies are designed for simple, high-volume OLTP (OnLine Transaction Processing.) Most large enterprises have an established way of doing OLTP, probably via relational database management systems. Why change?
- MapReduce is an exception, in that it’s designed for analytics. MapReduce may be useful for enterprises. But where it is, it probably should be integrated into an analytic DBMS.
- There’s one big countervailing factor to all these generalities — schema flexibility.
As for the longer form, let me start by noting that there are two main kinds of reason for not liking SQL. Read more
Correction to a recent quote
I’m quoted in a recent article around Aster’s appliance announcement as saying data warehouse appliances are more suitable for small workgroups of analysts crunching small amounts of data than they are for other uses.
But that’s not what I think at all.
I do think the ease-of-administration pitch for appliances makes them particularly well suited for users who want to scrape by without doing much database adminstration. This is especially appealing to departments or smaller enterprises. And the first/best scenario that comes to mind is indeed a small team of analysts, with good SQL skills but lightweight DBA experience, although Netezza has proved that many other kinds of users can find appliances appealing as well.
But that small team of analysts may maintain the largest database in the firm.
And by the way — notwithstanding the MySpace counterexample, most of Aster’s initial customers had <10 terabyte databases, and I think indeed <5 terabyte. The “frontline” pitch succeeded for Aster before (MySpace again aside) any better-big-data-crunching story did.
Categories: Analytic technologies, Aster Data, Data warehouse appliances, Data warehousing, Theory and architecture | Leave a Comment |
Xtreme Data readies a different kind of FPGA-based data warehouse appliance
Xtreme Data called me to talk about its plans in the data warehouse appliance business, almost all details of which are currently embargoed. Still, a few points may be worth noting ahead of more precise information, namely:
- Xtreme Data’s basic idea is to take a custom board and build a data warehouse appliance around it.
- An Xtreme Data board looks a lot like a conventional two-socket board, but has only one four-core CPU. In addition, it sports some FPGAs (Field-Programmable Gate Arrays).
- In the Xtreme Data appliance, the FPGAs will be used for core SQL processing, after the data is ingested via conventional I/O. This is different from Netezza’s approach to FPGA-based data warehouse appliances, in which the FPGA sits in the place of a disk controller and touches the data first, before passing it off to a more or less conventional CPU.
- While preparing entry into the data warehouse appliance business, Xtreme Data has sold its board to 150 other outfits, many quite impressive. Buyers seem to be FPGA users who previously had to craft their own custom boards. According to Xtreme Data, major uses by these customers include:
- Military/intelligence/digital signal processing.
- Military/intelligence/cybersecurity (a newish area for Xtreme Data)
- Bioinformatics/high-throughput gene sequencing (a “handful” of customers)
- Medical imaging
- More or less pure university research of various sorts (around 50 customers)
- … but not database management.
- Xtreme Data’s website has a non-obvious URL. 🙂
So far as I can tell, Xtreme Data’s 1.0 product will — like most other 1.0 analytic database management products — be focused on price/performance, without little or no positive differentiation in the way of features.