Analysis of issues in data warehousing, with extensive coverage of database management systems and data warehouse appliances that are optimized to query large volumes of data. Related subjects include:
Way back in 2006, I wrote about a cool Netezza feature called the zone map, which in essence allows you to do partition elimination even in the absence of strict range partitioning.
Netezza’s substitute for range partitioning is very simple. Netezza features “zone maps,” which note the minimum and maximum of each column value (if such concepts are meaningful) in each extent. This can amount to effective range partitioning over dates; if data is added over time, there’s a good chance that the data in any particular date range is clustered, and a zone map lets you pick out which data falls in the desired data range.
I further wrote
… that seems to be the primary scenario in which zone maps confer a large benefit.
But I now think that part was too pessimistic. For example, in bulk load scenarios, it’s easy to imagine ways in which data can be clustered or skewed. And in such cases, zone maps can let you skip a large fraction of potential I/O.
Over the years I’ve said that other things were reminiscent of Netezza zone maps, e.g. features of Infobright, SenSage, InfiniDB and even Microsoft SQL Server. But truth be told, when I actually use the phrase “zone map”, people usually give me a blank look.
In a recent briefing about BLU, IBM introduced me to a better term — data skipping. I like it and, unless somebody comes up with a good reason not to, I plan to start using it myself.
Actian, which already owns VectorWise, is also buying ParAccel. The argument for why this kills VectorWise is simple. ParAccel does most things VectorWise does, more or less as well. It also does a lot more:
- ParAccel scales out.
- ParAccel has added analytic platform capabilities.
- I don’t know for sure, but I’d guess ParAccel has more mature management/plumbing capabilities as well.
One might conjecture that ParAccel is bad at highly concurrent, single-node use cases, and VectorWise is better at them — but at the link above, ParAccel bragged of supporting 5,000 concurrent connections. Besides, if one is just looking for a high-use reporting server, why not get Sybase IQ?? Anyhow, Actian hasn’t been investing enough in VectorWise to make it a major market player, and they’re unlikely to start now that they own ParAccel as well.
But I expect ParAccel to fail too. Reasons include:
- ParAccel’s small market share and traction.
- The disruption of any acquisition like this one.
- My general view of Actian as a company.
|Categories: Actian and Ingres, Columnar database management, Data warehousing, HP and Neoview, ParAccel, Sybase, VectorWise, Vertica Systems||10 Comments|
Teradata is announcing its new high-end systems, the Teradata 6700 series. Notes on that include:
- Teradata tends to get 35-55% (roughly speaking) annual performance improvements, as measured by its internal blended measure Tperf. A big part of this is exploiting new-generation Intel processors.
- This year the figure is around 40%.
- The 6700 is based on Intel’s Sandy Bridge.
- Teradata previously told me that Ivy Bridge — the next one after Sandy Bridge — could offer a performance “discontinuity”. So, while this is just a guess, I expect that next year’s Teradata performance improvement will beat this year’s.
- Teradata has now largely switched over to InfiniBand.
Teradata is also talking about data integration and best-of-breed systems, with buzzwords such as:
- Teradata Unified Data Architecture.
- Fabric-based computing, even though this isn’t really about storage.
- Teradata SQL-H.
|Categories: Data integration and middleware, Data warehouse appliances, Data warehousing, Pricing, SAS Institute, Teradata||3 Comments|
As vendors so often do, Teradata has caused itself some naming confusion. SQL-H was introduced as a facility of Teradata Aster, to complement SQL-MR.* But while SQL-MR is in essence a set of SQL extensions, SQL-H is not. Rather, SQL-H is a transparency interface that makes Hadoop data responsive to the same code that would work on Teradata Aster …
*Speaking of confusion — Teradata Aster seems to use the spellings SQL/MR and SQL-MR interchangeably.
… except that now there’s also a SQL-H for regular Teradata systems as well. While it has the same general features and benefits as SQL-H for Teradata Aster, the details are different, since the underlying systems are.
I hope that’s clear.
|Categories: Data integration and middleware, Data warehousing, Emulation, transparency, portability, Hadoop, SQL/Hadoop integration, Teradata||2 Comments|
Hmm. I probably should have broken this out as three posts rather than one after all. Sorry about that.
Discussions of DBMS performance are always odd, for starters because:
- Workloads and use cases vary greatly.
- In particular, benchmarks such as the YCSB or TPC-H aren’t very helpful.
- It’s common for databases or at least working sets to be entirely in RAM — but it’s not always required.
- Consistency and durability models vary. What’s more, in some systems — e.g. MongoDB — there’s considerable flexibility as to which model you use.
- In particular, there’s an increasingly common choice in which data is written synchronously to RAM on 2 or more servers, then asynchronously to disk on each of them. Performance in these cases can be quite different from when all writes need to be committed to disk. Of course, you need sufficient disk I/O to keep up, so SSDs (Solid-State Drives) can come in handy.
- Many workloads are inherently single node (replication aside). Others are not.
MongoDB and 10gen
I caught up with Ron Avnur at 10gen. Technical highlights included: Read more
Well-resourced Silicon Valley start-ups typically announce their existence multiple times. Company formation, angel funding, Series A funding, Series B funding, company launch, product beta, and product general availability may not be 7 different “news events”, but they’re apt to be at least 3-4. Platfora, no exception to this rule, is hitting general availability today, and in connection with that I learned a bit more about what they are up to.
In simplest terms, Platfora offers exploratory business intelligence against Hadoop-based data. As per last weekend’s post about exploratory BI, a key requirement is speed; and so far as I can tell, any technological innovation Platfora offers relates to the need for speed. Specifically, I drilled into Platfora’s performance architecture on the query processing side (and associated data movement); Platfora also brags of rendering 100s of 1000s of “marks” quickly in HTML5 visualizations, but I haven’t a clue as to whether that’s much of an accomplishment in itself.
Platfora’s marketing suggests it obviates the need for a data warehouse at all; for most enterprises, of course, that is a great exaggeration. But another dubious aspect of Platfora marketing actually serves to understate the product’s merits — Platfora claims to have an “in-memory” product, when what’s really the case is that Platfora’s memory-centric technology uses both RAM and disk to manage larger data marts than could reasonably be fit into RAM alone. Expanding on what I wrote about Platfora when it de-stealthed: Read more
|Categories: Business intelligence, Columnar database management, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, Market share and customer counts, Memory-centric data management, Platfora, Workload management||12 Comments|
Perhaps the single toughest question in all database technology is: Which different purposes can a single data store serve well? — or to phrase it more technically — Which different usage patterns can a single data store support efficiently? Ted Codd was on multiple sides of that issue, first suggesting that relational DBMS could do everything and then averring they could not. Mike Stonebraker too has been on multiple sides, first introducing universal DBMS attempts with Postgres and Illustra/Informix, then more recently suggesting the world needs 9 or so kinds of database technology. As for me — well, I agreed with Mike both times.
Since this is MUCH too big a subject for a single blog post, what I’ll do in this one is simply race through some background material. To a first approximation, this whole discussion is mainly about data layouts — but only if we interpret that concept broadly enough to comprise:
- Every level of storage (disk, RAM, etc.).
- Indexes, aggregates and raw data alike.
To date, nobody has ever discovered a data layout that is efficient for all usage patterns. As a general rule, simpler data layouts are often faster to write, while fancier ones can boost query performance. Specific tradeoffs include, but hardly are limited to: Read more
It’s hard to make data easy to analyze. While everybody seems to realize this — a few marketeers perhaps aside — some remarks might be useful even so.
Many different technologies purport to make data easy, or easier, to an analyze; so many, in fact, that cataloguing them all is forbiddingly hard. Major claims, and some technologies that make them, include:
- “We get data into a form in which it can be analyzed.” This is the story behind, among others:
- Most of the data integration and ETL (Extract/Transform/Load) industries, software vendors and consulting firms alike.
- Many things that purport to be “analytic applications” or data warehouse “quick starts”.
- “Data reduction” use cases in event processing.*
- Text analytics tools.
- “Forget all that transformation foofarah — just load (or write) data into our thing and start analyzing it immediately.” This at various times has been much of the story behind:
- Relational DBMS, according to their inventor E. F. Codd.
- MOLAP (Multidimensional OnLine Analytic Processing), also according to RDBMS inventor E. F. Codd.
- Any kind of analytic DBMS, or general purpose DBMS used for data warehousing.
- Newer kinds of analytic DBMS that are faster than older kinds.
- The “data mart spin-out” feature of certain analytic DBMS.
- In-memory analytic data stores.
- NoSQL DBMS that have a few analytic features.
- TokuDB, similarly.
- Electronic spreadsheets, from VisiCalc to Datameer.
- “Our tools help you with specific kinds of analyses or analytic displays.” This is the story underlying, among others:
- The business intelligence industry.
- The predictive analytics industry.
- Algorithmic trading use cases in complex event processing.*
- Some analytic applications.
*Complex event/stream processing terminology is always problematic.
My thoughts on all this start: Read more
I recently complained that the Gartner Magic Quadrant for Data Warehouse DBMS conflates many use cases into one set of rankings. So perhaps now would be a good time to offer some thoughts on how to tell use cases apart. Assuming you know that you really want to manage your analytic database with a relational DBMS, the first questions you ask yourself could be:
- How big is your database? How big is your budget?
- How do you feel about appliances?
- How do you feel about the cloud?
- What are the size and shape of your workload?
- How fresh does the data need to be?
Let’s drill down. Read more
Comments on Gartner’s 2012 Magic Quadrant for Data Warehouse Database Management Systems — evaluations
To my taste, the most glaring mis-rankings in the 2012/2013 Gartner Magic Quadrant for Data Warehouse Database Management are that it is too positive on Kognitio and too negative on Infobright. Secondarily, it is too negative on HP Vertica, and too positive on ParAccel and Actian/VectorWise. So let’s consider those vendors first.
Gartner seems confused about Kognitio’s products and history alike.
- Gartner calls Kognitio an “in-memory” DBMS, which is not accurate.
- Gartner doesn’t remark on Kognitio’s worst-in-class* compression.
- Gartner gives Kognitio oddly high marks for a late, me-too Hadoop integration strategy.
- Gartner writes as if Kognitio’s next attempt at the US market will be the first one, which is not the case.
- Gartner says that Kognitio pioneered data warehouse SaaS (Software as a Service), which actually has existed since the pre-relational 1970s.
Gartner is correct, however, to note that Kognitio doesn’t sell much stuff overall.
In the cases of HP Vertica, Infobright, ParAccel, and Actian/VectorWise, the 2012 Gartner Magic Quadrant for Data Warehouse Database Management’s facts are fairly accurate, but I dispute Gartner’s evaluation. When it comes to Vertica: Read more