Data warehouse appliances
Analysis of data warehouse appliances – i.e., of hardware/software bundles optimized for fast query and analysis of large volumes of (usually) relational data. Related subjects include:
In a call Monday with a prominent company, I was told:
- Teradata, Netezza, Greenplum and Vertica aren’t relational.
- Teradata, Netezza, Greenplum and Vertica are all data warehouse appliances.
That, to put it mildly, is not accurate. So I shall try, yet again, to set the record straight.
In an industry where people often call a DBMS just a “database” — so that a database is something that manages a database! — one may wonder why I bother. Anyhow …
1. The products commonly known as Oracle, Exadata, DB2, Sybase, SQL Server, Teradata, Sybase IQ, Netezza, Vertica, Greenplum, Aster, Infobright, SAND, ParAccel, Exasol, Kognitio et al. all either are or incorporate relational database management systems, aka RDBMS or relational DBMS.
2. In principle, there can be difficulties in judging whether or not a DBMS is “relational”. In practice, those difficulties don’t arise — yet. Every significant DBMS still falls into one of two categories:
- Was designed to do relational stuff* from the get-go, even if it now does other things too.
- Supports a lot of SQL.
- Was designed primarily to do non-relational things.*
- Doesn’t support all that much SQL.
*I expect the distinction to get more confusing soon, at which point I’ll adopt terms more precise than “relational things” and “relational stuff”.
3. There are two chief kinds of relational DBMS: Read more
A reporter asked me to speculate about the next releases of Oracle and Exadata. He and I agreed:
- It seems likely that they’ll be discussed at Oracle OpenWorld in a couple of months.
- Exadata in particular is due for a hardware refresh.
- Oracle12c is a good guess at a name, where “C” is for “Cloud”.
My answers mixed together thoughts on what Oracle should and will emphasize (which aren’t the same thing but hopefully bear some relationship to each other ). They were (lightly edited):
- The worst thing about Oracle is the ongoing DBA work for what should be automatic.
- Oracle RAC still makes scale-out too difficult. Presumably, Oracle is looking to build aggressively on recent steps in automating parallelism.
- For Exadata, assume that Oracle is always looking to improve how data gets allocated among disk, flash, and RAM. Look also for Exadata versions with different silicon-disk ratios than are available now.
- Tighter integration among the various appliances is surely a goal, …
- … but I don’t know whether Oracle will pick them apart and let you put various kinds of hardware in the same racks or not. I’d guess against that, because the current set-up gives them a pretext to sell you more capacity than you need.
- I wonder whether Oracle will finally introduce a true columnar storage option, a year behind Teradata. That would be the obvious enhancement on the data warehousing side, if they can pull it off. If they can’t, it’s a damning commentary on the core Oracle codebase.
- Probably Oracle will have something that it portrays as good multi-tenancy support. Some of that could be based on Label Security and so on.
- Anything that makes schema change easier could be a win on the DBA and multi-tenancy sides alike, which would be a nice two-fer.
|Categories: Clustering, Columnar database management, Data warehouse appliances, Data warehousing, Exadata, Oracle, Teradata||7 Comments|
Chris Kanaracus uncovered a case of Oracle actually pulling an ad after having been found “guilty” of false advertising. The essence seems to be that Oracle claimed 20X hardware performance vs. IBM, based on a comparison done against 6 year old hardware running an earlier version of the Oracle DBMS. My quotes in the article were:
- “Everybody’s guilty of that kind of exaggeration.”
- “Oracle tends to be even a little guiltier than others.”
- “If your new system can’t outperform somebody else’s old system by a huge factor on at least some queries, you’re doing something wrong.”
- “Use newer, better hardware; use newer, better software; have a top sales engineer do a great job of tuning it and of course you’ll see huge performance results.”
Another example of Oracle exaggeration was around the Exadata replacement of Teradata at Softbank. But the bogosity flows both ways. Netezza used to make a flat claim of 50X better performance than Oracle, while Vertica’s standard press release boilerplate long boasted
50x-1000x faster performance at 30% the cost of traditional solutions
Of course, reality is a lot more complicated. Even if you assume apples-to-apples comparisons in terms of hardware and software versions, performance comparisons can vary greatly depending upon queries, databases, or use cases. For example:
- Many queries are inherently much faster over columnar storage than over row-based.
- Different data sets respond very differently to various compression algorithms.
- Some analytic RDBMS can maintain strong performance at high levels of concurrent usage. Some can’t.
- Some queries that run very fast on one DBMS without tuning might require careful tuning in another system.
- Some DBMS scale out much better than others.
- Vendors optimize for different usage assumptions, which may or may not apply in your particular case.
And so, vendor marketing claims about across-the-board performance should be viewed with the utmost of suspicion.
|Categories: Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Exadata, Netezza, Oracle, Vertica Systems||Leave a Comment|
I love talking with Carson Schmidt, chief of Teradata’s hardware engineering (among other things), even if I don’t always understand the details of what he’s talking about. It had been way too long since our last chat, so I requested another one. We were joined by Keith Muller, who I presume is pictured here. Takeaways included:
- Teradata performance growth was slow in the early 2000s, but has accelerated since then; Intel gets a lot of the credit (and blame) for that.
- Carson hopes for a performance “discontinuity” with Intel Ivy Bridge.
- Teradata is not afraid to use niche special-purpose chips.
- Teradata’s views can be taken as well-informed endorsements of InfiniBand and SAS 2.0.
|Categories: Data warehouse appliances, Data warehousing, Database compression, Solid-state memory, Storage, Teradata||11 Comments|
SAP HANA has gotten much attention, mainly for its potential. I finally got briefed on HANA a few weeks ago. While we didn’t have time for all that much detail, it still might be interesting to talk about where SAP HANA stands today.
SAP HANA is positioned as an “appliance”. So far as I can tell, that really means it’s a software product for which there are a variety of emphatically-recommended hardware configurations — Intel-only, from what right now are eight usual-suspect hardware partners. Anyhow, the core of SAP HANA is an in-memory DBMS. Particulars include:
- Mainly, HANA is an in-memory columnar DBMS, based on SAP’s confusingly-renamed BI Accelerator/BW Accelerator. Analytics and most OLTP (OnLine Transaction Processing) go against the columnar part of HANA.
- The HANA DBMS also has an in-memory row storage option, used to store metadata, small tables, and so on.
- SAP HANA talks both SQL and MDX.
- The HANA DBMS is shared-nothing across blades or rack servers. I imagine that within an individual blade it’s shared everything. The usual-suspect data distribution or partitioning strategies are available — hash, range, round-robin.
- SAP HANA has what sounds like a natural disk-based persistence strategy — logs, snapshots, and so on. SAP says that this is synchronous enough to give ACID compliance. For some hardware partners, those “disks” are actually Fusion I/O cards.
- HANA is fault-tolerant “across servers”.
- Text support is “coming soon”, which makes sense, given that BI Accelerator was based on the TREX search engine in the first place. Inxight is also in the HANA text mix.
- You can put data into SAP HANA in a variety of obvious ways:
- Writing it directly.
- Trigger-based replication (perhaps from the DBMS that runs your SAP apps).
- Log-based replication (based on Sybase Replication Server).
- SAP Business Objects’ ETL tool.
SAP says that the row-store part is based both on P*Time, an acquisition from Korea some time ago, and also on SAP’s own MaxDB. The IBM white paper mentions only the MaxDB aspect. (Edit: Actually, see the comment thread below.) Based on a variety of clues, I conjecture that this was an aspect of SAP HANA development that did not go entirely smoothly.
Other SAP HANA components include: Read more
This year’s Gartner Magic Quadrant for Data Warehouse Database Management Systems is out.* I shall now comment, just as I did on the 2010, 2009, 2008, 2007, and 2006 Gartner Data Warehouse Database Management System Magic Quadrants, to varying extents. To frame the discussion, let me start by saying:
- In general, I regard Gartner Magic Quadrants as a bad use of good research.
- Illustrating the uselessness of — or at least poor execution on — the overall quadrant metaphor, a large majority of the vendors covered are lined up near the line x = y, each outpacing the one below in both of the quadrant’s dimensions.
- I find fewer specifics to disagree with in this Gartner Magic Quadrant than in previous year’s versions. Two factors jump to mind as possible reasons:
- This year’s Gartner Magic Quadrant for Data Warehouse Database Management Systems is somewhat less ambitious than others; while it gives as much company detail as its predecessors, it doesn’t add as much discussion of overall trends. So there’s less to (potentially) disagree with.
- Merv Adrian is now at Gartner.
- Whatever the problems may be with Gartner’s approach, the whole thing comes out better than do Forrester’s failed imitations.
*As of February, 2012 — and surely for many months thereafter — Teradata is graciously paying for a link to the report.
Specific company comments, roughly in line with Gartner’s rough single-dimensional rank ordering, include: Read more
Predictably, I wasn’t pre-briefed on the details of Oracle’s Big Data Appliance announcement today, and an inquiry to partner Cloudera doesn’t happen to have been immediately answered.* But anyhow, it’s clear from coverage by Larry Dignan and Derrick Harris that Oracle’s Big Data Appliance includes:
- Some version of Cloudera Manager (I’m guessing more or less the best one).*
- Some version of Apache Hadoop (I’m guessing the same distribution that Cloudera prefers to use).*
- Some kind of support.
In other words, it’s a lot like getting Cloudera Enterprise,* plus some hardware, plus some other stuff.
*Edit: About 2 minutes after I posted this, I got email from Cloudera CEO Mike Olson. Yes, the Oracle Big Data Appliance bundles Cloudera Enterprise.
That raises an anyway recurring question: What exactly is Cloudera Manager? Read more
When I drafted a list of key analytics-sector issues in honor of look-ahead season, the first item was “execution of various big vendors’ ambitious initiatives”. By “execute” I mean mainly:
- “Deliver products that really meet customers’ desires and needs.”
- “Successfully convince them that you’re doing so …”
- “… at an attractive overall cost.”
Vendors mentioned here are Oracle, SAP, HP, and IBM. Anybody smaller got left out due to the length of this post. Among the bigger omissions were:
As a new year approaches, it’s the season for lists, forecasts and general look-ahead. Press interviews of that nature have already begun. And so I’m working on a trilogy of related posts, all based on an inquiry about hot analytic trends for 2012.
This post is a moderately edited form of an actual interview. Two other posts cover analytic trends to watch (planned) and analytic vendor execution challenges to watch (already up).
|Categories: Business intelligence, Cloud computing, Data warehouse appliances, Data warehousing, EMC, Greenplum, HP and Neoview, QlikTech and QlikView, SAP AG, Software as a Service (SaaS), Tableau Software, Vertica Systems||4 Comments|
1. EMC Greenplum has evolved its appliance product line. As I read that, the latest announcement boils down to saying that you can neatly network together various Greenplum appliances in quarter-rack increments. If you take a quarter rack each of four different things, then Greenplum says “Hooray! Our appliance is all-in-one!” Big whoop.
2. That said, the Hadoop part of EMC ‘s story is based on MapR, which so far as I can tell is actually a pretty good Hadoop implementation. More precisely, MapR makes strong claims about performance and so on, and Apache Hadoop folks don’t reply “MapR is full of &#$!” Rather, they say “We’re going to close the gap with MapR a lot faster than the MapR folks like to think — and by the way, guys, thanks for the butt-kick.” A lot more precision about MapR may be found in this M. C. Srivas SlideShare.
|Categories: Data warehouse appliances, eBay, EMC, Greenplum, Hadoop, MapR, MapReduce, Open source, Oracle||2 Comments|