I just ran across a December 10 blog post by Chuck Hollis outlining some of EMC’s — or at least Chuck’s — views on data warehousing and business intelligence. It’s worth scanning, a certain “Where you stand depends upon where you sit” flavor to it notwithstanding. In a contrast to my usual blogging style, Chuck’s post is excerpted at length below, with comments from me interspersed.
There’s an interesting statistic floating around that something like 20% of all enterprise storage is used in DW/BI environments. On that basis alone, it shouldn’t be a surprise that EMC would take a serious interest in this topic.
Hmm. It’s hard to think of anything in an OLTP database, log file, whatever that shouldn’t be copied into a data warehouses at this point. Sometimes it should then be discarded, sometimes not. E-mail, other text-based messaging, other media, etc. are an exception to that rule, however.
OK, I can believe the 20% figure.
Sure, there’s always more data and more users, but occasionally a data warehouse grows from simply being a nice-to-have decision support application into an operational part of the business — and the game changes.
True. On the other hand, older data warehouses are most commonly managed either by general-purpose DBMS such as Oracle and DB2, or else by Teradata, and generally designed for more robustness than perhaps they even needed.
But I guess Sybase IQ and Essbase are two big exceptions to that generalization.
… usually there’s no “one smokin’ query” to go consider — the reality is usually hundreds of ad-hoc requests, some optimized, some not, all hammering the data store. Designing for optimal performance isn’t always about maximizing sequential access. Unoptimized queries can hit the data store with random reads, radically changing the performance profile.
I.e., workloads aren’t trivially simple. But Chuck’s way of saying it has more dramatic flair.
Many operational DW/BI environments are now continually updated to provide near-real-time results to queries, meaning write performance now becomes a very interesting topic.
Environments with very large numbers of spindles also may have to figure in rebuild times for failed disks (they do fail occasionally, you know …) and considering what the resulting performance impact might look like.
More and more DW/BI environments are being designed as HA environments — with varying degrees of redundancy and failover throughout — up to and including a remote failover site.
Good points all.
And, oh yes, we’ve got an entirely new magic ingredient to play with — enterprise flash drives. While usually impractical for the primary data store, they can do amazing performance magic when applied against the temporary caches used by the DW/BI application.
DBMS of different architectures seem to differ as to the extent they write intermediate result sets out to disk. But yes — that can be a very big deal.
As to whether that’s all flash is usually practical for in data warehousing — today that’s probably true.
Then there’s the operational aspect of all this — how is it all backed up and recovered? Waiting days or weeks for a 100TB environment to be recreated from tape or source data isn’t usually an option for most businesses.
What about development, testing and staging of larger DW/BI applications? Things like snaps and replication come into play. And, yes, we’ve got more than a few customers doing remote DR for their DW/BI environments.
More good points.
Stepping back from the application itself, all DW/BI environments produce mountains of downstream data — lots of query results, analysis, data cubes, reports, and more. Sometimes these downstream environments can be much larger than the DW/BI that creates them all.
That shouldn’t be happening. If it is, you should get a fast analytic DBMS and reduce your dependence on OLAP cubes.
Over the last few years, I’ve been amazed at how many new software players are coming into this application space.
Sure, we’re working with industry standard players like Oracle, IBM and Microsoft. We’re also starting to do more work with the second wave of appliance vendors like Teradata and Netezza.
Most interesting to me are the newest wave of software-only players who can take a relatively standard scale-out server/storage environment and do some amazing things: vendors like GreenPlum, Vertica, DatAllegro and ParAccel.
The market success of those four seems inversely proportional to the supposed closeness of their relationships with EMC.
Just teasing, Chuck.
The other heated debate in the industry is the “dedicated appliance vs. standardized infrastructure”.
Some will argue that a customized and bespoke all-in-one environment is best for DW/BI applications. EMC’s point of view is that intelligent and optimized use of standardized infrastructure can deliver similar — or sometimes better — results, and deliver an operational environment that works pretty much the same as the rest of the landscape.
One of the useful aspects of our new Competency Center is that all of these approaches can be evaluated — side-by-side if needed — using a relatively standard server/storage infrastructure environment as a starting point.
Non-EMC alternatives can be evaluated side-by-side in EMC’s facilities with EMC-based ones??
Somehow, I don’t think that’s exactly what Chuck meant …
And then there’s the extremely interesting prospect of running DW/BI environments under VMware. We’ve already established that there’s no disk I/O tax using VMware, and in some cases we get better I/O results.
More intriguing is the ability for VMware to take modern four-socket, six-core large-memory server designs, and partition them into comfortable virtualized chunks that DW/BI software can comfortably exploit, with the potential of delivering substantially more aggregate server performance in a given server environment.
As with the OLAP cube references above, this makes me think Chuck is thinking not just of high-end but also of sub-terabyte data warehousing.