March 26th, 2007 Curt Monash
Many of my thoughts on data warehouse DBMS and appliances have been collected in a white paper, sponsored by DATAllegro. As in a couple of other white papers — collected here — I coined a phrase to describe the core concept: Index-light. MPP row-oriented data warehouse DBMSs certainly have indices, which are occasionally even used. But the approaches to database design that are supported or make sense to use are simply different for DATAllegro, Netezza (the most extreme example of all) or Teradata than for Oracle or Microsoft. And the differences are all in the direction of less indexing.
Here’s an excerpt from the paper. Please pardon the formatting; it reads better in the actual .PDF Read the rest of this entry »
Posted in DATAllegro, Data warehouse appliances, Data warehousing, Database theory and practice, Relational database management systems | 2 Comments »
March 25th, 2007 Curt Monash
Oracle made a slick move in picking up Tangosol, a leader in object/data caching for all sorts of major OLTP apps. They do financial trading, telecom operations, big web sites (Fedex, Geico), and other good stuff. This is a reminder that the list of important memory-centric data handling technologies is getting fairly long, including:
- Object caching (e.g., Tangosol, Progress ObjectStore)
- In-memory RDBMS (e.g., Oracle TimesTen, Solid BoostEngine, McObject eXtremeDB)
- Stream processing (e.g., Progress Apama, Streambase)
And that’s just for OLTP; there’s a whole other set of memory-centric technologies for analytics as well.
When one connects the dots, I think three major points jump out:
- There’s a lot more to high-end OLTP than relational database management.
- Oracle is determined to be the leader in as many of those areas as possible.
- This all fits the market disruption narrative.
I write about Point #1 all the time. So this time around let me expand a little more on #2 and #3.
Read the rest of this entry »
Posted in Cache, Complex event/stream processing (CEP), Database diversity, Database theory and practice, Memory-centric data management, OLTP database management, Oracle, Oracle TimesTen, Progress, Apama, and DataDirect, Relational database management systems, Specialized data management in general, StreamBase, solidDB | 2 Comments »
March 24th, 2007 Curt Monash
I’ve recently made a lot of posts about database compression. 3X or more compression is rapidly becoming standard; 5X+ is coming soon as processor power increases; 10X or more is not unrealistic. True, this applies mainly to data warehouses, but that’s where the big database growth is happening. And new kinds of data — geospatial, telemetry, document, video, whatever — are highly compressible as well.
This trend suggests a few interesting possibilities for hardware, semiconductors, and storage.
- The growth in demand for storage might actually slow. That said, I frankly think it’s more likely that Parkinson’s Law of Data will continue to hold: Data expands to fill the space available. E.g., video and other media have near-infinite potential to consume storage; it’s just a question of resolution and fidelity.
- Solid-state (aka semiconductor or flash) persistent storage might become practical sooner than we think. If you really can fit a terabyte of data onto 100 gigs of flash, that’s a pretty affordable alternative. And by the way — if that happens, a lot of what I’ve been saying about random vs. sequential reads might be irrelevant.
- Similarly, memory-centric data management is more affordable when compression is aggressive. That’s a key point of schemes such as SAP’s or QlikTech’s. Who needs flash? Just put it in RAM, persisting it to disk just for backup.
- There’s a use for faster processors. Compression isn’t free. What you save on disk space and I/O you pay for at the CPU level. Those 5X+ compression levels do depend on faster processors, at least for the row store vendors.
Keep getting great research about database management, analytics, and related technologies. No hassle, no spam!
Technorati Tags: relational databases, storage, processors, memory, flash memory, compression
Posted in Data warehousing, Database compression, Memory-centric data management, QlikTech and QlikView, SAP, BI Accelerator, and MaxDB | 6 Comments »
March 24th, 2007 Curt Monash
In my opinion, the key part of Mike Stonebraker’s fascinating note on data compression was (emphasis mine):
The standard wisdom in most row stores is to use block compression. Hence, a storage block is compressed using a single technique (say Lempel-Ziv or dictionary). The technique chosen then compresses all the attributes in all the columns which occur on the block. In contrast, Vertica compresses a storage block that only contains one attribute. Hence, it can use a different compression scheme for each attribute. Obviously a compression scheme that is type-specific will beat an implementation that is “one size fits all”.
It is possible for a row store to use a type-specific compression scheme. However, if there are 50 attributes in a record, then it must remember the state for 50 type-specific implementations, and complexity increases significantly.
In addition, all row stores we are familiar with decompress each storage block on access, so that the query executor processes uncompressed tuples. In contrast, the Vertica executor processes compressed tuples. This results in better L2 cache locality, less main memory copying and generally much better performance.
Of course, any row store implementation can rewrite their executor to run on compressed data. However, this is a rewrite – and a lot of work.
Read the rest of this entry »
Posted in Columnar architectures, Data warehousing, Database compression, Sybase, Vertica Systems | 6 Comments »
March 24th, 2007 Curt Monash
The following is by Mike Stonebraker, CTO of Vertica Systems, copyright 2007, as part of our ongoing discussion of data compression. My comments are in a separate post.
Row Store Compression versus Column Store Compression
I Introduction
There are three aspects of space requirements, which we discuss in this short note, namely:
structural space requirements
index space requirements
attribute space requirements.
Read the rest of this entry »
Posted in Data warehousing, Database compression, Database theory and practice, Michael Stonebraker, Vertica Systems | 3 Comments »
March 21st, 2007 Curt Monash
We have lively discussions going on columnar data stores vs. vertically partitioned row stores. Part is visible in the comment thread to a recent post. Other parts come in private comments from Stuart Frost of DATAllegro and Mike Stonebraker of Vertica et al.
To me, the most interesting part of what the Vertica guys are saying is twofold. One is that data compression just works better in column stores than row stores, perhaps by a factor of 3, because “the next thing in storage is the same data type, rather than a different one.” Frankly, although Mike has said this a couple of times, I haven’t understood yet why row stores can’t be smart enough to compress just as well. Yes, it’s a little harder than it would be in a columnar system; but I don’t see why the challenge would be insuperable.
The second part is even cooler, namely the claim that column stores allow the processors to operate directly on compressed data. But once again, I don’t see why row stores can’t do that too. For example, when you join via bitmapped indices, exactly what you’re doing is operating on highly-compressed data.
Want to continue getting great research about DBMS, analytics, and other technologies related to data management? Then subscribe to our feed, by RSS/Atom or e-mail! We recommend taking the integrated feed for all our blogs, but blog-specific ones are also easily available.
Posted in Columnar architectures, DATAllegro, Data warehouse appliances, Data warehousing, Database compression, Relational database management systems, Vertica Systems | 1 Comment »
March 19th, 2007 Curt Monash
Stuart Frost of DATAllegro offered an interesting counter today to columnar DBMS architectures — vertical partitioning. In particular, he told me of a 120 terabyte (growing soon to 250 terabytes) call data record database, in which a few key columns were separated out. Read the rest of this entry »
Posted in Columnar architectures, DATAllegro, Data warehouse appliances, Data warehousing, Kognitio and WX2, Relational database management systems, Vertica Systems | 9 Comments »
March 17th, 2007 Curt Monash
SaaS integration is in the air.
- I recently talked with Pervasive Software about their data integration line. A large part of Pervasive’s new business is Salesforce.com integration, including at some big-name software vendors as customer/partner switch-hitters.
- I just rechecked my notes from my January talk with Cast Iron Systems. A large part of Cast Iron’s new business is also integration with Salesforce.com, Netsuite, and other SaaS vendors.
- Informatica keeps putting out press releases about Salesforce.com integration, most recently by offering replication in SaaS form itself.
But of course this makes sense. Without good data integration, SaaS applications would be pretty useless, at least at large and medium-sized enterprises.
Posted in Cast Iron Systems, EII, ETL, and/or EAI, Informatica, Pervasive Software, SaaS | No Comments »
March 16th, 2007 Curt Monash
I talk to a lot of data warehouse software and/or appliance start-ups. Naturally, they’re all gunning for Netezza, and regale me with stories about competitive replacements, competitive wins, benchmark wins, and the like. And there have been a couple of personnel departures too, notably development chief Bill Blake. Netezza insists this is because he got a CEO offer he couldn’t refuse, he’s still friendly with the company, development plans are entirely on track, and news of some sort is coming out in a few weeks. Also, Greenplum brags that its Asia/Pacific manager was snagged from Netezza.
On the other hand, Netezza claims lots of sales momentum, and that’s certainly consistent with what I hear from its competitors. Read the rest of this entry »
Posted in Business Objects, Data warehouse appliances, Data warehousing, Greenplum, Netezza, Relational database management systems | No Comments »
March 16th, 2007 Curt Monash
IBM sent over a bunch of success stories recently, with DB2’s new aggressive compression prominently mentioned. Mike Stonebraker made a big point of Vertica’s compression when last we talked; other column-oriented data warehouse/mart software vendors (e.g. Kognitio, SAP, Sybase) get strong compression benefits as well. Other data warehouse/mart specialists are doing a lot with compression too, although some of that is governed by please-don’t-say-anything-good-about-us NDA agreements.
Compression is important for at least three reasons:
- It saves disk space, which is a major cost issue in data warehousing.
- It saves I/O, which is the major performance issue in data warehousing.
- In well-designed systems, it can actually make on-chip execution faster, because the gains in memory speed and movement can exceed the cost of actually packing/unpacking the data. (Or so I’m told; I haven’t aggressively investigated that claim.)
When evaluating data warehouse/mart software, take a look at the vendor’s compression story. It’s important stuff.
EDIT: DATAllegro claims in a note to me that they get 3-4x storage savings via compression. They also make the observation that fewer disks ==> fewer disk failures, and spin that — as it were
— into a claim of greater reliability.
Posted in DATAllegro, Data warehouse appliances, Data warehousing, Database compression, IBM and DB2, Relational database management systems, SAP, BI Accelerator, and MaxDB, Vertica Systems | 2 Comments »
March 14th, 2007 Curt Monash
Like Greenplum, EnterpriseDB is a PostgreSQL-based DBMS vendor with an interesting story, whose technical merits I don’t yet know enough to judge. In particular, CEO Andy Astor:
- Confirms that EnterpriseDB is OLTP-focused, unlike Greenplum. That said, they are also used for some reporting and so on. But they don’t run 10s-of-terabytes sized data marts.
- Claims EnterpriseDB has a high level of Oracle compatibility – SQL, datatypes, stored procedures (so that would be PL/SQL too), packages, functions, etc.
- Claims ANTs isn’t nearly as Oracle-compatible.
- Claims 50-100% better OLTP performance out of the box than vanilla PostgreSQL, due to auto-tuning.
Also, EnterpriseDB has added a bunch of tools to PostgreSQL – debugging, DBA, etc. And it provides actual-company customer support, something that seems desirable when using a DBMS. It should also be noted that the product is definitely closed-source, notwithstanding EnterpriseDB’s open-source-like business model and its close ties to the open source community.
Read the rest of this entry »
Posted in ANTs Software, Data warehousing, EnterpriseDB and Postgres Plus, Ingres, Mid-range DBMS, OLTP database management, Open source RDBMS, Oracle, Portability, transparency, and plug-compatibility, PostgreSQL | No Comments »
March 13th, 2007 Curt Monash
I asked Jeff Jones of IBM to explain the various DB2 code lines to me. His answer was so clear that I asked further permission to post it verbatim. Here it is. The main takeaway is that one shouldn’t confuse the shared-everything z/OS (mainframe) version with the more loosely-coupled Unix/Linux/Windows version.
1. DB2 9 for z/OS (CAM note: i.e., mainframe) is a unique code base designed in cooperation with and integrated tightly with the operating system (z/OS) and the hardware (System z). That said, our development and administration tools (the externals of the product), as well as the SQL language supported, are built to be nearly the same across DB2 platforms. DB2 9 for z/OS has a shared-resource architecture similar to Oracle RAC. Parallel Sysplex and other specialized System z hardware enable this high performance, high reliability scenario (that even Oracle has said is well built). Born in 1983.
http://ibm.com/db2/zos
2. DB2 9 for Linux, UNIX and Windows is a second unique code base. (CAM note: i.e., “open systems”) Roughly 10% of that code base is reserved for platform-specific code to optimize to threading, security, clustering etc. across Linux (quite a few), UNIX (AIX, Solaris, HP-UX) and Windows (many versions). This code base is designed for portability given that we don’t own the underlying hardware in all cases (as we do for DB2 on System z). Much tooling is shared across the other DB2 platforms. Born in 1993.
http://ibm.com/db2/9
http://ibm.com/software/data/db2/linux/validate < --- Linux platforms supported
NOTE: DB2 for Linux runs on all four IBM servers (System z, System p, System i and System x), same code base.
Read the rest of this entry »
Posted in IBM and DB2, Relational database management systems | No Comments »
March 13th, 2007 Curt Monash
I talked with Greenplum honchos Bill Cook and Scott Yara yesterday. Bill is the new CEO, formerly head of Sun’s field operations. Scott is president, and in effect the marketing-guy co-founder. I still don’t know whether I really believe their technical story. But I do think I have a feel for what they’re trying to do. Key aspects of the Greenplum strategy include:
- Greenplum rewrote a lot of PostgreSQL to parallelize it, in the correct belief that MPP is the best way to go for high-end data warehousing.
- Indeed, Greenplum claims to have a general solution to DBMS parallelization. Unlike Netezza, DATallegro, Vertica, and Kognitio, Greenplum offers a row-oriented data store with a fairly full set of indexing techniques. You want star indices or bitmaps? They have them. (They even claimed to be used for some text management when last we talked, although that was for O’Reilly and Mark Logic seems to be O’Reilly’s main text-indexing vendor.)
- Greenplum’s main sales strategy is to be part of Sun’s product line, bundled into Thumper boxes as single-part-number Sun offerings. They certainly could add other hardware OEMs, just like Checkpoint sells firewalls through multiple appliance vendors. But at least for now it’s all about Sun.
Read the rest of this entry »
Posted in Data warehouse appliances, Data warehousing, Greenplum, Open source RDBMS, PostgreSQL, Relational database management systems | 1 Comment »
March 8th, 2007 Curt Monash
Ingres has non-trivial resources – 300 employees, 10,000 “real” customers, and some additional large number of installations embedded in CA products. It has a fairly pure support-only open source revenue model, although there may be exceptions to that in cases such as the DATAllegro relationship.
Should anybody care?
Yes and no. To compete effectively in the mid-range OLTP relational database management system market, you need a product that’s much easier to administer than Oracle, and preferably easier even than Microsoft SQL*Server. Ingres doesn’t meet that standard. Until it does, it probably won’t have much of a market outside its current installed base. But some of Ingres’s strategies and directions are pretty clever, and may be interesting to people who’d never actually consider using Ingres technology. Specifically, Ingres has plans in the areas of appliances and database services, two subjects that are close to my heart.
Read the rest of this entry »
Posted in DATAllegro, Ingres, Relational database management systems | 2 Comments »
March 6th, 2007 Curt Monash
Monash Advantage members just received an exclusive nine-page Monash Letter with a competitive overview of the DBMS industry. The full analysis is exclusive to them, but I’ll give some highlights here.
1. As per my recent “deck-clearing” posts, there’s a lot more competitive opportunity in the DBMS industry than many observers recognize.
2. One reason is the considerable number of separate niches in the DBMS space.
3. Oracle is a classical Geoffrey Moore “gorilla” only in the market for high-end OLTP and mixed-used DBMS. Everything else is up for grabs.
4. As discussed here extensively, simpler appliance-like architectures are beating the overly complex general-purpose DBMS vendors’ solutions for VLDB data warehousing.
5. MPP/shared-nothing architectures are deservedly beating SMP/shared-everything approaches for VLDB data warehousing.
That’s not the only Monash Letter recently released; another one covered online marketing strategy and tactics.
Posted in Data warehousing, Database diversity, Database theory and practice, Oracle | No Comments »
March 6th, 2007 Curt Monash
I haven’t been as clear as I could have been in explaining why I think MPP/shared-nothing beats SMP/shared-everything. The answer is in a short white paper, currently bottlenecked at the sponsor’s end of the process. Here’s an excerpt from the latest draft:
There are two ways to make more powerful computers:
1. Use more powerful parts – processors, disk drives, etc.
2. Just use more parts of the same power.
Of the two, the more-parts strategy much more cost-effective. Smaller* parts are much more economical, since the bigger the part, the harder and more costly it is to avoid defects, in manufacturing and initial design alike. Consequently, all high-end computers rely on some kind of parallel processing.
*As measured in terms of capacity, transistor count, etc., not physical size.
Read the rest of this entry »
Posted in DATAllegro, Data warehouse appliances, Data warehousing, Database theory and practice, Microsoft and SQL*Server, Netezza, Oracle, Relational database management systems, Teradata, Vertica Systems | 6 Comments »
March 1st, 2007 Curt Monash
Oracle is evidently buying Hyperion Software. Much like Gaul, Hyperion can be divided into three parts:
- Budgeting and consolidation applications, descended from the original Hyperion and Pillar.
- Essbase, the definitive MOLAP engine, descended from Arbor Software.
- A business intelligence suite, descended from Brio.
The most important part is budgeting/planning, because it could help Oracle change the rules for application software. But Essbase could be just the nudge Oracle needs to finally renounce its one-server-fits-all dogma.
Read the rest of this entry »
Posted in Data warehousing, Hierarchies, networks, graphs, and trees, MOLAP, Microsoft and SQL*Server, Oracle | 2 Comments »