Theory and architecture
Analysis of design choices in databases and database management systems. Related subjects include:
- Any subcategory
- Database diversity
- Explicit support for specific data types
- (in Text Technologies) Text search
Oracle Exadata 2 capacity pricing
Summary of Oracle Exadata 2 capacity pricing
Analyzing Oracle Exadata pricing is always harder than one would first think. But I’ve finally gotten around to doing an Oracle Exadata 2 pricing spreadsheet. The main takeaways are:
- If we believe Oracle’s claims of 10X compression, Exadata 2 costs more per terabyte of user data than Netezza TwinFin — $22-26K/TB vs. TwinFin’s <$20K — but less than the Teradata 2550.
- These figures are highly sensitive to assumptions about Oracle’s hybrid columnar compression.
- Similarly, if Netezza or Teradata were to significantly upgrade their own compression, the price comparison would look quite different.
- Options such as Data Mining or Oracle Spatial add 12% or so each to Exadata’s total system price.
Longer version
When Oracle introduced Exadata last year it was, well, expensive. Exadata 2 has now been announced, and it is significantly cheaper than Exadata 1 per terabyte of user data, based on:
- Similar overall pricing
- Twice the disk capacity
- Better compression
Categories: Analytic technologies, Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Exadata, Netezza, Oracle, Pricing, Teradata | 13 Comments |
Yahoo wants to do decapetabyte-scale data warehousing in Hadoop
My old client Mark Tsimelzon moved over to Yahoo after Coral8 was acquired, and I caught up with him last month. He turns out to be running development for a significant portion of Yahoo’s Hadoop effort — everything other than HDFS (Hadoop Distributed File System). Yahoo evidently plans to, within a year or so, get Hadoop to the point that it is managing 10s of petabytes of data for Yahoo, with reasonable data warehousing functionality.
Highlights of our visit included:
- There are dozens of people at Yahoo doing Hadoop development that will wind up getting open sourced. (Full-time or close to it.) In particular, everything Mark’s team does goes to open source.
- Yahoo is moving as much of its analytics to Hadoop as possible. Much of this is being moved away from Oracle and from Yahoo’s own Everest.
- A column store is being put on top of HDFS, based on Yahoo technology. Columns will be striped across nodes. Perhaps that’s why the effort is called Project Zebra.
- Mark believes that in a year Hadoop will be much further along in meeting traditional data warehousing requirements, in areas such as:
- Metadata
- SLAs/high availability/other workload management
- Data retention policies
- Security/privacy*
- Yahoo views the time-to-market benefits of Hadoop as being more important than TCO.
Categories: Analytic technologies, Data warehousing, Hadoop, MapReduce, Open source, Oracle, Petabyte-scale data management, Web analytics, Yahoo | 6 Comments |
Facts and rumors
- Vertica is putting out a press release today touting its 100th customer, and talking of triple digit growth last year.
- Multiple sources have told me that the DATAllegro system is being thrown out of Dell, so evidently Dell is telling this to one and all. If that goes through, this would presumably leave TEOCO as DATAllegro’s single happy customer. (I haven’t checked with Microsoft for its view.)
- A rumor has it that Infiniband technology vendor Voltaire, Ltd. privately claims triple-digit sales of switches for Exadata 1 (I think that one would be one switch per Exadata installation, not per rack). Based just on a quick glance, this is far from confirmed by Voltaire’s earnings conference call transcripts or SEC filings. However, the most recent transcript does seem to indicate Voltaire got multiple Exadata deals in the telecommunications sector, and suggests some Exadata penetration in other sectors as well.
- I was told of a classified-agency user that has >1 petabyte of data on Exadata 1 and 600 terabytes or so on Netezza. My not-obviously-biased source says the agency is distinctly happier with Netezza than Exadata.
- Like ParAccel, Oracle just got dinged for TPC-related misbehavior.
- Rumor has it that Sun has no intention of helping ParAccel rerun its withdrawn TPC-H benchmark.
- ParAccel has withdrawn the claim from its home page to be the “CERTIFIED” price-performance leader. This seems to confirm that the claim was a reference to the TPC-H. In my opinion, that was a gross misrepresentation of what the TPC-H shows.
Thoughts on the integration of OLTP and data warehousing, especially in Exadata 2
Oracle is pushing Exadata 2 as being a great system for any of OLTP (OnLine Transaction Processing), data warehousing or, presumably, the integration of same. This claim rests on a few premises, namely: Read more
Categories: Analytic technologies, Data warehouse appliances, Data warehousing, Exadata, OLTP, Oracle, Solid-state memory, Theory and architecture | 36 Comments |
Notes on the Oracle Database 11g Release 2 white paper
The Oracle Database 11g Release 2 white paper I cited a couple of weeks ago has evidently been edited, given that a phrase I quoted last month is no longer to be found. Anyhow, here are some quotes from and comments on what evidently is the latest version. Read more
HadoopDB
Despite a thoughtful heads-up from Daniel Abadi at the time of his original posting about HadoopDB, I’m just getting around to writing about it now. HadoopDB is a research project carried out by a couple of Abadi’s students. Further research is definitely planned. But it seems too early to say that HadoopDB will ever get past the “research and oh by the way the code is open sourced” stage and become a real code line — whether commercialized, open source, or both.
The basic idea of HadoopDB is to put copies of a DBMS at different nodes of a grid, and use Hadoop to parcel work among them. Major benefits when compared with massively parallel DBMS are said to be:
- Open/cheap/free
- Query fault-tolerance
- The related concept of tolerating node degradation that isn’t an outright node failure.
HadoopDB has actually been built with PostgreSQL. That version achieved performance well below that of a commercial DBMS “DBX”, where X=2. Column-store guru Abadi has repeatedly signaled his intention to try out HadoopDB with VectorWise at the nodes instead. (Recall that VectorWise is shared-everything.) It will be interesting to see how that configuration performs.
The real opportunity for HadoopDB, however, in my opinion may lie elsewhere. Read more
Fault-tolerant queries
MapReduce/Hadoop fans sometimes raise the question of query fault-tolerance. That is — if a node fails, does the query need to be restarted, or can it keep going? For example, Daniel Abadi et al. trumpet query fault-tolerance as one of the virtues of HadoopDB. Some of the scientists at XLDB spoke of query fault-tolerance as being a good reason to leave 100s or 1000s of terabytes of data in Hadoop-managed file systems.
When we discussed this subject a few months ago in a couple of comment threads, it seemed to be the case that:
- Hadoop generally has query fault-tolerance. Intermediate result sets are materialized, and data isn’t tied to nodes anyway. So if a node goes down, its work can be sent to another node.
- Hive actually did not have query fault-tolerance at that time, but it was on the roadmap. (Edit: Actually, it did within a single MapReduce job. But one Hive job can comprise several rounds of MapReduce.)
- Most DBMS vendors do not have query fault-tolerance. If a query fails, it gets restarted from scratch.
- Aster Data’s nCluster, however, does appear to have some kind of query fault-tolerance.
This raises an obvious (pair of) question(s) — why and/or when would anybody ever care about query fault-tolerance? Read more
Categories: Analytic technologies, Aster Data, Data warehousing, Hadoop, Parallelization, Scientific research, Theory and architecture | 10 Comments |
Introduction to the XLDB and SciDB projects
Before I write anything else about the overlapping efforts known as XLDB and SciDB, I probably should explain and disambiguate what they are as best I can. XLDB was organized and still is run by guys who want to solve a scientific problem in eXtremely Large DataBase Management, most especially Jacek Becla of SLAC (the organization previously known as Stanford Linear Accelerator Center). Becla’s original motivation was that he needs a DBMS to manage what will be 55 petabytes of raw image data and 100 petabytes of astronomical data total for LSST (Large Synoptic Survey Telescope). Read more
Categories: Data models and architecture, Database diversity, eBay, Michael Stonebraker, Open source, Petabyte-scale data management, Scientific research, Theory and architecture | 2 Comments |
Oracle Exadata hybrid columnar compression
Oracle Database 11g Release 2 is out, and as usual I wasn’t briefed — perhaps because Oracle is more scared than its competitors are of hard questions, perhaps for some other reason entirely.* Anyhow, Oracle Database 11 Release 2 contains an Exadata-only feature called hybrid columnar compression. The Oracle Database 11g Release 2 white paper says “data is grouped, ordered, and stored one column at a time.” But Kevin Closson clarifies:
The word hybrid is important.
Rows are still used. They are stored in an object called a Compression Unit. Compression Units can span multiple blocks. Like values are stored in the compression unit with metadata that maps back to the rows.
So, “hybrid” is the word. But, none of that matters as much as the effectiveness. This form of compression is extremely effective.
That sounds a whole lot like PAX. Specifically, in Oracle’s case I would guess “hybrid columnar compression” provides the compression benefits of column stores, but not column stores’ I/O benefits, and also not any kind of in-memory compression. Read more
Categories: Columnar database management, Data warehousing, Database compression, Exadata, Oracle, Theory and architecture | 20 Comments |
Sybase IQ technical highlights
General highlights of the Sybase IQ technical story include:
- Sybase IQ is an analytic DBMS with a columnar/column-store architecture
- Unlike most analytic DBMS, Sybase IQ has a shared-disk architecture.
- The Sybase IQ indexing story is a bit complicated, with a bunch of different index kinds. Most are focused on columns with low cardinality, and it least in some cases are a lot like bitmaps. (Sybase IQ when first introduced was a pure bitmap index product, with a single index type “Fast Project”.) But one index kind, “High Group” — designed for columns with high cardinality – is an exception to most generalities about other Sybase IQ index kinds, and instead is more akin to a b-tree.
- Unlike Vertica, Sybase stores each column of data only once. I don’t see how it would make sense to have multiple indexes on the same column, but I didn’t actually ask whether doing so is possible or common.
- Sybase estimates that Sybase IQ requires ¼ the DBA effort of, say, Oracle. (Frankly, that’s not a particularly good figure.) Obviously, this is just a broad-brush average.
- Sybase recently repurposed an acquired ETL tool to be focused on Sybase IQ. IQ of course also works with various third-party tools, certified or otherwise.
- Sybase’s Power Designer CASE (Computer-Aided Software Engineering)/database design tool works with Sybase IQ.
- Sybase is proud of Sybase IQ’s new in-database analytics capabilities, but I haven’t yet grasped what, if anything, is differentiated about them.
- Sybase has an ILM (Information Lifecycle Management) story built around the point that different columns can be stored on different kinds of media.
Highlights of the Sybase IQ compression story include: Read more