Data warehousing
Analysis of issues in data warehousing, with extensive coverage of database management systems and data warehouse appliances that are optimized to query large volumes of data. Related subjects include:
Three big myths about MapReduce
Once again, I find myself writing and talking a lot about MapReduce. But I suspect that MapReduce-related conversations would go better if we overcame three fairly common MapReduce myths:
- MapReduce is something very new
- MapReduce involves strict adherence to the Map-Reduce programming paradigm
- MapReduce is a single technology
Categories: Analytic technologies, Aster Data, Cloudera, Data warehousing, Google, Greenplum, Hadoop, Log analysis, MapReduce, Michael Stonebraker, Parallelization, Web analytics | 11 Comments |
Introduction to SenSage
I visited with SenSage on my two most recent trips to San Francisco. Both visits were, through no fault of SenSage’s, hasty. Still, I think I have enough of a handle on SenSage basics to be worth writing up.
General SenSage highlights include:
Kickfire capacity and pricing
Kickfire’s marketing communication efforts are still a work in progress. Kickfire did finally relax its secrecy about FPGA-vs.-custom-silicon – not coincidentally during Netezza’s recent publicity cycle. That wise choice helped Kickfire get some favorable attention recently for its technical and market strategy, e.g. from Daniel Abadi, Merv Adrian and, kicking things off — as it were — me. Weeks after a recent Kickfire product release, there’s finally a fairly accurate data sheet up, although there’s still one self-defeatingly misleading line I’ll comment on below. Pricing is a whole other area of confusion, although it seems that current list prices have been inadvertently* leaked in Merv’s post linked above, with only one inaccuracy that I can detect.**
*I gather from the company that they forgot to tell Merv pricing was NDA.
** Merv cited a price as “starting” that I believe to be top-of-the-line. No criticism of Merv is implied in that; Kickfire has not been very clear in communicating hard numbers.
All that said, if one takes Kickfire’s marketing statements literally, Kickfire list pricing is around $20-50K per terabyte for a few small, fixed, high-performance configurations. That’s all-in, for plug-and-play appliances. What’s more, that range is based on the actual published user data capacity numbers for various Kickfire models, which I think are low for several reasons:
- Kickfire doesn’t officially admit that its model with 14.4 terabytes of disk can manage more than 6 terabytes of data, even though it clearly can.
- Actually, those 14.4 terabytes of disk can be increased or lowered as you choose.
- The basic compression figures implied in those calculations seem conservative.
- Compression figures are a lot more conservative yet, in that Kickfire assumes you’ll have a lot of actual indexes on your data. I’m not sure that’s necessary for most workloads.
Categories: Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Kickfire, Pricing | 3 Comments |
Infobright notes
I had lunch w/ Bob Zurek and Susan Davis of Infobright today. This wasn’t primarily a briefing, but a few takeaways are:
- Infobright now has >100 paying customers.
- Typical database size is from the low 100s of gigabytes to the low single-digit number of terabytes.
- Agile development is at or approaching two-week release cycles.
- Like Kickfire, Infobright has a multi-year deal with MySQL that insulates it against many potential Oracle/MySQL shenanigans.
- From an industry perspective, Infobright’s customer base sounds a lot like other vendors’:
- Data mart outsourcing/online analytics
- Log files for websites
- Telecommunications
- Financial services
- OEM, especially in the markets cited above
- “Hey, we’re beginning to see the occasional energy deal”
- A few random others
- Infobright is seeing some household-name customers, who surely have big-name analytic DBMS products, but who also have a policy that open source is the default choice, and if open source can get the job done then the favorite closed-source choices aren’t used.
- Infobright has the usual open-source community story — lots of involvement and engagement in the forums, but contributions are limited mainly to connectivity, utility scripts, etc. (Maybe some national language translation too; I’m not sure.)
Greenplum is going hybrid columnar as well
Over the past summer, Vertica, VectorWise, and Oracle all announced flavors of hybrid row/columnar storage. Now it’s Greenplum’s turn. Greenplum is actually offering true columnar storage, as opposed to Oracle’s PAX-like scheme — and also as opposed to the kind of Frankencolumn storage Daniel Abadi decries. For example, you don’t have to do a join to retrieve multiple columns; you just ask for them and there they are. Similarly, Greenplum doesn’t maintain explicit row IDs – whether in row-oriented or column-oriented append-only storage – relying instead on block-level header information. Read more
Categories: Analytic technologies, Columnar database management, Data warehousing, Database compression, Greenplum, Theory and architecture | 12 Comments |
How 30+ enterprises are using Hadoop
MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of Hadoop World, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera’s files. Facts and metrics ranged widely, of course:
- Some are in heavy production with Hadoop, and closely engaged with Cloudera. Others are active Hadoop users but are very secretive. Yet others signed up for initial Hadoop training last week.
- Some have Hadoop clusters in the thousands of nodes. Many have Hadoop clusters in the 50-100 node range. Others are just prototyping Hadoop use. And one seems to be “OEMing” a small Hadoop cluster in each piece of equipment sold.
- Many export data from Hadoop to a relational DBMS; many others just leave it in HDFS (Hadoop Distributed File System), e.g. with Hive as the query language, or in exactly one case Jaql.
- Some are household names, in web businesses or otherwise. Others seem to be pretty obscure.
- Industries include financial services, telecom (Asia only, and quite new), bioinformatics (and other research), intelligence, and lots of web and/or advertising/media.
- Application areas mentioned — and these overlap in some cases — include:
- Log and/or clickstream analysis of various kinds
- Marketing analytics
- Machine learning and/or sophisticated data mining
- Image processing
- Processing of XML messages
- Web crawling and/or text processing
- General archiving, including of relational/tabular data, e.g. for compliance
Scientific data sharing
I’ve been posting recently about some issues in scientific data management. One topic I haven’t addressed yet is policies around data sharing. Generally:
- Scientists, like other academics, have their research judged largely on the basis of their published papers.
- The data scientists capture benefits scientists’ careers mainly by informing and being used in their published papers.
- Scientists are correspondingly uninterested in, if not actively opposed to, sharing their data with the rest of the world
- Promptly (for the data they use to directly support their publications)
- Perhaps ever (for the rest of the data)
On the other hand, it’s blindingly obvious that the world as a whole would be better off with widespread scientific data sharing, provided that making data “free” doesn’t significantly undermine scientists’ incentives to capture it in the first place. And institutions such as funding agencies are taking note. Thus:
Scientific data management technology should be suitable for either of the scenarios:
- Data is widely shared among scientists.
- Data is jealously guarded by the scientists who first gather it.
Categories: Data warehousing, Scientific research | 7 Comments |
Oracle Exadata customers presenting at Oracle Open World
Greg Rahn tweeted a list of Exadata-focused sessions at Oracle Open World next week. As Oracle employees and supporters have been foreshadowing, there will be Exadata users and user-like folks presenting. I identified what look like half a dozen (not counting any who, for example, will make surprise appearances at keynote addresses), specifically: Read more
Categories: Data warehousing, Exadata, Market share and customer counts, Oracle, Teradata | 5 Comments |
Oracle and Vertica on compression and other physical data layout features
In my recent post on Exadata pricing, I highlighted the importance of Oracle’s compression figures to the discussion, and the uncertainty about same. This led to a Twitter discussion featuring Greg Rahn* of Oracle and Dave Menninger and Omer Trajman of Vertica. I also followed up with Omer on the phone. Read more
Categories: Columnar database management, Data models and architecture, Data warehousing, Database compression, Oracle, Theory and architecture, Vertica Systems | 14 Comments |
Oracle Exadata 2 capacity pricing
Summary of Oracle Exadata 2 capacity pricing
Analyzing Oracle Exadata pricing is always harder than one would first think. But I’ve finally gotten around to doing an Oracle Exadata 2 pricing spreadsheet. The main takeaways are:
- If we believe Oracle’s claims of 10X compression, Exadata 2 costs more per terabyte of user data than Netezza TwinFin — $22-26K/TB vs. TwinFin’s <$20K — but less than the Teradata 2550.
- These figures are highly sensitive to assumptions about Oracle’s hybrid columnar compression.
- Similarly, if Netezza or Teradata were to significantly upgrade their own compression, the price comparison would look quite different.
- Options such as Data Mining or Oracle Spatial add 12% or so each to Exadata’s total system price.
Longer version
When Oracle introduced Exadata last year it was, well, expensive. Exadata 2 has now been announced, and it is significantly cheaper than Exadata 1 per terabyte of user data, based on:
- Similar overall pricing
- Twice the disk capacity
- Better compression