Infobright notes
I had lunch w/ Bob Zurek and Susan Davis of Infobright today. This wasn’t primarily a briefing, but a few takeaways are:
- Infobright now has >100 paying customers.
- Typical database size is from the low 100s of gigabytes to the low single-digit number of terabytes.
- Agile development is at or approaching two-week release cycles.
- Like Kickfire, Infobright has a multi-year deal with MySQL that insulates it against many potential Oracle/MySQL shenanigans.
- From an industry perspective, Infobright’s customer base sounds a lot like other vendors’:
- Data mart outsourcing/online analytics
- Log files for websites
- Telecommunications
- Financial services
- OEM, especially in the markets cited above
- “Hey, we’re beginning to see the occasional energy deal”
- A few random others
- Infobright is seeing some household-name customers, who surely have big-name analytic DBMS products, but who also have a policy that open source is the default choice, and if open source can get the job done then the favorite closed-source choices aren’t used.
- Infobright has the usual open-source community story — lots of involvement and engagement in the forums, but contributions are limited mainly to connectivity, utility scripts, etc. (Maybe some national language translation too; I’m not sure.)
Greenplum is going hybrid columnar as well
Over the past summer, Vertica, VectorWise, and Oracle all announced flavors of hybrid row/columnar storage. Now it’s Greenplum’s turn. Greenplum is actually offering true columnar storage, as opposed to Oracle’s PAX-like scheme — and also as opposed to the kind of Frankencolumn storage Daniel Abadi decries. For example, you don’t have to do a join to retrieve multiple columns; you just ask for them and there they are. Similarly, Greenplum doesn’t maintain explicit row IDs – whether in row-oriented or column-oriented append-only storage – relying instead on block-level header information. Read more
Categories: Analytic technologies, Columnar database management, Data warehousing, Database compression, Greenplum, Theory and architecture | 12 Comments |
How 30+ enterprises are using Hadoop
MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of Hadoop World, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera’s files. Facts and metrics ranged widely, of course:
- Some are in heavy production with Hadoop, and closely engaged with Cloudera. Others are active Hadoop users but are very secretive. Yet others signed up for initial Hadoop training last week.
- Some have Hadoop clusters in the thousands of nodes. Many have Hadoop clusters in the 50-100 node range. Others are just prototyping Hadoop use. And one seems to be “OEMing” a small Hadoop cluster in each piece of equipment sold.
- Many export data from Hadoop to a relational DBMS; many others just leave it in HDFS (Hadoop Distributed File System), e.g. with Hive as the query language, or in exactly one case Jaql.
- Some are household names, in web businesses or otherwise. Others seem to be pretty obscure.
- Industries include financial services, telecom (Asia only, and quite new), bioinformatics (and other research), intelligence, and lots of web and/or advertising/media.
- Application areas mentioned — and these overlap in some cases — include:
- Log and/or clickstream analysis of various kinds
- Marketing analytics
- Machine learning and/or sophisticated data mining
- Image processing
- Processing of XML messages
- Web crawling and/or text processing
- General archiving, including of relational/tabular data, e.g. for compliance
Scientific data sharing
I’ve been posting recently about some issues in scientific data management. One topic I haven’t addressed yet is policies around data sharing. Generally:
- Scientists, like other academics, have their research judged largely on the basis of their published papers.
- The data scientists capture benefits scientists’ careers mainly by informing and being used in their published papers.
- Scientists are correspondingly uninterested in, if not actively opposed to, sharing their data with the rest of the world
- Promptly (for the data they use to directly support their publications)
- Perhaps ever (for the rest of the data)
On the other hand, it’s blindingly obvious that the world as a whole would be better off with widespread scientific data sharing, provided that making data “free” doesn’t significantly undermine scientists’ incentives to capture it in the first place. And institutions such as funding agencies are taking note. Thus:
Scientific data management technology should be suitable for either of the scenarios:
- Data is widely shared among scientists.
- Data is jealously guarded by the scientists who first gather it.
Categories: Data warehousing, Scientific research | 7 Comments |
I have some presentations coming up (all on October Thursdays)
On Thursday, October 15, and two different times (10:00 am and 1:00 pm Eastern time), I’ll be giving a webinar for Aster Data on MapReduce. The content is very much work in progress, but it definitely will:
- Be overviewy in nature
- Emphasize SQL/MapReduce integration
Then, on the evening of Thursday, October 22, there’s something called the Boston Big Data Summit, in Waltham, where “Big Data” evidently is to be construed as anything from a few terabytes on up. (Things are smaller in the Northeast than in California …) It’s being put together by Amrith Kumar (who I don’t really know) and Bob Zurek (who everybody knows). This is the inaguaral meeting. It seems I’m both giving the keynote and running the subsequent panel, one of whose participants will be Ellen Rubin. Read more
Categories: Analytic technologies, Aster Data, Cloud computing, MapReduce, Presentations | 4 Comments |
Oracle Exadata customers presenting at Oracle Open World
Greg Rahn tweeted a list of Exadata-focused sessions at Oracle Open World next week. As Oracle employees and supporters have been foreshadowing, there will be Exadata users and user-like folks presenting. I identified what look like half a dozen (not counting any who, for example, will make surprise appearances at keynote addresses), specifically: Read more
Categories: Data warehousing, Exadata, Market share and customer counts, Oracle, Teradata | 5 Comments |
Oracle and Vertica on compression and other physical data layout features
In my recent post on Exadata pricing, I highlighted the importance of Oracle’s compression figures to the discussion, and the uncertainty about same. This led to a Twitter discussion featuring Greg Rahn* of Oracle and Dave Menninger and Omer Trajman of Vertica. I also followed up with Omer on the phone. Read more
Categories: Columnar database management, Data models and architecture, Data warehousing, Database compression, Oracle, Theory and architecture, Vertica Systems | 14 Comments |
Oracle’s version of “actually, we’ve been doing MapReduce all along too”
In a recent blog post, Jean-Pierre Dijcks of Oracle makes the argument that Oracle has supported MapReduce all along, essentially because:
- You can do lots of procedural logic in the Oracle database, in a broad choice of languages, so in particular you can do Map steps.
- You can do lots of procedural logic in the Oracle database, in a broad choice of languages, so in particular you can do Reduce steps.
- Oracle offers a mechanism for parallelizing procedural logic.
Oracle doesn’t appear to have an explicit Map/Reduce programming interface, but I wouldn’t be surprised if Oracle Consulting cranked one out at some point to meet customer demand.
The post goes on to claim the usual in-database MapReduce benefit of avoiding the overhead of intermediate query result materialization. Presumably, then, Oracle’s quasi-MapReduce would also lack query fault-tolerance.
Categories: Analytic technologies, MapReduce, Oracle, Parallelization | 1 Comment |
Oracle Exadata 2 capacity pricing
Summary of Oracle Exadata 2 capacity pricing
Analyzing Oracle Exadata pricing is always harder than one would first think. But I’ve finally gotten around to doing an Oracle Exadata 2 pricing spreadsheet. The main takeaways are:
- If we believe Oracle’s claims of 10X compression, Exadata 2 costs more per terabyte of user data than Netezza TwinFin — $22-26K/TB vs. TwinFin’s <$20K — but less than the Teradata 2550.
- These figures are highly sensitive to assumptions about Oracle’s hybrid columnar compression.
- Similarly, if Netezza or Teradata were to significantly upgrade their own compression, the price comparison would look quite different.
- Options such as Data Mining or Oracle Spatial add 12% or so each to Exadata’s total system price.
Longer version
When Oracle introduced Exadata last year it was, well, expensive. Exadata 2 has now been announced, and it is significantly cheaper than Exadata 1 per terabyte of user data, based on:
- Similar overall pricing
- Twice the disk capacity
- Better compression
Categories: Analytic technologies, Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Exadata, Netezza, Oracle, Pricing, Teradata | 13 Comments |
Jacek Becla on issues in scientific data management
Just as Martin Kersten did, Jacek Becla emailed a response to my post on issues in scientific data management. With his permission, I’ve lightly edited his email too, and am posting it below, with some interspersed comments of my own. Read more