In a short October, 2011 post about Datameer, I wrote:
Datameer is designed to let you do simple stuff on large amounts of data, where “large amounts of data” typically means data in Hadoop, and “simple stuff” includes basic versions of a spreadsheet, of BI, and of EtL (Extract/Transform/Load, without much in the way of T).
That’s all still mainly true, although with the recent Datameer 2.0:
- You can run Datameer and the underlying Hadoop on a desktop or workgroup group.
- There are some infographics pretty-picture-drawing capabilities, which will surely delight those who like vector-based HTML 5 pictures of coffee cups, saucers and macaroons.
- No doubt Datameer has been generally enhanced on multiple fronts.
In essence, Datameer has two positionings.
- One is “OK, you’ve got Hadoop — now wouldn’t you like to do something useful with it?” That can include both business intelligence and ETL.
- Beyond that, Datameer founder/CEO Stefan Groschupf’s core argument is that schema-on-read is really, really useful, even at the cost of absorbing a potentially large performance hit. In other words, he’s making a case for a form of non-relational BI.
The single-server (desktop or workgroup) Datameer story is something like:
- Stefan used to work on Nutch, which is somewhat of a single-server predecessor to Hadoop. So it seems natural to him to run Hadoop on a single server. Besides, he likes its sequential file access speed.
- When you get single-server Datameer, Hadoop is bundled into the same installation.
- That’s Apache Hadoop 1.0, with no back-porting, mix-matching of sub-projects, or anything — just the purest of pure Apache Hadoop.
What you get in such a package is a competitor to Excel-based business intelligence, which in particular is meant to work well with data sets that nobody’s ever bothered putting into any kind of relational database.
Notes on Datameer’s data integration story (for any of its configurations) include:
- Datameer wants you to import data into Hadoop and operate on it there. Theoretically, Datameer also lets you operate on data in place, for example doing joins in an RDBMS in the usual way. But that’s not recommended.
- All Datameer connectors are bidirectional.
- Datameer’s metadata integration story seems pretty basic, focused on column names and so on.
- However, there’s more ETL-related metadata somewhere in the picture. For example, Stefan showed me some nice-looking live data lineage pictures (flowchart style).
- Data profiling and histograms are in the mix somewhere too.
- One of the Datameer application areas Stefan cited was data cleaning.
And of course, once the data’s in Hadoop, you can do all sorts of transformations on it, including many that would be impractical in any conventional ETL tool.
And finally, here are a few numbers, customer tidbits, and so on:
- Datameer has “under 100″ enterprise customers, including:
- The 3 largest banks in the world.
- The 2 largest credit card companies in the world.
- The US government.
- The London Olympics.
- At least 4 of the Fortune 10.
- Datameer customers seem to be concentrated in financial services, retailing, and web/internet businesses.
- Application areas Stefan mentioned included:
- Data cleaning, aggregation, and so on.
- Multiple kinds of anti-fraud.
- Various forms of security.
- Datameer has >60 employees, up from >40 last time I asked.
- Datameer has had 1000s of downloads.
- Datameer has a financing underway.
- An early version of ETL for Hadoop was discussed in my August, 2010 post about Pentaho’s Hadoop ETL strategy and the associated comment thread. Note that Pentaho is, of course, also a BI vendor itself.
- I returned to the subject of Hadoop-based data integration in a short post in May, 2011.
- HCatalog is a big part of my ongoing research into Hadoop ETL.
- I’m a fan of dynamic schemas too.