The most straightforward approach to the applications business is:
- Take general-purpose technology and think through how to apply it to a specific application domain.
- Produce packaged application software accordingly.
However, this strategy is not as successful in analytics as in the transactional world, for two main reasons:
- Analytic applications of that kind are rarely complete.
- Incomplete applications rarely sell well.
I first realized all this about a decade ago, after Henry Morris coined the term analytic applications and business intelligence companies thought it was their future. In particular, when Dave Kellogg ran marketing for Business Objects, he rattled off an argument to the effect that Business Objects had generated more analytic app revenue over the lifetime of the company than Cognos had. I retorted, with only mild hyperbole, that the lifetime numbers he was citing amounted to “a bad week for SAP”. Somewhat hoist by his own petard, Dave quickly conceded that he agreed with my skepticism, and we changed the subject accordingly.
Reasons that analytic applications are commonly less complete than the transactional kind include: Read more
|Categories: Business intelligence, Business Objects, Data mart outsourcing, Investment research and trading, Log analysis, Metamarkets and Druid, Oracle, SAP AG, SAS Institute, Web analytics, WibiData||16 Comments|
My clients at Odiago, vendors of WibiData, have changed their company name simply to WibiData. Even better, they blogged with more detail as to how WibiData works, in what is essentially a follow-on to my original WibiData post last October. Among other virtues, WibiData turns out to be a poster child for my views on derived data and the corresponding schema evolution.
Interesting quotes include:
WibiData is designed to store … transactional data side-by-side with profile and other derived data attributes.
… the ability to add new ad-hoc columns to a table enables more flexible analysis: output data that is the result of one analytic pipeline is stored adjacent to its input data, meaning that you can easily use this as input to second- or third-order derived data as well.
schemas can vary over time; you can easily add a field to a record, or delete a field. … But even though you start collecting that new data, your existing analysis pipelines can treat records like they always did; programs that don’t yet know about the new cookie are still compatible with both the old records already collected, and the new records with the additional field. New programs fill in default values for old data recorded before a field was added, applying the new schema at read time.
schemas for every column are stored in a data dictionary that matches column names with their schemas, as well as human-readable descriptions of the data.
Interesting aspects of the post that don’t lend themselves as well to being excerpted include:
- How the Produce-Gather “analysis calculus” — i.e. framework — works.
- How this all ties into Apache projects (and sub-projects) such as Hadoop, HBase, and Avro.
|Categories: Data models and architecture, Data warehousing, Derived data, NoSQL, WibiData||3 Comments|
Christophe Bisciglia and Aaron Kimball have a new company.
- It’s called Odiago, and is one of my gratifyingly more numerous tiny clients.
- Odiago’s product line is called WibiData, after the justly popular We Be Sushi restaurants.
- We’ve agreed on a split exclusive de-stealthing launch. You can read about the company/founder/investor stuff on TechCrunch. But this is the place for — well, for the tech crunch.
WibiData is designed for management of, investigative analytics on, and operational analytics on consumer internet data, the main examples of which are web site traffic and personalization and their analogues for games and/or mobile devices. The core WibiData technology, built on HBase and Hadoop,* is a data management and analytic execution layer. That’s where the secret sauce resides. Also included are:
- REST APIs for interactive access.
- Import/export tools, including JDBC access.
- Management tools.
- Analytic libraries — data mining, predictive analytics, machine learning, and so on.
The whole thing is in beta, with about three (paying) beta customers.
*And Avro and so on.
The core ideas of WibiData include:
- ALL data pertaining to a single user (or mobile device) is kept in a single, possibly very long, HBase row.
- There are two primary operators in WibiData, Produce and Gather.
- Produce operates on single rows. It can operate on one row at HBase speed (milliseconds) if you need to inform an interactive user response. Or it can operate on the whole database in batch via Hadoop MapReduce.
- It is reasonable to think of Produce as mainly doing two things. One is the aforementioned serving of data out of WibiData into interactive applications. The other is scoring, classifying, recommending, etc. on individual users (i.e. rows), in line with an analytic model.
- Gather typically operates on all your rows at once, and emits suitable input for a MapReduce Reduce step. It is reasonable to think of Gather as being a key cog in the training of analytic models.
- HBase schema management is done at the WibiData system level, not directly in applications. There’s a WibiData HBase data dictionary, powered by a set of system tables, that specifies cell data types/record types and, in effect, primitive schemas.
|Categories: Data models and architecture, Hadoop, HBase, NoSQL, Predictive modeling and advanced analytics, Web analytics, WibiData||14 Comments|