I’m taking a few weeks defocused from work, as a kind of grandpaternity leave. That said, the venue for my Dances of Infant Calming is a small-but-nice apartment in San Francisco, so a certain amount of thinking about tech industries is inevitable. I even found time last Tuesday to meet or speak with my clients at WibiData, MemSQL, Cloudera, Citus Data, and MongoDB. And thus:
1. I’ve been sloppy in my terminology around “geo-distribution”, in that I don’t always make it easy to distinguish between:
- Storing different parts of a database in different geographies, often for reasons of data privacy regulatory compliance.
- Replicating an entire database into different geographies, often for reasons of latency and/or availability/ disaster recovery,
The latter case can be subdivided further depending on whether multiple copies of the data can accept first writes (aka active-active, multi-master, or multi-active), or whether there’s a clear single master for each part of the database.
What made me think of this was a phone call with MongoDB in which I learned that the limit on number of replicas had been raised from 12 to 50, to support the full-replication/latency-reduction use case.
2. Three years ago I posted about agile (predictive) analytics. One of the points was:
… if you change your offers, prices, ad placement, ad text, ad appearance, call center scripts, or anything else, you immediately gain new information that isn’t well-reflected in your previous models.
Subsequently I’ve been hearing more about predictive experimentation such as bandit testing. WibiData, whose views are influenced by a couple of Very Famous Department Store clients (one of which is Macy’s), thinks experimentation is quite important. And it could be argued that experimentation is one of the simplest and most direct ways to increase the value of your data.
3. I’d further say that a number of developments, trends or possibilities I’m seeing are or could be connected. These include agile and experimental predictive analytics in general, as noted in the previous point, along with: Read more
In a general pontification on positioning, I wrote:
every product in a category is positioned along the same set of attributes,
and went on to suggest that summary attributes were more important than picky detailed ones. So how does that play out for investigative analytics?
First, summary attributes that matter for almost any kind of enterprise software include:
- Performance and scalability. I write about analytic performance and scalability a lot. Usually that’s in the context of analytic DBMS, but it also arises in analytic stacks such as Platfora, Metamarkets or even QlikView, and also in the challenges of making predictive modeling scale.
- Reliability, availability and security.* This is more crucial for short-request applications than analytic ones, but even your analytic systems shouldn’t leak data or crash.
- Goodness of fit with legacy systems. I hate that one, because enterprises often sacrifice way too much in favor of that benefit.
- Price. Duh.
*I picked up that phrase when — abbreviated as RAS — it was used to characterize the emphasis for Oracle 8. I like it better than a general and ambiguous concept of “enterprise-ready”.
The reason I’m writing this post, however, is to call out two summary attributes of special importance in investigative analytics — which regrettably which often conflict with each other — namely:
- Agility. People don’t want to submit requests for reports or statistical analyses; they want to get answers as soon as the questions come to mind.
- Completeness of feature set — for a particular use case, that is. There’s no such thing as an investigative analytics offering with a feature set that’s close to complete for all purposes; even SAS, IBM and other behemoths fall short.
Much of what I work on boils down to those two subjects. For example: Read more
|Categories: Aster Data, Business intelligence, Data warehousing, KXEN, Predictive modeling and advanced analytics, SAS Institute, Teradata||9 Comments|
First, some quick history.
- I first heard of KXEN 7-8 years ago from Roman Bukary, then of SAP. He positioned KXEN as an easy-to-embed predictive modeling tool, which was getting various interesting partnerships and OEM deals.
- Returning those near-roots, KXEN is being bought (Q4 expected close) by SAP.
- I say “near roots” because KXEN’s original story had something to do with SVMs (Support Vector Machines).
- But that was already old news back in 2006, and KXEN had pivoted to a simpler and more automated modeling approach. Presumably, this ease of modeling was part of the reason for KXEN’s OEM/partnership appeal.
However, I don’t want to give the impression that KXEN is the second coming of Crystal Reports. Most of what I heard about KXEN’s partnership chops, after Roman’s original heads-up, came from Teradata. Even KXEN itself didn’t seem to see that as a major part of their strategy.
And by the way, KXEN is yet another example of my observation that fancy math rarely drives great enterprise software success.
KXEN’s most recent strategies are perhaps best described by contrasting it to the vastly larger SAS. Read more
My July 2 comments on predictive modeling were far from my best work. Let’s try again.
1. Predictive analytics has two very different aspects.
Developing models, aka “modeling”:
- Is a big part of investigative analytics.
- May or may not be difficult to parallelize and/or integrate into an analytic RDBMS.
- May or may not require use of your whole database.
- Generally is done by humans.
- Often is done by people with special skills, e.g. “statisticians” or “data scientists”.
More precisely, some modeling algorithms are straightforward to parallelize and/or integrate into RDBMS, but many are not.
Using models, most commonly:
- Is done by machines …
- … that “score” data according to the models.
- May be done in batch or at run-time.
- Is embarrassingly parallel, and is much more commonly integrated into analytic RDBMS than modeling is.
2. Some people think that all a modeler needs are a few basic algorithms. (That’s why, for example, analytic RDBMS vendors are proud of integrating a few specific modeling routines.) Other people think that’s ridiculous. Depending on use case, either group can be right.
3. If adoption of DBMS-integrated modeling is high, I haven’t noticed.
|Categories: Data warehousing, Hadoop, Health care, IBM and DB2, KXEN, Predictive modeling and advanced analytics, SAS Institute||3 Comments|
Last November, I wrote two posts on agile predictive analytics. It’s time to return to the subject. I’m used to KXEN talking about the ability to do predictive modeling, very quickly, perhaps without professional statisticians; that the core of what KXEN does. But I was surprised when Revolution Analytics told me a similar story, based on a different approach, because ordinarily that’s not how R is used at all.
Ultimately, there seem to be three reasons why you’d want quick turnaround on your predictive modeling: Read more
|Categories: Business intelligence, Investment research and trading, KXEN, Predictive modeling and advanced analytics, Revolution Analytics, Telecommunications, Web analytics||10 Comments|
I had one of “those” trips last week:
- 20 meetings, a number of them very multi-hour.
- A broken laptop.
- Flights that arrived 10:30ish Sunday night and left 7:00 Saturday morning.
So please pardon me if things are a bit disjointed …
I’ve argued for a while that:
- All human-generated data should be retained.
- The more important kinds of machine-generated data should be retained as well.
- Raw data isn’t enough; it’s really important to store derived data as well.
Here are a few notes on the derived data trend. Read more
|Categories: Derived data, Hadoop, Hortonworks, KXEN, Predictive modeling and advanced analytics||8 Comments|
- This is a list of Monash Advantage members.
- All our vendor clients are Monash Advantage members, unless …
- … we work with them primarily in their capacity as technology users. (A large fraction of our user clients happen to be SaaS vendors.)
- We do not usually disclose our user clients.
- We do not usually disclose our venture capital clients, nor those who invest in publicly-traded securities.
- Excluded from this round of disclosure is one vendor I have never written about.
- Included in this round of disclosure is one client paying for services partly in stock. All our other clients are cash-only.
For reasons explained below, I’ll group the clients geographically. Obviously, companies often have multiple locations, but this is approximately how it works from the standpoint of their interactions with me. Read more
A reporter interviewed me via IM about how CIOs should view SAS Institute and its products. Naturally, I have edited my comments (lightly) into a blog post. They turned out to be clustered into three groups, as follows:
- SAS faces a number of challenges, not unlike those faced by other high-priced legacy technology vendors.
- It is used by organizations who have large budgets to pay for the product and to pay people to be expert on the product’s intricacies.
- SAS has not integrated with scale-out analytic DBMS technologies as well or quickly as had been hoped, or as earlier marketing suggested was likely.
- SAS has not been strong in helping its users do agile predictive analytics.
- SAS’ strengths are concentrated in product breadth:
- Lots of statistical algorithms.
- Various vertical products that make the modeling techniques more accessible in specific application domains.
- Various approaches to engineering for scalability — no one of those has been a table-thumping success to date, but SAS has the resources to keep trying.
- Some level of integration with its own business intelligence and text analytics products.
- For any particular use case, the burden of proof is on SAS alternatives to show that they have enough pieces in the toolkit to meet the needs.
- SPSS (now owned by IBM) also has legacy issues.
- KXEN is focused on marketing use cases.
- Mahout has been one of the less successful Hadoop-related open source projects.
- R-based technology is still maturing.
- The modeling capabilities (as opposed to just scoring) bundled into RDBMS and well-parallelized tend to be pretty limited. Apparent exceptions tend to just be R repackaged.
|Categories: Analytic technologies, Data warehousing, Hadoop, IBM and DB2, KXEN, Predictive modeling and advanced analytics, SAS Institute||18 Comments|
I frequently badger my clients to tell their story in the form of a company blog, where they can say what needs saying without being restricted by the rules of other formats. KXEN actually listened, and put up a pair of CTO posts that make the company story a lot clearer.
Excerpts from the first post include (with minor edits for formatting, including added emphasis):
Back in 1995, Vladimir Vapnik … changed the machine learning game with his new ‘Statistical Learning Theory’: he provided the machine learning guys with a mathematical framework that allowed them finally to understand, at the core, why some techniques were working and some others were not. All of a sudden, a new realm of algorithms could be written that would use mathematical equations instead of engineering data science tricks (don’t get me wrong here: I am an engineer at heart and I know the value of “tricks,” but tricks cannot overcome the drawbacks of a bad mathematical framework). Here was a foundation for automated data mining techniques that would perform as well as the best data scientists deploying these tricks. Luck is not enough though; it was because we knew a lot about statistics and machine learning that we were able to decipher the nuggets of gold in Vladimir’s theory.