Analysis of data mining powerhouse SAS, and the especially the relationship between SAS’s data mining products and various database management systems. Related subjects include:
1. There are multiple ways in which analytics is inherently modular. For example:
- Business intelligence tools can reasonably be viewed as application development tools. But the “applications” may be developed one report at a time.
- The point of a predictive modeling exercise may be to develop a single scoring function that is then integrated into a pre-existing operational application.
- Conversely, a recommendation-driven website may be developed a few pages — and hence also a few recommendations — at a time.
Also, analytics is inherently iterative.
- Everything I just called “modular” can reasonably be called “iterative” as well.
- So can any work process of the nature “OK, we got an insight. Let’s pursue it and get more accuracy.”
If I’m right that analytics is or at least should be modular and iterative, it’s easy to see why people hate multi-year data warehouse creation projects. Perhaps it’s also easy to see why I like the idea of schema-on-need.
2. In 2011, I wrote, in the context of agile predictive analytics, that
… the “business analyst” role should be expanded beyond BI and planning to include lightweight predictive analytics as well.
I gather that a similar point is at the heart of Gartner’s new term citizen data scientist. I am told that the term resonates with at least some enterprises. Read more
|Categories: Business intelligence, Data warehousing, Datameer, Hadoop, Log analysis, Oracle, Platfora, Predictive modeling and advanced analytics, SAS Institute, Software as a Service (SaaS), Tableau Software, Web analytics||2 Comments|
In a general pontification on positioning, I wrote:
every product in a category is positioned along the same set of attributes,
and went on to suggest that summary attributes were more important than picky detailed ones. So how does that play out for investigative analytics?
First, summary attributes that matter for almost any kind of enterprise software include:
- Performance and scalability. I write about analytic performance and scalability a lot. Usually that’s in the context of analytic DBMS, but it also arises in analytic stacks such as Platfora, Metamarkets or even QlikView, and also in the challenges of making predictive modeling scale.
- Reliability, availability and security.* This is more crucial for short-request applications than analytic ones, but even your analytic systems shouldn’t leak data or crash.
- Goodness of fit with legacy systems. I hate that one, because enterprises often sacrifice way too much in favor of that benefit.
- Price. Duh.
*I picked up that phrase when — abbreviated as RAS — it was used to characterize the emphasis for Oracle 8. I like it better than a general and ambiguous concept of “enterprise-ready”.
The reason I’m writing this post, however, is to call out two summary attributes of special importance in investigative analytics — which regrettably which often conflict with each other — namely:
- Agility. People don’t want to submit requests for reports or statistical analyses; they want to get answers as soon as the questions come to mind.
- Completeness of feature set — for a particular use case, that is. There’s no such thing as an investigative analytics offering with a feature set that’s close to complete for all purposes; even SAS, IBM and other behemoths fall short.
Much of what I work on boils down to those two subjects. For example: Read more
|Categories: Aster Data, Business intelligence, Data warehousing, KXEN, Predictive modeling and advanced analytics, SAS Institute, Teradata||10 Comments|
I talked with Teradata about a bunch of stuff yesterday, including this week’s announcements in in-database predictive modeling. The specific news was about partnerships with Fuzzy Logix and Revolution Analytics. But what I found more interesting was the surrounding discussion. In a nutshell:
- Teradata is finally seeing substantial interest in in-database modeling, rather than just in-database scoring (which has been important for years) and in-database data preparation (which is a lot like ELT — Extract/Load/transform).
- Teradata is seeing substantial interest in R.
- It seems as if similar groups of customers are interested in both parts of that, such as:
This is the strongest statement of perceived demand for in-database modeling I’ve heard. (Compare Point #3 of my July predictive modeling post.) And fits with what I’ve been hearing about R.
|Categories: EAI, EII, ETL, ELT, ETLT, Parallelization, Predictive modeling and advanced analytics, Revolution Analytics, SAS Institute, Telecommunications, Teradata||1 Comment|
First, some quick history.
- I first heard of KXEN 7-8 years ago from Roman Bukary, then of SAP. He positioned KXEN as an easy-to-embed predictive modeling tool, which was getting various interesting partnerships and OEM deals.
- Returning those near-roots, KXEN is being bought (Q4 expected close) by SAP.
- I say “near roots” because KXEN’s original story had something to do with SVMs (Support Vector Machines).
- But that was already old news back in 2006, and KXEN had pivoted to a simpler and more automated modeling approach. Presumably, this ease of modeling was part of the reason for KXEN’s OEM/partnership appeal.
However, I don’t want to give the impression that KXEN is the second coming of Crystal Reports. Most of what I heard about KXEN’s partnership chops, after Roman’s original heads-up, came from Teradata. Even KXEN itself didn’t seem to see that as a major part of their strategy.
And by the way, KXEN is yet another example of my observation that fancy math rarely drives great enterprise software success.
KXEN’s most recent strategies are perhaps best described by contrasting it to the vastly larger SAS. Read more
When we scheduled a call to talk about Sentry, Cloudera’s Charles Zedlewski and I found time to discuss other stuff as well. One interesting part of our discussion was around the processing “frameworks” Cloudera sees as most important.
- The four biggies are:
- MapReduce. Duh.
- SQL, specifically Impala. This is as opposed to the uneasy Hive/MapReduce layering.
- “Math” , which seems to mainly be through partnerships with SAS and Revolution Analytics. I don’t know a lot about how these work, but I presume they bypass MapReduce, in which case I could imagine them greatly outperforming Mahout.
- Stream processing (Storm) is next in line.
- Graph — e.g. Giraph — rises to at least the proof-of-concept level. Again, the hope would be that this well outperforms graph-on-MapReduce.
- Charles is also seeing at least POC interest in Spark.
- But MPI (Message Passing Interface) on Hadoop isn’t going anywhere fast, except to the extent it’s baked into SAS or other “math” frameworks. Generic MPI use cases evidently turn out to be a bad fit for Hadoop, due to factors such as:
- Low data volumes.
- Latencies in various parts of the system
HBase was artificially omitted from this “frameworks” discussion because Cloudera sees it as a little bit more of a “storage” system than a processing one.
Another good subject was offloading work to Hadoop, in a couple different senses of “offload”: Read more
My July 2 comments on predictive modeling were far from my best work. Let’s try again.
1. Predictive analytics has two very different aspects.
Developing models, aka “modeling”:
- Is a big part of investigative analytics.
- May or may not be difficult to parallelize and/or integrate into an analytic RDBMS.
- May or may not require use of your whole database.
- Generally is done by humans.
- Often is done by people with special skills, e.g. “statisticians” or “data scientists”.
More precisely, some modeling algorithms are straightforward to parallelize and/or integrate into RDBMS, but many are not.
Using models, most commonly:
- Is done by machines …
- … that “score” data according to the models.
- May be done in batch or at run-time.
- Is embarrassingly parallel, and is much more commonly integrated into analytic RDBMS than modeling is.
2. Some people think that all a modeler needs are a few basic algorithms. (That’s why, for example, analytic RDBMS vendors are proud of integrating a few specific modeling routines.) Other people think that’s ridiculous. Depending on use case, either group can be right.
3. If adoption of DBMS-integrated modeling is high, I haven’t noticed.
|Categories: Ayasdi, Data warehousing, Hadoop, Health care, IBM and DB2, KXEN, Predictive modeling and advanced analytics, SAS Institute||4 Comments|
Teradata is announcing its new high-end systems, the Teradata 6700 series. Notes on that include:
- Teradata tends to get 35-55% (roughly speaking) annual performance improvements, as measured by its internal blended measure Tperf. A big part of this is exploiting new-generation Intel processors.
- This year the figure is around 40%.
- The 6700 is based on Intel’s Sandy Bridge.
- Teradata previously told me that Ivy Bridge — the next one after Sandy Bridge — could offer a performance “discontinuity”. So, while this is just a guess, I expect that next year’s Teradata performance improvement will beat this year’s.
- Teradata has now largely switched over to InfiniBand.
Teradata is also talking about data integration and best-of-breed systems, with buzzwords such as:
- Teradata Unified Data Architecture.
- Fabric-based computing, even though this isn’t really about storage.
- Teradata SQL-H.
|Categories: Data integration and middleware, Data warehouse appliances, Data warehousing, Pricing, SAS Institute, Teradata||3 Comments|
The 2011/2012 Gartner Magic Quadrant for Business Intelligence Platforms — company-by-company comments
This is one of a series of posts on business intelligence and related analytic technology subjects, keying off the 2011/2012 version of the Gartner Magic Quadrant for Business Intelligence Platforms. The four posts in the series cover:
- Overview comments about the 2011/2012 Gartner Magic Quadrant for Business Intelligence Platforms, as well as a link to the actual document.
- Business intelligence industry trends — some of Gartner’s thoughts but mainly my own.
- (This post) Company-by-company comments based on the 2011/2012 Gartner Magic Quadrant for Business Intelligence Platforms.
- Third-party analytics, pulling together and expanding on some points I made in the first three posts.
The heart of Gartner Group’s 2011/2012 Magic Quadrant for Business Intelligence Platforms was the company comments. I shall expound upon some, roughly in declining order of Gartner’s “Completeness of Vision” scores, dubious though those rankings may be. Read more
|Categories: Business intelligence, Business Objects, Cognos, IBM and DB2, Information Builders, Jaspersoft, MicroStrategy, Open source, Oracle, Pentaho, QlikTech and QlikView, SAP AG, SAS Institute, Tableau Software||5 Comments|
The most straightforward approach to the applications business is:
- Take general-purpose technology and think through how to apply it to a specific application domain.
- Produce packaged application software accordingly.
However, this strategy is not as successful in analytics as in the transactional world, for two main reasons:
- Analytic applications of that kind are rarely complete.
- Incomplete applications rarely sell well.
I first realized all this about a decade ago, after Henry Morris coined the term analytic applications and business intelligence companies thought it was their future. In particular, when Dave Kellogg ran marketing for Business Objects, he rattled off an argument to the effect that Business Objects had generated more analytic app revenue over the lifetime of the company than Cognos had. I retorted, with only mild hyperbole, that the lifetime numbers he was citing amounted to “a bad week for SAP”. Somewhat hoist by his own petard, Dave quickly conceded that he agreed with my skepticism, and we changed the subject accordingly.
Reasons that analytic applications are commonly less complete than the transactional kind include: Read more
|Categories: Business intelligence, Business Objects, Data mart outsourcing, Investment research and trading, Log analysis, Metamarkets and Druid, Oracle, SAP AG, SAS Institute, Web analytics, WibiData||17 Comments|
A reporter interviewed me via IM about how CIOs should view SAS Institute and its products. Naturally, I have edited my comments (lightly) into a blog post. They turned out to be clustered into three groups, as follows:
- SAS faces a number of challenges, not unlike those faced by other high-priced legacy technology vendors.
- It is used by organizations who have large budgets to pay for the product and to pay people to be expert on the product’s intricacies.
- SAS has not integrated with scale-out analytic DBMS technologies as well or quickly as had been hoped, or as earlier marketing suggested was likely.
- SAS has not been strong in helping its users do agile predictive analytics.
- SAS’ strengths are concentrated in product breadth:
- Lots of statistical algorithms.
- Various vertical products that make the modeling techniques more accessible in specific application domains.
- Various approaches to engineering for scalability — no one of those has been a table-thumping success to date, but SAS has the resources to keep trying.
- Some level of integration with its own business intelligence and text analytics products.
- For any particular use case, the burden of proof is on SAS alternatives to show that they have enough pieces in the toolkit to meet the needs.
- SPSS (now owned by IBM) also has legacy issues.
- KXEN is focused on marketing use cases.
- Mahout has been one of the less successful Hadoop-related open source projects.
- R-based technology is still maturing.
- The modeling capabilities (as opposed to just scoring) bundled into RDBMS and well-parallelized tend to be pretty limited. Apparent exceptions tend to just be R repackaged.
|Categories: Analytic technologies, Data warehousing, Hadoop, IBM and DB2, KXEN, Predictive modeling and advanced analytics, SAS Institute||18 Comments|