IBM and DB2 – DBMS 2 : DataBase Management System Services

Are analytic RDBMS and data warehouse appliances obsolete?

Curt Monash — Mon, 29 Aug 2016 01:28:31 +0000

I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:

Many of the independent vendors were swooped up by acquisition.
- IBM bought Netezza.
- Microsoft bought DATAllegro.
- HP bought Vertica.
- Greenplum went to EMC/VMware/Pivotal.
- Teradata bought Aster.
- Actian bought both ParAccel and Vectorwise.
None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
Impala and other Hadoop-based alternatives are technology options.
Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big.

Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.

My answer, in a nutshell, is:

Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud — are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.

To see why, let’s start by asking: “With what do you want to integrate your analytic SQL processing?”

If you want to integrate with relational OLTP (OnLine Transaction Processing), your OLTP RDBMS vendor surely has a story worth listening to. Memory-centric offerings MemSQL and SAP HANA are also pitched that way.
If you want to integrate with your SAP apps in particular, HANA is the obvious choice.
If you want to integrate with other work you do in the Amazon cloud, Redshift is worth a look.

Beyond those cases, a big issue is integration with … well, with data integration. Analytic RDBMS got a lot of their workloads from ELT or ETLT, which stand for Extract/(Transform)/Load/Transform. I.e., you’d load data into an efficient analytic RDBMS and then do your transformations, vs. the “traditional” (for about 10-15 years of tradition) approach of doing your transformations in your ETL (Extract/Transform/Load) engine. But in bigger installations, Hadoop often snatches away that part of the workload, even if the rest of the processing remains on a dedicated analytic RDBMS platform such as Teradata’s.

And suppose you want to integrate with more advanced analytics — e.g. statistics, other predictive modeling/machine learning, or graph analytics? Well — and this both surprised and disappointed me — analytic platforms in the RDBMS sense didn’t work out very well. Early Hadoop had its own problems too. But Spark is doing just fine, and seems poised to win.

My technical observations around these trends include:

Advanced analytics commonly require flexible, iterative processing.
Spark is much better at such processing than earlier Hadoop …
… which in turn is better than anything that’s been built into an analytic RDBMS.
Open source/open standards and the associated skill sets come into play too. Highly vendor-proprietary DBMS-tied analytic stacks don’t have enough advantages over open ones.
Notwithstanding the foregoing, RDBMS-based platforms can still win if a big part of the task lies in fancy SQL.

And finally, if a task is “partly relational”, then Hadoop or Spark often fit both parts.

They don’t force you into using SQL for everything, nor into putting all your data into relational schemas, and that flexibility can be a huge relief.
Even so, almost everybody who uses those uses some SQL, at least for initial data extraction. Those systems are also plenty good enough at SQL for joining data to reference tables, and all that other SQL stuff you’d never want to give up.

But suppose you just want to do business intelligence, which is still almost always done over relational data structures? Analytic RDBMS offer the trade-offs:

They generally still provide the best performance or performance/concurrency combination, for the cost, although YMMV (Your Mileage May Vary).
One has to load the data in and immediately structure it relationally, which can be an annoying contrast to Hadoop alternatives (data base administration can be just-in-time) or to OLTP integration (less or no re-loading).
Other integrations, as noted above, can also be weak.

Suppose all that is a good match for your situation. Then you should surely continue using an analytic RDBMS, if you already have one, and perhaps even acquire one if you don’t. But for many other use cases, analytic RDBMS are no longer the best way to go.

Finally, how does the cloud affect all this? Mainly, it brings one more analytic RDBMS competitor into the mix, namely Amazon Redshift. Redshift is a simple system for doing analytic SQL over data that was in or headed to the Amazon cloud anyway. It seems to be quite successful.

Bottom line: Analytic RDBMS are no longer in their youthful prime, but they are healthy contributors in middle age. Mainly, they’re still best-of-breed for supporting demanding BI.

Notes on vendor lock-in

Curt Monash — Wed, 20 Jul 2016 01:35:32 +0000

Vendor lock-in is an important subject. Everybody knows that. But few of us realize just how complicated the subject is, nor how riddled it is with paradoxes. Truth be told, I wasn’t fully aware either. But when I set out to write this post, I found that it just kept growing longer.

1. The most basic form of lock-in is:

You do application development for a target set of platform technologies.
Your applications can’t run without those platforms underneath.
Hence, you’re locked into those platforms.

2. Enterprise vendor standardization is closely associated with lock-in. The core idea is that you have a mandate or strong bias toward having different apps run over the same platforms, because:

That simplifies your environment, requiring less integration and interoperability.
That simplifies your staffing; the same skill sets apply to multiple needs and projects.
That simplifies your vendor support relationships; there’s “one throat to choke”.
That simplifies your price negotiation.

3. That last point is double-edged; you have more power over suppliers to whom you give more business, but they also have more power over you. The upshot is often an ELA (Enterprise License Agreement), which commonly works:

For a fixed period of time, the enterprise may use as much of a given product set as they want, with costs fixed in advance.
A few years later, the price is renegotiated, based on then-current levels of usage.

Thus, doing an additional project using ELAed products may appear low-cost.

Incremental license and maintenance fees may be zero in the short-term.
Incremental personnel costs may be controlled because the needed skills are already in-house.

Often those appearances are substantially correct. That’s a big reason why incumbent software is difficult to supplant unless the upstart substitute is superior in fundamental and important ways.

4. Subscriptions are closely associated with lock-in.

Most obviously, the traditional software industry gets its profits from high-margin support/maintenance services.
Cloud lock-in has rapidly become a big deal.
The open source vendors meeting lock-in resistance, noted below, have subscription business models.

Much of why customers care about lock-in is the subscription costs it’s likely to commit them to.

5. Also related to lock-in are thick single-vendor technology stacks. If you run Oracle applications, you’re going to run the Oracle DBMS too. And if you run that, you’re likely to run other Oracle software, and perhaps use Exadata hardware as well. The cloud ==> lock-in truism is an example of this point as well.

6. There’s a lot of truth to the generality that central IT cares about overall technology architecture, while line-of-business departments just want to get the job done. This causes departments to both:

Oppose standardization.
Like thick technology stacks.

Thus, departmental influence on IT both encourages and discourages lock-in.

7. IBM is all about lock-in. IBM’s support for Linux, Eclipse and so on don’t really contradict that. IBM’s business model is to ~~squeeze~~ serve its still-large number of strongly loyal customers as well as it can.

8. Microsoft’s business model over the decades has also greatly depended on lock-in.

Indeed, it exploited Windows/Office lock-in so vigorously as to incur substantial anti-trust difficulties.
Server-side Windows tends to be involved in thick stacks — DBMS, middleware, business intelligence, SharePoint and more. Many customers (smaller enterprises or in some cases departments) are firmly locked into these stacks.
Microsoft is making a strong cloud push with Azure, which inherently involves lock-in.

Yet sometimes, Microsoft is more free and open.

Office for Macintosh allowed the Mac to be a viable Windows competitor. (And Microsoft was well-paid for that, generating comparable revenue per Mac to what it got for each Windows PC.)
Visual Studio is useful for writing apps to run against multiple DBMS.
Just recently, Microsoft SQL Server was ported to Linux.

9. SAP applications run over several different DBMS, including its own cheap MaxDB. That counteracts potential DBMS lock-in. But some of its newer apps are HANA-specific. That, of course, has the opposite effect.

10. And with that as background, we can finally get to what led me to finally write this post. Multiple clients have complaints that may be paraphrased as:

Customers are locked into expensive traditional DBMS such as Oracle.
Yet they’re so afraid of lock-in now that they don’t want to pay for our vendor-supplied versions of open source database technologies; they prefer to roll their own.
Further confusing matters, they also are happy to use cloud technologies, including the associated database technologies (e.g. . Redshift or other Amazon offerings), creating whole new stacks of lock-in.

So open source vendors of NoSQL data managers and similar technologies felt like they were the only kind of vendor suffering from fear of lock-in.

I agree with them that enterprises who feel this way are getting it wrong. Indeed:

The management of even NoSQL DBMS is a big issue, and help in that area has high cash value for customers.
Serious users need support.
Support and management tools happen to be synergistic with each other.

This is the value proposition that propelled Cloudera. It’s also a strong reason to give money to whichever of MongoDB, DataStax, Neo Technology et al. sponsors open source technology that you use.

General disclosure: My fingerprints have been on this industry strategy since before the term “NoSQL” was coined. It’s been an aspect of many different consulting relationships.

Some enterprises push back, logically or emotionally as the case may be, by observing that the best internet companies — e.g., Facebook — are allergic to paying for software, even open source. My refutations of that argument include:

Facebook has more and better engineers than you do.
Facebook has a lot more servers than you do, and would presumably face much higher prices than you would if you each chose to forgo the in-house alternative.
Facebook pays for open source software in a different way than through subscription fees — it invents and enhances it. Multiple important projects have originated at Facebook, and it contributes to many others. Are you in a position to do the same thing?

And finally — most of Facebook’s users get its service for free. (Advertisers are the ones who pay cash; all others just pay in attention to the ads.) So if getting its software for free actually does screw up its SLAs (Service Level Agreements) — well, free generally comes with poorer SLAs than paid. But if you’re in the business of serving paying customers, then you might want to have paying-customer kinds of SLAs, even on the parts of your technology — e.g. websites urging people to do business with you — that you provide for free yourself.

Related links

The technology underlying packaged applications (November, 2015, but it has a historical focus)
Topics in migration (January, 2015)
Much of the vendor advice on Strategic Messaging.

Oracle as the new IBM — has a long decline started?

Curt Monash — Thu, 31 Dec 2015 09:15:34 +0000

When I find myself making the same observation fairly frequently, that’s a good impetus to write a post based on it. And so this post is based on the thought that there are many analogies between:

Oracle and the Oracle DBMS.
IBM and the IBM mainframe.

And when you look at things that way, Oracle seems to be swimming against the tide.

Drilling down, there are basically three things that can seriously threaten Oracle’s market position:

Growth in apps of the sort for which Oracle’s RDBMS is not well-suited. Much of “Big Data” fits that description.
Outright, widespread replacement of Oracle’s application suites. This is the least of Oracle’s concerns at the moment, but could of course be a disaster in the long term.
Transition to “the cloud”. This trend amplifies the other two.

Oracle’s decline, if any, will be slow — but I think it has begun.

Oracle/IBM analogies

There’s a clear market lead in the core product category. IBM was dominant in mainframe computing. While not as dominant, Oracle is definitely a strong leader in high-end OTLP/mixed-use (OnLine Transaction Processing) RDBMS.

That market lead is even greater than it looks, because some of the strongest competitors deserve asterisks. Many of IBM’s mainframe competitors were “national champions” — Fujitsu and Hitachi in Japan, Bull in France and so on. Those were probably stronger competitors to IBM than the classic BUNCH companies (Burroughs, Univac, NCR, Control Data, Honeywell).

Similarly, Oracle’s strongest direct competitors are IBM DB2 and Microsoft SQL Server, each of which is sold primarily to customers loyal to the respective vendors’ full stacks. SAP is now trying to play a similar game.

The core product is stable, secure, richly featured, and generally very mature. Duh.

The core product is complicated to administer — which provides great job security for administrators. IBM had JCL (Job Control Language). Oracle has a whole lot of manual work overseeing indexes. In each case, there are many further examples of the point. Edit: A Twitter discussion suggests the specific issue with indexes has been long fixed.

Niche products can actually be more reliable than the big, super-complicated leader. Tandem Nonstop computers were super-reliable. Simple, “embeddable” RDBMS — e.g. Progress or SQL Anywhere — in many cases just work. Still, if you want one system to run most of your workload 24×7, it’s natural to choose the category leader.

The category leader has a great “whole product” story. Here I’m using “whole product” in the sense popularized by Geoffrey Moore, to encompass ancillary products, professional services, training, and so on, from the vendor and third parties alike. There was a time when most serious packaged apps ran exclusively on IBM mainframes. Oracle doesn’t have quite the same dominance, but there are plenty of packaged apps for which it is the natural choice of engine.

Notwithstanding all the foregoing, there’s strong vulnerability to alternative product categories. IBM mainframes eventually were surpassed by UNIX boxes, which had grown up from the minicomputer and even workstation categories. Similarly, the Oracle DBMS has trouble against analytic RDBMS specialists, NoSQL, text search engines and more.

IBM’s fate, and Oracle’s

Given that background, what does it teach us about possible futures for Oracle? The golden age of the IBM mainframe lasted 25 or 30 years — 1965-1990 is a good way to think about it, although there’s a little wiggle room at both ends of the interval. Since then it’s been a fairly stagnant cash-cow business, in which a large minority or perhaps even small majority of IBM’s customers have remained intensely loyal, while others have aligned with other vendors.

Oracle’s DBMS business seems pretty stagnant now too. There’s no new on-premises challenger to Oracle now as strong as UNIX boxes were to IBM mainframes 20-25 years ago, but as noted above, traditional competitors are stronger in Oracle’s case than they were in IBM’s. Further, the transition to the cloud is a huge deal, currently in its early stages, and there’s no particular reason to think Oracle will hold any more share there than IBM did in the transition to UNIX.

Within its loyal customer base, IBM has been successful at selling a broad variety of new products (typically software) and services, often via acquired firms. Oracle, of course, has also extended its product lines immensely from RDBMS, to encompass “engineered systems” hardware, app server, apps, business intelligence and more. On the whole, this aspect of Oracle’s strategy is working well.

That said, in most respects Oracle is weaker at account control than peak IBM.

Oracle’s core competitors, IBM and Microsoft, are stronger than IBM’s were.
DB2 and SQL Server are much closer to Oracle compatibility than most mainframes were to IBM. (Amdahl is an obvious exception.) This is especially true as of the past 10-15 years, when it has become increasingly clear that reliance on stored procedures is a questionable programming practice. Edit: But please see the discussion below challenging this claim.
Oracle (the company) is widely hated, in a way that IBM generally wasn’t.
Oracle doesn’t dominate a data center the way hardware monopolist IBM did in a hardware-first era.

Above all, Oracle doesn’t have the “Trust us; we’ll make sure your IT works” story that IBM did. Appliances, aka “engineered systems”, are a step in that direction, but those are only — or at least mainly — to run Oracle software, which generally isn’t everything a customer has.

But think of the apps!

Oracle does have one area in which it has more account control power than IBM ever did — applications. If you run Oracle apps, you probably should be running the Oracle RDBMS and perhaps an Exadata rack as well. And perhaps you’ll use Oracle BI too, at least in use cases where you don’t prefer something that emphasizes a more modern UI.

As a practical matter, most enterprise app rip-and-replace happens in a few scenarios:

Merger/acquisition. An enterprise that winds up with different apps for the same functions may consolidate and throw the loser out. I’m sure Oracle loses a few customers this way to SAP every year, and vice-versa.
Drastic obsolescence. This can take a few forms, mainly:
- Been there, done that.
- Enterprise outgrows the capabilities of the current app suite. Oracle’s not going to lose much business that way.
- Major platform shift. Going forward, that means SaaS/”cloud” (Software as a Service).

And so the main “opportunity” for Oracle to lose application market share is in the transition to the cloud.

Putting this all together …

A typical large-enterprise Oracle customer has 1000s of apps running on Oracle. The majority would be easy to port to some other system, but the exceptions to that rule are numerous enough to matter — a lot. Thus, Oracle has a secure place at that customer until such time as its applications are mainly swept away and replaced with something new.

But what about new apps? In many cases, they’ll arise in areas where Oracle’s position isn’t strong.

New third-party apps are likely to come from SaaS vendors. Oracle can reasonably claim to be a major SaaS vendor itself, and salesforce.com has a complex relationship with the Oracle RDBMS. But on the whole, SaaS vendors aren’t enthusiastic Oracle adopters.
New internet-oriented apps are likely to focus on customer/prospect interactions (here I’m drawing the (trans)action/interaction distinction) or even more purely machine-generated data (“Internet of Things”). The Oracle RDBMS has few advantages in those realms.
Further, new apps — especially those that focus on data external to the company — will in many cases be designed for the cloud. This is not a realm of traditional Oracle strength.

And that is why I think the answer to this post’s title question is probably “Yes”.

Related links

A significant fraction of my posts, in this blog and Software Memories alike, are probably at least somewhat relevant to this sweeping discussion. Particularly germane is my 2012 overview of Oracle’s evolution. Other posts to call out are my recent piece on transitioning to the cloud, and my series on enterprise application history.

Transitioning to the cloud(s)

Curt Monash — Mon, 07 Dec 2015 17:48:53 +0000

There’s a lot of talk these days about transitioning to the cloud, by IT customers and vendors alike. Of course, I have thoughts on the subject, some of which are below.

1. The economies of scale of not running your own data centers are real. That’s the kind of non-core activity almost all enterprises should outsource. Of course, those considerations taken alone argue equally for true cloud, co-location or SaaS (Software as a Service).

2. When the (Amazon) cloud was newer, I used to hear that certain kinds of workloads didn’t map well to the architecture Amazon had chosen. In particular, shared-nothing analytic query processing was necessarily inefficient. But I’m not hearing nearly as much about that any more.

3. Notwithstanding the foregoing, not everybody loves Amazon pricing.

4. Infrastructure vendors such as Oracle would like to also offer their infrastructure to you in the cloud. As per the above, that could work. However:

Is all your computing on Oracle’s infrastructure? Probably not.
Do you want to move the Oracle part and the non-Oracle part to different clouds? Ideally, no.
Do you like the idea of being even more locked in to Oracle than you are now? [Insert BDSM joke here.]
Will Oracle do so much better of a job hosting its own infrastructure that you use its cloud anyway? Well, that’s an interesting question.

Actually, if we replace “Oracle” by “Microsoft”, the whole idea sounds better. While Microsoft doesn’t have a proprietary server hardware story like Oracle’s, many folks are content in the Microsoft walled garden. IBM has fiercely loyal customers as well, and so may a couple of Japanese computer manufacturers.

5. Even when running stuff in the cloud is otherwise a bad idea, there’s still:

Test and dev(elopment) — usually phrased that way, although the opposite order makes more sense.
Short-term projects — the most obvious examples are in investigative analytics.
Disaster recovery.

So in many software categories, almost every vendor should have a cloud option of some kind.

6. Reasons for your data to wind up in a plurality of remote data centers include:

High availability, and similarly disaster recovery. Duh.
Second-source/avoidance of lock-in.
Geo-compliance.
Particular SaaS offerings being hosted in different places.
Use of both true cloud and co-location for different parts of your business.

7. “Mostly compatible” is by no means the same as “compatible”, and confusing the two leads to tears. Even so, “mostly compatible” has stood the IT industry in good stead multiple times. My favorite examples are:

SQL
UNIX (before LINUX).
IBM-compatible PCs (or, as Ben Rosen used to joke, Compaq-compatible).
Many cases in which vendors upgrade their own products.

I raise this point for two reasons:

I think Amazon/OpenStack could be another important example.
A vendor offering both cloud and on-premises versions of their offering, with minor incompatibilities between the two, isn’t automatically crazy.

8. SaaS vendors, in many cases, will need to deploy in many different clouds. Reasons include:

If they want customers around the world, they may need to process data in customers’ home country or region.
It could be the simplest way to meet the need of offering customers an on-premises option.

That said, there are of course significant differences between, for example:

Deploying to Amazon in multiple regions around the world.
Deploying to Amazon plus a variety of OpenStack-based cloud providers around the world, e.g. some “national champions” (perhaps subsidiaries of the main telecommunications firms).*
Deploying to Amazon, to other OpenStack-based cloud providers, and also to an OpenStack-based system that resides on customer premises (or in their co-location facility).

9. The previous point, and the last bullet of the one before that, are why I wrote in a post about enterprise app history:

There’s a huge difference between designing applications to run on one particular technology stack, vs. needing them to be portable across several. As a general rule, offering an application across several different brands of almost-compatible technology — e.g. market-leading RDBMS or (before the Linux era) proprietary UNIX boxes — commonly works out well. The application vendor just has to confine itself to relying on the intersection of the various brands’ feature sets.*

*The usual term for that is the spectacularly incorrect phrase “lowest common denominator”.

Offering the “same” apps over fundamentally different platform technologies is much harder, and I struggle to think of any cases of great success.

10. Decisions on where to process and store data are of course strongly influenced by where and how the data originates. In broadest terms:

Traditional business transaction data at large enterprises is typically managed by on-premises legacy systems. So legacy issues arise in full force.
Internet interaction data — e.g. web site clicks — typically originates in systems that are hosted remotely. (Few enterprises run their websites on premises.) It is tempting to manage and analyze that data where it originates. That said:
- You often want to enhance that data with what you know from your business records …
- … which is information that you may or may not be willing to send off-premises.
“Phone-home” IoT (Internet of Things) data, from devices at — for example — many customer locations, often makes sense to receive in the cloud. Once it’s there, why not process and analyze it there as well?
Machine-generated data that originates on your premises may never need to leave them. Even if their origins are as geographically distributed as customer devices are, there’s a good chance that you won’t need other cloud features (e.g. elastic scalability) as much as in customer-device use cases.

Related link

While the nuances of my views may change over time, I continue to think that computing platforms will almost all be appliances, clusters or clouds.

Machine learning’s connection to (the rest of) AI

Curt Monash — Tue, 01 Dec 2015 09:28:22 +0000

This is part of a four post series spanning two blogs.

One post gives a general historical overview of the artificial intelligence business.
One post specifically covers the history of expert systems.
One post gives a general present-day overview of the artificial intelligence business.
One post (this one) explores the close connection between machine learning and (the rest of) AI.

1. I think the technical essence of AI is usually:

Inputs come in.
Decisions or actions come out.
More precisely — inputs come in, something intermediate is calculated, and the intermediate result is mapped to a decision or action.
The intermediate results are commonly either numerical (a scalar or perhaps a vector of scalars) or a classification/partition into finitely many possible intermediate outputs.

Of course, a lot of non-AI software can be described the same way.

To check my claim, please consider:

It fits rules engines/expert systems so simply it’s barely worth saying.
It fits any kind of natural language processing; the intermediate results might be words or phrases or concepts or whatever.
It fits machine vision beautifully.

To see why it’s true from a bottom-up standpoint, please consider the next two points.

2. It is my opinion that most things called “intelligence” — natural and artificial alike — have a great deal to do with pattern recognition and response. Examples of what I mean include:

Think of what’s on an IQ test, or a commonly accepted substitute for same. (The SAT sometimes substitutes.) A lot of that is pattern recognition.
When the “multiple intelligences” or just “emotional intelligence” concepts gained currency, the core idea was the recognition of various different kinds of pattern. (E.g., reading somebody else’s emotions, something that I’m not nearly as good at as I am at the skills measured by standard IQ tests.)
The central mechanism of neurotransmission is a neuron recognizing that an action potential has crossed a certain threshold, and firing as a result.
Traditional areas of AI include natural language recognition, machine vision, and so on.
Another traditional area of AI is rules-based processing — conditions in, decision out.
Back in the 1980s (less so today), it was thought that a core underpinning for AI technology was knowledge representation. That said, as much as I like interesting data structures, I have my doubts.
- The Semantic Web grew out of this idea.
- Also, the single most enduring proponent of the centrality of knowledge representation was probably Doug Lenat, who gave his name to a famed unit of bogosity.
- While the previous two points are probably just coincidence, the juxtaposition is suggestive.

3. In most computational cases, pattern recognition and response boil down to scoring and/or classification (whether in a narrow machine learning sense of “classification” or otherwise). What I mean by this is:

I’m thinking of scoring as a function that maps inputs into scalar values. (Or a vector of scalars.)
I’m thinking of classification as a function that maps inputs into a finite range of possible values. (Note that this is mathematically equivalent to a finite partition on the set of inputs.)
I’m also assuming that the system maps each possible score or classification to a decision or response (deterministically or probabilistically as the case may be).
Then if you compose the two maps, you wind up with a function from {possible input patterns} to {possible responses}.

4. If you want a good algorithm for classification, of course, it’s natural to pursue it via machine learning. And the same is true of scoring, at least if we recall that the domains of machine learning and statistics have essentially merged.

5. It took people remarkably long to figure out the previous point. Through at least the end of the previous century, it was generally assumed that the way to come up with clever algorithms for, for example, text analytics or machine vision was — well, to think them up.

6. As spelled out in my overview of present-day commercial AI, there’s a somewhat paradoxical industry structure, in that:

Even though machine learning is a sine qua non of many businesses, tech and non-tech alike …
… the rest of AI is largely concentrated at a few behemoth technology companies.

Of course, there are plenty of startups hoping to change that structure. I hope some of them succeed.

What is AI, and who has it?

Curt Monash — Tue, 01 Dec 2015 09:25:46 +0000

This is part of a four post series spanning two blogs.

One post gives a general historical overview of the artificial intelligence business.
One post specifically covers the history of expert systems.
One post (this one) gives a general present-day overview of the artificial intelligence business.
One post explores the close connection between machine learning and (the rest of) AI.

1. “Artificial intelligence” is a term that usually means one or more of:

“Smart things that computers can’t do yet.”
“Smart things that computers couldn’t do until recently.”
“Technology that has emerged from the work of computer scientists who said they were doing AI.”
“Underpinnings for other things that might be called AI.”

But that covers a lot of ground, especially since reasonable people might disagree as to what constitutes “smart”.

2. Examples of what has been called “AI” include:

Rule-based processing, especially if it is referred to as “expert systems”.
Machine learning.
Many aspects of “natural language processing” — a term almost as overloaded as “artificial intelligence” — including but not limited to:
- Text search.
- Speech recognition, especially but not only if it seems somewhat lifelike.
- Automated language translation.
- Natural language database query.
Machine vision.
Autonomous vehicles.
Robots, especially but not only ones that seem somewhat lifelike.
Automated theorem proving.
Playing chess at an ELO rating of 1600 or better.
Beating the world champion at chess.
Beating the world champion at Jeopardy.
Anything that IBM brands or rebrands as “Watson”.

That last bit is awkward, as IBM is doing the industry a major disservice via its recklessly confusing Watson marketing, which is instantiating Monash’s First Law of Commercial Semantics — Bad jargon drowns out good. I suspect there’s an interesting debate under it all, in which IBM stands almost alone against the whole rest of the industry by sticking to the old academic belief that sophisticated knowledge representation is the key to AI. But it’s hard to be sure, because IBM’s Watson marketing is so full of smoke that reality, if any, doesn’t show through.

3. When I think of present-day AI commercialization, what comes to mind is mainly:

Multiple efforts in speech recognition, from Google, Microsoft, Apple, and Nuance Communications. (I’m not sure whether Apple’s is mainly in-house or mainly outsourced.)
Other natural language efforts, such as Google’s in machine translation.
Technology related to robots and autonomous vehicles, specifically in machine vision, other senses (e.g. touch), and reactions (e.g. driving decisions).
- Google is the most visible player here. It’s gotten a lot of press for driverless automobiles, and it bought up a lot of robotics companies when they were hurting due to a hiatus in DARPA funding.
- Large auto companies will surely compete.
Gesture interpretation and similar kinds of recognition.
- Microsoft has the most visibility here, due to Kinect, and is trying to bring similar technology to general computing.
- Facebook, Google et al. are making major investments into the closely related area of virtual reality. Facebook is also building an AI team.
Machine learning.
- Machine learning in general can be regarded as part of AI, at least historically.
- Machine learning is a key component of many AI efforts. Google in particular has made a big fuss about it, suggesting that data is generally more important than algorithms.
Whatever parts of the IBM story, if any, are actually real.

So with one big exception, commercial AI seems to be concentrated at a small number of behemoth companies. The exception is machine learning itself, which is being adopted and developed on a much broader basis.

4. AngelList seems to say I’m wrong, citing 576 different AI startups. CrunchBase offers 436 AI startups. So maybe some of those startups will succeed. We’ll see.

5. Some of the reasons for AI’s concentrated industry structure lie in general business and economics.

A large company can risk research with unclear payoffs a lot more easily than a small one can.
AI is prestigious and/or cool. Some large companies like to indulge in stuff like that.

Yes, those reasons are somewhat counteracted by the facts that:

VCs know they’re investing in companies whose eventual exit will likely be an acquisition.
Some of those acquisitions are for a LOT of money.

But I think they apply even so. And by the way — to date, most AI companies have not been acquired for very high prices.

6. Some of the reasons for AI industry concentration are more specifically technological.

Some AI — e.g. speech recognition or autonomous vehicle navigation — could be the “sizzle” that differentiates offerings in huge business sectors. Thus, a “win” in AI could have more value to an already-large electronics, search or automobile company than to a startup.
The largest companies in those huge sectors can afford huge amounts of training data, or may even get it as a byproduct of their other activities. Hence they can more easily afford massive exercises in the relevant machine learning.

My paradigmatic example for the latter point is Google with anything connected to search, such as translation (which it does of search results) or natural language recognition (which it does of search queries).

If you want to do an AI startup, those are some of the competitive factors that you need to beat.

Related links

An earlier version of some of this material was in my January, 2014 post on The games of Watson.
Earlier this year, I posted about robotics.
There is quite a bit of AI humor.

MariaDB and MaxScale

Curt Monash — Fri, 10 Apr 2015 16:48:11 +0000

I chatted with the MariaDB folks on Tuesday. Let me start by noting:

MariaDB, the product, is a MySQL fork.
MariaDB, product and company alike, are essentially a reaction to Oracle’s acquisition of MySQL. A lot of the key players are previously from MySQL.
MariaDB, the company, is the former SkySQL …
… which acquired or is the surviving entity of a merger with The Monty Program, which originated MariaDB. According to Wikipedia, something called the MariaDB Foundation is also in the mix.
I get the impression SkySQL mainly provided services around MySQL, especially remote DBA.
It appears that a lot of MariaDB’s technical differentiation going forward is planned to be in a companion product called MaxScale, which was released into Version 1.0 general availability earlier this year.

The numbers around MariaDB are a little vague. I was given the figure that there were ~500 customers total, but I couldn’t figure out what they were customers for. Remote DBA services? MariaDB support subscriptions? Something else? I presume there are some customers in each category, but I don’t know the mix. Other notes on MariaDB the company are:

~80 people in ~15 countries.
20-25 engineers, which hopefully doesn’t count a few field support people.
“Tiny” headquarters in Helsinki.
Business leadership growing in the US and especially the SF area.

MariaDB, the company, also has an OEM business. Part of their pitch is licensing for connectors — specifically LGPL — that hopefully gets around some of the legal headaches for MySQL engine suppliers.

MaxScale is a proxy, which starts out by intercepting and parsing MariaDB queries.

As you might guess, MaxScale has a sharding story.
- All MaxScale sharding is transparent.
- Right now MaxScale sharding is “schema-based”, which I interpret to mean as different tables potentially being on different servers.
- Planned to come soon is “key-based” sharding, which I interpret to mean as the kind of sharding that lets you scale a table across multiple servers without the application needing to know that is happening.
- I didn’t ask about join performance when tables are key-sharded.
MaxScale includes a firewall.
MaxScale has 5 “well-defined” APIs, which were described as:
- Authentication.
- Protocol.
- Monitoring.
- Routing.
- Filtering/logging.
I think MaxScale’s development schedule is “asynchronous” from that of the MariaDB product.
Further, MaxScale has a “plug-in” architecture that is said to make it easy to extend.
One plug-in on the roadmap is replication into Hadoop-based tables. (I think “into” is correct.)

I had trouble figuring out the differences between MariaDB’s free and enterprise editions. Specifically, I thought I heard that there were no feature differences, but I also thought I heard examples of feature differences. Further, there are third-party products included, but plans to replace some of those with in-house developed products in the future.

A few more notes:

MariaDB’s optimizer is rewritten vs. MySQL.
Like other vendors before it, MariaDB has gotten bored with its old version numbering scheme and jumped to 10.0.
One of the storage engines MariaDB ships is TokuDB. Surprisingly, TokuDB’s most appreciated benefit seems to be compression, not performance.
As an example of significant outside code contributions, MariaDB cites Google contributing whole-database encryption into what will be MariaDB 10.1.
Online schema change is on the roadmap.
There’s ~$20 million of venture capital in the backstory.
Engineering is mainly in Germany, Eastern Europe, and the US.
MariaDB Power8 performance is reportedly great (2X Intel Sandy Bridge or a little better). Power8 sales are mainly in Europe.

Hadoop: And then there were three

Curt Monash — Wed, 18 Feb 2015 21:50:37 +0000

Hortonworks, IBM, EMC Pivotal and others have announced a project called “Open Data Platform” to do … well, I’m not exactly sure what. Mainly, it sounds like:

An attempt to minimize the importance of any technical advantages Cloudera or MapR might have.
A face-saving way to admit that IBM’s and Pivotal’s insistence on having their own Hadoop distributions has been silly.
An excuse for press releases.
A source of an extra logo graphic to put on marketing slides.

Edit: Now there’s a press report saying explicitly that Hortonworks is taking over Pivotal’s Hadoop distro customers (which basically would mean taking over the support contracts and then working to migrate them to Hortonworks’ distro).

The claim is being made that this announcement solves some kind of problem about developing to multiple versions of the Hadoop platform, but to my knowledge that’s a problem rarely encountered in real life. When you already have a multi-enterprise open source community agreeing on APIs (Application Programming interfaces), what API inconsistency remains for a vendor consortium to painstakingly resolve?

Anyhow, it now seems clear that if you want to use a Hadoop distribution, there are three main choices:

Cloudera’s flavor, whether as software (from Cloudera) or in an appliance (e.g. from Oracle).
MapR’s flavor, as software from MapR.
Hortonworks’ flavor, from a number of vendors, including Hortonworks, IBM, Pivotal, Teradata et al.

In saying that, I’m glossing over a few points, such as:

There are various remote services that run Hadoop, most famously Amazon’s Elastic MapReduce.
You could get Apache Hadoop directly, rather than using the free or paid versions of a vendor distro. But why would you make that choice, unless you’re an internet bad-ass on the level of Facebook, or at least think that you are?
There will surely always be some proprietary stuff mixed into, for example, IBM’s BigInsights, so as to preserve at least the perception of all-important vendor lock-in.

But the main point stands — big computer companies, such as IBM, EMC (Pivotal) and previously Intel, are figuring out that they can’t bigfoot something that started out as an elephant — stuffed or otherwise — in the first place.

If you think I’m not taking this whole ODP thing very seriously, you’re right.

Related links

It’s a bit eyebrow-raising to see Mike Olson take a “more open source than thou” stance about something, but basically his post about this news is spot-on.
My take on Hadoop distributions two years ago might offer context. Trivia question: What’s the connection between the song that begins that post and the joke that ends it?

Thoughts and notes, Thanksgiving weekend 2014

Curt Monash — Mon, 01 Dec 2014 01:48:43 +0000

I’m taking a few weeks defocused from work, as a kind of grandpaternity leave. That said, the venue for my Dances of Infant Calming is a small-but-nice apartment in San Francisco, so a certain amount of thinking about tech industries is inevitable. I even found time last Tuesday to meet or speak with my clients at WibiData, MemSQL, Cloudera, Citus Data, and MongoDB. And thus:

1. I’ve been sloppy in my terminology around “geo-distribution”, in that I don’t always make it easy to distinguish between:

Storing different parts of a database in different geographies, often for reasons of data privacy regulatory compliance.
Replicating an entire database into different geographies, often for reasons of latency and/or availability/ disaster recovery,

The latter case can be subdivided further depending on whether multiple copies of the data can accept first writes (aka active-active, multi-master, or multi-active), or whether there’s a clear single master for each part of the database.

What made me think of this was a phone call with MongoDB in which I learned that the limit on number of replicas had been raised from 12 to 50, to support the full-replication/latency-reduction use case.

2. Three years ago I posted about agile (predictive) analytics. One of the points was:

… if you change your offers, prices, ad placement, ad text, ad appearance, call center scripts, or anything else, you immediately gain new information that isn’t well-reflected in your previous models.

Subsequently I’ve been hearing more about predictive experimentation such as bandit testing. WibiData, whose views are influenced by a couple of Very Famous Department Store clients (one of which is Macy’s), thinks experimentation is quite important. And it could be argued that experimentation is one of the simplest and most direct ways to increase the value of your data.

3. I’d further say that a number of developments, trends or possibilities I’m seeing are or could be connected. These include agile and experimental predictive analytics in general, as noted in the previous point, along with:

Also, the flashiest application I know of for only-moderately-successful KXEN came when one or more large retailers decided to run separate models for each of thousands of stores.

4. MongoDB, the product, has been refactored to support pluggable storage engines. In connection with that, MongoDB does/will ship with two storage engines – the traditional one and a new one from WiredTiger (but not TokuMX). Both will be equally supported by MongoDB, the company, although there surely are some tiers of support that will get bounced back to WiredTiger.

WiredTiger has the same techie principals as SleepyKat – get the wordplay?! – which was Mike Olson’s company before Cloudera. When asked, Mike spoke of those techies in remarkably glowing terms.

I wouldn’t be shocked if WiredTiger wound up playing the role for MongoDB that InnoDB played for MySQL. What I mean is that there were a lot of use cases for which the MySQL/MyISAM combination was insufficiently serious, but InnoDB turned MySQL into a respectable DBMS.

5. Hadoop’s traditional data distribution story goes something like:

Data lives on every non-special Hadoop node that does processing.
This gives the advantage of parallel data scans.
Sometimes data locality works well; sometimes it doesn’t.
Of course, if the output of every MapReduce step is persisted to disk, as is the case with Hadoop MapReduce 1, you might create some of your own data locality …
… but Hadoop is getting away from that kind of strict, I/O-intensive processing model.

However, Cloudera has noticed that some large enterprises really, really like to have storage separate from processing. Hence its recent partnership to work with EMC Isilon. Other storage partnerships, as well as a better fit with S3/object storage kinds of environments, are sure to follow, but I have no details to offer at this time.

6. Cloudera’s count of Spark users in its customer base is currently around 60. That includes everything from playing around to full production.

7. Things still seem to be going well at MemSQL, but I didn’t press for any details that I would be free to report.

8. Speaking of MemSQL, one would think that at some point something newer would replace Oracle et al. in the general-purpose RDBMS world, much as Unix and Linux grew to overshadow the powerful, secure, reliable, cumbersome IBM mainframe operating systems. On the other hand:

IBM blew away its mainframe competitors and had pretty close to a monopoly. But Oracle has some close and somewhat newer competitors in DB2 and Microsoft SQL Server. Therefore …
… upstarts have three behemoths to outdo, not just one.
MySQL, PostgreSQL and to some extent Sybase are still around as well.

Also, perhaps no replacement will be needed. If we subdivide the database management world into multiple categories including:

General-purpose RDBMS.
Analytic RDBMS.
NoSQL.
Non-relational analytic data stores (perhaps Hadoop-based).

it’s not obvious that the general-purpose RDBMS category on its own requires any new entrants to ever supplant the current leaders.

All that said – if any of the current new entrants do pull off the feat, SAP HANA is probably the best (longshot) guess to do so, and MemSQL the second-best.

9. If you’re a PostgreSQL user with performance or scalability concerns, you might want to check what Citus Data is doing.

An idealized log management and analysis system — from whom?

Curt Monash — Sun, 07 Sep 2014 12:38:52 +0000

I’ve talked with many companies recently that believe they are:

Focused on building a great data management and analytic stack for log management …
… unlike all the other companies that might be saying the same thing …
… and certainly unlike expensive, poorly-scalable Splunk …
… and also unlike less-focused vendors of analytic RDBMS (which are also expensive) and/or Hadoop distributions.

At best, I think such competitive claims are overwrought. Still, it’s a genuinely important subject and opportunity, so let’s consider what a great log management and analysis system might look like.

Much of this discussion could apply to machine-generated data in general. But right now I think more players are doing product management with an explicit conception either of log management or event-series analytics, so for this post I’ll share that focus too.

A short answer might be “Splunk, but with more analytic functionality and more scalable performance, at lower cost, plus numerous coupons for free pizza.” A more constructive and bottoms-up approach might start with:

Agents for any kind of machine that admits streams of data.
Parsers that:
- Immediately identify explicit name-value pairs in popular formats such as JSON or XML.
- Also immediately extract a significant fraction of all implicit fields in text strings — timestamps for sure, but also a lot else. (Splunk is the current gold standard for such capabilities.)
- Allow you to easily write rules for more such extractions.
Immediate indexing in line with everything the parsers do.
Easy import of log files, relational tables, and other relevant data structures.
Queries that can exploit all the indexes, at least up to the functionality level of SQL 2003 analytics (including windowing) and StreamSQL, of course with …
… blazing scalable performance.
Strong workload management and concurrent performance support. (Teradata is the gold standard for such capabilities in the analytic sphere.)
Various other mature-DBMS features, e.g. in backup, manageability, and uptime.

Further, there would be numerous styles of business intelligence interface, at least including:

Generic BI like we generally see for tabular data.
Constantly-changing displays of streaming data.
BI with an event-series orientation.
Strong alerting.
Mobile versions of everything.

And there would be good support for quick-turnaround, easily-operationalized predictive analytics, of the sort that’s fairly central to the visions for Kiji and Spark.

The data management part of that is particularly hard, in that:

Different architectures seem naturally well-suited for different parts of the problem.
Maturing a new data management product is always difficult, costly and slow.

My thoughts on strengths and weaknesses of some obvious log data management contenders start:

Oracle, IBM, and Microsoft have a lot of heft in all things database. But while each of those vendors has great resources and occasionally impressive pieces of new database engineering, none shows much evidence of framing, let alone solving, the problem in the right way(s).
SAP owns Sybase, HANA, several old CEP companies, and Business Objects. Add them to the Oracle/IBM/Microsoft list.
Teradata has a lot going for them. Their core analytic data management strengths are obvious. They’ve owned Aster for a while, and Aster innovated nPath quite some time ago. They recently added Hadapt, a leader in schema-on-need, as well as Revelytix, which has some good ideas in dataset management. Like most other DBMS vendors, however, Teradata doesn’t yet have much of a story for streaming data, and anyhow the most optimistic case for Teradata involves the difficult task of stitching together disparate data management technologies.
HP Vertica has a decent position as well. Probably more proven in general concurrent, scalable performance than others in their peer group (Netezza, Greenplum, et al.), Vertica also was relatively early in innovations relevant to log analysis, including a range of time series/event series features and its own schema-on-need effort. Vertica was also founded by people who were also streaming pioneers (there were heavily overlapping groups of academics behind StreamBase, Vertica and VoltDB), but it’s not clear how that background is reflected in present Vertica product.
Splunk, of course, has a complete stack. At the data acquisition and parsing layers, it’s second to none, and it has a considerable set of log-appropriate BI capabilities as well. And for data management it in effect is stitching together two different inverted-list data stores, plus Hadoop.
Hadoop distribution vendors such as Cloudera, MapR or Hortonworks offer typically bundle a range of relevant capabilities. HDFS (Hadoop Distributed File System) is the default place to dump entire logs. In most distros, Spark offers a new approach to streaming. Impala, Drill and so on offer query. Flume gathers the log data in the first place. But a lot of the cooler capabilities are immature or unproven, and in some cases that’s putting it mildly.

In the interest of length, I’ll omit discussion of smaller vendors, except to say that Platfora’s integrated-stack event series analytics story deserves attention, and I’m disappointed that I never hear about Sumo Logic. And I don’t know a lot about companies positioned as SIEM (Security Information and Event Management), especially now that SenSage has left the scene.