Exadata – DBMS 2 : DataBase Management System Services

Notes on vendor lock-in

Curt Monash — Wed, 20 Jul 2016 01:35:32 +0000

Vendor lock-in is an important subject. Everybody knows that. But few of us realize just how complicated the subject is, nor how riddled it is with paradoxes. Truth be told, I wasn’t fully aware either. But when I set out to write this post, I found that it just kept growing longer.

1. The most basic form of lock-in is:

You do application development for a target set of platform technologies.
Your applications can’t run without those platforms underneath.
Hence, you’re locked into those platforms.

2. Enterprise vendor standardization is closely associated with lock-in. The core idea is that you have a mandate or strong bias toward having different apps run over the same platforms, because:

That simplifies your environment, requiring less integration and interoperability.
That simplifies your staffing; the same skill sets apply to multiple needs and projects.
That simplifies your vendor support relationships; there’s “one throat to choke”.
That simplifies your price negotiation.

3. That last point is double-edged; you have more power over suppliers to whom you give more business, but they also have more power over you. The upshot is often an ELA (Enterprise License Agreement), which commonly works:

For a fixed period of time, the enterprise may use as much of a given product set as they want, with costs fixed in advance.
A few years later, the price is renegotiated, based on then-current levels of usage.

Thus, doing an additional project using ELAed products may appear low-cost.

Incremental license and maintenance fees may be zero in the short-term.
Incremental personnel costs may be controlled because the needed skills are already in-house.

Often those appearances are substantially correct. That’s a big reason why incumbent software is difficult to supplant unless the upstart substitute is superior in fundamental and important ways.

4. Subscriptions are closely associated with lock-in.

Most obviously, the traditional software industry gets its profits from high-margin support/maintenance services.
Cloud lock-in has rapidly become a big deal.
The open source vendors meeting lock-in resistance, noted below, have subscription business models.

Much of why customers care about lock-in is the subscription costs it’s likely to commit them to.

5. Also related to lock-in are thick single-vendor technology stacks. If you run Oracle applications, you’re going to run the Oracle DBMS too. And if you run that, you’re likely to run other Oracle software, and perhaps use Exadata hardware as well. The cloud ==> lock-in truism is an example of this point as well.

6. There’s a lot of truth to the generality that central IT cares about overall technology architecture, while line-of-business departments just want to get the job done. This causes departments to both:

Oppose standardization.
Like thick technology stacks.

Thus, departmental influence on IT both encourages and discourages lock-in.

7. IBM is all about lock-in. IBM’s support for Linux, Eclipse and so on don’t really contradict that. IBM’s business model is to ~~squeeze~~ serve its still-large number of strongly loyal customers as well as it can.

8. Microsoft’s business model over the decades has also greatly depended on lock-in.

Indeed, it exploited Windows/Office lock-in so vigorously as to incur substantial anti-trust difficulties.
Server-side Windows tends to be involved in thick stacks — DBMS, middleware, business intelligence, SharePoint and more. Many customers (smaller enterprises or in some cases departments) are firmly locked into these stacks.
Microsoft is making a strong cloud push with Azure, which inherently involves lock-in.

Yet sometimes, Microsoft is more free and open.

Office for Macintosh allowed the Mac to be a viable Windows competitor. (And Microsoft was well-paid for that, generating comparable revenue per Mac to what it got for each Windows PC.)
Visual Studio is useful for writing apps to run against multiple DBMS.
Just recently, Microsoft SQL Server was ported to Linux.

9. SAP applications run over several different DBMS, including its own cheap MaxDB. That counteracts potential DBMS lock-in. But some of its newer apps are HANA-specific. That, of course, has the opposite effect.

10. And with that as background, we can finally get to what led me to finally write this post. Multiple clients have complaints that may be paraphrased as:

Customers are locked into expensive traditional DBMS such as Oracle.
Yet they’re so afraid of lock-in now that they don’t want to pay for our vendor-supplied versions of open source database technologies; they prefer to roll their own.
Further confusing matters, they also are happy to use cloud technologies, including the associated database technologies (e.g. . Redshift or other Amazon offerings), creating whole new stacks of lock-in.

So open source vendors of NoSQL data managers and similar technologies felt like they were the only kind of vendor suffering from fear of lock-in.

I agree with them that enterprises who feel this way are getting it wrong. Indeed:

The management of even NoSQL DBMS is a big issue, and help in that area has high cash value for customers.
Serious users need support.
Support and management tools happen to be synergistic with each other.

This is the value proposition that propelled Cloudera. It’s also a strong reason to give money to whichever of MongoDB, DataStax, Neo Technology et al. sponsors open source technology that you use.

General disclosure: My fingerprints have been on this industry strategy since before the term “NoSQL” was coined. It’s been an aspect of many different consulting relationships.

Some enterprises push back, logically or emotionally as the case may be, by observing that the best internet companies — e.g., Facebook — are allergic to paying for software, even open source. My refutations of that argument include:

Facebook has more and better engineers than you do.
Facebook has a lot more servers than you do, and would presumably face much higher prices than you would if you each chose to forgo the in-house alternative.
Facebook pays for open source software in a different way than through subscription fees — it invents and enhances it. Multiple important projects have originated at Facebook, and it contributes to many others. Are you in a position to do the same thing?

And finally — most of Facebook’s users get its service for free. (Advertisers are the ones who pay cash; all others just pay in attention to the ads.) So if getting its software for free actually does screw up its SLAs (Service Level Agreements) — well, free generally comes with poorer SLAs than paid. But if you’re in the business of serving paying customers, then you might want to have paying-customer kinds of SLAs, even on the parts of your technology — e.g. websites urging people to do business with you — that you provide for free yourself.

Related links

The technology underlying packaged applications (November, 2015, but it has a historical focus)
Topics in migration (January, 2015)
Much of the vendor advice on Strategic Messaging.

Oracle as the new IBM — has a long decline started?

Curt Monash — Thu, 31 Dec 2015 09:15:34 +0000

When I find myself making the same observation fairly frequently, that’s a good impetus to write a post based on it. And so this post is based on the thought that there are many analogies between:

Oracle and the Oracle DBMS.
IBM and the IBM mainframe.

And when you look at things that way, Oracle seems to be swimming against the tide.

Drilling down, there are basically three things that can seriously threaten Oracle’s market position:

Growth in apps of the sort for which Oracle’s RDBMS is not well-suited. Much of “Big Data” fits that description.
Outright, widespread replacement of Oracle’s application suites. This is the least of Oracle’s concerns at the moment, but could of course be a disaster in the long term.
Transition to “the cloud”. This trend amplifies the other two.

Oracle’s decline, if any, will be slow — but I think it has begun.

Oracle/IBM analogies

There’s a clear market lead in the core product category. IBM was dominant in mainframe computing. While not as dominant, Oracle is definitely a strong leader in high-end OTLP/mixed-use (OnLine Transaction Processing) RDBMS.

That market lead is even greater than it looks, because some of the strongest competitors deserve asterisks. Many of IBM’s mainframe competitors were “national champions” — Fujitsu and Hitachi in Japan, Bull in France and so on. Those were probably stronger competitors to IBM than the classic BUNCH companies (Burroughs, Univac, NCR, Control Data, Honeywell).

Similarly, Oracle’s strongest direct competitors are IBM DB2 and Microsoft SQL Server, each of which is sold primarily to customers loyal to the respective vendors’ full stacks. SAP is now trying to play a similar game.

The core product is stable, secure, richly featured, and generally very mature. Duh.

The core product is complicated to administer — which provides great job security for administrators. IBM had JCL (Job Control Language). Oracle has a whole lot of manual work overseeing indexes. In each case, there are many further examples of the point. Edit: A Twitter discussion suggests the specific issue with indexes has been long fixed.

Niche products can actually be more reliable than the big, super-complicated leader. Tandem Nonstop computers were super-reliable. Simple, “embeddable” RDBMS — e.g. Progress or SQL Anywhere — in many cases just work. Still, if you want one system to run most of your workload 24×7, it’s natural to choose the category leader.

The category leader has a great “whole product” story. Here I’m using “whole product” in the sense popularized by Geoffrey Moore, to encompass ancillary products, professional services, training, and so on, from the vendor and third parties alike. There was a time when most serious packaged apps ran exclusively on IBM mainframes. Oracle doesn’t have quite the same dominance, but there are plenty of packaged apps for which it is the natural choice of engine.

Notwithstanding all the foregoing, there’s strong vulnerability to alternative product categories. IBM mainframes eventually were surpassed by UNIX boxes, which had grown up from the minicomputer and even workstation categories. Similarly, the Oracle DBMS has trouble against analytic RDBMS specialists, NoSQL, text search engines and more.

IBM’s fate, and Oracle’s

Given that background, what does it teach us about possible futures for Oracle? The golden age of the IBM mainframe lasted 25 or 30 years — 1965-1990 is a good way to think about it, although there’s a little wiggle room at both ends of the interval. Since then it’s been a fairly stagnant cash-cow business, in which a large minority or perhaps even small majority of IBM’s customers have remained intensely loyal, while others have aligned with other vendors.

Oracle’s DBMS business seems pretty stagnant now too. There’s no new on-premises challenger to Oracle now as strong as UNIX boxes were to IBM mainframes 20-25 years ago, but as noted above, traditional competitors are stronger in Oracle’s case than they were in IBM’s. Further, the transition to the cloud is a huge deal, currently in its early stages, and there’s no particular reason to think Oracle will hold any more share there than IBM did in the transition to UNIX.

Within its loyal customer base, IBM has been successful at selling a broad variety of new products (typically software) and services, often via acquired firms. Oracle, of course, has also extended its product lines immensely from RDBMS, to encompass “engineered systems” hardware, app server, apps, business intelligence and more. On the whole, this aspect of Oracle’s strategy is working well.

That said, in most respects Oracle is weaker at account control than peak IBM.

Oracle’s core competitors, IBM and Microsoft, are stronger than IBM’s were.
DB2 and SQL Server are much closer to Oracle compatibility than most mainframes were to IBM. (Amdahl is an obvious exception.) This is especially true as of the past 10-15 years, when it has become increasingly clear that reliance on stored procedures is a questionable programming practice. Edit: But please see the discussion below challenging this claim.
Oracle (the company) is widely hated, in a way that IBM generally wasn’t.
Oracle doesn’t dominate a data center the way hardware monopolist IBM did in a hardware-first era.

Above all, Oracle doesn’t have the “Trust us; we’ll make sure your IT works” story that IBM did. Appliances, aka “engineered systems”, are a step in that direction, but those are only — or at least mainly — to run Oracle software, which generally isn’t everything a customer has.

But think of the apps!

Oracle does have one area in which it has more account control power than IBM ever did — applications. If you run Oracle apps, you probably should be running the Oracle RDBMS and perhaps an Exadata rack as well. And perhaps you’ll use Oracle BI too, at least in use cases where you don’t prefer something that emphasizes a more modern UI.

As a practical matter, most enterprise app rip-and-replace happens in a few scenarios:

Merger/acquisition. An enterprise that winds up with different apps for the same functions may consolidate and throw the loser out. I’m sure Oracle loses a few customers this way to SAP every year, and vice-versa.
Drastic obsolescence. This can take a few forms, mainly:
- Been there, done that.
- Enterprise outgrows the capabilities of the current app suite. Oracle’s not going to lose much business that way.
- Major platform shift. Going forward, that means SaaS/”cloud” (Software as a Service).

And so the main “opportunity” for Oracle to lose application market share is in the transition to the cloud.

Putting this all together …

A typical large-enterprise Oracle customer has 1000s of apps running on Oracle. The majority would be easy to port to some other system, but the exceptions to that rule are numerous enough to matter — a lot. Thus, Oracle has a secure place at that customer until such time as its applications are mainly swept away and replaced with something new.

But what about new apps? In many cases, they’ll arise in areas where Oracle’s position isn’t strong.

New third-party apps are likely to come from SaaS vendors. Oracle can reasonably claim to be a major SaaS vendor itself, and salesforce.com has a complex relationship with the Oracle RDBMS. But on the whole, SaaS vendors aren’t enthusiastic Oracle adopters.
New internet-oriented apps are likely to focus on customer/prospect interactions (here I’m drawing the (trans)action/interaction distinction) or even more purely machine-generated data (“Internet of Things”). The Oracle RDBMS has few advantages in those realms.
Further, new apps — especially those that focus on data external to the company — will in many cases be designed for the cloud. This is not a realm of traditional Oracle strength.

And that is why I think the answer to this post’s title question is probably “Yes”.

Related links

A significant fraction of my posts, in this blog and Software Memories alike, are probably at least somewhat relevant to this sweeping discussion. Particularly germane is my 2012 overview of Oracle’s evolution. Other posts to call out are my recent piece on transitioning to the cloud, and my series on enterprise application history.

Couchbase 4.0 and related subjects

Curt Monash — Thu, 15 Oct 2015 15:17:44 +0000

I last wrote about Couchbase in November, 2012, around the time of Couchbase 2.0. One of the many new features I mentioned then was secondary indexing. Ravi Mayuram just checked in to tell me about Couchbase 4.0. One of the important new features he mentioned was what I think he said was Couchbase’s “first version” of secondary indexing. Obviously, I’m confused.

Now that you’re duly warned, let me remind you of aspects of Couchbase timeline.

2 corporate name changes ago, Couchbase was organized to commercialize memcached. memcached, of course, was internet companies’ default way to scale out short-request processing before the rise of NoSQL, typically backed by manually sharded MySQL.
Couchbase’s original value proposition, under the name Membase, was to provide persistence and of course support for memcached. This later grew into a caching-oriented pitch even to customers who weren’t already memcached users.
A merger with the makers of CouchDB ensued, with the intention of replacing Membase’s SQLite back end with CouchDB at the same time as JSON support was introduced. This went badly.
By now, however, Couchbase sells for more than distributed cache use cases. Ravi rattled off a variety of big-name customer examples for system-of-record kinds of use cases, especially in session logging (duh) and also in travel reservations.
Couchbase 4.0 has been in beta for a few months.

Technical notes on Couchbase 4.0 — and related riffs — start:

There’s a new SQL-like language called N1QL (pronounced like “nickel”). I’m hearing a lot about SQL-on-NoSQL these days. More on that below.
“Index”, “data” and “query” are three different services/tiers.
- You can run them all on the same nodes or separately. Couchbase doesn’t have enough experience yet with the technology to know which choice will wind up as a best practice.
- I’m hearing a lot about heterogeneous-node/multi-tier DBMS architectures these days, and would no longer stand by my 2009 statement that they are unusual. Other examples include Oracle Exadata, MySQL, MongoDB (now that it has pluggable storage engines), MarkLogic, and of course the whole worlds of Hadoop and Spark.
To be clear — the secondary indexes are global, and not tied to the same nodes as the data they index.
There’s a new back end called ForestDB, but if I understood correctly, it’s used just for the indexes, not for the underlying data.
ForestDB represents Couchbase indexes in something that resembles b-trees, but also relies on tries. Indeed, if I’m reading the relevant poster correctly, it’s based on a trie of b-trees.
In another increasingly common trend, Couchbase uses Bloom filters to help decide which partitions to retrieve for any particular query.

Up to a point, SQL-on-NoSQL stories can be fairly straightforward.

You define some kind of a table,* perhaps in a SQL-like DDL (Data Description Language).
SELECT, FROM and WHERE clauses work in the usual way.
Hopefully, if a column is going to have a lot of WHERE clauses on it, it also has an index.

For example, I think that’s the idea behind most ODBC/JDBC drivers for NoSQL systems. I think it’s also the idea behind most “SQL-like” languages that NoSQL vendors ship.

*Nobody I talk to about this ever wants to call it a “view”, but it sure sounds like a view to me — not a materialized view, of course, but a view nonetheless.

JOIN syntax can actually be straightforward as well under these assumptions. As for JOIN execution, Couchbase pulls all the data into the relevant tier, and nested loop execution there. My new clients at SequoiaDB have a similar strategy, by the way, although in their case there’s a hash join option as well.

But if things stopped there, they would miss an important complication: NoSQL has nested data. I.e., a value can actually be an array, whose entries are arrays themselves, and so on. That said, the “turtles all the way down” joke doesn’t quite apply, because at some point there are actual scalar or string values, and those are the ones SQL wants to actually operate on.

Most approaches I know of to that problem boil down to identifying particular fields as table columns, with or without aliases/renaming; I think that’s the old Hadapt/Vertica strategy, for example. Couchbase claims to be doing something a little different however, with a SQL-extending operator called UNNEST. Truth be told, I’m finding the N1QL language reference a bit terse, and haven’t figured out what the practical differences vs. the usual approach are, if any. But it sounds like there may be some interesting ideas in there somewhere.

The point of predicate pushdown

Curt Monash — Tue, 15 Jul 2014 13:52:45 +0000

Oracle is announcing today what it’s calling “Oracle Big Data SQL”. As usual, I haven’t been briefed, but highlights seem to include:

Oracle Big Data SQL is basically data federation using the External Tables capability of the Oracle DBMS.
Unlike independent products — e.g. Cirro — Oracle Big Data SQL federates SQL queries only across Oracle offerings, such as the Oracle DBMS, the Oracle NoSQL offering, or Oracle’s Cloudera-based Hadoop appliance.
Also unlike independent products, Oracle Big Data SQL is claimed to be compatible with Oracle’s usual security model and SQL dialect.
At least when it talks to Hadoop, Oracle Big Data SQL exploits predicate pushdown to reduce network traffic.

And by the way — Oracle Big Data SQL is NOT “SQL-on-Hadoop” as that term is commonly construed, unless the complete Oracle DBMS is running on every node of a Hadoop cluster.

Predicate pushdown is actually a simple concept:

If you issue a query in one place to run against a lot of data that’s in another place, you could spawn a lot of network traffic, which could be slow and costly. However …
… if you can “push down” parts of the query to where the data is stored, and thus filter out most of the data, then you can greatly reduce network traffic.

“Predicate pushdown” gets its name from the fact that portions of SQL statements, specifically ones that filter data, are properly referred to as predicates. They earn that name because predicates in mathematical logic and clauses in SQL are the same kind of thing — statements that, upon evaluation, can be TRUE or FALSE for different values of variables or data.

The most famous example of predicate pushdown is Oracle Exadata, with the story there being:

Oracle’s shared-everything architecture created a huge I/O bottleneck when querying large amounts of data, making Oracle inappropriate for very large data warehouses.
Oracle Exadata added a second tier of servers each tied to a subset of the overall storage; certain predicates are pushed down to that tier.
The I/O between Exadata’s two sets of servers is now tolerable, and so Oracle is now often competitive in the high-end data warehousing market,

Oracle evidently calls this “SmartScan”, and says Oracle Big Data SQL does something similar with predicate pushdown into Hadoop.

Oracle also hints at using predicate pushdown to do non-tabular operations on the non-relational systems, rather than shoehorning operations on multi-structured data into the Oracle DBMS, but my details on that are sparse.

Related link

Chris Kanaracus’ coverage of the announcement quotes me at length.

21st Century DBMS success and failure

Curt Monash — Mon, 14 Jul 2014 06:37:31 +0000

As part of my series on the keys to and likelihood of success, I outlined some examples from the DBMS industry. The list turned out too long for a single post, so I split it up by millennia. The part on 20th Century DBMS success and failure went up Friday; in this one I’ll cover more recent events, organized in line with the original overview post. Categories addressed will include analytic RDBMS (including data warehouse appliances), NoSQL/non-SQL short-request DBMS, MySQL, PostgreSQL, NewSQL and Hadoop.

DBMS rarely have trouble with the criterion “Is there an identifiable buying process?” If an enterprise is doing application development projects, a DBMS is generally chosen for each one. And so the organization will generally have a process in place for buying DBMS, or accepting them for free. Central IT, departments, and — at least in the case of free open source stuff — developers all commonly have the capacity for DBMS acquisition.

In particular, at many enterprises either departments have the ability to buy their own analytic technology, or else IT will willingly buy and administer things for a single department. This dynamic fueled much of the early rise of analytic RDBMS.

Buyer inertia is a greater concern.

A significant minority of enterprises are highly committed to their enterprise DBMS standards.
Another significant minority aren’t quite as committed, but set pretty high bars for new DBMS products to cross nonetheless.
FUD (Fear, Uncertainty and Doubt) about new DBMS is often justifiable, about stability and consistent performance alike.

A particularly complex version of this dynamic has played out in the market for analytic RDBMS/appliances.

First the newer products (from Netezza onwards) were sold to organizations who knew they wanted great performance or price/performance.
Then it became more about selling “business value” to organizations who needed more convincing about the benefits of great price/performance.
Then the behemoth vendors became more competitive, as Teradata introduced lower-price models, Oracle introduced Exadata, Sybase got more aggressive with Sybase IQ, IBM bought Netezza, EMC bought Greenplum, HP bought Vertica and so on. It is now hard for a non-behemoth analytic RDBMS vendor to make headway at large enterprise accounts.
Meanwhile, Hadoop has emerged as serious competitor for at least some analytic data management, especially but not only at internet companies.

Otherwise I’d say:

At large enterprises, their internet operations perhaps excepted:
- Short-request/general-purpose SQL alternatives to the behemoths — e.g. MySQL, PostgreSQL, NewSQL — have had tremendous difficulty getting established. The last big success was the rise of Microsoft SQL Server in the 1990s. That’s why I haven’t mentioned the term mid-range DBMS in years.
- NoSQL/non-SQL has penetrated large enterprises mainly for a few specific use cases, for example the lists I posted for MongoDB or graph databases.
Internet-only companies have few inertia issues when it comes to database managers. They’ll consider anything they regard as being in their price ballpark (which is however often restricted to open source). I think part of the reason is that as quickly as they rewrite their applications, DBMS are vastly less “strategic” to them than they are to most larger enterprises.
The internet operations of large companies — especially large retailers — in many cases behave like internet-only companies, but in many other cases behave like the rest of the enterprise.

The major reasons for DBMS categories to get established in the first place are:

Performance and/or scalability (many examples).
Developer features (for example dynamic schema).
License/maintenance cost (for example several open source categories).
Ease of installation and administration (for example open source again, and also data warehouse appliances).

Those same characteristics are major bases for competition among members of a new category, although as noted above behemoth-loyalty can also come into play.

Cool-vs.-weird tradeoffs are somewhat secondary among SQL DBMS.

There’s not much of a “cool” factor, because new products aren’t that different in what they do vs. older ones.
There’s not a terrible “weird” factor either, but of course any smaller offering faces FUD, and also …
… appliances are anti-strategic for many buyers, especially ones who demand a smooth path to the cloud.)

They’re huge, however, in the non-SQL world. Most non-SQL data managers have a major “weird” factor. Fortunately, NoSQL and Hadoop both have huge “cool” cred to offset it. XML/XQuery unfortunately did not.

Finally, in most DBMS categories there are massive issues with product completeness, more in the area of maturity than that of whole product. The biggest whole product issues are concentrated on the matter of interoperating with other software — business intelligence tools, packaged applications (if relevant to the category), etc. Most notably, the handful of DBMS that are certified to run SAP share a huge market that other DBMS can’t touch. But BI tools are less of a differentiator — I yawn when vendors tell me they are certified for/partnered with MicroStrategy, Tableau, Pentaho and Jaspersoft, and I’m surprised at any product that isn’t.

DBMS maturity has a lot of aspects, but the toughest challenges are concentrated in two main areas:

Reliability, especially but not only in short-request use cases.
Performance across a great variety of use cases. I observe frequently that performance in best-case scenarios, performance in the lab and performance in real-world environments are much further apart than vendors like to think.

In particular:

Maturity demands seem to be much higher for SQL DBMS than for NoSQL.
- I think this is one of several reasons NoSQL has been much more successful than NewSQL.
- It’s why I think MarkLogic’s “Enterprise NoSQL” positioning is a mistake.
As for MySQL:
- MySQL wasn’t close to reliable enough for enterprises to trust it until InnoDB became the default storage engine.
- MySQL 5 point releases have added major features, or decent performance for major features. I’ll confess to having lost track of what’s been fixed and what’s still missing.
- In saying all that I’m holding MySQL to a much higher maturity standard than I’m holding NoSQL — because that’s what I think enterprise customers do.
PostgreSQL “should” be doing a lot better than it is. I have an extremely low opinion of its promoters, and not just for personal reasons. (That said, the personal reasons don’t just apply to EnterpriseDB anymore. I’ve also run out of patience waiting for Josh Berkus to retract untruths he posted about me years ago.)
SAP HANA checks boxes for performance (In-memory rah rah rah!!) and whole product (Runs SAP!!). That puts it well ahead of most other newish SQL DBMS, purely analytic ones perhaps excepted.
Any other new short-request SQL DBMS that sounds like is has traction is also memory-centric.
Analytic RDBMS are in most respects held to lower maturity standards than DBMS used for write-intensive workloads. Even so, products in the category are still frequently tripped up by considerations of concurrent performance and mixed workload management.

Related links

There have been 1,470 previous posts in the 9-year history of this blog, many of which could serve as background material for this one. A couple that seem particularly germane and didn’t get already get linked above are:

The drive for uninterrupted DBMS operation.
Short-request DBMS trade-offs and alternatives.

RDBMS and their bundle-mates

Curt Monash — Sun, 10 Nov 2013 19:22:48 +0000

Relational DBMS used to be fairly straightforward product suites, which boiled down to:

A big SQL interpreter.
A bunch of administrative and operational tools.
Some very optional add-ons, often including an application development tool.

Now, however, most RDBMS are sold as part of something bigger.

Oracle has hugely thickened its stack, as part of an Innovator’s Solution strategy — hardware, middleware, applications, business intelligence, and more.
IBM has moved aggressively to a bundled “appliance” strategy. Even before that, IBM DB2 long sold much better to committed IBM accounts than as a software-only offering.
Microsoft SQL Server is part of a stack, starting with the Windows operating system.
Sybase was an exception to this rule, with thin(ner) stacks for both Adaptive Server Enterprise and Sybase IQ. But Sybase is now owned by SAP, and increasingly integrated as a business with …
… SAP HANA, which is closely associated with SAP’s applications.
Teradata has always been a hardware/software vendor. The most successful of its analytic DBMS rivals, in some order, are:
- Netezza, a pure appliance vendor, now part of IBM.
- Greenplum, an appliance-mainly vendor for most (not all) of its existence, and in particular now as a part of EMC Pivotal.
- Vertica, more of a software-only vendor than the others, but now owned by and increasingly mainstreamed into hardware vendor HP.
MySQL’s glory years were as part of the “LAMP” stack.
Various thin-stack RDBMS that once were or could have been important market players … aren’t. Examples include Progress OpenEdge, IBM Informix, and the various strays adopted by Actian.

This phenomenon is, I think, much more driven by vendors than users. Most of the examples I listed work or could work perfectly well on their own.* But relational database management systems are seen as “strategic” products, which means in particular:

They’re often expensive to adopt (software, hardware, people costs).
They’re also often expensive to switch away from.

And strategic products, high price tags, and thick product stacks commonly go together.

*Netezza is an exception. But Exadata is not; while Oracle data warehousing was in a bad technical place before Exadata, Exadata software is what cleaned the problem up.

Also relevant is that I took those examples from relatively mature RDBMS market segments — high-end OLTP/general purpose (OnLine Transaction processing), mid-range OLTP/general-purpose, and analytic. Products in those sectors have had enough time to be built out. They also tend to have fairly close competitors, as the most important product features (e.g. columnar storage in analytic RDBMS, or online backup across the board) have been imitated numerous times each.

NewSQL, by way of contrast, is just as thin-stack as NoSQL is. Products in those sectors are immature; vendors are completing them first before wedding them to other technology layers. They’re also strongly differentiated; if you tell me what topology you need and which style(s) of API or DML (Data Manipulation Language) you prefer, the list of product candidates I give you may be short indeed.

HBase is the obvious exception to my “NoSQL products stand alone” generalization, but its market position is a matter of debate.

I have mixed feelings about this trend. For starters, I’m grudgingly becoming more sympathetic to DBMS/hardware bundles, notwithstanding their role as a way to gouge more money from customers than the hardware is actually worth. Why? Because of my opinion that there’s a general move toward appliances, clusters and clouds. In particular:

As DBMS become better at straddling and melding RAM, flash and disk, legitimate reasons to optimize hardware/software integration will increase.
Microsoft (with Parallel Data Warehouse) and SAP (with HANA) induce customers to adopt hardware “appliances” even though they don’t sell and profit from the hardware themselves. This shoots down the argument that appliances are only a vendor trick to squeeze out more profits.
Netezza’s super-easy installation was a really nice feature.

When it comes to RDBMS/business intelligence bundles, my thoughts start:

As a general rule, a benefit of BI is that it can get at data from lots of different sources. This speaks against tying it to a specific DBMS.
The vendor-specific evidence is mixed:
- IBM has never explained any user advantages to including Cognos in its analytic “appliance” product lines.
- Teradata did some special optimizations for MicroStrategy. This suggests that, conversely, MicroStrategy could benefit from DBMS-specific features.
- QlikView built a custom in-memory data store.
- Specialized business intelligence stacks are on the rise, although generally with a beyond-just-relational flavor.

And so I’m skeptical about RDBMS/BI integration, but willing to be persuaded otherwise.

The integration of advanced analytics with RDBMS leaves me perplexed. Gains in performance, scalability and/or development ease would seem, in many cases, too great to pass up. (E.g.. the Teradata Aster 6 story, analytic libraries and all.) And indeed most analytic platform vendors report some level of adoption. But the whole thing is moving more slowly than I expected. Meanwhile in the Hadoop world, a much lesser SQL capability — Hive — seems to be integrated into other analytic processing with enthusiasm. Perhaps the problem is that enterprises have to figure out which analytic techniques to use in the first place, before they worry too much about making them efficient.

And finally, when it comes to bundling of packaged applications with RDBMS — that depends on the class of application.

At the high end, it’s almost purely a pricing ploy, as those apps are usually written for lowest-common-denominator SQL functionality, so as to preserve portability.
A lot of mid-range apps are written against a specific DBMS, which is then resold along with the app. What’s more …
… most of those apps will migrate over time to a SaaS (Software as a Service) delivery model, which allows for a wholly integrated stack. And as the Workday example teaches us, database choices for SaaS apps can be pretty imaginative.

Related links

The refactoring of everything (July, 2013)
Comments about Gartner’s comments about a bunch of DBMS products (November, 2013)
The cardinal rules of DBMS development (March, 2013)

The refactoring of everything

Curt Monash — Sat, 20 Jul 2013 16:13:02 +0000

I’ll start with three observations:

Computer systems can’t be entirely tightly coupled — nothing would ever get developed or tested.
Computer systems can’t be entirely loosely coupled — nothing would ever get optimized, in performance and functionality alike.
In an ongoing trend, there is and will be dramatic refactoring as to which connections wind up being loose or tight.

As written, that’s probably pretty obvious. Even so, it’s easy to forget just how pervasive the refactoring is and is likely to be. Let’s survey some examples first, and then speculate about consequences.

NoSQL and schema-on-read.
- The main point of relational DBMS is the Ted Codd guarantee, which says that applications and database designs can be loosely coupled. The price is that database designs for different applications are tightly coupled into one comprehensive schema.
- NoSQL and dynamic schemas turn that around. For any one application, application design and database design are tightly coupled; but database designs for different apps are often unrelated.
- What you think of these alternatives probably has a lot to do with what you think about separating the developer and DBA job functions. If you like that separation, the relational approach should look good; if you don’t, then dynamic schemas may be more suitable.
BI with dedicated data stores. Instead of just running against relational DBMS, various business intelligence tools feature proprietary data stores, often memory-centric. Two examples I’ve written about are Platfora and QlikView, but “in-memory BI” goes far beyond those two vendors.
BI integrated into operational apps. A trend that’s been developing for years is the tight integration of BI into operational apps. I’ve written mainly about Workday’s version of this, but it’s at least as big an issue going forward in competition among SAP, Oracle, and most other application vendors you can think of.
SAP HANA and competitors. It’s an understatement to call SAP HANA “overhyped”. But technology will surely someday get to the point that SAP implies is already here, with a lot of the silo-merging that that suggests. For databases that grow at much less than Moore’s Law speeds, it will be possible to integrate in-memory database capabilities that previously called for a variety of disk-based specialty systems.
Analytic application subsystems. Customer-facing analytic applications will have a whole different standard of completeness than more traditional back-office transactional ones. The base case is analytic subsystems loosely coupled to a variety of front-end technologies.
Other “real-time analytics”. I expected the short-request/analytic distinction to blur, but even so I’m astonished by the number of NoSQL and NewSQL vendors who’ve adopted “real-time analytics” as a core message. (For more on that, I refer you again to my recent webinar on the topic.)
Appliances. For a variety of technical and business reasons, vendors love selling appliances, aka “engineered systems”. Frankly, some of the integrations between hardware, operating system, and other software are tighter than others, so only some of this is a true refactoring. Anyhow, appliance stories can be heard from a large fraction of the computer industry, including for example:
- Apple and other mobile device makers.
- IBM — PureThisAndThat.
- Oracle — ExaEverything.
- Teradata.
- Microsoft — Xbox, Surface, Parallel Data Warehouse, etc.
- SAP — HANA appliances
- The whole telecom equipment industry.
Cluster management. We have entered the era of cluster computing. This has several consequences:
- Software designed to run on dedicated clusters often has cluster management software integrated in.
- Virtualization has evolved to break the old pairing between applications and the specific servers on which they run.
- OpenStack and similar cloud stacks are trying to take the evolution further.
SaaS/IaaS/DBaaS/PaaS. Software as a Service, in all its acronymic variations, can integrate software, hardware, the surrounding bricks and mortar, service, and everything else. Conversely, different SaaS systems can be a lot more stand-offish from each other than multiple vendors’ packaged apps, all running in the same data center, perhaps even on the same DBMS.

I could keep going on for a while; for example, I haven’t said anything yet about “intelligent storage”; indeed, I haven’t even mentioned analytic platforms or their SQL-on-Hadoop cousins. But hopefully I’ve run through enough different cases to justify the slightly hyperbolic title of this post.

So what are some possible implications? My candidates start:

As previously noted, I expect most computing to eventually wind up on a combination of appliances, clusters and/or clouds. That applies even to organizations whose workloads are small enough to run on single servers, because most of their computing (except for personal devices) will either be SaaS, or else the kind of public-facing internet applications that already tend to be in the cloud.
Also as previously noted, I expect traditional databases — i.e. ones that focus on human-generated data — to eventually wind up in RAM. I imagine that there will be both relational and dynamic-schema APIs to the memory-centric DBMS that manage them.
The current and near-future technology stacks underneath application suites such as Oracle’s, SAP’s or Infor’s are of little importance to their ultimate success. (Yes, that applies even to the wonder of HANA.) Much more significant will be the subsequent cloud/SaaS generation.
Similarly, I think there is plenty of opportunity for large new application software companies, SaaS or not as the case may be, just as there usually is in connection with major technological change.
Notwithstanding the various points of integration between analytics and short-request processing, there will also be analytics-only technology stacks for a long time to come.

The IT industry seems likely to remain interesting for a long time to come.

Appliances, clusters and clouds

Curt Monash — Sun, 24 Mar 2013 05:05:15 +0000

I believe:

The trend to clustered computing is sustainable.
The trend to appliances is also sustainable.
The “single” enterprise cluster is almost as much of a pipe dream as the single enterprise database.

I shall explain.

Arguments for hosting applications on some kind of cluster include:

If the workload requires more than one server — well, you’re in cluster territory!
If the workload requires less than one server — throw it into the virtualization pool.
If the workload is uneven — throw it into the virtualization pool.

Arguments specific to the public cloud include:

A large fraction of new third-party applications are SaaS (Software as a Service). Those naturally live in the cloud.
Cloud providers have efficiencies that you don’t.

That’s all pretty compelling. However, these are not persuasive reasons to put everything on a SINGLE cluster or cloud. They could as easily lead you to have your VMware cluster and your Exadata rack and your Hadoop cluster and your NoSQL cluster and your object storage OpenStack cluster — among others — all while participating in several different public clouds as well.

Why would you not move work into a cluster at all? First, if ain’t broken, you might not want to fix it. Some of the cluster options make it easy for you to consolidate existing workloads — that’s a central goal of VMware and Exadata — but others only make sense to adopt in connection with new application projects. Second, you might just want device locality. I have a gaming-class PC next to my desk; it drives a couple of monitors; I like that arrangement. Away from home I carry a laptop computer instead. Arguments can be made for small remote-office servers as well.

To put all that more simply:

Moving existing applications to new platforms often isn’t worth the trouble.
Many needs can be best met by single, physically local devices.

Appliances are a natural form factor for single-purpose computing. It is reasonable to characterize as “appliances” — in the computing sense of the term — medical equipment, vehicles, cash machines, cash registers, enterprise security devices, home entertainment, exercise machines and, yes, refrigerators; computers, in some form, can be found almost anywhere. But appliances also are a convenient way to package enterprise systems — configurations will be correct, installation will be simpler, and fortunate software-centric appliance vendors may capture margins on hardware sales and support. And the idea of SaaS-like continuous updates to your enterprise systems seems much more reasonable in the case of a locked-down appliance-like configuration.

Circling back to the beginning, I’d say there are multiple reasons not to expect all your computing to be done on a single cluster:

You might want to use appliances don’t fit into that cluster.
You might want to use SaaS offerings that don’t fit into that cluster.
The efficiency gains from using a single cluster aren’t that much greater than the gains from using a few of them.
You might want different parts of your computing work to be done in-house and in the public cloud.
You might want different parts of your data to be kept in different countries.
Different kinds of work might fit better onto differently-configured nodes, and current cloud/cluster technology doesn’t do a wonderful job with heterogeneity.
A lot of computing is so inherently small and local that it shouldn’t be clustered at all.

Ceteris paribus, fewer clusters are better than more of them. But all things are not equal, and it’s not reasonable to try to reduce your clusters to one — not even if that one is administered with splendid efficiency by low-cost workers, in a low-cost building, drawing low-cost electric power, in a low-cost part of the world.

Notes and links, February 17, 2013

Curt Monash — Mon, 18 Feb 2013 03:54:22 +0000

1. It boggles my mind that some database technology companies still don’t view compression as a major issue. Compression directly affects storage and bandwidth usage alike — for all kinds of storage (potentially including RAM) and for all kinds of bandwidth (network, I/O, and potentially on-server).

Trading off less-than-maximal compression so as to minimize CPU impact can make sense. Having no compression at all, however, is an admission of defeat.

2. People tend to misjudge Hadoop’s development pace in either of two directions. An overly expansive view is to note that some people working on Hadoop are trying to make it be all things for all people, and to somehow imagine those goals will soon be achieved. An overly narrow view is to note an important missing feature in Hadoop, and think there’s a big business to be made out of offering it alone.

At this point, I’d guess that Cloudera and Hortonworks have 500ish employees combined, many of whom are engineers. That allows for a low double-digit number of 5+ person engineering teams, along with a number of smaller projects. The most urgently needed features are indeed being built. On the other hand, a complete monument to computing will not soon emerge.

3. Schooner’s acquisition by SanDisk has led to the discontinuation of Schooner’s SQL DBMS SchoonerSQL. Schooner’s flash-optimized key-value store Membrain continues. I don’t have details, but the Membrain web page suggests both data store and cache use cases.

4. There’s considerable personnel movement at Boston-area database technology companies right now. Please ping me directly if you care.

5. I talked recently with Ashish Thusoo of Qubole. Qubole’s initial offering is a Hive-in-the-cloud, started by the guys who invented Hive. Qubole’s coolest new technical feature vs. generic Hive seems to be a disk-based columnar cache that lives with the servers, to help “smooth over the jitters” between Amazon EC2 and S3. Qubole company basics include:

Founded last year.
15 early adopters, generally from mid-sized internet companies. Some of the adopters are already paying.
12 employees.

6. In my recent When I am a VC Overlord post, I wrote:

4. I will not fund any software whose primary feature is that it is implemented in the “cloud” or via “SaaS”. A me-too product on a different platform is still a me-too product.

5. I will not fund any pitch that emphasizes the word “elastic”. Elastic is an important feature of underwear and pajamas, but even in those domains it does not provide differentiation.

Cloud/SaaS deployments give you a chance at providing superior ease of use/installation/administration, without compromising functionality — but they don’t automatically guarantee it. It’s hard work to make your customers’ lives easier.*

*This is the second consecutive post in which I’ve used a similar line. I’ll try to stop now. What’s really scary is that I was inspired by the old Frank Perdue ad “It takes a tough man to make a tender chicken.”

7. Ofir Manor of EMC is skeptical about Oracle’s claims for Hybrid Columnar Compression. But he didn’t really dig up that much dirt, except that he seems to think 10X compression is more of a ceiling than the floor that Oracle marketing suggests it is. The money quote is:

Oracle used to provide 3x compression, now it provides 10x compression, so no wonder the best references customers are seeing about 3.4x savings…

That 3X is from Oracle’s Basic Compression, which seems to be a block-level dictionary scheme.

8. Nong Li of Cloudera wrote in praise of the code generation option in Impala. 3x performance is mentioned. What interested me was a nice observation that goes beyond Impala:

Code generation is most beneficial for queries that execute simple expressions and the interpretation overhead is most pronounced. For example, a query that is doing a regular expression match over each row is not going to benefit from code generation much because the interpretation overhead is low compared to the regex processing time.

Code generation may end up like compression — an architectural feature that DBMS just obviously should have.

Key questions when selecting an analytic RDBMS

Curt Monash — Wed, 06 Feb 2013 16:32:28 +0000

I recently complained that the Gartner Magic Quadrant for Data Warehouse DBMS conflates many use cases into one set of rankings. So perhaps now would be a good time to offer some thoughts on how to tell use cases apart. Assuming you know that you really want to manage your analytic database with a relational DBMS, the first questions you ask yourself could be:

How big is your database? How big is your budget?
How do you feel about appliances?
How do you feel about the cloud?
What are the size and shape of your workload?
How fresh does the data need to be?

Let’s drill down.

How big is your database? How big is your budget?

Taken together, these questions tell you which choices are even feasible. Does your database fit into RAM, at a price you can afford? Does it fit onto a single, perhaps large, server? If both answers are “No”, then you need a real scale-out system, querying disk or flash (which itself could be hard to afford). Otherwise, you have more options.

Note that database compression has a big influence on what fits where.

How do you feel about appliances?

Depending on considerations such as database size, the choice of Oracle, Teradata, IBM Netezza, or Microsoft SQL Server may mandate or at least strongly suggest an appliance form factor. For most other analytic DBMS, an appliance is more optional. Are appliances good for you? Bad? Indifferent? Trade-offs include:

Appliances often involve paying a premium for hardware purchase and/or support.
Appliances often are easy(ier) to install and manage.
Appliances are easier to upgrade in some ways (everything’s integrated), but harder in others (less ability to upgrade bottlenecked parts).
Appliances often don’t play well in the cloud.

How do you feel about the cloud?

Analytic DBMS run better on good hardware and predictable bandwidth (hence all those appliances). These can be hard to find in the cloud. So, not coincidentally, can be analytic DBMS references, although most vendors can muster a few.

If you feel you need to run your analytic RDBMS in the cloud now, check references carefully. If you only are concerned about the cloud as some indefinite future, then you might want to rule out a few appliance-only vendors, but otherwise you probably shouldn’t worry. Cloud hardware and networking are getting better, and RDBMS software vendors are gaining experience in cloud deployments.

What are the size and shape of your workload?

Different analytic databases can have very different kinds of workloads. Tasks include:

Complex, long-running queries.
Repetitive reports of varying degrees of complexity.
Simple queries.
Large, scheduled loads.
Continuous or near-continuous/micro-batch loads.

The big issue is — how many of each kind of task need to performed concurrently, and in what combinations? If you’re refreshing 10,000 dashboards, several hundred of which might be getting drill-down queries at once, while trying to do a few scan-heavy queries in the background and some 15-way joins, most analytic DBMS might disappoint you. (Indeed, I’d ask whether you might want to split up that work among two or more systems.) Different DBMS — and different hardware/storage/networking configurations — shine in different scenarios.

How fresh does the data need to be?

Any serious analytic DBMS can be loaded daily or hourly, edge cases perhaps excepted. In most cases 15 minute intervals work as well, or even 5, but check whether those load latencies would interfere with any performance optimizations. But if you want sub-second data freshness, or even several-second — well, that has to be a top-tier architectural issue.

If your analytics are simple enough, it’s appealing to do the immediate-response ones straight from your transactional database. If not, you may need some kind of streaming-replication setup. Usually, I wind up recommending replication approaches that don’t yet have a lot of maturity or references. Tread carefully here.

Related links

Eight kinds of analytic database (July 2011, 2-part post)
How to select a data warehouse DBMS (February 2009, slide deck)