Database diversity – DBMS 2 : DataBase Management System Services

Some stuff that’s always on my mind

Curt Monash — Sun, 20 May 2018 19:27:21 +0000

I have a LOT of partially-written blog posts, but am struggling to get any of them finished (obviously). Much of the problem is that they have so many dependencies on each other. Clearly, then, I should consider refactoring my writing plans.

So let’s start with this. Here, in no particular order, is a list of some things that I’ve said in the past, and which I still think are or should be of interest today. It’s meant to be background for numerous posts I write in the near future, and indeed a few hooks for such posts are included below.

1. Data(base) management technology is progressing pretty much as I expected.

Vendors generally recognize that maturing a data store is an important, many-years-long process.
Multiple kinds of data model are viable …
… but it’s usually helpful to be able to do some kind of JOIN.
To deal with the variety of hardware/network/storage arrangements out there, layering/tiering is on the rise. (An amazing number of vendors each seem to think they invented the idea.)

2. Rightly or wrongly, enterprises are often quite sloppy about analytic accuracy.

My two central examples have long been inaccurate metrics and false-positive alerts.
In predictive analytics, it’s straightforward to quantify how much additional value you’re leaving on the table with your imperfect accuracy.
Enterprise search and other text technologies are still often terrible.
After years of “real-time” overhype, organizations have seemingly swung to under-valuing real-time analytics.

3. Outside traditional enterprises, the accuracy problem can be even worse, and the consequences of analytic inaccuracy can be severe. In some cases this is well understood; autonomous vehicle researchers, for example, seem properly attentive to the challenge of not-killing-pedestrians. But in others it’s a mess. For example, I don’t think the “fake news on social media” challenge will be resolved without new technical approaches that, to my knowledge, aren’t yet even being tried.

4. More generally, I’ve long argued that the technology industry would someday have to deal with a variety of public policy and social concerns. That day has come. In anticipation, I wrote at length about privacy/surveillance, and a little about some other areas, including net neutrality, patents, economic development, and public technology spending. Missing subjects include censorship (private and public alike), and perhaps also the efforts to tie data ownership into anti-trust policy.

5. Given all the tech-specific public policy work that’s needed, I’m pulling back from some my broader political efforts. However, I stand by my overview opinions of last February, and I delivered on some of its IOUs in a two-part series on persuasion.

6. The ongoing rise of “edge computing” and the “Internet of Things” fit into the general trend that in 2013 I summarized as appliances, clusters and clouds.

7. I continue to think that a huge fraction of analytics is properly characterized as monitoring. That ties into a number of areas of interest. For example:

Platform technologies — including distributed data management — are often compete on the maturity of their built-in monitoring.
My complaints about BI inaccuracy commonly relate to use cases in monitoring.
Privacy/surveillance issues are commonly about monitoring. It’s common to worry that such monitoring is actually too accurate.
But I also worry that privacy/surveillance monitoring isn’t accurate enough … and hence that it leads to people being discriminated against who absolutely don’t need to be.
Edge computing involves a lot of devices that need to be monitored.
Censorship obviously has a lot to do with monitoring.

8. And finally for now, my core precepts for strategic messaging haven’t changed.

Related link

As you may have already guessed, the title of this post is based on a classic song.

Analytics on the edge?

Curt Monash — Fri, 30 Jun 2017 08:27:18 +0000

There’s a theory going around to the effect that:

Compute power is and will be everywhere, for example in cars, robots, medical devices or microwave ovens. Let’s refer to these platforms collectively as “real-world appliances”.
Much more data will be created on these platforms than can reasonably be sent back to centralized/cloudy servers.
Therefore, cloud-centric architectures will soon be obsolete, perhaps before they’re ever dominant in the first place.

There’s enough truth to all that to make it worth discussing. But the strong forms of the claims seem overblown.

1. This story doesn’t even make sense except for certain new classes of application. Traditional business applications run all over the world, in dedicated or SaaSy modes as the case may be. E-commerce is huge. So is content delivery. Architectures for all those things will continue to evolve, but what we have now basically works.

2. When it comes to real-world appliances, this story is partially accurate. An automobile is a rolling network of custom Linux systems, each running hand-crafted real-time apps, a few of which also have minor requirements for remote connectivity. That’s OK as far as it goes, but there could be better support for real-time operational analytics. If something as flexible as Spark were capable of unattended operation, I think many engineers of real-world appliances would find great ways to use it.

3. There’s a case to be made for something better yet. I think the argument is premature, but it’s worth at least a little consideration.

There are any number of situations in which decisions are made on or about remote systems, based on models or rules that should be improved over time. For example, such decisions might be made in:

Machine vision or other “recognition”-oriented areas of AI.
Detection or prediction of malfunctions.
Choices as to what data is significant enough to ship back upstream.

In the canonical case, we might envision a system in which:

Huge amounts of data are collected and are used to make real-time decisions.
The models are trained centrally, and updated remotely over time as they are improved.
The remote systems can only ship back selected or aggregated data to help train the models.

This all seems like an awkward fit for any common computing architecture I can think of.

But it’s hard to pin down important examples of that “canonical” case. The story implicitly assumes:

A model is widely deployed.
The model does a decent job but not a perfect one.
Based on its successes and failures, the model gets improved.

And now we’re begging a huge question: What exactly is there that keeps score as to when the model succeeds and fails? Mathematically speaking, I can’t imagine what a general answer would be like.

4. So when it comes to predictive models executed on real-world appliances I think that analytic workflows will:

Differ for different (categories) of applications.
Rely in most cases on simple patterns of data movement, such as:
- Stream everything to central servers and sort it out there, or if that’s not workable …
- … instrument a limited number of test nodes to store everything, and recover the data in batch for analysis.
- Update models only in timeframes that you’re doing a full app update/refresh.

And with that much of the apparent need for fancy distributed analytic architectures evaporates.

5. Finally, and notwithstanding the previous point: Across many use cases, there’s some kind of remote log data being shipped back to a central location. It may be the complete log. It may be periodic aggregates. It may happen only what the edge nodes regard as significant events. But something is getting shipped home.

The architectures for shipping, receiving and analyzing such data are in many cases immature. That’s obvious if there’s any kind of streaming involved, or if analysis is done in Spark. Ditto if there’s anything we might call “non-tabular business intelligence”. As this stuff matures, it will in many cases fit very well with today’s cloud thinking. But in any case — it needs to mature.

Truth be told, even the relational case is immature, in that it can easily rely on what I called:

data warehouses (perhaps really data marts) that are updated in human real-time

That quote is from a recent post about Kudu, which:

Is designed for exactly that use case.
Went GA early this year.

As always, technology is in flux.

Related links

Interana is another example of very new technology that seems applicable to these use cases.
My 2013 post on the future of IT architectures still rings true.

MongoDB 3.4 and “multimodel” query

Curt Monash — Wed, 23 Nov 2016 12:01:25 +0000

“Multimodel” database management is a hot new concept these days, notwithstanding that it’s been around since at least the 1990s. My clients at MongoDB of course had to join the train as well, but they’ve taken a clear and interesting stance:

A query layer with multiple ways to query and analyze data.
A separate data storage layer in which you have a choice of data storage engines …
… each of which has the same logical (JSON-based) data structure.

When I pointed out that it would make sense to call this “multimodel query” — because the storage isn’t “multimodel” at all — they quickly agreed.

To be clear: While there are multiple ways to read data in MongoDB, there’s still only one way to write it. Letting that sink in helps clear up confusion as to what about MongoDB is or isn’t “multimodel”. To spell that out a bit further:

In query, MongoDB mixes multiple paradigms for DML (Data Manipulation Language). The main one is of course JSON.
When writing, the DML paradigm is unmixed — it’s just JSON.

Further, MongoDB query DML statements can be mixed with analytic functions rooted in Spark.

The main ways to query data in MongoDB, to my knowledge, are:

Native/JSON. Duh.
SQL.
- MongoDB has used MySQL as a guide to what SQL coverage they think the market is calling for.
- More to the point, they’re trying to provide enough SQL so that standard business intelligence tools work well (enough) against MongoDB.
- I neglected to ask why this changed from MongoDB’s adamantly non-SQL approach of 2 1/2 years ago.
Search.
- MongoDB has been adding text search features for a few releases.
- MongoDB’s newest search feature revolves around “facets”, in the Endeca sense of the term. MongoDB characterizes as a kind of text-oriented GroupBy.
Graph. MongoDB just introduced a kind of recursive join capability, which is useful for detecting multi-hop relationships (e.g. ancestor/descendant rather than just parent/child). MongoDB declares that the “graph” box is thereby checked.

Three years ago, in an overview of layered and multi-DML architectures, I suggested:

Layered DBMS and multimodel functionality fit well together.
Both carried performance costs.
In most cases, the costs could be affordable.

MongoDB seems to have bought strongly into that view on the query side — which is, of course, exactly the right way for them to have started.

Notes on the transition to the cloud

Curt Monash — Tue, 04 Oct 2016 02:22:21 +0000

1. The cloud is super-hot. Duh. And so, like any hot buzzword, “cloud” means different things to different marketers. Four of the biggest things that have been called “cloud” are:

The Amazon cloud, Microsoft Azure, and their competitors, aka public cloud.
Software as a service, aka SaaS.
Co-location in off-premises data centers, aka colo.
On-premises clusters (truly on-prem or colo as the case may be) designed to run a broad variety of applications, aka private cloud.

Further, there’s always the idea of hybrid cloud, in which a vendor peddles private cloud systems (usually appliances) running similar technology stacks to what they run in their proprietary public clouds. A number of vendors have backed away from such stories, but a few are still pushing it, including Oracle and Microsoft.

This is a good example of Monash’s Laws of Commercial Semantics.

2. Due to economies of scale, only a few companies should operate their own data centers, aka true on-prem(ises). The rest should use some combination of colo, SaaS, and public cloud.

This fact now seems to be widely understood.

3. The public cloud is a natural fit for those use cases in which elasticity truly matters. Many websites and other consumer internet backends have that characteristic. Such systems are often also a good fit for cloud technologies in general.

This is frequently a good reason for new — i.e. “greenfield” — apps to run in the cloud.

4. Security and privacy can be concerns in moving to the cloud. But I’m hearing that more and more industries are overcoming those concerns.

In connection to that point, it might be interesting to note:

In the 1960s and 1970s, one of the biggest industries for remote computing services — i.e. SaaS — was commercial banking.
Other big users were hospitals and stockbrokers.
The US intelligence agencies are building out their own shared, dedicated cloud.

5. Obviously, Amazon is the gorilla in the cloud business. Microsoft Azure gets favorable mentions as well. I don’t hear much about other public cloud providers, however, except that there are a lot of plans to support Google’s cloud just in case.

In particular, I hear less than I expected to about public clouds run by national-champion telecom companies around the world.

6. It’s inconvenient for an application vendor to offer both traditional and SaaS versions of a product. Release cycles and platform support are different in the two cases. But there’s no reason a large traditional application vendor couldn’t pull it off, and the largest are already more or less claiming to. Soon, this will feel like a market necessity across the board.

7. The converse is less universally true. However, some SaaS vendors do lose out from their lack of on-premises options. Key considerations include:

Does your application need to run close to your customers’ largest databases?
Do your customers still avoid the public cloud?

If both those things are true, and you don’t have an on-premises option, certain enterprises are excluded from your addressable market.

8. Line-of-business departments are commonly more cloud-friendly than central IT is. Reasons include:

Departments don’t necessarily see central IT as any “closer” to them than the cloud is.
Departments don’t necessarily care about issues that give central IT pause.
Departments sometimes buy things that only are available via remote delivery, e.g. narrowly focused SaaS applications or market data.

I discussed some of this in my recent post on vendor lock-in.

9. When the public cloud was younger, it had various technological limitations. You couldn’t easily get fast storage like flash. You couldn’t control data movement well enough for good MPP (Massively Parallel Processing) in use cases like analytic SQL.

Those concerns seem to have been largely alleviated.

10. It takes a long time for legacy platforms to be decommissioned. At some enterprises, however, that work has indeed been going on for a long time, via virtualization.

11. If you think about system requirements:

There is a lot of computing power in devices that may be regarded as IoT nodes — phones, TV boxes, thermostats, cars, industrial equipment, sensors, etc. Client-side computing is getting ever more diverse.
Server-side computing, however, is more homogenous. Enterprises can, should and likely will meet the vast majority of their server requirements on a relatively small number of clusters each.

I argued the latter point in my 2013 post on appliances, clusters, and clouds, using terminology and reasoning that are now only slightly obsolete.

So what will those clusters be? Some will be determined by app choices. Most obviously, if you use SaaS, the SaaS vendor decides which cloud(s) your data is in. And if you’re re-hosting legacy systems via virtualization, that’s another cluster.

Otherwise, clusters will probably be organized by database, in the most expansive sense of term. For example, there could be separate clusters for:

Operational data managed by your general-purpose RDBMS (Oracle, SQL Server, DB2, whatever).
Relational data warehousing, whether in an analytic RDBMS or otherwise.
Log files, perhaps managed in Hadoop or Splunk.
Your website and other internet back-ends, perhaps running over NoSQL data stores.
Text documents managed by some kind of search engine.
Media block or object storage, if the organization’s audio/video/whatever would overwhelm a text search engine. (Text search or document management systems can often also handle low volumes of non-text media.)

Indeed, since computing is rarely as consolidated as CIOs dream of it being, a large enterprise might have several clusters for any of those categories — each running different software for data and storage management — with different deployment choices among colo, true on-prem, and true cloud.

Oracle as the new IBM — has a long decline started?

Curt Monash — Thu, 31 Dec 2015 09:15:34 +0000

When I find myself making the same observation fairly frequently, that’s a good impetus to write a post based on it. And so this post is based on the thought that there are many analogies between:

Oracle and the Oracle DBMS.
IBM and the IBM mainframe.

And when you look at things that way, Oracle seems to be swimming against the tide.

Drilling down, there are basically three things that can seriously threaten Oracle’s market position:

Growth in apps of the sort for which Oracle’s RDBMS is not well-suited. Much of “Big Data” fits that description.
Outright, widespread replacement of Oracle’s application suites. This is the least of Oracle’s concerns at the moment, but could of course be a disaster in the long term.
Transition to “the cloud”. This trend amplifies the other two.

Oracle’s decline, if any, will be slow — but I think it has begun.

Oracle/IBM analogies

There’s a clear market lead in the core product category. IBM was dominant in mainframe computing. While not as dominant, Oracle is definitely a strong leader in high-end OTLP/mixed-use (OnLine Transaction Processing) RDBMS.

That market lead is even greater than it looks, because some of the strongest competitors deserve asterisks. Many of IBM’s mainframe competitors were “national champions” — Fujitsu and Hitachi in Japan, Bull in France and so on. Those were probably stronger competitors to IBM than the classic BUNCH companies (Burroughs, Univac, NCR, Control Data, Honeywell).

Similarly, Oracle’s strongest direct competitors are IBM DB2 and Microsoft SQL Server, each of which is sold primarily to customers loyal to the respective vendors’ full stacks. SAP is now trying to play a similar game.

The core product is stable, secure, richly featured, and generally very mature. Duh.

The core product is complicated to administer — which provides great job security for administrators. IBM had JCL (Job Control Language). Oracle has a whole lot of manual work overseeing indexes. In each case, there are many further examples of the point. Edit: A Twitter discussion suggests the specific issue with indexes has been long fixed.

Niche products can actually be more reliable than the big, super-complicated leader. Tandem Nonstop computers were super-reliable. Simple, “embeddable” RDBMS — e.g. Progress or SQL Anywhere — in many cases just work. Still, if you want one system to run most of your workload 24×7, it’s natural to choose the category leader.

The category leader has a great “whole product” story. Here I’m using “whole product” in the sense popularized by Geoffrey Moore, to encompass ancillary products, professional services, training, and so on, from the vendor and third parties alike. There was a time when most serious packaged apps ran exclusively on IBM mainframes. Oracle doesn’t have quite the same dominance, but there are plenty of packaged apps for which it is the natural choice of engine.

Notwithstanding all the foregoing, there’s strong vulnerability to alternative product categories. IBM mainframes eventually were surpassed by UNIX boxes, which had grown up from the minicomputer and even workstation categories. Similarly, the Oracle DBMS has trouble against analytic RDBMS specialists, NoSQL, text search engines and more.

IBM’s fate, and Oracle’s

Given that background, what does it teach us about possible futures for Oracle? The golden age of the IBM mainframe lasted 25 or 30 years — 1965-1990 is a good way to think about it, although there’s a little wiggle room at both ends of the interval. Since then it’s been a fairly stagnant cash-cow business, in which a large minority or perhaps even small majority of IBM’s customers have remained intensely loyal, while others have aligned with other vendors.

Oracle’s DBMS business seems pretty stagnant now too. There’s no new on-premises challenger to Oracle now as strong as UNIX boxes were to IBM mainframes 20-25 years ago, but as noted above, traditional competitors are stronger in Oracle’s case than they were in IBM’s. Further, the transition to the cloud is a huge deal, currently in its early stages, and there’s no particular reason to think Oracle will hold any more share there than IBM did in the transition to UNIX.

Within its loyal customer base, IBM has been successful at selling a broad variety of new products (typically software) and services, often via acquired firms. Oracle, of course, has also extended its product lines immensely from RDBMS, to encompass “engineered systems” hardware, app server, apps, business intelligence and more. On the whole, this aspect of Oracle’s strategy is working well.

That said, in most respects Oracle is weaker at account control than peak IBM.

Oracle’s core competitors, IBM and Microsoft, are stronger than IBM’s were.
DB2 and SQL Server are much closer to Oracle compatibility than most mainframes were to IBM. (Amdahl is an obvious exception.) This is especially true as of the past 10-15 years, when it has become increasingly clear that reliance on stored procedures is a questionable programming practice. Edit: But please see the discussion below challenging this claim.
Oracle (the company) is widely hated, in a way that IBM generally wasn’t.
Oracle doesn’t dominate a data center the way hardware monopolist IBM did in a hardware-first era.

Above all, Oracle doesn’t have the “Trust us; we’ll make sure your IT works” story that IBM did. Appliances, aka “engineered systems”, are a step in that direction, but those are only — or at least mainly — to run Oracle software, which generally isn’t everything a customer has.

But think of the apps!

Oracle does have one area in which it has more account control power than IBM ever did — applications. If you run Oracle apps, you probably should be running the Oracle RDBMS and perhaps an Exadata rack as well. And perhaps you’ll use Oracle BI too, at least in use cases where you don’t prefer something that emphasizes a more modern UI.

As a practical matter, most enterprise app rip-and-replace happens in a few scenarios:

Merger/acquisition. An enterprise that winds up with different apps for the same functions may consolidate and throw the loser out. I’m sure Oracle loses a few customers this way to SAP every year, and vice-versa.
Drastic obsolescence. This can take a few forms, mainly:
- Been there, done that.
- Enterprise outgrows the capabilities of the current app suite. Oracle’s not going to lose much business that way.
- Major platform shift. Going forward, that means SaaS/”cloud” (Software as a Service).

And so the main “opportunity” for Oracle to lose application market share is in the transition to the cloud.

Putting this all together …

A typical large-enterprise Oracle customer has 1000s of apps running on Oracle. The majority would be easy to port to some other system, but the exceptions to that rule are numerous enough to matter — a lot. Thus, Oracle has a secure place at that customer until such time as its applications are mainly swept away and replaced with something new.

But what about new apps? In many cases, they’ll arise in areas where Oracle’s position isn’t strong.

New third-party apps are likely to come from SaaS vendors. Oracle can reasonably claim to be a major SaaS vendor itself, and salesforce.com has a complex relationship with the Oracle RDBMS. But on the whole, SaaS vendors aren’t enthusiastic Oracle adopters.
New internet-oriented apps are likely to focus on customer/prospect interactions (here I’m drawing the (trans)action/interaction distinction) or even more purely machine-generated data (“Internet of Things”). The Oracle RDBMS has few advantages in those realms.
Further, new apps — especially those that focus on data external to the company — will in many cases be designed for the cloud. This is not a realm of traditional Oracle strength.

And that is why I think the answer to this post’s title question is probably “Yes”.

Related links

A significant fraction of my posts, in this blog and Software Memories alike, are probably at least somewhat relevant to this sweeping discussion. Particularly germane is my 2012 overview of Oracle’s evolution. Other posts to call out are my recent piece on transitioning to the cloud, and my series on enterprise application history.

Readings in Database Systems

Curt Monash — Thu, 10 Dec 2015 12:26:40 +0000

Mike Stonebraker and Larry Ellison have numerous things in common. If nothing else:

They’re both titanic figures in the database industry.
They both gave me testimonials on the home page of my business website.
They both have been known to use the present tense when the future tense would be more accurate.

I mention the latter because there’s a new edition of Readings in Database Systems, aka the Red Book, available online, courtesy of Mike, Joe Hellerstein and Peter Bailis. Besides the recommended-reading academic papers themselves, there are 12 survey articles by the editors, and an occasional response where, for example, editors disagree. Whether or not one chooses to tackle the papers themselves — and I in fact have not dived into them — the commentary is of great interest.

But I would not take every word as the gospel truth, especially when academics describe what they see as commercial market realities. In particular, as per my quip in the first paragraph, the data warehouse market has not yet gone to the extremes that Mike suggests,* if indeed it ever will. And while Joe is close to correct when he says that the company Essbase was acquired by Oracle, what actually happened is that Arbor Software, which made Essbase, merged with Hyperion Software, and the latter was eventually indeed bought by the giant of Redwood Shores.**

*When it comes to data warehouse market assessment, Mike seems to often be ahead of the trend.

**Let me interrupt my tweaking of very smart people to confess that my own commentary on the Oracle/Hyperion deal was not, in retrospect, especially prescient.

Mike pretty much opened the discussion with a blistering attack against hierarchical data models such as JSON or XML. To a first approximation, his views might be summarized as:

Logical hierarchical models can be OK in certain cases. In particular, JSON could be a somewhat useful datatype in an RDBMS.
Physical hierarchical models are horrible.
Rather, you should implement the logical hierarchical model over a columnar RDBMS.

My responses start:

Nested data structures are more important than Mike’s discussion seems to suggest.
Native XML and JSON stores are apt to have an index on every field. If you squint, that index looks a lot like a column store.
Even NoSQL stores should and I think in most cases will have some kind of SQL-like DML (Data Manipulation Language). In particular, there should be some ability to do joins, because total denormalization is not always a good choice.

In no particular order, here are some other thoughts about or inspired by the survey articles in Readings in Database Systems, 5th Edition.

I agree that OLTP (OnLine Transaction Processing) is transitioning to main memory.
I agree with the emphasis on “data in motion”.
While I needle him for overstating the speed of the transition, Mike is right that columnar architectures are winning for analytics. (Or you could say they’ve won, if you recognize that mop-up from the victory will still take 1 or 2 decades.)
The guys seem to really hate MapReduce, which is an old story for Mike, but a bit of a reversal for Joe.
MapReduce is many things, but it’s not a data model, and it’s also not something that Hadoop 1.0 was an alternative to. Saying each of those things was sloppy writing.
The guys characterize consistency/transaction isolation as a rather ghastly mess. That part was an eye-opener.
Mike is a big fan of arrays. I suspect he’s right in general, although I also suspect he’s overrating SciDB. I also think he’s somewhat overrating the market penetration of cube stores, aka MOLAP.
The point about Hadoop (in particular) and modern technologies in general showing the way to modularization of DBMS is an excellent one.
Joe and Mike disagreed about analytics; Joe’s approach rang truer for me. My own opinion is:
- Business intelligence has been important for quite a while, and won’t stop.
- Machine learning is becoming ever more important.
- It’s still early days for the integration of the two areas, but much more will come.
The challenge of whether anybody wants to do machine learning (or other advanced analytics) over a DBMS is sidestepped in part by the previously mentioned point about the modularization of a DBMS. Hadoop, for example, can be both an OK analytic DBMS (although not fully competitive with mature, dedicated products) and of course also an advanced analytics framework.
Similarly, except in the short-term I’m not worried about the limitations of Spark’s persistence mechanisms. Almost every commercial distribution of Spark I can think of is part of a package that also contains a more mature data store.
Versatile DBMS and analytic frameworks suffer strategic contention for memory, with different parts of the system wanting to use it in different ways. Raising that as a concern about the integration of analytic DBMS with advanced analytic frameworks is valid.
I used to overrate the importance of abstract datatypes, in large part due to Mike’s influence. I got over it. He should too. They’re useful, to the point of being a checklist item, but not a game-changer. A big part of the problem is what I mentioned in the previous point — different parts of a versatile DBMS would prefer to do different things with memory.
I used to overrate the importance of user-defined functions in an analytic RDBMS. Mike had nothing to do with my error. I got over it. He should too. They’re useful, to the point of being a checklist item, but not a game-changer. Looser coupling between analytics and data management seems more flexible.
Excellent points are made about the difficulties of “First we build the perfect schema” data warehouse projects and, similarly, MDM (Master Data Management).
There’s an interesting discussion that helps explain why optimizer progress is so slow (both for the industry in general and for each individual product).

Related links

I did a deep dive into MarkLogic’s indexing strategy in 2008, which informed my comment about XML/JSON stores above.
Again with MarkLogic as the focus, in 2010 I was skeptical about document stores not offering joins. MarkLogic has since capitulated.
I’m not current on SciDB, but I did write a bit about it in 2010.
I’m surprised that I can’t find a post to point to about modularization of DBMS. I’ll leave this here as a placeholder until I can.
Edit: As promised, I’ve now posted about the object-relational/abstract datatype boom of the 1990s.

Differentiation in data management

Curt Monash — Mon, 26 Oct 2015 19:32:34 +0000

In the previous post I broke product differentiation into 6-8 overlapping categories, which may be abbreviated as:

Scope
Accuracy
(Other) trustworthiness
Speed
User experience
Cost

and sometimes also issues in adoption and administration.

Now let’s use this framework to examine two market categories I cover — data management and, in separate post, business intelligence.

Applying this taxonomy to data management:

Scope: Different subcategories of data management technology are suitable for different kinds of data, different scale of data, etc. To a lesser extent that may be true within a subcategory as well.
Scope: Further, products may differ in what you can do with the data, especially analytically.
Accuracy: Don’t … lose … data.
Other trustworthiness:
- Uptime, availability and so on are big deals in many data management sectors.
- Security is hugely important for data that both belongs to other people — usually your customers — and is accessible via the internet. It’s important in numerous other database use cases as well.
- Awkwardly, the CAP Theorem teaches us that there can be a bit of a trade-off between availability and (temporary) accuracy.
Speed:
- Different kinds of data management products perform differently in different use cases.
- If your use case is down the middle of what a mature data management subsector focuses on, performance may not vary much among individual leading products.
- Even then, tuning effort may be quite different for different products.
User experience:
- Users rarely interact directly with database management products.
- There can be clear differentiation in database administration UIs. (The most dramatic example was perhaps the rise of Microsoft SQL Server.)
- Data manipulation languages (DMLs) can make a huge difference in programmers’ lives.
Cost:
- License and maintenance costs can be a huge issue, especially if you’re buying from traditional vendors.
- Performance affects cost in a few ways: hardware costs for sure, tuning effort in some cases, and occasionally even vendor license/maintenance fees.
- Ongoing operations costs can vary greatly by database product in general, and by your pre-existing in-house expertise in particular.
- Ease of programming can sometimes lead to significant programming cost differences as well.
Adoption: This one is often misunderstood.
- The effort of adopting new database technology for new applications is often overrated. When projects are huge, it’s often because of what you’re doing with the technology, not because of the technology itself.
- Migration, however, is usually a bitch.

For reasons of length, I’m doing a separate post on differentiation in business intelligence.

Notes on packaged applications (including SaaS)

Curt Monash — Thu, 08 Oct 2015 00:27:56 +0000

1. The rise of SAP (and later Siebel Systems) was greatly helped by Anderson Consulting, even before it was split off from the accounting firm and renamed as Accenture. My main contact in that group was Rob Kelley, but it’s possible that Brian Sommer was even more central to the industry-watching part of the operation. Brian is still around, and he just leveled a blast at the ERP* industry, which I encourage you to read. I agree with most of it.

*Enterprise Resource Planning

Brian’s argument, as I interpret it, boils down mainly to two points:

Big ERP companies selling big ERP systems are pathetically slow at adding new functionality. He’s right. My favorite example is the multi-decade slog to integrate useful analytics into operational apps.
The world of “Big Data” is fundamentally antithetical to the design of current-generation ERP systems. I think he’s right in that as well.

I’d add that SaaS (Software As A Service)/on-premises tensions aren’t helping incumbent vendors either.

But no article addresses all the subjects it ideally should, and I’d like to call out two omissions. First, what Brian said is in many cases applicable just to large and/or internet-first companies. Plenty of smaller, more traditional businesses could get by just fine with no more functionality than is in “Big ERP” today, if we stipulate that it should be:

Delivered via SaaS.
Much easier to adopt and use.

Second, even within the huge enterprise/huge app vendor world, it’s not entirely clear how integrated ERP supposedly is or isn’t with CRM (Customer Relationship Management). And a lot of what Brian talks about fits pretty cleanly into the CRM bucket.

2. In any case, there are many application areas that — again assuming that we’re in the large enterprise or large internet company world — fit well neither with classical ERP nor with its CRM sibling. For starters, investigative analytics doesn’t fit well into packaged application suites, for a myriad of reasons, the most basic of which are:

The whole point of investigative analytics is to discover things that are new. Therefore, business processes are inherently unpredictable.
So are data inputs.

If somebody does claim to be selling an app in investigative analytics, it is usually really an analytic application subsystem or else something very disconnected from other apps. Indeed, in almost all cases it’s both.

3. When it comes to customer-facing websites, I stand by my arguments three years ago in the post just linked above, which boil down to:

What I just said above about investigative analytics, plus the observation that …
… websites have a strong creative aspect that fits badly with soup-to-nuts packaged applications.

Also, complex websites are likely to rely on dynamic schemas, and packaged apps have trouble adapting to those.

4. This is actually an example of a more general point — packaged or SaaS apps generally assume rather fixed schemas. (The weasel word “rather” is included to allow for customization-through-configuration, but I think the overall point holds.) Indeed, database design is commonly the essence of packaged app technology.

5. However, those schemas do not have to be relational. It would be inaccurate to say that packaged apps always assume tabular data, because of examples such as:

SAP has built on top of quasi-objects for a long time, although the underpinnings are technically relational.
There are some cases of building entirely on an object-oriented or hierarchical data model, especially in health care.
Business has some inherent hierarchies that get reflected in data structures, e.g. in bills of materials or organization charts.

But even non-tabular data structures are, in the minds of app developers, usually assumed to have fixed schemas.

Related links

The refactoring of everything (July, 2013)
Schema issues for complex websites (October, 2011)
Workday’s architecture (June, 2012)

Multi-model database managers

Curt Monash — Mon, 24 Aug 2015 08:07:00 +0000

I’d say:

Multi-model database management has been around for decades. Marketers who say otherwise are being ridiculous.
Thus, “multi-model”-centric marketing is the last refuge of the incompetent. Vendors who say “We have a great DBMS, and by the way it’s multi-model (now/too)” are being smart. Vendors who say “You need a multi-model DBMS, and that’s the reason you should buy from us” are being pathetic.
Multi-logical-model data management and multi-latency-assumption data management are greatly intertwined.

Before supporting my claims directly, let me note that this is one of those posts that grew out of a Twitter conversation. The first round went:

Merv Adrian: 2 kinds of multimodel from DBMS vendors: multi-model DBMSs and multimodel portfolios. The latter create more complexity, not less.

Me: “Owned by the same vendor” does not imply “well integrated”. Indeed, not a single example is coming to mind.

Merv: We are clearly in violent agreement on that one.

Around the same time I suggested that Intersystems Cache’ was the last significant object-oriented DBMS, only to get the pushback that they were “multi-model” as well. That led to some reasonable-sounding justification — although the buzzwords of course aren’t from me — namely:

Caché supports #SQL, #NoSQL. Interchange across tables, hierarchical, document storage.

Along the way, I was reminded that some of the marketing claims around “multi-model” are absurd. For example, at the time I am writing this, the Wikipedia article on “multi-model database” claims that “The first multi-model database was OrientDB, created in 2010…” In fact, however, by the definitions used in that article, multi-model DBMS date back to the 1980s, when relational functionality was grafted onto pre-relational systems such as TOTAL and IDMS.

What’s more, since the 1990s, multi-model functionality has been downright common, specifically in major products such as Oracle, DB2 and Informix, not to mention PostgreSQL. (But not so much Microsoft or Sybase.) Indeed, there was significant SQL standards work done around datatype extensions, especially in the contexts of SQL/MM and SQL3.

I tackled this all in 2013, when I argued:

One database to rule them all systems aren’t very realistic, but even so, …
… single-model systems will become increasingly obsolete.

Developments since then have been in line with my thoughts. For example, Spark added DataFrames, which promise substantial data model flexibility for Spark use cases, but more mature products have progressed in a more deliberate way.

What’s new in all this is a growing desire to re-integrate short-request and analytic processing — hence Gartner’s new-ish buzzword of HTAP (Hybrid Transactional/Analytic Processing). The more sensible reasons for this trend are:

Operational applications have always needed to accept immediate writes. (Losing data is bad.)
Operational applications have always needed to serve small query result sets based on the freshest data. (If you write something into a database, you might need to immediately retrieve it to finish the business operation.)
It is increasingly common for predictive decisions to be made at similar speeds. (That’s what recommenders and personalizers do.) Ideally, such decisions can be based on fresh and historical data alike.
The long-standing desire for business intelligence to operate on super-fresh data is, increasingly, making sense, as we get ever more stuff to monitor. However …
… most such analysis should look at historical data as well.
Streaming technology is supplying ever more fresh data.

But here’s the catch — the best models for writing data are the worst for reading it, and vice-versa, because you want to write data as a lightly-structured document or log, but read it from a Ted-Codd-approved RDBMS or MOLAP system. And if you don’t have the time to move data among multiple stores, then you want one store to do a decent job of imitating both kinds of architecture. The interesting new developments in multi-model data management will largely be focused on that need.

Related links

The two-policemen joke seems ever more relevant.
My April, 2015 post on indexing technology reminds us that one DBMS can do multiple things.
Back in 2009 integrating OLTP and data warehousing was clearly a bad idea.

Notes on HBase

Curt Monash — Tue, 10 Mar 2015 18:24:40 +0000

I talked with a couple of Cloudera folks about HBase last week. Let me frame things by saying:

The closest thing to an HBase company, ala MongoDB/MongoDB or DataStax/Cassandra, is Cloudera.
Cloudera still uses a figure of 20% of its customers being HBase-centric.
HBaseCon and so on notwithstanding, that figure isn’t really reflected in Cloudera’s marketing efforts. Cloudera’s marketing commitment to HBase has never risen to nearly the level of MongoDB’s or DataStax’s push behind their respective core products.
With Cloudera’s move to “zero/one/many” pricing, Cloudera salespeople have little incentive to push HBase hard to accounts other than HBase-first buyers.

Also:

Cloudera no longer dominates HBase development, if it ever did.
- Cloudera is the single biggest contributor to HBase, by its count, but doesn’t make a majority of the contributions on its own.
- Cloudera sees Hortonworks as having become a strong HBase contributor.
- Intel is also a strong contributor, as are end user organizations such as Chinese telcos. Not coincidentally, Intel was a major Hadoop provider in China before the Intel/Cloudera deal.
As far as Cloudera is concerned, HBase is just one data storage technology of several, focused on high-volume, high-concurrency, low-latency short-request processing. Cloudera thinks this is OK because of HBase’s strong integration with the rest of the Hadoop stack.
Others who may be inclined to disagree are in several cases doing projects on top of HBase to extend its reach. (In particular, please see the discussion below about Apache Phoenix and Trafodion, both of which want to offer relational-like functionality.)

Cloudera’s views on HBase history — in response to the priorities I brought to the conversation — include:

HBase initially favored consistency over performance/availability, while Cassandra initially favored the opposite choice. Both products, however, have subsequently become more tunable in those tradeoffs.
Cloudera’s initial contributions to HBase focused on replication, disaster recovery and so on. I guess that could be summarized as “scaling”.
Hortonworks’ early HBase contributions included (but were not necessarily limited to):
- Making recovery much faster (10s of seconds or less, rather than minutes or more).
- Some of that consistency vs. availability tuning.
“Coprocessors” were added to HBase ~3 years ago, to add extensibility, with the first use being in security/permissions.
With more typical marketing-oriented version numbers:
- HBase .90, the first release that did a good job on durability, could have been 1.0.
- HBase .92 and .94, which introduced coprocessors, could have been Version 2.
- HBase .96 and .98 could have been Version 3.
- The recent HBase 1.0 could have been 4.0.

The HBase roadmap includes:

A kind of BLOB/CLOB (Binary/Character Large OBject) support.
- Intel is heavily involved in this feature.
- The initial limit is 10 megabytes or so, due to some limitations in the API (I didn’t ask why that made sense). This happens to be all the motivating Chinese customer needs for the traffic photographs it wants to store.
Various kinds of “multi-tenancy” support (multi-tenancy is one of those terms whose meaning is getting stretched beyond recognition), including:
- Mixed workload support (short-request and analytic) on the same nodes.
- Mixed workload support on different nodes in the same cluster.
- Security between different apps in the same cluster.
(Still in the design phase) Bottleneck Whack-A-Mole, with goals including but not limited to:
- Scale-out beyond the current assumed limit of ~1200 nodes.
- More predictable performance, based on smaller partition sizes.
(Possibly) Multi-data-center fail-over.

Not on the HBase roadmap per se are global/secondary indexes. Rather, we talked about projects on top of HBase which are meant to provide those. One is Apache Phoenix, which supposedly:

Makes it simple to manage compound keys. (E.g., City/State/ZipCode)
Provides global secondary indexes (but not in a fully ACID way).
Offers some very basic JOIN support.
Provides a JDBC interface.
Offers efficiencies in storage utilization, scan optimizations, and aggregate calculations.

Another such project is Trafodion — supposedly the Welsh word for “transaction” — open sourced by HP. This seems to be based on NonStop SQL and Neoview code, which counter-intuitively have always been joined at the hip.

There was a lot more to the conversation, but I’ll stop here for two reasons:

This post is pretty long already.
I’m reserving some of the discussion until after I’ve chatted with vendors of other NoSQL systems.

Related link

My July 2011 post on HBase offers context, as do the comments on it.