Buying processes – DBMS 2 : DataBase Management System Services

Brittleness and incremental improvement

Curt Monash — Wed, 20 Jun 2018 09:13:56 +0000

Every system — computer or otherwise — needs to deal with possibilities of damage or error. If it does this well, it may be regarded as “robust”, “mature(d), “strengthened”, or simply “improved”.* Otherwise, it can reasonably be called “brittle”.

*It’s also common to use the word “harden(ed)”. But I think that’s a poor choice, as brittle things are often also hard.

0. As a general rule in IT:

New technologies and products are brittle.
They are strengthened incrementally over time.

There are many categories of IT strengthening. Two of the broadest are:

Bug-fixing.
Bottleneck Whack-A-Mole.

1. One of my more popular posts stated:

Developing a good DBMS requires 5-7 years and tens of millions of dollars.

The reasons I gave all spoke to brittleness/strengthening, most obviously in:

Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.

Similar things are true for other kinds of “platform software” or distributed systems.

2. The UI brittleness/improvement story starts similarly:

Graphical user interfaces can present users’ choices clearly, making them great antidotes to users’ initial lack of system knowledge or training.
Usability testing and engineering can lead to improvements and the removal of glitches.

Unfortunately, however, as systems add or change features, UI navigation can get more difficult over time rather than easier.

In at least one scenario — plane crashes due to confused-pilot error — the consequences can be literally fatal.

3. Sometimes brittleness just doesn’t get solved.

Security is perhaps the most visible example. Almost every security system can be broken, and bad actors actively do so.
Another example was 1980s-90s CASE (Computer-Aided Software Engineering), specifically in the area of generating code from specifications. The technology was only able to generate apps that performed a limited set of functions — too limited to be useful outside of certain niches — and never successfully evolved further.

4. Large organizations are riddled with screw-ups. One of the most successful large enterprises in world history was the US military of World War 2 — and that is literally the organization where the word snafu was coined.

The response is often bureaucracy. Somebody makes a mistake; procedures and rules are then instituted to ensure the mistake is never repeated. Over time, many rules and procedures build up, until organizational systems are hardened. Business processes wind up taking many steps, each of which represents both a cost and a potential for failure. And sometimes the only decisions that successfully get through the process are uninspired, uncreative or flat-out wrong.

This is the classic example of “hardening” — commonly expressed via its rough synonym “calcification” — adding even more brittleness than it removes.

Outward-facing/regulatory bureaucracies can be even worse..

Regulators generally have two constituencies — consumers/general public and businesses — with the benefits of regulation going to one and the costs going to the other. Anything regulators do will likely displease at least one constituency.
These ever-unsatisfactory regulations can often only be changed through a long administrative or legislative process.
Violating regulations has unpredictable but sometimes severe consequences.

The whole thing is a colossal mess.

5. Many of the previous points apply to enterprise applications, which facilitate business processes, have UIs, and commonly involve platform-like technology as well.

Changing enterprise apps can take both discontinuous and incremental forms; you can either replace your old apps outright or change something in (or in the implementation of) the ones you have.

If you rip-and-replace your apps, you’re likely to also do so to your business processes, and vice-versa. Discontinuous business process change is often seen as a great virtue, sometimes under the buzzname business process reengineering (BPR).
If you want to change your processes more incrementally, you likely need one or both of two things:
- App software with more features than you initially need. That may be easy to get, but it isn’t cheap.
- A nimble IT department. That one is neither cheap nor easy.

6. My biggest reason for writing about brittleness and improvement is to approach some topics around analytics and AI. As previously noted:

Artificial intelligence is facing public skepticism both for being too accurate (!) and not accurate enough.
Analytics in general is often surprisingly inaccurate.

Please stay tuned.

Related link(s)

Most of what I wrote in my December, 2015 series about artificial intelligence still holds true.

Notes on Spark and Databricks — technology

Curt Monash — Sun, 31 Jul 2016 14:30:18 +0000

During my recent visit to Databricks, I of course talked a lot about technology — largely with Reynold Xin, but a bit with Ion Stoica as well. Spark 2.0 is just coming out now, and of course has a lot of enhancements. At a high level:

Using the new terminology, Spark originally assumed users had data engineering skills, but Spark 2.0 is designed to be friendly to data scientists.
A lot of this is via a focus on simplified APIs, based on
- Unlike similarly named APIs in R and Python, Spark DataFrames work with nested data.
- Machine learning and Spark Streaming both work with Spark DataFrames.
There are lots of performance improvements as well, some substantial. Spark is still young enough that Bottleneck Whack-A-Mole yields huge benefits, especially in the SparkSQL area.
SQL coverage is of course improved. For example, SparkSQL can now perform all TPC-S queries.

The majority of Databricks’ development efforts, however, are specific to its cloud service, rather than being donated to Apache for the Spark project. Some of the details are NDA, but it seems fair to mention at least:

Databricks’ notebooks feature for organizing and launching machine learning processes and so on is a biggie. Jupyter is an open source analog.
Databricks has been working on security, and even on the associated certifications.

Two of the technical initiatives Reynold told me about seemed particularly cool. One, on the machine learning side, was a focus on training models online as new data streams in. In most cases this seems to require new algorithms for old model types, with a core idea being that the algorithm does a mini gradient descent for each new data point.

The other cool idea fits the trend of alternatives to the “lambda architecture”. Under the name “structured streaming”, which seems to be a replacement for “DStreaming”, the idea is to do set-based SQL processing even though membership of the set changes over time. Result sets are extracted on a snapshot basis; you can keep either all the results from each snapshot query or just the deltas.

Despite all this, there’s some non-trivial dissatisfaction with Spark, fair or otherwise.

Some of the reason is that SparkSQL is too immature to be great.
Some is annoyance that Databricks isn’t putting everything it has into open source.
Some is that everything has its architectural trade-offs.

To the last point, I raised one of the biggest specifics with Reynold, namely Spark’s lack of a strong built-in data persistence capability. Reynold’s answer was that they’re always working to speed up reading and writing from other forms of persistent storage. E.g., he cited a figure of ~100 million rows/core/second decoded from Parquet.

Notes on vendor lock-in

Curt Monash — Wed, 20 Jul 2016 01:35:32 +0000

Vendor lock-in is an important subject. Everybody knows that. But few of us realize just how complicated the subject is, nor how riddled it is with paradoxes. Truth be told, I wasn’t fully aware either. But when I set out to write this post, I found that it just kept growing longer.

1. The most basic form of lock-in is:

You do application development for a target set of platform technologies.
Your applications can’t run without those platforms underneath.
Hence, you’re locked into those platforms.

2. Enterprise vendor standardization is closely associated with lock-in. The core idea is that you have a mandate or strong bias toward having different apps run over the same platforms, because:

That simplifies your environment, requiring less integration and interoperability.
That simplifies your staffing; the same skill sets apply to multiple needs and projects.
That simplifies your vendor support relationships; there’s “one throat to choke”.
That simplifies your price negotiation.

3. That last point is double-edged; you have more power over suppliers to whom you give more business, but they also have more power over you. The upshot is often an ELA (Enterprise License Agreement), which commonly works:

For a fixed period of time, the enterprise may use as much of a given product set as they want, with costs fixed in advance.
A few years later, the price is renegotiated, based on then-current levels of usage.

Thus, doing an additional project using ELAed products may appear low-cost.

Incremental license and maintenance fees may be zero in the short-term.
Incremental personnel costs may be controlled because the needed skills are already in-house.

Often those appearances are substantially correct. That’s a big reason why incumbent software is difficult to supplant unless the upstart substitute is superior in fundamental and important ways.

4. Subscriptions are closely associated with lock-in.

Most obviously, the traditional software industry gets its profits from high-margin support/maintenance services.
Cloud lock-in has rapidly become a big deal.
The open source vendors meeting lock-in resistance, noted below, have subscription business models.

Much of why customers care about lock-in is the subscription costs it’s likely to commit them to.

5. Also related to lock-in are thick single-vendor technology stacks. If you run Oracle applications, you’re going to run the Oracle DBMS too. And if you run that, you’re likely to run other Oracle software, and perhaps use Exadata hardware as well. The cloud ==> lock-in truism is an example of this point as well.

6. There’s a lot of truth to the generality that central IT cares about overall technology architecture, while line-of-business departments just want to get the job done. This causes departments to both:

Oppose standardization.
Like thick technology stacks.

Thus, departmental influence on IT both encourages and discourages lock-in.

7. IBM is all about lock-in. IBM’s support for Linux, Eclipse and so on don’t really contradict that. IBM’s business model is to ~~squeeze~~ serve its still-large number of strongly loyal customers as well as it can.

8. Microsoft’s business model over the decades has also greatly depended on lock-in.

Indeed, it exploited Windows/Office lock-in so vigorously as to incur substantial anti-trust difficulties.
Server-side Windows tends to be involved in thick stacks — DBMS, middleware, business intelligence, SharePoint and more. Many customers (smaller enterprises or in some cases departments) are firmly locked into these stacks.
Microsoft is making a strong cloud push with Azure, which inherently involves lock-in.

Yet sometimes, Microsoft is more free and open.

Office for Macintosh allowed the Mac to be a viable Windows competitor. (And Microsoft was well-paid for that, generating comparable revenue per Mac to what it got for each Windows PC.)
Visual Studio is useful for writing apps to run against multiple DBMS.
Just recently, Microsoft SQL Server was ported to Linux.

9. SAP applications run over several different DBMS, including its own cheap MaxDB. That counteracts potential DBMS lock-in. But some of its newer apps are HANA-specific. That, of course, has the opposite effect.

10. And with that as background, we can finally get to what led me to finally write this post. Multiple clients have complaints that may be paraphrased as:

Customers are locked into expensive traditional DBMS such as Oracle.
Yet they’re so afraid of lock-in now that they don’t want to pay for our vendor-supplied versions of open source database technologies; they prefer to roll their own.
Further confusing matters, they also are happy to use cloud technologies, including the associated database technologies (e.g. . Redshift or other Amazon offerings), creating whole new stacks of lock-in.

So open source vendors of NoSQL data managers and similar technologies felt like they were the only kind of vendor suffering from fear of lock-in.

I agree with them that enterprises who feel this way are getting it wrong. Indeed:

The management of even NoSQL DBMS is a big issue, and help in that area has high cash value for customers.
Serious users need support.
Support and management tools happen to be synergistic with each other.

This is the value proposition that propelled Cloudera. It’s also a strong reason to give money to whichever of MongoDB, DataStax, Neo Technology et al. sponsors open source technology that you use.

General disclosure: My fingerprints have been on this industry strategy since before the term “NoSQL” was coined. It’s been an aspect of many different consulting relationships.

Some enterprises push back, logically or emotionally as the case may be, by observing that the best internet companies — e.g., Facebook — are allergic to paying for software, even open source. My refutations of that argument include:

Facebook has more and better engineers than you do.
Facebook has a lot more servers than you do, and would presumably face much higher prices than you would if you each chose to forgo the in-house alternative.
Facebook pays for open source software in a different way than through subscription fees — it invents and enhances it. Multiple important projects have originated at Facebook, and it contributes to many others. Are you in a position to do the same thing?

And finally — most of Facebook’s users get its service for free. (Advertisers are the ones who pay cash; all others just pay in attention to the ads.) So if getting its software for free actually does screw up its SLAs (Service Level Agreements) — well, free generally comes with poorer SLAs than paid. But if you’re in the business of serving paying customers, then you might want to have paying-customer kinds of SLAs, even on the parts of your technology — e.g. websites urging people to do business with you — that you provide for free yourself.

Related links

The technology underlying packaged applications (November, 2015, but it has a historical focus)
Topics in migration (January, 2015)
Much of the vendor advice on Strategic Messaging.

Some checklists for making technical choices

Curt Monash — Mon, 15 Feb 2016 16:27:10 +0000

Whenever somebody asks for my help on application technology strategy, I start by trying to ascertain three things. The absolute first is actually a prerequisite to almost any kind of useful conversation, which is to ascertain in general terms what the hell it is that we are talking about.

My second goal is to ascertain technology constraints. Three common types are:

Compatible with legacy systems and/or enterprise standards.
Cheap, free and/or open source.
Proven, vetted by sufficiently many references, and/or generally having an “enterprise-y” reputation.

That’s often a short and straightforward discussion, except in those awkward situations when all three of my bullet points above are applicable at once.

The third item is usually more interesting. I try to figure out what is to be accomplished. That’s usually not a simple matter, because the initial list of goals and requirements is almost never accurate. It’s actually more common that I have to tell somebody to be more ambitious than that I need to rein them in.

Commonly overlooked needs include:

If you want to sell something and have happy users, you need a good UI.
You will also soon need tools and a UI for administration.
Customers demand low-latency/fresh data. Your explanation of why they don’t really need it doesn’t contradict the fact that they want it.
Providing data access and saying “You can hook up any BI tool you want and build charts” is not generally regarded as offering a good UI.
When “adding analytics” to something previously focused on short-request processing, it is common to underestimate the variety of things users will soon want to do. (One common reason for this under-estimate is that after years of being told it can’t be done, they’ve learned not to ask.)

And if you take one thing away from this post, then take this:

If you “know” exactly which features are or aren’t helpful to users, …
.. and if you supply only what you “know” they should use, …
… then you will discover that what you “knew” wasn’t really accurate.

I guarantee it.

So far what I’ve said can be summarized as “Figure out what you’re trying to do, and what constraints there are on your choices for doing it.” The natural next step is to list the better-thought-of choices that meet your constraints, and — voila! — you have a short list. That’s basically correct, but there’s one significant complication.

Speaking of complications, what I’m portraying as a kind of linear/waterfall decision process of course usually involves lots of iteration, meandering around and general wheel-spinning. Real life is messy.

Simply put, there are many different kinds of application project. Other folks’ experience may not be as applicable to your case as you hope, because your case is different. So the rest of this post contains a checklist of distinctions among various different kinds of application project.

For starters, there are at least two major kind(s) of software development.

Many projects fit the traditional development model, elements of which are:
- You — and this is very much a plural “you” — code something up more or less from scratch, using whatever language(s) and/or framework(s) you think make sense.
- You break the main project into pieces in obvious ways (e.g. server back end vs. mobile front), and then into further pieces for manageability.
- There may also be database designs, test harnesses, connectors to other apps and so on.
But there are many other projects in which smaller bits of configuration and/or scripting are the essence of what you do.
- This is particularly common in analytics, where there might be business intelligence tools, ETL tools, scripts running against Hadoop and so on. The original building of a data warehouse/hub/lake/reservoir may also fit this model.
- It’s also what you do to get a major purchased packaged application into actual production.
- It also is often what happens for websites that serve “content”.

Other significant distinctions include:

In-house vs. software-for-resale. If the developing organization is handing code to somebody else, then we’re probably talking about a more traditional kind of project. But if the whole thing is growing organically in-house, the script-spaghetti alternative may well be viable (in those projects for which it seems appropriate). Important subsidiary distinctions start with:
- (If in-house) Truly in-house vs. out-sourced.
- (If for resale) On-premises vs. SaaS. Or maybe not.
Kind(s) of analytics, if any. Technologies and development processes used can be very different depending upon whether the application features:
- Business intelligence (not particularly real-time) as its essence.
- Reporting or other BI as added functionality to an essentially operational app.
- Low-latency BI, perhaps supported by (other) short-request processing.
- Predictive model scoring.
The role(s) of the user(s). This influences how appealing and easy the UI needs to be.* Requirements are very different, for example, among:
- Classic consumer-facing websites, with recommenders and so on.
- Marketing websites targeted at a small group of business-to-business customers.
- Data-sharing websites for existing consumer stakeholders.
- Cheery benefits-information websites that the HR department wants employees to look at.
- Purely internal apps meant to be used by (self-)important executives.
- Internal apps meant to be used by line workers who will be given substantial training on them.
Certain kinds of application project stand almost separately from the rest of these considerations, because their starting point is legacy apps. Examples may be found among:
- Migration/consolidation projects.
- Refactoring projects.
- Addition of incremental functionality.

*It also influences security, all good practices for securing internal apps notwithstanding.

Much also depends on the size and sophistication of the organization. What the “organization” is depends a bit on context:

In the case of software products, SaaS (Software as a Service) or other internet services, it is primarily the vendor. However …
… in B2B cases the sophistication of the customer organizations can also matter.
In the case of in-house enterprise development, there’s only one enterprise involved (duh). However …
… the “department” vs. “IT” distinction may be very important.

Specific considerations of this kind start:

Is me-too functionality enough, or does the enterprise seek competitive advantage through technology?
What kinds of technical risk does it seem prudent and desirable to take?

And that, in a nutshell, is why strategizing about application technology is often more complicated than it first appears.

Related links

My November, 2015 post on issues in enterprise application software links to a number of other relevant posts.
One of those (the same month) briefly surveyed actual choices in technology support for enterprise apps.
A number of my posts draw distinction among different analytic use cases. An April, 2015 example points to some of the earlier ones.
My July, 2012 categorization of kinds of BI is particularly relevant.
A November, 2012 post focused on assessing the supposed need for speed.
My September, 2011 strategic worksheet is evergreen.

CDH 5.5

Curt Monash — Thu, 19 Nov 2015 11:52:01 +0000

I talked with Cloudera shortly ahead of today’s announcement of Cloudera 5.5. Much of what we talked about had something or other to do with SQL data management. Highlights include:

Impala and Kudu are being donated to Apache. This actually was already announced Tuesday. (Due to Apache’s rules, if I had any discussion with Cloudera speculating on the likelihood of Apache accepting the donations, I would not be free to relay it.)
Cloudera is introducing SQL extensions so that Impala can query nested data structures. More on that below.
- The basic idea for the nested datatype support is that there are SQL extensions with a “dot” notation to let you get at the specific columns you need.
- From a feature standpoint, we’re definitely still in the early days.
  - When I asked about indexes on these quasi-columns, I gathered that they’re not present in beta but are hoped for by the time of general availability.
  - Basic data skipping, also absent in beta, seems to be more confidently expected in GA.
- This is for Parquet first, Avro next, and presumably eventually native JSON as well.
- This is said to be Dremel-like, at least in the case of Parquet. I must confess that I’m not familiar enough with Apache Drill to compare the two efforts.
Cloudera is increasing its coverage of Spark in several ways.
- Cloudera is adding support for MLlib.
- Cloudera is adding support for SparkSQL. More on that below.
- Cloudera is adding support for Spark going against S3. The short answer to “How is this different from the Databricks service?” is:
  - More “platform” stuff from the Hadoop stack (e.g. for data ingest).
  - Less in the way of specific Spark usability stuff.
- Cloudera is putting into beta what it got in the Xplain.io acquisition, which it unfortunately is naming Cloudera Navigator Optimizer. More on that in a separate post.
- Impala and Hive are getting column-level security via Apache Sentry.
- There are other security enhancements.
- Some policy-based information lifecycle management is being added as well.

While I had Cloudera on the phone, I asked a few questions about Impala adoption, specifically focused on concurrency. There was mention of:

Petabyte scale databases — at least one clear case for Impala/business intelligence only, and the likelihood that the Impala/BI part of other bigger installations was also in that range.
Hundreds of nodes.
10s of simultaneous queries in dashboard use cases.
1 – 3 million queries/month as a common figure.

Cloudera also expressed the opinions that:

An “overwhelming majority” of Cloudera customers have adopted Impala. (I imagine there’s a bit of hyperbole in that — for one thing, Cloudera has a pricing option in which Impala is not included.)
It is common for Impala customers to use Hive for “data preparation”.
SparkSQL has “order of magnitude” less performance than Impala, but a little more than performance than Hive running over either Spark or Tez.
SparkSQL’s main use cases are (and these overlap heavily):
- As part of an analytic process (as opposed to straightforwardly DBMS-like use).
- To persist data outside the confines of a single Spark job.

Differentiation in business intelligence

Curt Monash — Mon, 26 Oct 2015 19:34:09 +0000

Parts of the business intelligence differentiation story resemble the one I just posted for data management. After all:

Both kinds of products query and aggregate data.
Both are offered by big “enterprise standard” behemoth companies and also by younger, nimbler specialists.
You really, really, really don’t want your customer data to leak via a security breach in either kind of product.

That said, insofar as BI’s competitive issues resemble those of DBMS, they are those of DBMS-lite. For example:

BI is less mission-critical than some other database uses.
BI has done a lot less than DBMS to deal with multi-structured data.
Scalability demands on BI are less than those on DBMS — indeed, they’re the ones that are left over after the DBMS has done its data crunching first.

And full-stack analytic systems — perhaps delivered via SaaS (Software as a Service) — can moot the BI/data management distinction anyway.

Of course, there are major differences between how DBMS and BI are differentiated. The biggest are in user experience. I’d say:

For many people, BI is the user experience over the underlying data store(s).
Two crucial aspects of user experience are navigational power and speed of response.
- At one extreme, people hated the old green paper reports.
- At the other, BI in the QlikView/Tableau era is one of the few kinds of enterprise software that competes on the basis of being
- This is also somewhat true with respect to snazzy BI demos, such as interactive maps or way-before-their-day touch screens.*
Features like collaboration and mobile UIs also matter.
Since BI is commonly adopted via quick departmental projects — at least as the hoped-for first-step of a “land-and-expand” campaign — administrative usability is at a premium as well.

* Computer Pictures and thus Cullinet used a touch screen over 30 years ago. Great demo, but not so useful as an actual product, due to the limitations on data structure.

Where things get tricky is in my category of accuracy. In the early 2000s, I pitched and wrote a white paper arguing that BI helps bring “integrity” to an enterprise in various ways. But I don’t think BI vendors have done a good job of living up to that promise.

They’ve moved slowly in accuracy-intensive areas such as alerting or predictive modeling.
“Single source of truth” and similar protestations turned out to be much oversold.

Indeed, it’s tempting to say that business intelligence has been much too stupid. I really like some attempts to make BI sharper, e.g. at Rocana or ClearStory, but it remains to be seen whether many customer care about their business intelligence actually being smart.

So how does all this fit into my differentiation taxonomy/framework? Referring liberally to what has already been written above, we get:

Scope:
- For traditional tabular analysis, BI products compete on a bunch of UI features.
- Non-tabular analysis is much more primitive. Event series interfaces may be the closest thing to an exception.
- Collaboration is in the mix as well.
Accuracy: I discussed this one above.
Other trustworthiness:
- Security is a big deal.
- Mission-critical robustness is usually, in truth, just a nice-to-have. But some (self-)important executives may disagree.
Speed:
- For some functionality — e.g. cross-database joins — BI tools almost have to rely on their own DBMS-like engines for performance.
- For other it’s more optional. You can do single-RDBMS query straight against the underlying system, or you can pre-position some of the data in memory.
- Please also see the adoption and administration section below.
User experience: I discussed this one above.
Adoption and administration:
- When BI is “owned” by a department, especially one that also doesn’t manage the underlying data, set-up and administration need to be super-easy.
- Sometimes, departmental BI is used as an excuse to pressure central IT into making data available.
- Much like analytic DBMS, BI adoption can sometimes be tied to huge first-time-data-warehouse building projects.
- Administration of big enterprise-standard BI is, to re-use a term, much like DBMS-lite.
Cost: The true cost of BI usage is commonly governed more by the underlying data management (and data acquisition) than by the BI software (and supporting servers) itself. That said:
- BI “hard” costs — licenses, servers, cloud fees, whatever — commonly have to fit into departmental budgets.
- So do BI people costs.
- BI people requirements also often have to fit into departmental skillets.

Differentiation in data management

Curt Monash — Mon, 26 Oct 2015 19:32:34 +0000

In the previous post I broke product differentiation into 6-8 overlapping categories, which may be abbreviated as:

Scope
Accuracy
(Other) trustworthiness
Speed
User experience
Cost

and sometimes also issues in adoption and administration.

Now let’s use this framework to examine two market categories I cover — data management and, in separate post, business intelligence.

Applying this taxonomy to data management:

Scope: Different subcategories of data management technology are suitable for different kinds of data, different scale of data, etc. To a lesser extent that may be true within a subcategory as well.
Scope: Further, products may differ in what you can do with the data, especially analytically.
Accuracy: Don’t … lose … data.
Other trustworthiness:
- Uptime, availability and so on are big deals in many data management sectors.
- Security is hugely important for data that both belongs to other people — usually your customers — and is accessible via the internet. It’s important in numerous other database use cases as well.
- Awkwardly, the CAP Theorem teaches us that there can be a bit of a trade-off between availability and (temporary) accuracy.
Speed:
- Different kinds of data management products perform differently in different use cases.
- If your use case is down the middle of what a mature data management subsector focuses on, performance may not vary much among individual leading products.
- Even then, tuning effort may be quite different for different products.
User experience:
- Users rarely interact directly with database management products.
- There can be clear differentiation in database administration UIs. (The most dramatic example was perhaps the rise of Microsoft SQL Server.)
- Data manipulation languages (DMLs) can make a huge difference in programmers’ lives.
Cost:
- License and maintenance costs can be a huge issue, especially if you’re buying from traditional vendors.
- Performance affects cost in a few ways: hardware costs for sure, tuning effort in some cases, and occasionally even vendor license/maintenance fees.
- Ongoing operations costs can vary greatly by database product in general, and by your pre-existing in-house expertise in particular.
- Ease of programming can sometimes lead to significant programming cost differences as well.
Adoption: This one is often misunderstood.
- The effort of adopting new database technology for new applications is often overrated. When projects are huge, it’s often because of what you’re doing with the technology, not because of the technology itself.
- Migration, however, is usually a bitch.

For reasons of length, I’m doing a separate post on differentiation in business intelligence.

Sources of differentiation

Curt Monash — Mon, 26 Oct 2015 19:31:38 +0000

Obviously, a large fraction of what I write about involves technical differentiation. So let’s try for a framework where differentiation claims can be placed in context. This post will get through the generalities. The sequels will apply them to specific cases.

Many buying and design considerations for IT fall into six interrelated areas:

Scope: What does the technology even purport to do? This consideration applies to pretty much everything.
- Usually, this means something like features.
- However, there’s an important special case in which the important features are the information content. (Examples: Arguably Google, and the Bloomberg service for sure.)
Accuracy: How correctly does the technology do it? This can take multiple forms.
- Sometimes, a binary right/wrong distinction pretty much suffices, with an acceptable error rate of zero. If you’re writing data, it shouldn’t get lost. If you’re doing arithmetic, it should be correct. Etc.
- Sometimes, there’s a clear right/wrong distinction, but error rates are necessarily non-zero, often with a trade-off between the rates for false positives and false negatives. (In text search and similar areas, those rates are measured respectively as precision and recall.) Security is a classic example. Many other cases arise when trying to identify problems or
- Sometimes accuracy is on a scale. Predictive modeling results are commonly of that kind. So are text search, voice recognition and so on.
Other trustworthiness.
- Reliability, availability and security are considerations in almost any IT scenario.
- Also crucial are any factors that are perceived as affecting the risk of project failure. Sometimes, these are lumped together as (part of) maturity.
Speed. There’s a great real and/or perceived “need for speed”.
- On the user level:
  - There are many advantages to quick results, “real time” or otherwise.
  - In particular, analysis is often more accurate if you have time for more iterations or intermediate steps.
  - Please recall that speed can actually have multiple kinds of benefit. For example, it can reduce costs, it can improve accuracy, it can improve user experience, or it can enable capabilities that would otherwise be wholly impractical.
- There can also be considerations of time to (initial) value, although people sometimes overrate how often this is a function of the technology itself.
- Consistency of performance can be an important aspect of product maturity.
User experience. Ideally, using a system is easy and pleasurable, or at least not unpleasant.
- Ease of use often equates to ease of (re)learning …
- … but there are exceptions, generally for what might be considered “power users”.
- Speed and performance can avoid a lot of unpleasant frustration.
- In some cases you can compel somebody — usually an employee — to use your interface. Often, however, you can’t, and that’s when user experience may matter most.
- An important category of user experience that doesn’t directly equate to ease or is Of course, the more accurate the recommendations are, the better.
- Most systems have at least two categories of user experience — one for the true users, and one for the IT folks who manage it. The IT folks’ experience often depends not just on true UI features, but on how hard or difficult the underlying system is to deal with in the first place.
Cost, or more precisely TCO (Total Cost of Ownership). Cost is always important, and especially so if there are numerous viable alternatives.
- Sometimes money paid to the vendor really is the largest component of TCO.
- Often, however, hardware or IT personnel expenditures are the lion’s share of overall cost.
- Administrators’ user experience can affect a large chunk of TCO.

Related links

This post is starting out with two sequels, on data management and business intelligence respectively.
Issues of differentiation are central to my strategic worksheet.
When thinking about differentiation, keep in mind the distinction between wants and needs.
This post fits well with my claim that every product in a category is positioned along the same set of attributes.
In a post last year about differentiation, I wrote “Your spiffy innovation is important in fewer situations than you would like to believe.”
If you think you’re a rare exception to that rule, please see my post about over-optimism.

21st Century DBMS success and failure

Curt Monash — Mon, 14 Jul 2014 06:37:31 +0000

As part of my series on the keys to and likelihood of success, I outlined some examples from the DBMS industry. The list turned out too long for a single post, so I split it up by millennia. The part on 20th Century DBMS success and failure went up Friday; in this one I’ll cover more recent events, organized in line with the original overview post. Categories addressed will include analytic RDBMS (including data warehouse appliances), NoSQL/non-SQL short-request DBMS, MySQL, PostgreSQL, NewSQL and Hadoop.

DBMS rarely have trouble with the criterion “Is there an identifiable buying process?” If an enterprise is doing application development projects, a DBMS is generally chosen for each one. And so the organization will generally have a process in place for buying DBMS, or accepting them for free. Central IT, departments, and — at least in the case of free open source stuff — developers all commonly have the capacity for DBMS acquisition.

In particular, at many enterprises either departments have the ability to buy their own analytic technology, or else IT will willingly buy and administer things for a single department. This dynamic fueled much of the early rise of analytic RDBMS.

Buyer inertia is a greater concern.

A significant minority of enterprises are highly committed to their enterprise DBMS standards.
Another significant minority aren’t quite as committed, but set pretty high bars for new DBMS products to cross nonetheless.
FUD (Fear, Uncertainty and Doubt) about new DBMS is often justifiable, about stability and consistent performance alike.

A particularly complex version of this dynamic has played out in the market for analytic RDBMS/appliances.

First the newer products (from Netezza onwards) were sold to organizations who knew they wanted great performance or price/performance.
Then it became more about selling “business value” to organizations who needed more convincing about the benefits of great price/performance.
Then the behemoth vendors became more competitive, as Teradata introduced lower-price models, Oracle introduced Exadata, Sybase got more aggressive with Sybase IQ, IBM bought Netezza, EMC bought Greenplum, HP bought Vertica and so on. It is now hard for a non-behemoth analytic RDBMS vendor to make headway at large enterprise accounts.
Meanwhile, Hadoop has emerged as serious competitor for at least some analytic data management, especially but not only at internet companies.

Otherwise I’d say:

At large enterprises, their internet operations perhaps excepted:
- Short-request/general-purpose SQL alternatives to the behemoths — e.g. MySQL, PostgreSQL, NewSQL — have had tremendous difficulty getting established. The last big success was the rise of Microsoft SQL Server in the 1990s. That’s why I haven’t mentioned the term mid-range DBMS in years.
- NoSQL/non-SQL has penetrated large enterprises mainly for a few specific use cases, for example the lists I posted for MongoDB or graph databases.
Internet-only companies have few inertia issues when it comes to database managers. They’ll consider anything they regard as being in their price ballpark (which is however often restricted to open source). I think part of the reason is that as quickly as they rewrite their applications, DBMS are vastly less “strategic” to them than they are to most larger enterprises.
The internet operations of large companies — especially large retailers — in many cases behave like internet-only companies, but in many other cases behave like the rest of the enterprise.

The major reasons for DBMS categories to get established in the first place are:

Performance and/or scalability (many examples).
Developer features (for example dynamic schema).
License/maintenance cost (for example several open source categories).
Ease of installation and administration (for example open source again, and also data warehouse appliances).

Those same characteristics are major bases for competition among members of a new category, although as noted above behemoth-loyalty can also come into play.

Cool-vs.-weird tradeoffs are somewhat secondary among SQL DBMS.

There’s not much of a “cool” factor, because new products aren’t that different in what they do vs. older ones.
There’s not a terrible “weird” factor either, but of course any smaller offering faces FUD, and also …
… appliances are anti-strategic for many buyers, especially ones who demand a smooth path to the cloud.)

They’re huge, however, in the non-SQL world. Most non-SQL data managers have a major “weird” factor. Fortunately, NoSQL and Hadoop both have huge “cool” cred to offset it. XML/XQuery unfortunately did not.

Finally, in most DBMS categories there are massive issues with product completeness, more in the area of maturity than that of whole product. The biggest whole product issues are concentrated on the matter of interoperating with other software — business intelligence tools, packaged applications (if relevant to the category), etc. Most notably, the handful of DBMS that are certified to run SAP share a huge market that other DBMS can’t touch. But BI tools are less of a differentiator — I yawn when vendors tell me they are certified for/partnered with MicroStrategy, Tableau, Pentaho and Jaspersoft, and I’m surprised at any product that isn’t.

DBMS maturity has a lot of aspects, but the toughest challenges are concentrated in two main areas:

Reliability, especially but not only in short-request use cases.
Performance across a great variety of use cases. I observe frequently that performance in best-case scenarios, performance in the lab and performance in real-world environments are much further apart than vendors like to think.

In particular:

Maturity demands seem to be much higher for SQL DBMS than for NoSQL.
- I think this is one of several reasons NoSQL has been much more successful than NewSQL.
- It’s why I think MarkLogic’s “Enterprise NoSQL” positioning is a mistake.
As for MySQL:
- MySQL wasn’t close to reliable enough for enterprises to trust it until InnoDB became the default storage engine.
- MySQL 5 point releases have added major features, or decent performance for major features. I’ll confess to having lost track of what’s been fixed and what’s still missing.
- In saying all that I’m holding MySQL to a much higher maturity standard than I’m holding NoSQL — because that’s what I think enterprise customers do.
PostgreSQL “should” be doing a lot better than it is. I have an extremely low opinion of its promoters, and not just for personal reasons. (That said, the personal reasons don’t just apply to EnterpriseDB anymore. I’ve also run out of patience waiting for Josh Berkus to retract untruths he posted about me years ago.)
SAP HANA checks boxes for performance (In-memory rah rah rah!!) and whole product (Runs SAP!!). That puts it well ahead of most other newish SQL DBMS, purely analytic ones perhaps excepted.
Any other new short-request SQL DBMS that sounds like is has traction is also memory-centric.
Analytic RDBMS are in most respects held to lower maturity standards than DBMS used for write-intensive workloads. Even so, products in the category are still frequently tripped up by considerations of concurrent performance and mixed workload management.

Related links

There have been 1,470 previous posts in the 9-year history of this blog, many of which could serve as background material for this one. A couple that seem particularly germane and didn’t get already get linked above are:

The drive for uninterrupted DBMS operation.
Short-request DBMS trade-offs and alternatives.

Hardware and storage notes

Curt Monash — Thu, 01 May 2014 02:05:16 +0000

My California trip last week focused mainly on software — duh! — but I had some interesting hardware/storage/architecture discussions as well, especially in the areas of:

Rack- or data-center-scale systems.
The real or imagined demise of Moore’s Law.
Flash.

I also got updated as to typical Hadoop hardware.

If systems are designed at the whole-rack level or higher, then there can be much more flexibility and efficiency in terms of mixing and connecting CPU, RAM and storage. The Google/Facebook/Amazon cool kids are widely understood to be following this approach, so others are naturally considering it as well. My most interesting of several mentions of that point was when I got the chance to talk with Berkeley computer architecture guru Dave Patterson, who’s working on plans for 100-petabyte/terabit-networking kinds of systems, for usage after 2020 or so. (If you’re interested, you might want to contact him; I’m sure he’d love more commercial sponsorship.)

One of Dave’s design assumptions is that Moore’s Law really will end soon (or at least greatly slow down), if by Moore’s Law you mean that every 18 months or so one can get twice as many transistors onto a chip of the same area and cost than one could before. However, while he thinks that applies to CPU and RAM, Dave thinks flash is an exception. I gathered that he thinks the power/heat reasons for Moore’s Law to end will be much harder to defeat than the other ones; note that flash, because of what it’s used for, has vastly less power running through it than CPU or RAM do.

Otherwise, I didn’t gain much new insight into actual flash uptake. Everybody thinks flash is or soon will be very important; but in many segments, folks are trading off disk vs. RAM without worrying much about the intermediate flash alternative.

I visited two Hadoop distribution vendors this trip, namely the ones who are my clients – Cloudera and MapR. I remembered to ask one of them, Cloudera, about typical Hadoop hardware, and got answers that sounded consistent with hardware trends Hortonworks told me about last August. The story is, more or less:

The default assumption remains $20-30K/node, 2 sockets, 12 disks. (Edit: See lively price discussion in the comments below.)
Most hardware vendors have standard/default Hadoop boxes by now, and in many cases customers just buy what’s on offer.
The aforementioned disks sometimes get up to 4 terabytes now.
128GB is now the norm for RAM. 256GB is common. Higher amounts are seen, up to – in rare cases – 2-4 TB.
Flash is of interest, but isn’t being demanded much yet. This could change when flash’s storage density matches disk’s.
Flash interest is highest for Impala.

Cloudera suggested that the larger amounts of RAM tend to be used when customers frame the need as putting certain analytic datasets entirely in RAM. This rings true to me; there’s lots of evidence that users think that way, and not just in analytic cases. This is probably one of the reasons that they often jump straight from disk to RAM without fully exploring the opportunities of flash.

One last thing — the big cloud vendors are at least considering the use of their own non-Intel chip designs, which might be part of the reason for Intel’s large Hadoop investment.