Netezza – DBMS 2 : DataBase Management System Services

Generally available Kudu

Curt Monash — Fri, 16 Jun 2017 15:52:45 +0000

I talked with Cloudera about Kudu in early May. Besides giving me a lot of information about Kudu, Cloudera also helped confirm some trends I’m seeing elsewhere, including:

Security is an ever bigger deal.
There’s a lot of interest in data warehouses (perhaps really data marts) that are updated in human real-time.
- Prospects for that respond well to the actual term “data warehouse”, at least when preceded by some modifier to suggest that it’s modern/low-latency/non-batch or whatever.
- Flash is often — but not yet always — preferred over disk for that kind of use.
- Sometimes these data stores are greenfield. When they’re migrations, they come more commonly from analytic RDBMS or data warehouse appliance (the most commonly mentioned ones are Teradata, Netezza and Vertica, but that’s perhaps just due to those product lines’ market share), rather than from general purpose DBMS such as Oracle or SQL Server.
Intel is making it ever easier to vectorize CPU operations, and analytic data managers are increasingly taking advantage of this possibility.

Now let’s talk about Kudu itself. As I discussed at length in September 2015, Kudu is:

A data storage system introduced by Cloudera (and subsequently open-sourced).
Columnar.
Updatable in human real-time.
Meant to serve as the data storage tier for Impala and Spark.

Kudu’s adoption and roll-out story starts:

Kudu went to general availability on January 31. I gather this spawned an uptick in trial activity.
A subsequent release with some basic security features spawned another uptick.
I don’t think Cloudera will mind my saying that there are many hundreds of active Kudu clusters.
But Cloudera believes that, this soon after GA, very few Kudu users are in actual production.

Early Kudu interest is focused on 2-3 kinds of use case. The biggest is the kind of “data warehousing” highlighted above. Cloudera characterizes the others by the kinds of data stored, specifically the overlapping categories of time series — including financial trading — and machine-generated data. A lot of early Kudu use is with Spark, even ahead of (or in conjunction with) Impala. A small amount has no relational front-end at all.

Other notes on Kudu include:

Solid-state storage is recommended, with a few terabytes per node.
You can also use spinning disk. If you do, your write-ahead logs can still go to flash.
Cloudera said Kudu compression ratios can be as low as 2-5X, or as high as 10-20X. With that broad a range, I didn’t drill down into specifics of what they meant.
There seem to be a number of Kudu clusters with 50+ nodes each. By way of contrast, a “typical” Cloudera customer has 100s of nodes overall.
As you might imagine from their newness, Kudu security features — Kerberos-based — are at the database level rather than anything more granular.

And finally, the Cloudera folks woke me up to some issues around streaming data ingest. If you stream data in, there will be retries resulting in duplicate delivery. So your system needs to deal with those one way or another. Kudu’s way is:

Primary keys will be unique. (Note: This is not obvious in a system that isn’t an entire RDBMS in itself.)
You can configure the uniqueness to be guaranteed either through an upsert mechanism or just by simply rejecting duplicates.
Alternatively, you can write code to handle duplication errors, e.g. via Spark.

Are analytic RDBMS and data warehouse appliances obsolete?

Curt Monash — Mon, 29 Aug 2016 01:28:31 +0000

I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:

Many of the independent vendors were swooped up by acquisition.
- IBM bought Netezza.
- Microsoft bought DATAllegro.
- HP bought Vertica.
- Greenplum went to EMC/VMware/Pivotal.
- Teradata bought Aster.
- Actian bought both ParAccel and Vectorwise.
None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
Impala and other Hadoop-based alternatives are technology options.
Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big.

Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.

My answer, in a nutshell, is:

Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud — are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.

To see why, let’s start by asking: “With what do you want to integrate your analytic SQL processing?”

If you want to integrate with relational OLTP (OnLine Transaction Processing), your OLTP RDBMS vendor surely has a story worth listening to. Memory-centric offerings MemSQL and SAP HANA are also pitched that way.
If you want to integrate with your SAP apps in particular, HANA is the obvious choice.
If you want to integrate with other work you do in the Amazon cloud, Redshift is worth a look.

Beyond those cases, a big issue is integration with … well, with data integration. Analytic RDBMS got a lot of their workloads from ELT or ETLT, which stand for Extract/(Transform)/Load/Transform. I.e., you’d load data into an efficient analytic RDBMS and then do your transformations, vs. the “traditional” (for about 10-15 years of tradition) approach of doing your transformations in your ETL (Extract/Transform/Load) engine. But in bigger installations, Hadoop often snatches away that part of the workload, even if the rest of the processing remains on a dedicated analytic RDBMS platform such as Teradata’s.

And suppose you want to integrate with more advanced analytics — e.g. statistics, other predictive modeling/machine learning, or graph analytics? Well — and this both surprised and disappointed me — analytic platforms in the RDBMS sense didn’t work out very well. Early Hadoop had its own problems too. But Spark is doing just fine, and seems poised to win.

My technical observations around these trends include:

Advanced analytics commonly require flexible, iterative processing.
Spark is much better at such processing than earlier Hadoop …
… which in turn is better than anything that’s been built into an analytic RDBMS.
Open source/open standards and the associated skill sets come into play too. Highly vendor-proprietary DBMS-tied analytic stacks don’t have enough advantages over open ones.
Notwithstanding the foregoing, RDBMS-based platforms can still win if a big part of the task lies in fancy SQL.

And finally, if a task is “partly relational”, then Hadoop or Spark often fit both parts.

They don’t force you into using SQL for everything, nor into putting all your data into relational schemas, and that flexibility can be a huge relief.
Even so, almost everybody who uses those uses some SQL, at least for initial data extraction. Those systems are also plenty good enough at SQL for joining data to reference tables, and all that other SQL stuff you’d never want to give up.

But suppose you just want to do business intelligence, which is still almost always done over relational data structures? Analytic RDBMS offer the trade-offs:

They generally still provide the best performance or performance/concurrency combination, for the cost, although YMMV (Your Mileage May Vary).
One has to load the data in and immediately structure it relationally, which can be an annoying contrast to Hadoop alternatives (data base administration can be just-in-time) or to OLTP integration (less or no re-loading).
Other integrations, as noted above, can also be weak.

Suppose all that is a good match for your situation. Then you should surely continue using an analytic RDBMS, if you already have one, and perhaps even acquire one if you don’t. But for many other use cases, analytic RDBMS are no longer the best way to go.

Finally, how does the cloud affect all this? Mainly, it brings one more analytic RDBMS competitor into the mix, namely Amazon Redshift. Redshift is a simple system for doing analytic SQL over data that was in or headed to the Amazon cloud anyway. It seems to be quite successful.

Bottom line: Analytic RDBMS are no longer in their youthful prime, but they are healthy contributors in middle age. Mainly, they’re still best-of-breed for supporting demanding BI.

Which analytic technology problems are important to solve for whom?

Curt Monash — Thu, 09 Apr 2015 11:52:50 +0000

I hear much discussion of shortfalls in analytic technology, especially from companies that want to fill in the gaps. But how much do these gaps actually matter? In many cases, that depends on what the analytic technology is being used for. So let’s think about some different kinds of analytic task, and where they each might most stress today’s available technology.

In separating out the task areas, I’ll focus first on the spectrum “To what extent is this supposed to produce novel insights?” and second on the dimension “To what extent is this supposed to be integrated into a production/operational system?” Issues of latency, algorithmic novelty, etc. can follow after those. In particular, let’s consider the tasks:

Reporting for regulatory compliance (financial or otherwise). The purpose of this is to follow rules.
- This is non-innovative almost by design.
- Somebody probably originally issued the regulations for a reason, so the reports may be useful for monitoring purposes. Failing that, they probably are supported by the same infrastructure that also tries to do useful monitoring.
- Data governance is crucial. Submitting incorrect data to regulators can have dire consequences. That said, when we hear about poor governance of poly-structured data, I question whether that data is being used in the applications where strong governance is actually needed.
Other routine, monitoring-oriented business intelligence. The purpose can be general monitoring or general communication. Sometimes the purpose is lost to history entirely. This is generally lame, at least technically, unless interesting requirements are added.
- Displaying it on mobile devices makes it snazzier, and in some cases more convenient. Whoop-de-do.
- Usually what makes it interesting these days is the desire to actually explore the data and gain new insights. More on that below.
- BI for inherently non-tabular data is definitely an unsolved problem.
- Integration of BI with enterprise apps continues to be an interesting subject, but one I haven’t learned anything new about recently.
- All that said, this is an area for some of the most demanding classical data warehouse installations, specifically ones that are demanding along dimensions such as concurrency or schema complexity. (Recall that the most complicated data warehouses are often not the largest ones.) Data governance can be important here as well.
Investigation by business analysts or line-of-business executives. Much of the action is here, not least because …
- … it’s something of a catch-all category.
  - “Business analyst” is a flexible job description, and business analysts can have a variety of goals.
  - Alleged line-of-business executives doing business-analyst work are commonly delegating it to fuller-time business analysts.
- These folks can probably manage departmental analytic RDBMS if they need to (that was one of Netezza’s early value propositions), but a Hadoop cluster stretches them. So easy deployment and administration stories — e.g. “Hadoop with less strain”/”Spark with less strain” — can have merit. This could be true even if there’s a separate team of data wranglers pre-processing data that the analysts will then work with.
- Further, when it comes to business intelligence:
  - Tableau and its predecessors have set a high bar for quality of user interface.
  - The non-tabular BI challenges are present in spades.
  - ETL reduction/elimination (Extract/Transform/Load) is a major need.
- Predictive modeling by business analysts is problematic from beginning to end; much progress needs to be made here.
Investigation by data scientists. The “data scientist”/”business analyst” distinction is hardly precise. But for the purpose of this post, a business analyst may be presumed to excel at elementary mathematics — even stock analysts just use math at a high school level — and at using tabular databases, while data scientists (individuals or teams) have broader skill sets and address harder technical or mathematical problems.
- The technology for “data science” is generally on the newish side. Management and performance at scale are still improving.
- There’s a need and/or desire for more sophisticated analytic tools, in predictive modeling and graph.
Rapid-response trouble-shooting. There are some folks — for example network operators — whose job includes monitoring things moment to moment and, when there’s a problem, reacting quickly.
- Splunk and/or Flume commonly suffice to collect that data, but of course that’s a moving target as data sources proliferate.
- I expect a lot of innovation relevant to the analytic side, in areas such as streaming, low-latency BI, event series analytics, and BI/predictive modeling integration.
“Operationalization” of investigative results. This is a hot area, because doing something with insights — “insights” being a hot analytic buzzword these days — is more valuable than merely having them.
- This is where short-request kinds of data stores — NoSQL or otherwise — are often stressed, especially in the low-latency analytics they need to support.
- This is the big area for any kind of “closed loop” predictive modeling story, e.g. in experimentation.
- At least in theory, this is another big area for streaming.

And finally — across multiple kinds of user group and use case, there are some applications that will only be possible when sensors or other data sources improve.

Bottom line: Almost every interesting analytic technology problem is worth solving for some market, but please be careful about finding the right match.

Related links

Where the innovation is (January, 2015)
Various notes (November, 2014)
“Freeing business analysts from IT” (August, 2014)
Data integration as a business opportunity (July, 2014)
Differentiation in BI usability (March, 2014)
Analytic database distinctions (February, 2013)
Juggling analytic databases (March, 2012)
Applications of an analytic kind (February, 2012)
Agile predictive analytics (November, 2011)
Eight kinds of analytic database (July, 2011)
Use cases for low-latency analytics (April, 2011)
The three principle kinds of analytic business benefit (March, 2011)

Is analytic data management finally headed for the cloud?

Curt Monash — Wed, 22 Oct 2014 08:48:43 +0000

It seems reasonable to wonder whether analytic data management is headed for the cloud. In no particular order:

Amazon Redshift appears to be prospering.
So are some SaaS (Software as a Service) business intelligence vendors.
Amazon Elastic MapReduce is still around.
Snowflake Computing launched with a cloud strategy.
Cazena, with vague intentions for cloud data warehousing, destealthed.*
Cloudera made various cloud-related announcements.
Data is increasingly machine-generated, and machine-generated data commonly originates off-premises.
The general argument for cloud-or-at-least-colocation has compelling aspects.
Analytic workloads can be “bursty”, and so could benefit from true cloud elasticity.

Also — although the specifics on this are generally vague and/or confidential — I sense a narrowing of the gap between:

The hardware + networking required for performant analytic data management.
The hardware + networking available in the cloud.

*Cazena is proud of its team of advisors. However, the only person yet announced for a Cazena operating role is Prat Moghe, and his time period in Netezza’s mainstream happens not to have been one in which Netezza had much technical or market accomplishment.

On the other hand:

If you have processing power very close to the data, then you can avoid a lot of I/O or data movement. Many cloud configurations do not support this.
Many optimizations depend upon controlling or at least knowing the hardware and networking set-up. Public clouds rarely offer that level of control.

And so I’m still more confident in SaaS/colocation analytic data management, or in Redshift, than I am in true arm’s-length cloud-based systems.

Thoughts on SaaS

Curt Monash — Mon, 25 Nov 2013 01:16:05 +0000

Generalizing about SaaS (Software as a Service) is hard. To prune some of the confusion, let’s start by noting:

SaaS has been around for over half a century, and at times has been the dominant mode of application delivery.
The term multi-tenancy is being used in several different ways.
Multi-tenancy, in the purest sense, is inessential to SaaS. It’s simply an implementation choice that has certain benefits for the SaaS provider. And by the way, …
… salesforce.com, the chief proponent of the theory that true multi-tenancy is the hallmark of true SaaS, abandoned that position this week.
Internet-based services are commonly, if you squint a little, SaaS. Examples include but are hardly limited to Google, Twitter, Dropbox, Intuit, Amazon Web Services, and the company that hosts this blog (KnownHost).
Some of the core arguments for SaaS’ rise, namely the various efficiencies of data center outsourcing and scale, apply equally to the public cloud, to SaaS, and to AEaaS (Anything Else as a Service).
These benefits are particularly strong for inherently networked use cases. For example, you really don’t want to be hosting your website yourself. And salesforce.com got its start supporting salespeople who worked out of remote offices.
In theory and occasionally in practice, certain SaaS benefits, namely the outsourcing of software maintenance and updates, could be enjoyed on-premises as well. Whether I think that could be a bigger deal going forward will be explored in future posts.

For smaller enterprises, the core outsourcing argument is compelling. How small? Well:

What’s the minimum level of IT operations headcount needed for mission-critical systems? Let’s just say “several”.
What does that cost? Fully burdened, somewhere in the six figures.
What fraction of the IT budget should such headcount be? As low a double digit percentage as possible.
What fraction of revenues should be spent on IT? Some single-digit percentage.

So except for special cases, an enterprise with less than $100 million or so in revenue may have trouble affording on-site data processing, at least at a mission-critical level of robustness. It may well be better to use NetSuite or something like that, assuming needed features are available in SaaS form.*

*Truth be told, I’m not up to speed on mid-range SaaS application suite alternatives.

Continuing that thought — if you’re a mid-range application software provider, you have to develop a SaaS version of your product line. That’s a very different business model than the apps + OEMed platform you’re probably providing now, but it’s the best way to serve your customers going forward. And by the way — while mid-range application software is commonly sold on a regional basis, SaaS can be sold more globally; after all, the the need for onsite service is eliminated, and price points should in most cases fit with telephone sales. Yes, national language and regional data privacy rules are both concerns, but they still leave the available markets looking much bigger than regional resellers have traditionally enjoyed. So expect shake-outs in a whole lot of vertical markets, as vendors horn in on each other’s territories, and a few elephantine winners perhaps emerge.

The argument above assumes that extreme reliability is needed. So there’s nothing necessarily wrong with a small team of business analysts sticking an RDBMS appliance* in a corner and managing it themselves. If it sputters from time to time, who cares; using it still may be easier than getting that data in and out of the cloud. But eventually, if all the data is remote anyway — SaaS, website, etc. — then it may make sense to do analytics remotely as well.

*Previously, that appliance might have been from Netezza; now, my first thought is the cheaper — albeit more limited — Infobright.

The arguments that direct smaller companies toward SaaS apply to large enterprises to, but they aren’t as dispositive. Larger enterprises can actually afford to do their own IT operations if they want to. What’s more, moving away from in-house operations is harder for big firms, due to the larger and more customized portfolio of legacy systems they’re likely to have. That said:

Almost all enterprises should have their internet-facing systems offsite, even if just via co-location. The core reasons are that ingesting high-volume inbound network traffic is inherently difficult, and security issues make it much tougher yet. In addressing these challenges, specialists enjoy significant economies of scale.
Most enterprises will have plenty of SaaS silos. If nothing else:
- Complex machinery will increasingly “phone home” for help staying in good working order. That’s a form of SaaS.
- Information providers and aggregators tend to deliver via SaaS.
- Various kinds of collaboration and communication apps, from Google Mail to Dropbox, live in the cloud. Personal productivity applications, from word processing to Photoshop, may be following.
- “Rodney Dangerfield” departments — i.e., ones unhappy with the respect and attention they get from central IT — often turn to SaaS or similar outsourcing. Human resources is an obvious example, from Automatic Data Processing to Employease to, these days, Workday.

That leaves us with the questions as to when and how large enterprises should or will move their core applications to SaaS and/or the cloud. Given the length of this post, I won’t try to answer them now. But for starters:

Enterprises don’t like to rip and replace their apps, except in consolidation projects, as long as they can avoid doing so.
Cloud/remote computing economies are less convincing if you already have your computer rooms staffed and set up.
A key benefit of SaaS is that vendors control and drive the upgrade cycles. One cost of that is restrictions on customization, although you can also build apps and app extensions on Paas//DBaaS/Waas (Platform/DataBase/Whatever as a Service) offerings such as force.com.
Lock-in is a serious concern, for application and platform offerings alike. Not only are you betting on one vendor’s software black box, you’re also betting on its remote computing operation. If you grow dissatisfied with either, or with their pricing, you may not have much opportunity to escape.

How Revolution Analytics parallelizes R

Curt Monash — Tue, 19 Nov 2013 05:18:41 +0000

I talked tonight with Lee Edlefsen, Chief Scientist of Revolution Analytics, and now think I understand Revolution’s parallel R much better than I did before.

There are four primary ways that people try to parallelize predictive modeling:

They can run the same algorithm on different parts of a dataset on different nodes, then return all the results, and claim they’ve parallelized. This is trivial and not really a solution. It is also the last-ditch fallback position for those who parallelize more seriously.
They can generate intermediate results from different parts of a dataset on different nodes, then generate and return a single final result. This is what Revolution does.
They can parallelize the linear algebra that underlies so many algorithms. Netezza and Greenplum tried this, but I don’t think it worked out very well in either case. Lee cited a saying in statistical computing “If you’re using matrices, you’re doing it wrong”; he thinks shortcuts and workarounds are almost always the better way to go.
They can jack up the speed of inter-node communication, perhaps via MPI (Messaging Passing Interface), so that full parallelization isn’t needed. That’s SAS’ main approach.

One confusing aspect of this discussion is that it could reference several heavily-overlapping but not identical categories of algorithms, including:

External memory algorithms, which operates on datasets too big to fit in main memory, by — for starters — reading in and working on a part of the data at a time. Lee observes that these are almost always parallelizable.
What Revolution markets as External Memory Algorithms, which are those external memory algorithms it has gotten around to implementing so far. These are all parallelized. They are also all in the category of …
… algorithms that can be parallelized by:
- Operating on data in parts.
- Getting intermediate results.
- Combining them in some way for a final result.
Algorithms of the previous category, where the way of combining them specifically is in the form of summation, such as those discussed in the famous paper Map-Reduce for Machine Learning on Multicore. Not all of Revolution’s current parallel algorithms fall into this group.

To be clear, all Revolution’s parallel algorithms are in Category #2 by definition and Category #3 in practice. However, they aren’t all in Category #4.

The canonical example of how to parallelize an algorithm via intermediate results is taking the mean of a large set of numbers. Specifically:

For each subset of data, you both count the entries and sum the values.
Then to combine those intermediate results:
- You sum the sums.
- You also sum the counts.
- You divide the former result by the latter.

Unfortunately, it’s hard to clearly articulate a precise description of these parallelizable algorithms. That said:

What you want is for the end result to be identical irrespective of how the data is split up. (Duh!)
Lee suggested that it is sufficient but not necessary that the way of combining the intermediate results be both commutative and associative.
To date, all of Revolution’s algorithms are — you guessed it! — commutative and associative.

I asked Lee about algorithms that were inherently difficult to parallelize in this style, and he expressed optimism that some other approach would usually work in practice. In particular, we had a lively discussion about finding the exact median, or more generally finding n-tiles and the whole “empirical distribution”. Lee said that, for example, it is extremely fast to bin billions of values into 10,000 buckets. Further, he suggested it is very fast in general to do the operation for integer values, and hence also for any values with a reasonably limited number of significant digits.

As should be clear from this discussion, Revolution’s parallel algorithms are indeed parallel for any reasonable kind of distribution of work. Although they were shipped first for multi-core single-server and MPI environments, the recent ports to Teradata and generic Hadoop MapReduce seem to have been fairly straightforward. Revolution seems to have good modularity between the algorithms themselves, the intermediate data passing, and the original algorithm launch, and hence makes strong claims of R code portability — but the list of exceptions in “portable except for ____” did seem to lengthen a bit each time we returned to the subject.

Finally, notes on Revolution’s Teradata implementation include:

There’s a master process (external stored procedure) which then generates SQL and invokes table operators.
The whole thing runs in protected mode (i.e. out-of-process). Lee thinks that there’s only a small performance penalty vs. in-process.
(For some reason I found this amusing) When you send an R job to Teradata, the R code itself is shipped via ODBC.

while notes on Revolution’s initial Hadoop implementation start:

One way it talks to data in HDFS (Hadoop Distributed File System) is through LibHDFS. The other, when available, is ODBC.
It uses generic MapReduce. Faster alternatives may be implemented down the road.

Related link

Teradata is seeing interest in in-database R. (September, 2013)

RDBMS and their bundle-mates

Curt Monash — Sun, 10 Nov 2013 19:22:48 +0000

Relational DBMS used to be fairly straightforward product suites, which boiled down to:

A big SQL interpreter.
A bunch of administrative and operational tools.
Some very optional add-ons, often including an application development tool.

Now, however, most RDBMS are sold as part of something bigger.

Oracle has hugely thickened its stack, as part of an Innovator’s Solution strategy — hardware, middleware, applications, business intelligence, and more.
IBM has moved aggressively to a bundled “appliance” strategy. Even before that, IBM DB2 long sold much better to committed IBM accounts than as a software-only offering.
Microsoft SQL Server is part of a stack, starting with the Windows operating system.
Sybase was an exception to this rule, with thin(ner) stacks for both Adaptive Server Enterprise and Sybase IQ. But Sybase is now owned by SAP, and increasingly integrated as a business with …
… SAP HANA, which is closely associated with SAP’s applications.
Teradata has always been a hardware/software vendor. The most successful of its analytic DBMS rivals, in some order, are:
- Netezza, a pure appliance vendor, now part of IBM.
- Greenplum, an appliance-mainly vendor for most (not all) of its existence, and in particular now as a part of EMC Pivotal.
- Vertica, more of a software-only vendor than the others, but now owned by and increasingly mainstreamed into hardware vendor HP.
MySQL’s glory years were as part of the “LAMP” stack.
Various thin-stack RDBMS that once were or could have been important market players … aren’t. Examples include Progress OpenEdge, IBM Informix, and the various strays adopted by Actian.

This phenomenon is, I think, much more driven by vendors than users. Most of the examples I listed work or could work perfectly well on their own.* But relational database management systems are seen as “strategic” products, which means in particular:

They’re often expensive to adopt (software, hardware, people costs).
They’re also often expensive to switch away from.

And strategic products, high price tags, and thick product stacks commonly go together.

*Netezza is an exception. But Exadata is not; while Oracle data warehousing was in a bad technical place before Exadata, Exadata software is what cleaned the problem up.

Also relevant is that I took those examples from relatively mature RDBMS market segments — high-end OLTP/general purpose (OnLine Transaction processing), mid-range OLTP/general-purpose, and analytic. Products in those sectors have had enough time to be built out. They also tend to have fairly close competitors, as the most important product features (e.g. columnar storage in analytic RDBMS, or online backup across the board) have been imitated numerous times each.

NewSQL, by way of contrast, is just as thin-stack as NoSQL is. Products in those sectors are immature; vendors are completing them first before wedding them to other technology layers. They’re also strongly differentiated; if you tell me what topology you need and which style(s) of API or DML (Data Manipulation Language) you prefer, the list of product candidates I give you may be short indeed.

HBase is the obvious exception to my “NoSQL products stand alone” generalization, but its market position is a matter of debate.

I have mixed feelings about this trend. For starters, I’m grudgingly becoming more sympathetic to DBMS/hardware bundles, notwithstanding their role as a way to gouge more money from customers than the hardware is actually worth. Why? Because of my opinion that there’s a general move toward appliances, clusters and clouds. In particular:

As DBMS become better at straddling and melding RAM, flash and disk, legitimate reasons to optimize hardware/software integration will increase.
Microsoft (with Parallel Data Warehouse) and SAP (with HANA) induce customers to adopt hardware “appliances” even though they don’t sell and profit from the hardware themselves. This shoots down the argument that appliances are only a vendor trick to squeeze out more profits.
Netezza’s super-easy installation was a really nice feature.

When it comes to RDBMS/business intelligence bundles, my thoughts start:

As a general rule, a benefit of BI is that it can get at data from lots of different sources. This speaks against tying it to a specific DBMS.
The vendor-specific evidence is mixed:
- IBM has never explained any user advantages to including Cognos in its analytic “appliance” product lines.
- Teradata did some special optimizations for MicroStrategy. This suggests that, conversely, MicroStrategy could benefit from DBMS-specific features.
- QlikView built a custom in-memory data store.
- Specialized business intelligence stacks are on the rise, although generally with a beyond-just-relational flavor.

And so I’m skeptical about RDBMS/BI integration, but willing to be persuaded otherwise.

The integration of advanced analytics with RDBMS leaves me perplexed. Gains in performance, scalability and/or development ease would seem, in many cases, too great to pass up. (E.g.. the Teradata Aster 6 story, analytic libraries and all.) And indeed most analytic platform vendors report some level of adoption. But the whole thing is moving more slowly than I expected. Meanwhile in the Hadoop world, a much lesser SQL capability — Hive — seems to be integrated into other analytic processing with enthusiasm. Perhaps the problem is that enterprises have to figure out which analytic techniques to use in the first place, before they worry too much about making them efficient.

And finally, when it comes to bundling of packaged applications with RDBMS — that depends on the class of application.

At the high end, it’s almost purely a pricing ploy, as those apps are usually written for lowest-common-denominator SQL functionality, so as to preserve portability.
A lot of mid-range apps are written against a specific DBMS, which is then resold along with the app. What’s more …
… most of those apps will migrate over time to a SaaS (Software as a Service) delivery model, which allows for a wholly integrated stack. And as the Workday example teaches us, database choices for SaaS apps can be pretty imaginative.

Related links

The refactoring of everything (July, 2013)
Comments about Gartner’s comments about a bunch of DBMS products (November, 2013)
The cardinal rules of DBMS development (March, 2013)

“Disruption” in the software industry

Curt Monash — Thu, 01 Aug 2013 01:02:41 +0000

I lampoon the word “disruptive” for being badly overused. On the other hand, I often refer to the concept myself. Perhaps I should clarify.

You probably know that the modern concept of disruption comes from Clayton Christensen, specifically in The Innovator’s Dilemma and its sequel, The Innovator’s Solution. The basic ideas are:

Market leaders serve high-end customers with complex, high-end products and services, often distributed through a costly sales channel.
Upstarts serve a different market segment, often cheaply and/or simply, perhaps with a different business model (e.g. a different sales channel).
Upstarts expand their offerings, and eventually attack the leaders in their core markets.

In response (this is the Innovator’s Solution part):

Leaders expand their product lines, increasing the value of their offerings in their core markets.
In particular, leaders expand into adjacent market segments, capturing margins and value even if their historical core businesses are commoditized.
Leaders may also diversify into direct competition with the upstarts, but that generally works only if it’s via a separate division, perhaps acquired, that has permission to compete hard with the main business.

But not all cleverness is “disruption”.

Routine product advancement by leaders — even when it’s admirably clever — is “sustaining” innovation, as opposed to the disruptive stuff.
Innovative new technology from small companies is not, in itself, disruption either.

Here are some of the examples that make me think of the whole subject.

1. The best example of DBMS industry disruption is Microsoft SQL Server in the 1990s. Every time I talked with Microsoft, they asserted that Oracle would have great difficulty competing with SQL Server, because SQL Server was much cheaper than Oracle, and was offered to businesses and departments who would be satisfied with its features, in many cases through sales channels that Microsoft dominated. Microsoft also had a massive advantage over Oracle in ease-of-administration. Dan Rosenberg and Andy Mendelsohn eventually led Oracle to narrow that gap, but for years Microsoft’s administrability edge fit perfectly into the “simpler/cheaper” part of the disruption story.

Microsoft turned out to be correct in its optimism, and is now a formidable competitor to Oracle in enterprises much larger than it once appeared able to serve.

2. Oracle executed an awesome “Innovator’s Solution” response (to Microsoft and other threats). Oracle has gone bonkers expanding its stack, with massive acquisitions in applications, hardware, middleware and more. And while upstart DBMS vendors certainly get some projects Oracle would want, on the whole Oracle has done an excellent job of maintaining both margins and account control.

All good things come to an end, and I think the finish to Oracle’s glory days is closer than the beginning. But the length of Oracle’s run at the top is a testimony, in large part, to its excellent record of strategic decision-making.

That said, the end of my nicely lucrative consulting relationship with Oracle came in the late 1990s, when I sent over dire and in retrospect accurate warnings that Oracle was blowing the internet opportunity. Even Oracle’s strategies aren’t always correct.

3. MySQL wasn’t a huge success at disruption. MySQL had a textbook disruption strategy — cheap, simple, pursuing markets Oracle wasn’t strong in. But MySQL never accomplished much in Oracle’s core or semi-core markets.

Even so, Oracle went well out of its way to buy up MySQL, limiting future threats from that source.

4. BI isn’t really undergoing disruption. QlikTech and now Tableau have gained business intelligence market share by offering good-looking, easy-to-deploy systems to departments. But that’s exactly what the current market leaders did in the 1990s, and they haven’t entirely forgotten the land-and-expand model. Yes, I’ve predicted a much-needed reinvention of/revolution in BI — but better technology alone rarely a disruption makes.

5. SaaS can be a vector of disruption. Salesforce.com rode software-as-a-service to disruption of the sales force automation and customer relationship management (CRM) markets. SaaS was easy to deploy in the distributed way that field sales forces needed, and just plain easy to deploy as marketing departments took control of their IT.

I also think SaaS is going to dominate the market at small enterprises, specifically ones too small to have a lot of domain specialists on their IT staffs, on the strength of what are new business models for sellers and buyers alike. But it’s not yet clear whether SaaS vendors will disrupt many more large-enterprise software markets.

6. Netezza was disruptive:

Much cheaper than alternatives.
Much easier to deploy and administer.
A smart purchase even for “departments” of 3 analysts or less.

But like MySQL, Netezza didn’t grow up to take over the world.

7. Hadoop is disruptive. For reasons of price, scale, and capabilities, Hadoop doesn’t compete much with analytic (or other) RDBMS. But it aspires to. It also aspires to compete with object storage, predictive modeling tools, and various other categories of software as well.

Hadoop’s disruptive success has already surpassed Netezza’s, and probably MySQL’s as well. How far it goes will be one of the big stories of IT’s next 7-10 years.

8. NoSQL, taken together, is disruptive, for similar reasons to why MySQL was. But that doesn’t mean that any one particular NoSQL product should be viewed as particularly disruptive. Even MongoDB has accomplished less to date than MySQL eventually did.*

*But also in less time, as my friends at 10gen would surely hasten to point out.

9. There isn’t much disruption in NewSQL. Mainly, NewSQL is a collection of efforts to win with better technology than what came before.

Data skipping

Curt Monash — Mon, 27 May 2013 05:11:53 +0000

Way back in 2006, I wrote about a cool Netezza feature called the zone map, which in essence allows you to do partition elimination even in the absence of strict range partitioning.

Netezza’s substitute for range partitioning is very simple. Netezza features “zone maps,” which note the minimum and maximum of each column value (if such concepts are meaningful) in each extent. This can amount to effective range partitioning over dates; if data is added over time, there’s a good chance that the data in any particular date range is clustered, and a zone map lets you pick out which data falls in the desired data range.

I further wrote

… that seems to be the primary scenario in which zone maps confer a large benefit.

But I now think that part was too pessimistic. For example, in bulk load scenarios, it’s easy to imagine ways in which data can be clustered or skewed. And in such cases, zone maps can let you skip a large fraction of potential I/O.

Over the years I’ve said that other things were reminiscent of Netezza zone maps, e.g. features of Infobright, SenSage, InfiniDB and even Microsoft SQL Server. But truth be told, when I actually use the phrase “zone map”, people usually give me a blank look.

In a recent briefing about BLU, IBM introduced me to a better term — data skipping. I like it and, unless somebody comes up with a good reason not to, I plan to start using it myself.

DBMS development and other subjects

Curt Monash — Mon, 18 Mar 2013 05:29:42 +0000

The cardinal rules of DBMS development

Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars.

That’s if things go extremely well.

Rule 2: You aren’t an exception to Rule 1.

In particular:

Concurrent workloads benchmarked in the lab are poor predictors of concurrent performance in real life.
Mixed workload management is harder than you’re assuming it is.
Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.

DBMS with Hadoop underpinnings …

… aren’t exceptions to the cardinal rules of DBMS development. That applies to Impala (Cloudera), Stinger (Hortonworks), and Hadapt, among others. Fortunately, the relevant vendors seem to be well aware of this fact.

But note that the HadoopDB prototype — on which Hadapt was based — was completed and the paper presented in 2009.

MarkLogic …

… has been around long enough to make a good DBMS. It used to make a solid XML DBMS. Now SQL and JSON are also in the mix. The SQL part is a reversal of MarkLogic’s long-time stance. The JSON part gets MarkLogic out of the usually-losing side of the XML/JSON debate.

As for MarkLogic’s Enterprise NoSQL messaging — it basically equates “NoSQL” to “short-request dynamic-schema“, and in 2013 I have little quarrel with that definition.

RDBMS-oriented Hadoop file formats are confusing

I’ve recently tried asking both Cloudera and Hortonworks about the “columnar” file formats beneath their respective better-Hive efforts, each time getting the response “Let me set you up with a call with the right person.” Cloudera also emailed over a link to Parquet, evidently the latest such project.

Specific areas about which I’m confused (and the same questions apply to any of these projects, as they seem similarly-intended) include but are not limited to:

Is it truly columnar (doesn’t seem so, based on the verbiage), or more PAX-like, or something else entirely?
What’s the nested data structure story? (It seems there is one.)
What’s the compression story?

Come to think of it, the name “Parquet” suggests that either:

Rows and columns are mixed together.
Somebody has the good taste to be a Celtics fan.

Whither analytic platforms?

I’ve been a big advocate of analytic platform technology, but interest hasn’t increased as much as I expected. Teradata Aster seems to be doing well, but not so extremely well that IBM Netezza, Sybase IQ, et al. feel the need to be aggressive in their responses. Vendors have, for the most part, put decent capabilities in place; but the energy I’d looked for isn’t there.

I think that problems include:

Analytic platforms are marketed too purely as a development play. Selling six-to-seven figure application development deals is hard.
But selling analytic performance — the other main benefit — is harder than it used to be. Good enough is often good enough. In particular …
… a lot of analytic work is being conceded, rightly or wrongly, to Hadoop.
More generally, selling advanced analytic tools is commonly a tough, niche-oriented business.

Also, some of the investigative analytics energy has been absorbed by business intelligence tools, specifically ones with “discovery” interfaces — Tableau, QlikView, and so on.

Related links

One database to rule them all? (February, 2013)
NewSQL thoughts (January, 2013)
Bottleneck Whack-A-Mole (August, 2009)