Data warehouse appliances – DBMS 2 : DataBase Management System Services

Generally available Kudu

Curt Monash — Fri, 16 Jun 2017 15:52:45 +0000

I talked with Cloudera about Kudu in early May. Besides giving me a lot of information about Kudu, Cloudera also helped confirm some trends I’m seeing elsewhere, including:

Security is an ever bigger deal.
There’s a lot of interest in data warehouses (perhaps really data marts) that are updated in human real-time.
- Prospects for that respond well to the actual term “data warehouse”, at least when preceded by some modifier to suggest that it’s modern/low-latency/non-batch or whatever.
- Flash is often — but not yet always — preferred over disk for that kind of use.
- Sometimes these data stores are greenfield. When they’re migrations, they come more commonly from analytic RDBMS or data warehouse appliance (the most commonly mentioned ones are Teradata, Netezza and Vertica, but that’s perhaps just due to those product lines’ market share), rather than from general purpose DBMS such as Oracle or SQL Server.
Intel is making it ever easier to vectorize CPU operations, and analytic data managers are increasingly taking advantage of this possibility.

Now let’s talk about Kudu itself. As I discussed at length in September 2015, Kudu is:

A data storage system introduced by Cloudera (and subsequently open-sourced).
Columnar.
Updatable in human real-time.
Meant to serve as the data storage tier for Impala and Spark.

Kudu’s adoption and roll-out story starts:

Kudu went to general availability on January 31. I gather this spawned an uptick in trial activity.
A subsequent release with some basic security features spawned another uptick.
I don’t think Cloudera will mind my saying that there are many hundreds of active Kudu clusters.
But Cloudera believes that, this soon after GA, very few Kudu users are in actual production.

Early Kudu interest is focused on 2-3 kinds of use case. The biggest is the kind of “data warehousing” highlighted above. Cloudera characterizes the others by the kinds of data stored, specifically the overlapping categories of time series — including financial trading — and machine-generated data. A lot of early Kudu use is with Spark, even ahead of (or in conjunction with) Impala. A small amount has no relational front-end at all.

Other notes on Kudu include:

Solid-state storage is recommended, with a few terabytes per node.
You can also use spinning disk. If you do, your write-ahead logs can still go to flash.
Cloudera said Kudu compression ratios can be as low as 2-5X, or as high as 10-20X. With that broad a range, I didn’t drill down into specifics of what they meant.
There seem to be a number of Kudu clusters with 50+ nodes each. By way of contrast, a “typical” Cloudera customer has 100s of nodes overall.
As you might imagine from their newness, Kudu security features — Kerberos-based — are at the database level rather than anything more granular.

And finally, the Cloudera folks woke me up to some issues around streaming data ingest. If you stream data in, there will be retries resulting in duplicate delivery. So your system needs to deal with those one way or another. Kudu’s way is:

Primary keys will be unique. (Note: This is not obvious in a system that isn’t an entire RDBMS in itself.)
You can configure the uniqueness to be guaranteed either through an upsert mechanism or just by simply rejecting duplicates.
Alternatively, you can write code to handle duplication errors, e.g. via Spark.

Are analytic RDBMS and data warehouse appliances obsolete?

Curt Monash — Mon, 29 Aug 2016 01:28:31 +0000

I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:

Many of the independent vendors were swooped up by acquisition.
- IBM bought Netezza.
- Microsoft bought DATAllegro.
- HP bought Vertica.
- Greenplum went to EMC/VMware/Pivotal.
- Teradata bought Aster.
- Actian bought both ParAccel and Vectorwise.
None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
Impala and other Hadoop-based alternatives are technology options.
Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big.

Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.

My answer, in a nutshell, is:

Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud — are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.

To see why, let’s start by asking: “With what do you want to integrate your analytic SQL processing?”

If you want to integrate with relational OLTP (OnLine Transaction Processing), your OLTP RDBMS vendor surely has a story worth listening to. Memory-centric offerings MemSQL and SAP HANA are also pitched that way.
If you want to integrate with your SAP apps in particular, HANA is the obvious choice.
If you want to integrate with other work you do in the Amazon cloud, Redshift is worth a look.

Beyond those cases, a big issue is integration with … well, with data integration. Analytic RDBMS got a lot of their workloads from ELT or ETLT, which stand for Extract/(Transform)/Load/Transform. I.e., you’d load data into an efficient analytic RDBMS and then do your transformations, vs. the “traditional” (for about 10-15 years of tradition) approach of doing your transformations in your ETL (Extract/Transform/Load) engine. But in bigger installations, Hadoop often snatches away that part of the workload, even if the rest of the processing remains on a dedicated analytic RDBMS platform such as Teradata’s.

And suppose you want to integrate with more advanced analytics — e.g. statistics, other predictive modeling/machine learning, or graph analytics? Well — and this both surprised and disappointed me — analytic platforms in the RDBMS sense didn’t work out very well. Early Hadoop had its own problems too. But Spark is doing just fine, and seems poised to win.

My technical observations around these trends include:

Advanced analytics commonly require flexible, iterative processing.
Spark is much better at such processing than earlier Hadoop …
… which in turn is better than anything that’s been built into an analytic RDBMS.
Open source/open standards and the associated skill sets come into play too. Highly vendor-proprietary DBMS-tied analytic stacks don’t have enough advantages over open ones.
Notwithstanding the foregoing, RDBMS-based platforms can still win if a big part of the task lies in fancy SQL.

And finally, if a task is “partly relational”, then Hadoop or Spark often fit both parts.

They don’t force you into using SQL for everything, nor into putting all your data into relational schemas, and that flexibility can be a huge relief.
Even so, almost everybody who uses those uses some SQL, at least for initial data extraction. Those systems are also plenty good enough at SQL for joining data to reference tables, and all that other SQL stuff you’d never want to give up.

But suppose you just want to do business intelligence, which is still almost always done over relational data structures? Analytic RDBMS offer the trade-offs:

They generally still provide the best performance or performance/concurrency combination, for the cost, although YMMV (Your Mileage May Vary).
One has to load the data in and immediately structure it relationally, which can be an annoying contrast to Hadoop alternatives (data base administration can be just-in-time) or to OLTP integration (less or no re-loading).
Other integrations, as noted above, can also be weak.

Suppose all that is a good match for your situation. Then you should surely continue using an analytic RDBMS, if you already have one, and perhaps even acquire one if you don’t. But for many other use cases, analytic RDBMS are no longer the best way to go.

Finally, how does the cloud affect all this? Mainly, it brings one more analytic RDBMS competitor into the mix, namely Amazon Redshift. Redshift is a simple system for doing analytic SQL over data that was in or headed to the Amazon cloud anyway. It seems to be quite successful.

Bottom line: Analytic RDBMS are no longer in their youthful prime, but they are healthy contributors in middle age. Mainly, they’re still best-of-breed for supporting demanding BI.

Data models

Curt Monash — Mon, 23 Feb 2015 03:08:10 +0000

7-10 years ago, I repeatedly argued the viewpoints:

Relational DBMS were the right choice in most cases.
Multiple kinds of relational DBMS were needed, optimized for different kinds of use case.
There were a variety of specialized use cases in which non-relational data models were best.

Since then, however:

Hadoop has flourished.
NoSQL has flourished.
Graph DBMS have matured somewhat.
Much of the action has shifted to machine-generated data, of which there are many kinds.

So it’s probably best to revisit all that in a somewhat organized way.

To make the subject somewhat manageable, I’ll focus on fielded data — i.e. data that represents values of something — rather than, for example, video or images. Fielded data always arrives as a string of bits, whose meaning boils down to a set of pairs. Here by “string of bits” I mean mainly a single record or document (for example), although most of what I say can apply to a whole stream of data instead.

Important distinctions include:

Are the field names implicit or explicit? In relational use cases field names tend to be implicit, governed by the metadata. In some log files they may be space-savingly implicit as well. In other logs, XML streams, JSON streams and so on they are explicit.
If the field names are implicit, is any processing needed to recover them? Think Hadoop or Splunk acting on “dumb-looking” log data.
In any one record/document/whatever, are the field names unique? If not, then the current data model is not relational.
Are the field names the same from one record/document/whatever to the next? I.e., does the data fit into a consistent schema?
Is there a structure connecting the field names (and if so what kind)? E.g., hierarchical documents, or relational foreign keys.

Some major data models can be put into a fairly strict ordering of query desirability by noting:

The best thing to query is a relational DBMS. Everything has a known field name, so SELECTs are straightforward. You also have JOINs, which are commonly very valuable. And RDBMS are a mature technology with in many cases great query performance.
The next-best thing to query is another kind of data store with known field names. In such data stores:
- SQL or SQL-like SELECTs will still work, or can easily be made to do.
- Useful indexing systems can be grafted on to them (although they are typically less mature than in RDBMS).
- In the (mainly) future, perhaps JOINs can be grafted on as well.
The worst thing to query is a data store in which you only have a schema on read. You have to do work to make the thing queryable in the first place.

Unsurprisingly, that ordering is reversed when it comes to writing data.

The easiest thing to write to is a data store with no structure.
Next-easiest is to write to a data store that lets you make up the structure as you go along.
The hardest thing to write to is a relational DBMS, because of the requirements that must be obeyed, notably:
- Implicit field names, governed by metadata.
- Unique field names within any one record.
- The same (ordered) set of field names for each record — more precisely, a limited collection of such ordered sets, one per table.

And so, for starters, most large enterprises will have important use cases for data stores in all of the obvious categories. In particular:

Usually it is best to have separate brands of general-purpose/OLTP (OnLine Transaction Processing) and analytic RDBMS. Further:
- I have in the past also advocated for a mid-range — i.e. lighter-weight — general purpose RDBMS.
- SAP really, really wants you to use HANA to run SAP’s apps.
- You might want an in-memory RDBMS (MemSQL) or a particularly cloudy one or whatever.
Your website alone is reason enough to use a NoSQL DBMS, most likely MongoDB or Cassandra. And it often makes sense to have multiple NoSQL systems used for different purposes, because:
- They’re all immature right now, with advantages over each other.
- The apps you’re using them for are likely to be thrown out in a few years, so you won’t have great pain switching if you ever do decide to standardize.
Whatever else Hadoop is — and it’s a lot of things — it’s also a happy home for log files. And enterprises have lots of log files.

Beyond that:

You may want something to manage organizational hierarchies and so on, if you build enough custom systems in areas such as security, knowledge management, or MDM (Master Data Management). I’m increasingly persuaded by the argument that this should be a graph DBMS rather than an LDAP (Lightweight Directory Access Protocol) system.
Splunk is cool.
Use cases for various other kinds of data stores can often be found.
Of course you’ll be implicitly using whatever is bundled into your SaaS (Software as a Service) systems, your app-specific appliances and so on.

And finally, I think in-memory data grids:

Will be widely used and important.
Will be used to instantiate multiple data models at once.

Related links

One reason for writing this post was for some deck-clearing before I revisit the white-hot topic of data streaming. (October, 2014)
I’ve long mused about the challenges of getting by without joins. (November, 2010)
In 2013 I observed that data models will be in perpetual, rapid flux.
In 2013 I also discussed attempts to combine multiple data models (or access methods) in a single DBMS.
I surveyed data models and access methods back in 2008.

Greenplum is being open sourced

Curt Monash — Wed, 18 Feb 2015 21:51:39 +0000

While I don’t find the Open Data Platform thing very significant, an associated piece of news seems cooler — Pivotal is open sourcing a bunch of software, with Greenplum as the crown jewel. Notes on that start:

Greenplum has been an on-again/off-again low-cost player since before its acquisition by EMC, but open source is basically a commitment to having low license cost be permanently on.
In most regards, “free like beer” is what’s important here, not “free like speech”. I doubt non-Pivotal employees are going to do much hacking on the long-closed Greenplum code base.
That said, Greenplum forked PostgreSQL a long time ago, and the general PostgreSQL community might gain ideas from some of the work Greenplum has done.
The only other bit of newly open-sourced stuff I find interesting is HAWQ. Redis was already open source, and I’ve never been persuaded to care about GemFire.

Greenplum, let us recall, is a pretty decent MPP (Massively Parallel Processing) analytic RDBMS. Various aspects of it were oversold at various times, and I’ve never heard that they actually licked concurrency. But Greenplum has long had good SQL coverage and petabyte-scale deployments and a columnar option and some in-database analytics and so on; i.e., it’s legit. When somebody asks me about open source analytic RDBMS to consider, I expect Greenplum to consistently be on the short list.

Further, the low-cost alternatives for analytic RDBMS are adding up.

Amazon Redshift has considerable traction.
Hadoop (even just with Hive) has offloaded a lot of ELT (Extract/Load/Transform) from analytic RDBMS such as Teradata.
Now Greenplum is in the mix as well.

For many analytic RDBMS use cases, at least one of those three will be an appealing possibility.

By no means do I want to suggest those are the only alternatives.

Smaller-vendor offerings, such as CitusDB or Infobright, may well be competitive too.
Larger vendors can always slash price in specific deals.
MonetDB is still around.

But the three possibilities I cited first should suffice as proof for almost all enterprises that, for most use cases not requiring high concurrency, analytic RDBMS need not cost an arm and a leg.

Related link

Greenplum revenue at EMC was problematic from the get-go.

Notes from a visit to Teradata

Curt Monash — Sun, 31 Aug 2014 09:17:29 +0000

I spent a day with Teradata in Rancho Bernardo last week. Most of what we discussed is confidential, but I think the non-confidential parts and my general impressions add up to enough for a post.

First, let’s catch up with some personnel gossip. So far as I can tell:

Scott Gnau runs most of Teradata’s development, product management, and product marketing, the big exception being that …
… Darryl McDonald run the apps part (Aprimo and so on), and no longer is head of marketing.
Oliver Ratzesberger runs Teradata’s software development.
Jeff Carter has returned to his roots and runs the hardware part, in place of Carson Schmidt.
Aster founders Mayank Bawa and Tasso Argyros have left Teradata (perhaps some earn-out period ended).
Carson is temporarily running Aster development (in place of Mayank), and has some sort of evangelism role waiting after that.
With the acquisition of Hadapt, Teradata gets some attention from Dan Abadi. Also, they’re retaining Justin Borgman.

The biggest change in my general impressions about Teradata is that they’re having smart thoughts about the cloud. At least, Oliver is. All details are confidential, and I wouldn’t necessarily expect them to become clear even in October (which once again is the month for Teradata’s user conference). My main concern about all that is whether Teradata’s engineering team can successfully execute on Oliver’s directives. I’m optimistic, but I don’t have a lot of detail to support my good feelings.

In some quick-and-dirty positioning and sales qualification notes, which crystallize what we already knew before:

The Teradata 1xxx series is focused on cost-per-bit.
The Teradata 2xxx series is focused on cost-per-query. It is commonly Teradata’s “lead” product, at least for new customers.
The Teradata 6xxx series is supposed to be able to do “everything”.
The Teradata Aster “Discovery Analytics” platform is sold mainly to customers who have a specific high-value problem to solve. (Randy Lea gave me a nice round dollar number, but I won’t share it.) I like that approach, as it obviates much of the concern about “Wait — is this strategic for us long-term, given that we also have both Teradata database and Hadoop clusters?”

Also:

1xxx and 2xxx systems are meant to be I/O-constrained. 6xxx systems are meant to be constrained mainly by CPU, but every system will be I/O-constrained at some point.
There is at least one example of a Very Well Known organization buying Teradata’s Hadoop-only appliance despite not otherwise being a Hadoop customer. Teradata concedes, however, that this is not a common occurrence.
Customers are increasingly using co-location rather than their own data centers. Many colo organizations charge more or less strictly by floor space. Hence, there’s a push for maximum processing density per rack, power density and weight be damned.

Speaking of not being CPU-constrained — I heard 7-10% as an estimate for typical Hadoop utilization, and also 10-15%. While I didn’t ask, I presume these figures assume traditional MapReduce types of Hadoop workloads. I’m not sure why these figures are yet lower than eBay’s long-ago estimates of Hadoop “parallel efficiency”.

Like Carson used to do, Jeff shared a variety of hardware and networking tidbits with me. In particular:

Jeff is confident in Moore’s Law continuing for at least 5 more years. (I think that’s a near-consensus; the 2020s, however, are another matter.)
Teradata still uses SAS rather than SATA for all disk (spinning or solid-state) controllers. They’re now seeing 6-700 MB/sec/device on SSDs (Solid State Disk), up from 3-400.
SSD prices are down 60% over the past 6 months, vs. much slower declines previously.
Formerly a SanDisk/Pliant partisan, Teradata now thinks there are multiple vendors of good SSDs. (I’m not sure whether they’d be happy if I said which one they currently like best.)
Jeff foresees InfiniBand and Ethernet more or less merging. Right now Teradata is using a lot of 56 Gb/sec InfiniBand.

Since Oliver is now a Teradata mucky-muck, I asked about virtual data marts, an idea that he pretty much invented or at least popularized back in his eBay days. Comments included:

Teradata now calls them Data Labs.
Adoption is very high.
One major feature is “time boxing” — they expire after a period of time unless you renew them.
Analysis of virtual data mart usage is a good guide as to what you might want to add to your permanent data warehouse.

And I’ll stop here, although I hope that a couple more-focused posts will also eventually flow from the visit.

21st Century DBMS success and failure

Curt Monash — Mon, 14 Jul 2014 06:37:31 +0000

As part of my series on the keys to and likelihood of success, I outlined some examples from the DBMS industry. The list turned out too long for a single post, so I split it up by millennia. The part on 20th Century DBMS success and failure went up Friday; in this one I’ll cover more recent events, organized in line with the original overview post. Categories addressed will include analytic RDBMS (including data warehouse appliances), NoSQL/non-SQL short-request DBMS, MySQL, PostgreSQL, NewSQL and Hadoop.

DBMS rarely have trouble with the criterion “Is there an identifiable buying process?” If an enterprise is doing application development projects, a DBMS is generally chosen for each one. And so the organization will generally have a process in place for buying DBMS, or accepting them for free. Central IT, departments, and — at least in the case of free open source stuff — developers all commonly have the capacity for DBMS acquisition.

In particular, at many enterprises either departments have the ability to buy their own analytic technology, or else IT will willingly buy and administer things for a single department. This dynamic fueled much of the early rise of analytic RDBMS.

Buyer inertia is a greater concern.

A significant minority of enterprises are highly committed to their enterprise DBMS standards.
Another significant minority aren’t quite as committed, but set pretty high bars for new DBMS products to cross nonetheless.
FUD (Fear, Uncertainty and Doubt) about new DBMS is often justifiable, about stability and consistent performance alike.

A particularly complex version of this dynamic has played out in the market for analytic RDBMS/appliances.

First the newer products (from Netezza onwards) were sold to organizations who knew they wanted great performance or price/performance.
Then it became more about selling “business value” to organizations who needed more convincing about the benefits of great price/performance.
Then the behemoth vendors became more competitive, as Teradata introduced lower-price models, Oracle introduced Exadata, Sybase got more aggressive with Sybase IQ, IBM bought Netezza, EMC bought Greenplum, HP bought Vertica and so on. It is now hard for a non-behemoth analytic RDBMS vendor to make headway at large enterprise accounts.
Meanwhile, Hadoop has emerged as serious competitor for at least some analytic data management, especially but not only at internet companies.

Otherwise I’d say:

At large enterprises, their internet operations perhaps excepted:
- Short-request/general-purpose SQL alternatives to the behemoths — e.g. MySQL, PostgreSQL, NewSQL — have had tremendous difficulty getting established. The last big success was the rise of Microsoft SQL Server in the 1990s. That’s why I haven’t mentioned the term mid-range DBMS in years.
- NoSQL/non-SQL has penetrated large enterprises mainly for a few specific use cases, for example the lists I posted for MongoDB or graph databases.
Internet-only companies have few inertia issues when it comes to database managers. They’ll consider anything they regard as being in their price ballpark (which is however often restricted to open source). I think part of the reason is that as quickly as they rewrite their applications, DBMS are vastly less “strategic” to them than they are to most larger enterprises.
The internet operations of large companies — especially large retailers — in many cases behave like internet-only companies, but in many other cases behave like the rest of the enterprise.

The major reasons for DBMS categories to get established in the first place are:

Performance and/or scalability (many examples).
Developer features (for example dynamic schema).
License/maintenance cost (for example several open source categories).
Ease of installation and administration (for example open source again, and also data warehouse appliances).

Those same characteristics are major bases for competition among members of a new category, although as noted above behemoth-loyalty can also come into play.

Cool-vs.-weird tradeoffs are somewhat secondary among SQL DBMS.

There’s not much of a “cool” factor, because new products aren’t that different in what they do vs. older ones.
There’s not a terrible “weird” factor either, but of course any smaller offering faces FUD, and also …
… appliances are anti-strategic for many buyers, especially ones who demand a smooth path to the cloud.)

They’re huge, however, in the non-SQL world. Most non-SQL data managers have a major “weird” factor. Fortunately, NoSQL and Hadoop both have huge “cool” cred to offset it. XML/XQuery unfortunately did not.

Finally, in most DBMS categories there are massive issues with product completeness, more in the area of maturity than that of whole product. The biggest whole product issues are concentrated on the matter of interoperating with other software — business intelligence tools, packaged applications (if relevant to the category), etc. Most notably, the handful of DBMS that are certified to run SAP share a huge market that other DBMS can’t touch. But BI tools are less of a differentiator — I yawn when vendors tell me they are certified for/partnered with MicroStrategy, Tableau, Pentaho and Jaspersoft, and I’m surprised at any product that isn’t.

DBMS maturity has a lot of aspects, but the toughest challenges are concentrated in two main areas:

Reliability, especially but not only in short-request use cases.
Performance across a great variety of use cases. I observe frequently that performance in best-case scenarios, performance in the lab and performance in real-world environments are much further apart than vendors like to think.

In particular:

Maturity demands seem to be much higher for SQL DBMS than for NoSQL.
- I think this is one of several reasons NoSQL has been much more successful than NewSQL.
- It’s why I think MarkLogic’s “Enterprise NoSQL” positioning is a mistake.
As for MySQL:
- MySQL wasn’t close to reliable enough for enterprises to trust it until InnoDB became the default storage engine.
- MySQL 5 point releases have added major features, or decent performance for major features. I’ll confess to having lost track of what’s been fixed and what’s still missing.
- In saying all that I’m holding MySQL to a much higher maturity standard than I’m holding NoSQL — because that’s what I think enterprise customers do.
PostgreSQL “should” be doing a lot better than it is. I have an extremely low opinion of its promoters, and not just for personal reasons. (That said, the personal reasons don’t just apply to EnterpriseDB anymore. I’ve also run out of patience waiting for Josh Berkus to retract untruths he posted about me years ago.)
SAP HANA checks boxes for performance (In-memory rah rah rah!!) and whole product (Runs SAP!!). That puts it well ahead of most other newish SQL DBMS, purely analytic ones perhaps excepted.
Any other new short-request SQL DBMS that sounds like is has traction is also memory-centric.
Analytic RDBMS are in most respects held to lower maturity standards than DBMS used for write-intensive workloads. Even so, products in the category are still frequently tripped up by considerations of concurrent performance and mixed workload management.

Related links

There have been 1,470 previous posts in the 9-year history of this blog, many of which could serve as background material for this one. A couple that seem particularly germane and didn’t get already get linked above are:

The drive for uninterrupted DBMS operation.
Short-request DBMS trade-offs and alternatives.

Notes and comments, March 17, 2014

Curt Monash — Mon, 17 Mar 2014 07:09:15 +0000

I have ever more business-advice posts up on Strategic Messaging. Recent subjects include pricing and stealth-mode marketing. Other stuff I’ve been up to includes:

The Spark buzz keeps increasing; almost everybody I talk with expects Spark to win big, probably across several use cases.

Disclosure: I’ll soon be in a substantial client relationship with Databricks, hoping to improve their stealth-mode marketing.

The “real-time analytics” gold rush I called out last year continues. A large fraction of the vendors I talk with have some variant of “real-time analytics” as a central message.

Basho had a major change in leadership. A Twitter exchange ensued. Joab Jackson offered a more sober — figuratively and literally — take.

Hadapt laid off its sales and marketing folks, and perhaps some engineers as well. In a nutshell, Hadapt’s approach to SQL-on-Hadoop wasn’t selling vs. the many alternatives, and Hadapt is doubling down on poly-structured data*/schema-on-need.

*While Hadapt doesn’t to my knowledge use the term “poly-structured data”, some other vendors do. And so I may start using it more myself, at least when the poly-structured/multi-structured distinction actually seems significant.

WibiData is partnering with DataStax, WibiData is of course pleased to get access to Cassandra’s user base, which gave me the opportunity to ask why they thought Cassandra had beaten HBase in those accounts. The answer was performance and availability, while Cassandra’s traditional lead in geo-distribution wasn’t mentioned at all.

Disclosure: My fingerprints are all over that deal.

In other news, WibiData has had some executive departures as well, but seems to be staying the course on its strategy. I continue to think that WibiData has a really interesting vision about how to do large-data-volume interactive computing, and anybody in that space would do well to talk with them or at least look into the open source projects WibiData sponsors.

I encountered another apparently-popular machine-learning term — bandit model. It seems to be glorified A/B testing, and it seems to be popular. I think the point is that it tries to optimize for just how much you invest in testing unproven (for good or bad) alternatives.

I had an awkward set of interactions with Gooddata, including my longest conversations with them since 2009. Gooddata is in the early days of trying to offer an all-things-to-all-people analytic stack via SaaS (Software as a Service). I gather that Hadoop, Vertica, PostgreSQL (a cheaper Vertica alternative), Spark, Shark (as a faster version of Hive) and Cassandra (under the covers) are all in the mix — but please don’t hold me to those details.

I continue to think that computing is moving to a combination of appliances, clusters, and clouds. That said, I recently bought a new gaming-class computer, and spent many hours gaming on it just yesterday.* I.e., there’s room for general-purpose workstations as well. But otherwise, I’m not hearing anything that contradicts my core point.

*The last beta weekend for The Elder Scrolls Online; I loved Morrowind.

Some stuff I’m thinking about (early 2014)

Curt Monash — Sun, 02 Feb 2014 18:51:49 +0000

From time to time I like to do “what I’m working on” posts. From my recent blogging, you probably already know that includes:

Hadoop (always, and please see below).
Analytic RDBMS (ditto).
NoSQL and NewSQL.
Specifically, SQL-on-Hadoop
Schema-on-need.
Spark and other memory-centric technology, including streaming.
Public policy, mainly but not only in the area of surveillance/privacy.
General strategic advice for all sizes of tech company.

Other stuff on my mind includes but is not limited to:

1. Certain categories of buying organizations are inherently leading-edge.

Internet companies have adopted Hadoop, NoSQL, NewSQL and all that en masse. Often, they won’t even look at things that are conventional or expensive.
US telecom companies have been buying 1 each of every DBMS on the market since pre-relational days.
Financial services firms — specifically algorithmic traders and broker-dealers — have been in their own technical world for decades …
… as have national-security agencies …
… as have pharmaceutical research departments.

Fine. But what really intrigues me is when more ordinary enterprises also put leading-edge technologies into production. I pester everybody for examples of that.

2. In particular, I hope to figure out where Hadoop is or soon will be getting major adoption.

Widespread Hadoop adoption at ordinary large enterprises is, I think, inevitable and imminent. But it hasn’t quite happened yet.
I think that part of the “enterprise data hub” story is a great bet to come true — Hadoop is becoming a key destination for data to land and be transformed. MapReduce was invented for data transformation; Hadoop was invented to do MapReduce; data transformation workloads have already been moving from expensive analytic RDBMS to cheaper Hadoop.
I also think Hadoop — enhanced with Spark or whatever — will win as a platform for sophisticated predictive modeling; Hadoop’s (and Spark’s) flexibility is at least as useful for the purpose as RDBMS’ SQL execution speed.
I’m still skeptical about ordinary enterprises’ adoption of Hadoop as a business intelligence platform, but it’s definitely another area to track.

3. Analytic RDBMS and data warehouse appliance pricing is always a big deal. Hadoop’s great price advantage doesn’t have to be permanent, and in fact there are a number of fairly low-cost RDBMS offerings, such as petascale Vertica, the Teradata 1000 series, or Infobright.

Speaking of that, it turns out Teradata now publishes per-terabyte pricing. Please note that those are uncompressed prices; actual prices can be assumed to be lower, at least for databases that compress well.

Analytic RDBMS prices are still shaking out.

4. As I previously noted, ensemble models have become the norm for machine learning. I want to learn more about the implications of that.

One conjecture — everything we learned in school about statistics is wrong, or at least it’s less important than we thought. Predictive modeling is not mainly about least squares, regressions, curve-fitting, etc. Rather, it’s first and foremost about data segmentation and clustering, with all the curve-fitting stuff being secondary.

Besides fitting — as it were — what I hear, this hypothesis also matches common sense. How do businesses use predictive modeling? For each customer/prospect/site-visitor/whatever, they decide which of a limited number of possible actions to take. At its core, that’s an exercise in segmentation.

5. I think data integration is getting a lot smarter than it was. Hadoop-based transformation is the obvious example. But there’s also ClearStory’s data intelligence pitch. (And yes, I know I need to talk with Paxata. There’s been a lot of ball-dropping on that one, including by me.)

6. There’s a meta-theme in the above — stuff that’s not exactly a DBMS or DBMS-like data store. Streaming fits into that. So does smart data integration. So, arguably, does Spark. So do data grids, another of those topics I’d like to know more about but haven’t nailed down yet.

Data management is getting ever more complex.

SaaS appliances, SaaS data centers, and customer-premises SaaS

Curt Monash — Fri, 29 Nov 2013 11:03:28 +0000

Conclusions

I think that most sufficiently large enterprise SaaS vendors should offer an appliance option, as an alternative to the core multi-tenant service. In particular:

SaaS appliances address customer fears about security, privacy, compliance, performance isolation, and lock-in.
Some of these benefits occur even if the appliance runs in the same data centers that host the vendor’s standard multi-tenant SaaS. Most of the rest occur if the customer can choose a co-location facility in which to place the appliance.
Whether many customers should or will use the SaaS appliance option is somewhat secondary; it’s a check-mark item. I.e., many customers and prospects will be pleased that the option at least exists.

How I reached them

Core reasons for selling or using SaaS (Software as a Service) as opposed to licensed software start:

The SaaS vendor handles all software upgrades, and makes them promptly. In principle, this benefit could also be achieved on a dedicated system on customer premises (or at the customer’s choice of co-location facility).
In addition, the SaaS vendor handles all the platform and operational stuff — hardware, operating system, computer room, etc. This benefit is antithetical to direct customer control.
The SaaS vendor only has to develop for and operate on a tightly restricted platform stack that it knows very well. This benefit is also enjoyed in the case of customer-premises appliances.

Conceptually, then, customer-premises SaaS is not impossible, even though one of the standard Big Three SaaS benefits is lost. Indeed:

Microsoft Windows and many other client software packages already offer to let their updates be automagically handled by the vendor.
In that vein, consumer devices such as game consoles already are a kind of SaaS appliance.
Complex devices of any kind, including computers, will see ever more in the way of “phone-home” features or optional services, often including routine maintenance and upgrades.

But from an enterprise standpoint, that’s all (relatively) simple stuff. So we’re left with a more challenging question — does customer-premises SaaS make sense in the case of enterprise applications or other server software?

Why would a customer actually want on-premises SaaS, as opposed to the standard remote version? The first ideas that come to mind are:

Security and/or privacy considerations, real or imagined. This is in fact the motivation behind the single case of on-premises enterprise SaaS I have confirmed, namely one that Cloudant told me about.* (I don’t have similar levels of detail about Glassbeam’s one on-premises subscription customer.)
Similarly, a less specific desire for isolation …
… and/or control.
Avoiding the expense of data movement to/from a remote location. For example, an enterprise might use SaaS OLTP (OnLine Transaction Processing) apps whose results it wants to stream to an on-premises data warehouse. Or the enterprise might have lower-volume but also lower-latency — and hence more costly — data integration needs, perhaps between different OLTP application suites, or with some MDM (Master Data Management) in the mix.

And, um — that’s about all I’ve got.

*Yes, I know Cloudant is DBaaS — but to me that’s a kind of SaaS, in which the S just happens to center around a DBMS.

Confusing matters further, there’s a middle option as well. salesforce.com and HP just announced that salesforce.com apps will, for the first time, run on dedicated customer-specific racks. But this will only be within the same data centers and operation groups that handle the rest of salesforce.com’s system. Notes on what’s being called a “pod” strategy start:

This suggests a perceived demand for isolation.
If these pods are offered to accounts large enough to saturate a bunch of servers each, they need not be much more expensive than the multi-tenant version of salesforce’s offering.
Other than (fully-loaded) cost, it’s hard to see a downside to this vs. multi-tenant SaaS.

Notwithstanding ever-increasing levels of comfort with SaaS and cloud computing, I’d guess that a number of enterprises will find the cost of single-tenant SaaS more palatable than the queasiness they feel about multi-tenant alternatives. And so I think there’s a place for single-customer enterprise SaaS stacks somewhere; the main remaining question is where they will be located.

Ducking that question a bit longer, let me note that:

In any scenario, we’re most likely talking about something like SaaS appliances. Customer-premises server SaaS that isn’t in some kind of appliance form is madness (unless the SaaS vendor is paid for on-site support as well), because no SaaS vendor wants to support hardware it can’t specify or control.
Some enterprises in some countries will surely insist on keeping data within national borders, for reasons of geo-compliance. Hence there will be a need to deploy SaaS appliances either literally to their premises, or else to an in-country co-location facility, perhaps managed by a big telecom firm. Of course, that need can only arise if vendors first overcome issues of software nationalization — language, regulations, other business customs, whatever.
The final point in my recent SaaS discussion post was about lock-in. If you use something that only runs in the supplier’s data centers, your lock-in is even worse than it is with most enterprise IT technology.
I can’t currently think of many examples in which SaaS appliances need to be located directly in the customers’ main data centers. When you get to data flows and volumes big enough for that to matter, you’re likely talking about the kinds of internet applications that probably shouldn’t be on-premises in the first place.

And that finally brings us to the opinions I copied up top.

I think that most sufficiently large enterprise SaaS vendors should offer an appliance option, as an alternative to the core multi-tenant service. In particular:

SaaS appliances address customer fears about security, privacy, compliance, performance isolation, and lock-in.
Some of these benefits occur even if the appliance runs in the same data centers that host the vendor’s standard multi-tenant SaaS. Most of the rest occur if the customer can choose a co-location facility in which to place the appliance.
Whether many customers should or will use the SaaS appliance option is somewhat secondary; it’s a check-mark item. I.e., many customers and prospects will be pleased that the option at least exists.

Related link

Naomi Bloom is a SaaS purist, who would presumably deplore the whole concept of “customer-premises SaaS”.

Thoughts on SaaS

Curt Monash — Mon, 25 Nov 2013 01:16:05 +0000

Generalizing about SaaS (Software as a Service) is hard. To prune some of the confusion, let’s start by noting:

SaaS has been around for over half a century, and at times has been the dominant mode of application delivery.
The term multi-tenancy is being used in several different ways.
Multi-tenancy, in the purest sense, is inessential to SaaS. It’s simply an implementation choice that has certain benefits for the SaaS provider. And by the way, …
… salesforce.com, the chief proponent of the theory that true multi-tenancy is the hallmark of true SaaS, abandoned that position this week.
Internet-based services are commonly, if you squint a little, SaaS. Examples include but are hardly limited to Google, Twitter, Dropbox, Intuit, Amazon Web Services, and the company that hosts this blog (KnownHost).
Some of the core arguments for SaaS’ rise, namely the various efficiencies of data center outsourcing and scale, apply equally to the public cloud, to SaaS, and to AEaaS (Anything Else as a Service).
These benefits are particularly strong for inherently networked use cases. For example, you really don’t want to be hosting your website yourself. And salesforce.com got its start supporting salespeople who worked out of remote offices.
In theory and occasionally in practice, certain SaaS benefits, namely the outsourcing of software maintenance and updates, could be enjoyed on-premises as well. Whether I think that could be a bigger deal going forward will be explored in future posts.

For smaller enterprises, the core outsourcing argument is compelling. How small? Well:

What’s the minimum level of IT operations headcount needed for mission-critical systems? Let’s just say “several”.
What does that cost? Fully burdened, somewhere in the six figures.
What fraction of the IT budget should such headcount be? As low a double digit percentage as possible.
What fraction of revenues should be spent on IT? Some single-digit percentage.

So except for special cases, an enterprise with less than $100 million or so in revenue may have trouble affording on-site data processing, at least at a mission-critical level of robustness. It may well be better to use NetSuite or something like that, assuming needed features are available in SaaS form.*

*Truth be told, I’m not up to speed on mid-range SaaS application suite alternatives.

Continuing that thought — if you’re a mid-range application software provider, you have to develop a SaaS version of your product line. That’s a very different business model than the apps + OEMed platform you’re probably providing now, but it’s the best way to serve your customers going forward. And by the way — while mid-range application software is commonly sold on a regional basis, SaaS can be sold more globally; after all, the the need for onsite service is eliminated, and price points should in most cases fit with telephone sales. Yes, national language and regional data privacy rules are both concerns, but they still leave the available markets looking much bigger than regional resellers have traditionally enjoyed. So expect shake-outs in a whole lot of vertical markets, as vendors horn in on each other’s territories, and a few elephantine winners perhaps emerge.

The argument above assumes that extreme reliability is needed. So there’s nothing necessarily wrong with a small team of business analysts sticking an RDBMS appliance* in a corner and managing it themselves. If it sputters from time to time, who cares; using it still may be easier than getting that data in and out of the cloud. But eventually, if all the data is remote anyway — SaaS, website, etc. — then it may make sense to do analytics remotely as well.

*Previously, that appliance might have been from Netezza; now, my first thought is the cheaper — albeit more limited — Infobright.

The arguments that direct smaller companies toward SaaS apply to large enterprises to, but they aren’t as dispositive. Larger enterprises can actually afford to do their own IT operations if they want to. What’s more, moving away from in-house operations is harder for big firms, due to the larger and more customized portfolio of legacy systems they’re likely to have. That said:

Almost all enterprises should have their internet-facing systems offsite, even if just via co-location. The core reasons are that ingesting high-volume inbound network traffic is inherently difficult, and security issues make it much tougher yet. In addressing these challenges, specialists enjoy significant economies of scale.
Most enterprises will have plenty of SaaS silos. If nothing else:
- Complex machinery will increasingly “phone home” for help staying in good working order. That’s a form of SaaS.
- Information providers and aggregators tend to deliver via SaaS.
- Various kinds of collaboration and communication apps, from Google Mail to Dropbox, live in the cloud. Personal productivity applications, from word processing to Photoshop, may be following.
- “Rodney Dangerfield” departments — i.e., ones unhappy with the respect and attention they get from central IT — often turn to SaaS or similar outsourcing. Human resources is an obvious example, from Automatic Data Processing to Employease to, these days, Workday.

That leaves us with the questions as to when and how large enterprises should or will move their core applications to SaaS and/or the cloud. Given the length of this post, I won’t try to answer them now. But for starters:

Enterprises don’t like to rip and replace their apps, except in consolidation projects, as long as they can avoid doing so.
Cloud/remote computing economies are less convincing if you already have your computer rooms staffed and set up.
A key benefit of SaaS is that vendors control and drive the upgrade cycles. One cost of that is restrictions on customization, although you can also build apps and app extensions on Paas//DBaaS/Waas (Platform/DataBase/Whatever as a Service) offerings such as force.com.
Lock-in is a serious concern, for application and platform offerings alike. Not only are you betting on one vendor’s software black box, you’re also betting on its remote computing operation. If you grow dissatisfied with either, or with their pricing, you may not have much opportunity to escape.