Hadoop – DBMS 2 : DataBase Management System Services

Imanis Data

Curt Monash — Tue, 22 Aug 2017 12:46:20 +0000

I talked recently with the folks at Imanis Data. For starters:

The point of Imanis is to make copies of your databases, for purposes such as backup/restore, test/analysis, or compliance-driven archiving. (That’s in declining order of current customer activity.) Another use is migration via restoring to a different cluster than the one that created the data in the first place.
The data can come from NoSQL database managers, from Hadoop, or from Vertica. (Again, that’s in declining order.)
As you might imagine, Imanis makes incremental backups; the only full backup is the first one you do for that database.
“Imanis” is a new name; the previous name was “Talena”.

Also:

Imanis has ~35 subscription customers, a significant majority of which are in the Fortune 1000.
Customer industries, in roughly declining order, include:
- Financial services other than insurance.
- Insurance.
- Retail.
- “Technology”.
~40% of Imanis customers are in the public cloud.
Imanis is focused on the North American market at this time.
Imanis has ~45 employees.
The Imanis product just hit Version 3.

Imanis correctly observes that there are multiple reasons you might want to recover from backup, including:

General disaster/system failure.
Bug in an application that writes data.
Malicious acts, including encryption-by-ransomware.

Imanis uses the phrase “point-in-time backup” to emphasize its flexibility in letting you choose your favorite time-version of your rolling backup.

Imanis also correctly draws the inference that the right backup strategy is some version of:

Make backups very frequently. This boils down to “Do a great job of making incremental backups (and restoring from them when necessary). This is where Imanis has spent the bulk of its technical effort to date.
In case recovery is needed, identify the last clean (or provably/confidently clean) version of the database and restore from that. The identification part boils down to letting the backup databases be queried directly. That’s largely a roadmap item.
- Imanis has recently added the capability to build its own functionality querying the backup database.
- JDBC/whatever general access is still in the future.

Note: When Imanis backups offer direct query access, the possibility will of course exist to use the backup data for general query processing. But while that kind of capability sounds great in theory, I’m not aware of it being a big deal (on technology stacks that already offer it) in practice.

The most technically notable other use cases Imanis mentioned are probably:

Data science dataset generation. Imanis lets you generate a partial copy of the database for analytic or test purposes.
- You can project, select or sample your data, which suggests use of the current query capabilities.
- There’s an API to let you mask Personally Identifiable Information by writing your own data transformations.
Archiving/tiering/ILM (Information Lifecycle Management). Imanis lets you divide data according to its hotness.

Imanis views its competition as:

Native utilities of the data stores.
Hand-coded scripts.
Datos.io, principally in the Cassandra market (so far).

Beyond those, the obvious comparison to Imanis is Delphix. I haven’t spoken with Delphix for a few years, but I believe that key differences between Delphix and Imanis start:

Delphix is focused on widely-installed RDBMS such as Oracle.
Delphix actually tries to have different production logical copies of your database run off of the same physical copy. Imanis, in contrast, offers technology to help you copy your databases quickly and effectively, but the copies you actually use will indeed be separate from each other.

Imanis software runs on its own cluster, based on hacked Hadoop. A lot of the hacking seems to relate to a metadata store, which supports things like:

Understanding which (incrementally backed up) blocks need to be pulled together to make a specific copy of the database.
Putting data in different places for ILM/tiering.

Another piece of Imanis tech is machine-learning-based anomaly detection.

As incrementally backed-up blocks arrive, Imanis flags anomalous ones and states a reason for them.
A flag is given a reason.
You can denounce the flag as a false alert, and hopefully similar flags won’t be raised in the future.

The technology for this seems rather basic:

Random forests for the flagging.
No drilldown w/in the Imanis system for follow-up.

But in general concept this is something a lot more systems should be doing.

Most of the rest of Imanis’ tech story is straightforward — support various alternatives for computing platforms, offer the usual security choices, etc. One exception that was new to me was the use of erasure codes, which seem to be a generalization of the concept of parity bits. Allegedly, when used in a storage context these have the near-magical property of offering 4X replication safety with only a 1.5X expansion of data volume. I won’t claim to have understood the subject well enough to see how that could make sense, or what tradeoffs it would entail.

Notes on data security

Curt Monash — Thu, 10 Aug 2017 09:15:50 +0000

1. In June I wrote about burgeoning interest in data security. I’d now like to add:

Even more than I previously thought, demand seems to be driven largely by issues of regulatory compliance.
In an exception to that general rule, many enterprise have vague mandates for data encryption.
In awkward contradiction to that general rule, there’s a general sense that it’s just security’s “turn” to be a differentiating feature, since various other “enterprise” needs are already being well-addressed.

We can reconcile these anecdata pretty well if we postulate that:

Enterprises generally agree that data security is an important need.
Exactly how they meet this need depends upon what regulators choose to require.

2. My current impressions of the legal privacy vs. surveillance tradeoffs are basically:

The freer non-English-speaking countries are more concerned about ensuring data privacy. In particular, the European Union’s upcoming GDPR (General Data Protection Regulation) seems like a massive addition to the compliance challenge.
The “Five Eyes” (US, UK, Canada, Australia, New Zealand) are more concerned about maintaining the efficacy of surveillance.
Authoritarian countries, of course, emphasize surveillance as well.

3. Multiple people have told me that security concerns include (data) lineage and (data) governance as well. I’m fairly OK with that conflation.

By citing “lineage” I think they’re referring to the point that if you don’t know where data came from, you don’t know if it’s trustworthy. This fits well with standard uses of the “data lineage” term.
By “data governance” they seem to mean policies and procedures to limit the chance of unauthorized or uncontrolled data change, or technology to support those policies. Calling that “data governance” is a bit of a stretch, but it’s not so ridiculous that we need to make a big fuss about it.

In other words: If your data transformation pipelines aren’t locked down, then your data isn’t locked down either.

4. But how seriously does that last point need to be taken? For starters, the possibility of erroneous calculations:

Is a strong threat to analytic accuracy, as has been recognized at least for the decades that “one version of the truth” has been a catchphrase.
Has some regulatory risk, e.g. in the United States around Sarbanes-Oxley.
Is not as a big a deal for the core security threat of data theft/exfiltration.

Further, it’s not too hard architecturally to have a divide between:

Data transformation for operational use cases, which may need to be locked down.
Data transformation for purely investigative analytics, which can be very fluid, for transformation technologies such as Hadoop, Spark and Excel alike.

Bottom line: Data transformation security is an accessible must-have in some use cases, but an impractical nice-to-have in others.

Generally available Kudu

Curt Monash — Fri, 16 Jun 2017 15:52:45 +0000

I talked with Cloudera about Kudu in early May. Besides giving me a lot of information about Kudu, Cloudera also helped confirm some trends I’m seeing elsewhere, including:

Security is an ever bigger deal.
There’s a lot of interest in data warehouses (perhaps really data marts) that are updated in human real-time.
- Prospects for that respond well to the actual term “data warehouse”, at least when preceded by some modifier to suggest that it’s modern/low-latency/non-batch or whatever.
- Flash is often — but not yet always — preferred over disk for that kind of use.
- Sometimes these data stores are greenfield. When they’re migrations, they come more commonly from analytic RDBMS or data warehouse appliance (the most commonly mentioned ones are Teradata, Netezza and Vertica, but that’s perhaps just due to those product lines’ market share), rather than from general purpose DBMS such as Oracle or SQL Server.
Intel is making it ever easier to vectorize CPU operations, and analytic data managers are increasingly taking advantage of this possibility.

Now let’s talk about Kudu itself. As I discussed at length in September 2015, Kudu is:

A data storage system introduced by Cloudera (and subsequently open-sourced).
Columnar.
Updatable in human real-time.
Meant to serve as the data storage tier for Impala and Spark.

Kudu’s adoption and roll-out story starts:

Kudu went to general availability on January 31. I gather this spawned an uptick in trial activity.
A subsequent release with some basic security features spawned another uptick.
I don’t think Cloudera will mind my saying that there are many hundreds of active Kudu clusters.
But Cloudera believes that, this soon after GA, very few Kudu users are in actual production.

Early Kudu interest is focused on 2-3 kinds of use case. The biggest is the kind of “data warehousing” highlighted above. Cloudera characterizes the others by the kinds of data stored, specifically the overlapping categories of time series — including financial trading — and machine-generated data. A lot of early Kudu use is with Spark, even ahead of (or in conjunction with) Impala. A small amount has no relational front-end at all.

Other notes on Kudu include:

Solid-state storage is recommended, with a few terabytes per node.
You can also use spinning disk. If you do, your write-ahead logs can still go to flash.
Cloudera said Kudu compression ratios can be as low as 2-5X, or as high as 10-20X. With that broad a range, I didn’t drill down into specifics of what they meant.
There seem to be a number of Kudu clusters with 50+ nodes each. By way of contrast, a “typical” Cloudera customer has 100s of nodes overall.
As you might imagine from their newness, Kudu security features — Kerberos-based — are at the database level rather than anything more granular.

And finally, the Cloudera folks woke me up to some issues around streaming data ingest. If you stream data in, there will be retries resulting in duplicate delivery. So your system needs to deal with those one way or another. Kudu’s way is:

Primary keys will be unique. (Note: This is not obvious in a system that isn’t an entire RDBMS in itself.)
You can configure the uniqueness to be guaranteed either through an upsert mechanism or just by simply rejecting duplicates.
Alternatively, you can write code to handle duplication errors, e.g. via Spark.

The data security mess

Curt Monash — Wed, 14 Jun 2017 13:21:11 +0000

A large fraction of my briefings this year have included a focus on data security. This is the first year in the past 35 that that’s been true.* I believe that reasons for this trend include:

Security is an important aspect of being “enterprise-grade”. Other important checkboxes have been largely filled in. Now it’s security’s turn.
A major platform shift, namely to the cloud, is underway or at least being planned for. Security is an important thing to think about as that happens.
The cloud even aside, technology trends have created new ways to lose data, which security technology needs to address.
Traditionally paranoid industries are still paranoid.
Other industries are newly (and rightfully) terrified of exposing customer data.
My clients at Cloudera thought they had a chance to get significant messaging leverage from emphasizing security. So far, it seems that they were correct.

*Not really an exception: I did once make it a project to learn about classic network security, including firewall appliances and so on.

Certain security requirements, desires or features keep coming up. These include (and as in many of my lists, these overlap):

Easy, comprehensive access control. More on this below.
Encryption. If other forms of security were perfect, encryption would never be needed. But they’re not.
Auditing. Ideally, auditing can alert you to trouble before (much) damage is done. If not, then it can at least help you do proactive damage control in the face of breach.
Whatever regulators mandate.
Whatever is generally regarded as best practices. Security “best practices” generally keep enterprises out of legal and regulatory trouble, or at least minimize same. They also keep employees out of legal and career trouble, or minimize same. Hopefully, they even keep data safe.
Whatever the government is known to use. This is a common proxy for “best practices”.

More specific or extreme requirements include:

Security certifications.
Ways for enterprises to always hold their own encryption keys, even for cloud data.
Value/label-based security.
Isolation of audit logs onto separate (and separately-protected) systems.
Keeping data out of SaaS vendors’ control altogether.

I don’t know how widely these latter kinds of requirements will spread.

The most confusing part of all this may be access control.

Security has a concept called AAA, standing for Authentication, Authorization and Accounting/Auditing/Other things that start with”A”. Yes — even the core acronym in this area is ill-defined.
The new standard for authentication is Kerberos. Or maybe it’s SAML (Security Assertion Markup Language). But SAML is actually an old, now-fragmented standard. But it’s also particularly popular in new, cloud use cases. And Kerberos is actually even older than SAML.
Suppose we want to deny somebody authorization to access certain raw data, but let them see certain aggregated or derived information. How can we be sure they can’t really see the forbidden underlying data, except through a case-by-case analysis? And if that case-by-case analysis is needed, how can the authorization rules ever be simple?

Further confusing matters, it is an extremely common analytic practice to extract data from somewhere and put it somewhere else to be analyzed. Such extracts are an obvious vector for data breaches, especially when the target system is managed by an individual or IT-weak department. Excel-on-laptops is probably the worst case, but even fat-client BI — both QlikView and Tableau are commonly used with local in-memory data staging — can present substantial security risks. To limit such risks, IT departments are trying to impose new standards and controls on departmental analytics. But IT has been fighting that war for many decades, and it hasn’t won yet.

And that’s all when data is controlled by a single enterprise. Inter-enterprise data sharing confuses things even more. For example, national security breaches in the US tend to come from government contractors more than government employees. (Ed Snowden is the most famous example. Chelsea Manning is the most famous exception.) And as was already acknowledged above, even putting your data under control of a SaaS vendor opens hard-to-plug security holes.

Data security is a real mess.

Edit (July 10, 2017): Matt Asay evidently agrees with this post, specifically in the context of Hadoop.

Light-touch managed services

Curt Monash — Wed, 14 Jun 2017 13:14:06 +0000

Cloudera recently introduced Cloudera Altus, a Hadoop-in-the-cloud offering with an interesting processing model:

Altus manages jobs for you.
But you actually run them on your own cluster, and so you never have to put your data under Altus’ control.

Thus, you avoid a potential security risk (shipping your data to Cloudera’s service). I’ve tentatively named this strategy light-touch managed services, and am interested in exploring how broadly applicable it might or might not be.

For light-touch to be a good approach, there should be (sufficiently) little downside in performance, reliability and so on from having your service not actually control the data. That assumption is trivially satisfied in the case of Cloudera Altus, because it’s not an ordinary kind of app; rather, its whole function is to improve the job-running part of your stack. Most kinds of apps, however, want to operate on your data directly. For those, it is more challenging to meet acceptable SLAs (Service-Level Agreements) on a light-touch basis.

Let’s back up and consider what “light-touch” for data-interacting apps (i.e., almost all apps) would actually mean. The basics are:

The user has some kind of environment that manages data and executes programs.
The light-touch service, running outside this environment, spawns one or more app processes inside it.
Useful work ensues …
… with acceptable reliability and performance.
The environment’s security guarantees ensure that data doesn’t leak out.

Cases where that doesn’t even make sense include but are not limited to:

Transaction-processing applications that are carefully tuned for efficient database access.
Applications that need to be carefully installed on or in connection with a particular server, DBMS, app server or whatever.

On the other hand:

A light-touch service is at least somewhat reasonable in connection with analytics-oriented data-management-plus-processing environments such as Hadoop/Spark clusters.
There are many workloads over Hadoop clusters that don’t need efficient database access. (Otherwise Hive use would not be so prevalent.)
Light-touch efforts seem more likely to be helped than hurt by abstraction environments such as the public cloud.

So we can imagine some kind of outside service that spawns analytic jobs to be run on your preferred — perhaps cloudy — Hadoop/Spark cluster. That could be a safe way to get analytics done over data that really, really, really shouldn’t be allowed to leak.

But before we anoint light-touch managed services as the NBT (Next Big Thing/Newest Bright Thought), there’s one more hurdle for it to overcome — why bother at all? What would a light-touch managed service provide that you wouldn’t also get from installing packaged software onto your cluster and running it in the usual way? The simplest answer is “The benefits of SaaS (Software as a Service)”, and so we can rephrase the challenge as “Which benefits of SaaS still apply in the light-touch managed service scenario?”

The vendor perspective might start, with special cases such as Cloudera Altus excepted:

The cost-saving benefits of multi-tenancy mostly don’t apply. Each instance winds up running on a separate cluster, namely the customer’s own. (But that’s likely to be SaaS/cloud itself.)
The benefits of controlling your execution environment apply at best in part. You may be able to assume the customer’s core cluster is through some cloud service, but you don’t get to run the operation yourself.
The benefits of a SaaS-like product release cycle do mainly apply.
- Only having to support the current version(s) of the product is a little limited when you don’t wholly control your execution environment.
- Light-touch doesn’t seem to interfere with the traditional SaaS approach of a rapid, incremental product release cycle.

When we flip to the user perspective, however, the idea looks a little better.

Cloudy analytics is for folks who favor convenience of various kinds over tightly-managed, blazing performance.
Security and data privacy are ongoing (and increasing) concerns.
Light-touch services are in line with those priorities.

Bottom line: Light-touch managed services are well worth thinking about. But they’re not likely to be a big deal soon.

Cloudera Altus

Curt Monash — Wed, 14 Jun 2017 13:12:48 +0000

I talked with Cloudera before the recent release of Altus. In simplest terms, Cloudera’s cloud strategy aspires to:

Provide all the important advantages of on-premises Cloudera.
Provide all the important advantages of native cloud offerings such as Amazon EMR (Elastic MapReduce, or at least come sufficiently close to that goal.
Benefit from customers’ desire to have on-premises and cloud deployments that work:
- Alike in any case.
- Together, to the extent that that makes use-case sense.

In other words, Cloudera is porting its software to an important new platform.* And this port isn’t complete yet, in that Altus is geared only for certain workloads. Specifically, Altus is focused on “data pipelines”, aka data transformation, aka “data processing”, aka new-age ETL (Extract/Transform/Load). (Other kinds of workload are on the roadmap, including several different styles of Impala use.) So what about that is particularly interesting? Well, let’s drill down.

*Or, if you prefer, improving on early versions of the port.

Since so much of the Hadoop and Spark stacks is open source, competition often isn’t based on core product architecture or features, but rather on factors such as:

Ease of management. This one is nuanced in the case of cloud/Altus. For starters:
- One of Cloudera’s main areas of differentiation has always been Cloudera Manager.
- Cloudera Director was Cloudera’s first foray into cloud-specific management.
- Cloudera Altus features easier/simpler management than Cloudera Director, meant to be analogous to native Amazon management tools, and good-enough for use cases that don’t require strenuous optimization.
- Cloudera Altus also includes an optional workload analyzer, in slight conflict with other parts of the Altus story. More on that below.
Ease of development. Frankly, this rarely seems to come up as a differentiator in the Hadoop/Spark world, various “notebook” offerings such as Databricks’ or Cloudera’s notwithstanding.
Price. When price is the major determinant, Cloudera is sad.
Open source purity. Ditto. But at most enterprises — at least those with hefty IT budgets — emphasis on open source purity either is a proxy for price shopping, or else boils down to largely bogus concerns about vendor lock-in.

Of course, “core” kinds of considerations are present to some extent too, including:

Performance, concurrency, etc. I no longer hear many allegations of differences in across-the-board Hadoop performance. But the subject does arise in specific areas, most obviously in analytic SQL processing. It arises in the case of Altus as well, in that Cloudera improved in a couple of areas that it concedes were previously Amazon EMR advantages, namely:
- Interacting with S3 data stores.
- Spinning instances up and down.
Reliability and data safety. Cloudera mentioned that it did some work so as to be comfortable with S3’s eventual consistency model.

Recently, Cloudera has succeeded at blowing security up into a major competitive consideration. Of course, they’re trying that with Altus as well. Much of the Cloudera Altus story is the usual — rah-rah Cloudera security, Sentry, Kerberos everywhere, etc. But there’s one aspect that I find to be simple yet really interesting:

Cloudera Altus doesn’t manage data for you.
Rather, it launches and manages jobs on a separate Hadoop cluster.

Thus, there are very few new security risks to running Cloudera Altus, beyond whatever risks are inherent to running any version of Hadoop in the public cloud.

Where things get a bit more complicated is some features for workload analysis.

Cloudera recently introduced some capabilities for on-the-fly trouble-shooting. That’s fine.
Cloudera has also now announced an offline workload analyzer, which compares actual metrics computed from your log files to “normal” ones from well-running jobs. For that, you really do have to ship information to a separate cluster managed by Cloudera.

The information shipped is logs rather than actual query results or raw data. In theory, an attacker who had all those logs could conceivably make inferences about the data itself; but in practice, that doesn’t seem like an important security risk at all.

So is this an odd situation where that strategy works, or could what we might call light-touch managed services turn out to be widespread and important? That’s a good question to address in a separate post.

Cloudera’s Data Science Workbench

Curt Monash — Mon, 20 Mar 2017 00:41:15 +0000

0. Matt Brandwein of Cloudera briefed me on the new Cloudera Data Science Workbench. The problem it purports to solve is:

One way to do data science is to repeatedly jump through the hoops of working with a properly-secured Hadoop cluster. This is difficult.
Another way is to extract data from a Hadoop cluster onto your personal machine. This is insecure (once the data arrives) and not very parallelized.
A third way is needed.

Cloudera’s idea for a third way is:

You don’t run anything on your desktop/laptop machine except a browser.
The browser connects you to a Docker container that holds (and isolates) a kind of virtual desktop for you.
The Docker container runs on your Cloudera cluster, so connectivity-to-Hadoop and security are handled rather automagically.

In theory, that’s pure goodness … assuming that the automagic works sufficiently well. I gather that Cloudera Data Science Workbench has been beta tested by 5 large organizations and many 10s of users. We’ll see what is or isn’t missing as more customers take it for a spin.

1. Recall that Cloudera installations have 4 kinds of nodes. 3 are obvious:

Hadoop worker nodes.
Hadoop master nodes.
Nodes that run Cloudera Manager.

The fourth kind are edge/gateway nodes. Those handle connections to the outside world, and can also run selected third-party software. They also are where Cloudera Data Science Workbench lives.

2. One point of this architecture is to let each data scientist run the languages and tools of her choice. Docker isolation is supposed to make that practical and safe.

And so we have a case of the workbench metaphor actually being accurate! While a “workbench” is commonly just an integrated set of tools, in this case it’s also a place for you to use other tools your personally like and bring in.

Surely there are some restrictions as to which tools you can use, but I didn’t ask for those to be spelled out.

3. Matt kept talking about security, to an extent I recall in almost no other analytics-oriented briefing. This had several aspects.

As noted above, a lot of the hassle of Hadoop-based data science relates to security.
As also noted above, evading the hassle by extracting data is a huge security risk. (If you lose customer data, you’re going to have a very, very bad day.)
According to Matt, standard uses of notebook tools such as Jupyter or Zeppelin wind up having data stored wherever code is. Cloudera’s otherwise similar notebook-style interface evidently avoids that flaw. (Presumably, it you want to see the output, you rerun the script against the data store yourself.)

4. To a first approximation, the target users of Cloudera Data Science Workbench can be characterized the same way BI-oriented business analysts are. They’re people with:

Sufficiently good quantitative skills to do the analysis.
Sufficiently good computer skills to do SQL queries and so on, but not a lot more than that.

Of course, “sufficiently good quantitative skills” can mean something quite different in data science than it does for the glorified arithmetic of ordinary business intelligence.

5. Cloudera Data Science Workbench doesn’t have any special magic in parallelization. It just helps you access the parallelism that’s already out there. Some algorithms are easy to parallelize. Some libraries have parallelized a few algorithms beyond that. Otherwise, you’re on your own.

6. When I asked whether Cloudera Data Science Workbench was open source (like most of what Cloudera provides) or closed source (like Cloudera Manager), I didn’t get the clearest of answers. On the one hand, it’s a Cloudera-specific product, as the name suggests; on the other, it’s positioned as having been stitched together almost entirely from a collection of open source projects.

DBAs of the future

Curt Monash — Wed, 23 Nov 2016 12:02:31 +0000

After a July visit to DataStax, I wrote

The idea that NoSQL does away with DBAs (DataBase Administrators) is common. It also turns out to be wrong. DBAs basically do two things.

Handle the database design part of application development. In NoSQL environments, this part of the job is indeed largely refactored away. More precisely, it is integrated into the general app developer/architect role.

Manage production databases. This part of the DBA job is, if anything, a bigger deal in the NoSQL world than in more mature and automated relational environments. It’s likely to be called part of “devops” rather than “DBA”, but by whatever name it’s very much a thing.

That turns out to understate the core point, which is that DBAs still matter in non-RDBMS environments. Specifically, it’s too narrow in two ways.

First, it’s generally too narrow as to what DBAs do; people with DBA-like skills are also involved in other areas such as “data governance”, “information lifecycle management”, storage, or what I like to call data mustering.
Second — and more narrowly — the first bullet point of the quote is actually incorrect. In fact, the database design part of application development can be done by a specialized person up front in the NoSQL world, just as it commonly is for RDBMS apps.

My wake-up call for that latter bit was a recent MongoDB 3.4 briefing. MongoDB certainly has various efforts in administrative tools, which I won’t recapitulate here. But to my surprise, MongoDB also found a role for something resembling relational database design. The idea is simple: A database administrator defines a view against a MongoDB database, where views:

Are logical rather than materialized. (At least at this time.)
Have their permissions and so on set by the DBA.
Are the sole thing the programmer writes against.

Besides the obvious benefits in development ease and security, MongoDB says that performance can be better as well.* This is of course a new feature, without a lot of adoption at this time. Even so, it seems likely that NoSQL doesn’t obsolete any part of the traditional DBA role.

*I didn’t actually ask what a naive programmer can do to trash performance that views can forestall, but … well, I was once a naive programmer myself.

Two trends that I think could make DBA’s lives even more interesting and challenging in the future are:

The integration of quick data management into complex analytic processes. Here by “quick data management” I mean, for example, what you do in connection with a complex Hadoop or Spark (set of) job(s). Leaving the data management to a combination of magic and Python scripts doesn’t seem to respect how central data operations are to analytic tasks.
The integration of data management and streaming. I should probably write about this point separately, but in any case — it seems that streaming stacks will increasingly look like over-caffeinated DBMS.

Bottom line: Database administration skills will be needed for a long time to come.

Notes on the transition to the cloud

Curt Monash — Tue, 04 Oct 2016 02:22:21 +0000

1. The cloud is super-hot. Duh. And so, like any hot buzzword, “cloud” means different things to different marketers. Four of the biggest things that have been called “cloud” are:

The Amazon cloud, Microsoft Azure, and their competitors, aka public cloud.
Software as a service, aka SaaS.
Co-location in off-premises data centers, aka colo.
On-premises clusters (truly on-prem or colo as the case may be) designed to run a broad variety of applications, aka private cloud.

Further, there’s always the idea of hybrid cloud, in which a vendor peddles private cloud systems (usually appliances) running similar technology stacks to what they run in their proprietary public clouds. A number of vendors have backed away from such stories, but a few are still pushing it, including Oracle and Microsoft.

This is a good example of Monash’s Laws of Commercial Semantics.

2. Due to economies of scale, only a few companies should operate their own data centers, aka true on-prem(ises). The rest should use some combination of colo, SaaS, and public cloud.

This fact now seems to be widely understood.

3. The public cloud is a natural fit for those use cases in which elasticity truly matters. Many websites and other consumer internet backends have that characteristic. Such systems are often also a good fit for cloud technologies in general.

This is frequently a good reason for new — i.e. “greenfield” — apps to run in the cloud.

4. Security and privacy can be concerns in moving to the cloud. But I’m hearing that more and more industries are overcoming those concerns.

In connection to that point, it might be interesting to note:

In the 1960s and 1970s, one of the biggest industries for remote computing services — i.e. SaaS — was commercial banking.
Other big users were hospitals and stockbrokers.
The US intelligence agencies are building out their own shared, dedicated cloud.

5. Obviously, Amazon is the gorilla in the cloud business. Microsoft Azure gets favorable mentions as well. I don’t hear much about other public cloud providers, however, except that there are a lot of plans to support Google’s cloud just in case.

In particular, I hear less than I expected to about public clouds run by national-champion telecom companies around the world.

6. It’s inconvenient for an application vendor to offer both traditional and SaaS versions of a product. Release cycles and platform support are different in the two cases. But there’s no reason a large traditional application vendor couldn’t pull it off, and the largest are already more or less claiming to. Soon, this will feel like a market necessity across the board.

7. The converse is less universally true. However, some SaaS vendors do lose out from their lack of on-premises options. Key considerations include:

Does your application need to run close to your customers’ largest databases?
Do your customers still avoid the public cloud?

If both those things are true, and you don’t have an on-premises option, certain enterprises are excluded from your addressable market.

8. Line-of-business departments are commonly more cloud-friendly than central IT is. Reasons include:

Departments don’t necessarily see central IT as any “closer” to them than the cloud is.
Departments don’t necessarily care about issues that give central IT pause.
Departments sometimes buy things that only are available via remote delivery, e.g. narrowly focused SaaS applications or market data.

I discussed some of this in my recent post on vendor lock-in.

9. When the public cloud was younger, it had various technological limitations. You couldn’t easily get fast storage like flash. You couldn’t control data movement well enough for good MPP (Massively Parallel Processing) in use cases like analytic SQL.

Those concerns seem to have been largely alleviated.

10. It takes a long time for legacy platforms to be decommissioned. At some enterprises, however, that work has indeed been going on for a long time, via virtualization.

11. If you think about system requirements:

There is a lot of computing power in devices that may be regarded as IoT nodes — phones, TV boxes, thermostats, cars, industrial equipment, sensors, etc. Client-side computing is getting ever more diverse.
Server-side computing, however, is more homogenous. Enterprises can, should and likely will meet the vast majority of their server requirements on a relatively small number of clusters each.

I argued the latter point in my 2013 post on appliances, clusters, and clouds, using terminology and reasoning that are now only slightly obsolete.

So what will those clusters be? Some will be determined by app choices. Most obviously, if you use SaaS, the SaaS vendor decides which cloud(s) your data is in. And if you’re re-hosting legacy systems via virtualization, that’s another cluster.

Otherwise, clusters will probably be organized by database, in the most expansive sense of term. For example, there could be separate clusters for:

Operational data managed by your general-purpose RDBMS (Oracle, SQL Server, DB2, whatever).
Relational data warehousing, whether in an analytic RDBMS or otherwise.
Log files, perhaps managed in Hadoop or Splunk.
Your website and other internet back-ends, perhaps running over NoSQL data stores.
Text documents managed by some kind of search engine.
Media block or object storage, if the organization’s audio/video/whatever would overwhelm a text search engine. (Text search or document management systems can often also handle low volumes of non-text media.)

Indeed, since computing is rarely as consolidated as CIOs dream of it being, a large enterprise might have several clusters for any of those categories — each running different software for data and storage management — with different deployment choices among colo, true on-prem, and true cloud.

Are analytic RDBMS and data warehouse appliances obsolete?

Curt Monash — Mon, 29 Aug 2016 01:28:31 +0000

I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:

Many of the independent vendors were swooped up by acquisition.
- IBM bought Netezza.
- Microsoft bought DATAllegro.
- HP bought Vertica.
- Greenplum went to EMC/VMware/Pivotal.
- Teradata bought Aster.
- Actian bought both ParAccel and Vectorwise.
None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
Impala and other Hadoop-based alternatives are technology options.
Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big.

Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.

My answer, in a nutshell, is:

Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud — are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.

To see why, let’s start by asking: “With what do you want to integrate your analytic SQL processing?”

If you want to integrate with relational OLTP (OnLine Transaction Processing), your OLTP RDBMS vendor surely has a story worth listening to. Memory-centric offerings MemSQL and SAP HANA are also pitched that way.
If you want to integrate with your SAP apps in particular, HANA is the obvious choice.
If you want to integrate with other work you do in the Amazon cloud, Redshift is worth a look.

Beyond those cases, a big issue is integration with … well, with data integration. Analytic RDBMS got a lot of their workloads from ELT or ETLT, which stand for Extract/(Transform)/Load/Transform. I.e., you’d load data into an efficient analytic RDBMS and then do your transformations, vs. the “traditional” (for about 10-15 years of tradition) approach of doing your transformations in your ETL (Extract/Transform/Load) engine. But in bigger installations, Hadoop often snatches away that part of the workload, even if the rest of the processing remains on a dedicated analytic RDBMS platform such as Teradata’s.

And suppose you want to integrate with more advanced analytics — e.g. statistics, other predictive modeling/machine learning, or graph analytics? Well — and this both surprised and disappointed me — analytic platforms in the RDBMS sense didn’t work out very well. Early Hadoop had its own problems too. But Spark is doing just fine, and seems poised to win.

My technical observations around these trends include:

Advanced analytics commonly require flexible, iterative processing.
Spark is much better at such processing than earlier Hadoop …
… which in turn is better than anything that’s been built into an analytic RDBMS.
Open source/open standards and the associated skill sets come into play too. Highly vendor-proprietary DBMS-tied analytic stacks don’t have enough advantages over open ones.
Notwithstanding the foregoing, RDBMS-based platforms can still win if a big part of the task lies in fancy SQL.

And finally, if a task is “partly relational”, then Hadoop or Spark often fit both parts.

They don’t force you into using SQL for everything, nor into putting all your data into relational schemas, and that flexibility can be a huge relief.
Even so, almost everybody who uses those uses some SQL, at least for initial data extraction. Those systems are also plenty good enough at SQL for joining data to reference tables, and all that other SQL stuff you’d never want to give up.

But suppose you just want to do business intelligence, which is still almost always done over relational data structures? Analytic RDBMS offer the trade-offs:

They generally still provide the best performance or performance/concurrency combination, for the cost, although YMMV (Your Mileage May Vary).
One has to load the data in and immediately structure it relationally, which can be an annoying contrast to Hadoop alternatives (data base administration can be just-in-time) or to OLTP integration (less or no re-loading).
Other integrations, as noted above, can also be weak.

Suppose all that is a good match for your situation. Then you should surely continue using an analytic RDBMS, if you already have one, and perhaps even acquire one if you don’t. But for many other use cases, analytic RDBMS are no longer the best way to go.

Finally, how does the cloud affect all this? Mainly, it brings one more analytic RDBMS competitor into the mix, namely Amazon Redshift. Redshift is a simple system for doing analytic SQL over data that was in or headed to the Amazon cloud anyway. It seems to be quite successful.

Bottom line: Analytic RDBMS are no longer in their youthful prime, but they are healthy contributors in middle age. Mainly, they’re still best-of-breed for supporting demanding BI.