Cloudera – DBMS 2 : DataBase Management System Services

Generally available Kudu

Curt Monash — Fri, 16 Jun 2017 15:52:45 +0000

I talked with Cloudera about Kudu in early May. Besides giving me a lot of information about Kudu, Cloudera also helped confirm some trends I’m seeing elsewhere, including:

Security is an ever bigger deal.
There’s a lot of interest in data warehouses (perhaps really data marts) that are updated in human real-time.
- Prospects for that respond well to the actual term “data warehouse”, at least when preceded by some modifier to suggest that it’s modern/low-latency/non-batch or whatever.
- Flash is often — but not yet always — preferred over disk for that kind of use.
- Sometimes these data stores are greenfield. When they’re migrations, they come more commonly from analytic RDBMS or data warehouse appliance (the most commonly mentioned ones are Teradata, Netezza and Vertica, but that’s perhaps just due to those product lines’ market share), rather than from general purpose DBMS such as Oracle or SQL Server.
Intel is making it ever easier to vectorize CPU operations, and analytic data managers are increasingly taking advantage of this possibility.

Now let’s talk about Kudu itself. As I discussed at length in September 2015, Kudu is:

A data storage system introduced by Cloudera (and subsequently open-sourced).
Columnar.
Updatable in human real-time.
Meant to serve as the data storage tier for Impala and Spark.

Kudu’s adoption and roll-out story starts:

Kudu went to general availability on January 31. I gather this spawned an uptick in trial activity.
A subsequent release with some basic security features spawned another uptick.
I don’t think Cloudera will mind my saying that there are many hundreds of active Kudu clusters.
But Cloudera believes that, this soon after GA, very few Kudu users are in actual production.

Early Kudu interest is focused on 2-3 kinds of use case. The biggest is the kind of “data warehousing” highlighted above. Cloudera characterizes the others by the kinds of data stored, specifically the overlapping categories of time series — including financial trading — and machine-generated data. A lot of early Kudu use is with Spark, even ahead of (or in conjunction with) Impala. A small amount has no relational front-end at all.

Other notes on Kudu include:

Solid-state storage is recommended, with a few terabytes per node.
You can also use spinning disk. If you do, your write-ahead logs can still go to flash.
Cloudera said Kudu compression ratios can be as low as 2-5X, or as high as 10-20X. With that broad a range, I didn’t drill down into specifics of what they meant.
There seem to be a number of Kudu clusters with 50+ nodes each. By way of contrast, a “typical” Cloudera customer has 100s of nodes overall.
As you might imagine from their newness, Kudu security features — Kerberos-based — are at the database level rather than anything more granular.

And finally, the Cloudera folks woke me up to some issues around streaming data ingest. If you stream data in, there will be retries resulting in duplicate delivery. So your system needs to deal with those one way or another. Kudu’s way is:

Primary keys will be unique. (Note: This is not obvious in a system that isn’t an entire RDBMS in itself.)
You can configure the uniqueness to be guaranteed either through an upsert mechanism or just by simply rejecting duplicates.
Alternatively, you can write code to handle duplication errors, e.g. via Spark.

Light-touch managed services

Curt Monash — Wed, 14 Jun 2017 13:14:06 +0000

Cloudera recently introduced Cloudera Altus, a Hadoop-in-the-cloud offering with an interesting processing model:

Altus manages jobs for you.
But you actually run them on your own cluster, and so you never have to put your data under Altus’ control.

Thus, you avoid a potential security risk (shipping your data to Cloudera’s service). I’ve tentatively named this strategy light-touch managed services, and am interested in exploring how broadly applicable it might or might not be.

For light-touch to be a good approach, there should be (sufficiently) little downside in performance, reliability and so on from having your service not actually control the data. That assumption is trivially satisfied in the case of Cloudera Altus, because it’s not an ordinary kind of app; rather, its whole function is to improve the job-running part of your stack. Most kinds of apps, however, want to operate on your data directly. For those, it is more challenging to meet acceptable SLAs (Service-Level Agreements) on a light-touch basis.

Let’s back up and consider what “light-touch” for data-interacting apps (i.e., almost all apps) would actually mean. The basics are:

The user has some kind of environment that manages data and executes programs.
The light-touch service, running outside this environment, spawns one or more app processes inside it.
Useful work ensues …
… with acceptable reliability and performance.
The environment’s security guarantees ensure that data doesn’t leak out.

Cases where that doesn’t even make sense include but are not limited to:

Transaction-processing applications that are carefully tuned for efficient database access.
Applications that need to be carefully installed on or in connection with a particular server, DBMS, app server or whatever.

On the other hand:

A light-touch service is at least somewhat reasonable in connection with analytics-oriented data-management-plus-processing environments such as Hadoop/Spark clusters.
There are many workloads over Hadoop clusters that don’t need efficient database access. (Otherwise Hive use would not be so prevalent.)
Light-touch efforts seem more likely to be helped than hurt by abstraction environments such as the public cloud.

So we can imagine some kind of outside service that spawns analytic jobs to be run on your preferred — perhaps cloudy — Hadoop/Spark cluster. That could be a safe way to get analytics done over data that really, really, really shouldn’t be allowed to leak.

But before we anoint light-touch managed services as the NBT (Next Big Thing/Newest Bright Thought), there’s one more hurdle for it to overcome — why bother at all? What would a light-touch managed service provide that you wouldn’t also get from installing packaged software onto your cluster and running it in the usual way? The simplest answer is “The benefits of SaaS (Software as a Service)”, and so we can rephrase the challenge as “Which benefits of SaaS still apply in the light-touch managed service scenario?”

The vendor perspective might start, with special cases such as Cloudera Altus excepted:

The cost-saving benefits of multi-tenancy mostly don’t apply. Each instance winds up running on a separate cluster, namely the customer’s own. (But that’s likely to be SaaS/cloud itself.)
The benefits of controlling your execution environment apply at best in part. You may be able to assume the customer’s core cluster is through some cloud service, but you don’t get to run the operation yourself.
The benefits of a SaaS-like product release cycle do mainly apply.
- Only having to support the current version(s) of the product is a little limited when you don’t wholly control your execution environment.
- Light-touch doesn’t seem to interfere with the traditional SaaS approach of a rapid, incremental product release cycle.

When we flip to the user perspective, however, the idea looks a little better.

Cloudy analytics is for folks who favor convenience of various kinds over tightly-managed, blazing performance.
Security and data privacy are ongoing (and increasing) concerns.
Light-touch services are in line with those priorities.

Bottom line: Light-touch managed services are well worth thinking about. But they’re not likely to be a big deal soon.

Cloudera Altus

Curt Monash — Wed, 14 Jun 2017 13:12:48 +0000

I talked with Cloudera before the recent release of Altus. In simplest terms, Cloudera’s cloud strategy aspires to:

Provide all the important advantages of on-premises Cloudera.
Provide all the important advantages of native cloud offerings such as Amazon EMR (Elastic MapReduce, or at least come sufficiently close to that goal.
Benefit from customers’ desire to have on-premises and cloud deployments that work:
- Alike in any case.
- Together, to the extent that that makes use-case sense.

In other words, Cloudera is porting its software to an important new platform.* And this port isn’t complete yet, in that Altus is geared only for certain workloads. Specifically, Altus is focused on “data pipelines”, aka data transformation, aka “data processing”, aka new-age ETL (Extract/Transform/Load). (Other kinds of workload are on the roadmap, including several different styles of Impala use.) So what about that is particularly interesting? Well, let’s drill down.

*Or, if you prefer, improving on early versions of the port.

Since so much of the Hadoop and Spark stacks is open source, competition often isn’t based on core product architecture or features, but rather on factors such as:

Ease of management. This one is nuanced in the case of cloud/Altus. For starters:
- One of Cloudera’s main areas of differentiation has always been Cloudera Manager.
- Cloudera Director was Cloudera’s first foray into cloud-specific management.
- Cloudera Altus features easier/simpler management than Cloudera Director, meant to be analogous to native Amazon management tools, and good-enough for use cases that don’t require strenuous optimization.
- Cloudera Altus also includes an optional workload analyzer, in slight conflict with other parts of the Altus story. More on that below.
Ease of development. Frankly, this rarely seems to come up as a differentiator in the Hadoop/Spark world, various “notebook” offerings such as Databricks’ or Cloudera’s notwithstanding.
Price. When price is the major determinant, Cloudera is sad.
Open source purity. Ditto. But at most enterprises — at least those with hefty IT budgets — emphasis on open source purity either is a proxy for price shopping, or else boils down to largely bogus concerns about vendor lock-in.

Of course, “core” kinds of considerations are present to some extent too, including:

Performance, concurrency, etc. I no longer hear many allegations of differences in across-the-board Hadoop performance. But the subject does arise in specific areas, most obviously in analytic SQL processing. It arises in the case of Altus as well, in that Cloudera improved in a couple of areas that it concedes were previously Amazon EMR advantages, namely:
- Interacting with S3 data stores.
- Spinning instances up and down.
Reliability and data safety. Cloudera mentioned that it did some work so as to be comfortable with S3’s eventual consistency model.

Recently, Cloudera has succeeded at blowing security up into a major competitive consideration. Of course, they’re trying that with Altus as well. Much of the Cloudera Altus story is the usual — rah-rah Cloudera security, Sentry, Kerberos everywhere, etc. But there’s one aspect that I find to be simple yet really interesting:

Cloudera Altus doesn’t manage data for you.
Rather, it launches and manages jobs on a separate Hadoop cluster.

Thus, there are very few new security risks to running Cloudera Altus, beyond whatever risks are inherent to running any version of Hadoop in the public cloud.

Where things get a bit more complicated is some features for workload analysis.

Cloudera recently introduced some capabilities for on-the-fly trouble-shooting. That’s fine.
Cloudera has also now announced an offline workload analyzer, which compares actual metrics computed from your log files to “normal” ones from well-running jobs. For that, you really do have to ship information to a separate cluster managed by Cloudera.

The information shipped is logs rather than actual query results or raw data. In theory, an attacker who had all those logs could conceivably make inferences about the data itself; but in practice, that doesn’t seem like an important security risk at all.

So is this an odd situation where that strategy works, or could what we might call light-touch managed services turn out to be widespread and important? That’s a good question to address in a separate post.

Cloudera’s Data Science Workbench

Curt Monash — Mon, 20 Mar 2017 00:41:15 +0000

0. Matt Brandwein of Cloudera briefed me on the new Cloudera Data Science Workbench. The problem it purports to solve is:

One way to do data science is to repeatedly jump through the hoops of working with a properly-secured Hadoop cluster. This is difficult.
Another way is to extract data from a Hadoop cluster onto your personal machine. This is insecure (once the data arrives) and not very parallelized.
A third way is needed.

Cloudera’s idea for a third way is:

You don’t run anything on your desktop/laptop machine except a browser.
The browser connects you to a Docker container that holds (and isolates) a kind of virtual desktop for you.
The Docker container runs on your Cloudera cluster, so connectivity-to-Hadoop and security are handled rather automagically.

In theory, that’s pure goodness … assuming that the automagic works sufficiently well. I gather that Cloudera Data Science Workbench has been beta tested by 5 large organizations and many 10s of users. We’ll see what is or isn’t missing as more customers take it for a spin.

1. Recall that Cloudera installations have 4 kinds of nodes. 3 are obvious:

Hadoop worker nodes.
Hadoop master nodes.
Nodes that run Cloudera Manager.

The fourth kind are edge/gateway nodes. Those handle connections to the outside world, and can also run selected third-party software. They also are where Cloudera Data Science Workbench lives.

2. One point of this architecture is to let each data scientist run the languages and tools of her choice. Docker isolation is supposed to make that practical and safe.

And so we have a case of the workbench metaphor actually being accurate! While a “workbench” is commonly just an integrated set of tools, in this case it’s also a place for you to use other tools your personally like and bring in.

Surely there are some restrictions as to which tools you can use, but I didn’t ask for those to be spelled out.

3. Matt kept talking about security, to an extent I recall in almost no other analytics-oriented briefing. This had several aspects.

As noted above, a lot of the hassle of Hadoop-based data science relates to security.
As also noted above, evading the hassle by extracting data is a huge security risk. (If you lose customer data, you’re going to have a very, very bad day.)
According to Matt, standard uses of notebook tools such as Jupyter or Zeppelin wind up having data stored wherever code is. Cloudera’s otherwise similar notebook-style interface evidently avoids that flaw. (Presumably, it you want to see the output, you rerun the script against the data store yourself.)

4. To a first approximation, the target users of Cloudera Data Science Workbench can be characterized the same way BI-oriented business analysts are. They’re people with:

Sufficiently good quantitative skills to do the analysis.
Sufficiently good computer skills to do SQL queries and so on, but not a lot more than that.

Of course, “sufficiently good quantitative skills” can mean something quite different in data science than it does for the glorified arithmetic of ordinary business intelligence.

5. Cloudera Data Science Workbench doesn’t have any special magic in parallelization. It just helps you access the parallelism that’s already out there. Some algorithms are easy to parallelize. Some libraries have parallelized a few algorithms beyond that. Otherwise, you’re on your own.

6. When I asked whether Cloudera Data Science Workbench was open source (like most of what Cloudera provides) or closed source (like Cloudera Manager), I didn’t get the clearest of answers. On the one hand, it’s a Cloudera-specific product, as the name suggests; on the other, it’s positioned as having been stitched together almost entirely from a collection of open source projects.

Introduction to data Artisans and Flink

Curt Monash — Sun, 21 Aug 2016 21:15:59 +0000

data Artisans and Flink basics start:

Flink is an Apache project sponsored by the Berlin-based company data Artisans.
Flink has been viewed in a few different ways, all of which are similar to how Spark is seen. In particular, per co-founder Kostas Tzoumas:
- Flink’s original goal was “Hadoop done right”.
- Now Flink is focused on streaming analytics, as an alternative to Spark Streaming, Samza, et al.
Kostas seems to see Flink as a batch-plus-streaming engine that’s streaming-first.

Like many open source projects, Flink seems to have been partly inspired by a Google paper.

To this point, data Artisans and Flink have less maturity and traction than Databricks and Spark. For example:

The first line of Flink code dates back to 2010.
data Artisans and the Flink open source project both started in 2014.
When I met him in late June, Kostas told me that Data Artisans had raised $7 million and had 15 employees.
Flink’s current SQL support is very minor.

Per Kostas, about half of Flink committers are at data Artisans; others are at Cloudera, Hortonworks, Confluent, Intel, at least one production user, and some universities. Kostas provided about 5 examples of production Flink users, plus a couple of very big names that were sort-of-users (one was using a forked version of Flink, while another is becoming a user “soon”).

The technical story at data Artisans/Flink revolves around the assertion “We have the right architecture for streaming.” If I understood data Artisans co-founder Stephan Ewen correctly on a later call, the two key principles in support of that seem to be:

The key is to keep data “transport” running smoothly without interruptions, delays or bottlenecks, where the relevant sense of “transport” is movement from one operator/operation to the next.
In this case, the Flink folks feel that modularity supports efficiency.

In particular:

Anything that relates to consistency/recovery is kept almost entirely separate from basic processing, with minimal overhead and nothing that resembles a lock.
Windowing and so on operate separately from basic “transport” as well.
The core idea is that special markers — currently in the ~20 byte range in size — are injected into the streams. When the marker gets to an operator, the operator snapshots the then-current state of its part of the stream.
Should recovery ever be needed, consistency is achieved by assembling all the snapshots corresponding to a single marker, and replaying any processing that happened after those snapshots were taken.
- Actually, this is oversimplified, in that it assumes there’s only a single input stream.
- A lot of Flink’s cleverness, I gather, is involved in assembling a consistent snapshot despite the realities of multiple input streams.

The upshot, Flink partisans believe, is to match the high throughput of Spark Streaming while also matching the low latency of Storm.

The Flink folks naturally have a rich set of opinions about streaming. Besides the points already noted, these include:

“Exactly once” semantics are best in almost all use cases, as opposed to “at least once”, or to turning off fault tolerance altogether. (Exceptions might arise in extreme performance scenarios, or because of legacy systems’ expectations.)
Repetitive, scheduled batch jobs are often “streaming processes in disguise”. Besides any latency benefits, reimplementing them using streaming technology might simplify certain issues that can occur around the boundaries of batch windows. (The phrase “continuous processing” could reasonably be used here.)

We discussed joins quite a bit, but this was before I realized that Flink didn’t have much SQL support. Let’s just say they sounded rather primitive even when I assumed they were done via SQL.

Our discussion of windowing was more upbeat. Flink supports windows based either on timestamps or data arrival time, and these can be combined as needed. Stephan thinks this flexibility is important.

As for Flink use cases, they’re about what you’d expect:

Plenty of data transformation, because that’s how all these systems start out. Indeed, the earliest Flink adoption was for batch transformation.
Plenty of stream processing.

But Flink doesn’t have all the capabilities one would want for the kinds of investigative analytics commonly done on Spark.

Related links

My recent series of Spark posts offer comparison or background to this one.
I surveyed Spark Streaming, Storm et al. in January.
How you factor things is always important.
data Artisans has a non-obvious URL.

Notes on Spark and Databricks — generalities

Curt Monash — Sun, 31 Jul 2016 14:29:46 +0000

I visited Databricks in early July to chat with Ion Stoica and Reynold Xin. Spark also comes up in a large fraction of the conversations I have. So let’s do some catch-up on Databricks and Spark. In a nutshell:

Spark is indeed the replacement for Hadoop MapReduce.
Spark is becoming the default platform for machine learning.
SparkSQL (nee’ Shark) is puttering along predictably.
Databricks reports good success in its core business of cloud-based machine learning support.
Spark Streaming has strong adoption, but its position is at risk.
Databricks, the original authority on Spark, is not keeping a tight grip on that role.

I shall explain below. I also am posting separately about Spark evolution, especially Spark 2.0. I’ll also talk a bit in that post about Databricks’ proprietary/closed-source technology.

Spark is the replacement for Hadoop MapReduce.

This point is so obvious that I don’t know what to say in its support. The trend is happening, as originally decreed by Cloudera (and me), among others. People are rightly fed up with the limitations of MapReduce, and — niches perhaps aside — there are no serious alternatives other than Spark.

The greatest use for Spark seems to be the same as the canonical first use for MapReduce: data transformation. Also in line with the Spark/MapReduce analogy:

Data-transformation-only use cases are important, but they don’t dominate.
Most other use cases typically have a data transformation element as well …
… which has to be started before any other work can be done.

And so it seems likely that, at least for as long as Spark is growing rapidly, data transformation will appear to be the biggest Spark use case.

Spark is becoming the default platform for machine learning.

Largely, this is a corollary of:

The previous point.
The fact that Spark was originally designed with machine learning as its principal use case.

To do machine learning you need two things in your software:

A collection of algorithms. Spark, I gather, is one of numerous good alternatives there.
Support for machine learning workflows. That’s where Spark evidently stands alone.

And thus I have conversations like:

“Are you doing anything with Spark?”
“We’ve gotten more serious about machine learning, so yes.”

SparkSQL (nee’ Shark) is puttering along.

SparkSQL is pretty much following the Hive trajectory.

Useful from Day One as an adjunct to other kinds of processing.
A tease and occasionally useful as a SQL engine for its own sake, but really not very good, pending years to mature.

Databricks reports good success in its core business of cloud-based machine learning support.

Databricks, to an even greater extent than I previously realized, is focused on its cloud business, for which there are well over 200 paying customers. Notes on that include:

As you might expect based on my comments above, the majority of usage is for data transformation, but a lot of that is in anticipation of doing machine learning/predictive modeling in the near future.
Databricks customers typically already have their data in the Amazon cloud.
Naturally, a lot of Databricks customers are internet companies — ad tech startups and the like. Databricks also reports “strong” traction in the segments:
- Media
- Financial services (especially but not only insurance)
- Health care/pharma
The main languages Databricks customers use are R and Python. Ion said that Python was used more on the West Coast, while R was used more in the East.

Databricks’ core marketing concept seems to be “just-in-time data platform”. I don’t know why they picked that, as opposed to something that emphasizes Spark’s flexibility and functionality.

Spark Streaming’s long-term success is not assured.

To a first approximation, things look good for Spark Streaming.

Spark Streaming is definitely the leading companion to Kafka, and perhaps also to cloud equivalents (e.g. Amazon Kinesis).
The “traditional” alternatives of Storm and Samza are pretty much done.
Newer alternatives from Twitter, Confluent and Flink aren’t yet established.
Cloudera is a big fan of Spark Streaming.
Even if Spark Streaming were to generally decline, it might keep substantial “good enough” usage, analogously to Hive and SparkSQL.
Cool new Spark Streaming technology is coming out.

But I’m also hearing rumbles and grumbles about Spark Streaming. What’s more, we know that Spark Streaming wasn’t a core part of Spark’s design; the use case just happened to emerge. Demanding streaming use cases typically involve a lot of short-request inserts (or updates/upserts/whatever). And if you were designing a system to handle those … would it really be based on Spark?

Databricks is not keeping a tight grip on Spark leadership.

For starters:

Databricks’ main business, as noted above, is its cloud service. That seems to be going well.
Databricks’ secondary business is licensing stuff to Spark distributors. That doesn’t seem to amount to much; it’s too easy to go straight to the Apache distribution and bypass Databricks. No worries; this never seemed it would be a big revenue opportunity for Databricks.

At the moment, Databricks is pretty clearly the general leader of Spark. Indeed:

If you want the story on where Spark is going, you do what I did — you ask Databricks.
Similarly, if you’re thinking of pushing the boundaries on Spark use, and you have access to the Databricks folks, that’s who you’ll probably talk to.
Databricks employs ~1/3 of Spark committers.
Databricks organizes the Spark Summit.

But overall, Databricks doesn’t seem to care much about keeping Spark leadership. Its marketing efforts in that respect are minimal. Word-of-mouth buzz paints a similar picture. My direct relationship with the company gives the same impression. Oh, I’m sure Databricks would like to remain the Spark leader. But it doesn’t seem to devote much energy toward keeping the role.

Related links

Starting with my introduction to Spark, previous overview posts include those in:

Notes from a long trip, July 19, 2016

Curt Monash — Wed, 20 Jul 2016 01:34:31 +0000

For starters:

I spent three weeks in California on a hybrid personal/business trip. I had a bunch of meetings, but not three weeks’ worth.
The timing was awkward for most companies I wanted to see. No blame accrues to those who didn’t make themselves available.
I came back with a nasty cough. Follow-up phone calls aren’t an option until next week.
I’m impatient to start writing. Hence tonight’s posts. But it’s difficult for a man and his cough to be productive at the same time.

A running list of recent posts is:

As a companion to this post, I’m publishing a very long one on vendor lock-in.
Spark and Databricks are both prospering, and of course enhancing their technology as well.
Ditto DataStax.
Flink is interesting as the streaming technology it’s now positioned to be, rather than the overall Spark alternative it used to be positioned as but which the world didn’t need.

Subjects I’d like to add to that list include:

MemSQL, Zoomdata, and Neo Technology (also prospering).
Cloudera (multiple topics, as usual).
Analytic SQL engines (“traditional” analytic RDBMS aren’t doing well).
Microsoft’s reinvention (it feels real).
Metadata (it’s ever more of a thing).
Machine learning (it’s going to be a big portion of my research going forward).
Transitions to the cloud — this subject affects almost everything else.

I’ll edit these lists as appropriate when further posts go up. Last update: August 23, 2016.

Let’s cover some other subjects right here.

1. While Kafka is widely agreed to be the universal delivery mechanism for streams, the landscape for companion technologies is confused.

Back in January I wrote that the leaders were mainly Spark Streaming, followed by Storm.
I overlooked the fact that Storm creator Twitter was replacing Storm with something called Heron.*
If there’s any buzz about Confluent’s replacement for distant-third-place contender Samza, I missed it.
Opinions about Spark Streaming are mixed. Some folks want to get away from it; others like it just fine.

And of course Flink is hoping to blow everybody else in the space away.

*But that kind of thing is not necessarily a death knell. Cassandra inventor Facebook soon replaced Cassandra with HBase, yet Cassandra is doing just fine.

As for the “lambda architecture” — that has always felt like a kludge, and various outfits are trying to obsolete it in various ways. As just one example, Cloudera described that to me during my visit as one of the main points of Kudu.

2. The idea that NoSQL does away with DBAs (DataBase Administrators) is common. It also turns out to be wrong. DBAs basically do two things.

Handle the database design part of application development. In NoSQL environments, this part of the job is indeed largely refactored away. More precisely, it is integrated into the general app developer/architect role.
Manage production databases. This part of the DBA job is, if anything, a bigger deal in the NoSQL world than in more mature and automated relational environments. It’s likely to be called part of “devops” rather than “DBA”, but by whatever name it’s very much a thing.

I had a moment of clarity on this point while visiting my clients at DataStax, and discussing their goal — shared by numerous companies — of being properly appreciated for the management tools they provide. In the room with me were CEO Billy Bosworth and chief evangelist Patrick McFadin — both of whom are former DBAs themselves.

3. I visited ClearStory, and Sharmila Mulligan showed me her actual sales database, as well as telling me some things about funding. The details are all confidential, but ClearStory is clearly doing better than rumor might suggest.

4. Platfora insisted on meeting circumstances in which it was inconvenient for me to take notes. So I have no details to share. But they sounded happy.

Edit: On July 22, it was announced that Workday is acquiring Platfora. Now I understand why Platfora gave me a bit of a runaround.

5. Pneubotics — with a cool new video on its home page — has found its first excellent product/market fit. Traditional heavy metallic robots are great at painting and related tasks when they can remain stationary, or move on rigid metal rails. Neither of those options works well, however, for large curved or irregular surfaces as might be found in the aerospace industry. Customer success for the leading soft robot company has ensued.

This all seems pretty close to the inspection/maintenance/repair area that I previously suggested could be a good soft robotics fit.

Cloudera in the cloud(s)

Curt Monash — Fri, 22 Jan 2016 07:46:34 +0000

Cloudera released Version 2 of Cloudera Director, which is a companion product to Cloudera Manager focused specifically on the cloud. This led to a discussion about — you guessed it! — Cloudera and the cloud.

Making Cloudera run in the cloud has three major aspects:

Cloudera’s usual software, ported to run on the cloud platform(s).
Cloudera Director, which for example launches cloud instances.
Points of integration, e.g. taking information about security-oriented roles from the platform and feeding then to the role-based security that is specific to Cloudera Enterprise.

Features new in this week’s release of Cloudera Director include:

An API for job submission.
Support for spot and preemptable instances.
High availability.
Kerberos.
Some cluster repair.
Some cluster cloning.

I.e., we’re talking about some pretty basic/checklist kinds of things. Cloudera Director is evidently working for Amazon AWS and Google GCP, and planned for Windows Azure, VMware and OpenStack.

As for porting, let me start by noting:

Shared-nothing analytic systems, RDBMS and Hadoop alike, run much better in the cloud than they used to.
Even so, it seems that the future of Hadoop in the cloud is to rely on object storage, such as Amazon S3.

That makes sense in part because:

The applications where shared nothing most drastically outshines object storage are probably the ones in which data can just be filtered from disk — spinning-rust or solid-state as the case may be — and processed in place.
By way of contrast, if data is being redistributed a lot then the shared nothing benefit applies to a much smaller fraction of the overall workload.
The latter group of apps are probably the harder ones to optimize for.

But while it makes sense, much of what’s hardest about the ports involves the move to object storage. The status of that is roughly:

Cloudera already has a lot of its software running on Amazon S3, with Impala/Parquet in beta.
Object storage integration for Windows Azure is “in progress”.
Object storage integration for Google GCP it is “to be determined”.
Security for object storage — e.g. encryption — is a work in progress.
Cloudera Navigator for object storage is a roadmap item.

When I asked about particularly hard parts of porting to object storage, I got three specifics. Two of them sounded like challenges around having less detailed control, specifically in the area of consistency model and capacity planning. The third I frankly didn’t understand,* which was the semantics of move operations, relating to the fact that they were constant time in HDFS, but linear in size on object stores.

*It’s rarely obvious to me why something is o(1) until it is explained to me.

Naturally, we talked about competition, differentiation, adoption and all that stuff. Highlights included:

In general, Cloudera’s three big marketing messages these days can be summarized as “Fast”, “Easy”, and “Secure”.
Notwithstanding the differences as to which parts of the Cloudera stack run on premises, on Amazon AWS, on Microsoft Azure or on Google GCP, Cloudera thinks it’s important that its offering is the “same” on all platforms, which allows “hybrid” deployment.
In general, Cloudera still sees Hortonworks as a much bigger competitor than MapR or IBM.
Cloudera fondly believes that Cloudera Manager is a significant competitive advantage vs. Ambari. (This would presumably be part of the “Easy” claim.)
In particular, Cloudera asserts it has better troubleshooting/monitoring than the cloud alternatives do, because of superior drilldown into details.
Cloudera’s big competitor on the Amazon platform is Elastic MapReduce (EMR). Cloudera points out that EMR lacks various capabilities that are in the Cloudera stack. Of course, versions of these capabilities are sometimes found in other Amazon offerings, such as Redshift.
Cloudera’s big competitor on Azure is HDInsight. Cloudera sells against that via:
- General Cloudera vs. Hortonworks distinctions.
- “Hybrid”/portability.

Cloudera also offered a distinction among three types of workload:

ETL (Extract/Transform/Load) and “modeling” (by which Cloudera seems to mean predictive modeling).
- Cloudera pitches this as batch work.
- Cloudera tries to deposition competitors as being good mainly at these kinds of jobs.
- This can be reasonably said to be the original sweet spot of Hadoop and MapReduce — which fits with Cloudera’s attempt to portray competitors as technical laggards.
- Cloudera observes that these workloads tend to call for “transient” jobs. Lazier marketers might trot out the word “elasticity”.
BI (Business Intelligence) and “analytics”, by which Cloudera seems to mainly mean Impala and Spark.
“Application delivery”, by which Cloudera means operational stuff that can’t be allowed to go down. Presumably, this is a rough match to what I — and by now a lot of other folks as well — call short-request processing.

While I don’t agree with terminology that says modeling is not analytics, the basic distinction being drawn here make considerable sense.

The questionably named Cloudera Navigator Optimizer

Curt Monash — Thu, 19 Nov 2015 11:55:30 +0000

I only have mixed success at getting my clients to reach out to me for messaging advice when they’re introducing something new. Cloudera Navigator Optimizer, which is being announced along with Cloudera 5.5, is one of my failures in that respect; I heard about it for the first time Tuesday afternoon. I hate the name. I hate some of the slides I saw. But I do like one part of the messaging, namely the statement that this is about “refactoring” queries.

All messaging quibbles aside, I think the Cloudera Navigator Optimizer story is actually pretty interesting, and perhaps not just to users of SQL-on-Hadoop technologies such as Hive (which I guess I’d put in that category for simplicity) or Impala. As I understand Cloudera Navigator Optimizer:

It’s all about analytic SQL queries.
Specifically, it’s about reducing duplicated work.
It is not an “optimizer” in the ordinary RDBMS sense of the word.
It’s delivered via SaaS (Software as a Service).
Conceptually, it’s not really tied to SQL-on-Hadoop. However, …
… in practice it likely will be used by customers who want to optimize performance of Cloudera’s preferred styles of SQL-on-Hadoop, either because they’re already using SQL-on-Hadoop or in connection with an initial migration.

It grows out of Xplain.io, which started with the intention of being a general workload optimizer for Hadoop and wound up with this beta announcement of a tuning adviser for analytic SQL.

Right now, the Cloudera Navigator Optimizer service is:

Query code in.
Information and advice out.

Naturally, Cloudera’s intention — perhaps as early as at first general availability — is for the output to start including something that’s more like automation, e.g. hints for the Impala optimizer.

As Anupam Singh describes it, there are basically four kinds of problems that Cloudera Navigator Optimizer can help with:

ETL (Extract/Transform/Load) might repeat the same operation over and over again, e.g. joining to a reference table to help with data cleaning. It can be an optimization to consolidate some of that work. (The same would surely also be true in cases where the workload is more properly described as ELT.)
For business intelligence it is often helpful to materialize aggregates or result sets. (This is, of course, why materialized views were invented in the first place.)
Queries-from-hell — perhaps thousands of lines of SQL long — can perhaps be usefully rewritten into a sequence of much shorter queries.
Ad-hoc query workloads can have enough repetition that there’s opportunity for similar optimizations. Anupam thinks his technology has enough intelligence to detect some of these patterns.

Actually, all four of these cases can involve materializing tables so that they don’t need to keep being in part or whole recreated.

In essence, then, this is a way to add in more query pipelining than the underlying data store automagically provides on its own. And that seems to me like a very good idea to try. The whole thing might be worth trying out at least once, even if your analytic RDBMS installation has nothing to do with SQL at all.

CDH 5.5

Curt Monash — Thu, 19 Nov 2015 11:52:01 +0000

I talked with Cloudera shortly ahead of today’s announcement of Cloudera 5.5. Much of what we talked about had something or other to do with SQL data management. Highlights include:

Impala and Kudu are being donated to Apache. This actually was already announced Tuesday. (Due to Apache’s rules, if I had any discussion with Cloudera speculating on the likelihood of Apache accepting the donations, I would not be free to relay it.)
Cloudera is introducing SQL extensions so that Impala can query nested data structures. More on that below.
- The basic idea for the nested datatype support is that there are SQL extensions with a “dot” notation to let you get at the specific columns you need.
- From a feature standpoint, we’re definitely still in the early days.
  - When I asked about indexes on these quasi-columns, I gathered that they’re not present in beta but are hoped for by the time of general availability.
  - Basic data skipping, also absent in beta, seems to be more confidently expected in GA.
- This is for Parquet first, Avro next, and presumably eventually native JSON as well.
- This is said to be Dremel-like, at least in the case of Parquet. I must confess that I’m not familiar enough with Apache Drill to compare the two efforts.
Cloudera is increasing its coverage of Spark in several ways.
- Cloudera is adding support for MLlib.
- Cloudera is adding support for SparkSQL. More on that below.
- Cloudera is adding support for Spark going against S3. The short answer to “How is this different from the Databricks service?” is:
  - More “platform” stuff from the Hadoop stack (e.g. for data ingest).
  - Less in the way of specific Spark usability stuff.
- Cloudera is putting into beta what it got in the Xplain.io acquisition, which it unfortunately is naming Cloudera Navigator Optimizer. More on that in a separate post.
- Impala and Hive are getting column-level security via Apache Sentry.
- There are other security enhancements.
- Some policy-based information lifecycle management is being added as well.

While I had Cloudera on the phone, I asked a few questions about Impala adoption, specifically focused on concurrency. There was mention of:

Petabyte scale databases — at least one clear case for Impala/business intelligence only, and the likelihood that the Impala/BI part of other bigger installations was also in that range.
Hundreds of nodes.
10s of simultaneous queries in dashboard use cases.
1 – 3 million queries/month as a common figure.

Cloudera also expressed the opinions that:

An “overwhelming majority” of Cloudera customers have adopted Impala. (I imagine there’s a bit of hyperbole in that — for one thing, Cloudera has a pricing option in which Impala is not included.)
It is common for Impala customers to use Hive for “data preparation”.
SparkSQL has “order of magnitude” less performance than Impala, but a little more than performance than Hive running over either Spark or Tez.
SparkSQL’s main use cases are (and these overlap heavily):
- As part of an analytic process (as opposed to straightforwardly DBMS-like use).
- To persist data outside the confines of a single Spark job.