March 19, 2017

Cloudera’s Data Science Workbench

0. Matt Brandwein of Cloudera briefed me on the new Cloudera Data Science Workbench. The problem it purports to solve is:

Cloudera’s idea for a third way is:

In theory, that’s pure goodness … assuming that the automagic works sufficiently well. I gather that Cloudera Data Science Workbench has been beta tested by 5 large organizations and many 10s of users. We’ll see what is or isn’t missing as more customers take it for a spin.

1. Recall that Cloudera installations have 4 kinds of nodes. 3 are obvious:

The fourth kind are edge/gateway nodes. Those handle connections to the outside world, and can also run selected third-party software. They also are where Cloudera Data Science Workbench lives.

2. One point of this architecture is to let each data scientist run the languages and tools of her choice. Docker isolation is supposed to make that practical and safe.

And so we have a case of the workbench metaphor actually being accurate! While a “workbench” is commonly just an integrated set of tools, in this case it’s also a place for you to use other tools your personally like and bring in.

Surely there are some restrictions as to which tools you can use, but I didn’t ask for those to be spelled out.

3. Matt kept talking about security, to an extent I recall in almost no other analytics-oriented briefing. This had several aspects.

4. To a first approximation, the target users of Cloudera Data Science Workbench can be characterized the same way BI-oriented business analysts are. They’re people with:

Of course, “sufficiently good quantitative skills” can mean something quite different in data science than it does for the glorified arithmetic of ordinary business intelligence.

5. Cloudera Data Science Workbench doesn’t have any special magic in parallelization. It just helps you access the parallelism that’s already out there. Some algorithms are easy to parallelize. Some libraries have parallelized a few algorithms beyond that. Otherwise, you’re on your own.

6. When I asked whether Cloudera Data Science Workbench was open source (like most of what Cloudera provides) or closed source (like Cloudera Manager), I didn’t get the clearest of answers. On the one hand, it’s a Cloudera-specific product, as the name suggests; on the other, it’s positioned as having been stitched together almost entirely from a collection of open source projects.


2 Responses to “Cloudera’s Data Science Workbench”

  1. J-Luc Billy on March 20th, 2017 2:33 pm

    “… standard uses of notebook tools such as Jupyter or Zeppelin wind up having data stored wherever code is …” > and that place is, IMHO, an Edge Node. Because code requires the Hadoop libraries & Kerberos plumbing to access the secured Hadoop data. I see no real difference between Cloudera Workbench and HortonWorks-backed Zeppelin on that respect (maybe Jupyter users are less aware of the jargon because they don’t come from a Hadoop background… but if it connects like an Edge and quacks like an Edge, then it’s an Edge Node!)

  2. Episode 38 – Dataworks Summit 2017 – Preview – Roaring Elephant on March 28th, 2017 4:01 am

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.