Cloud computing – DBMS 2 : DataBase Management System Services

Notes on artificial intelligence, December 2017

Curt Monash — Tue, 12 Dec 2017 18:53:16 +0000

Most of my comments about artificial intelligence in December, 2015 still hold true. But there are a few points I’d like to add, reiterate or amplify.

1. As I wrote back then in a post about the connection between machine learning and the rest of AI,

It is my opinion that most things called “intelligence” — natural and artificial alike — have a great deal to do with pattern recognition and response.

2. Accordingly, it can be reasonable to equate machine learning and AI.

AI based on machine learning frequently works, on more than a toy level. (Examples: Various projects by Google)
AI based on knowledge representation usually doesn’t. (Examples: IBM Watson, 1980s expert systems)
“AI” can be the sexier marketing or fund-raising term.

3. Similarly, it can be reasonable to equate AI and pattern recognition. Glitzy applications of AI include:

Understanding or translation of language (written or spoken as the case may be).
Machine vision or autonomous vehicles.
Facial recognition.
Disease diagnosis via radiology interpretation.

4. The importance of AI and of recent AI advances differs greatly according to application or data category.

Machine learning and AI have little relevance to most traditional transactional apps.
Predictive modeling is a huge deal in customer-relationship apps. The most advanced organizations developing and using those rely on machine learning. I don’t see an important distinction between machine learning and “artificial intelligence” in this area.
Voice interaction is already revolutionary in certain niches (e.g. smartphones — Siri et al.). The same will likely hold other natural language or virtual/augmented reality interfaces if and when they go more mainstream. AI seems likely to make a huge impact on user interfaces.
AI also seems likely to have huge impact upon the understanding and reduction of machine-generated data.

5. Right now it seems as if large companies are the runaway leaders in AI commercialization. There are several reasons to think that could last.

They have deep pockets. Yes, but the same is true in any other area of technology. Small companies commonly out-innovate large one even so.
They have access to lots of data for model training. I find this argument persuasive in some specific areas, most notably any kind of language recognition that can be informed by search engine uses.
AI technology is sometimes part of a much larger whole. That argument is not obviously persuasive. After all, software can often be developed by one company and included as a module in somebody else’s systems. Machine vision has worked that way for decades.

I’m sure there are many niches in which decision-making, decision implementation and feedback are so tightly integrated that they all need to be developed by the same organization. But every example that remotely comes to mind is indeed the kind of niche that smaller companies are commonly able to address.

6. China and Russia are both vowing to lead the world in artificial intelligence. From a privacy/surveillance standpoint, this is worrisome. China also has a reasonable path to doing so (Russia not so much), in line with the “Lots of data makes models strong” line of argument.

The fiasco of Japan’s 1980s “Fifth-Generation Computing” initiative is only partly reassuring.

7. It seems that “deep learning” and GPUs fit well for AI/machine learning uses. I see no natural barriers to that trend, assuming it holds up on its own merits.

Since silicon clock speeds stopped increasing, chip power improvements have mainly taken the form of increased on-chip parallelism.
The general move to the cloud is also not a barrier. I have little doubt major cloud providers could do a good job of providing GPU-based capacity, given that:
They build their own computer systems.
They showed similar flexibility when they adopted flash storage.
Several of them are AI research leaders themselves.

Maybe CPU vendors will co-opt GPU functionality. Maybe not. I haven’t looked into that issue. But either way, it should be OK to adopt software that calls for GPU-style parallel computation.

8. Computer chess is in the news, so of course I have to comment. The core claim is something like:

Google’s AlphaZero technology was trained for four hours playing against itself, with no human heuristic input.
It then decisively beat Stockfish, previously the strongest computer chess program in the world.

My thoughts on that start:

AlphaZero actually beat a very crippled version of Stockfish.
That’s still impressive.
Google only released a small fraction of the games. But in the ones it did release, about half had a common theme — AlphaZero seemed to place great value on what chess analysts call “space”.
This all fits my view that recent splashy AI accomplishments are focused on pattern recognition.

More notes on the transition to the cloud

Curt Monash — Thu, 17 Aug 2017 09:11:01 +0000

Last year I posted observations about the transition to the cloud. Here are some further thoughts.

0. In case any doubt remained, the big questions about transitioning to the cloud are “When?” and “How?”. “Whether”, by way of contrast, is pretty much settled.

1. The answer to “When?” is generally “Over many years”. In particular, at most enterprises the cloud transition will span multiple CIO’s tenure in their positions.

Few enterprises will ever execute on simple, consistent, unchanging “cloud strategies”.

2. The SaaS (Software as a Service) vs. on-premises tradeoffs are being reargued, except that proponents now spell SaaS C-L-O-U-D. (Ali Ghodsi of Databricks made a particularly energetic version of that case in a recent meeting.)

3. In most countries (at least in the US and the rest of the West), the cloud vendors deemed to matter are Amazon, followed by Microsoft, followed by Google. And so, when it comes to the public cloud, Microsoft is much, much more enterprise-savvy than its key competitors.

4. In another non-technical competitive factor: Wal-Mart isn’t the only huge company that is hostile to the Amazon cloud because of competition with other Amazon businesses.

5. It was once thought that in many small countries around the world, there would be OpenStack-based “national champion” cloud winners, perhaps as subsidiaries of the leading telecom vendors. This doesn’t seem to be happening.

Even so, some of the larger managed-economy and/or generally authoritarian countries will have one or more “national champion” cloud winners each — surely China, presumably Russia, obviously Iran, and probably some others as well.

6. While OpenStack in general seems to have fizzled, S3 compatibility has momentum.

7. Finally, let’s return to our opening points: The cloud transition will happen, but it will take considerable time. A principal reason for slowness is that, as a general rule, apps aren’t migrated to platforms directly; rather, they get replaced by new apps on new platforms when the time is right for them to be phased out anyway.

However, there’s a codicil to those generalities — in some cases it’s easier to migrate to the new platform than in others. The hardest migration was probably when the rise of RDBMS, the shift from mainframes to UNIX and the switch to client/server all happened at once; just about nothing got ported from the old platforms to the new. Easier migrations included:

The switch from Unix to Linux. They were very similar.
The adoption of virtualization. A major purpose of the technology was to make migration easy.
The initial adoption of DBMS. Then-legacy apps relied on flat file systems, which DBMS often found easy to emulate.

The cloud transition is somewhere in the middle between those extremes. On the “easy” side:

Popular database management technologies and so on are available in the cloud just as they are on-premise.
Major app vendors are doing the hard work of cloud ports themselves.

Nonetheless, the public cloud is in many ways a whole new computing environment — and so for the most part, customer-built apps will prove too difficult to migrate. Hence my belief that overall migration to the cloud will be very incremental.

Analytics on the edge?

Curt Monash — Fri, 30 Jun 2017 08:27:18 +0000

There’s a theory going around to the effect that:

Compute power is and will be everywhere, for example in cars, robots, medical devices or microwave ovens. Let’s refer to these platforms collectively as “real-world appliances”.
Much more data will be created on these platforms than can reasonably be sent back to centralized/cloudy servers.
Therefore, cloud-centric architectures will soon be obsolete, perhaps before they’re ever dominant in the first place.

There’s enough truth to all that to make it worth discussing. But the strong forms of the claims seem overblown.

1. This story doesn’t even make sense except for certain new classes of application. Traditional business applications run all over the world, in dedicated or SaaSy modes as the case may be. E-commerce is huge. So is content delivery. Architectures for all those things will continue to evolve, but what we have now basically works.

2. When it comes to real-world appliances, this story is partially accurate. An automobile is a rolling network of custom Linux systems, each running hand-crafted real-time apps, a few of which also have minor requirements for remote connectivity. That’s OK as far as it goes, but there could be better support for real-time operational analytics. If something as flexible as Spark were capable of unattended operation, I think many engineers of real-world appliances would find great ways to use it.

3. There’s a case to be made for something better yet. I think the argument is premature, but it’s worth at least a little consideration.

There are any number of situations in which decisions are made on or about remote systems, based on models or rules that should be improved over time. For example, such decisions might be made in:

Machine vision or other “recognition”-oriented areas of AI.
Detection or prediction of malfunctions.
Choices as to what data is significant enough to ship back upstream.

In the canonical case, we might envision a system in which:

Huge amounts of data are collected and are used to make real-time decisions.
The models are trained centrally, and updated remotely over time as they are improved.
The remote systems can only ship back selected or aggregated data to help train the models.

This all seems like an awkward fit for any common computing architecture I can think of.

But it’s hard to pin down important examples of that “canonical” case. The story implicitly assumes:

A model is widely deployed.
The model does a decent job but not a perfect one.
Based on its successes and failures, the model gets improved.

And now we’re begging a huge question: What exactly is there that keeps score as to when the model succeeds and fails? Mathematically speaking, I can’t imagine what a general answer would be like.

4. So when it comes to predictive models executed on real-world appliances I think that analytic workflows will:

Differ for different (categories) of applications.
Rely in most cases on simple patterns of data movement, such as:
- Stream everything to central servers and sort it out there, or if that’s not workable …
- … instrument a limited number of test nodes to store everything, and recover the data in batch for analysis.
- Update models only in timeframes that you’re doing a full app update/refresh.

And with that much of the apparent need for fancy distributed analytic architectures evaporates.

5. Finally, and notwithstanding the previous point: Across many use cases, there’s some kind of remote log data being shipped back to a central location. It may be the complete log. It may be periodic aggregates. It may happen only what the edge nodes regard as significant events. But something is getting shipped home.

The architectures for shipping, receiving and analyzing such data are in many cases immature. That’s obvious if there’s any kind of streaming involved, or if analysis is done in Spark. Ditto if there’s anything we might call “non-tabular business intelligence”. As this stuff matures, it will in many cases fit very well with today’s cloud thinking. But in any case — it needs to mature.

Truth be told, even the relational case is immature, in that it can easily rely on what I called:

data warehouses (perhaps really data marts) that are updated in human real-time

That quote is from a recent post about Kudu, which:

Is designed for exactly that use case.
Went GA early this year.

As always, technology is in flux.

Related links

Interana is another example of very new technology that seems applicable to these use cases.
My 2013 post on the future of IT architectures still rings true.

Light-touch managed services

Curt Monash — Wed, 14 Jun 2017 13:14:06 +0000

Cloudera recently introduced Cloudera Altus, a Hadoop-in-the-cloud offering with an interesting processing model:

Altus manages jobs for you.
But you actually run them on your own cluster, and so you never have to put your data under Altus’ control.

Thus, you avoid a potential security risk (shipping your data to Cloudera’s service). I’ve tentatively named this strategy light-touch managed services, and am interested in exploring how broadly applicable it might or might not be.

For light-touch to be a good approach, there should be (sufficiently) little downside in performance, reliability and so on from having your service not actually control the data. That assumption is trivially satisfied in the case of Cloudera Altus, because it’s not an ordinary kind of app; rather, its whole function is to improve the job-running part of your stack. Most kinds of apps, however, want to operate on your data directly. For those, it is more challenging to meet acceptable SLAs (Service-Level Agreements) on a light-touch basis.

Let’s back up and consider what “light-touch” for data-interacting apps (i.e., almost all apps) would actually mean. The basics are:

The user has some kind of environment that manages data and executes programs.
The light-touch service, running outside this environment, spawns one or more app processes inside it.
Useful work ensues …
… with acceptable reliability and performance.
The environment’s security guarantees ensure that data doesn’t leak out.

Cases where that doesn’t even make sense include but are not limited to:

Transaction-processing applications that are carefully tuned for efficient database access.
Applications that need to be carefully installed on or in connection with a particular server, DBMS, app server or whatever.

On the other hand:

A light-touch service is at least somewhat reasonable in connection with analytics-oriented data-management-plus-processing environments such as Hadoop/Spark clusters.
There are many workloads over Hadoop clusters that don’t need efficient database access. (Otherwise Hive use would not be so prevalent.)
Light-touch efforts seem more likely to be helped than hurt by abstraction environments such as the public cloud.

So we can imagine some kind of outside service that spawns analytic jobs to be run on your preferred — perhaps cloudy — Hadoop/Spark cluster. That could be a safe way to get analytics done over data that really, really, really shouldn’t be allowed to leak.

But before we anoint light-touch managed services as the NBT (Next Big Thing/Newest Bright Thought), there’s one more hurdle for it to overcome — why bother at all? What would a light-touch managed service provide that you wouldn’t also get from installing packaged software onto your cluster and running it in the usual way? The simplest answer is “The benefits of SaaS (Software as a Service)”, and so we can rephrase the challenge as “Which benefits of SaaS still apply in the light-touch managed service scenario?”

The vendor perspective might start, with special cases such as Cloudera Altus excepted:

The cost-saving benefits of multi-tenancy mostly don’t apply. Each instance winds up running on a separate cluster, namely the customer’s own. (But that’s likely to be SaaS/cloud itself.)
The benefits of controlling your execution environment apply at best in part. You may be able to assume the customer’s core cluster is through some cloud service, but you don’t get to run the operation yourself.
The benefits of a SaaS-like product release cycle do mainly apply.
- Only having to support the current version(s) of the product is a little limited when you don’t wholly control your execution environment.
- Light-touch doesn’t seem to interfere with the traditional SaaS approach of a rapid, incremental product release cycle.

When we flip to the user perspective, however, the idea looks a little better.

Cloudy analytics is for folks who favor convenience of various kinds over tightly-managed, blazing performance.
Security and data privacy are ongoing (and increasing) concerns.
Light-touch services are in line with those priorities.

Bottom line: Light-touch managed services are well worth thinking about. But they’re not likely to be a big deal soon.

Cloudera Altus

Curt Monash — Wed, 14 Jun 2017 13:12:48 +0000

I talked with Cloudera before the recent release of Altus. In simplest terms, Cloudera’s cloud strategy aspires to:

Provide all the important advantages of on-premises Cloudera.
Provide all the important advantages of native cloud offerings such as Amazon EMR (Elastic MapReduce, or at least come sufficiently close to that goal.
Benefit from customers’ desire to have on-premises and cloud deployments that work:
- Alike in any case.
- Together, to the extent that that makes use-case sense.

In other words, Cloudera is porting its software to an important new platform.* And this port isn’t complete yet, in that Altus is geared only for certain workloads. Specifically, Altus is focused on “data pipelines”, aka data transformation, aka “data processing”, aka new-age ETL (Extract/Transform/Load). (Other kinds of workload are on the roadmap, including several different styles of Impala use.) So what about that is particularly interesting? Well, let’s drill down.

*Or, if you prefer, improving on early versions of the port.

Since so much of the Hadoop and Spark stacks is open source, competition often isn’t based on core product architecture or features, but rather on factors such as:

Ease of management. This one is nuanced in the case of cloud/Altus. For starters:
- One of Cloudera’s main areas of differentiation has always been Cloudera Manager.
- Cloudera Director was Cloudera’s first foray into cloud-specific management.
- Cloudera Altus features easier/simpler management than Cloudera Director, meant to be analogous to native Amazon management tools, and good-enough for use cases that don’t require strenuous optimization.
- Cloudera Altus also includes an optional workload analyzer, in slight conflict with other parts of the Altus story. More on that below.
Ease of development. Frankly, this rarely seems to come up as a differentiator in the Hadoop/Spark world, various “notebook” offerings such as Databricks’ or Cloudera’s notwithstanding.
Price. When price is the major determinant, Cloudera is sad.
Open source purity. Ditto. But at most enterprises — at least those with hefty IT budgets — emphasis on open source purity either is a proxy for price shopping, or else boils down to largely bogus concerns about vendor lock-in.

Of course, “core” kinds of considerations are present to some extent too, including:

Performance, concurrency, etc. I no longer hear many allegations of differences in across-the-board Hadoop performance. But the subject does arise in specific areas, most obviously in analytic SQL processing. It arises in the case of Altus as well, in that Cloudera improved in a couple of areas that it concedes were previously Amazon EMR advantages, namely:
- Interacting with S3 data stores.
- Spinning instances up and down.
Reliability and data safety. Cloudera mentioned that it did some work so as to be comfortable with S3’s eventual consistency model.

Recently, Cloudera has succeeded at blowing security up into a major competitive consideration. Of course, they’re trying that with Altus as well. Much of the Cloudera Altus story is the usual — rah-rah Cloudera security, Sentry, Kerberos everywhere, etc. But there’s one aspect that I find to be simple yet really interesting:

Cloudera Altus doesn’t manage data for you.
Rather, it launches and manages jobs on a separate Hadoop cluster.

Thus, there are very few new security risks to running Cloudera Altus, beyond whatever risks are inherent to running any version of Hadoop in the public cloud.

Where things get a bit more complicated is some features for workload analysis.

Cloudera recently introduced some capabilities for on-the-fly trouble-shooting. That’s fine.
Cloudera has also now announced an offline workload analyzer, which compares actual metrics computed from your log files to “normal” ones from well-running jobs. For that, you really do have to ship information to a separate cluster managed by Cloudera.

The information shipped is logs rather than actual query results or raw data. In theory, an attacker who had all those logs could conceivably make inferences about the data itself; but in practice, that doesn’t seem like an important security risk at all.

So is this an odd situation where that strategy works, or could what we might call light-touch managed services turn out to be widespread and important? That’s a good question to address in a separate post.

Notes on the transition to the cloud

Curt Monash — Tue, 04 Oct 2016 02:22:21 +0000

1. The cloud is super-hot. Duh. And so, like any hot buzzword, “cloud” means different things to different marketers. Four of the biggest things that have been called “cloud” are:

The Amazon cloud, Microsoft Azure, and their competitors, aka public cloud.
Software as a service, aka SaaS.
Co-location in off-premises data centers, aka colo.
On-premises clusters (truly on-prem or colo as the case may be) designed to run a broad variety of applications, aka private cloud.

Further, there’s always the idea of hybrid cloud, in which a vendor peddles private cloud systems (usually appliances) running similar technology stacks to what they run in their proprietary public clouds. A number of vendors have backed away from such stories, but a few are still pushing it, including Oracle and Microsoft.

This is a good example of Monash’s Laws of Commercial Semantics.

2. Due to economies of scale, only a few companies should operate their own data centers, aka true on-prem(ises). The rest should use some combination of colo, SaaS, and public cloud.

This fact now seems to be widely understood.

3. The public cloud is a natural fit for those use cases in which elasticity truly matters. Many websites and other consumer internet backends have that characteristic. Such systems are often also a good fit for cloud technologies in general.

This is frequently a good reason for new — i.e. “greenfield” — apps to run in the cloud.

4. Security and privacy can be concerns in moving to the cloud. But I’m hearing that more and more industries are overcoming those concerns.

In connection to that point, it might be interesting to note:

In the 1960s and 1970s, one of the biggest industries for remote computing services — i.e. SaaS — was commercial banking.
Other big users were hospitals and stockbrokers.
The US intelligence agencies are building out their own shared, dedicated cloud.

5. Obviously, Amazon is the gorilla in the cloud business. Microsoft Azure gets favorable mentions as well. I don’t hear much about other public cloud providers, however, except that there are a lot of plans to support Google’s cloud just in case.

In particular, I hear less than I expected to about public clouds run by national-champion telecom companies around the world.

6. It’s inconvenient for an application vendor to offer both traditional and SaaS versions of a product. Release cycles and platform support are different in the two cases. But there’s no reason a large traditional application vendor couldn’t pull it off, and the largest are already more or less claiming to. Soon, this will feel like a market necessity across the board.

7. The converse is less universally true. However, some SaaS vendors do lose out from their lack of on-premises options. Key considerations include:

Does your application need to run close to your customers’ largest databases?
Do your customers still avoid the public cloud?

If both those things are true, and you don’t have an on-premises option, certain enterprises are excluded from your addressable market.

8. Line-of-business departments are commonly more cloud-friendly than central IT is. Reasons include:

Departments don’t necessarily see central IT as any “closer” to them than the cloud is.
Departments don’t necessarily care about issues that give central IT pause.
Departments sometimes buy things that only are available via remote delivery, e.g. narrowly focused SaaS applications or market data.

I discussed some of this in my recent post on vendor lock-in.

9. When the public cloud was younger, it had various technological limitations. You couldn’t easily get fast storage like flash. You couldn’t control data movement well enough for good MPP (Massively Parallel Processing) in use cases like analytic SQL.

Those concerns seem to have been largely alleviated.

10. It takes a long time for legacy platforms to be decommissioned. At some enterprises, however, that work has indeed been going on for a long time, via virtualization.

11. If you think about system requirements:

There is a lot of computing power in devices that may be regarded as IoT nodes — phones, TV boxes, thermostats, cars, industrial equipment, sensors, etc. Client-side computing is getting ever more diverse.
Server-side computing, however, is more homogenous. Enterprises can, should and likely will meet the vast majority of their server requirements on a relatively small number of clusters each.

I argued the latter point in my 2013 post on appliances, clusters, and clouds, using terminology and reasoning that are now only slightly obsolete.

So what will those clusters be? Some will be determined by app choices. Most obviously, if you use SaaS, the SaaS vendor decides which cloud(s) your data is in. And if you’re re-hosting legacy systems via virtualization, that’s another cluster.

Otherwise, clusters will probably be organized by database, in the most expansive sense of term. For example, there could be separate clusters for:

Operational data managed by your general-purpose RDBMS (Oracle, SQL Server, DB2, whatever).
Relational data warehousing, whether in an analytic RDBMS or otherwise.
Log files, perhaps managed in Hadoop or Splunk.
Your website and other internet back-ends, perhaps running over NoSQL data stores.
Text documents managed by some kind of search engine.
Media block or object storage, if the organization’s audio/video/whatever would overwhelm a text search engine. (Text search or document management systems can often also handle low volumes of non-text media.)

Indeed, since computing is rarely as consolidated as CIOs dream of it being, a large enterprise might have several clusters for any of those categories — each running different software for data and storage management — with different deployment choices among colo, true on-prem, and true cloud.

Are analytic RDBMS and data warehouse appliances obsolete?

Curt Monash — Mon, 29 Aug 2016 01:28:31 +0000

I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:

Many of the independent vendors were swooped up by acquisition.
- IBM bought Netezza.
- Microsoft bought DATAllegro.
- HP bought Vertica.
- Greenplum went to EMC/VMware/Pivotal.
- Teradata bought Aster.
- Actian bought both ParAccel and Vectorwise.
None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
Impala and other Hadoop-based alternatives are technology options.
Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big.

Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.

My answer, in a nutshell, is:

Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud — are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.

To see why, let’s start by asking: “With what do you want to integrate your analytic SQL processing?”

If you want to integrate with relational OLTP (OnLine Transaction Processing), your OLTP RDBMS vendor surely has a story worth listening to. Memory-centric offerings MemSQL and SAP HANA are also pitched that way.
If you want to integrate with your SAP apps in particular, HANA is the obvious choice.
If you want to integrate with other work you do in the Amazon cloud, Redshift is worth a look.

Beyond those cases, a big issue is integration with … well, with data integration. Analytic RDBMS got a lot of their workloads from ELT or ETLT, which stand for Extract/(Transform)/Load/Transform. I.e., you’d load data into an efficient analytic RDBMS and then do your transformations, vs. the “traditional” (for about 10-15 years of tradition) approach of doing your transformations in your ETL (Extract/Transform/Load) engine. But in bigger installations, Hadoop often snatches away that part of the workload, even if the rest of the processing remains on a dedicated analytic RDBMS platform such as Teradata’s.

And suppose you want to integrate with more advanced analytics — e.g. statistics, other predictive modeling/machine learning, or graph analytics? Well — and this both surprised and disappointed me — analytic platforms in the RDBMS sense didn’t work out very well. Early Hadoop had its own problems too. But Spark is doing just fine, and seems poised to win.

My technical observations around these trends include:

Advanced analytics commonly require flexible, iterative processing.
Spark is much better at such processing than earlier Hadoop …
… which in turn is better than anything that’s been built into an analytic RDBMS.
Open source/open standards and the associated skill sets come into play too. Highly vendor-proprietary DBMS-tied analytic stacks don’t have enough advantages over open ones.
Notwithstanding the foregoing, RDBMS-based platforms can still win if a big part of the task lies in fancy SQL.

And finally, if a task is “partly relational”, then Hadoop or Spark often fit both parts.

They don’t force you into using SQL for everything, nor into putting all your data into relational schemas, and that flexibility can be a huge relief.
Even so, almost everybody who uses those uses some SQL, at least for initial data extraction. Those systems are also plenty good enough at SQL for joining data to reference tables, and all that other SQL stuff you’d never want to give up.

But suppose you just want to do business intelligence, which is still almost always done over relational data structures? Analytic RDBMS offer the trade-offs:

They generally still provide the best performance or performance/concurrency combination, for the cost, although YMMV (Your Mileage May Vary).
One has to load the data in and immediately structure it relationally, which can be an annoying contrast to Hadoop alternatives (data base administration can be just-in-time) or to OLTP integration (less or no re-loading).
Other integrations, as noted above, can also be weak.

Suppose all that is a good match for your situation. Then you should surely continue using an analytic RDBMS, if you already have one, and perhaps even acquire one if you don’t. But for many other use cases, analytic RDBMS are no longer the best way to go.

Finally, how does the cloud affect all this? Mainly, it brings one more analytic RDBMS competitor into the mix, namely Amazon Redshift. Redshift is a simple system for doing analytic SQL over data that was in or headed to the Amazon cloud anyway. It seems to be quite successful.

Bottom line: Analytic RDBMS are no longer in their youthful prime, but they are healthy contributors in middle age. Mainly, they’re still best-of-breed for supporting demanding BI.

Notes on Spark and Databricks — technology

Curt Monash — Sun, 31 Jul 2016 14:30:18 +0000

During my recent visit to Databricks, I of course talked a lot about technology — largely with Reynold Xin, but a bit with Ion Stoica as well. Spark 2.0 is just coming out now, and of course has a lot of enhancements. At a high level:

Using the new terminology, Spark originally assumed users had data engineering skills, but Spark 2.0 is designed to be friendly to data scientists.
A lot of this is via a focus on simplified APIs, based on
- Unlike similarly named APIs in R and Python, Spark DataFrames work with nested data.
- Machine learning and Spark Streaming both work with Spark DataFrames.
There are lots of performance improvements as well, some substantial. Spark is still young enough that Bottleneck Whack-A-Mole yields huge benefits, especially in the SparkSQL area.
SQL coverage is of course improved. For example, SparkSQL can now perform all TPC-S queries.

The majority of Databricks’ development efforts, however, are specific to its cloud service, rather than being donated to Apache for the Spark project. Some of the details are NDA, but it seems fair to mention at least:

Databricks’ notebooks feature for organizing and launching machine learning processes and so on is a biggie. Jupyter is an open source analog.
Databricks has been working on security, and even on the associated certifications.

Two of the technical initiatives Reynold told me about seemed particularly cool. One, on the machine learning side, was a focus on training models online as new data streams in. In most cases this seems to require new algorithms for old model types, with a core idea being that the algorithm does a mini gradient descent for each new data point.

The other cool idea fits the trend of alternatives to the “lambda architecture”. Under the name “structured streaming”, which seems to be a replacement for “DStreaming”, the idea is to do set-based SQL processing even though membership of the set changes over time. Result sets are extracted on a snapshot basis; you can keep either all the results from each snapshot query or just the deltas.

Despite all this, there’s some non-trivial dissatisfaction with Spark, fair or otherwise.

Some of the reason is that SparkSQL is too immature to be great.
Some is annoyance that Databricks isn’t putting everything it has into open source.
Some is that everything has its architectural trade-offs.

To the last point, I raised one of the biggest specifics with Reynold, namely Spark’s lack of a strong built-in data persistence capability. Reynold’s answer was that they’re always working to speed up reading and writing from other forms of persistent storage. E.g., he cited a figure of ~100 million rows/core/second decoded from Parquet.

Notes from a long trip, July 19, 2016

Curt Monash — Wed, 20 Jul 2016 01:34:31 +0000

For starters:

I spent three weeks in California on a hybrid personal/business trip. I had a bunch of meetings, but not three weeks’ worth.
The timing was awkward for most companies I wanted to see. No blame accrues to those who didn’t make themselves available.
I came back with a nasty cough. Follow-up phone calls aren’t an option until next week.
I’m impatient to start writing. Hence tonight’s posts. But it’s difficult for a man and his cough to be productive at the same time.

A running list of recent posts is:

As a companion to this post, I’m publishing a very long one on vendor lock-in.
Spark and Databricks are both prospering, and of course enhancing their technology as well.
Ditto DataStax.
Flink is interesting as the streaming technology it’s now positioned to be, rather than the overall Spark alternative it used to be positioned as but which the world didn’t need.

Subjects I’d like to add to that list include:

MemSQL, Zoomdata, and Neo Technology (also prospering).
Cloudera (multiple topics, as usual).
Analytic SQL engines (“traditional” analytic RDBMS aren’t doing well).
Microsoft’s reinvention (it feels real).
Metadata (it’s ever more of a thing).
Machine learning (it’s going to be a big portion of my research going forward).
Transitions to the cloud — this subject affects almost everything else.

I’ll edit these lists as appropriate when further posts go up. Last update: August 23, 2016.

Let’s cover some other subjects right here.

1. While Kafka is widely agreed to be the universal delivery mechanism for streams, the landscape for companion technologies is confused.

Back in January I wrote that the leaders were mainly Spark Streaming, followed by Storm.
I overlooked the fact that Storm creator Twitter was replacing Storm with something called Heron.*
If there’s any buzz about Confluent’s replacement for distant-third-place contender Samza, I missed it.
Opinions about Spark Streaming are mixed. Some folks want to get away from it; others like it just fine.

And of course Flink is hoping to blow everybody else in the space away.

*But that kind of thing is not necessarily a death knell. Cassandra inventor Facebook soon replaced Cassandra with HBase, yet Cassandra is doing just fine.

As for the “lambda architecture” — that has always felt like a kludge, and various outfits are trying to obsolete it in various ways. As just one example, Cloudera described that to me during my visit as one of the main points of Kudu.

2. The idea that NoSQL does away with DBAs (DataBase Administrators) is common. It also turns out to be wrong. DBAs basically do two things.

Handle the database design part of application development. In NoSQL environments, this part of the job is indeed largely refactored away. More precisely, it is integrated into the general app developer/architect role.
Manage production databases. This part of the DBA job is, if anything, a bigger deal in the NoSQL world than in more mature and automated relational environments. It’s likely to be called part of “devops” rather than “DBA”, but by whatever name it’s very much a thing.

I had a moment of clarity on this point while visiting my clients at DataStax, and discussing their goal — shared by numerous companies — of being properly appreciated for the management tools they provide. In the room with me were CEO Billy Bosworth and chief evangelist Patrick McFadin — both of whom are former DBAs themselves.

3. I visited ClearStory, and Sharmila Mulligan showed me her actual sales database, as well as telling me some things about funding. The details are all confidential, but ClearStory is clearly doing better than rumor might suggest.

4. Platfora insisted on meeting circumstances in which it was inconvenient for me to take notes. So I have no details to share. But they sounded happy.

Edit: On July 22, it was announced that Workday is acquiring Platfora. Now I understand why Platfora gave me a bit of a runaround.

5. Pneubotics — with a cool new video on its home page — has found its first excellent product/market fit. Traditional heavy metallic robots are great at painting and related tasks when they can remain stationary, or move on rigid metal rails. Neither of those options works well, however, for large curved or irregular surfaces as might be found in the aerospace industry. Customer success for the leading soft robot company has ensued.

This all seems pretty close to the inspection/maintenance/repair area that I previously suggested could be a good soft robotics fit.

Cloudera in the cloud(s)

Curt Monash — Fri, 22 Jan 2016 07:46:34 +0000

Cloudera released Version 2 of Cloudera Director, which is a companion product to Cloudera Manager focused specifically on the cloud. This led to a discussion about — you guessed it! — Cloudera and the cloud.

Making Cloudera run in the cloud has three major aspects:

Cloudera’s usual software, ported to run on the cloud platform(s).
Cloudera Director, which for example launches cloud instances.
Points of integration, e.g. taking information about security-oriented roles from the platform and feeding then to the role-based security that is specific to Cloudera Enterprise.

Features new in this week’s release of Cloudera Director include:

An API for job submission.
Support for spot and preemptable instances.
High availability.
Kerberos.
Some cluster repair.
Some cluster cloning.

I.e., we’re talking about some pretty basic/checklist kinds of things. Cloudera Director is evidently working for Amazon AWS and Google GCP, and planned for Windows Azure, VMware and OpenStack.

As for porting, let me start by noting:

Shared-nothing analytic systems, RDBMS and Hadoop alike, run much better in the cloud than they used to.
Even so, it seems that the future of Hadoop in the cloud is to rely on object storage, such as Amazon S3.

That makes sense in part because:

The applications where shared nothing most drastically outshines object storage are probably the ones in which data can just be filtered from disk — spinning-rust or solid-state as the case may be — and processed in place.
By way of contrast, if data is being redistributed a lot then the shared nothing benefit applies to a much smaller fraction of the overall workload.
The latter group of apps are probably the harder ones to optimize for.

But while it makes sense, much of what’s hardest about the ports involves the move to object storage. The status of that is roughly:

Cloudera already has a lot of its software running on Amazon S3, with Impala/Parquet in beta.
Object storage integration for Windows Azure is “in progress”.
Object storage integration for Google GCP it is “to be determined”.
Security for object storage — e.g. encryption — is a work in progress.
Cloudera Navigator for object storage is a roadmap item.

When I asked about particularly hard parts of porting to object storage, I got three specifics. Two of them sounded like challenges around having less detailed control, specifically in the area of consistency model and capacity planning. The third I frankly didn’t understand,* which was the semantics of move operations, relating to the fact that they were constant time in HDFS, but linear in size on object stores.

*It’s rarely obvious to me why something is o(1) until it is explained to me.

Naturally, we talked about competition, differentiation, adoption and all that stuff. Highlights included:

In general, Cloudera’s three big marketing messages these days can be summarized as “Fast”, “Easy”, and “Secure”.
Notwithstanding the differences as to which parts of the Cloudera stack run on premises, on Amazon AWS, on Microsoft Azure or on Google GCP, Cloudera thinks it’s important that its offering is the “same” on all platforms, which allows “hybrid” deployment.
In general, Cloudera still sees Hortonworks as a much bigger competitor than MapR or IBM.
Cloudera fondly believes that Cloudera Manager is a significant competitive advantage vs. Ambari. (This would presumably be part of the “Easy” claim.)
In particular, Cloudera asserts it has better troubleshooting/monitoring than the cloud alternatives do, because of superior drilldown into details.
Cloudera’s big competitor on the Amazon platform is Elastic MapReduce (EMR). Cloudera points out that EMR lacks various capabilities that are in the Cloudera stack. Of course, versions of these capabilities are sometimes found in other Amazon offerings, such as Redshift.
Cloudera’s big competitor on Azure is HDInsight. Cloudera sells against that via:
- General Cloudera vs. Hortonworks distinctions.
- “Hybrid”/portability.

Cloudera also offered a distinction among three types of workload:

ETL (Extract/Transform/Load) and “modeling” (by which Cloudera seems to mean predictive modeling).
- Cloudera pitches this as batch work.
- Cloudera tries to deposition competitors as being good mainly at these kinds of jobs.
- This can be reasonably said to be the original sweet spot of Hadoop and MapReduce — which fits with Cloudera’s attempt to portray competitors as technical laggards.
- Cloudera observes that these workloads tend to call for “transient” jobs. Lazier marketers might trot out the word “elasticity”.
BI (Business Intelligence) and “analytics”, by which Cloudera seems to mainly mean Impala and Spark.
“Application delivery”, by which Cloudera means operational stuff that can’t be allowed to go down. Presumably, this is a rough match to what I — and by now a lot of other folks as well — call short-request processing.

While I don’t agree with terminology that says modeling is not analytics, the basic distinction being drawn here make considerable sense.