Amazon and its cloud – DBMS 2 : DataBase Management System Services

The technology industry is under broad political attack

Curt Monash — Fri, 15 Dec 2017 09:25:55 +0000

I apologize for posting a December downer, but this needs to be said.

The technology industry is under attack:

From politicians and political pundits …
… especially from “populists” and/or the political right …
… in the United States and other countries.

These attacks:

Are in some cases specific to internet companies such as Google and Facebook.
In some cases threaten the tech industry more broadly.
Are in some cases part of general attacks on the educated/ professional/“globalist”/”coastal” “elites”.

You’ve surely noticed some of these attacks. But you may not have noticed just how many different attacks and criticisms there are, on multiple levels.

1. Concerns about jobs, disruption, gentrification and so on are a Really Big Deal, causing large swaths of the population to regard technology as bad for their pocketbooks. In particular:

There’s tremendous concern about job loss to automation and/or globalization. Technology helps cause the first and enable the second.
Generally, when an industry destroys jobs, one hopes that it will create new ones to take their place. But while US technology companies have created many jobs, a lot of those are overseas.
Flaps about overseas finances, taxes, and so on aren’t helping. Apple, for example, has major issues in Europe and the US alike.
Working-class jobs that tech companies do create are often attacked for their pay and conditions, e.g. for Amazon warehouse workers or Uber drivers.
Even when the technology industry unquestionably creates good, domestic jobs, the industry may be attacked for them. Consider for example the concerns about cost of living/gentrification in Northern California.
“Sharing economy” companies such as Uber and Airbnb and others are involved in local political fights all around the world, as they undercut traditional service providers.

People who believe that technologists harm them are a major political force.

2. The technology industry is under considerable legislative, regulatory, and judicial pressure. For starters:

Tech companies are attacked for doing too little to aid law enforcement and government surveillance.
Tech companies are attacked for doing too much to aid law enforcement and government surveillance.
Tech companies are attacked for doing too little censorship.
Tech companies are attacked for doing too much censorship.
Privacy regulations are ever-changing.

Complicating things further, these challenges take different forms in different countries around the world.

Also:

China pressures foreign vendors to transfer technology into China.
Recent network neutrality developments in the US favor older telecom providers, at the expense of newer internet companies.
Anti-immigrant policies in the US threaten tech vendors.

I could keep going much longer than that. Government relations are a major, major issue for tech.

3. It is traditional to claim that advances in communication/media technologies will wreck society.

Television was going to make us mass-conformist couch potatoes.
Video games were going to make us violent couch potatoes.

This era brings similar concerns.

Social media makes us couch potatoes sitting in niche-conformist echo chambers.
Modern media over-stimulate us and wreck our attention spans.

I.e., the apocalypse is imminent, and tech is what will bring it on.

The most compelling version of that argument I’ve seen is Jean Twenge’s claims that there’s a teen mental health crisis perfectly matched in time to the rise of the smartphone. And to make any such claim seem particularly damning, please recall: Social media and gaming companies are clearly trying to foster a form of addiction in — well, in their users.

Current concern may ebb just like previous generations’ did. But for now, they’re yet another aspect of a threat-filled environment.

4. What worries me most is this: The United States and other countries face relentless attacks on education, educators, science, scientists, and rationality itself. And there are no obvious limits to how bad these can get. China’s Cultural Revolution and the Cambodian genocide happened during my lifetime. Stalin and Hitler ruled during my parents’. All four took particular aim at people like us.

Bottom line: EVERYBODY in the technology industry should be or quickly become politically aware. We have an awful lot of politics to deal with.

More notes on the transition to the cloud

Curt Monash — Thu, 17 Aug 2017 09:11:01 +0000

Last year I posted observations about the transition to the cloud. Here are some further thoughts.

0. In case any doubt remained, the big questions about transitioning to the cloud are “When?” and “How?”. “Whether”, by way of contrast, is pretty much settled.

1. The answer to “When?” is generally “Over many years”. In particular, at most enterprises the cloud transition will span multiple CIO’s tenure in their positions.

Few enterprises will ever execute on simple, consistent, unchanging “cloud strategies”.

2. The SaaS (Software as a Service) vs. on-premises tradeoffs are being reargued, except that proponents now spell SaaS C-L-O-U-D. (Ali Ghodsi of Databricks made a particularly energetic version of that case in a recent meeting.)

3. In most countries (at least in the US and the rest of the West), the cloud vendors deemed to matter are Amazon, followed by Microsoft, followed by Google. And so, when it comes to the public cloud, Microsoft is much, much more enterprise-savvy than its key competitors.

4. In another non-technical competitive factor: Wal-Mart isn’t the only huge company that is hostile to the Amazon cloud because of competition with other Amazon businesses.

5. It was once thought that in many small countries around the world, there would be OpenStack-based “national champion” cloud winners, perhaps as subsidiaries of the leading telecom vendors. This doesn’t seem to be happening.

Even so, some of the larger managed-economy and/or generally authoritarian countries will have one or more “national champion” cloud winners each — surely China, presumably Russia, obviously Iran, and probably some others as well.

6. While OpenStack in general seems to have fizzled, S3 compatibility has momentum.

7. Finally, let’s return to our opening points: The cloud transition will happen, but it will take considerable time. A principal reason for slowness is that, as a general rule, apps aren’t migrated to platforms directly; rather, they get replaced by new apps on new platforms when the time is right for them to be phased out anyway.

However, there’s a codicil to those generalities — in some cases it’s easier to migrate to the new platform than in others. The hardest migration was probably when the rise of RDBMS, the shift from mainframes to UNIX and the switch to client/server all happened at once; just about nothing got ported from the old platforms to the new. Easier migrations included:

The switch from Unix to Linux. They were very similar.
The adoption of virtualization. A major purpose of the technology was to make migration easy.
The initial adoption of DBMS. Then-legacy apps relied on flat file systems, which DBMS often found easy to emulate.

The cloud transition is somewhere in the middle between those extremes. On the “easy” side:

Popular database management technologies and so on are available in the cloud just as they are on-premise.
Major app vendors are doing the hard work of cloud ports themselves.

Nonetheless, the public cloud is in many ways a whole new computing environment — and so for the most part, customer-built apps will prove too difficult to migrate. Hence my belief that overall migration to the cloud will be very incremental.

Cloudera Altus

Curt Monash — Wed, 14 Jun 2017 13:12:48 +0000

I talked with Cloudera before the recent release of Altus. In simplest terms, Cloudera’s cloud strategy aspires to:

Provide all the important advantages of on-premises Cloudera.
Provide all the important advantages of native cloud offerings such as Amazon EMR (Elastic MapReduce, or at least come sufficiently close to that goal.
Benefit from customers’ desire to have on-premises and cloud deployments that work:
- Alike in any case.
- Together, to the extent that that makes use-case sense.

In other words, Cloudera is porting its software to an important new platform.* And this port isn’t complete yet, in that Altus is geared only for certain workloads. Specifically, Altus is focused on “data pipelines”, aka data transformation, aka “data processing”, aka new-age ETL (Extract/Transform/Load). (Other kinds of workload are on the roadmap, including several different styles of Impala use.) So what about that is particularly interesting? Well, let’s drill down.

*Or, if you prefer, improving on early versions of the port.

Since so much of the Hadoop and Spark stacks is open source, competition often isn’t based on core product architecture or features, but rather on factors such as:

Ease of management. This one is nuanced in the case of cloud/Altus. For starters:
- One of Cloudera’s main areas of differentiation has always been Cloudera Manager.
- Cloudera Director was Cloudera’s first foray into cloud-specific management.
- Cloudera Altus features easier/simpler management than Cloudera Director, meant to be analogous to native Amazon management tools, and good-enough for use cases that don’t require strenuous optimization.
- Cloudera Altus also includes an optional workload analyzer, in slight conflict with other parts of the Altus story. More on that below.
Ease of development. Frankly, this rarely seems to come up as a differentiator in the Hadoop/Spark world, various “notebook” offerings such as Databricks’ or Cloudera’s notwithstanding.
Price. When price is the major determinant, Cloudera is sad.
Open source purity. Ditto. But at most enterprises — at least those with hefty IT budgets — emphasis on open source purity either is a proxy for price shopping, or else boils down to largely bogus concerns about vendor lock-in.

Of course, “core” kinds of considerations are present to some extent too, including:

Performance, concurrency, etc. I no longer hear many allegations of differences in across-the-board Hadoop performance. But the subject does arise in specific areas, most obviously in analytic SQL processing. It arises in the case of Altus as well, in that Cloudera improved in a couple of areas that it concedes were previously Amazon EMR advantages, namely:
- Interacting with S3 data stores.
- Spinning instances up and down.
Reliability and data safety. Cloudera mentioned that it did some work so as to be comfortable with S3’s eventual consistency model.

Recently, Cloudera has succeeded at blowing security up into a major competitive consideration. Of course, they’re trying that with Altus as well. Much of the Cloudera Altus story is the usual — rah-rah Cloudera security, Sentry, Kerberos everywhere, etc. But there’s one aspect that I find to be simple yet really interesting:

Cloudera Altus doesn’t manage data for you.
Rather, it launches and manages jobs on a separate Hadoop cluster.

Thus, there are very few new security risks to running Cloudera Altus, beyond whatever risks are inherent to running any version of Hadoop in the public cloud.

Where things get a bit more complicated is some features for workload analysis.

Cloudera recently introduced some capabilities for on-the-fly trouble-shooting. That’s fine.
Cloudera has also now announced an offline workload analyzer, which compares actual metrics computed from your log files to “normal” ones from well-running jobs. For that, you really do have to ship information to a separate cluster managed by Cloudera.

The information shipped is logs rather than actual query results or raw data. In theory, an attacker who had all those logs could conceivably make inferences about the data itself; but in practice, that doesn’t seem like an important security risk at all.

So is this an odd situation where that strategy works, or could what we might call light-touch managed services turn out to be widespread and important? That’s a good question to address in a separate post.

Notes on the transition to the cloud

Curt Monash — Tue, 04 Oct 2016 02:22:21 +0000

1. The cloud is super-hot. Duh. And so, like any hot buzzword, “cloud” means different things to different marketers. Four of the biggest things that have been called “cloud” are:

The Amazon cloud, Microsoft Azure, and their competitors, aka public cloud.
Software as a service, aka SaaS.
Co-location in off-premises data centers, aka colo.
On-premises clusters (truly on-prem or colo as the case may be) designed to run a broad variety of applications, aka private cloud.

Further, there’s always the idea of hybrid cloud, in which a vendor peddles private cloud systems (usually appliances) running similar technology stacks to what they run in their proprietary public clouds. A number of vendors have backed away from such stories, but a few are still pushing it, including Oracle and Microsoft.

This is a good example of Monash’s Laws of Commercial Semantics.

2. Due to economies of scale, only a few companies should operate their own data centers, aka true on-prem(ises). The rest should use some combination of colo, SaaS, and public cloud.

This fact now seems to be widely understood.

3. The public cloud is a natural fit for those use cases in which elasticity truly matters. Many websites and other consumer internet backends have that characteristic. Such systems are often also a good fit for cloud technologies in general.

This is frequently a good reason for new — i.e. “greenfield” — apps to run in the cloud.

4. Security and privacy can be concerns in moving to the cloud. But I’m hearing that more and more industries are overcoming those concerns.

In connection to that point, it might be interesting to note:

In the 1960s and 1970s, one of the biggest industries for remote computing services — i.e. SaaS — was commercial banking.
Other big users were hospitals and stockbrokers.
The US intelligence agencies are building out their own shared, dedicated cloud.

5. Obviously, Amazon is the gorilla in the cloud business. Microsoft Azure gets favorable mentions as well. I don’t hear much about other public cloud providers, however, except that there are a lot of plans to support Google’s cloud just in case.

In particular, I hear less than I expected to about public clouds run by national-champion telecom companies around the world.

6. It’s inconvenient for an application vendor to offer both traditional and SaaS versions of a product. Release cycles and platform support are different in the two cases. But there’s no reason a large traditional application vendor couldn’t pull it off, and the largest are already more or less claiming to. Soon, this will feel like a market necessity across the board.

7. The converse is less universally true. However, some SaaS vendors do lose out from their lack of on-premises options. Key considerations include:

Does your application need to run close to your customers’ largest databases?
Do your customers still avoid the public cloud?

If both those things are true, and you don’t have an on-premises option, certain enterprises are excluded from your addressable market.

8. Line-of-business departments are commonly more cloud-friendly than central IT is. Reasons include:

Departments don’t necessarily see central IT as any “closer” to them than the cloud is.
Departments don’t necessarily care about issues that give central IT pause.
Departments sometimes buy things that only are available via remote delivery, e.g. narrowly focused SaaS applications or market data.

I discussed some of this in my recent post on vendor lock-in.

9. When the public cloud was younger, it had various technological limitations. You couldn’t easily get fast storage like flash. You couldn’t control data movement well enough for good MPP (Massively Parallel Processing) in use cases like analytic SQL.

Those concerns seem to have been largely alleviated.

10. It takes a long time for legacy platforms to be decommissioned. At some enterprises, however, that work has indeed been going on for a long time, via virtualization.

11. If you think about system requirements:

There is a lot of computing power in devices that may be regarded as IoT nodes — phones, TV boxes, thermostats, cars, industrial equipment, sensors, etc. Client-side computing is getting ever more diverse.
Server-side computing, however, is more homogenous. Enterprises can, should and likely will meet the vast majority of their server requirements on a relatively small number of clusters each.

I argued the latter point in my 2013 post on appliances, clusters, and clouds, using terminology and reasoning that are now only slightly obsolete.

So what will those clusters be? Some will be determined by app choices. Most obviously, if you use SaaS, the SaaS vendor decides which cloud(s) your data is in. And if you’re re-hosting legacy systems via virtualization, that’s another cluster.

Otherwise, clusters will probably be organized by database, in the most expansive sense of term. For example, there could be separate clusters for:

Operational data managed by your general-purpose RDBMS (Oracle, SQL Server, DB2, whatever).
Relational data warehousing, whether in an analytic RDBMS or otherwise.
Log files, perhaps managed in Hadoop or Splunk.
Your website and other internet back-ends, perhaps running over NoSQL data stores.
Text documents managed by some kind of search engine.
Media block or object storage, if the organization’s audio/video/whatever would overwhelm a text search engine. (Text search or document management systems can often also handle low volumes of non-text media.)

Indeed, since computing is rarely as consolidated as CIOs dream of it being, a large enterprise might have several clusters for any of those categories — each running different software for data and storage management — with different deployment choices among colo, true on-prem, and true cloud.

Are analytic RDBMS and data warehouse appliances obsolete?

Curt Monash — Mon, 29 Aug 2016 01:28:31 +0000

I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:

Many of the independent vendors were swooped up by acquisition.
- IBM bought Netezza.
- Microsoft bought DATAllegro.
- HP bought Vertica.
- Greenplum went to EMC/VMware/Pivotal.
- Teradata bought Aster.
- Actian bought both ParAccel and Vectorwise.
None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
Impala and other Hadoop-based alternatives are technology options.
Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big.

Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.

My answer, in a nutshell, is:

Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud — are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.

To see why, let’s start by asking: “With what do you want to integrate your analytic SQL processing?”

If you want to integrate with relational OLTP (OnLine Transaction Processing), your OLTP RDBMS vendor surely has a story worth listening to. Memory-centric offerings MemSQL and SAP HANA are also pitched that way.
If you want to integrate with your SAP apps in particular, HANA is the obvious choice.
If you want to integrate with other work you do in the Amazon cloud, Redshift is worth a look.

Beyond those cases, a big issue is integration with … well, with data integration. Analytic RDBMS got a lot of their workloads from ELT or ETLT, which stand for Extract/(Transform)/Load/Transform. I.e., you’d load data into an efficient analytic RDBMS and then do your transformations, vs. the “traditional” (for about 10-15 years of tradition) approach of doing your transformations in your ETL (Extract/Transform/Load) engine. But in bigger installations, Hadoop often snatches away that part of the workload, even if the rest of the processing remains on a dedicated analytic RDBMS platform such as Teradata’s.

And suppose you want to integrate with more advanced analytics — e.g. statistics, other predictive modeling/machine learning, or graph analytics? Well — and this both surprised and disappointed me — analytic platforms in the RDBMS sense didn’t work out very well. Early Hadoop had its own problems too. But Spark is doing just fine, and seems poised to win.

My technical observations around these trends include:

Advanced analytics commonly require flexible, iterative processing.
Spark is much better at such processing than earlier Hadoop …
… which in turn is better than anything that’s been built into an analytic RDBMS.
Open source/open standards and the associated skill sets come into play too. Highly vendor-proprietary DBMS-tied analytic stacks don’t have enough advantages over open ones.
Notwithstanding the foregoing, RDBMS-based platforms can still win if a big part of the task lies in fancy SQL.

And finally, if a task is “partly relational”, then Hadoop or Spark often fit both parts.

They don’t force you into using SQL for everything, nor into putting all your data into relational schemas, and that flexibility can be a huge relief.
Even so, almost everybody who uses those uses some SQL, at least for initial data extraction. Those systems are also plenty good enough at SQL for joining data to reference tables, and all that other SQL stuff you’d never want to give up.

But suppose you just want to do business intelligence, which is still almost always done over relational data structures? Analytic RDBMS offer the trade-offs:

They generally still provide the best performance or performance/concurrency combination, for the cost, although YMMV (Your Mileage May Vary).
One has to load the data in and immediately structure it relationally, which can be an annoying contrast to Hadoop alternatives (data base administration can be just-in-time) or to OLTP integration (less or no re-loading).
Other integrations, as noted above, can also be weak.

Suppose all that is a good match for your situation. Then you should surely continue using an analytic RDBMS, if you already have one, and perhaps even acquire one if you don’t. But for many other use cases, analytic RDBMS are no longer the best way to go.

Finally, how does the cloud affect all this? Mainly, it brings one more analytic RDBMS competitor into the mix, namely Amazon Redshift. Redshift is a simple system for doing analytic SQL over data that was in or headed to the Amazon cloud anyway. It seems to be quite successful.

Bottom line: Analytic RDBMS are no longer in their youthful prime, but they are healthy contributors in middle age. Mainly, they’re still best-of-breed for supporting demanding BI.

Notes on vendor lock-in

Curt Monash — Wed, 20 Jul 2016 01:35:32 +0000

Vendor lock-in is an important subject. Everybody knows that. But few of us realize just how complicated the subject is, nor how riddled it is with paradoxes. Truth be told, I wasn’t fully aware either. But when I set out to write this post, I found that it just kept growing longer.

1. The most basic form of lock-in is:

You do application development for a target set of platform technologies.
Your applications can’t run without those platforms underneath.
Hence, you’re locked into those platforms.

2. Enterprise vendor standardization is closely associated with lock-in. The core idea is that you have a mandate or strong bias toward having different apps run over the same platforms, because:

That simplifies your environment, requiring less integration and interoperability.
That simplifies your staffing; the same skill sets apply to multiple needs and projects.
That simplifies your vendor support relationships; there’s “one throat to choke”.
That simplifies your price negotiation.

3. That last point is double-edged; you have more power over suppliers to whom you give more business, but they also have more power over you. The upshot is often an ELA (Enterprise License Agreement), which commonly works:

For a fixed period of time, the enterprise may use as much of a given product set as they want, with costs fixed in advance.
A few years later, the price is renegotiated, based on then-current levels of usage.

Thus, doing an additional project using ELAed products may appear low-cost.

Incremental license and maintenance fees may be zero in the short-term.
Incremental personnel costs may be controlled because the needed skills are already in-house.

Often those appearances are substantially correct. That’s a big reason why incumbent software is difficult to supplant unless the upstart substitute is superior in fundamental and important ways.

4. Subscriptions are closely associated with lock-in.

Most obviously, the traditional software industry gets its profits from high-margin support/maintenance services.
Cloud lock-in has rapidly become a big deal.
The open source vendors meeting lock-in resistance, noted below, have subscription business models.

Much of why customers care about lock-in is the subscription costs it’s likely to commit them to.

5. Also related to lock-in are thick single-vendor technology stacks. If you run Oracle applications, you’re going to run the Oracle DBMS too. And if you run that, you’re likely to run other Oracle software, and perhaps use Exadata hardware as well. The cloud ==> lock-in truism is an example of this point as well.

6. There’s a lot of truth to the generality that central IT cares about overall technology architecture, while line-of-business departments just want to get the job done. This causes departments to both:

Oppose standardization.
Like thick technology stacks.

Thus, departmental influence on IT both encourages and discourages lock-in.

7. IBM is all about lock-in. IBM’s support for Linux, Eclipse and so on don’t really contradict that. IBM’s business model is to ~~squeeze~~ serve its still-large number of strongly loyal customers as well as it can.

8. Microsoft’s business model over the decades has also greatly depended on lock-in.

Indeed, it exploited Windows/Office lock-in so vigorously as to incur substantial anti-trust difficulties.
Server-side Windows tends to be involved in thick stacks — DBMS, middleware, business intelligence, SharePoint and more. Many customers (smaller enterprises or in some cases departments) are firmly locked into these stacks.
Microsoft is making a strong cloud push with Azure, which inherently involves lock-in.

Yet sometimes, Microsoft is more free and open.

Office for Macintosh allowed the Mac to be a viable Windows competitor. (And Microsoft was well-paid for that, generating comparable revenue per Mac to what it got for each Windows PC.)
Visual Studio is useful for writing apps to run against multiple DBMS.
Just recently, Microsoft SQL Server was ported to Linux.

9. SAP applications run over several different DBMS, including its own cheap MaxDB. That counteracts potential DBMS lock-in. But some of its newer apps are HANA-specific. That, of course, has the opposite effect.

10. And with that as background, we can finally get to what led me to finally write this post. Multiple clients have complaints that may be paraphrased as:

Customers are locked into expensive traditional DBMS such as Oracle.
Yet they’re so afraid of lock-in now that they don’t want to pay for our vendor-supplied versions of open source database technologies; they prefer to roll their own.
Further confusing matters, they also are happy to use cloud technologies, including the associated database technologies (e.g. . Redshift or other Amazon offerings), creating whole new stacks of lock-in.

So open source vendors of NoSQL data managers and similar technologies felt like they were the only kind of vendor suffering from fear of lock-in.

I agree with them that enterprises who feel this way are getting it wrong. Indeed:

The management of even NoSQL DBMS is a big issue, and help in that area has high cash value for customers.
Serious users need support.
Support and management tools happen to be synergistic with each other.

This is the value proposition that propelled Cloudera. It’s also a strong reason to give money to whichever of MongoDB, DataStax, Neo Technology et al. sponsors open source technology that you use.

General disclosure: My fingerprints have been on this industry strategy since before the term “NoSQL” was coined. It’s been an aspect of many different consulting relationships.

Some enterprises push back, logically or emotionally as the case may be, by observing that the best internet companies — e.g., Facebook — are allergic to paying for software, even open source. My refutations of that argument include:

Facebook has more and better engineers than you do.
Facebook has a lot more servers than you do, and would presumably face much higher prices than you would if you each chose to forgo the in-house alternative.
Facebook pays for open source software in a different way than through subscription fees — it invents and enhances it. Multiple important projects have originated at Facebook, and it contributes to many others. Are you in a position to do the same thing?

And finally — most of Facebook’s users get its service for free. (Advertisers are the ones who pay cash; all others just pay in attention to the ads.) So if getting its software for free actually does screw up its SLAs (Service Level Agreements) — well, free generally comes with poorer SLAs than paid. But if you’re in the business of serving paying customers, then you might want to have paying-customer kinds of SLAs, even on the parts of your technology — e.g. websites urging people to do business with you — that you provide for free yourself.

Related links

The technology underlying packaged applications (November, 2015, but it has a historical focus)
Topics in migration (January, 2015)
Much of the vendor advice on Strategic Messaging.

Governments vs. tech companies — it’s complicated

Curt Monash — Thu, 19 May 2016 03:42:01 +0000

Numerous tussles fit the template:

A government wants access to data contained in one or more devices (mobile/personal or server as the case may be).
The computer’s manufacturer or operator doesn’t want to provide it, for reasons including:
- That’s what customers prefer.
- That’s what other governments require.
- Being pro-liberty is the right and moral choice. (Yes, right and wrong do sometimes actually come into play. )

As a general rule, what’s best for any kind of company is — pricing and so on aside — whatever is best or most pleasing for their customers or users. This would suggest that it is in tech companies’ best interest to favor privacy, but there are two important quasi-exceptions:

Recommendation/personalization. E-commerce and related businesses rely heavily on customer analysis and tracking.
When the customer is the surveiller. Governments pay well for technology that is used to watch over their citizens.

I used the “quasi-” prefix because screwing the public is risky, especially in the long term.

Something that is not even a quasi-exception to the tech industry’s actual or potential pro-privacy bias is governmental mandates to let their users be watched. In many cases, governments compel privacy violations, by threat of severe commercial or criminal penalties. Tech companies should and often do resist these mandates as vigorously as they can, in the courts and/or via lobbying as the case may be. Yes, companies have to comply with the law. However, it’s against their interests for the law to compel privacy violations, because those make their products and services less appealing.

The most visible example of all this right now is the FBI/Apple kerfuffle. To borrow a phrase — it’s complicated. Among other aspects:

Syed Rizwan Farook, one of the San Bernardino terrorist murderers, had 3 cell phones. He carefully destroyed his 2 personal phones before his attack, but didn’t bother with his iPhone from work.
Notwithstanding this clue that the surviving phone contained nothing of interest, the FBI wanted to unlock it. It needed technical help to do so.
The FBI got a court order commanding Apple’s help. Apple refused and appealed the order.
The FBI eventually hired a third party to unlock Farook’s phone, for a price that was undisclosed but >$1.3 million.
Nothing of interest was found on the phone.
Stories popped up of the FBI asking for Apple’s help unlocking numerous other iPhones. The courts backed Apple or not depending on how they interpreted the All Writs Act. The All Writs Act was passed in the first-ever session of the US Congress, in 1789, and can reasonably be assumed to reflect all the knowledge that the Founders possessed about mobile telephony.
It’s widely assumed that the NSA could have unlocked the phones for the FBI — but it didn’t.

Russell Brandom of The Verge collected links explaining most of the points above.

With that as illustration, let’s go to some vendor examples:

Apple — which sells devices much more than advertising — has clearly decided that being (seen as) pro-privacy is its preferred course.
Microsoft — all rumors about Skype backdoors and the like notwithstanding — has made a similar choice. Notably, it is struggling to keep data hosted on its European servers out of US subpoena reach.
Amazon and Google, by way of contrast, whose core consumer businesses depend on recommendation/personalization, have not been so visible about protecting the privacy of their cloud services’ data.
Blackberry, meanwhile, seems to split the difference, being pro-privacy in its enterprise server business but acquiescing to surveillance in its consumer operations.

All of these cases seem consistent with my comments about vendors’ privacy interests above.

Bottom line: The technology industry is correct to resist government anti-privacy mandates by all means possible.

Cloudera in the cloud(s)

Curt Monash — Fri, 22 Jan 2016 07:46:34 +0000

Cloudera released Version 2 of Cloudera Director, which is a companion product to Cloudera Manager focused specifically on the cloud. This led to a discussion about — you guessed it! — Cloudera and the cloud.

Making Cloudera run in the cloud has three major aspects:

Cloudera’s usual software, ported to run on the cloud platform(s).
Cloudera Director, which for example launches cloud instances.
Points of integration, e.g. taking information about security-oriented roles from the platform and feeding then to the role-based security that is specific to Cloudera Enterprise.

Features new in this week’s release of Cloudera Director include:

An API for job submission.
Support for spot and preemptable instances.
High availability.
Kerberos.
Some cluster repair.
Some cluster cloning.

I.e., we’re talking about some pretty basic/checklist kinds of things. Cloudera Director is evidently working for Amazon AWS and Google GCP, and planned for Windows Azure, VMware and OpenStack.

As for porting, let me start by noting:

Shared-nothing analytic systems, RDBMS and Hadoop alike, run much better in the cloud than they used to.
Even so, it seems that the future of Hadoop in the cloud is to rely on object storage, such as Amazon S3.

That makes sense in part because:

The applications where shared nothing most drastically outshines object storage are probably the ones in which data can just be filtered from disk — spinning-rust or solid-state as the case may be — and processed in place.
By way of contrast, if data is being redistributed a lot then the shared nothing benefit applies to a much smaller fraction of the overall workload.
The latter group of apps are probably the harder ones to optimize for.

But while it makes sense, much of what’s hardest about the ports involves the move to object storage. The status of that is roughly:

Cloudera already has a lot of its software running on Amazon S3, with Impala/Parquet in beta.
Object storage integration for Windows Azure is “in progress”.
Object storage integration for Google GCP it is “to be determined”.
Security for object storage — e.g. encryption — is a work in progress.
Cloudera Navigator for object storage is a roadmap item.

When I asked about particularly hard parts of porting to object storage, I got three specifics. Two of them sounded like challenges around having less detailed control, specifically in the area of consistency model and capacity planning. The third I frankly didn’t understand,* which was the semantics of move operations, relating to the fact that they were constant time in HDFS, but linear in size on object stores.

*It’s rarely obvious to me why something is o(1) until it is explained to me.

Naturally, we talked about competition, differentiation, adoption and all that stuff. Highlights included:

In general, Cloudera’s three big marketing messages these days can be summarized as “Fast”, “Easy”, and “Secure”.
Notwithstanding the differences as to which parts of the Cloudera stack run on premises, on Amazon AWS, on Microsoft Azure or on Google GCP, Cloudera thinks it’s important that its offering is the “same” on all platforms, which allows “hybrid” deployment.
In general, Cloudera still sees Hortonworks as a much bigger competitor than MapR or IBM.
Cloudera fondly believes that Cloudera Manager is a significant competitive advantage vs. Ambari. (This would presumably be part of the “Easy” claim.)
In particular, Cloudera asserts it has better troubleshooting/monitoring than the cloud alternatives do, because of superior drilldown into details.
Cloudera’s big competitor on the Amazon platform is Elastic MapReduce (EMR). Cloudera points out that EMR lacks various capabilities that are in the Cloudera stack. Of course, versions of these capabilities are sometimes found in other Amazon offerings, such as Redshift.
Cloudera’s big competitor on Azure is HDInsight. Cloudera sells against that via:
- General Cloudera vs. Hortonworks distinctions.
- “Hybrid”/portability.

Cloudera also offered a distinction among three types of workload:

ETL (Extract/Transform/Load) and “modeling” (by which Cloudera seems to mean predictive modeling).
- Cloudera pitches this as batch work.
- Cloudera tries to deposition competitors as being good mainly at these kinds of jobs.
- This can be reasonably said to be the original sweet spot of Hadoop and MapReduce — which fits with Cloudera’s attempt to portray competitors as technical laggards.
- Cloudera observes that these workloads tend to call for “transient” jobs. Lazier marketers might trot out the word “elasticity”.
BI (Business Intelligence) and “analytics”, by which Cloudera seems to mainly mean Impala and Spark.
“Application delivery”, by which Cloudera means operational stuff that can’t be allowed to go down. Presumably, this is a rough match to what I — and by now a lot of other folks as well — call short-request processing.

While I don’t agree with terminology that says modeling is not analytics, the basic distinction being drawn here make considerable sense.

Transitioning to the cloud(s)

Curt Monash — Mon, 07 Dec 2015 17:48:53 +0000

There’s a lot of talk these days about transitioning to the cloud, by IT customers and vendors alike. Of course, I have thoughts on the subject, some of which are below.

1. The economies of scale of not running your own data centers are real. That’s the kind of non-core activity almost all enterprises should outsource. Of course, those considerations taken alone argue equally for true cloud, co-location or SaaS (Software as a Service).

2. When the (Amazon) cloud was newer, I used to hear that certain kinds of workloads didn’t map well to the architecture Amazon had chosen. In particular, shared-nothing analytic query processing was necessarily inefficient. But I’m not hearing nearly as much about that any more.

3. Notwithstanding the foregoing, not everybody loves Amazon pricing.

4. Infrastructure vendors such as Oracle would like to also offer their infrastructure to you in the cloud. As per the above, that could work. However:

Is all your computing on Oracle’s infrastructure? Probably not.
Do you want to move the Oracle part and the non-Oracle part to different clouds? Ideally, no.
Do you like the idea of being even more locked in to Oracle than you are now? [Insert BDSM joke here.]
Will Oracle do so much better of a job hosting its own infrastructure that you use its cloud anyway? Well, that’s an interesting question.

Actually, if we replace “Oracle” by “Microsoft”, the whole idea sounds better. While Microsoft doesn’t have a proprietary server hardware story like Oracle’s, many folks are content in the Microsoft walled garden. IBM has fiercely loyal customers as well, and so may a couple of Japanese computer manufacturers.

5. Even when running stuff in the cloud is otherwise a bad idea, there’s still:

Test and dev(elopment) — usually phrased that way, although the opposite order makes more sense.
Short-term projects — the most obvious examples are in investigative analytics.
Disaster recovery.

So in many software categories, almost every vendor should have a cloud option of some kind.

6. Reasons for your data to wind up in a plurality of remote data centers include:

High availability, and similarly disaster recovery. Duh.
Second-source/avoidance of lock-in.
Geo-compliance.
Particular SaaS offerings being hosted in different places.
Use of both true cloud and co-location for different parts of your business.

7. “Mostly compatible” is by no means the same as “compatible”, and confusing the two leads to tears. Even so, “mostly compatible” has stood the IT industry in good stead multiple times. My favorite examples are:

SQL
UNIX (before LINUX).
IBM-compatible PCs (or, as Ben Rosen used to joke, Compaq-compatible).
Many cases in which vendors upgrade their own products.

I raise this point for two reasons:

I think Amazon/OpenStack could be another important example.
A vendor offering both cloud and on-premises versions of their offering, with minor incompatibilities between the two, isn’t automatically crazy.

8. SaaS vendors, in many cases, will need to deploy in many different clouds. Reasons include:

If they want customers around the world, they may need to process data in customers’ home country or region.
It could be the simplest way to meet the need of offering customers an on-premises option.

That said, there are of course significant differences between, for example:

Deploying to Amazon in multiple regions around the world.
Deploying to Amazon plus a variety of OpenStack-based cloud providers around the world, e.g. some “national champions” (perhaps subsidiaries of the main telecommunications firms).*
Deploying to Amazon, to other OpenStack-based cloud providers, and also to an OpenStack-based system that resides on customer premises (or in their co-location facility).

9. The previous point, and the last bullet of the one before that, are why I wrote in a post about enterprise app history:

There’s a huge difference between designing applications to run on one particular technology stack, vs. needing them to be portable across several. As a general rule, offering an application across several different brands of almost-compatible technology — e.g. market-leading RDBMS or (before the Linux era) proprietary UNIX boxes — commonly works out well. The application vendor just has to confine itself to relying on the intersection of the various brands’ feature sets.*

*The usual term for that is the spectacularly incorrect phrase “lowest common denominator”.

Offering the “same” apps over fundamentally different platform technologies is much harder, and I struggle to think of any cases of great success.

10. Decisions on where to process and store data are of course strongly influenced by where and how the data originates. In broadest terms:

Traditional business transaction data at large enterprises is typically managed by on-premises legacy systems. So legacy issues arise in full force.
Internet interaction data — e.g. web site clicks — typically originates in systems that are hosted remotely. (Few enterprises run their websites on premises.) It is tempting to manage and analyze that data where it originates. That said:
- You often want to enhance that data with what you know from your business records …
- … which is information that you may or may not be willing to send off-premises.
“Phone-home” IoT (Internet of Things) data, from devices at — for example — many customer locations, often makes sense to receive in the cloud. Once it’s there, why not process and analyze it there as well?
Machine-generated data that originates on your premises may never need to leave them. Even if their origins are as geographically distributed as customer devices are, there’s a good chance that you won’t need other cloud features (e.g. elastic scalability) as much as in customer-device use cases.

Related link

While the nuances of my views may change over time, I continue to think that computing platforms will almost all be appliances, clusters or clouds.

Zoomdata and the Vs

Curt Monash — Tue, 07 Jul 2015 23:23:43 +0000

Let’s start with some terminology biases:

I dislike the term “big data” but like the Vs that define it — Volume, Velocity, Variety and Variability.
Though I think it’s silly, I understand why BI innovators flee from the term “business intelligence” (they’re afraid of not sounding new).

So when my clients at Zoomdata told me that they’re in the business of providing “the fastest visual analytics for big data”, I understood their choice, but rolled my eyes anyway. And then I immediately started to check how their strategy actually plays against the “big data” Vs.

It turns out that:

Zoomdata does its processing server-side, which allows for load-balancing and scale-out. Scale-out and claims of great query speed are relevant when data is of high volume.
Zoomdata depends heavily on Spark.
Zoomdata’s UI assumes data can be a mix of historical and streaming, and that if looking at streaming data you might want to also check history. This addresses velocity.
Zoomdata assumes data can be in a variety of data stores, including:
- Relational (operational RDBMS, analytic RDBMS, or SQL-on-Hadoop).
- Files (generic HDFS — Hadoop Distributed File System or S3).*
- NoSQL (MongoDB and HBase were mentioned).
- Search (Elasticsearch was mentioned among others).
Zoomdata also tries to detect data variability.
Zoomdata is OEM/embedding-friendly.

*The HDFS/S3 aspect seems to be a major part of Zoomdata’s current story.

Core aspects of Zoomdata’s technical strategy include:

QlikView/Tableau-style navigation, at least up to a point. (I hope that vendors with a much longer track record have more nuances in their UIs.)
Suitable UI for wholly or partially “real-time” data. In particular:
- Time is an easy dimension to get along the X-axis.
- You can select current or historical regions from the same graph, aka “data rewind”.
Federated query with some predicate pushdown, aka “data fusion”.
- Data filtering and some GroupBys are pushed down to the underlying data stores — SQL or NoSQL — when it makes sense.*
- Pushing down joins (assuming that both sides of the join are from the same data store) is a roadmap item.
Approximate query results, aka “data sharpening”. Zoomdata simulates high-speed query by first serving you approximate query results, ala Datameer.
Spark to finish up queries. Anything that isn’t pushed down to the underlying data store is probably happening in Spark DataFrames.
Spark for other kinds of calculations.

*Apparently it doesn’t make sense in some major operational/general-purpose — as opposed to analytic — RDBMS. From those systems, Zoomdata may actually extract and pre-cube data.

The technology story for “data sharpening” starts:

Zoomdata more-or-less samples the underlying data, and returns a result just for the sample. Since this is a small query, it resolves quickly.
More precisely, there’s a sequence of approximations, with results based on ever larger samples, until eventually the whole query is answered.
Zoomdata has a couple of roadmap items for making these approximations more accurate:
- The integration of BlinkDB with Spark will hopefully result in actual error bars for the approximations.
- Zoomdata is working itself on how to avoid sample skew.

The point of data sharpening, besides simply giving immediate gratification, is that hopefully the results for even a small sample will be enough for the user to determine:

Where in particular she wants to drill down.
Whether she asked the right query in the first place.

I like this early drilldown story for a couple of reasons:

I think it matches the way a lot of people work. First you get to the query of the right general structure; then you refine the parameters.
It’s good for exact-results performance too. Most of what otherwise might have been a long-running query may not need to happen at all.

Aka “Honey, I shrunk the query!”

Zoomdata’s query execution strategy depends heavily on doing lots of “micro-queries” and unioning their result sets. In particular:

Data sharpening relies on a bunch of data-subset queries of increasing size.
Streaming/”real-time” BI is built from a bunch of sub-queries restricted to small time slices each.

Even for not-so-micro queries, Zoomdata may find itself doing a lot of unioning, as data from different time periods may be in different stores.

Architectural choices in support of all this include:

Zoomdata ships with Spark, but can and probably in most cases should be pointed at an external Spark cluster instead. One point is that Zoomdata itself scales by user count, while the Spark cluster scales by data volume.
Zoomdata uses MongoDB off to the side as a metadata store. Except for what’s in that store, Zoomdata seems to be able to load balance rather statelessly. And Zoomdata doesn’t think that the MongoDB store is a bottleneck either.
Zoomdata uses Docker.
Zoomdata is starting to use Mesos.

When a young company has good ideas, it’s natural to wonder how established or mature this all is. Well:

Zoomdata has 86 employees.
Zoomdata has (production) customers, success stories, and so on, but can’t yet talk fluently about many production use cases.
If we recall that companies don’t always get to do (all) their own positioning, it’s fair to say that Zoomdata started out as “Cloudera’s cheap-option BI buddy”, but I don’t think that’s an accurate characterization as this point.
Zoomdata, like almost all young companies in the history of BI, favors a “land-and-expand” adoption strategy. Indeed …
… Zoomdata tells prospects it wants to be an additional BI provider to them, rather than rip-and-replacement.

As for technological maturity:

Zoomdata’s view of data seems essentially tabular, notwithstanding its facility with streams and NoSQL. It doesn’t seem to have tackled much in the way of event series analytics yet.
One of Zoomdata’s success stories is iPad-centric. (Salesperson visits prospect and shows her an informative chart; prospect opens wallet; ka-ching.) So I presume mobile BI is working.
Zoomdata is comfortable handling 10s of millions of rows of data, may be strained when handling 100s of millions of rows, and has been tested in-house up to 1 billion rows. But that’s data that lands in Spark. The underlying data being filtered can be much larger, and Zoomdata indeed cites one example of a >40 TB Impala database.
When I asked about concurrency, Zoomdata told me of in-house testing, not actual production users.
Zoomdata’s list when asked what they don’t do (except through partners, of which they have a bunch) was:
- Data wrangling.
- ETL (Extract/Transform/Load).
- Data transformation. (In a market segment with a lot of Hadoop and Spark, that’s not really redundant with the previous bullet point.)
- Data cataloguing, ala Alation or Tamr.
- Machine learning.

Related link

I wrote about multiple kinds of approximate query result capabilities, Zoomdata-like or otherwise, back in July, 2012.