Analytic technologies – DBMS 2 : DataBase Management System Services

Brittleness, Murphy’s Law, and single-impetus failures

Curt Monash — Wed, 20 Jun 2018 09:15:44 +0000

In my initial post on brittleness I suggested that a typical process is:

Build something brittle.
Strengthen it over time.

In many engineering scenarios, a fuller description could be:

Design something that works in the base cases.
Anticipate edge cases and sources of error, and design for them too.
Implement the design.
Discover which edge cases and error sources you failed to consider.
Improve your product to handle them too.
Repeat as needed.

So it’s necesseary to understand what is or isn’t likely to go wrong. Unfortunately, that need isn’t always met.

Murphy’s Law and exaggerated fears

We should always bear in mind Murphy’s Law, which in its simplest form states: Anything that can go wrong, will. But also remember that Murphy’s Law is a joke; and even if it were serious, nothing concise is ever precise.

People who tend to over-believe in Murphy’s Law include but are hardly limited to:

Bureaucrats.
Worried parents, especially of only children. (Later kids tend to have it easier, as their parents have more experience.)
Any buyer or voter you believe has been over-persuaded toward fear, uncertainty and doubt.
Relational bigots who view the Ted Codd guarantee as an absolute requirement for data management.

Adversaries

The strongest scenarios for Murphy’s Law should be adversarial ones, in which somebody is actively trying to cause problems. But even there it doesn’t always apply. For example:

Information security commonly fits the Murphy model. Hackers keep outwitting defenders.
Email spam, however, does not. It’s pretty much of a solved problem; the few spam emails that still get through hardly matter.
Web search is somewhere in between. Both sides are partially successful in the combat over adversarial information retrieval, as “good” and “bad” sites alike are both well-represented in search results.

Single-impetus failures

Since bad or scary things will happen — Murphy’s Law isn’t entirely wrong — a standard design practice is to avoid single points of failure. Brittleness has a lot to do with which single points of failure have been overlooked; improvement has a lot to do with belatedly cleaning them up. In adversarial scenarios, avoiding single points of failure relates closely to defense in depth.

Some of the nastiest surprises occur when failures have no obvious single point, yet wind up being possible from a single impetus.* This happens when multiple points or moments of failure are somehow correlated, or when they actually cascade. Examples vary widely, including:

The collapse of the World Trade Center buildings.
An authoritarian leader who manages to destroy a whole democratic system of government.

IT examples that are relatively big deals include:

Security breaches in which an attacker becomes able to fully impersonate a well-credentialed user.
Power outages or other whole-building breakdowns that bring down all parts of a (locally) redundant cluster.
Software bugs that bring down all parts of a supposedly redundant system at once.
Analytic failures that stem from misleading data sets. (Garbage in, garbage out.)

*I chose the phrase “single impetus” rather than “single cause” because NOTHING has a truly single cause; things only can happen when all kinds of conditions are satisfied for them to succeed. But there can indeed be an identifiable force, plan or occurrence that sets a chain of events in motion, and that’s what I’m calling the “impetus”.

Related link

A lot of analytics turns out to be adversarial.

Brittleness and incremental improvement

Curt Monash — Wed, 20 Jun 2018 09:13:56 +0000

Every system — computer or otherwise — needs to deal with possibilities of damage or error. If it does this well, it may be regarded as “robust”, “mature(d), “strengthened”, or simply “improved”.* Otherwise, it can reasonably be called “brittle”.

*It’s also common to use the word “harden(ed)”. But I think that’s a poor choice, as brittle things are often also hard.

0. As a general rule in IT:

New technologies and products are brittle.
They are strengthened incrementally over time.

There are many categories of IT strengthening. Two of the broadest are:

Bug-fixing.
Bottleneck Whack-A-Mole.

1. One of my more popular posts stated:

Developing a good DBMS requires 5-7 years and tens of millions of dollars.

The reasons I gave all spoke to brittleness/strengthening, most obviously in:

Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.

Similar things are true for other kinds of “platform software” or distributed systems.

2. The UI brittleness/improvement story starts similarly:

Graphical user interfaces can present users’ choices clearly, making them great antidotes to users’ initial lack of system knowledge or training.
Usability testing and engineering can lead to improvements and the removal of glitches.

Unfortunately, however, as systems add or change features, UI navigation can get more difficult over time rather than easier.

In at least one scenario — plane crashes due to confused-pilot error — the consequences can be literally fatal.

3. Sometimes brittleness just doesn’t get solved.

Security is perhaps the most visible example. Almost every security system can be broken, and bad actors actively do so.
Another example was 1980s-90s CASE (Computer-Aided Software Engineering), specifically in the area of generating code from specifications. The technology was only able to generate apps that performed a limited set of functions — too limited to be useful outside of certain niches — and never successfully evolved further.

4. Large organizations are riddled with screw-ups. One of the most successful large enterprises in world history was the US military of World War 2 — and that is literally the organization where the word snafu was coined.

The response is often bureaucracy. Somebody makes a mistake; procedures and rules are then instituted to ensure the mistake is never repeated. Over time, many rules and procedures build up, until organizational systems are hardened. Business processes wind up taking many steps, each of which represents both a cost and a potential for failure. And sometimes the only decisions that successfully get through the process are uninspired, uncreative or flat-out wrong.

This is the classic example of “hardening” — commonly expressed via its rough synonym “calcification” — adding even more brittleness than it removes.

Outward-facing/regulatory bureaucracies can be even worse..

Regulators generally have two constituencies — consumers/general public and businesses — with the benefits of regulation going to one and the costs going to the other. Anything regulators do will likely displease at least one constituency.
These ever-unsatisfactory regulations can often only be changed through a long administrative or legislative process.
Violating regulations has unpredictable but sometimes severe consequences.

The whole thing is a colossal mess.

5. Many of the previous points apply to enterprise applications, which facilitate business processes, have UIs, and commonly involve platform-like technology as well.

Changing enterprise apps can take both discontinuous and incremental forms; you can either replace your old apps outright or change something in (or in the implementation of) the ones you have.

If you rip-and-replace your apps, you’re likely to also do so to your business processes, and vice-versa. Discontinuous business process change is often seen as a great virtue, sometimes under the buzzname business process reengineering (BPR).
If you want to change your processes more incrementally, you likely need one or both of two things:
- App software with more features than you initially need. That may be easy to get, but it isn’t cheap.
- A nimble IT department. That one is neither cheap nor easy.

6. My biggest reason for writing about brittleness and improvement is to approach some topics around analytics and AI. As previously noted:

Artificial intelligence is facing public skepticism both for being too accurate (!) and not accurate enough.
Analytics in general is often surprisingly inaccurate.

Please stay tuned.

Related link(s)

Most of what I wrote in my December, 2015 series about artificial intelligence still holds true.

Some stuff that’s always on my mind

Curt Monash — Sun, 20 May 2018 19:27:21 +0000

I have a LOT of partially-written blog posts, but am struggling to get any of them finished (obviously). Much of the problem is that they have so many dependencies on each other. Clearly, then, I should consider refactoring my writing plans.

So let’s start with this. Here, in no particular order, is a list of some things that I’ve said in the past, and which I still think are or should be of interest today. It’s meant to be background for numerous posts I write in the near future, and indeed a few hooks for such posts are included below.

1. Data(base) management technology is progressing pretty much as I expected.

Vendors generally recognize that maturing a data store is an important, many-years-long process.
Multiple kinds of data model are viable …
… but it’s usually helpful to be able to do some kind of JOIN.
To deal with the variety of hardware/network/storage arrangements out there, layering/tiering is on the rise. (An amazing number of vendors each seem to think they invented the idea.)

2. Rightly or wrongly, enterprises are often quite sloppy about analytic accuracy.

My two central examples have long been inaccurate metrics and false-positive alerts.
In predictive analytics, it’s straightforward to quantify how much additional value you’re leaving on the table with your imperfect accuracy.
Enterprise search and other text technologies are still often terrible.
After years of “real-time” overhype, organizations have seemingly swung to under-valuing real-time analytics.

3. Outside traditional enterprises, the accuracy problem can be even worse, and the consequences of analytic inaccuracy can be severe. In some cases this is well understood; autonomous vehicle researchers, for example, seem properly attentive to the challenge of not-killing-pedestrians. But in others it’s a mess. For example, I don’t think the “fake news on social media” challenge will be resolved without new technical approaches that, to my knowledge, aren’t yet even being tried.

4. More generally, I’ve long argued that the technology industry would someday have to deal with a variety of public policy and social concerns. That day has come. In anticipation, I wrote at length about privacy/surveillance, and a little about some other areas, including net neutrality, patents, economic development, and public technology spending. Missing subjects include censorship (private and public alike), and perhaps also the efforts to tie data ownership into anti-trust policy.

5. Given all the tech-specific public policy work that’s needed, I’m pulling back from some my broader political efforts. However, I stand by my overview opinions of last February, and I delivered on some of its IOUs in a two-part series on persuasion.

6. The ongoing rise of “edge computing” and the “Internet of Things” fit into the general trend that in 2013 I summarized as appliances, clusters and clouds.

7. I continue to think that a huge fraction of analytics is properly characterized as monitoring. That ties into a number of areas of interest. For example:

Platform technologies — including distributed data management — are often compete on the maturity of their built-in monitoring.
My complaints about BI inaccuracy commonly relate to use cases in monitoring.
Privacy/surveillance issues are commonly about monitoring. It’s common to worry that such monitoring is actually too accurate.
But I also worry that privacy/surveillance monitoring isn’t accurate enough … and hence that it leads to people being discriminated against who absolutely don’t need to be.
Edge computing involves a lot of devices that need to be monitored.
Censorship obviously has a lot to do with monitoring.

8. And finally for now, my core precepts for strategic messaging haven’t changed.

Related link

As you may have already guessed, the title of this post is based on a classic song.

Notes on artificial intelligence, December 2017

Curt Monash — Tue, 12 Dec 2017 18:53:16 +0000

Most of my comments about artificial intelligence in December, 2015 still hold true. But there are a few points I’d like to add, reiterate or amplify.

1. As I wrote back then in a post about the connection between machine learning and the rest of AI,

It is my opinion that most things called “intelligence” — natural and artificial alike — have a great deal to do with pattern recognition and response.

2. Accordingly, it can be reasonable to equate machine learning and AI.

AI based on machine learning frequently works, on more than a toy level. (Examples: Various projects by Google)
AI based on knowledge representation usually doesn’t. (Examples: IBM Watson, 1980s expert systems)
“AI” can be the sexier marketing or fund-raising term.

3. Similarly, it can be reasonable to equate AI and pattern recognition. Glitzy applications of AI include:

Understanding or translation of language (written or spoken as the case may be).
Machine vision or autonomous vehicles.
Facial recognition.
Disease diagnosis via radiology interpretation.

4. The importance of AI and of recent AI advances differs greatly according to application or data category.

Machine learning and AI have little relevance to most traditional transactional apps.
Predictive modeling is a huge deal in customer-relationship apps. The most advanced organizations developing and using those rely on machine learning. I don’t see an important distinction between machine learning and “artificial intelligence” in this area.
Voice interaction is already revolutionary in certain niches (e.g. smartphones — Siri et al.). The same will likely hold other natural language or virtual/augmented reality interfaces if and when they go more mainstream. AI seems likely to make a huge impact on user interfaces.
AI also seems likely to have huge impact upon the understanding and reduction of machine-generated data.

5. Right now it seems as if large companies are the runaway leaders in AI commercialization. There are several reasons to think that could last.

They have deep pockets. Yes, but the same is true in any other area of technology. Small companies commonly out-innovate large one even so.
They have access to lots of data for model training. I find this argument persuasive in some specific areas, most notably any kind of language recognition that can be informed by search engine uses.
AI technology is sometimes part of a much larger whole. That argument is not obviously persuasive. After all, software can often be developed by one company and included as a module in somebody else’s systems. Machine vision has worked that way for decades.

I’m sure there are many niches in which decision-making, decision implementation and feedback are so tightly integrated that they all need to be developed by the same organization. But every example that remotely comes to mind is indeed the kind of niche that smaller companies are commonly able to address.

6. China and Russia are both vowing to lead the world in artificial intelligence. From a privacy/surveillance standpoint, this is worrisome. China also has a reasonable path to doing so (Russia not so much), in line with the “Lots of data makes models strong” line of argument.

The fiasco of Japan’s 1980s “Fifth-Generation Computing” initiative is only partly reassuring.

7. It seems that “deep learning” and GPUs fit well for AI/machine learning uses. I see no natural barriers to that trend, assuming it holds up on its own merits.

Since silicon clock speeds stopped increasing, chip power improvements have mainly taken the form of increased on-chip parallelism.
The general move to the cloud is also not a barrier. I have little doubt major cloud providers could do a good job of providing GPU-based capacity, given that:
They build their own computer systems.
They showed similar flexibility when they adopted flash storage.
Several of them are AI research leaders themselves.

Maybe CPU vendors will co-opt GPU functionality. Maybe not. I haven’t looked into that issue. But either way, it should be OK to adopt software that calls for GPU-style parallel computation.

8. Computer chess is in the news, so of course I have to comment. The core claim is something like:

Google’s AlphaZero technology was trained for four hours playing against itself, with no human heuristic input.
It then decisively beat Stockfish, previously the strongest computer chess program in the world.

My thoughts on that start:

AlphaZero actually beat a very crippled version of Stockfish.
That’s still impressive.
Google only released a small fraction of the games. But in the ones it did release, about half had a common theme — AlphaZero seemed to place great value on what chess analysts call “space”.
This all fits my view that recent splashy AI accomplishments are focused on pattern recognition.

Imanis Data

Curt Monash — Tue, 22 Aug 2017 12:46:20 +0000

I talked recently with the folks at Imanis Data. For starters:

The point of Imanis is to make copies of your databases, for purposes such as backup/restore, test/analysis, or compliance-driven archiving. (That’s in declining order of current customer activity.) Another use is migration via restoring to a different cluster than the one that created the data in the first place.
The data can come from NoSQL database managers, from Hadoop, or from Vertica. (Again, that’s in declining order.)
As you might imagine, Imanis makes incremental backups; the only full backup is the first one you do for that database.
“Imanis” is a new name; the previous name was “Talena”.

Also:

Imanis has ~35 subscription customers, a significant majority of which are in the Fortune 1000.
Customer industries, in roughly declining order, include:
- Financial services other than insurance.
- Insurance.
- Retail.
- “Technology”.
~40% of Imanis customers are in the public cloud.
Imanis is focused on the North American market at this time.
Imanis has ~45 employees.
The Imanis product just hit Version 3.

Imanis correctly observes that there are multiple reasons you might want to recover from backup, including:

General disaster/system failure.
Bug in an application that writes data.
Malicious acts, including encryption-by-ransomware.

Imanis uses the phrase “point-in-time backup” to emphasize its flexibility in letting you choose your favorite time-version of your rolling backup.

Imanis also correctly draws the inference that the right backup strategy is some version of:

Make backups very frequently. This boils down to “Do a great job of making incremental backups (and restoring from them when necessary). This is where Imanis has spent the bulk of its technical effort to date.
In case recovery is needed, identify the last clean (or provably/confidently clean) version of the database and restore from that. The identification part boils down to letting the backup databases be queried directly. That’s largely a roadmap item.
- Imanis has recently added the capability to build its own functionality querying the backup database.
- JDBC/whatever general access is still in the future.

Note: When Imanis backups offer direct query access, the possibility will of course exist to use the backup data for general query processing. But while that kind of capability sounds great in theory, I’m not aware of it being a big deal (on technology stacks that already offer it) in practice.

The most technically notable other use cases Imanis mentioned are probably:

Data science dataset generation. Imanis lets you generate a partial copy of the database for analytic or test purposes.
- You can project, select or sample your data, which suggests use of the current query capabilities.
- There’s an API to let you mask Personally Identifiable Information by writing your own data transformations.
Archiving/tiering/ILM (Information Lifecycle Management). Imanis lets you divide data according to its hotness.

Imanis views its competition as:

Native utilities of the data stores.
Hand-coded scripts.
Datos.io, principally in the Cassandra market (so far).

Beyond those, the obvious comparison to Imanis is Delphix. I haven’t spoken with Delphix for a few years, but I believe that key differences between Delphix and Imanis start:

Delphix is focused on widely-installed RDBMS such as Oracle.
Delphix actually tries to have different production logical copies of your database run off of the same physical copy. Imanis, in contrast, offers technology to help you copy your databases quickly and effectively, but the copies you actually use will indeed be separate from each other.

Imanis software runs on its own cluster, based on hacked Hadoop. A lot of the hacking seems to relate to a metadata store, which supports things like:

Understanding which (incrementally backed up) blocks need to be pulled together to make a specific copy of the database.
Putting data in different places for ILM/tiering.

Another piece of Imanis tech is machine-learning-based anomaly detection.

As incrementally backed-up blocks arrive, Imanis flags anomalous ones and states a reason for them.
A flag is given a reason.
You can denounce the flag as a false alert, and hopefully similar flags won’t be raised in the future.

The technology for this seems rather basic:

Random forests for the flagging.
No drilldown w/in the Imanis system for follow-up.

But in general concept this is something a lot more systems should be doing.

Most of the rest of Imanis’ tech story is straightforward — support various alternatives for computing platforms, offer the usual security choices, etc. One exception that was new to me was the use of erasure codes, which seem to be a generalization of the concept of parity bits. Allegedly, when used in a storage context these have the near-magical property of offering 4X replication safety with only a 1.5X expansion of data volume. I won’t claim to have understood the subject well enough to see how that could make sense, or what tradeoffs it would entail.

Notes on data security

Curt Monash — Thu, 10 Aug 2017 09:15:50 +0000

1. In June I wrote about burgeoning interest in data security. I’d now like to add:

Even more than I previously thought, demand seems to be driven largely by issues of regulatory compliance.
In an exception to that general rule, many enterprise have vague mandates for data encryption.
In awkward contradiction to that general rule, there’s a general sense that it’s just security’s “turn” to be a differentiating feature, since various other “enterprise” needs are already being well-addressed.

We can reconcile these anecdata pretty well if we postulate that:

Enterprises generally agree that data security is an important need.
Exactly how they meet this need depends upon what regulators choose to require.

2. My current impressions of the legal privacy vs. surveillance tradeoffs are basically:

The freer non-English-speaking countries are more concerned about ensuring data privacy. In particular, the European Union’s upcoming GDPR (General Data Protection Regulation) seems like a massive addition to the compliance challenge.
The “Five Eyes” (US, UK, Canada, Australia, New Zealand) are more concerned about maintaining the efficacy of surveillance.
Authoritarian countries, of course, emphasize surveillance as well.

3. Multiple people have told me that security concerns include (data) lineage and (data) governance as well. I’m fairly OK with that conflation.

By citing “lineage” I think they’re referring to the point that if you don’t know where data came from, you don’t know if it’s trustworthy. This fits well with standard uses of the “data lineage” term.
By “data governance” they seem to mean policies and procedures to limit the chance of unauthorized or uncontrolled data change, or technology to support those policies. Calling that “data governance” is a bit of a stretch, but it’s not so ridiculous that we need to make a big fuss about it.

In other words: If your data transformation pipelines aren’t locked down, then your data isn’t locked down either.

4. But how seriously does that last point need to be taken? For starters, the possibility of erroneous calculations:

Is a strong threat to analytic accuracy, as has been recognized at least for the decades that “one version of the truth” has been a catchphrase.
Has some regulatory risk, e.g. in the United States around Sarbanes-Oxley.
Is not as a big a deal for the core security threat of data theft/exfiltration.

Further, it’s not too hard architecturally to have a divide between:

Data transformation for operational use cases, which may need to be locked down.
Data transformation for purely investigative analytics, which can be very fluid, for transformation technologies such as Hadoop, Spark and Excel alike.

Bottom line: Data transformation security is an accessible must-have in some use cases, but an impractical nice-to-have in others.

Analytics on the edge?

Curt Monash — Fri, 30 Jun 2017 08:27:18 +0000

There’s a theory going around to the effect that:

Compute power is and will be everywhere, for example in cars, robots, medical devices or microwave ovens. Let’s refer to these platforms collectively as “real-world appliances”.
Much more data will be created on these platforms than can reasonably be sent back to centralized/cloudy servers.
Therefore, cloud-centric architectures will soon be obsolete, perhaps before they’re ever dominant in the first place.

There’s enough truth to all that to make it worth discussing. But the strong forms of the claims seem overblown.

1. This story doesn’t even make sense except for certain new classes of application. Traditional business applications run all over the world, in dedicated or SaaSy modes as the case may be. E-commerce is huge. So is content delivery. Architectures for all those things will continue to evolve, but what we have now basically works.

2. When it comes to real-world appliances, this story is partially accurate. An automobile is a rolling network of custom Linux systems, each running hand-crafted real-time apps, a few of which also have minor requirements for remote connectivity. That’s OK as far as it goes, but there could be better support for real-time operational analytics. If something as flexible as Spark were capable of unattended operation, I think many engineers of real-world appliances would find great ways to use it.

3. There’s a case to be made for something better yet. I think the argument is premature, but it’s worth at least a little consideration.

There are any number of situations in which decisions are made on or about remote systems, based on models or rules that should be improved over time. For example, such decisions might be made in:

Machine vision or other “recognition”-oriented areas of AI.
Detection or prediction of malfunctions.
Choices as to what data is significant enough to ship back upstream.

In the canonical case, we might envision a system in which:

Huge amounts of data are collected and are used to make real-time decisions.
The models are trained centrally, and updated remotely over time as they are improved.
The remote systems can only ship back selected or aggregated data to help train the models.

This all seems like an awkward fit for any common computing architecture I can think of.

But it’s hard to pin down important examples of that “canonical” case. The story implicitly assumes:

A model is widely deployed.
The model does a decent job but not a perfect one.
Based on its successes and failures, the model gets improved.

And now we’re begging a huge question: What exactly is there that keeps score as to when the model succeeds and fails? Mathematically speaking, I can’t imagine what a general answer would be like.

4. So when it comes to predictive models executed on real-world appliances I think that analytic workflows will:

Differ for different (categories) of applications.
Rely in most cases on simple patterns of data movement, such as:
- Stream everything to central servers and sort it out there, or if that’s not workable …
- … instrument a limited number of test nodes to store everything, and recover the data in batch for analysis.
- Update models only in timeframes that you’re doing a full app update/refresh.

And with that much of the apparent need for fancy distributed analytic architectures evaporates.

5. Finally, and notwithstanding the previous point: Across many use cases, there’s some kind of remote log data being shipped back to a central location. It may be the complete log. It may be periodic aggregates. It may happen only what the edge nodes regard as significant events. But something is getting shipped home.

The architectures for shipping, receiving and analyzing such data are in many cases immature. That’s obvious if there’s any kind of streaming involved, or if analysis is done in Spark. Ditto if there’s anything we might call “non-tabular business intelligence”. As this stuff matures, it will in many cases fit very well with today’s cloud thinking. But in any case — it needs to mature.

Truth be told, even the relational case is immature, in that it can easily rely on what I called:

data warehouses (perhaps really data marts) that are updated in human real-time

That quote is from a recent post about Kudu, which:

Is designed for exactly that use case.
Went GA early this year.

As always, technology is in flux.

Related links

Interana is another example of very new technology that seems applicable to these use cases.
My 2013 post on the future of IT architectures still rings true.

Generally available Kudu

Curt Monash — Fri, 16 Jun 2017 15:52:45 +0000

I talked with Cloudera about Kudu in early May. Besides giving me a lot of information about Kudu, Cloudera also helped confirm some trends I’m seeing elsewhere, including:

Security is an ever bigger deal.
There’s a lot of interest in data warehouses (perhaps really data marts) that are updated in human real-time.
- Prospects for that respond well to the actual term “data warehouse”, at least when preceded by some modifier to suggest that it’s modern/low-latency/non-batch or whatever.
- Flash is often — but not yet always — preferred over disk for that kind of use.
- Sometimes these data stores are greenfield. When they’re migrations, they come more commonly from analytic RDBMS or data warehouse appliance (the most commonly mentioned ones are Teradata, Netezza and Vertica, but that’s perhaps just due to those product lines’ market share), rather than from general purpose DBMS such as Oracle or SQL Server.
Intel is making it ever easier to vectorize CPU operations, and analytic data managers are increasingly taking advantage of this possibility.

Now let’s talk about Kudu itself. As I discussed at length in September 2015, Kudu is:

A data storage system introduced by Cloudera (and subsequently open-sourced).
Columnar.
Updatable in human real-time.
Meant to serve as the data storage tier for Impala and Spark.

Kudu’s adoption and roll-out story starts:

Kudu went to general availability on January 31. I gather this spawned an uptick in trial activity.
A subsequent release with some basic security features spawned another uptick.
I don’t think Cloudera will mind my saying that there are many hundreds of active Kudu clusters.
But Cloudera believes that, this soon after GA, very few Kudu users are in actual production.

Early Kudu interest is focused on 2-3 kinds of use case. The biggest is the kind of “data warehousing” highlighted above. Cloudera characterizes the others by the kinds of data stored, specifically the overlapping categories of time series — including financial trading — and machine-generated data. A lot of early Kudu use is with Spark, even ahead of (or in conjunction with) Impala. A small amount has no relational front-end at all.

Other notes on Kudu include:

Solid-state storage is recommended, with a few terabytes per node.
You can also use spinning disk. If you do, your write-ahead logs can still go to flash.
Cloudera said Kudu compression ratios can be as low as 2-5X, or as high as 10-20X. With that broad a range, I didn’t drill down into specifics of what they meant.
There seem to be a number of Kudu clusters with 50+ nodes each. By way of contrast, a “typical” Cloudera customer has 100s of nodes overall.
As you might imagine from their newness, Kudu security features — Kerberos-based — are at the database level rather than anything more granular.

And finally, the Cloudera folks woke me up to some issues around streaming data ingest. If you stream data in, there will be retries resulting in duplicate delivery. So your system needs to deal with those one way or another. Kudu’s way is:

Primary keys will be unique. (Note: This is not obvious in a system that isn’t an entire RDBMS in itself.)
You can configure the uniqueness to be guaranteed either through an upsert mechanism or just by simply rejecting duplicates.
Alternatively, you can write code to handle duplication errors, e.g. via Spark.

The data security mess

Curt Monash — Wed, 14 Jun 2017 13:21:11 +0000

A large fraction of my briefings this year have included a focus on data security. This is the first year in the past 35 that that’s been true.* I believe that reasons for this trend include:

Security is an important aspect of being “enterprise-grade”. Other important checkboxes have been largely filled in. Now it’s security’s turn.
A major platform shift, namely to the cloud, is underway or at least being planned for. Security is an important thing to think about as that happens.
The cloud even aside, technology trends have created new ways to lose data, which security technology needs to address.
Traditionally paranoid industries are still paranoid.
Other industries are newly (and rightfully) terrified of exposing customer data.
My clients at Cloudera thought they had a chance to get significant messaging leverage from emphasizing security. So far, it seems that they were correct.

*Not really an exception: I did once make it a project to learn about classic network security, including firewall appliances and so on.

Certain security requirements, desires or features keep coming up. These include (and as in many of my lists, these overlap):

Easy, comprehensive access control. More on this below.
Encryption. If other forms of security were perfect, encryption would never be needed. But they’re not.
Auditing. Ideally, auditing can alert you to trouble before (much) damage is done. If not, then it can at least help you do proactive damage control in the face of breach.
Whatever regulators mandate.
Whatever is generally regarded as best practices. Security “best practices” generally keep enterprises out of legal and regulatory trouble, or at least minimize same. They also keep employees out of legal and career trouble, or minimize same. Hopefully, they even keep data safe.
Whatever the government is known to use. This is a common proxy for “best practices”.

More specific or extreme requirements include:

Security certifications.
Ways for enterprises to always hold their own encryption keys, even for cloud data.
Value/label-based security.
Isolation of audit logs onto separate (and separately-protected) systems.
Keeping data out of SaaS vendors’ control altogether.

I don’t know how widely these latter kinds of requirements will spread.

The most confusing part of all this may be access control.

Security has a concept called AAA, standing for Authentication, Authorization and Accounting/Auditing/Other things that start with”A”. Yes — even the core acronym in this area is ill-defined.
The new standard for authentication is Kerberos. Or maybe it’s SAML (Security Assertion Markup Language). But SAML is actually an old, now-fragmented standard. But it’s also particularly popular in new, cloud use cases. And Kerberos is actually even older than SAML.
Suppose we want to deny somebody authorization to access certain raw data, but let them see certain aggregated or derived information. How can we be sure they can’t really see the forbidden underlying data, except through a case-by-case analysis? And if that case-by-case analysis is needed, how can the authorization rules ever be simple?

Further confusing matters, it is an extremely common analytic practice to extract data from somewhere and put it somewhere else to be analyzed. Such extracts are an obvious vector for data breaches, especially when the target system is managed by an individual or IT-weak department. Excel-on-laptops is probably the worst case, but even fat-client BI — both QlikView and Tableau are commonly used with local in-memory data staging — can present substantial security risks. To limit such risks, IT departments are trying to impose new standards and controls on departmental analytics. But IT has been fighting that war for many decades, and it hasn’t won yet.

And that’s all when data is controlled by a single enterprise. Inter-enterprise data sharing confuses things even more. For example, national security breaches in the US tend to come from government contractors more than government employees. (Ed Snowden is the most famous example. Chelsea Manning is the most famous exception.) And as was already acknowledged above, even putting your data under control of a SaaS vendor opens hard-to-plug security holes.

Data security is a real mess.

Edit (July 10, 2017): Matt Asay evidently agrees with this post, specifically in the context of Hadoop.

Light-touch managed services

Curt Monash — Wed, 14 Jun 2017 13:14:06 +0000

Cloudera recently introduced Cloudera Altus, a Hadoop-in-the-cloud offering with an interesting processing model:

Altus manages jobs for you.
But you actually run them on your own cluster, and so you never have to put your data under Altus’ control.

Thus, you avoid a potential security risk (shipping your data to Cloudera’s service). I’ve tentatively named this strategy light-touch managed services, and am interested in exploring how broadly applicable it might or might not be.

For light-touch to be a good approach, there should be (sufficiently) little downside in performance, reliability and so on from having your service not actually control the data. That assumption is trivially satisfied in the case of Cloudera Altus, because it’s not an ordinary kind of app; rather, its whole function is to improve the job-running part of your stack. Most kinds of apps, however, want to operate on your data directly. For those, it is more challenging to meet acceptable SLAs (Service-Level Agreements) on a light-touch basis.

Let’s back up and consider what “light-touch” for data-interacting apps (i.e., almost all apps) would actually mean. The basics are:

The user has some kind of environment that manages data and executes programs.
The light-touch service, running outside this environment, spawns one or more app processes inside it.
Useful work ensues …
… with acceptable reliability and performance.
The environment’s security guarantees ensure that data doesn’t leak out.

Cases where that doesn’t even make sense include but are not limited to:

Transaction-processing applications that are carefully tuned for efficient database access.
Applications that need to be carefully installed on or in connection with a particular server, DBMS, app server or whatever.

On the other hand:

A light-touch service is at least somewhat reasonable in connection with analytics-oriented data-management-plus-processing environments such as Hadoop/Spark clusters.
There are many workloads over Hadoop clusters that don’t need efficient database access. (Otherwise Hive use would not be so prevalent.)
Light-touch efforts seem more likely to be helped than hurt by abstraction environments such as the public cloud.

So we can imagine some kind of outside service that spawns analytic jobs to be run on your preferred — perhaps cloudy — Hadoop/Spark cluster. That could be a safe way to get analytics done over data that really, really, really shouldn’t be allowed to leak.

But before we anoint light-touch managed services as the NBT (Next Big Thing/Newest Bright Thought), there’s one more hurdle for it to overcome — why bother at all? What would a light-touch managed service provide that you wouldn’t also get from installing packaged software onto your cluster and running it in the usual way? The simplest answer is “The benefits of SaaS (Software as a Service)”, and so we can rephrase the challenge as “Which benefits of SaaS still apply in the light-touch managed service scenario?”

The vendor perspective might start, with special cases such as Cloudera Altus excepted:

The cost-saving benefits of multi-tenancy mostly don’t apply. Each instance winds up running on a separate cluster, namely the customer’s own. (But that’s likely to be SaaS/cloud itself.)
The benefits of controlling your execution environment apply at best in part. You may be able to assume the customer’s core cluster is through some cloud service, but you don’t get to run the operation yourself.
The benefits of a SaaS-like product release cycle do mainly apply.
- Only having to support the current version(s) of the product is a little limited when you don’t wholly control your execution environment.
- Light-touch doesn’t seem to interfere with the traditional SaaS approach of a rapid, incremental product release cycle.

When we flip to the user perspective, however, the idea looks a little better.

Cloudy analytics is for folks who favor convenience of various kinds over tightly-managed, blazing performance.
Security and data privacy are ongoing (and increasing) concerns.
Light-touch services are in line with those priorities.

Bottom line: Light-touch managed services are well worth thinking about. But they’re not likely to be a big deal soon.