Memory-centric data management – DBMS 2 : DataBase Management System Services

Analytics on the edge?

Curt Monash — Fri, 30 Jun 2017 08:27:18 +0000

There’s a theory going around to the effect that:

Compute power is and will be everywhere, for example in cars, robots, medical devices or microwave ovens. Let’s refer to these platforms collectively as “real-world appliances”.
Much more data will be created on these platforms than can reasonably be sent back to centralized/cloudy servers.
Therefore, cloud-centric architectures will soon be obsolete, perhaps before they’re ever dominant in the first place.

There’s enough truth to all that to make it worth discussing. But the strong forms of the claims seem overblown.

1. This story doesn’t even make sense except for certain new classes of application. Traditional business applications run all over the world, in dedicated or SaaSy modes as the case may be. E-commerce is huge. So is content delivery. Architectures for all those things will continue to evolve, but what we have now basically works.

2. When it comes to real-world appliances, this story is partially accurate. An automobile is a rolling network of custom Linux systems, each running hand-crafted real-time apps, a few of which also have minor requirements for remote connectivity. That’s OK as far as it goes, but there could be better support for real-time operational analytics. If something as flexible as Spark were capable of unattended operation, I think many engineers of real-world appliances would find great ways to use it.

3. There’s a case to be made for something better yet. I think the argument is premature, but it’s worth at least a little consideration.

There are any number of situations in which decisions are made on or about remote systems, based on models or rules that should be improved over time. For example, such decisions might be made in:

Machine vision or other “recognition”-oriented areas of AI.
Detection or prediction of malfunctions.
Choices as to what data is significant enough to ship back upstream.

In the canonical case, we might envision a system in which:

Huge amounts of data are collected and are used to make real-time decisions.
The models are trained centrally, and updated remotely over time as they are improved.
The remote systems can only ship back selected or aggregated data to help train the models.

This all seems like an awkward fit for any common computing architecture I can think of.

But it’s hard to pin down important examples of that “canonical” case. The story implicitly assumes:

A model is widely deployed.
The model does a decent job but not a perfect one.
Based on its successes and failures, the model gets improved.

And now we’re begging a huge question: What exactly is there that keeps score as to when the model succeeds and fails? Mathematically speaking, I can’t imagine what a general answer would be like.

4. So when it comes to predictive models executed on real-world appliances I think that analytic workflows will:

Differ for different (categories) of applications.
Rely in most cases on simple patterns of data movement, such as:
- Stream everything to central servers and sort it out there, or if that’s not workable …
- … instrument a limited number of test nodes to store everything, and recover the data in batch for analysis.
- Update models only in timeframes that you’re doing a full app update/refresh.

And with that much of the apparent need for fancy distributed analytic architectures evaporates.

5. Finally, and notwithstanding the previous point: Across many use cases, there’s some kind of remote log data being shipped back to a central location. It may be the complete log. It may be periodic aggregates. It may happen only what the edge nodes regard as significant events. But something is getting shipped home.

The architectures for shipping, receiving and analyzing such data are in many cases immature. That’s obvious if there’s any kind of streaming involved, or if analysis is done in Spark. Ditto if there’s anything we might call “non-tabular business intelligence”. As this stuff matures, it will in many cases fit very well with today’s cloud thinking. But in any case — it needs to mature.

Truth be told, even the relational case is immature, in that it can easily rely on what I called:

data warehouses (perhaps really data marts) that are updated in human real-time

That quote is from a recent post about Kudu, which:

Is designed for exactly that use case.
Went GA early this year.

As always, technology is in flux.

Related links

Interana is another example of very new technology that seems applicable to these use cases.
My 2013 post on the future of IT architectures still rings true.

Analyzing the right data

Curt Monash — Thu, 13 Apr 2017 12:05:43 +0000

0. A huge fraction of what’s important in analytics amounts to making sure that you are analyzing the right data. To a large extent, “the right data” means “the right subset of your data”.

1. In line with that theme:

Relational query languages, at their core, subset data. Yes, they all also do arithmetic, and many do more math or other processing than just that. But it all starts with the set theory.
Underscoring the power of this approach, other data architectures over which analytics is done usually wind up with SQL or “SQL-like” language access as well.

2. Business intelligence interfaces today don’t look that different from what we had in the 1980s or 1990s. The biggest visible* changes, in my opinion, have been in the realm of better drilldown, ala QlikView and then Tableau. Drilldown, of course, is the main UI for business analysts and end users to subset data themselves.

*I used the word “visible” on purpose. The advances at the back end have been enormous, and much of that redounds to the benefit of BI.

3. I wrote 2 1/2 years ago that sophisticated predictive modeling commonly fit the template:

Divide your data into clusters.
Model each cluster separately.

That continues to be tough work. Attempts to productize shortcuts have not caught fire.

4. In an example of the previous point, anomaly management technology can, in theory, help shortcut any type of analytics, in that it tries to identify what parts of your data to focus on (and why). But it’s in its early days; none of the approaches to general anomaly management has gained much traction.

5. Marketers have vast amounts of information about us. It starts with every credit card transaction line item and a whole lot of web clicks. But it’s not clear how many of those (10s of) thousands of columns of data they actually use.

6. In some cases, the “right” amount of data to use may actually be tiny. Indeed, some statisticians claim that fewer than 10 data points may be enough to get a good model. I’m skeptical, at least as to the practical significance of such extreme figures. But on the more plausible side — if you’re hunting bad guys, it may not take very many separate facts before you have good evidence of collusion or fraud.

Internet fraud excepted, of course. Identifying that usually involves sifting through a lot of log entries.

7. All the needle-hunting in the world won’t help you unless what you seek is in the haystack somewhere.

Often, enterprises explicitly invest in getting more data.
Keeping everything you already generate is the obvious choice for most categories of data, but some of the lowest-value-per-bit logs may forever be thrown away.

8. Google is famously in the camp that there’s no such thing as too much data to analyze. For example, it famously uses >500 “signals” in judging the quality of potential search results. I don’t know how many separate data sources those signals are informed by, but surely there are a lot.

9. Few predictive modeling users demonstrate a need for vast data scaling. My support for that claim is a lot of anecdata. In particular:

Some predictive modeling techniques scale well. Some scale poorly. The level of pain around the “scale poorly” aspects of that seems to be fairly light (or “moderate” at worst). For example:
- In the previous technology generation, analytic DBMS and data warehouse appliance vendors tried hard to make statistical packages scale across their systems. Success was limited. Nobody seemed terribly upset.
- Cloudera’s Data Science Workbench messaging isn’t really scaling-centric.
Spark’s success in machine learning is rather rarely portrayed as centering on scaling. And even when it is, Spark basically runs in memory, so each Spark node is processing all that much data.

10. Somewhere in this post — i.e. right here — let’s acknowledge that the right data to analyze may not be exactly what was initially stored. Data munging/wrangling/cleaning/preparation is often a big deal. Complicated forms of derived data can be important too.

11. Let’s also mention data marts. Basically, data marts subset and copy data, because the data will be easier to analyze in its copied form, or because they want to separate workloads between the original and copied data store.

If we assume the data is on spinning disks or even flash, then the need for that strategy declined long ago.
Suppose you want to keep data entirely in memory? Then you might indeed want to subset-and-copy it. But with so many memory-centric systems doing decent jobs of persistent storage too, there’s often a viable whole-dataset management alternative.

But notwithstanding the foregoing:

Security/access control can be a good reason for subset-and-copy.
So can other kinds of administrative simplification.

12. So what does this all suggest going forward? I believe:

Drilldown is and will remain central to BI. If your BI doesn’t support robust drilldown, you’re doing it wrong. “Real-time” use cases are not exceptions to this rule.
In a strong overlap with the previous point, drilldown is and will remain central to monitoring. Whatever monitoring means to you, the ability to pinpoint the specific source of interesting signals is crucial.
The previous point can be recast as saying that it’s crucial to identify, isolate and explain anomalies. Some version(s) of anomaly management will become a big deal.
SQL and “SQL-like” languages will remain integral to analytic processing for a long time.
Memory-centric analytic frameworks such as Spark will continue to win. The data size constraints imposed by memory-centric processing will rarely cause difficulties.

Related links

Other recent “unifying-theme” posts focused on monitoring and coordination.
My 2013 post on what matters in investigative analytics still holds up pretty well.

DBAs of the future

Curt Monash — Wed, 23 Nov 2016 12:02:31 +0000

After a July visit to DataStax, I wrote

The idea that NoSQL does away with DBAs (DataBase Administrators) is common. It also turns out to be wrong. DBAs basically do two things.

Handle the database design part of application development. In NoSQL environments, this part of the job is indeed largely refactored away. More precisely, it is integrated into the general app developer/architect role.

Manage production databases. This part of the DBA job is, if anything, a bigger deal in the NoSQL world than in more mature and automated relational environments. It’s likely to be called part of “devops” rather than “DBA”, but by whatever name it’s very much a thing.

That turns out to understate the core point, which is that DBAs still matter in non-RDBMS environments. Specifically, it’s too narrow in two ways.

First, it’s generally too narrow as to what DBAs do; people with DBA-like skills are also involved in other areas such as “data governance”, “information lifecycle management”, storage, or what I like to call data mustering.
Second — and more narrowly — the first bullet point of the quote is actually incorrect. In fact, the database design part of application development can be done by a specialized person up front in the NoSQL world, just as it commonly is for RDBMS apps.

My wake-up call for that latter bit was a recent MongoDB 3.4 briefing. MongoDB certainly has various efforts in administrative tools, which I won’t recapitulate here. But to my surprise, MongoDB also found a role for something resembling relational database design. The idea is simple: A database administrator defines a view against a MongoDB database, where views:

Are logical rather than materialized. (At least at this time.)
Have their permissions and so on set by the DBA.
Are the sole thing the programmer writes against.

Besides the obvious benefits in development ease and security, MongoDB says that performance can be better as well.* This is of course a new feature, without a lot of adoption at this time. Even so, it seems likely that NoSQL doesn’t obsolete any part of the traditional DBA role.

*I didn’t actually ask what a naive programmer can do to trash performance that views can forestall, but … well, I was once a naive programmer myself.

Two trends that I think could make DBA’s lives even more interesting and challenging in the future are:

The integration of quick data management into complex analytic processes. Here by “quick data management” I mean, for example, what you do in connection with a complex Hadoop or Spark (set of) job(s). Leaving the data management to a combination of magic and Python scripts doesn’t seem to respect how central data operations are to analytic tasks.
The integration of data management and streaming. I should probably write about this point separately, but in any case — it seems that streaming stacks will increasingly look like over-caffeinated DBMS.

Bottom line: Database administration skills will be needed for a long time to come.

Rapid analytics

Curt Monash — Fri, 21 Oct 2016 14:17:04 +0000

“Real-time” technology excites people, and has for decades. Yet the actual, useful technology to meet “real-time” requirements remains immature, especially in cases which call for rapid human decision-making. Here are some notes on that conundrum.

1. I recently posted that “real-time” is getting real. But there are multiple technology challenges involved, including:

General streaming. Some of my posts on that subject are linked at the bottom of my August post on Flink.
Low-latency ingest of data into structures from which it can be immediately analyzed. That helps drive the (re)integration of operational data stores, analytic data stores, and other analytic support — e.g. via Spark.
Business intelligence that can be used quickly enough. This is a major ongoing challenge. My clients at Zoomdata may be thinking about this area more clearly than most, but even they are still in the early stages of providing what users need.
Advanced analytics that can be done quickly enough. Answers there may come through developments in anomaly management, but that area is still in its super-early days.
Alerting, which has been under-addressed for decades. Perhaps the anomaly management vendors will finally solve it.

2. In early 2011, I coined the phrase investigative analytics, about which I said three main things:

It is meant to contrast with “operational analytics”.
It is meant to conflate “several disciplines, namely”:
- Statistics, data mining, machine learning, and/or predictive analytics.
- The more research-oriented aspects of business intelligence tools.
- Analogous technologies as applied to non-tabular data types such as text or graph.
A simple definition would be “Seeking (previously unknown) patterns in data.”

Generally, that has held up pretty well, although “exploratory” is the more widely used term. But the investigative/operational dichotomy obscures one key fact, which is the central point of this post: There’s a widespread need for very rapid data investigation.

3. This is not just a niche need. There are numerous rapid-investigation use cases in mind, some already mentioned in my recent posts on anomaly management and real-time applications.

Network operations. This is my paradigmatic example.
- Data is zooming all over the place, in many formats and structures, among many kinds of devices. That’s log data, header data and payload data alike. Many kinds of problems can arise …
- … which operators want to diagnose and correct, in as few minutes as possible.
- Interfaces commonly include real-time business intelligence, some drilldown, and a lot of command-line options.
- I’ve written about various specifics, especially in connection with the vendors Splunk and Rocana.
Security and anti-fraud. Infosec and cyberfraud, to a considerable extent, are just common problems in network operations. Much of the response is necessarily automated — but the bad guys are always trying to outwit your automation. If you think they may have succeeded, you want to figure that out very, very fast.
Consumer promotion and engagement. Consumer marketers feel a great need for speed. Some of it is even genuine.
- If an online promotion is going badly (or particularly well), they can in theory react almost instantly. So they’d like to know almost instantly, perhaps via BI tools with great drilldown.
- The same is even truer in the case of social media eruptions and the like. Obviously, the tools here are heavily text-oriented.
- Call centers and even physical stores have some of the same aspects as internet consumer operations.
Consumer internet backends, for e-commerce, publishing, gaming or whatever. These cases combine and in some cases integrate the previous three points. For example, if you get a really absurd-looking business result, that could be your first indication of network malfunctions or automated fraud.
Industrial technology, such as factory operations, power/gas/water networks, vehicle fleets or oil rigs. Much as in IT networks, these contain a diversity of equipment — each now spewing its own logs — and have multiple possible modes of failure. More often than is the case in IT networks, you can recognize danger signs, then head off failure altogether via preventive maintenance. But when you can’t, it is crucial to identify the causes of failure fast.
General IoT (Internet of Things) operation. This covers several of the examples above, as well as cases in which you sell a lot of devices, have them “phone home”, and labor to keep that whole multi-owner network working.
National security. If I told you what I meant by this one, I’d have to … [redacted].

4. And then there’s the investment industry, which obviously needs very rapid analysis. When I was a stock analyst, I could be awakened by a phone call and told news that I would need to explain to 1000s of conference call listeners 20 minutes later. This was >30 years ago. The business moves yet faster today.

The investment industry has invested greatly in high-speed supporting technology for decades. That’s how Mike Bloomberg got so rich founding a vertical market tech business. But investment-oriented technology indeed remains a very vertical sector; little of it get more broadly applied.

I think the reason may be that investing is about guesswork, while other use cases call for more definitive answers. In particular:

If you’re wrong 49.9% of the time in investing, you might still be a big winner.
In high-frequency trading, speed is paramount; you have to be faster than your competitors. In speed/accuracy trade-offs, speed wins.

5. Of course, it’s possible to overstate these requirements. As in all real-time discussions, one needs to think hard about:

How much speed is important in meeting users’ needs.
How much additional speed, if any, is important in satisfying users’ desires.

But overall, I have little doubt that rapid analytics is a legitimate area for technology advancement and growth.

“Real-time” is getting real

Curt Monash — Tue, 06 Sep 2016 06:43:40 +0000

I’ve been an analyst for 35 years, and debates about “real-time” technology have run through my whole career. Some of those debates are by now pretty much settled. In particular:

Yes, interactive computer response is crucial.
- Into the 1980s, many apps were batch-only. Demand for such apps dried up.
- Business intelligence should occur at interactive speeds, which is a major reason that there’s a market for high-performance analytic RDBMS.
Theoretical arguments about “true” real-time vs. near-real-time are often pointless.
- What matters in most cases is human users’ perceptions of speed.
- Most of the exceptions to that rule occur when machines race other machines, for example in automated bidding (high frequency trading or otherwise) or in network security.

A big issue that does remain open is: How fresh does data need to be? My preferred summary answer is: As fresh as is needed to support the best decision-making. I think that formulation starts with several advantages:

It respects the obvious point that different use cases require different levels of data freshness.
It cautions against people who think they need fresh information but aren’t in a position to use it. (Such users have driven much bogus “real-time” demand in the past.)
It covers cases of both human and automated decision-making.

Straightforward applications of this principle include:

In “buying race” situations such as high-frequency trading, data needs to be as fresh as the other guy’s, and preferably even fresher.
Supply-chain systems generally need data that’s fresh to within a few hours; in some cases, sub-hour freshness is needed.
That’s a good standard for many desktop business intelligence scenarios as well.
Equipment-monitoring systems’ need for data freshness depends on how quickly catastrophic or cascading failures can occur or be averted.
- Different specific cases call for wildly different levels of data freshness.
- When equipment is well-instrumented with sensors, freshness requirements can be easy to meet.

E-commerce and other internet interaction scenarios can be more complicated, but it seems safe to say:

Recommenders/personalizers should take into account information from the current session.
Try very hard to give customers correct information about merchandise availability or pricing.

In meeting freshness requirements, multiple technical challenges can come into play.

Traditional batch aggregation is too slow for some analytic needs. That’s a core reason for having an analytic RDBMS.
Traditional data integration/movement pipelines can also be too slow. That’s a basis for short-request-capable data stores to also capture some analytic workloads. E.g., this is central to MemSQL’s pitch, and to some NoSQL applications as well.
Scoring models at interactive speeds is often easy. Retraining them quickly is much harder, and at this point only rarely done.
OLTP (OnLine Transaction Processing) guarantees adequate data freshness …
… except in scenarios where the transactions themselves are too slow. Questionably-consistent systems — commonly NoSQL — can usually meet performance requirements, but might have issues with the freshness of accurate data.
Older generations of streaming technology disappointed. The current generation is still maturing.

Based on all that, what technology investments should you be making, in order to meet “real-time” needs? My answers start:

Customer communications, online or telephonic as the case may be, should be based on accurate data. In particular:
- If your OLTP data is somehow siloed away from your phone support data, fix that immediately, if not sooner. (Fixing it 5-15 years ago would be ideal.)
- If your eventual consistency is so eventual that customers notice, fix it ASAP.
If you invest in predictive analytics/machine learning to support your recommenders/personalizers, then your models should at least be scored on fresh data.
- If your models don’t support that, reformulate them.
- If your data pipeline doesn’t support that, rebuild it.
- Actual high-speed retraining of models isn’t an immediate need. But if you’re going to have to transition to that anyway, consider doing do early and getting it over with.
Your BI should have great drilldown and exploration. Find the most active users of such functionality in your enterprise, even if — especially if! — they built some kind of departmental analytic system outside the enterprise mainstream. Ask them what, if anything, they need that they don’t have. Respond accordingly.
Whatever expensive and complex equipment you have, slather it with sensors. Spend a bit of research effort on seeing whether the resulting sensor logs can be made useful.
- Please note that this applies both to vehicles and to fixed objects (e.g. buildings, pipelines) as well as traditional industrial machinery.
- It also applies to any products you make which draw electric power.

So yes — I think “real-time” has finally become pretty real.

Are analytic RDBMS and data warehouse appliances obsolete?

Curt Monash — Mon, 29 Aug 2016 01:28:31 +0000

I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:

Many of the independent vendors were swooped up by acquisition.
- IBM bought Netezza.
- Microsoft bought DATAllegro.
- HP bought Vertica.
- Greenplum went to EMC/VMware/Pivotal.
- Teradata bought Aster.
- Actian bought both ParAccel and Vectorwise.
None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
Impala and other Hadoop-based alternatives are technology options.
Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big.

Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.

My answer, in a nutshell, is:

Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud — are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.

To see why, let’s start by asking: “With what do you want to integrate your analytic SQL processing?”

If you want to integrate with relational OLTP (OnLine Transaction Processing), your OLTP RDBMS vendor surely has a story worth listening to. Memory-centric offerings MemSQL and SAP HANA are also pitched that way.
If you want to integrate with your SAP apps in particular, HANA is the obvious choice.
If you want to integrate with other work you do in the Amazon cloud, Redshift is worth a look.

Beyond those cases, a big issue is integration with … well, with data integration. Analytic RDBMS got a lot of their workloads from ELT or ETLT, which stand for Extract/(Transform)/Load/Transform. I.e., you’d load data into an efficient analytic RDBMS and then do your transformations, vs. the “traditional” (for about 10-15 years of tradition) approach of doing your transformations in your ETL (Extract/Transform/Load) engine. But in bigger installations, Hadoop often snatches away that part of the workload, even if the rest of the processing remains on a dedicated analytic RDBMS platform such as Teradata’s.

And suppose you want to integrate with more advanced analytics — e.g. statistics, other predictive modeling/machine learning, or graph analytics? Well — and this both surprised and disappointed me — analytic platforms in the RDBMS sense didn’t work out very well. Early Hadoop had its own problems too. But Spark is doing just fine, and seems poised to win.

My technical observations around these trends include:

Advanced analytics commonly require flexible, iterative processing.
Spark is much better at such processing than earlier Hadoop …
… which in turn is better than anything that’s been built into an analytic RDBMS.
Open source/open standards and the associated skill sets come into play too. Highly vendor-proprietary DBMS-tied analytic stacks don’t have enough advantages over open ones.
Notwithstanding the foregoing, RDBMS-based platforms can still win if a big part of the task lies in fancy SQL.

And finally, if a task is “partly relational”, then Hadoop or Spark often fit both parts.

They don’t force you into using SQL for everything, nor into putting all your data into relational schemas, and that flexibility can be a huge relief.
Even so, almost everybody who uses those uses some SQL, at least for initial data extraction. Those systems are also plenty good enough at SQL for joining data to reference tables, and all that other SQL stuff you’d never want to give up.

But suppose you just want to do business intelligence, which is still almost always done over relational data structures? Analytic RDBMS offer the trade-offs:

They generally still provide the best performance or performance/concurrency combination, for the cost, although YMMV (Your Mileage May Vary).
One has to load the data in and immediately structure it relationally, which can be an annoying contrast to Hadoop alternatives (data base administration can be just-in-time) or to OLTP integration (less or no re-loading).
Other integrations, as noted above, can also be weak.

Suppose all that is a good match for your situation. Then you should surely continue using an analytic RDBMS, if you already have one, and perhaps even acquire one if you don’t. But for many other use cases, analytic RDBMS are no longer the best way to go.

Finally, how does the cloud affect all this? Mainly, it brings one more analytic RDBMS competitor into the mix, namely Amazon Redshift. Redshift is a simple system for doing analytic SQL over data that was in or headed to the Amazon cloud anyway. It seems to be quite successful.

Bottom line: Analytic RDBMS are no longer in their youthful prime, but they are healthy contributors in middle age. Mainly, they’re still best-of-breed for supporting demanding BI.

Introduction to data Artisans and Flink

Curt Monash — Sun, 21 Aug 2016 21:15:59 +0000

data Artisans and Flink basics start:

Flink is an Apache project sponsored by the Berlin-based company data Artisans.
Flink has been viewed in a few different ways, all of which are similar to how Spark is seen. In particular, per co-founder Kostas Tzoumas:
- Flink’s original goal was “Hadoop done right”.
- Now Flink is focused on streaming analytics, as an alternative to Spark Streaming, Samza, et al.
Kostas seems to see Flink as a batch-plus-streaming engine that’s streaming-first.

Like many open source projects, Flink seems to have been partly inspired by a Google paper.

To this point, data Artisans and Flink have less maturity and traction than Databricks and Spark. For example:

The first line of Flink code dates back to 2010.
data Artisans and the Flink open source project both started in 2014.
When I met him in late June, Kostas told me that Data Artisans had raised $7 million and had 15 employees.
Flink’s current SQL support is very minor.

Per Kostas, about half of Flink committers are at data Artisans; others are at Cloudera, Hortonworks, Confluent, Intel, at least one production user, and some universities. Kostas provided about 5 examples of production Flink users, plus a couple of very big names that were sort-of-users (one was using a forked version of Flink, while another is becoming a user “soon”).

The technical story at data Artisans/Flink revolves around the assertion “We have the right architecture for streaming.” If I understood data Artisans co-founder Stephan Ewen correctly on a later call, the two key principles in support of that seem to be:

The key is to keep data “transport” running smoothly without interruptions, delays or bottlenecks, where the relevant sense of “transport” is movement from one operator/operation to the next.
In this case, the Flink folks feel that modularity supports efficiency.

In particular:

Anything that relates to consistency/recovery is kept almost entirely separate from basic processing, with minimal overhead and nothing that resembles a lock.
Windowing and so on operate separately from basic “transport” as well.
The core idea is that special markers — currently in the ~20 byte range in size — are injected into the streams. When the marker gets to an operator, the operator snapshots the then-current state of its part of the stream.
Should recovery ever be needed, consistency is achieved by assembling all the snapshots corresponding to a single marker, and replaying any processing that happened after those snapshots were taken.
- Actually, this is oversimplified, in that it assumes there’s only a single input stream.
- A lot of Flink’s cleverness, I gather, is involved in assembling a consistent snapshot despite the realities of multiple input streams.

The upshot, Flink partisans believe, is to match the high throughput of Spark Streaming while also matching the low latency of Storm.

The Flink folks naturally have a rich set of opinions about streaming. Besides the points already noted, these include:

“Exactly once” semantics are best in almost all use cases, as opposed to “at least once”, or to turning off fault tolerance altogether. (Exceptions might arise in extreme performance scenarios, or because of legacy systems’ expectations.)
Repetitive, scheduled batch jobs are often “streaming processes in disguise”. Besides any latency benefits, reimplementing them using streaming technology might simplify certain issues that can occur around the boundaries of batch windows. (The phrase “continuous processing” could reasonably be used here.)

We discussed joins quite a bit, but this was before I realized that Flink didn’t have much SQL support. Let’s just say they sounded rather primitive even when I assumed they were done via SQL.

Our discussion of windowing was more upbeat. Flink supports windows based either on timestamps or data arrival time, and these can be combined as needed. Stephan thinks this flexibility is important.

As for Flink use cases, they’re about what you’d expect:

Plenty of data transformation, because that’s how all these systems start out. Indeed, the earliest Flink adoption was for batch transformation.
Plenty of stream processing.

But Flink doesn’t have all the capabilities one would want for the kinds of investigative analytics commonly done on Spark.

Related links

My recent series of Spark posts offer comparison or background to this one.
I surveyed Spark Streaming, Storm et al. in January.
How you factor things is always important.
data Artisans has a non-obvious URL.

More about Databricks and Spark

Curt Monash — Sun, 21 Aug 2016 20:36:15 +0000

Databricks CEO Ali Ghodsi checked in because he disagreed with part of my recent post about Databricks. Ali’s take on Databricks’ position in the Spark world includes:

What I called Databricks’ “secondary business” of “licensing stuff to Spark distributors” was really about second/third tier support. Fair enough. But distributors of stacks including Spark, for whatever combination of on-premise and cloud as the case may be, may in many cases be viewed as competitors to Databricks cloud-only service. So why should Databricks help them?
Databricks’ investment in Spark Summit and similar evangelism is larger than I realized.
Ali suggests that the fraction of Databricks’ engineering devoted to open source Spark is greater than I understood during my recent visit.

Ali also walked me through customer use cases and adoption in wonderful detail. In general:

A large majority of Databricks customers have machine learning use cases.
Predicting and preventing user/customer churn is a huge issue across multiple market sectors.

The story on those sectors, per Ali, is:

First, Databricks penetrated ad-tech, for use cases such as ad selection.
Databricks’ second market was “mass media”.
- Disclosed examples include Viacom and NBC/Universal.
- There are “many” specific use cases. Personalization is a big one.
- Conviva-style video operations optimization is a use case for several customers, naturally including Conviva. (Reminder: Conviva was Ion Stoica’s previous company.)
Health care came third.
- Use cases here seem to be concentrated on a variety of approaches to predict patient outcomes.
- Analytic techniques often combine machine learning with traditional statistics.
- Security is a major requirement in this sector; fortunately, Databricks believes it excels at that.
Next came what he calls “industrial IT”. This group includes cool examples such as:
- Finding oil.
- Predictive maintenance of wind turbines.
- Predicting weather based on sensor data.
Finally (for now), there’s financial services. Of course, “financial services” comprises a variety of quite different business segments. Example use cases include:
- Credit card marketing.
- Investment analysis (based on expensive third-party data sets that are already in the cloud).
- Anti-fraud.

At an unspecified place in the timeline is national security, for a use case very similar to anti-fraud — identifying communities of bad people. Graph analytics plays a big role here.

And finally, of course we discussed some technical stuff, in philosophy, futures and usage as the case may be. In particular, Ali stressed that Spark 2.0 is the first that “breaks”/changes the APIs; hence the release number. It is now the case that:

There’s a single API for batch and streaming alike, and for machine learning “too”. This is DataFrames/DataSets. In this API …
… everything is a table. That said:
- Tables can be nested.
- Tables can be infinitely large, in which case you’re doing streaming.
Based on this, Ali thinks Spark 2.0 is now really a streaming engine.

Other tidbits included:

Ali said that every Databricks customer uses SQL. No exceptions.
- Indeed, a “number” of customers are using business intelligence tools. Therefore …
- … Databricks is licensing connector technology from Simba.
They’re working on model serving, with a REST API, rather than just model building. This was demoed at the recent Spark Summit, but is still in the “nascent” stage.
Ali insists that every streaming system with good performance does some kind of micro-batching under the hood. But the Spark programmers no longer need to take that directly into account. (In earlier versions, programmatic window sizes needed to be integer multiples of the low-level system’s chosen interval.)
In the future, when Databricks runs on more than just the Amazon cloud, Databricks customers will of course have cloud-to-cloud portability.

Notes on Spark and Databricks — technology

Curt Monash — Sun, 31 Jul 2016 14:30:18 +0000

During my recent visit to Databricks, I of course talked a lot about technology — largely with Reynold Xin, but a bit with Ion Stoica as well. Spark 2.0 is just coming out now, and of course has a lot of enhancements. At a high level:

Using the new terminology, Spark originally assumed users had data engineering skills, but Spark 2.0 is designed to be friendly to data scientists.
A lot of this is via a focus on simplified APIs, based on
- Unlike similarly named APIs in R and Python, Spark DataFrames work with nested data.
- Machine learning and Spark Streaming both work with Spark DataFrames.
There are lots of performance improvements as well, some substantial. Spark is still young enough that Bottleneck Whack-A-Mole yields huge benefits, especially in the SparkSQL area.
SQL coverage is of course improved. For example, SparkSQL can now perform all TPC-S queries.

The majority of Databricks’ development efforts, however, are specific to its cloud service, rather than being donated to Apache for the Spark project. Some of the details are NDA, but it seems fair to mention at least:

Databricks’ notebooks feature for organizing and launching machine learning processes and so on is a biggie. Jupyter is an open source analog.
Databricks has been working on security, and even on the associated certifications.

Two of the technical initiatives Reynold told me about seemed particularly cool. One, on the machine learning side, was a focus on training models online as new data streams in. In most cases this seems to require new algorithms for old model types, with a core idea being that the algorithm does a mini gradient descent for each new data point.

The other cool idea fits the trend of alternatives to the “lambda architecture”. Under the name “structured streaming”, which seems to be a replacement for “DStreaming”, the idea is to do set-based SQL processing even though membership of the set changes over time. Result sets are extracted on a snapshot basis; you can keep either all the results from each snapshot query or just the deltas.

Despite all this, there’s some non-trivial dissatisfaction with Spark, fair or otherwise.

Some of the reason is that SparkSQL is too immature to be great.
Some is annoyance that Databricks isn’t putting everything it has into open source.
Some is that everything has its architectural trade-offs.

To the last point, I raised one of the biggest specifics with Reynold, namely Spark’s lack of a strong built-in data persistence capability. Reynold’s answer was that they’re always working to speed up reading and writing from other forms of persistent storage. E.g., he cited a figure of ~100 million rows/core/second decoded from Parquet.

Notes from a long trip, July 19, 2016

Curt Monash — Wed, 20 Jul 2016 01:34:31 +0000

For starters:

I spent three weeks in California on a hybrid personal/business trip. I had a bunch of meetings, but not three weeks’ worth.
The timing was awkward for most companies I wanted to see. No blame accrues to those who didn’t make themselves available.
I came back with a nasty cough. Follow-up phone calls aren’t an option until next week.
I’m impatient to start writing. Hence tonight’s posts. But it’s difficult for a man and his cough to be productive at the same time.

A running list of recent posts is:

As a companion to this post, I’m publishing a very long one on vendor lock-in.
Spark and Databricks are both prospering, and of course enhancing their technology as well.
Ditto DataStax.
Flink is interesting as the streaming technology it’s now positioned to be, rather than the overall Spark alternative it used to be positioned as but which the world didn’t need.

Subjects I’d like to add to that list include:

MemSQL, Zoomdata, and Neo Technology (also prospering).
Cloudera (multiple topics, as usual).
Analytic SQL engines (“traditional” analytic RDBMS aren’t doing well).
Microsoft’s reinvention (it feels real).
Metadata (it’s ever more of a thing).
Machine learning (it’s going to be a big portion of my research going forward).
Transitions to the cloud — this subject affects almost everything else.

I’ll edit these lists as appropriate when further posts go up. Last update: August 23, 2016.

Let’s cover some other subjects right here.

1. While Kafka is widely agreed to be the universal delivery mechanism for streams, the landscape for companion technologies is confused.

Back in January I wrote that the leaders were mainly Spark Streaming, followed by Storm.
I overlooked the fact that Storm creator Twitter was replacing Storm with something called Heron.*
If there’s any buzz about Confluent’s replacement for distant-third-place contender Samza, I missed it.
Opinions about Spark Streaming are mixed. Some folks want to get away from it; others like it just fine.

And of course Flink is hoping to blow everybody else in the space away.

*But that kind of thing is not necessarily a death knell. Cassandra inventor Facebook soon replaced Cassandra with HBase, yet Cassandra is doing just fine.

As for the “lambda architecture” — that has always felt like a kludge, and various outfits are trying to obsolete it in various ways. As just one example, Cloudera described that to me during my visit as one of the main points of Kudu.

2. The idea that NoSQL does away with DBAs (DataBase Administrators) is common. It also turns out to be wrong. DBAs basically do two things.

Handle the database design part of application development. In NoSQL environments, this part of the job is indeed largely refactored away. More precisely, it is integrated into the general app developer/architect role.
Manage production databases. This part of the DBA job is, if anything, a bigger deal in the NoSQL world than in more mature and automated relational environments. It’s likely to be called part of “devops” rather than “DBA”, but by whatever name it’s very much a thing.

I had a moment of clarity on this point while visiting my clients at DataStax, and discussing their goal — shared by numerous companies — of being properly appreciated for the management tools they provide. In the room with me were CEO Billy Bosworth and chief evangelist Patrick McFadin — both of whom are former DBAs themselves.

3. I visited ClearStory, and Sharmila Mulligan showed me her actual sales database, as well as telling me some things about funding. The details are all confidential, but ClearStory is clearly doing better than rumor might suggest.

4. Platfora insisted on meeting circumstances in which it was inconvenient for me to take notes. So I have no details to share. But they sounded happy.

Edit: On July 22, it was announced that Workday is acquiring Platfora. Now I understand why Platfora gave me a bit of a runaround.

5. Pneubotics — with a cool new video on its home page — has found its first excellent product/market fit. Traditional heavy metallic robots are great at painting and related tasks when they can remain stationary, or move on rigid metal rails. Neither of those options works well, however, for large curved or irregular surfaces as might be found in the aerospace industry. Customer success for the leading soft robot company has ensued.

This all seems pretty close to the inspection/maintenance/repair area that I previously suggested could be a good soft robotics fit.