Streaming and complex event processing (CEP) – DBMS 2 : DataBase Management System Services

Analytics on the edge?

Curt Monash — Fri, 30 Jun 2017 08:27:18 +0000

There’s a theory going around to the effect that:

Compute power is and will be everywhere, for example in cars, robots, medical devices or microwave ovens. Let’s refer to these platforms collectively as “real-world appliances”.
Much more data will be created on these platforms than can reasonably be sent back to centralized/cloudy servers.
Therefore, cloud-centric architectures will soon be obsolete, perhaps before they’re ever dominant in the first place.

There’s enough truth to all that to make it worth discussing. But the strong forms of the claims seem overblown.

1. This story doesn’t even make sense except for certain new classes of application. Traditional business applications run all over the world, in dedicated or SaaSy modes as the case may be. E-commerce is huge. So is content delivery. Architectures for all those things will continue to evolve, but what we have now basically works.

2. When it comes to real-world appliances, this story is partially accurate. An automobile is a rolling network of custom Linux systems, each running hand-crafted real-time apps, a few of which also have minor requirements for remote connectivity. That’s OK as far as it goes, but there could be better support for real-time operational analytics. If something as flexible as Spark were capable of unattended operation, I think many engineers of real-world appliances would find great ways to use it.

3. There’s a case to be made for something better yet. I think the argument is premature, but it’s worth at least a little consideration.

There are any number of situations in which decisions are made on or about remote systems, based on models or rules that should be improved over time. For example, such decisions might be made in:

Machine vision or other “recognition”-oriented areas of AI.
Detection or prediction of malfunctions.
Choices as to what data is significant enough to ship back upstream.

In the canonical case, we might envision a system in which:

Huge amounts of data are collected and are used to make real-time decisions.
The models are trained centrally, and updated remotely over time as they are improved.
The remote systems can only ship back selected or aggregated data to help train the models.

This all seems like an awkward fit for any common computing architecture I can think of.

But it’s hard to pin down important examples of that “canonical” case. The story implicitly assumes:

A model is widely deployed.
The model does a decent job but not a perfect one.
Based on its successes and failures, the model gets improved.

And now we’re begging a huge question: What exactly is there that keeps score as to when the model succeeds and fails? Mathematically speaking, I can’t imagine what a general answer would be like.

4. So when it comes to predictive models executed on real-world appliances I think that analytic workflows will:

Differ for different (categories) of applications.
Rely in most cases on simple patterns of data movement, such as:
- Stream everything to central servers and sort it out there, or if that’s not workable …
- … instrument a limited number of test nodes to store everything, and recover the data in batch for analysis.
- Update models only in timeframes that you’re doing a full app update/refresh.

And with that much of the apparent need for fancy distributed analytic architectures evaporates.

5. Finally, and notwithstanding the previous point: Across many use cases, there’s some kind of remote log data being shipped back to a central location. It may be the complete log. It may be periodic aggregates. It may happen only what the edge nodes regard as significant events. But something is getting shipped home.

The architectures for shipping, receiving and analyzing such data are in many cases immature. That’s obvious if there’s any kind of streaming involved, or if analysis is done in Spark. Ditto if there’s anything we might call “non-tabular business intelligence”. As this stuff matures, it will in many cases fit very well with today’s cloud thinking. But in any case — it needs to mature.

Truth be told, even the relational case is immature, in that it can easily rely on what I called:

data warehouses (perhaps really data marts) that are updated in human real-time

That quote is from a recent post about Kudu, which:

Is designed for exactly that use case.
Went GA early this year.

As always, technology is in flux.

Related links

Interana is another example of very new technology that seems applicable to these use cases.
My 2013 post on the future of IT architectures still rings true.

DBAs of the future

Curt Monash — Wed, 23 Nov 2016 12:02:31 +0000

After a July visit to DataStax, I wrote

The idea that NoSQL does away with DBAs (DataBase Administrators) is common. It also turns out to be wrong. DBAs basically do two things.

Handle the database design part of application development. In NoSQL environments, this part of the job is indeed largely refactored away. More precisely, it is integrated into the general app developer/architect role.

Manage production databases. This part of the DBA job is, if anything, a bigger deal in the NoSQL world than in more mature and automated relational environments. It’s likely to be called part of “devops” rather than “DBA”, but by whatever name it’s very much a thing.

That turns out to understate the core point, which is that DBAs still matter in non-RDBMS environments. Specifically, it’s too narrow in two ways.

First, it’s generally too narrow as to what DBAs do; people with DBA-like skills are also involved in other areas such as “data governance”, “information lifecycle management”, storage, or what I like to call data mustering.
Second — and more narrowly — the first bullet point of the quote is actually incorrect. In fact, the database design part of application development can be done by a specialized person up front in the NoSQL world, just as it commonly is for RDBMS apps.

My wake-up call for that latter bit was a recent MongoDB 3.4 briefing. MongoDB certainly has various efforts in administrative tools, which I won’t recapitulate here. But to my surprise, MongoDB also found a role for something resembling relational database design. The idea is simple: A database administrator defines a view against a MongoDB database, where views:

Are logical rather than materialized. (At least at this time.)
Have their permissions and so on set by the DBA.
Are the sole thing the programmer writes against.

Besides the obvious benefits in development ease and security, MongoDB says that performance can be better as well.* This is of course a new feature, without a lot of adoption at this time. Even so, it seems likely that NoSQL doesn’t obsolete any part of the traditional DBA role.

*I didn’t actually ask what a naive programmer can do to trash performance that views can forestall, but … well, I was once a naive programmer myself.

Two trends that I think could make DBA’s lives even more interesting and challenging in the future are:

The integration of quick data management into complex analytic processes. Here by “quick data management” I mean, for example, what you do in connection with a complex Hadoop or Spark (set of) job(s). Leaving the data management to a combination of magic and Python scripts doesn’t seem to respect how central data operations are to analytic tasks.
The integration of data management and streaming. I should probably write about this point separately, but in any case — it seems that streaming stacks will increasingly look like over-caffeinated DBMS.

Bottom line: Database administration skills will be needed for a long time to come.

Rapid analytics

Curt Monash — Fri, 21 Oct 2016 14:17:04 +0000

“Real-time” technology excites people, and has for decades. Yet the actual, useful technology to meet “real-time” requirements remains immature, especially in cases which call for rapid human decision-making. Here are some notes on that conundrum.

1. I recently posted that “real-time” is getting real. But there are multiple technology challenges involved, including:

General streaming. Some of my posts on that subject are linked at the bottom of my August post on Flink.
Low-latency ingest of data into structures from which it can be immediately analyzed. That helps drive the (re)integration of operational data stores, analytic data stores, and other analytic support — e.g. via Spark.
Business intelligence that can be used quickly enough. This is a major ongoing challenge. My clients at Zoomdata may be thinking about this area more clearly than most, but even they are still in the early stages of providing what users need.
Advanced analytics that can be done quickly enough. Answers there may come through developments in anomaly management, but that area is still in its super-early days.
Alerting, which has been under-addressed for decades. Perhaps the anomaly management vendors will finally solve it.

2. In early 2011, I coined the phrase investigative analytics, about which I said three main things:

It is meant to contrast with “operational analytics”.
It is meant to conflate “several disciplines, namely”:
- Statistics, data mining, machine learning, and/or predictive analytics.
- The more research-oriented aspects of business intelligence tools.
- Analogous technologies as applied to non-tabular data types such as text or graph.
A simple definition would be “Seeking (previously unknown) patterns in data.”

Generally, that has held up pretty well, although “exploratory” is the more widely used term. But the investigative/operational dichotomy obscures one key fact, which is the central point of this post: There’s a widespread need for very rapid data investigation.

3. This is not just a niche need. There are numerous rapid-investigation use cases in mind, some already mentioned in my recent posts on anomaly management and real-time applications.

Network operations. This is my paradigmatic example.
- Data is zooming all over the place, in many formats and structures, among many kinds of devices. That’s log data, header data and payload data alike. Many kinds of problems can arise …
- … which operators want to diagnose and correct, in as few minutes as possible.
- Interfaces commonly include real-time business intelligence, some drilldown, and a lot of command-line options.
- I’ve written about various specifics, especially in connection with the vendors Splunk and Rocana.
Security and anti-fraud. Infosec and cyberfraud, to a considerable extent, are just common problems in network operations. Much of the response is necessarily automated — but the bad guys are always trying to outwit your automation. If you think they may have succeeded, you want to figure that out very, very fast.
Consumer promotion and engagement. Consumer marketers feel a great need for speed. Some of it is even genuine.
- If an online promotion is going badly (or particularly well), they can in theory react almost instantly. So they’d like to know almost instantly, perhaps via BI tools with great drilldown.
- The same is even truer in the case of social media eruptions and the like. Obviously, the tools here are heavily text-oriented.
- Call centers and even physical stores have some of the same aspects as internet consumer operations.
Consumer internet backends, for e-commerce, publishing, gaming or whatever. These cases combine and in some cases integrate the previous three points. For example, if you get a really absurd-looking business result, that could be your first indication of network malfunctions or automated fraud.
Industrial technology, such as factory operations, power/gas/water networks, vehicle fleets or oil rigs. Much as in IT networks, these contain a diversity of equipment — each now spewing its own logs — and have multiple possible modes of failure. More often than is the case in IT networks, you can recognize danger signs, then head off failure altogether via preventive maintenance. But when you can’t, it is crucial to identify the causes of failure fast.
General IoT (Internet of Things) operation. This covers several of the examples above, as well as cases in which you sell a lot of devices, have them “phone home”, and labor to keep that whole multi-owner network working.
National security. If I told you what I meant by this one, I’d have to … [redacted].

4. And then there’s the investment industry, which obviously needs very rapid analysis. When I was a stock analyst, I could be awakened by a phone call and told news that I would need to explain to 1000s of conference call listeners 20 minutes later. This was >30 years ago. The business moves yet faster today.

The investment industry has invested greatly in high-speed supporting technology for decades. That’s how Mike Bloomberg got so rich founding a vertical market tech business. But investment-oriented technology indeed remains a very vertical sector; little of it get more broadly applied.

I think the reason may be that investing is about guesswork, while other use cases call for more definitive answers. In particular:

If you’re wrong 49.9% of the time in investing, you might still be a big winner.
In high-frequency trading, speed is paramount; you have to be faster than your competitors. In speed/accuracy trade-offs, speed wins.

5. Of course, it’s possible to overstate these requirements. As in all real-time discussions, one needs to think hard about:

How much speed is important in meeting users’ needs.
How much additional speed, if any, is important in satisfying users’ desires.

But overall, I have little doubt that rapid analytics is a legitimate area for technology advancement and growth.

“Real-time” is getting real

Curt Monash — Tue, 06 Sep 2016 06:43:40 +0000

I’ve been an analyst for 35 years, and debates about “real-time” technology have run through my whole career. Some of those debates are by now pretty much settled. In particular:

Yes, interactive computer response is crucial.
- Into the 1980s, many apps were batch-only. Demand for such apps dried up.
- Business intelligence should occur at interactive speeds, which is a major reason that there’s a market for high-performance analytic RDBMS.
Theoretical arguments about “true” real-time vs. near-real-time are often pointless.
- What matters in most cases is human users’ perceptions of speed.
- Most of the exceptions to that rule occur when machines race other machines, for example in automated bidding (high frequency trading or otherwise) or in network security.

A big issue that does remain open is: How fresh does data need to be? My preferred summary answer is: As fresh as is needed to support the best decision-making. I think that formulation starts with several advantages:

It respects the obvious point that different use cases require different levels of data freshness.
It cautions against people who think they need fresh information but aren’t in a position to use it. (Such users have driven much bogus “real-time” demand in the past.)
It covers cases of both human and automated decision-making.

Straightforward applications of this principle include:

In “buying race” situations such as high-frequency trading, data needs to be as fresh as the other guy’s, and preferably even fresher.
Supply-chain systems generally need data that’s fresh to within a few hours; in some cases, sub-hour freshness is needed.
That’s a good standard for many desktop business intelligence scenarios as well.
Equipment-monitoring systems’ need for data freshness depends on how quickly catastrophic or cascading failures can occur or be averted.
- Different specific cases call for wildly different levels of data freshness.
- When equipment is well-instrumented with sensors, freshness requirements can be easy to meet.

E-commerce and other internet interaction scenarios can be more complicated, but it seems safe to say:

Recommenders/personalizers should take into account information from the current session.
Try very hard to give customers correct information about merchandise availability or pricing.

In meeting freshness requirements, multiple technical challenges can come into play.

Traditional batch aggregation is too slow for some analytic needs. That’s a core reason for having an analytic RDBMS.
Traditional data integration/movement pipelines can also be too slow. That’s a basis for short-request-capable data stores to also capture some analytic workloads. E.g., this is central to MemSQL’s pitch, and to some NoSQL applications as well.
Scoring models at interactive speeds is often easy. Retraining them quickly is much harder, and at this point only rarely done.
OLTP (OnLine Transaction Processing) guarantees adequate data freshness …
… except in scenarios where the transactions themselves are too slow. Questionably-consistent systems — commonly NoSQL — can usually meet performance requirements, but might have issues with the freshness of accurate data.
Older generations of streaming technology disappointed. The current generation is still maturing.

Based on all that, what technology investments should you be making, in order to meet “real-time” needs? My answers start:

Customer communications, online or telephonic as the case may be, should be based on accurate data. In particular:
- If your OLTP data is somehow siloed away from your phone support data, fix that immediately, if not sooner. (Fixing it 5-15 years ago would be ideal.)
- If your eventual consistency is so eventual that customers notice, fix it ASAP.
If you invest in predictive analytics/machine learning to support your recommenders/personalizers, then your models should at least be scored on fresh data.
- If your models don’t support that, reformulate them.
- If your data pipeline doesn’t support that, rebuild it.
- Actual high-speed retraining of models isn’t an immediate need. But if you’re going to have to transition to that anyway, consider doing do early and getting it over with.
Your BI should have great drilldown and exploration. Find the most active users of such functionality in your enterprise, even if — especially if! — they built some kind of departmental analytic system outside the enterprise mainstream. Ask them what, if anything, they need that they don’t have. Respond accordingly.
Whatever expensive and complex equipment you have, slather it with sensors. Spend a bit of research effort on seeing whether the resulting sensor logs can be made useful.
- Please note that this applies both to vehicles and to fixed objects (e.g. buildings, pipelines) as well as traditional industrial machinery.
- It also applies to any products you make which draw electric power.

So yes — I think “real-time” has finally become pretty real.

Introduction to data Artisans and Flink

Curt Monash — Sun, 21 Aug 2016 21:15:59 +0000

data Artisans and Flink basics start:

Flink is an Apache project sponsored by the Berlin-based company data Artisans.
Flink has been viewed in a few different ways, all of which are similar to how Spark is seen. In particular, per co-founder Kostas Tzoumas:
- Flink’s original goal was “Hadoop done right”.
- Now Flink is focused on streaming analytics, as an alternative to Spark Streaming, Samza, et al.
Kostas seems to see Flink as a batch-plus-streaming engine that’s streaming-first.

Like many open source projects, Flink seems to have been partly inspired by a Google paper.

To this point, data Artisans and Flink have less maturity and traction than Databricks and Spark. For example:

The first line of Flink code dates back to 2010.
data Artisans and the Flink open source project both started in 2014.
When I met him in late June, Kostas told me that Data Artisans had raised $7 million and had 15 employees.
Flink’s current SQL support is very minor.

Per Kostas, about half of Flink committers are at data Artisans; others are at Cloudera, Hortonworks, Confluent, Intel, at least one production user, and some universities. Kostas provided about 5 examples of production Flink users, plus a couple of very big names that were sort-of-users (one was using a forked version of Flink, while another is becoming a user “soon”).

The technical story at data Artisans/Flink revolves around the assertion “We have the right architecture for streaming.” If I understood data Artisans co-founder Stephan Ewen correctly on a later call, the two key principles in support of that seem to be:

The key is to keep data “transport” running smoothly without interruptions, delays or bottlenecks, where the relevant sense of “transport” is movement from one operator/operation to the next.
In this case, the Flink folks feel that modularity supports efficiency.

In particular:

Anything that relates to consistency/recovery is kept almost entirely separate from basic processing, with minimal overhead and nothing that resembles a lock.
Windowing and so on operate separately from basic “transport” as well.
The core idea is that special markers — currently in the ~20 byte range in size — are injected into the streams. When the marker gets to an operator, the operator snapshots the then-current state of its part of the stream.
Should recovery ever be needed, consistency is achieved by assembling all the snapshots corresponding to a single marker, and replaying any processing that happened after those snapshots were taken.
- Actually, this is oversimplified, in that it assumes there’s only a single input stream.
- A lot of Flink’s cleverness, I gather, is involved in assembling a consistent snapshot despite the realities of multiple input streams.

The upshot, Flink partisans believe, is to match the high throughput of Spark Streaming while also matching the low latency of Storm.

The Flink folks naturally have a rich set of opinions about streaming. Besides the points already noted, these include:

“Exactly once” semantics are best in almost all use cases, as opposed to “at least once”, or to turning off fault tolerance altogether. (Exceptions might arise in extreme performance scenarios, or because of legacy systems’ expectations.)
Repetitive, scheduled batch jobs are often “streaming processes in disguise”. Besides any latency benefits, reimplementing them using streaming technology might simplify certain issues that can occur around the boundaries of batch windows. (The phrase “continuous processing” could reasonably be used here.)

We discussed joins quite a bit, but this was before I realized that Flink didn’t have much SQL support. Let’s just say they sounded rather primitive even when I assumed they were done via SQL.

Our discussion of windowing was more upbeat. Flink supports windows based either on timestamps or data arrival time, and these can be combined as needed. Stephan thinks this flexibility is important.

As for Flink use cases, they’re about what you’d expect:

Plenty of data transformation, because that’s how all these systems start out. Indeed, the earliest Flink adoption was for batch transformation.
Plenty of stream processing.

But Flink doesn’t have all the capabilities one would want for the kinds of investigative analytics commonly done on Spark.

Related links

My recent series of Spark posts offer comparison or background to this one.
I surveyed Spark Streaming, Storm et al. in January.
How you factor things is always important.
data Artisans has a non-obvious URL.

More about Databricks and Spark

Curt Monash — Sun, 21 Aug 2016 20:36:15 +0000

Databricks CEO Ali Ghodsi checked in because he disagreed with part of my recent post about Databricks. Ali’s take on Databricks’ position in the Spark world includes:

What I called Databricks’ “secondary business” of “licensing stuff to Spark distributors” was really about second/third tier support. Fair enough. But distributors of stacks including Spark, for whatever combination of on-premise and cloud as the case may be, may in many cases be viewed as competitors to Databricks cloud-only service. So why should Databricks help them?
Databricks’ investment in Spark Summit and similar evangelism is larger than I realized.
Ali suggests that the fraction of Databricks’ engineering devoted to open source Spark is greater than I understood during my recent visit.

Ali also walked me through customer use cases and adoption in wonderful detail. In general:

A large majority of Databricks customers have machine learning use cases.
Predicting and preventing user/customer churn is a huge issue across multiple market sectors.

The story on those sectors, per Ali, is:

First, Databricks penetrated ad-tech, for use cases such as ad selection.
Databricks’ second market was “mass media”.
- Disclosed examples include Viacom and NBC/Universal.
- There are “many” specific use cases. Personalization is a big one.
- Conviva-style video operations optimization is a use case for several customers, naturally including Conviva. (Reminder: Conviva was Ion Stoica’s previous company.)
Health care came third.
- Use cases here seem to be concentrated on a variety of approaches to predict patient outcomes.
- Analytic techniques often combine machine learning with traditional statistics.
- Security is a major requirement in this sector; fortunately, Databricks believes it excels at that.
Next came what he calls “industrial IT”. This group includes cool examples such as:
- Finding oil.
- Predictive maintenance of wind turbines.
- Predicting weather based on sensor data.
Finally (for now), there’s financial services. Of course, “financial services” comprises a variety of quite different business segments. Example use cases include:
- Credit card marketing.
- Investment analysis (based on expensive third-party data sets that are already in the cloud).
- Anti-fraud.

At an unspecified place in the timeline is national security, for a use case very similar to anti-fraud — identifying communities of bad people. Graph analytics plays a big role here.

And finally, of course we discussed some technical stuff, in philosophy, futures and usage as the case may be. In particular, Ali stressed that Spark 2.0 is the first that “breaks”/changes the APIs; hence the release number. It is now the case that:

There’s a single API for batch and streaming alike, and for machine learning “too”. This is DataFrames/DataSets. In this API …
… everything is a table. That said:
- Tables can be nested.
- Tables can be infinitely large, in which case you’re doing streaming.
Based on this, Ali thinks Spark 2.0 is now really a streaming engine.

Other tidbits included:

Ali said that every Databricks customer uses SQL. No exceptions.
- Indeed, a “number” of customers are using business intelligence tools. Therefore …
- … Databricks is licensing connector technology from Simba.
They’re working on model serving, with a REST API, rather than just model building. This was demoed at the recent Spark Summit, but is still in the “nascent” stage.
Ali insists that every streaming system with good performance does some kind of micro-batching under the hood. But the Spark programmers no longer need to take that directly into account. (In earlier versions, programmatic window sizes needed to be integer multiples of the low-level system’s chosen interval.)
In the future, when Databricks runs on more than just the Amazon cloud, Databricks customers will of course have cloud-to-cloud portability.

Notes on Spark and Databricks — technology

Curt Monash — Sun, 31 Jul 2016 14:30:18 +0000

During my recent visit to Databricks, I of course talked a lot about technology — largely with Reynold Xin, but a bit with Ion Stoica as well. Spark 2.0 is just coming out now, and of course has a lot of enhancements. At a high level:

Using the new terminology, Spark originally assumed users had data engineering skills, but Spark 2.0 is designed to be friendly to data scientists.
A lot of this is via a focus on simplified APIs, based on
- Unlike similarly named APIs in R and Python, Spark DataFrames work with nested data.
- Machine learning and Spark Streaming both work with Spark DataFrames.
There are lots of performance improvements as well, some substantial. Spark is still young enough that Bottleneck Whack-A-Mole yields huge benefits, especially in the SparkSQL area.
SQL coverage is of course improved. For example, SparkSQL can now perform all TPC-S queries.

The majority of Databricks’ development efforts, however, are specific to its cloud service, rather than being donated to Apache for the Spark project. Some of the details are NDA, but it seems fair to mention at least:

Databricks’ notebooks feature for organizing and launching machine learning processes and so on is a biggie. Jupyter is an open source analog.
Databricks has been working on security, and even on the associated certifications.

Two of the technical initiatives Reynold told me about seemed particularly cool. One, on the machine learning side, was a focus on training models online as new data streams in. In most cases this seems to require new algorithms for old model types, with a core idea being that the algorithm does a mini gradient descent for each new data point.

The other cool idea fits the trend of alternatives to the “lambda architecture”. Under the name “structured streaming”, which seems to be a replacement for “DStreaming”, the idea is to do set-based SQL processing even though membership of the set changes over time. Result sets are extracted on a snapshot basis; you can keep either all the results from each snapshot query or just the deltas.

Despite all this, there’s some non-trivial dissatisfaction with Spark, fair or otherwise.

Some of the reason is that SparkSQL is too immature to be great.
Some is annoyance that Databricks isn’t putting everything it has into open source.
Some is that everything has its architectural trade-offs.

To the last point, I raised one of the biggest specifics with Reynold, namely Spark’s lack of a strong built-in data persistence capability. Reynold’s answer was that they’re always working to speed up reading and writing from other forms of persistent storage. E.g., he cited a figure of ~100 million rows/core/second decoded from Parquet.

Notes from a long trip, July 19, 2016

Curt Monash — Wed, 20 Jul 2016 01:34:31 +0000

For starters:

I spent three weeks in California on a hybrid personal/business trip. I had a bunch of meetings, but not three weeks’ worth.
The timing was awkward for most companies I wanted to see. No blame accrues to those who didn’t make themselves available.
I came back with a nasty cough. Follow-up phone calls aren’t an option until next week.
I’m impatient to start writing. Hence tonight’s posts. But it’s difficult for a man and his cough to be productive at the same time.

A running list of recent posts is:

As a companion to this post, I’m publishing a very long one on vendor lock-in.
Spark and Databricks are both prospering, and of course enhancing their technology as well.
Ditto DataStax.
Flink is interesting as the streaming technology it’s now positioned to be, rather than the overall Spark alternative it used to be positioned as but which the world didn’t need.

Subjects I’d like to add to that list include:

MemSQL, Zoomdata, and Neo Technology (also prospering).
Cloudera (multiple topics, as usual).
Analytic SQL engines (“traditional” analytic RDBMS aren’t doing well).
Microsoft’s reinvention (it feels real).
Metadata (it’s ever more of a thing).
Machine learning (it’s going to be a big portion of my research going forward).
Transitions to the cloud — this subject affects almost everything else.

I’ll edit these lists as appropriate when further posts go up. Last update: August 23, 2016.

Let’s cover some other subjects right here.

1. While Kafka is widely agreed to be the universal delivery mechanism for streams, the landscape for companion technologies is confused.

Back in January I wrote that the leaders were mainly Spark Streaming, followed by Storm.
I overlooked the fact that Storm creator Twitter was replacing Storm with something called Heron.*
If there’s any buzz about Confluent’s replacement for distant-third-place contender Samza, I missed it.
Opinions about Spark Streaming are mixed. Some folks want to get away from it; others like it just fine.

And of course Flink is hoping to blow everybody else in the space away.

*But that kind of thing is not necessarily a death knell. Cassandra inventor Facebook soon replaced Cassandra with HBase, yet Cassandra is doing just fine.

As for the “lambda architecture” — that has always felt like a kludge, and various outfits are trying to obsolete it in various ways. As just one example, Cloudera described that to me during my visit as one of the main points of Kudu.

2. The idea that NoSQL does away with DBAs (DataBase Administrators) is common. It also turns out to be wrong. DBAs basically do two things.

Handle the database design part of application development. In NoSQL environments, this part of the job is indeed largely refactored away. More precisely, it is integrated into the general app developer/architect role.
Manage production databases. This part of the DBA job is, if anything, a bigger deal in the NoSQL world than in more mature and automated relational environments. It’s likely to be called part of “devops” rather than “DBA”, but by whatever name it’s very much a thing.

I had a moment of clarity on this point while visiting my clients at DataStax, and discussing their goal — shared by numerous companies — of being properly appreciated for the management tools they provide. In the room with me were CEO Billy Bosworth and chief evangelist Patrick McFadin — both of whom are former DBAs themselves.

3. I visited ClearStory, and Sharmila Mulligan showed me her actual sales database, as well as telling me some things about funding. The details are all confidential, but ClearStory is clearly doing better than rumor might suggest.

4. Platfora insisted on meeting circumstances in which it was inconvenient for me to take notes. So I have no details to share. But they sounded happy.

Edit: On July 22, it was announced that Workday is acquiring Platfora. Now I understand why Platfora gave me a bit of a runaround.

5. Pneubotics — with a cool new video on its home page — has found its first excellent product/market fit. Traditional heavy metallic robots are great at painting and related tasks when they can remain stationary, or move on rigid metal rails. Neither of those options works well, however, for large curved or irregular surfaces as might be found in the aerospace industry. Customer success for the leading soft robot company has ensued.

This all seems pretty close to the inspection/maintenance/repair area that I previously suggested could be a good soft robotics fit.

Kafka and more

Curt Monash — Mon, 25 Jan 2016 11:28:02 +0000

In a companion introduction to Kafka post, I observed that Kafka at its core is remarkably simple. Confluent offers a marchitecture diagram that illustrates what else is on offer, about which I’ll note:

The red boxes — “Ops Dashboard” and “Data Flow Audit” — are the initial closed-source part. No surprise that they sound like management tools; that’s the traditional place for closed source add-ons to start.
“Schema Management”
- Is used to define fields and so on.
- Is not equivalent to what is ordinarily meant by schema validation, in that …
- … it allows schemas to change, but puts constraints on which changes are allowed.
- Is done in plug-ins that live with the producer or consumer of data.
- Is based on the Hadoop-oriented file format Avro.

Kafka offers little in the way of analytic data transformation and the like. Hence, it’s commonly used with companion products.

Per Confluent/Kafka honcho Jay Kreps, the companion is generally Spark Streaming, Storm or Samza, in declining order of popularity, with Samza running a distant third.
Jay estimates that there’s such a companion product at around 50% of Kafka installations.
Conversely, Jay estimates that around 80% of Spark Streaming, Storm or Samza users also use Kafka. On the one hand, that sounds high to me; on the other, I can’t quickly name a counterexample, unless Storm originator Twitter is one such.
Jay’s views on the Storm/Spark comparison include:
- Storm is more mature than Spark Streaming, which makes sense given their histories.
- Storm’s distributed processing capabilities are more questionable than Spark Streaming’s.
- Spark Streaming is generally used by folks in the heavily overlapping categories of:
  - Spark users.
  - Analytics types.
  - People who need to share stuff between the batch and stream processing worlds.
- Storm is generally used by people coding up more operational apps.

If we recognize that Jay’s interests are obviously streaming-centric, this distinction maps pretty well to the three use cases Cloudera recently called out.

Complicating this discussion further is Confluent 2.1, which is expected late this quarter. Confluent 2.1 will include, among other things, a stream processing layer that works differently from any of the alternatives I cited, in that:

It’s a library running in client applications that can interrogate the core Kafka server, rather than …
… a separate thing running on a separate cluster.

The library will do joins, aggregations and so on, and while relying on core Kafka for information about process health and the like. Jay sees this as more of a competitor to Storm in operational use cases than to Spark Streaming in analytic ones.

We didn’t discuss other Confluent 2.1 features much, and frankly they all sounded to me like items from the “You mean you didn’t have that already??” list any young product has.

Related links

My October, 2014 post on Streaming for Hadoop is a sort of predecessor to this two-post series.

Kafka and Confluent

Curt Monash — Mon, 25 Jan 2016 11:27:13 +0000

For starters:

Kafka has gotten considerable attention and adoption in streaming.
Kafka is open source, out of LinkedIn.
Folks who built it there, led by Jay Kreps, now have a company called Confluent.
Confluent seems to be pursuing a fairly standard open source business model around Kafka.
Confluent seems to be in the low to mid teens in paying customers.
Confluent believes 1000s of Kafka clusters are in production.
Confluent reports 40 employees and $31 million raised.

At its core Kafka is very simple:

Kafka accepts streams of data in substantially any format, and then streams the data back out, potentially in a highly parallel way.
Any producer or consumer of data can connect to Kafka, via what can reasonably be called a publish/subscribe model.
Kafka handles various issues of scaling, load balancing, fault tolerance and so on.

So it seems fair to say:

Kafka offers the benefits of hub vs. point-to-point connectivity.
Kafka acts like a kind of switch, in the telecom sense. (However, this is probably not a very useful metaphor in practice.)

Jay also views Kafka as something like a file system. Kafka doesn’t actually have a file-system-like interface for managing streams, but he acknowledges that as a need and presumably a roadmap item.

The most noteworthy technical point for me was that Kafka persists data, for reasons of buffering, fault-tolerance and the like. The duration of the persistence is configurable, and can be different for different feeds, with two main options:

Guaranteed to have the last update of anything.
Complete for the past N days.

Jay thinks this is a major difference vs. messaging systems that have come before. As you might expect, given that data arrives in timestamp order and then hangs around for a while:

Kafka can offer strong guarantees of delivering data in the correct order.
Persisted data is automagically broken up into partitions.

Technical tidbits include:

Data is generally fresh to within 1.5 milliseconds.
100s of MB/sec/server is claimed. I didn’t ask how big a server was.
LinkedIn runs >1 trillion messages/day through Kafka.
Others in that throughput range include but are not limited to Microsoft and Netflix.
A message is commonly 1 KB or less.
At a guesstimate, 50%ish of messages are in Avro. JSON is another frequent format.

Jay’s answer to any concern about performance overhead for current or future features is usually to point out that anything other than the most basic functionality:

Runs in different processes from core Kafka …
… if it doesn’t actually run on a different cluster.

For example, connectors have their own pools of processes.

I asked the natural open source question about who contributes what to the Apache Kafka project. Jay’s quick answers were:

Perhaps 80% of Kafka code comes from Confluent.
LinkedIn has contributed most of the rest.
However, as is typical in open source, the general community has contributed some connectors.
The general community also contributes “esoteric” bug fixes, which Jay regards as evidence that Kafka is in demanding production use.

Jay has a rather erudite and wry approach to naming and so on.

Kafka got its name because it was replacing something he regarded as Kafkaesque. OK.
Samza is an associated project that has something to do with transformations. Good name. (The central character of The Metamorphosis was Gregor Samsa, and the opening sentence of the story mentions a transformation.)
In his short book about logs, Jay has a picture caption “ETL in Ancient Greece. Not much has changed.” The picture appears to be of Sisyphus. I love it.
I still don’t know why he named a key-value store Voldemort. Perhaps it was something not to be spoken of.

What he and his team do not yet have is a clear name for their product category. Difficulties in naming include:

Kafka is limited and simple. But of course Confluent has plans to broaden its capabilities.
It’s long been hard to decide whether to talk about “events”, “streams” or both.
“Streaming” has another tech meaning, in the context of video, songs, etc.
One candidate name, “event hub”, has already been grabbed by IBM and Microsoft for their specific offerings.
Naming is always hard in general.

Confluent seems to be using “stream data platform” as a placeholder. As per the link above, I once suggested Data Stream Management System, or more concisely Datastream Manager. “Event”, “event stream” or “event series” could perhaps be mixed in as well. I don’t really have an opinion yet, and probably won’t until I’ve studied the space in a little more detail.

And on that note, I’ll end this post for reasons of length, and discuss Kafka-related technology separately.

Related links

My October, 2014 post on Streaming for Hadoop is a sort of predecessor to this two-post series.