Application areas – DBMS 2 : DataBase Management System Services

Analytics on the edge?

Curt Monash — Fri, 30 Jun 2017 08:27:18 +0000

There’s a theory going around to the effect that:

Compute power is and will be everywhere, for example in cars, robots, medical devices or microwave ovens. Let’s refer to these platforms collectively as “real-world appliances”.
Much more data will be created on these platforms than can reasonably be sent back to centralized/cloudy servers.
Therefore, cloud-centric architectures will soon be obsolete, perhaps before they’re ever dominant in the first place.

There’s enough truth to all that to make it worth discussing. But the strong forms of the claims seem overblown.

1. This story doesn’t even make sense except for certain new classes of application. Traditional business applications run all over the world, in dedicated or SaaSy modes as the case may be. E-commerce is huge. So is content delivery. Architectures for all those things will continue to evolve, but what we have now basically works.

2. When it comes to real-world appliances, this story is partially accurate. An automobile is a rolling network of custom Linux systems, each running hand-crafted real-time apps, a few of which also have minor requirements for remote connectivity. That’s OK as far as it goes, but there could be better support for real-time operational analytics. If something as flexible as Spark were capable of unattended operation, I think many engineers of real-world appliances would find great ways to use it.

3. There’s a case to be made for something better yet. I think the argument is premature, but it’s worth at least a little consideration.

There are any number of situations in which decisions are made on or about remote systems, based on models or rules that should be improved over time. For example, such decisions might be made in:

Machine vision or other “recognition”-oriented areas of AI.
Detection or prediction of malfunctions.
Choices as to what data is significant enough to ship back upstream.

In the canonical case, we might envision a system in which:

Huge amounts of data are collected and are used to make real-time decisions.
The models are trained centrally, and updated remotely over time as they are improved.
The remote systems can only ship back selected or aggregated data to help train the models.

This all seems like an awkward fit for any common computing architecture I can think of.

But it’s hard to pin down important examples of that “canonical” case. The story implicitly assumes:

A model is widely deployed.
The model does a decent job but not a perfect one.
Based on its successes and failures, the model gets improved.

And now we’re begging a huge question: What exactly is there that keeps score as to when the model succeeds and fails? Mathematically speaking, I can’t imagine what a general answer would be like.

4. So when it comes to predictive models executed on real-world appliances I think that analytic workflows will:

Differ for different (categories) of applications.
Rely in most cases on simple patterns of data movement, such as:
- Stream everything to central servers and sort it out there, or if that’s not workable …
- … instrument a limited number of test nodes to store everything, and recover the data in batch for analysis.
- Update models only in timeframes that you’re doing a full app update/refresh.

And with that much of the apparent need for fancy distributed analytic architectures evaporates.

5. Finally, and notwithstanding the previous point: Across many use cases, there’s some kind of remote log data being shipped back to a central location. It may be the complete log. It may be periodic aggregates. It may happen only what the edge nodes regard as significant events. But something is getting shipped home.

The architectures for shipping, receiving and analyzing such data are in many cases immature. That’s obvious if there’s any kind of streaming involved, or if analysis is done in Spark. Ditto if there’s anything we might call “non-tabular business intelligence”. As this stuff matures, it will in many cases fit very well with today’s cloud thinking. But in any case — it needs to mature.

Truth be told, even the relational case is immature, in that it can easily rely on what I called:

data warehouses (perhaps really data marts) that are updated in human real-time

That quote is from a recent post about Kudu, which:

Is designed for exactly that use case.
Went GA early this year.

As always, technology is in flux.

Related links

Interana is another example of very new technology that seems applicable to these use cases.
My 2013 post on the future of IT architectures still rings true.

Cloudera Altus

Curt Monash — Wed, 14 Jun 2017 13:12:48 +0000

I talked with Cloudera before the recent release of Altus. In simplest terms, Cloudera’s cloud strategy aspires to:

Provide all the important advantages of on-premises Cloudera.
Provide all the important advantages of native cloud offerings such as Amazon EMR (Elastic MapReduce, or at least come sufficiently close to that goal.
Benefit from customers’ desire to have on-premises and cloud deployments that work:
- Alike in any case.
- Together, to the extent that that makes use-case sense.

In other words, Cloudera is porting its software to an important new platform.* And this port isn’t complete yet, in that Altus is geared only for certain workloads. Specifically, Altus is focused on “data pipelines”, aka data transformation, aka “data processing”, aka new-age ETL (Extract/Transform/Load). (Other kinds of workload are on the roadmap, including several different styles of Impala use.) So what about that is particularly interesting? Well, let’s drill down.

*Or, if you prefer, improving on early versions of the port.

Since so much of the Hadoop and Spark stacks is open source, competition often isn’t based on core product architecture or features, but rather on factors such as:

Ease of management. This one is nuanced in the case of cloud/Altus. For starters:
- One of Cloudera’s main areas of differentiation has always been Cloudera Manager.
- Cloudera Director was Cloudera’s first foray into cloud-specific management.
- Cloudera Altus features easier/simpler management than Cloudera Director, meant to be analogous to native Amazon management tools, and good-enough for use cases that don’t require strenuous optimization.
- Cloudera Altus also includes an optional workload analyzer, in slight conflict with other parts of the Altus story. More on that below.
Ease of development. Frankly, this rarely seems to come up as a differentiator in the Hadoop/Spark world, various “notebook” offerings such as Databricks’ or Cloudera’s notwithstanding.
Price. When price is the major determinant, Cloudera is sad.
Open source purity. Ditto. But at most enterprises — at least those with hefty IT budgets — emphasis on open source purity either is a proxy for price shopping, or else boils down to largely bogus concerns about vendor lock-in.

Of course, “core” kinds of considerations are present to some extent too, including:

Performance, concurrency, etc. I no longer hear many allegations of differences in across-the-board Hadoop performance. But the subject does arise in specific areas, most obviously in analytic SQL processing. It arises in the case of Altus as well, in that Cloudera improved in a couple of areas that it concedes were previously Amazon EMR advantages, namely:
- Interacting with S3 data stores.
- Spinning instances up and down.
Reliability and data safety. Cloudera mentioned that it did some work so as to be comfortable with S3’s eventual consistency model.

Recently, Cloudera has succeeded at blowing security up into a major competitive consideration. Of course, they’re trying that with Altus as well. Much of the Cloudera Altus story is the usual — rah-rah Cloudera security, Sentry, Kerberos everywhere, etc. But there’s one aspect that I find to be simple yet really interesting:

Cloudera Altus doesn’t manage data for you.
Rather, it launches and manages jobs on a separate Hadoop cluster.

Thus, there are very few new security risks to running Cloudera Altus, beyond whatever risks are inherent to running any version of Hadoop in the public cloud.

Where things get a bit more complicated is some features for workload analysis.

Cloudera recently introduced some capabilities for on-the-fly trouble-shooting. That’s fine.
Cloudera has also now announced an offline workload analyzer, which compares actual metrics computed from your log files to “normal” ones from well-running jobs. For that, you really do have to ship information to a separate cluster managed by Cloudera.

The information shipped is logs rather than actual query results or raw data. In theory, an attacker who had all those logs could conceivably make inferences about the data itself; but in practice, that doesn’t seem like an important security risk at all.

So is this an odd situation where that strategy works, or could what we might call light-touch managed services turn out to be widespread and important? That’s a good question to address in a separate post.

Interana

Curt Monash — Mon, 17 Apr 2017 10:10:41 +0000

Interana has an interesting story, in technology and business model alike. For starters:

Interana does ad-hoc event series analytics, which they call “interactive behavioral analytics solutions”.
Interana has a full-stack analytic offering, include:
- Its own columnar DBMS …
- … which has a non-SQL DML (Data Manipulation Language) meant to handle event series a lot more fluently than SQL does, but which the user is never expected to learn because …
- … there also are BI-like visual analytics tools that support plenty of drilldown.
Interana sells all this to “product” departments rather than marketing, because marketing doesn’t sufficiently value Interana’s ad-hoc query flexibility.
Interana boasts >40 customers, with annual subscription fees ranging from high 5 figures to low 7 digits.

And to be clear — if we leave aside any questions of marketing-name sizzle, this really is business intelligence. The closest Interana comes to helping with predictive modeling is giving its ad-hoc users inspiration as to where they should focus their modeling attention.

Interana also has an interesting twist in its business model, which I hope can be used successfully by other enterprise software startups as well.

For now, at no extra charge, Interana will operate its software for you as a managed service. (A majority of Interana’s clients run the software on Amazon or Azure, where that kind of offering makes sense.)
However, presumably in connection with greater confidence in its software’s ease of administration, Interana will move this year toward unbundling the service as an extra-charge offering on top of the software itself.

The key to understanding Interana is its DML. Notes on that include:

Interana’s DML is focused on path analytics …
- … but Interana doesn’t like to use that phrase because it sounds too math-y and difficult.
- Interana may be the first company that’s ever told me it’s focused on providing a better nPath.
Primitives in Interana’s language — notwithstanding the company’s claim that it never ever intended to sell to marketing departments — include familiar web analytics concepts such as “session”, “funnel” and so on. (However, these are being renamed to more neutral terms such as “flow” in an upcoming version of the product.)
As typical example questions or analytic subjects, Interana offered:
- “Which are the most common products in shopping carts where time-to-checkout was greater than 30 minutes?”
- Exactly which steps in the onboarding process result in the greatest user frustration?
The Interana folks and I agree that Splunk is the most recent example of a new DML kicking off a significant company.
The most recent example I can think of in which a vendor hung its hat on a new DML that was a “visual programming language” is StreamBase, with EventFlow. That didn’t go all that well.
To use Founder/CTO Bobby Johnson’s summary term, the real goal of the Interana language is to describe a state machine, specifically one that produces (sets of) sequences of events (and the elapsed time between them).

Notes on Interana speeds & feeds include:

Interana only promises data freshness up to micro-batch latencies — i.e., a few minutes. (Obviously, this shuts them out of most networking monitoring and devops use cases.)
Interana thinks it’s very important for query response time to max out at a low number of seconds. If necessary, the software will return approximate results rather than exact ones so as to meet this standard.
Interana installations and workloads to date have gotten as large as:
- 1-200 nodes.
- Trillions of rows, equating to 100s of TBs of data after compression/ >1 PB uncompressed.
- Billions of rows/events received per day.
- 100s of 1000s of (very sparse) columns.
- 1000s of named users.

Although Interana’s original design point was spinning disk, most customers store their Interana data on flash.

Interana architecture choices include:

They’re serious about micro-batching.
- If the user’s data is naturally micro-batched — e.g. a new S3 bucket every few minutes — Interana works with that.
- Even if the customer’s data is streamed — e.g. via Kafka — Interana insists on micro-batching it.
They’re casual about schemas.
- Interana assumes data arrives with some kind of recognizable structure, via JSON, CSV or whatever.
  - Interana observes, correctly, that log data often is decently structured.
    - For example, if you’re receiving “phone home” pings from products you originally manufactured, you know what data structures to expect.
    - Interana calls this “logging with intent”.
  - Interana is fine with a certain amount of JSON (for example) schema change over time.
  - If your arriving data truly is a mess, then you need to calm it down via a pass through Splunk or whatever before sending it to Interana.
- JSON hierarchies turn into multi-part column names in the usual way.
- Interana supports one level of true nesting, and one level only; column values can be “lists”, but list values can’t be list themselves.

Finally, other Interana tech notes include:

Compression is a central design consideration …
- … especially but not only compression algorithms designed to deal with great sparseness, such as run-length encoding (RLE).
- Dictionary compression, in a strategy that is rarer than I once expected it to be, uses a global rather than shard-by-shard dictionary. The data Interana expects is of low-enough cardinality for this to be the better choice.
- Column data is sorted. A big part of the reason is of course to aid compression.
- Compression strategies are chosen automatically for each segment. Wholly automatically, I gather; you can’t tune the choice manually.
As you would think, Interana technically includes multiple data stores.
- Data first hits a write-optimized store. Unlike the case of Vertica, this WOS never is involved in answering queries.
- Asynchronously, the data is broken into columns, and banged to “disk”.
- Asynchronously again, the data is sorted.
- Queries run against sorted data, sorting recent blocks on-the-fly if necessary.
Interana lets you shard different replicas of the data according to different shard keys.
Interana is proud of the random sampling it does when serving approximate query results.

Analyzing the right data

Curt Monash — Thu, 13 Apr 2017 12:05:43 +0000

0. A huge fraction of what’s important in analytics amounts to making sure that you are analyzing the right data. To a large extent, “the right data” means “the right subset of your data”.

1. In line with that theme:

Relational query languages, at their core, subset data. Yes, they all also do arithmetic, and many do more math or other processing than just that. But it all starts with the set theory.
Underscoring the power of this approach, other data architectures over which analytics is done usually wind up with SQL or “SQL-like” language access as well.

2. Business intelligence interfaces today don’t look that different from what we had in the 1980s or 1990s. The biggest visible* changes, in my opinion, have been in the realm of better drilldown, ala QlikView and then Tableau. Drilldown, of course, is the main UI for business analysts and end users to subset data themselves.

*I used the word “visible” on purpose. The advances at the back end have been enormous, and much of that redounds to the benefit of BI.

3. I wrote 2 1/2 years ago that sophisticated predictive modeling commonly fit the template:

Divide your data into clusters.
Model each cluster separately.

That continues to be tough work. Attempts to productize shortcuts have not caught fire.

4. In an example of the previous point, anomaly management technology can, in theory, help shortcut any type of analytics, in that it tries to identify what parts of your data to focus on (and why). But it’s in its early days; none of the approaches to general anomaly management has gained much traction.

5. Marketers have vast amounts of information about us. It starts with every credit card transaction line item and a whole lot of web clicks. But it’s not clear how many of those (10s of) thousands of columns of data they actually use.

6. In some cases, the “right” amount of data to use may actually be tiny. Indeed, some statisticians claim that fewer than 10 data points may be enough to get a good model. I’m skeptical, at least as to the practical significance of such extreme figures. But on the more plausible side — if you’re hunting bad guys, it may not take very many separate facts before you have good evidence of collusion or fraud.

Internet fraud excepted, of course. Identifying that usually involves sifting through a lot of log entries.

7. All the needle-hunting in the world won’t help you unless what you seek is in the haystack somewhere.

Often, enterprises explicitly invest in getting more data.
Keeping everything you already generate is the obvious choice for most categories of data, but some of the lowest-value-per-bit logs may forever be thrown away.

8. Google is famously in the camp that there’s no such thing as too much data to analyze. For example, it famously uses >500 “signals” in judging the quality of potential search results. I don’t know how many separate data sources those signals are informed by, but surely there are a lot.

9. Few predictive modeling users demonstrate a need for vast data scaling. My support for that claim is a lot of anecdata. In particular:

Some predictive modeling techniques scale well. Some scale poorly. The level of pain around the “scale poorly” aspects of that seems to be fairly light (or “moderate” at worst). For example:
- In the previous technology generation, analytic DBMS and data warehouse appliance vendors tried hard to make statistical packages scale across their systems. Success was limited. Nobody seemed terribly upset.
- Cloudera’s Data Science Workbench messaging isn’t really scaling-centric.
Spark’s success in machine learning is rather rarely portrayed as centering on scaling. And even when it is, Spark basically runs in memory, so each Spark node is processing all that much data.

10. Somewhere in this post — i.e. right here — let’s acknowledge that the right data to analyze may not be exactly what was initially stored. Data munging/wrangling/cleaning/preparation is often a big deal. Complicated forms of derived data can be important too.

11. Let’s also mention data marts. Basically, data marts subset and copy data, because the data will be easier to analyze in its copied form, or because they want to separate workloads between the original and copied data store.

If we assume the data is on spinning disks or even flash, then the need for that strategy declined long ago.
Suppose you want to keep data entirely in memory? Then you might indeed want to subset-and-copy it. But with so many memory-centric systems doing decent jobs of persistent storage too, there’s often a viable whole-dataset management alternative.

But notwithstanding the foregoing:

Security/access control can be a good reason for subset-and-copy.
So can other kinds of administrative simplification.

12. So what does this all suggest going forward? I believe:

Drilldown is and will remain central to BI. If your BI doesn’t support robust drilldown, you’re doing it wrong. “Real-time” use cases are not exceptions to this rule.
In a strong overlap with the previous point, drilldown is and will remain central to monitoring. Whatever monitoring means to you, the ability to pinpoint the specific source of interesting signals is crucial.
The previous point can be recast as saying that it’s crucial to identify, isolate and explain anomalies. Some version(s) of anomaly management will become a big deal.
SQL and “SQL-like” languages will remain integral to analytic processing for a long time.
Memory-centric analytic frameworks such as Spark will continue to win. The data size constraints imposed by memory-centric processing will rarely cause difficulties.

Related links

Other recent “unifying-theme” posts focused on monitoring and coordination.
My 2013 post on what matters in investigative analytics still holds up pretty well.

Introduction to SequoiaDB and SequoiaCM

Curt Monash — Sun, 12 Mar 2017 18:19:49 +0000

For starters, let me say:

SequoiaDB, the company, is my client.
SequoiaDB, the product, is the main product of SequoiaDB, the company.
SequoiaDB, the company, has another product line SequoiaCM, which subsumes SequoiaDB in content management use cases.
SequoiaDB, the product, is fundamentally a JSON data store. But it has a relational front end …
… and is usually sold for RDBMS-like use cases …
… except when it is sold as part of SequoiaCM, which adds in a large object/block store and a content-management-oriented library.
SequoiaDB’s products are open source.
SequoiaDB’s largest installation seems to be 2 PB across 100 nodes; that includes block storage.
Figures for DBMS-only database sizes aren’t as clear, but the sweet spot of the cluster-size range for such use cases seems to be 6-30 nodes.

Also:

SequoiaDB, the company, was founded in Toronto, by former IBM DB2 folks.
Even so, it’s fairly accurate to view SequoiaDB as a Chinese company. Specifically:
- SequoiaDB’s founders were Chinese nationals.
- Most of them went back to China.
- Other employees to date have been entirely Chinese.
- Sales to date have been entirely in China, but SequoiaDB has international aspirations
SequoiaDB has >100 employees, a large majority of which are split fairly evenly between “engineering” and “implementation and technical support”.
SequoiaDB’s marketing (as opposed to sales) department is astonishingly tiny.
SequoiaDB cites >100 subscription customers, including 10 in the global Fortune 500, a large fraction of which are in the banking sector. (Other sectors mentioned repeatedly are government and telecom.)

Unfortunately, SequoiaDB has not captured a lot of detailed information about unpaid open source production usage.

While I usually think that the advantages of open source are overstated, in SequoiaDB’s case open source will have* an additional benefit when SequoiaDB does go international — it addresses any concerns somebody might have about using Chinese technology.

*Edit: Actually, this claim is overstated based on SequoiaDB’s current open source practices. Please see the comment thread below.

SequoiaDB’s technology story starts:

SequoiaDB is a layered DBMS.
It manages JSON via update-in-place. MVCC (Multi-Version Concurrency Control) is on the roadmap.
Indexes are B-tree.
Transparent sharding and elasticity happen in what by now is the industry-standard/best-practices way:
- There are many (typically 4096) logical partitions, many of which are assigned to each physical partition.
- If the number of physical partitions changes, logical partitions are reassigned accordingly.
Relational OLTP (OnLine Transaction Processing) functionality is achieved by using a kind of PostgreSQL front end.
Relational batch processing is done via SparkSQL.
There also is a block/LOB (Large OBject) storage engine meant for content management applications.
SequoiaCM boils down technically to:
- SequoiaDB, which is used to store JSON metadata about the LOBs …
- … and whose generic-DBMS coordination capabilities are also used over the block/LOB engine.
- A Java library focused on content management.

SequoiaDB’s relationship with PostgreSQL is complicated, but as best I understand SequoiaDB’s relational operations:

SQL parsing, optimization, and so on rely mainly on PostgreSQL code. (Of course, there are some hacks, such as to the optimizer’s cost functions.)
Actual data storage is done via SequoiaDB’s JSON store, using PostgreSQL Foreign Data Wrappers. Each record goes in a separate JSON document. Locks, commits and so on — i.e. “write prevention” — are handled by the JSON store.
PostgreSQL’s own storage engine is actually part of the stack, but only to manage temp space and the like.

PostgreSQL stored procedures are already in the SequoiaDB product. Triggers and referential integrity are not. Neither, so far as I can tell, are PostgreSQL’s datatype extensibility capabilities.

I neglected to ask how much of that remains true when SparkSQL is invoked.

SequoiaDB’s use cases to date seem to fall mainly into three groups:

Content management via SequoiaCM.
“Operational data lakes”.
Pretty generic replacement of legacy RDBMS.

Internet back-ends, however — and this is somewhat counter-intuitive for an open-source JSON store — are rare, at least among paying subscription customers. But SequoiaDB did tell me of one classic IoT (Internet of Things) application, with lots of devices “phoning home” and the results immediately feeding a JSON-based dashboard.

To understand SequoiaDB’s “operational data lake” story, it helps to understand the typical state of data warehousing at SequoiaDB’s customers and prospects, which isn’t great:

2-3 years of data, and not all the data even from that time period.
Only enough processing power to support structured business intelligence …
… and hence little opportunity for ad-hoc query.

SequoiaDB operational data lakes offer multiple improvements over that scenario:

They hold as much relational data as customers choose to dump there.
That data can be simply copied from operational stores, with no transformation.
Or if data arrives via JSON — from external organizations or micro-services as the case may be — the JSON can be stored unmodified as well.
Queries can be run straight against this data soup.
Of course, views can also be set up in advance to help with querying.

Views are particularly useful with what might be called slowly changing schemas. (I didn’t check whether what SequoiaDB is talking about matches precisely with the more common term “slowly changing dimensions”.) Each time the schema changes, a new table is created in SequoiaDB to receive copies of the data. If one wants to query against the parts of the database structure that didn’t change — well, a view can be establish to allow for that.

Finally, it seems that SequoiaCM uses are concentrated in what might be called “security and checking-up” areas, such:

Photographs as part of an authentication process.
Video of in-person banking transactions, both for fraud prevention and for general service quality assurance.
Storage of security videos (for example from automated teller machines).

SequoiaCM deals seem to be bigger than other SequoiaDB ones, surely in part because the amounts of data managed are larger.

Rapid analytics

Curt Monash — Fri, 21 Oct 2016 14:17:04 +0000

“Real-time” technology excites people, and has for decades. Yet the actual, useful technology to meet “real-time” requirements remains immature, especially in cases which call for rapid human decision-making. Here are some notes on that conundrum.

1. I recently posted that “real-time” is getting real. But there are multiple technology challenges involved, including:

General streaming. Some of my posts on that subject are linked at the bottom of my August post on Flink.
Low-latency ingest of data into structures from which it can be immediately analyzed. That helps drive the (re)integration of operational data stores, analytic data stores, and other analytic support — e.g. via Spark.
Business intelligence that can be used quickly enough. This is a major ongoing challenge. My clients at Zoomdata may be thinking about this area more clearly than most, but even they are still in the early stages of providing what users need.
Advanced analytics that can be done quickly enough. Answers there may come through developments in anomaly management, but that area is still in its super-early days.
Alerting, which has been under-addressed for decades. Perhaps the anomaly management vendors will finally solve it.

2. In early 2011, I coined the phrase investigative analytics, about which I said three main things:

It is meant to contrast with “operational analytics”.
It is meant to conflate “several disciplines, namely”:
- Statistics, data mining, machine learning, and/or predictive analytics.
- The more research-oriented aspects of business intelligence tools.
- Analogous technologies as applied to non-tabular data types such as text or graph.
A simple definition would be “Seeking (previously unknown) patterns in data.”

Generally, that has held up pretty well, although “exploratory” is the more widely used term. But the investigative/operational dichotomy obscures one key fact, which is the central point of this post: There’s a widespread need for very rapid data investigation.

3. This is not just a niche need. There are numerous rapid-investigation use cases in mind, some already mentioned in my recent posts on anomaly management and real-time applications.

Network operations. This is my paradigmatic example.
- Data is zooming all over the place, in many formats and structures, among many kinds of devices. That’s log data, header data and payload data alike. Many kinds of problems can arise …
- … which operators want to diagnose and correct, in as few minutes as possible.
- Interfaces commonly include real-time business intelligence, some drilldown, and a lot of command-line options.
- I’ve written about various specifics, especially in connection with the vendors Splunk and Rocana.
Security and anti-fraud. Infosec and cyberfraud, to a considerable extent, are just common problems in network operations. Much of the response is necessarily automated — but the bad guys are always trying to outwit your automation. If you think they may have succeeded, you want to figure that out very, very fast.
Consumer promotion and engagement. Consumer marketers feel a great need for speed. Some of it is even genuine.
- If an online promotion is going badly (or particularly well), they can in theory react almost instantly. So they’d like to know almost instantly, perhaps via BI tools with great drilldown.
- The same is even truer in the case of social media eruptions and the like. Obviously, the tools here are heavily text-oriented.
- Call centers and even physical stores have some of the same aspects as internet consumer operations.
Consumer internet backends, for e-commerce, publishing, gaming or whatever. These cases combine and in some cases integrate the previous three points. For example, if you get a really absurd-looking business result, that could be your first indication of network malfunctions or automated fraud.
Industrial technology, such as factory operations, power/gas/water networks, vehicle fleets or oil rigs. Much as in IT networks, these contain a diversity of equipment — each now spewing its own logs — and have multiple possible modes of failure. More often than is the case in IT networks, you can recognize danger signs, then head off failure altogether via preventive maintenance. But when you can’t, it is crucial to identify the causes of failure fast.
General IoT (Internet of Things) operation. This covers several of the examples above, as well as cases in which you sell a lot of devices, have them “phone home”, and labor to keep that whole multi-owner network working.
National security. If I told you what I meant by this one, I’d have to … [redacted].

4. And then there’s the investment industry, which obviously needs very rapid analysis. When I was a stock analyst, I could be awakened by a phone call and told news that I would need to explain to 1000s of conference call listeners 20 minutes later. This was >30 years ago. The business moves yet faster today.

The investment industry has invested greatly in high-speed supporting technology for decades. That’s how Mike Bloomberg got so rich founding a vertical market tech business. But investment-oriented technology indeed remains a very vertical sector; little of it get more broadly applied.

I think the reason may be that investing is about guesswork, while other use cases call for more definitive answers. In particular:

If you’re wrong 49.9% of the time in investing, you might still be a big winner.
In high-frequency trading, speed is paramount; you have to be faster than your competitors. In speed/accuracy trade-offs, speed wins.

5. Of course, it’s possible to overstate these requirements. As in all real-time discussions, one needs to think hard about:

How much speed is important in meeting users’ needs.
How much additional speed, if any, is important in satisfying users’ desires.

But overall, I have little doubt that rapid analytics is a legitimate area for technology advancement and growth.

Notes on anomaly management

Curt Monash — Mon, 10 Oct 2016 07:35:10 +0000

Then felt I like some watcher of the skies
When a new planet swims into his ken

— John Keats, “On First Looking Into Chapman’s Homer”

1. In June I wrote about why anomaly management is hard. Well, not only is it hard to do; it’s hard to talk about as well. One reason, I think, is that it’s hard to define what an anomaly is. And that’s a structural problem, not just a semantic one — if something is well enough understood to be easily described, then how much of an anomaly is it after all?

Artificial intelligence is famously hard to define for similar reasons.

“Anomaly management” and similar terms are not yet in the software marketing mainstream, and may never be. But naming aside, the actual subject matter is important.

2. Anomaly analysis is clearly at the heart of several sectors, including:

IT operations
Factory and other physical-plant operations
Security
Anti-fraud
Anti-terrorism

Each of those areas features one or both of the frameworks:

Surprises are likely to be bad.
Coincidences are likely to be suspicious.

So if you want to identify, understand, avert and/or remediate bad stuff, data anomalies are the first place to look.

3. The “insights” promised by many analytics vendors — especially those who sell to marketing departments — are also often heralded by anomalies. Already in the 1970s, Walmart observed that red clothing sold particularly well in Omaha, while orange flew off the shelves in Syracuse. And so, in large college towns, they stocked their stores to the gills with clothing in the colors of the local football team. They also noticed that fancy dresses for little girls sold especially well in Hispanic communities … specifically for girls at the age of First Communion.

4. The examples in the previous point may be characterized as noteworthy correlations that surely are reflecting actual causality. (The beer/diapers story would be another example, if only it were true.) Formally, the same is probably true of most actionable anomalies. So “anomalies” are fairly similar to — or at least overlap heavily with — “statistically surprising observations”.

And I do mean “statistically”. As per my Keats quote above, we have a classical model of sudden-shock discovery — an astronomer finding a new planet, a radar operator seeing a blip on a screen, etc. But Keats’ poem is 200 years old this month. In this century, there’s a lot more number-crunching involved.

Please note: It is certainly not the case that anomalies are necessarily found via statistical techniques. But however they’re actually found, they would at least in theory score as positives via various statistical tests.

5. There are quite a few steps to the anomaly-surfacing process, including but not limited to:

Collecting the raw data in a timely manner.
Identifying candidate signals (and differentiating them from noise).
Communicating surprising signals to the most eager consumers (and letting them do their own analysis).
Giving more tightly-curated information to a broader audience.

Hence many different kinds of vendor can have roles to play.

6. One vendor that has influenced my thinking about data anomalies is Nestlogic, an early-stage start-up with which I’m heavily involved. Here “heavily involved” includes:

I own more stock in Nestlogic than I have in any other company of which I wasn’t the principal founder.
I’m in close contact with founder/CEO David Gruzman.
I’ve personally written much of Nestlogic’s website content.

Nestlogic’s claims include:

For machine-generated data, anomalies are likely to be found in data segments, not individual records. (Here a “segment” might be all the data coming from a particular set of sources in a particular period of time.)
The more general your approach to anomaly detection, the better, for at least three reasons:
- In adversarial use cases, the hacker/fraudster/terrorist/whatever might deliberately deviate from previous patterns, so as to evade detection by previously-established filters.
- When there are multiple things to discover, one anomaly can mask another, until it is detected and adjusted for.
- (This point isn’t specific to anomaly management) More general tools can mean that an enterprise has fewer different new tools to adopt.
Anomalies boil down to surprising data profiles, so anomaly detection bears a slight resemblance to the data profiling approaches used in data quality, data integration and query optimization.
Different anomaly management users need very different kinds of UI. Less technical ones may want clear, simple alerts, with a minimum of false positives. Others may use anomaly management as a jumping-off point for investigative analytics and/or human real-time operational control.

I find these claims persuasive enough to help Nestlogic with its marketing and fund-raising, and to cite them in my post here. Still, please understand that they are Nestlogic’s and David’s assertions, not my own.

Notes on the transition to the cloud

Curt Monash — Tue, 04 Oct 2016 02:22:21 +0000

1. The cloud is super-hot. Duh. And so, like any hot buzzword, “cloud” means different things to different marketers. Four of the biggest things that have been called “cloud” are:

The Amazon cloud, Microsoft Azure, and their competitors, aka public cloud.
Software as a service, aka SaaS.
Co-location in off-premises data centers, aka colo.
On-premises clusters (truly on-prem or colo as the case may be) designed to run a broad variety of applications, aka private cloud.

Further, there’s always the idea of hybrid cloud, in which a vendor peddles private cloud systems (usually appliances) running similar technology stacks to what they run in their proprietary public clouds. A number of vendors have backed away from such stories, but a few are still pushing it, including Oracle and Microsoft.

This is a good example of Monash’s Laws of Commercial Semantics.

2. Due to economies of scale, only a few companies should operate their own data centers, aka true on-prem(ises). The rest should use some combination of colo, SaaS, and public cloud.

This fact now seems to be widely understood.

3. The public cloud is a natural fit for those use cases in which elasticity truly matters. Many websites and other consumer internet backends have that characteristic. Such systems are often also a good fit for cloud technologies in general.

This is frequently a good reason for new — i.e. “greenfield” — apps to run in the cloud.

4. Security and privacy can be concerns in moving to the cloud. But I’m hearing that more and more industries are overcoming those concerns.

In connection to that point, it might be interesting to note:

In the 1960s and 1970s, one of the biggest industries for remote computing services — i.e. SaaS — was commercial banking.
Other big users were hospitals and stockbrokers.
The US intelligence agencies are building out their own shared, dedicated cloud.

5. Obviously, Amazon is the gorilla in the cloud business. Microsoft Azure gets favorable mentions as well. I don’t hear much about other public cloud providers, however, except that there are a lot of plans to support Google’s cloud just in case.

In particular, I hear less than I expected to about public clouds run by national-champion telecom companies around the world.

6. It’s inconvenient for an application vendor to offer both traditional and SaaS versions of a product. Release cycles and platform support are different in the two cases. But there’s no reason a large traditional application vendor couldn’t pull it off, and the largest are already more or less claiming to. Soon, this will feel like a market necessity across the board.

7. The converse is less universally true. However, some SaaS vendors do lose out from their lack of on-premises options. Key considerations include:

Does your application need to run close to your customers’ largest databases?
Do your customers still avoid the public cloud?

If both those things are true, and you don’t have an on-premises option, certain enterprises are excluded from your addressable market.

8. Line-of-business departments are commonly more cloud-friendly than central IT is. Reasons include:

Departments don’t necessarily see central IT as any “closer” to them than the cloud is.
Departments don’t necessarily care about issues that give central IT pause.
Departments sometimes buy things that only are available via remote delivery, e.g. narrowly focused SaaS applications or market data.

I discussed some of this in my recent post on vendor lock-in.

9. When the public cloud was younger, it had various technological limitations. You couldn’t easily get fast storage like flash. You couldn’t control data movement well enough for good MPP (Massively Parallel Processing) in use cases like analytic SQL.

Those concerns seem to have been largely alleviated.

10. It takes a long time for legacy platforms to be decommissioned. At some enterprises, however, that work has indeed been going on for a long time, via virtualization.

11. If you think about system requirements:

There is a lot of computing power in devices that may be regarded as IoT nodes — phones, TV boxes, thermostats, cars, industrial equipment, sensors, etc. Client-side computing is getting ever more diverse.
Server-side computing, however, is more homogenous. Enterprises can, should and likely will meet the vast majority of their server requirements on a relatively small number of clusters each.

I argued the latter point in my 2013 post on appliances, clusters, and clouds, using terminology and reasoning that are now only slightly obsolete.

So what will those clusters be? Some will be determined by app choices. Most obviously, if you use SaaS, the SaaS vendor decides which cloud(s) your data is in. And if you’re re-hosting legacy systems via virtualization, that’s another cluster.

Otherwise, clusters will probably be organized by database, in the most expansive sense of term. For example, there could be separate clusters for:

Operational data managed by your general-purpose RDBMS (Oracle, SQL Server, DB2, whatever).
Relational data warehousing, whether in an analytic RDBMS or otherwise.
Log files, perhaps managed in Hadoop or Splunk.
Your website and other internet back-ends, perhaps running over NoSQL data stores.
Text documents managed by some kind of search engine.
Media block or object storage, if the organization’s audio/video/whatever would overwhelm a text search engine. (Text search or document management systems can often also handle low volumes of non-text media.)

Indeed, since computing is rarely as consolidated as CIOs dream of it being, a large enterprise might have several clusters for any of those categories — each running different software for data and storage management — with different deployment choices among colo, true on-prem, and true cloud.

“Real-time” is getting real

Curt Monash — Tue, 06 Sep 2016 06:43:40 +0000

I’ve been an analyst for 35 years, and debates about “real-time” technology have run through my whole career. Some of those debates are by now pretty much settled. In particular:

Yes, interactive computer response is crucial.
- Into the 1980s, many apps were batch-only. Demand for such apps dried up.
- Business intelligence should occur at interactive speeds, which is a major reason that there’s a market for high-performance analytic RDBMS.
Theoretical arguments about “true” real-time vs. near-real-time are often pointless.
- What matters in most cases is human users’ perceptions of speed.
- Most of the exceptions to that rule occur when machines race other machines, for example in automated bidding (high frequency trading or otherwise) or in network security.

A big issue that does remain open is: How fresh does data need to be? My preferred summary answer is: As fresh as is needed to support the best decision-making. I think that formulation starts with several advantages:

It respects the obvious point that different use cases require different levels of data freshness.
It cautions against people who think they need fresh information but aren’t in a position to use it. (Such users have driven much bogus “real-time” demand in the past.)
It covers cases of both human and automated decision-making.

Straightforward applications of this principle include:

In “buying race” situations such as high-frequency trading, data needs to be as fresh as the other guy’s, and preferably even fresher.
Supply-chain systems generally need data that’s fresh to within a few hours; in some cases, sub-hour freshness is needed.
That’s a good standard for many desktop business intelligence scenarios as well.
Equipment-monitoring systems’ need for data freshness depends on how quickly catastrophic or cascading failures can occur or be averted.
- Different specific cases call for wildly different levels of data freshness.
- When equipment is well-instrumented with sensors, freshness requirements can be easy to meet.

E-commerce and other internet interaction scenarios can be more complicated, but it seems safe to say:

Recommenders/personalizers should take into account information from the current session.
Try very hard to give customers correct information about merchandise availability or pricing.

In meeting freshness requirements, multiple technical challenges can come into play.

Traditional batch aggregation is too slow for some analytic needs. That’s a core reason for having an analytic RDBMS.
Traditional data integration/movement pipelines can also be too slow. That’s a basis for short-request-capable data stores to also capture some analytic workloads. E.g., this is central to MemSQL’s pitch, and to some NoSQL applications as well.
Scoring models at interactive speeds is often easy. Retraining them quickly is much harder, and at this point only rarely done.
OLTP (OnLine Transaction Processing) guarantees adequate data freshness …
… except in scenarios where the transactions themselves are too slow. Questionably-consistent systems — commonly NoSQL — can usually meet performance requirements, but might have issues with the freshness of accurate data.
Older generations of streaming technology disappointed. The current generation is still maturing.

Based on all that, what technology investments should you be making, in order to meet “real-time” needs? My answers start:

Customer communications, online or telephonic as the case may be, should be based on accurate data. In particular:
- If your OLTP data is somehow siloed away from your phone support data, fix that immediately, if not sooner. (Fixing it 5-15 years ago would be ideal.)
- If your eventual consistency is so eventual that customers notice, fix it ASAP.
If you invest in predictive analytics/machine learning to support your recommenders/personalizers, then your models should at least be scored on fresh data.
- If your models don’t support that, reformulate them.
- If your data pipeline doesn’t support that, rebuild it.
- Actual high-speed retraining of models isn’t an immediate need. But if you’re going to have to transition to that anyway, consider doing do early and getting it over with.
Your BI should have great drilldown and exploration. Find the most active users of such functionality in your enterprise, even if — especially if! — they built some kind of departmental analytic system outside the enterprise mainstream. Ask them what, if anything, they need that they don’t have. Respond accordingly.
Whatever expensive and complex equipment you have, slather it with sensors. Spend a bit of research effort on seeing whether the resulting sensor logs can be made useful.
- Please note that this applies both to vehicles and to fixed objects (e.g. buildings, pipelines) as well as traditional industrial machinery.
- It also applies to any products you make which draw electric power.

So yes — I think “real-time” has finally become pretty real.

More about Databricks and Spark

Curt Monash — Sun, 21 Aug 2016 20:36:15 +0000

Databricks CEO Ali Ghodsi checked in because he disagreed with part of my recent post about Databricks. Ali’s take on Databricks’ position in the Spark world includes:

What I called Databricks’ “secondary business” of “licensing stuff to Spark distributors” was really about second/third tier support. Fair enough. But distributors of stacks including Spark, for whatever combination of on-premise and cloud as the case may be, may in many cases be viewed as competitors to Databricks cloud-only service. So why should Databricks help them?
Databricks’ investment in Spark Summit and similar evangelism is larger than I realized.
Ali suggests that the fraction of Databricks’ engineering devoted to open source Spark is greater than I understood during my recent visit.

Ali also walked me through customer use cases and adoption in wonderful detail. In general:

A large majority of Databricks customers have machine learning use cases.
Predicting and preventing user/customer churn is a huge issue across multiple market sectors.

The story on those sectors, per Ali, is:

First, Databricks penetrated ad-tech, for use cases such as ad selection.
Databricks’ second market was “mass media”.
- Disclosed examples include Viacom and NBC/Universal.
- There are “many” specific use cases. Personalization is a big one.
- Conviva-style video operations optimization is a use case for several customers, naturally including Conviva. (Reminder: Conviva was Ion Stoica’s previous company.)
Health care came third.
- Use cases here seem to be concentrated on a variety of approaches to predict patient outcomes.
- Analytic techniques often combine machine learning with traditional statistics.
- Security is a major requirement in this sector; fortunately, Databricks believes it excels at that.
Next came what he calls “industrial IT”. This group includes cool examples such as:
- Finding oil.
- Predictive maintenance of wind turbines.
- Predicting weather based on sensor data.
Finally (for now), there’s financial services. Of course, “financial services” comprises a variety of quite different business segments. Example use cases include:
- Credit card marketing.
- Investment analysis (based on expensive third-party data sets that are already in the cloud).
- Anti-fraud.

At an unspecified place in the timeline is national security, for a use case very similar to anti-fraud — identifying communities of bad people. Graph analytics plays a big role here.

And finally, of course we discussed some technical stuff, in philosophy, futures and usage as the case may be. In particular, Ali stressed that Spark 2.0 is the first that “breaks”/changes the APIs; hence the release number. It is now the case that:

There’s a single API for batch and streaming alike, and for machine learning “too”. This is DataFrames/DataSets. In this API …
… everything is a table. That said:
- Tables can be nested.
- Tables can be infinitely large, in which case you’re doing streaming.
Based on this, Ali thinks Spark 2.0 is now really a streaming engine.

Other tidbits included:

Ali said that every Databricks customer uses SQL. No exceptions.
- Indeed, a “number” of customers are using business intelligence tools. Therefore …
- … Databricks is licensing connector technology from Simba.
They’re working on model serving, with a REST API, rather than just model building. This was demoed at the recent Spark Summit, but is still in the “nascent” stage.
Ali insists that every streaming system with good performance does some kind of micro-batching under the hood. But the Spark programmers no longer need to take that directly into account. (In earlier versions, programmatic window sizes needed to be integer multiples of the low-level system’s chosen interval.)
In the future, when Databricks runs on more than just the Amazon cloud, Databricks customers will of course have cloud-to-cloud portability.