RDF and graphs – DBMS 2 : DataBase Management System Services

MongoDB 3.4 and “multimodel” query

Curt Monash — Wed, 23 Nov 2016 12:01:25 +0000

“Multimodel” database management is a hot new concept these days, notwithstanding that it’s been around since at least the 1990s. My clients at MongoDB of course had to join the train as well, but they’ve taken a clear and interesting stance:

A query layer with multiple ways to query and analyze data.
A separate data storage layer in which you have a choice of data storage engines …
… each of which has the same logical (JSON-based) data structure.

When I pointed out that it would make sense to call this “multimodel query” — because the storage isn’t “multimodel” at all — they quickly agreed.

To be clear: While there are multiple ways to read data in MongoDB, there’s still only one way to write it. Letting that sink in helps clear up confusion as to what about MongoDB is or isn’t “multimodel”. To spell that out a bit further:

In query, MongoDB mixes multiple paradigms for DML (Data Manipulation Language). The main one is of course JSON.
When writing, the DML paradigm is unmixed — it’s just JSON.

Further, MongoDB query DML statements can be mixed with analytic functions rooted in Spark.

The main ways to query data in MongoDB, to my knowledge, are:

Native/JSON. Duh.
SQL.
- MongoDB has used MySQL as a guide to what SQL coverage they think the market is calling for.
- More to the point, they’re trying to provide enough SQL so that standard business intelligence tools work well (enough) against MongoDB.
- I neglected to ask why this changed from MongoDB’s adamantly non-SQL approach of 2 1/2 years ago.
Search.
- MongoDB has been adding text search features for a few releases.
- MongoDB’s newest search feature revolves around “facets”, in the Endeca sense of the term. MongoDB characterizes as a kind of text-oriented GroupBy.
Graph. MongoDB just introduced a kind of recursive join capability, which is useful for detecting multi-hop relationships (e.g. ancestor/descendant rather than just parent/child). MongoDB declares that the “graph” box is thereby checked.

Three years ago, in an overview of layered and multi-DML architectures, I suggested:

Layered DBMS and multimodel functionality fit well together.
Both carried performance costs.
In most cases, the costs could be affordable.

MongoDB seems to have bought strongly into that view on the query side — which is, of course, exactly the right way for them to have started.

Adversarial analytics and other topics

Curt Monash — Mon, 30 May 2016 10:15:33 +0000

Five years ago, in a taxonomy of analytic business benefits, I wrote:

A large fraction of all analytic efforts ultimately serve one or more of three purposes:

Marketing

Problem and anomaly detection and diagnosis

Planning and optimization

That continues to be true today. Now let’s add a bit of spin.

1. A large fraction of analytics is adversarial. In particular:

Many of the analytics companies I talk with tell me that they have important use cases in security, anti-fraud or both.
Click fraud steals a large fraction of the revenue in online advertising and other promotion. Combating it is a major application need.
Spam is another huge, ongoing fight.
- When Google et al. fight web spammers — which is of course a great part of what web search engine developers do — they’re engaged in adversarial information retrieval.
- Blog comment spam is still a problem, even though the vast majority of instances can now be caught.
- Ditto for email.
There’s an adversarial aspect to algorithmic trading. You’re trying to beat other investors. What’s more, they’re trying to identify your trading activity, so you’re trying to obscure it. Etc.
Unfortunately, unfree countries can deploy analytics to identify attempts to evade censorship. I plan to post much more on that point soon.
Similarly, de-anonymization can be adversarial.
Analytics supporting national security often have an adversarial aspect.
Banks deploy analytics to combat money-laundering.

Adversarial analytics are inherently difficult, because your adversary actively wants you to get the wrong answer. Approaches to overcome the difficulties include:

Deploying lots of data. Email spam was only defeated by large providers who processed lots of email and hence could see when substantially the same email was sent to many victims at once. (By the way, that’s why “spear-phishing” still works. Malicious email sent to only one or a few victims still can’t be stopped.)
Using unusual analytic approaches. For example, graph analytics are used heavily in adversarial situations, even though they have lighter adoption otherwise.
Using many analytic tests. For example, Google famously has 100s (at least) of sub-algorithms contributing to its search rankings. The idea here is that even the cleverest adversary might find it hard to perfectly simulate innocent behavior.

2. I was long a skeptic of “real-time” analytics, although I always made exceptions for a few use cases. (Indeed, I actually used a form of real-time business intelligence when I entered the private sector in 1981, namely stock quote machines.) Recently, however, the stuff has gotten more-or-less real. And so, in a post focused on data models, I highlighted some use cases, including:

It is increasingly common for predictive decisions to be made at [real-timeish] speeds. (That’s what recommenders and personalizers do.) Ideally, such decisions can be based on fresh and historical data alike.

The long-standing desire for business intelligence to operate on super-fresh data is, increasingly, making sense, as we get ever more stuff to monitor. However …

… most such analysis should look at historical data as well.

Streaming technology is supplying ever more fresh data.

Let’s now tie those comments into the analytic use case trichotomy above. From the standpoint of mainstream (or early-life/future-mainstream) analytic technologies, I think much of the low-latency action is in two areas:

Recommenders/personalizers.
Monitoring and troubleshooting networked equipment. This is generally an exercise in anomaly detection and interpretation.

Beyond that:

At sufficiently large online companies, there’s a role for low-latency marketing decision support.
Low-latency marketing-oriented BI can also help highlight system malfunctions.
Investments/trading has a huge low-latency aspect, but that’s somewhat apart from the analytic mainstream. (And it doesn’t fit well into my trichotomy anyway.)
Also not in the analytic mainstream are the use cases for low-latency (re)planning and optimization.

Related links

My April, 2015 post Which analytic technology problems are important to solve for whom? has a round-up of possibly relevant links.

IT-centric notes on the future of health care

Curt Monash — Tue, 26 May 2015 05:02:09 +0000

It’s difficult to project the rate of IT change in health care, because:

Health care is suffused with technology — IT, medical device and biotech alike — and hence has the potential for rapid change. However, it is also the case that …
… health care is heavily bureaucratic, political and regulated.

Timing aside, it is clear that health care change will be drastic. The IT part of that starts with vastly comprehensive electronic health records, which will be accessible (in part or whole as the case may be) by patients, care givers, care payers and researchers alike. I expect elements of such records to include:

The human-generated part of what’s in ordinary paper health records today, but across a patient’s entire lifetime. This of course includes notes created by doctors and other care-givers.
Large amounts of machine-generated data, including:
- The results of clinical tests. Continued innovation can be expected in testing, for reasons that include:
  - Most tests exploit electronic technology. Progress in electronics is intense.
  - Biomedical research is itself intense.
  - In particular, most research technologies (for example gene sequencing) can be made cheap enough over time to be affordable clinically.
- The output of consumer health-monitoring devices — e.g. Fitbit and its successors. The buzzword here is “quantified self”, but what it boils down to is that every moment of our lives will be measured and recorded.

These vastly greater amounts of data cited above will allow for greatly changed analytics.

Right now, medical decisions are made based on research that looks at a few data points each for a specially-recruited sample of patients, then draws conclusions based on simplistic and questionable statistical methods.
More sophisticated analytic methods are commonly used, but almost always still to aid in the discovery and formation of hypotheses that will then be validated, if at all, using the bad old analytic techniques.
State of the art predictive modeling, applied to vastly more data, will surely yield greatly better results.

And so I believe that health care itself will be revolutionized.

Diagnosis will be much more accurate, pretty much across the board, except in those limited areas where it’s already excellent today.
Medication regimens will be much more personalized. (Pharma manufacturing may have to change greatly as a result.) So will other treatments. So will diet/fitness regimens.
The vulnerable (elderly, hospital patients) will be more accurately and comprehensively monitored. Also, their care will likely be aided by robotics.
Some of the same things will be true of infants and toddlers. (In other cases, they get such close attention today that I can’t imagine how it could be greatly increased. )

I believe that this will all happen because I believe that it will make health care vastly more successful. And if I’m right about that, no obstacles will be able to prevent it from coming into play — not cost (which will keep going down in a quasi-Moore’s-Law way), not bureaucratic inertia (although that will continue to slow things greatly), and not privacy fears (despite the challenges cited below).

So what are the IT implications of all this?

I already mentioned the need for new (or newly-used) kinds of predictive modeling.
Probably in association with those, event detection — which in many but not all cases will amount to anomaly detection — will be huge. If one goal is to let the elderly and ailing live independently, but receive help when it’s needed — well, recognizing when that help is needed will be crucial. Similar dynamics will occur in hospitals.
And in support of that, there will be great amount of monitoring, and hence strong demands upon sensors and recognition. Potentially, all five human senses will be mimicked, among others. These technologies will become even more important in health care if I’m right that robotics will play a big role.
Data quality will be a major challenge, especially in the doctors’-notes parts of health records. Reasons start:
- Different medical professionals might evaluate the same situation differently; diagnosis is a craft rather than a dumb, repeatable skill.
- If entries are selected from a predefined set of options, none may be a perfect match to the doctor’s actual opinion.
- Doctors often say what’s needful to have their decisions (care, tests, etc.) approved, whether or not it precisely matches what they really think. Thus, there are significant incentives to enter bad data.
- Free-text data is more central to health care than to many other application areas, and text data is inherently dirty.
- Health records are decades later than many other applications in moving from paper to IT.
Data integration problems will also be and indeed already are huge, because different health care providers have addressed the tough challenges of record-keeping in different ways.

As for data management — well, almost everything discussed in this blog could come into play.

A person’s entire medical record resembles the kind of mess increasingly often dumped these days into NoSQL — typically MongoDB, Cassandra, or HBase.
There are plenty of business-transaction records in the mix, of the kind that have long been managed by RDBMS.
There are a whole lot of diverse machines in the mix, and managing the data to keep such a menagerie running is commonly the job of Splunk or streaming-enhanced Hadoop.
There’s a lot of free text in medical records. Also images, video and so on.
Since graph analytics is used in research today, it might at some point make its way into clinical use.

Finally, let me say:

Data-driven medicine cannot live up to its potential unless researchers can investigate data sets comprising private information of large numbers of people.
Researchers will not have the appropriate permissions unless privacy law moves toward a basis in data use, rather than exclusively regulating data possession.

Related links

The New York Times and Hacker News discussed the benefits of using your own medical records a couple months ago.
I wrote about the monitoring/early response aspects of health care in February, 2015.
Perhaps my most recent survey of privacy issues was in September, 2014.
A pretty good survey of the debate about statistical methods in medical research came out in December, 2013.

Data models

Curt Monash — Mon, 23 Feb 2015 03:08:10 +0000

7-10 years ago, I repeatedly argued the viewpoints:

Relational DBMS were the right choice in most cases.
Multiple kinds of relational DBMS were needed, optimized for different kinds of use case.
There were a variety of specialized use cases in which non-relational data models were best.

Since then, however:

Hadoop has flourished.
NoSQL has flourished.
Graph DBMS have matured somewhat.
Much of the action has shifted to machine-generated data, of which there are many kinds.

So it’s probably best to revisit all that in a somewhat organized way.

To make the subject somewhat manageable, I’ll focus on fielded data — i.e. data that represents values of something — rather than, for example, video or images. Fielded data always arrives as a string of bits, whose meaning boils down to a set of pairs. Here by “string of bits” I mean mainly a single record or document (for example), although most of what I say can apply to a whole stream of data instead.

Important distinctions include:

Are the field names implicit or explicit? In relational use cases field names tend to be implicit, governed by the metadata. In some log files they may be space-savingly implicit as well. In other logs, XML streams, JSON streams and so on they are explicit.
If the field names are implicit, is any processing needed to recover them? Think Hadoop or Splunk acting on “dumb-looking” log data.
In any one record/document/whatever, are the field names unique? If not, then the current data model is not relational.
Are the field names the same from one record/document/whatever to the next? I.e., does the data fit into a consistent schema?
Is there a structure connecting the field names (and if so what kind)? E.g., hierarchical documents, or relational foreign keys.

Some major data models can be put into a fairly strict ordering of query desirability by noting:

The best thing to query is a relational DBMS. Everything has a known field name, so SELECTs are straightforward. You also have JOINs, which are commonly very valuable. And RDBMS are a mature technology with in many cases great query performance.
The next-best thing to query is another kind of data store with known field names. In such data stores:
- SQL or SQL-like SELECTs will still work, or can easily be made to do.
- Useful indexing systems can be grafted on to them (although they are typically less mature than in RDBMS).
- In the (mainly) future, perhaps JOINs can be grafted on as well.
The worst thing to query is a data store in which you only have a schema on read. You have to do work to make the thing queryable in the first place.

Unsurprisingly, that ordering is reversed when it comes to writing data.

The easiest thing to write to is a data store with no structure.
Next-easiest is to write to a data store that lets you make up the structure as you go along.
The hardest thing to write to is a relational DBMS, because of the requirements that must be obeyed, notably:
- Implicit field names, governed by metadata.
- Unique field names within any one record.
- The same (ordered) set of field names for each record — more precisely, a limited collection of such ordered sets, one per table.

And so, for starters, most large enterprises will have important use cases for data stores in all of the obvious categories. In particular:

Usually it is best to have separate brands of general-purpose/OLTP (OnLine Transaction Processing) and analytic RDBMS. Further:
- I have in the past also advocated for a mid-range — i.e. lighter-weight — general purpose RDBMS.
- SAP really, really wants you to use HANA to run SAP’s apps.
- You might want an in-memory RDBMS (MemSQL) or a particularly cloudy one or whatever.
Your website alone is reason enough to use a NoSQL DBMS, most likely MongoDB or Cassandra. And it often makes sense to have multiple NoSQL systems used for different purposes, because:
- They’re all immature right now, with advantages over each other.
- The apps you’re using them for are likely to be thrown out in a few years, so you won’t have great pain switching if you ever do decide to standardize.
Whatever else Hadoop is — and it’s a lot of things — it’s also a happy home for log files. And enterprises have lots of log files.

Beyond that:

You may want something to manage organizational hierarchies and so on, if you build enough custom systems in areas such as security, knowledge management, or MDM (Master Data Management). I’m increasingly persuaded by the argument that this should be a graph DBMS rather than an LDAP (Lightweight Directory Access Protocol) system.
Splunk is cool.
Use cases for various other kinds of data stores can often be found.
Of course you’ll be implicitly using whatever is bundled into your SaaS (Software as a Service) systems, your app-specific appliances and so on.

And finally, I think in-memory data grids:

Will be widely used and important.
Will be used to instantiate multiple data models at once.

Related links

One reason for writing this post was for some deck-clearing before I revisit the white-hot topic of data streaming. (October, 2014)
I’ve long mused about the challenges of getting by without joins. (November, 2010)
In 2013 I observed that data models will be in perpetual, rapid flux.
In 2013 I also discussed attempts to combine multiple data models (or access methods) in a single DBMS.
I surveyed data models and access methods back in 2008.

Where the innovation is

Curt Monash — Mon, 19 Jan 2015 08:27:57 +0000

I hoped to write a reasonable overview of current- to medium-term future IT innovation. Yeah, right. But if we abandon any hope that this post could be comprehensive, I can at least say:

1. Back in 2011, I ranted against the term Big Data, but expressed more fondness for the V words — Volume, Velocity, Variety and Variability. That said, when it comes to data management and movement, solutions to the V problems have generally been sketched out.

Volume has been solved. There are Hadoop installations with 100s of petabytes of data, analytic RDBMS with 10s of petabytes, general-purpose Exadata sites with petabytes, and 10s/100s of petabytes of analytic Accumulo at the NSA. Further examples abound.
Velocity is being solved. My recent post on Hadoop-based streaming suggests how. In other use cases, velocity is addressed via memory-centric RDBMS.
Variety and Variability have been solved. MongoDB, Cassandra and perhaps others are strong NoSQL choices. Schema-on-need is in earlier days, but may help too.

2. Even so, there’s much room for innovation around data movement and management. I’d start with:

Product maturity is a huge issue for all the above, and will remain one for years.
Hadoop and Spark show that application execution engines:
- Have a lot of innovation ahead of them.
- Are tightly entwined with data management, and with data movement as well.
Hadoop is due for another refactoring, focused on both in-memory and persistent storage.
There are many issues in storage that can affect data technologies as well, including but not limited to:
- Solid-state (flash or post-flash) vs. spinning disk.
- Networked vs. direct-attached.
- Virtualized vs. identifiable-physical.
- Object/file/block.
Graph analytics and data management are still confused.

3. As I suggested last year, data transformation is an important area for innovation.

MapReduce was invented for data transformation, which is still a large part of what goes on in Hadoop.
The smart data preparation crowd is deservedly getting attention.
The more different data models — NoSQL and so on — that are used, the greater are the demands on data transformation.

4. There’s a lot going on in investigative analytics. Besides the “platform” technologies already mentioned, in areas such as fast-query, data preparation, and general execution engines, there’s also great innovation higher in the stack. Most recently I’ve written about multiple examples in predictive modeling, such as:

Mathematically (more) complex models that are at once more accurate and more easily arrived at than (nearly) linear ones.
Similarly, more complex clustering.
Predictive experimentation.
The use of business intelligence and predictive modeling to inform each other.

Beyond that:

Event-series analytics is another exciting area. (At least on the BI side, I frankly expected it to sweep through the relevant vertical markets more quickly than it has.)
I’ve long been disappointed in the progress in text analytics. But sentiment analysis is doing fairly well, many more languages are analyzed than before, and I occasionally hear rumblings of text analytic sophistication inching back towards that already available in the previous decade.
While I don’t write about it much, modern BI navigation is an impressive and wonderful thing.

5. Back in 2013, in what was perhaps my previous most comprehensive post on innovation, I drew a link between innovation and refactoring, where what was being refactored was “everything”. Even so, I’ve been ignoring a biggie. Security is a mess, and I don’t see how it can ever be solved unless systems are much more modular from the ground up. By that I mean:

“Fencing” processes and resources away from each other improves system quality, in that it defends against both deliberate attacks and inadvertent error.
Fencing is costly, both in terms of context-switching and general non-optimization. Nonetheless, I suspect that …
… the cost of such process isolation may need to be borne.
Object-oriented programming and its associated contracts are good things in this context. But it’s obvious they’re not getting the job done on their own.

More specifically,

It is cheap to give single-purpose intelligent devices more computing power than they know what to do with. There is really no excuse for allowing them to be insecure.
It is rare for a modern PC to go much above 25% CPU usage, simply because most PC programs are still single-core. This illustrates that — assuming some offsetting improvements in multi-core parallelism — desktop software could take a security performance hit without much pain to users’ wallets.
On servers, we may in many cases be talking about lightweight virtual machines.

And to be clear:

What I’m talking about would do little to help the authentication/authorization aspects of security, but …
… those will never be perfect in any case (because they depend upon fallible humans) …
… which is exactly why other forms of security will always be needed.

6. You’ve probably noticed the fuss around an open letter about artificial intelligence, with some press coverage suggesting that AI is a Terminator-level threat to humanity. Underlying all that is a fairly interesting paper summarizing some needs for future research and innovation in AI. In particular, reading the paper reminded me of the previous point about security.

7. Three areas of software innovation that, even though they’re pretty much in my wheelhouse, I have little to say about right now are:

Application development technology, languages, frameworks, etc.
The integration of analytics into old-style operational apps.
The never-ending attempts to make large-enterprise-class application functionality available to outfits with small-enterprise sophistication and budgets.

8. There is, of course, tremendous innovation in robots and other kinds of device. But this post is already long enough, so I’ll address those areas some other time.

Related links

In many cases, I think that innovations will prove more valuable — or at least much easier to monetize — when presented to particular vertical markets.
Edit: I followed up on the last point with a post about soft robots.

Notes and links, December 12, 2014

Curt Monash — Fri, 12 Dec 2014 11:05:15 +0000

1. A couple years ago I wrote skeptically about integrating predictive modeling and business intelligence. I’m less skeptical now.

For starters:

The predictive experimentation I wrote about over Thanksgiving calls naturally for some BI/dashboarding to monitor how it’s going.
If you think about Nutonian’s pitch, it can be approximated as “Root-cause analysis so easy a business analyst can do it.” That could be interesting to jump to after BI has turned up anomalies. And it should be pretty easy to whip up a UI for choosing a data set and objective function to model on, since those are both things that the BI tool would know how to get to anyway.

I’ve also heard a couple of ideas about how predictive modeling can support BI. One is via my client Omer Trajman, whose startup ScalingData is still semi-stealthy, but says they’re “working at the intersection of big data and IT operations”. The idea goes something like this:

Suppose we have lots of logs about lots of things.* Machine learning can help:
- Notice what’s an anomaly.
- Group* together things that seem to be experiencing similar anomalies.
That can inform a BI-plus interface for a human to figure out what is happening.

Makes sense to me. (Edit: ScalingData subsequently launched, under the name Rocana.)

* The word “cluster” could have been used here in a couple of different ways, so I decided to avoid it altogether.

Finally, I’m hearing a variety of “smart ETL/data preparation” and “we recommend what columns you should join” stories. I don’t know how much machine learning there’s been in those to date, but it’s usually at least on the roadmap to make the systems (yet) smarter in the future. The end benefit is usually to facilitate BI.

2. Discussion of graph DBMS can get confusing. For example:

Use cases run the gamut from short-request to highly analytic; no graph DBMS is well-suited for all graph use cases.
Graph DBMS have huge problems scaling, because graphs are very hard to partition usefully; hence some of the more analytic use cases may not benefit from a graph DBMS at all.
The term “graph” has meanings in computer science that have little to do with the problems graph DBMS try to solve, notably directed acyclic graphs for program execution, which famously are at the heart of both Spark and Tez.
My clients at Neo Technology/Neo4j call one of their major use cases MDM (Master Data Management), without getting much acknowledgement of that from the mainstream MDM community.

I mention this in part because that “MDM” use case actually has some merit. The idea is that hierarchies such as organization charts, product hierarchies and so on often aren’t actually strict hierarchies. And even when they are, they’re usually strict only at specific points in time; if you care about their past state as well as their present one, a hierarchical model might have trouble describing them. Thus, LDAP (Lightweight Directory Access Protocol) engines may not be an ideal way to manage and reference such “hierarchies:; a graph DBMS might do better.

3. There is a surprising degree of controversy among predictive modelers as to whether more data yields better results. Besides, the most common predictive modeling stacks have difficulty scaling. And so it is common to model against samples of a data set rather than the whole thing.*

*Strictly speaking, almost the whole thing — you’ll often want to hold at least a sample of the data back for model testing.

Well, WibiData’s couple of Very Famous Department Store customers have tested WibiData’s ability to model against an entire database vs. their alternative predictive modeling stacks’ need to sample data. WibiData says that both report significantly better results from training over the whole data set than from using just samples.

4. Scaling Data is on the bandwagon for Spark Streaming and Kafka.

5. Derrick Harris and Pivotal turn out to have been earlier than me in posting about Tachyon bullishness.

6. With the Hortonworks deal now officially priced, Derrick was also free to post more about/from Hortonworks’ pitch. Of course, Hortonworks is saying Hadoop will be Big Big Big, and suggesting we should thus not be dismayed by Hortonworks’ financial performance so far. However, Derrick did not cite Hortonworks actually giving any reasons why its competitive position among Hadoop distribution vendors should improve.

Beyond that, Hortonworks says YARN is a big deal, but doesn’t seem to like Spark Streaming.

Spark and Databricks

Curt Monash — Sun, 02 Feb 2014 18:50:57 +0000

I’ve heard a lot of buzz recently around Spark. So I caught up with Ion Stoica and Mike Franklin for a call. Let me start by acknowledging some sources of confusion.

Spark is very new. All Spark adoption is recent.
Databricks was founded to commercialize Spark. It is very much in stealth mode …
… except insofar as Databricks folks are going out and trying to drum up Spark adoption.
Ion Stoica is running Databricks, but you couldn’t tell that from his UC Berkeley bio page. Edit: After I posted this, Ion’s bio was quickly updated.
Spark creator and Databricks CTO Matei Zaharia is an MIT professor, but actually went on leave there before he ever showed up.
Cloudera is perhaps Spark’s most visible supporter. But Cloudera’s views of Spark’s role in the world is different from the Spark team’s.

The “What is Spark?” question may soon be just as difficult as the ever-popular “What is Hadoop?” That said — and referring back to my original technical post about Spark and also to a discussion of prominent Spark user ClearStory — my try at “What is Spark?” goes something like this:

Spark is a distributed execution engine for analytic processes …
… which works well with Hadoop.
Spark is distinguished by a flexible in-memory data model …
… and farms out persistence to HDFS (Hadoop Distributed File System) or other existing data stores.
Intended analytic use cases for Spark include:
- SQL data manipulation.
- ETL-like data manipulation.
- Streaming-like data manipulation.
- Machine learning.
- Graph analytics.

Except for certain low-latency operations,* anything you can do in Spark can also be done in straight Hadoop; Spark just can have advantages in performance and programming ease. Spark RDDs (Resilient Distributed Datasets) are immutable at this time, so Spark is not suited for short-request update workloads.

*A new Spark task requires a thread, not a whole Java Virtual Machine.

Everybody agrees that machine learning is a top Spark use case. In particular:

Cloudera sees machine learning as the major area of Spark adoption to date.
Ion gave me the impression machine learning is one of the major areas of Spark adoption to date.
Mike gave me the impression that machine learning was a core intended use case for Spark the first time we talked about it.
There’s a machine learning library for Spark, and also a way to use Spark to do distributed R.

I believe data transformation is a major Spark use case as well.

Ion gave me that impression, although Cloudera surprisingly did not. Edit: Actually, see Matt Brandwine’s comment below.
I have one client (ClearStory) using Spark that way, and a second that’s likely to.
It makes sense that the #1 Hadoop use case (to date), which is something Spark also is well-suited for, would be an important early Spark use case as well.

Spark Streaming is fairly new, but is already getting some adoption. Notes on that start:

The actual technology is a form of micro-batching. I plan to learn more about that in the future.
Cloudera sees streaming as one of the two big Spark use cases, and praises Spark Streaming for its fault tolerance and its great ease of coding.
Mike Franklin knows a lot about streaming.

Part of that story is a sudden decline in the reputation of Storm, whose troubles seem to include:

Project founder and Twitter employee Nathan Marz seems no longer to be associated with Storm nor employed at Twitter.
I am told that in general the Storm community is not all that vibrant.
Various aspects of Storm’s technology are disappointing people.

Other notes on Spark use cases include:

Impala-loving Cloudera doesn’t plan to support Shark. Duh.
Cloudera also won’t at first support any Spark predictive modeling add-on.
Ion’s other company, Conviva, is doing some real-time decisioning in Spark.

Spark data management has been enhanced by a project called Tachyon.* The main point of Tachyon is that Spark RDDs (Resilient Distributed Datasets) now persist in memory beyond the life of a job; besides offering the RDDs to other Spark jobs, Tachyon also opens them to Hadoop via an HDFS emulator.

*If there’s ever a Spark/Tachyon management suite, I hope some aspect is named Cherenkov — i.e., the radiation that is measured to detect the passage of tachyons.:)

And finally, some metrics and so on:

Databricks has between 10 and 20 employees.
Spark has >100 individual contributors from >25 different companies.
There was a Spark Summit with >450 attendees (from >180 organizations), and an earlier Spark-mainly conference with >200 attendees.
The Spark meet-up group in San Francisco has >1500 members signed up.
Various Spark users and subprojects are identified on the Apache Spark pages.

Related link

Most of the current substance on Databricks’ website is in its blog.

Aster 6, graph analytics, and BSP

Curt Monash — Thu, 10 Oct 2013 11:42:38 +0000

Teradata Aster 6 has been preannounced (beta in Q4, general release in Q1 2014). The general architectural idea is:

There are multiple data stores, the first two of which are:
- The classic Aster relational data store.
- A file system that emulates HDFS (Hadoop Distributed File System).
There are multiple processing “engines”, where an engine is what occupies and controls a processing thread. These start with:
- Generic analytic SQL, as Aster has had all along.
- SQL-MR, the MapReduce Aster has also had all along.
- SQL-Graph aka SQL-GR, a graph analytics system.
The Aster parser and optimizer accept glorified SQL, and work across all the engines combined.

There’s much more, of course, but those are the essential pieces.

Just to be clear: Teradata Aster 6, aka the Teradata Aster Discovery Platform, includes HDFS compatibility, native MapReduce and ways of invoking Hadoop MapReduce on non-Aster nodes or clusters — but even so, you can’t run Hadoop MapReduce within Aster over Aster’s version of HDFS.

The most dramatic immediate additions are in the graph analytics area.* The new SQL-Graph is supported by something called BSP (Bulk Synchronous Parallel). I’ll start by observing (and some of this is confusing):

BSP was thought of a long time ago, as a general-purpose computing model, but recently has come to the fore specifically for graph analytics. (Think Pregel and Giraph, along with Teradata Aster.)
BSP has a kind of execution-graph metaphor, which is different from the graph data it helps analyze.
BSP is described as being a combination hardware/software technology, but Teradata Aster and everybody else I know of implements it in software only.
Aster long ago talked of adding a graph data store, but has given up that plan; rather, it wants you to do graph analytics on data stored in tables (or accessed through views) in the usual way.

Use cases suggested are a lot of marketing, plus anti-fraud.

*Pay no attention to Aster’s previous claims to do a good job on graph — and not only via nPath — in SQL-MR.

So far as I can infer from examples I’ve seen, the semantics of Teradata Aster SQL-Graph start:

Ordinary SQL except in the FROM clause.
Functions/operators that are the arguments for FROM; of course, they output tables. You can write these yourself, or use Teradata Aster’s prebuilt ones.

Within those functions, the core idea is:

Various tables are explicitly given the roles of “Vertices”, “Edges”, and so on. (It can get reasonably complicated; e.g., “Vertices_1” and “Vertices_2” for a bipartite graph.)
Those “tables” can actually instead be views, subqueries or whatever.

Specific prebuilt functions — Aster is big on prebuilt functions — include but surely aren’t limited to:

PageRank (which of course generally is a way to estimate individual vertices’ relative influence).
Various things that seem to focus on measuring which relationships are or aren’t significant. (I’m not sure whether they’re NDA or not, so to stay on the safe side I won’t spell them out.)

Truth be told, these prebuilt functions sound pretty interesting.

As for underpinnings — the idea behind BSP is:

You have a computing job that is both iterative and parallel.
You parallelize it among a bunch of logical vertices, which may or may not correspond to physical computing servers.
The job is broken up into “supersteps”, wherein local processing happens at each vertex.
At the end of a superstep, each vertex can send messages to other vertices. The next superstep can’t start until all the messages have arrived.

Hopefully, various problems with message latency and unreliability that arise in other models of parallel computing are obviated by BSP.

So why use BSP for graph analytics? Well, it’s pretty obvious why BSP would be a decent model; the real question is why something that relies on classical data partitioning isn’t even better. And of course the answer to that one is that data partitioning doesn’t work for most graphs; whatever you do, there are going to be a whole lot of edges crossing partition boundaries.

Real-world graphs have short average path lengths — Six Degrees of Separation and all that. While that isn’t in itself a proof that partitioning can’t work, it should at least serve as a strong plausibility argument.

Since this is a first release of a graph-processing capability, it’s safe to assume there’s a lot missing. For example, every SQL-GR graph operation starts by retrieving data and building a graph; there’s no reuse. I presume that some analytic operations aren’t explicitly supported yet, or are of questionable performance. (Subgraph pattern matching was mentioned as an area that was not yet optimized for.) But with all those caveats, this still feels like a pretty interesting entry into the relationship analytics market.

Cloudera Hadoop strategy and usage notes

Curt Monash — Sun, 25 Aug 2013 15:40:07 +0000

When we scheduled a call to talk about Sentry, Cloudera’s Charles Zedlewski and I found time to discuss other stuff as well. One interesting part of our discussion was around the processing “frameworks” Cloudera sees as most important.

The four biggies are:
- MapReduce. Duh.
- SQL, specifically Impala. This is as opposed to the uneasy Hive/MapReduce layering.
- Search.
- “Math” , which seems to mainly be through partnerships with SAS and Revolution Analytics. I don’t know a lot about how these work, but I presume they bypass MapReduce, in which case I could imagine them greatly outperforming Mahout.
Stream processing (Storm) is next in line.
Graph — e.g. Giraph — rises to at least the proof-of-concept level. Again, the hope would be that this well outperforms graph-on-MapReduce.
Charles is also seeing at least POC interest in Spark.
But MPI (Message Passing Interface) on Hadoop isn’t going anywhere fast, except to the extent it’s baked into SAS or other “math” frameworks. Generic MPI use cases evidently turn out to be a bad fit for Hadoop, due to factors such as:
- Low data volumes.
- Latencies in various parts of the system

HBase was artificially omitted from this “frameworks” discussion because Cloudera sees it as a little bit more of a “storage” system than a processing one.

Another good subject was offloading work to Hadoop, in a couple different senses of “offload”:

From general-purpose data stores, mainly RDBMS, analytic or otherwise. This sounds similar to Hortonworks’ views about efficiency-oriented offloading; batch work can be moved to Hadoop, saving costs and/or getting more mileage from costs that are already sunk into expensive legacy installations. The top targets here are large, centralized systems, with Teradata being a clear #1 and IBM mainframes a probable #2, but anything from Oracle to newer parallel analytic RDBMS is fair game.
From the specialized data stores associated with fuller technology stacks. The example I had in mind was Splunk; Charles added Palantir, HP Arcsight and, in the past, Endeca. The idea here is that Hadoop is used to organize and/or index data the way those products’ native data stores would, but in higher volumes than they are (cost-)effective for.

On a pickier note, I encouraged Charles to push back against Hortonworks’ arguments for ORC vs. Parquet. His first claim was that ORC at this time only works under Hive, while Parquet can also be used for Hive, MapReduce, etc. (Edit: But see Arun Murthy’s comment below.) I suspect this is a case where Hortonworks and Cloudera should just get over themselves, and either agree on a file format or wind up each supporting both of them. There’s a lot of DBMS-like tooling in Hadoop’s future, and I have to think it will work better — or at least run faster — if it can make reliable assumptions about how data is actually stored.

Related links

In connection with its 0.1 version, Jakob Homan of LinkedIn contrasted Giraph to MapReduce-based graph processing.
I wrote a series about graph processing in May, 2012.
MPI used to be a higher Hadoop priority (August, 2011). That’s why I’ve kept bringing it up.

How is the surveillance data used?

Curt Monash — Thu, 13 Jun 2013 15:36:45 +0000

Over the past week, discussion has exploded about US government surveillance. After summarizing, as best I could, what data the government appears to collect, now I ‘d like to consider what they actually do with it. More precisely, I’d like to focus on the data’s use(s) in combating US-soil terrorism. In a nutshell:

Reporting is persuasive that electronic surveillance data is helpful in following up on leads and tips obtained by other means.
Reporting is not persuasive that electronic surveillance data on its own uncovers or averts many terrorist plots.
With limited exceptions, neither evidence nor logic suggests that data mining or predictive modeling does much to prevent domestic terrorist attacks.

Consider the example of Tamerlan Tsarnaev:

In response to this 2011 request, the FBI checked U.S. government databases and other information to look for such things as derogatory telephone communications, possible use of online sites associated with the promotion of radical activity, associations with other persons of interest, travel history and plans, and education history.

While that response was unsuccessful in preventing a dramatic act of terrorism, at least they tried.

As for actual success stories — well, that’s a bit tough. In general, there are few known examples of terrorist plots being disrupted by law enforcement in the United States, except for fake plots engineered to draw terrorist-leaning individuals into committing actual crimes. One of those examples, that of Najibullah Zazi, was indeed based on an intercepted email — but the email address itself was uncovered through more ordinary anti-terrorism efforts.

As for machine learning/data mining/predictive modeling, I’ve never seen much of a hint of it being used in anti-terrorism efforts, whether in the news or in my own discussions inside the tech industry. And I think there’s a great reason for that — what would they use for a training set? Here’s what I mean.

Unless the jargon is being misused — which of course happens all too often — data mining works like this:

Data sets are collected in which outcomes are matched to (vectors of) other (dependent) variables. These are called training sets.
Analytic software is run, with the training sets as inputs and algorithms as outputs. This is called training the model. The output algorithms are produced which purport to estimate which other vectors of dependent variables are likely to be associated with which outcomes.

Yes, I’m saying that predictive modeling software, used at the modeling stage — as opposed to the model scoring/execution stage — has algorithms as output. Depending on details, that’s either literally true or else just true in effect.

For example, in the simplest case, namely a linear regression:

The outcome is an event such as a product sale (desirable) or equipment failure (to be avoided).
The algorithm is a weighted sum of the other variables, whose value is interpreted as the probability of that outcome.
The algorithm discovery process simply boiled down to calculating the coefficients in the weighted sum.

When data mining and predictive modeling get a little more complicated than that, we still call them “statistical analysis”; when they get much more complicated, the name “machine learning” is commonly used instead.

And so my views on the application of predictive modeling to domestic US anti-terrorism start:

In most respects, there aren’t enough examples to train models to help predict or avert terror attacks.
Presumably not coincidentally, while I’ve heard of many query and visualization techniques — notably graph analytics — I haven’t heard of predictive modeling applied directly to anti-terrorism.
There’s one big exception to this rule:
- Surveillance-based anti-terrorism efforts depend heavily on natural language processing …
- … and natural language processing depends heavily on machine learning.

Perhaps there are other examples similar to the natural language one, but nothing is currently coming to mind.

Note that not all these arguments apply to all parts of the world. For example, there have been enough roadside IEDs (Improvised Explosive Devices) in Iraq and Afghanistan that looking for unusual communication patterns associated with them might bear fruit. But when it comes to fending off terrorist attacks on US soil, I believe the main use of surveillance data is for straightforward query and data visualization based on the best educated guesses of smart human analysts.