DataStax – DBMS 2 : DataBase Management System Services

Notes on DataStax and Cassandra

Curt Monash — Mon, 08 Aug 2016 01:19:06 +0000

I visited DataStax on my recent trip. That was a tipping point leading to my recent discussions of NoSQL DBAs and misplaced fear of vendor lock-in. But of course I also learned some things about DataStax and Cassandra themselves.

On the customer side:

DataStax customers still overwhelmingly use Cassandra for internet back-ends — web, mobile or otherwise as the case might be.
This includes — and “includes” might be understating the point — traditional enterprises worried about competition from internet-only ventures.

Customers in large numbers want cloud capabilities, as a potential future if not a current need.

One customer example was a large retailer, who in the past was awful at providing accurate inventory information online, but now uses Cassandra for that. DataStax brags that its queries come back in 20 milliseconds, but that strikes me as a bit beside the point; what really matters is that data accuracy has gone from “batch” to some version of real-time. Also, Microsoft is a DataStax customer, using Cassandra (and Spark) for the Office 365 backend, or at least for the associated analytics.

Per Patrick McFadin, the four biggest things in DataStax Enterprise 5 are:

Graph capabilities.
Cassandra 3.0, which includes a complete storage engine rewrite.
Tiered storage/ILM (Information Lifecycle Management).
Policy-based replication.

Some of that terminology is mine, but perhaps my clients at DataStax will adopt it too.

We didn’t go into as much technical detail as I ordinarily might, but a few notes on that tiered storage/ILM bit are:

It’s a way to have some storage that’s more expensive (e.g. flash) and some that’s cheaper (e.g. spinning disk). Duh.
Since Cassandra has a strong time-series orientation, it’s easy to imagine how those policies might be specified.
Technologically, this is tightly integrated with Cassandra’s compaction strategy.

DataStax Enterprise 5 also introduced policy-based replication features, not all of which are in open source Cassandra. Data sovereignty/geo-compliance is improved, which is of particular importance in financial services. There’s also hub/spoke replication now, which seems to be of particular value in intermittently-connected use cases. DataStax said the motivating use case in that area was oilfield operations, where presumably there are Cassandra-capable servers at all ends of the wide-area network.

Related links

I wrote in detail about Cassandra architecture in December, 2013.
I wrote about intermittently-connected data management via the relational gold standard SQL Anywhere in July, 2010.

Notes from a long trip, July 19, 2016

Curt Monash — Wed, 20 Jul 2016 01:34:31 +0000

For starters:

I spent three weeks in California on a hybrid personal/business trip. I had a bunch of meetings, but not three weeks’ worth.
The timing was awkward for most companies I wanted to see. No blame accrues to those who didn’t make themselves available.
I came back with a nasty cough. Follow-up phone calls aren’t an option until next week.
I’m impatient to start writing. Hence tonight’s posts. But it’s difficult for a man and his cough to be productive at the same time.

A running list of recent posts is:

As a companion to this post, I’m publishing a very long one on vendor lock-in.
Spark and Databricks are both prospering, and of course enhancing their technology as well.
Ditto DataStax.
Flink is interesting as the streaming technology it’s now positioned to be, rather than the overall Spark alternative it used to be positioned as but which the world didn’t need.

Subjects I’d like to add to that list include:

MemSQL, Zoomdata, and Neo Technology (also prospering).
Cloudera (multiple topics, as usual).
Analytic SQL engines (“traditional” analytic RDBMS aren’t doing well).
Microsoft’s reinvention (it feels real).
Metadata (it’s ever more of a thing).
Machine learning (it’s going to be a big portion of my research going forward).
Transitions to the cloud — this subject affects almost everything else.

I’ll edit these lists as appropriate when further posts go up. Last update: August 23, 2016.

Let’s cover some other subjects right here.

1. While Kafka is widely agreed to be the universal delivery mechanism for streams, the landscape for companion technologies is confused.

Back in January I wrote that the leaders were mainly Spark Streaming, followed by Storm.
I overlooked the fact that Storm creator Twitter was replacing Storm with something called Heron.*
If there’s any buzz about Confluent’s replacement for distant-third-place contender Samza, I missed it.
Opinions about Spark Streaming are mixed. Some folks want to get away from it; others like it just fine.

And of course Flink is hoping to blow everybody else in the space away.

*But that kind of thing is not necessarily a death knell. Cassandra inventor Facebook soon replaced Cassandra with HBase, yet Cassandra is doing just fine.

As for the “lambda architecture” — that has always felt like a kludge, and various outfits are trying to obsolete it in various ways. As just one example, Cloudera described that to me during my visit as one of the main points of Kudu.

2. The idea that NoSQL does away with DBAs (DataBase Administrators) is common. It also turns out to be wrong. DBAs basically do two things.

Handle the database design part of application development. In NoSQL environments, this part of the job is indeed largely refactored away. More precisely, it is integrated into the general app developer/architect role.
Manage production databases. This part of the DBA job is, if anything, a bigger deal in the NoSQL world than in more mature and automated relational environments. It’s likely to be called part of “devops” rather than “DBA”, but by whatever name it’s very much a thing.

I had a moment of clarity on this point while visiting my clients at DataStax, and discussing their goal — shared by numerous companies — of being properly appreciated for the management tools they provide. In the room with me were CEO Billy Bosworth and chief evangelist Patrick McFadin — both of whom are former DBAs themselves.

3. I visited ClearStory, and Sharmila Mulligan showed me her actual sales database, as well as telling me some things about funding. The details are all confidential, but ClearStory is clearly doing better than rumor might suggest.

4. Platfora insisted on meeting circumstances in which it was inconvenient for me to take notes. So I have no details to share. But they sounded happy.

Edit: On July 22, it was announced that Workday is acquiring Platfora. Now I understand why Platfora gave me a bit of a runaround.

5. Pneubotics — with a cool new video on its home page — has found its first excellent product/market fit. Traditional heavy metallic robots are great at painting and related tasks when they can remain stationary, or move on rigid metal rails. Neither of those options works well, however, for large curved or irregular surfaces as might be found in the aerospace industry. Customer success for the leading soft robot company has ensued.

This all seems pretty close to the inspection/maintenance/repair area that I previously suggested could be a good soft robotics fit.

Cassandra and privacy requirements

Curt Monash — Thu, 15 Oct 2015 15:18:26 +0000

For starters:

I’ve suggested in the past that multi-data-center capabilities are important for “data sovereignty”/geo-compliance.
The need for geo-compliance just got a lot stronger, with the abolition of the European Union’s Safe Harbour rule for the US. If you collect data in multiple countries, you should be at least thinking about geo-compliance.
Cassandra is an established leader in multi-data-center operation.

But when I made that connection and checked in accordingly with my client Patrick McFadin at DataStax, I discovered that I’d been a little confused about how multi-data-center Cassandra works. The basic idea holds water, but the details are not quite what I was envisioning.

The story starts:

Cassandra groups nodes into logical “data centers” (i.e. token rings).
As a best practice, each physical data center can contain one or more logical data center, but not vice-versa.
There are two levels of replication — within a single logical data center, and between logical data centers.
Replication within a single data center is planned in the usual way, with the principal data center holding a database likely to have a replication factor of 3.
However, copies of the database held elsewhere may have different replication factors …
… and can indeed have different replication factors for different parts of the database.

In particular, a remote replication factor for Cassandra can = 0. When that happens, then you have data sitting in one geographical location that is absent from another geographical location; i.e., you can be in compliance with laws forbidding the export of certain data. To be clear (and this contradicts what I previously believed and hence also implied in this blog):

General multi-data-center operation is not what gives you geo-compliance, because the default case is that the whole database is replicated to each data center.
Instead, you get that effect by tweaking your specific replication settings.

The most visible DataStax client using this strategy is apparently ING Bank.

If you have a geo-compliance issue, you’re probably also concerned about security. After all, the whole reason the issue arises is because one country’s government might want to look at another country’s citizens’ or businesses’ data. The DataStax security story is approximately:

Encryption in flight, for any Cassandra.
Encryption at rest, specifically with DataStax Enterprise.
No cell-level or row-level security until Cassandra 3.0 is introduced and established. (I didn’t actually ask whether something similar to HBase coprocessors is coming for Cassandra, but that would be my first guess.)
Various roles and permissions stuff.

While flexible, Cassandra’s multi-data-center features do add some complexity. Tunable-consistency choices are baked into Cassandra programs at each point data is accessed, and more data centers make for more choices. (Default best practice = write if you get a local quorum, running the slight risk of logical data centers being out of sync with each other.)

One way in which the whole thing does seem nice and simple is that you can have different logical data centers running on different kinds of platforms — cloud, colocation, in-house, whatever — without Cassandra caring.

I’m not going to call the DataStax Enterprise approach to geo-compliance the “gold standard”, because some of it seems pretty clunky or otherwise feature-light. On the other hand, I’m not aware of competitors who exceed it, in features or track record, so “silver standard” seems defensible.

Basho and Riak

Curt Monash — Thu, 15 Oct 2015 15:18:05 +0000

Basho was on my (very short) blacklist of companies with whom I refuse to speak, because they have lied about the contents of previous conversations. But Tony Falco et al. are long gone from the company. So when Basho’s new management team reached out, I took the meeting.

For starters:

Basho management turned over significantly 1-2 years ago. The main survivors from the old team are 1 each in engineering, sales, and services.
Basho moved its headquarters to Bellevue, WA. (You get one guess as to where the new CEO lives.) Engineering operations are very distributed geographically.
Basho claims that it is much better at timely product shipments than it used to be. Its newest product has a planned (or at least hoped-for) 8-week cadence for point releases.
Basho’s revenue is ~90% subscription.
Basho claims >200 enterprise clients, vs. 100-120 when new management came in. Unfortunately, I forgot to ask the usual questions about divisions vs. whole organizations, OEM sell-through vs. direct, etc.
Basho claims an average contract value of >$100K, typically over 2-3 years. $9 million of that (which would be close to half the total, actually), comes from 2 particular deals of >$4 million each.

Basho’s product line has gotten a bit confusing, but as best I understand things the story is:

There’s something called Riak Core, which isn’t even a revenue-generating product. However, it’s an open source project with some big users (e.g. Goldman Sachs, Visa), and included in pretty much everything else Basho promotes.
Riak KV is the key-value store previously known as Riak. It generates the lion’s share of Basho’s revenue.
Riak S2 is an emulation of Amazon S3. Basho thinks that Riak KV loses efficiency when objects get bigger than 1 MB or so, and that’s when you might want to use Riak S2 in addition or instead.
Riak TS is for time series, and just coming out now.
Also in the mix are some (extra charge) connectors for Redis and Spark. Presumably, there are more of these to come.
There’s an umbrella marketing term of “Basho Data Platform”.

Technical notes on some of that include:

Riak Core doesn’t do data management. It just manages distributed operation of — well, whatever you want to operate. In part, Basho sees Riak Core as a better Apache ZooKeeper.
- That is the essence of the Riak/Spark pitch — something better than ZooKeeper for cluster management, and I presume some help in persisting Spark RDDs as well.
- The Riak/Redis pitch is even simpler — cluster management for Redis, and persistent backing as well.
- Basho’s criticisms of ZooKeeper start with “Cluster manager, manage thyself” claims about ZooKeeper availability, as in the PagerDuty ZooKeeper critique.
Riak KV has secondary indexing. Performance is somewhat questionable. It also has Solr indexing, which is fast.
At least in its 1.0 form, Riak TS assumes:
- There’s some kind of schema or record structure.
- There are explicit or else easily-inferred timestamps.
- Microsecond accuracy, perfect ordering and so on are not essential.
Thus, Riak TS 1.0 is not ideal for the classic Splunk use case where you text index/search on a lot of log emissions. It also is not ideal for financial tick storage.
Riak TS has range-based partitioning, where the range is in terms of time. Basho refers to this as “locality”.
Riak TS has a SQL subset. Evidently there’s decent flexibility as to which part of the database carries which schema.
Riak has a nice feature of allowing you stage a change to network topology before you push it live.
Riak’s vector clock approach to wide-area synchronization is more controversial.

Finally, notes on what Basho sees as use cases and competition include:

Riak KV is generally used to store usual-suspect stuff — log data, user/profile data and so on.
Basho thinks NoSQL is a 4-horse race — Basho/Riak KV, DataStax/Cassandra, MongoDB, Couchbase. (I would be surprised if there was much agreement with that view from, for example, MongoDB, DataStax, Aerospike, MapR or the HBase community.)
Basho competes on availability, scalability (including across geography) and so on, or in simplest terms:
- “Availability and correctness”
- Simple operation
Unsurprisingly, Basho thinks its closest competitor is DataStax. (However, DataStax tells me they don’t compete much with Basho.)
Basho thinks Riak KV has ease-of-operation advantages vs. Cassandra.
An example of a mission-critical Riak app is the UK National Health Service storing prescription information.
An example of Riak S2 and Riak KV being used together is Turner Broadcasting storing video in the former and associated metadata in the latter.
Riak TS is initially pointed at two use cases:
- “Internet of Things”
- “Metrics”, which seems to mean monitoring of system metrics.
Basho sees the competition for Riak TS as starting with HBase, Cassandra, and InfluxDB.

Notes on privacy and surveillance, October 11, 2015

Curt Monash — Sun, 11 Oct 2015 10:44:38 +0000

1. European Union data sovereignty laws have long had a “Safe Harbour” rule stating it was OK to ship data to the US. Per the case Maximilian Schrems v Data Protection Commissioner, this rule is now held to be invalid. Angst has ensued, and rightly so.

The core technical issues are roughly:

Data is usually in one logical database. Data may be replicated locally, for availability and performance. It may be replicated remotely, for availability, disaster recovery, and performance. But it’s still usually logically in one database.
Now remote geographic partitioning may be required by law. Some technologies (e.g. Cassandra) support that for a single logical database. Some don’t.
Even under best circumstances, hosting and administrative costs are likely to be higher when a database is split across more geographies (especially when the count is increased from 1 to 2).

Facebook’s estimate of billions of dollars in added costs is not easy to refute.

My next set of technical thoughts starts:

This is about data storage, not data use; for example, you can analyze Austrian data in the US, but you can’t store it there.
Of course, that can be a tricky distinction to draw. We can only hope that intermediate data stores, caches and so on can be allowed to use data from other geographies.
Assuming the law is generous in this regard, scan-heavy analytics are more problematic than other kinds.
But if there are any problems in those respects — well, if analytics can be parallelized in general, then in particular one should be able to parallelize across geographies. (Of course, this could require replicating one’s whole analytic stack across geographies.)

2. US law enforcement is at loggerheads with major US tech companies, because it wants the right to subpoena data stored overseas. The central case here is a request to get at Microsoft’s customer data stored in Ireland. A government victory would be catastrophic for the US tech industry, but I’m hopeful that sense will — at least to some extent — prevail.

3. Ed Snowden, Glenn Greenwald and numerous other luminaries are pushing something called the Snowden Treaty, as a model for how privacy laws should be set up. I’m a huge fan of what Snowden and Greenwald have done in general, but this particular project has not started well. First, they’ve rolled the thing out while actually giving almost no details, so they haven’t really contributing anything except a bit of PR. Second, one of the few details they did provide contains a horrific error.

Specifically, they “demand”

freedom from damaging publicity, public scrutiny …

To that I can only say: “Have you guys lost your minds???????” As written, that’s a demand that can only be met by censorship laws. I’m sure this error is unintentional, because Greenwald is in fact a stunningly impassioned and articulate opponent of censorship. Even so, that’s an appallingly careless mistake, which for me casts the whole publicity campaign into serious doubt.

4. As a general rule — although the details of course depend upon where you live — it is no longer possible to move around and be confident that you won’t be tracked. This is true even if you’re not a specific target of surveillance. Ways of tracking your movements include but are not limited to:

Electronic records of you paying public transit fares or tolls, as relevant. (Ditto rental car fees, train or airplane tickets, etc.)
License plate cameras, which in the US already have billions of records on file.
Anything that may be inferred from your mobile phone.

5. The previous point illustrates that the strong form of the Snowden Treaty is a pipe dream — it calls for a prohibition on mass surveillance, and that will never happen, because:

Governments will insist on trying to prevent “terrorism” before the fact. That mass surveillance is generally lousy at doing so won’t keep them from trying.
Governments will insist on being able to do general criminal forensics after the fact. So they’ll want mass surveillance data sitting around just in case they find that they need it.
Businesses share consumers’ transaction and interaction data, and such sharing is central to the current structure of the internet industry. That genie isn’t going back into the bottle. Besides, if it did, a few large internet companies would have even more of an oligopolistic advantage vs. the others than they now do.

The huge problem with these truisms, of course, is scope creep. Once the data exists, it can be used for many more purposes than the few we’d all agree are actually OK.

6. That, in turn, leads me back to two privacy posts that I like to keep reminding people of, because they make points that aren’t commonly found elsewhere:

The essential questions of fair data use, in which I point out such a long list of legal issues that almost everybody has overlooked some of them.
Very chilling effects, in which I point out how damaging surveillance can be when there’s even a possibility of adverse consequence.

Whether or not you basically agree with me about privacy and surveillance, those two posts may help flesh out whatever your views on the subject actually are.

DataStax and Cassandra update

Curt Monash — Mon, 14 Sep 2015 06:02:59 +0000

MongoDB isn’t the only company I reached out to recently for an update. Another is DataStax. I chatted mainly with Patrick McFadin, somebody with whom I’ve had strong consulting relationships at a user and vendor both. But Rachel Pedreschi contributed the marvelous phrase “twinkling dashboard”.

It seems fair to say that in most cases:

Cassandra is adopted for operational applications, specifically ones with requirements for extreme uptime and/or extreme write speed. (Of course, it should also be the case that NoSQL data structures are a good fit.)
Spark, including SparkSQL, and Solr are seen primarily as ways to navigate or analyze the resulting data.

Those generalities, in my opinion, make good technical sense. Even so, there are some edge cases or counterexamples, such as:

DataStax trumpets British Gas‘ plans collecting a lot of sensor data and immediately offering it up for analysis.*
Safeway uses Cassandra for a mobile part of its loyalty program, scoring customers and pushing coupons at them.
A large title insurance company uses Cassandra-plus-Solr to manage a whole lot of documents.

*And so a gas company is doing lightweight analysis on boiler temperatures, which it regards as hot data.

While most of the specifics are different, I’d say similar things about MongoDB, Cassandra, or any other NoSQL DBMS that comes to mind:

You can get any kind of data into them very fast; indeed, that’s a central part of what they were designed for.
In the general case, getting it back out for low-latency analytics is problematic …
… but there’s an increasing list of exceptions.

For DataStax Enterprise, exceptions start:

Formally, you can do almost anything in at least one of Solr or Spark/SparkSQL. So if volumes are low enough, you’re fine. In particular, Spark offers the potential to do many things at in-memory speeds.
Between Spark, the new functions, and general scripting, there are several ways to do low-latency aggregations. This can lead to “twinkling dashboards”.*
DataStax is alert to the need to stream data into Cassandra.
- That’s central to the NoSQL expectation of ingesting internet data very quickly.
- Kafka, Storm and Spark Streaming all seem to be in the mix.
Solr over Cassandra has a searchable RAM buffer, which can give the effect of real-time text indexing within a second or so of ingest.

*As much as I love the “twinkling dashboard” term — it reminds me of my stock analyst days — it does raise some concerns. In many use cases, human real-time BI should be closely integrated with the more historical kind.

DataStax Enterprise:

Is based on Cassandra 2.1.
Will probably never include Cassandra 2.2, waiting instead for …
….Cassandra 3.0, which will feature a storage engine rewrite …
… and will surely include Cassandra 2.2 features of note.

This connects to what I said previously in that Cassandra 2.2 adds some analytic features, specifically in the area of user-defined functions. Notes on Cassandra 2.2 UDFs include:

These are functions — not libraries, a programming language, or anything like that.
The “user-defined” moniker notwithstanding, the capability has been used to implement COUNT, SUM, AVG, MAX and so on.
You are meant to run user-defined functions on data in a single Cassandra partition; run them across partitions at your own performance risk.

And finally, some general tidbits:

A while ago, Apple said it had >75,000 Cassandra nodes. The figure is surely bigger now.
There are at least several other petabyte range Cassandra installations, and several more half-petabyte ones.
Netflix is not one of those. Instead, it has many 10s of smaller Cassandra clusters.
There are Cassandra users with >1 million reads+writes per second.

Finally a couple of random notes:

One of the text search use cases for Solr/Cassandra is to — in one query — get at information that originated in multiple places, e.g. for reasons of time period or geography. (I hear this about text search across lots of database technologies, relational and non-relational alike.)
As big a change as Cassandra 3.0 will be, it will not require that you take down your applications for an upgrade. That hasn’t been necessary since Cassandra 0.7.

Notes, links and comments, May 2, 2015

Curt Monash — Sat, 02 May 2015 14:36:39 +0000

I’m going to be out-of-sorts this week, due to a colonoscopy. (Between the prep, the procedure, and the recovery, that’s a multi-day disablement.) In the interim, here’s a collection of links, quick comments and the like.

1. Are you an engineer considering a start-up? This post is for you. It’s based on my long experience in and around such scenarios, and includes a section on “Deadly yet common mistakes”.

2. There seems to be a lot of confusion regarding the business model at my clients Databricks. Indeed, my own understanding of Databricks’ on-premises business has changed recently. There are no changes in my beliefs that:

Databricks does not directly license or support on-premises Spark users. Rather …
… it helps partner companies to do so, where:
- Examples of partner companies include usual-suspect Hadoop distribution vendors, and DataStax.
- “Help” commonly includes higher-level support.

However, I now get the impression that revenue from such relationships is a bigger deal to Databricks than I previously thought.

Databricks, by the way, has grown to >50 people.

3. DJ Patil and Ruslan Belkin apparently had a great session on lessons learned, covering a lot of ground. Many of the points are worth reading, but one in particular echoed something I’m hearing lots of places — “Data is super messy, and data cleanup will always be literally 80% of the work.” Actually, I’d replace the “always” by something like “very often”, and even that mainly for newish warehouses, data marts or datasets. But directionally the comment makes a whole lot of sense.

4. Of course, dirty data is a particular problem when the data is free-text.

5. In 2010 I wrote that the use of textual news information in investment algorithms had become “more common”. It’s become a bigger deal since. For example:

It seems to be quite profitable to do automated options trading based on the parsing of tweets.
In a funny example, Tesla motors stock gyrated due to Tesla’s April Fool’s press release about a new wristwatch product.

6. Sometimes a post here gets a comment thread so rich it’s worth doubling back to see what other folks added. I think the recent geek-out on indexes is one such case. Good stuff was added by multiple smart people.

7. Finally, I’ve been banging the drum for electronic health records for a long time, arguing that the great difficulties should be solved due to the great benefits of doing so. The Hacker News/New York Times combo offers a good recent discussion of the subject.

Where the innovation is

Curt Monash — Mon, 19 Jan 2015 08:27:57 +0000

I hoped to write a reasonable overview of current- to medium-term future IT innovation. Yeah, right. But if we abandon any hope that this post could be comprehensive, I can at least say:

1. Back in 2011, I ranted against the term Big Data, but expressed more fondness for the V words — Volume, Velocity, Variety and Variability. That said, when it comes to data management and movement, solutions to the V problems have generally been sketched out.

Volume has been solved. There are Hadoop installations with 100s of petabytes of data, analytic RDBMS with 10s of petabytes, general-purpose Exadata sites with petabytes, and 10s/100s of petabytes of analytic Accumulo at the NSA. Further examples abound.
Velocity is being solved. My recent post on Hadoop-based streaming suggests how. In other use cases, velocity is addressed via memory-centric RDBMS.
Variety and Variability have been solved. MongoDB, Cassandra and perhaps others are strong NoSQL choices. Schema-on-need is in earlier days, but may help too.

2. Even so, there’s much room for innovation around data movement and management. I’d start with:

Product maturity is a huge issue for all the above, and will remain one for years.
Hadoop and Spark show that application execution engines:
- Have a lot of innovation ahead of them.
- Are tightly entwined with data management, and with data movement as well.
Hadoop is due for another refactoring, focused on both in-memory and persistent storage.
There are many issues in storage that can affect data technologies as well, including but not limited to:
- Solid-state (flash or post-flash) vs. spinning disk.
- Networked vs. direct-attached.
- Virtualized vs. identifiable-physical.
- Object/file/block.
Graph analytics and data management are still confused.

3. As I suggested last year, data transformation is an important area for innovation.

MapReduce was invented for data transformation, which is still a large part of what goes on in Hadoop.
The smart data preparation crowd is deservedly getting attention.
The more different data models — NoSQL and so on — that are used, the greater are the demands on data transformation.

4. There’s a lot going on in investigative analytics. Besides the “platform” technologies already mentioned, in areas such as fast-query, data preparation, and general execution engines, there’s also great innovation higher in the stack. Most recently I’ve written about multiple examples in predictive modeling, such as:

Mathematically (more) complex models that are at once more accurate and more easily arrived at than (nearly) linear ones.
Similarly, more complex clustering.
Predictive experimentation.
The use of business intelligence and predictive modeling to inform each other.

Beyond that:

Event-series analytics is another exciting area. (At least on the BI side, I frankly expected it to sweep through the relevant vertical markets more quickly than it has.)
I’ve long been disappointed in the progress in text analytics. But sentiment analysis is doing fairly well, many more languages are analyzed than before, and I occasionally hear rumblings of text analytic sophistication inching back towards that already available in the previous decade.
While I don’t write about it much, modern BI navigation is an impressive and wonderful thing.

5. Back in 2013, in what was perhaps my previous most comprehensive post on innovation, I drew a link between innovation and refactoring, where what was being refactored was “everything”. Even so, I’ve been ignoring a biggie. Security is a mess, and I don’t see how it can ever be solved unless systems are much more modular from the ground up. By that I mean:

“Fencing” processes and resources away from each other improves system quality, in that it defends against both deliberate attacks and inadvertent error.
Fencing is costly, both in terms of context-switching and general non-optimization. Nonetheless, I suspect that …
… the cost of such process isolation may need to be borne.
Object-oriented programming and its associated contracts are good things in this context. But it’s obvious they’re not getting the job done on their own.

More specifically,

It is cheap to give single-purpose intelligent devices more computing power than they know what to do with. There is really no excuse for allowing them to be insecure.
It is rare for a modern PC to go much above 25% CPU usage, simply because most PC programs are still single-core. This illustrates that — assuming some offsetting improvements in multi-core parallelism — desktop software could take a security performance hit without much pain to users’ wallets.
On servers, we may in many cases be talking about lightweight virtual machines.

And to be clear:

What I’m talking about would do little to help the authentication/authorization aspects of security, but …
… those will never be perfect in any case (because they depend upon fallible humans) …
… which is exactly why other forms of security will always be needed.

6. You’ve probably noticed the fuss around an open letter about artificial intelligence, with some press coverage suggesting that AI is a Terminator-level threat to humanity. Underlying all that is a fairly interesting paper summarizing some needs for future research and innovation in AI. In particular, reading the paper reminded me of the previous point about security.

7. Three areas of software innovation that, even though they’re pretty much in my wheelhouse, I have little to say about right now are:

Application development technology, languages, frameworks, etc.
The integration of analytics into old-style operational apps.
The never-ending attempts to make large-enterprise-class application functionality available to outfits with small-enterprise sophistication and budgets.

8. There is, of course, tremendous innovation in robots and other kinds of device. But this post is already long enough, so I’ll address those areas some other time.

Related links

In many cases, I think that innovations will prove more valuable — or at least much easier to monetize — when presented to particular vertical markets.
Edit: I followed up on the last point with a post about soft robots.

Notes and comments, May 6, 2014

Curt Monash — Tue, 06 May 2014 13:46:54 +0000

After visiting California recently, I made a flurry of posts, several of which generated considerable discussion.

My claim that Spark will replace Hadoop MapReduce got much Twitter attention — including some high-profile endorsements — and also some responses here.
My MemSQL post led to a vigorous comparison of MemSQL vs. VoltDB.
My post on hardware and storage spawned a lively discussion of Hadoop hardware pricing; even Cloudera wound up disagreeing with what I reported Cloudera as having said. Sadly, there was less response to the part about the partial (!) end of Moore’s Law.
My Cloudera/SQL/Impala/Hive apparently was well-balanced, in that it got attacked from multiple sides via Twitter & email. Apparently, I was too hard on Impala, I was too hard on Hive, and I was too hard on boxes full of cardboard file cards as well.
My post on the Intel/Cloudera deal garnered a comment reminding us Dell had pushed the Intel distro.
My CitusDB post picked up a few clarifying comments.

Here is a catch-all post to complete the set.

1. The recently-announced Cloudera/MongoDB relationship* is still at the Barney stage. That said, I’m optimistic that their stated intention to add substance to the relationship will eventually come to fruition. If nothing else, the two companies have high regard for each other, at least at the Mike Olson/Max Schireson level.

*That’s one of numerous deals with my fingerprints on it, but in this case only lightly. It was probably on track to happen even without my nudges.

2. Most of what I talked about when I visited MongoDB is confidential; the public stuff was mainly in my recent MongoDB technology post. But in one exception, I asked Max for an update as to MongoDB enterprise use cases. He reported a cluster in data combination, especially but not only in use cases which have both a high-volume part and dynamic-schema aspects. Specific examples Max cited included:

Tracking financial holdings from a variety of asset classes — especially if derivatives are involved, because they have a dynamic-schema aspect.
Product catalogs, including for use on web sites.
Customer information.
Patient information.

3. I didn’t ask everybody I saw in California about business trends, and much of what we did discuss was confidential. That said:

MapR was proud of its numbers.
So was DataStax.
ClearStory has a bunch of Very Big Enterprises as customers, mainly but not only in consumer sectors (e.g. retail, packaged goods).

4. Platfora is focusing a bit, starting with clickstream and security — i.e., event series stuff. And by the way, they report that the term “event series” is working well for them.

5. I gather from a variety of comments and conversations that Amazon Redshift has achieved considerable traction.

6. Something I can’t find evidence of having posted before: I think multiple businesses monitor online sales or similar business successes as a guide to network problems. eBay did this via a custom in-memory MOLAP (Multidimensional Online Analytic Process) system years ago. Best evidence that this is hardly restricted to eBay: all the “me-too” responses I get from telling that story.

7. Citus Data tells me that as of PostgreSQL 9.4, Postgres will be able to return just the part of a JSON column needed for a query. This is as opposed to storing the whole thing as text and only retrieving it in its entirety.

8. In the comments to my “Spark on fire” post, Patrick McFadin pointed out that Mahout is transitioning from MapReduce to Spark. (All new work will be on Spark, although old MapReduce-based routines will continue to be supported.) It turns out that Derrick Harris wrote about that over a month ago, and I just missed the news.

9. Also in predictive analytics — there are rumblings that R could eventually be supplanted by Julia, although R’s massive libraries of algorithms still give it the advantage now.

10. Multiple vendors, fed up with the intermittent slowdowns from garbage collection, are moving some processing off the Java heap. Unfortunately, I neglected to ask any of them what the remaining differences then were between Java and C++ programming.

11. And to finish on a light note: BDAS — the project of which Spark is only a part — is pronounced “bad-ass”, something I first heard from Dave Patterson.

Notes and comments, March 17, 2014

Curt Monash — Mon, 17 Mar 2014 07:09:15 +0000

I have ever more business-advice posts up on Strategic Messaging. Recent subjects include pricing and stealth-mode marketing. Other stuff I’ve been up to includes:

The Spark buzz keeps increasing; almost everybody I talk with expects Spark to win big, probably across several use cases.

Disclosure: I’ll soon be in a substantial client relationship with Databricks, hoping to improve their stealth-mode marketing.

The “real-time analytics” gold rush I called out last year continues. A large fraction of the vendors I talk with have some variant of “real-time analytics” as a central message.

Basho had a major change in leadership. A Twitter exchange ensued. Joab Jackson offered a more sober — figuratively and literally — take.

Hadapt laid off its sales and marketing folks, and perhaps some engineers as well. In a nutshell, Hadapt’s approach to SQL-on-Hadoop wasn’t selling vs. the many alternatives, and Hadapt is doubling down on poly-structured data*/schema-on-need.

*While Hadapt doesn’t to my knowledge use the term “poly-structured data”, some other vendors do. And so I may start using it more myself, at least when the poly-structured/multi-structured distinction actually seems significant.

WibiData is partnering with DataStax, WibiData is of course pleased to get access to Cassandra’s user base, which gave me the opportunity to ask why they thought Cassandra had beaten HBase in those accounts. The answer was performance and availability, while Cassandra’s traditional lead in geo-distribution wasn’t mentioned at all.

Disclosure: My fingerprints are all over that deal.

In other news, WibiData has had some executive departures as well, but seems to be staying the course on its strategy. I continue to think that WibiData has a really interesting vision about how to do large-data-volume interactive computing, and anybody in that space would do well to talk with them or at least look into the open source projects WibiData sponsors.

I encountered another apparently-popular machine-learning term — bandit model. It seems to be glorified A/B testing, and it seems to be popular. I think the point is that it tries to optimize for just how much you invest in testing unproven (for good or bad) alternatives.

I had an awkward set of interactions with Gooddata, including my longest conversations with them since 2009. Gooddata is in the early days of trying to offer an all-things-to-all-people analytic stack via SaaS (Software as a Service). I gather that Hadoop, Vertica, PostgreSQL (a cheaper Vertica alternative), Spark, Shark (as a faster version of Hive) and Cassandra (under the covers) are all in the mix — but please don’t hold me to those details.

I continue to think that computing is moving to a combination of appliances, clusters, and clouds. That said, I recently bought a new gaming-class computer, and spent many hours gaming on it just yesterday.* I.e., there’s room for general-purpose workstations as well. But otherwise, I’m not hearing anything that contradicts my core point.

*The last beta weekend for The Elder Scrolls Online; I loved Morrowind.