Pricing – DBMS 2 : DataBase Management System Services

Interana

Curt Monash — Mon, 17 Apr 2017 10:10:41 +0000

Interana has an interesting story, in technology and business model alike. For starters:

Interana does ad-hoc event series analytics, which they call “interactive behavioral analytics solutions”.
Interana has a full-stack analytic offering, include:
- Its own columnar DBMS …
- … which has a non-SQL DML (Data Manipulation Language) meant to handle event series a lot more fluently than SQL does, but which the user is never expected to learn because …
- … there also are BI-like visual analytics tools that support plenty of drilldown.
Interana sells all this to “product” departments rather than marketing, because marketing doesn’t sufficiently value Interana’s ad-hoc query flexibility.
Interana boasts >40 customers, with annual subscription fees ranging from high 5 figures to low 7 digits.

And to be clear — if we leave aside any questions of marketing-name sizzle, this really is business intelligence. The closest Interana comes to helping with predictive modeling is giving its ad-hoc users inspiration as to where they should focus their modeling attention.

Interana also has an interesting twist in its business model, which I hope can be used successfully by other enterprise software startups as well.

For now, at no extra charge, Interana will operate its software for you as a managed service. (A majority of Interana’s clients run the software on Amazon or Azure, where that kind of offering makes sense.)
However, presumably in connection with greater confidence in its software’s ease of administration, Interana will move this year toward unbundling the service as an extra-charge offering on top of the software itself.

The key to understanding Interana is its DML. Notes on that include:

Interana’s DML is focused on path analytics …
- … but Interana doesn’t like to use that phrase because it sounds too math-y and difficult.
- Interana may be the first company that’s ever told me it’s focused on providing a better nPath.
Primitives in Interana’s language — notwithstanding the company’s claim that it never ever intended to sell to marketing departments — include familiar web analytics concepts such as “session”, “funnel” and so on. (However, these are being renamed to more neutral terms such as “flow” in an upcoming version of the product.)
As typical example questions or analytic subjects, Interana offered:
- “Which are the most common products in shopping carts where time-to-checkout was greater than 30 minutes?”
- Exactly which steps in the onboarding process result in the greatest user frustration?
The Interana folks and I agree that Splunk is the most recent example of a new DML kicking off a significant company.
The most recent example I can think of in which a vendor hung its hat on a new DML that was a “visual programming language” is StreamBase, with EventFlow. That didn’t go all that well.
To use Founder/CTO Bobby Johnson’s summary term, the real goal of the Interana language is to describe a state machine, specifically one that produces (sets of) sequences of events (and the elapsed time between them).

Notes on Interana speeds & feeds include:

Interana only promises data freshness up to micro-batch latencies — i.e., a few minutes. (Obviously, this shuts them out of most networking monitoring and devops use cases.)
Interana thinks it’s very important for query response time to max out at a low number of seconds. If necessary, the software will return approximate results rather than exact ones so as to meet this standard.
Interana installations and workloads to date have gotten as large as:
- 1-200 nodes.
- Trillions of rows, equating to 100s of TBs of data after compression/ >1 PB uncompressed.
- Billions of rows/events received per day.
- 100s of 1000s of (very sparse) columns.
- 1000s of named users.

Although Interana’s original design point was spinning disk, most customers store their Interana data on flash.

Interana architecture choices include:

They’re serious about micro-batching.
- If the user’s data is naturally micro-batched — e.g. a new S3 bucket every few minutes — Interana works with that.
- Even if the customer’s data is streamed — e.g. via Kafka — Interana insists on micro-batching it.
They’re casual about schemas.
- Interana assumes data arrives with some kind of recognizable structure, via JSON, CSV or whatever.
  - Interana observes, correctly, that log data often is decently structured.
    - For example, if you’re receiving “phone home” pings from products you originally manufactured, you know what data structures to expect.
    - Interana calls this “logging with intent”.
  - Interana is fine with a certain amount of JSON (for example) schema change over time.
  - If your arriving data truly is a mess, then you need to calm it down via a pass through Splunk or whatever before sending it to Interana.
- JSON hierarchies turn into multi-part column names in the usual way.
- Interana supports one level of true nesting, and one level only; column values can be “lists”, but list values can’t be list themselves.

Finally, other Interana tech notes include:

Compression is a central design consideration …
- … especially but not only compression algorithms designed to deal with great sparseness, such as run-length encoding (RLE).
- Dictionary compression, in a strategy that is rarer than I once expected it to be, uses a global rather than shard-by-shard dictionary. The data Interana expects is of low-enough cardinality for this to be the better choice.
- Column data is sorted. A big part of the reason is of course to aid compression.
- Compression strategies are chosen automatically for each segment. Wholly automatically, I gather; you can’t tune the choice manually.
As you would think, Interana technically includes multiple data stores.
- Data first hits a write-optimized store. Unlike the case of Vertica, this WOS never is involved in answering queries.
- Asynchronously, the data is broken into columns, and banged to “disk”.
- Asynchronously again, the data is sorted.
- Queries run against sorted data, sorting recent blocks on-the-fly if necessary.
Interana lets you shard different replicas of the data according to different shard keys.
Interana is proud of the random sampling it does when serving approximate query results.

Transitioning to the cloud(s)

Curt Monash — Mon, 07 Dec 2015 17:48:53 +0000

There’s a lot of talk these days about transitioning to the cloud, by IT customers and vendors alike. Of course, I have thoughts on the subject, some of which are below.

1. The economies of scale of not running your own data centers are real. That’s the kind of non-core activity almost all enterprises should outsource. Of course, those considerations taken alone argue equally for true cloud, co-location or SaaS (Software as a Service).

2. When the (Amazon) cloud was newer, I used to hear that certain kinds of workloads didn’t map well to the architecture Amazon had chosen. In particular, shared-nothing analytic query processing was necessarily inefficient. But I’m not hearing nearly as much about that any more.

3. Notwithstanding the foregoing, not everybody loves Amazon pricing.

4. Infrastructure vendors such as Oracle would like to also offer their infrastructure to you in the cloud. As per the above, that could work. However:

Is all your computing on Oracle’s infrastructure? Probably not.
Do you want to move the Oracle part and the non-Oracle part to different clouds? Ideally, no.
Do you like the idea of being even more locked in to Oracle than you are now? [Insert BDSM joke here.]
Will Oracle do so much better of a job hosting its own infrastructure that you use its cloud anyway? Well, that’s an interesting question.

Actually, if we replace “Oracle” by “Microsoft”, the whole idea sounds better. While Microsoft doesn’t have a proprietary server hardware story like Oracle’s, many folks are content in the Microsoft walled garden. IBM has fiercely loyal customers as well, and so may a couple of Japanese computer manufacturers.

5. Even when running stuff in the cloud is otherwise a bad idea, there’s still:

Test and dev(elopment) — usually phrased that way, although the opposite order makes more sense.
Short-term projects — the most obvious examples are in investigative analytics.
Disaster recovery.

So in many software categories, almost every vendor should have a cloud option of some kind.

6. Reasons for your data to wind up in a plurality of remote data centers include:

High availability, and similarly disaster recovery. Duh.
Second-source/avoidance of lock-in.
Geo-compliance.
Particular SaaS offerings being hosted in different places.
Use of both true cloud and co-location for different parts of your business.

7. “Mostly compatible” is by no means the same as “compatible”, and confusing the two leads to tears. Even so, “mostly compatible” has stood the IT industry in good stead multiple times. My favorite examples are:

SQL
UNIX (before LINUX).
IBM-compatible PCs (or, as Ben Rosen used to joke, Compaq-compatible).
Many cases in which vendors upgrade their own products.

I raise this point for two reasons:

I think Amazon/OpenStack could be another important example.
A vendor offering both cloud and on-premises versions of their offering, with minor incompatibilities between the two, isn’t automatically crazy.

8. SaaS vendors, in many cases, will need to deploy in many different clouds. Reasons include:

If they want customers around the world, they may need to process data in customers’ home country or region.
It could be the simplest way to meet the need of offering customers an on-premises option.

That said, there are of course significant differences between, for example:

Deploying to Amazon in multiple regions around the world.
Deploying to Amazon plus a variety of OpenStack-based cloud providers around the world, e.g. some “national champions” (perhaps subsidiaries of the main telecommunications firms).*
Deploying to Amazon, to other OpenStack-based cloud providers, and also to an OpenStack-based system that resides on customer premises (or in their co-location facility).

9. The previous point, and the last bullet of the one before that, are why I wrote in a post about enterprise app history:

There’s a huge difference between designing applications to run on one particular technology stack, vs. needing them to be portable across several. As a general rule, offering an application across several different brands of almost-compatible technology — e.g. market-leading RDBMS or (before the Linux era) proprietary UNIX boxes — commonly works out well. The application vendor just has to confine itself to relying on the intersection of the various brands’ feature sets.*

*The usual term for that is the spectacularly incorrect phrase “lowest common denominator”.

Offering the “same” apps over fundamentally different platform technologies is much harder, and I struggle to think of any cases of great success.

10. Decisions on where to process and store data are of course strongly influenced by where and how the data originates. In broadest terms:

Traditional business transaction data at large enterprises is typically managed by on-premises legacy systems. So legacy issues arise in full force.
Internet interaction data — e.g. web site clicks — typically originates in systems that are hosted remotely. (Few enterprises run their websites on premises.) It is tempting to manage and analyze that data where it originates. That said:
- You often want to enhance that data with what you know from your business records …
- … which is information that you may or may not be willing to send off-premises.
“Phone-home” IoT (Internet of Things) data, from devices at — for example — many customer locations, often makes sense to receive in the cloud. Once it’s there, why not process and analyze it there as well?
Machine-generated data that originates on your premises may never need to leave them. Even if their origins are as geographically distributed as customer devices are, there’s a good chance that you won’t need other cloud features (e.g. elastic scalability) as much as in customer-device use cases.

Related link

While the nuances of my views may change over time, I continue to think that computing platforms will almost all be appliances, clusters or clouds.

Differentiation in business intelligence

Curt Monash — Mon, 26 Oct 2015 19:34:09 +0000

Parts of the business intelligence differentiation story resemble the one I just posted for data management. After all:

Both kinds of products query and aggregate data.
Both are offered by big “enterprise standard” behemoth companies and also by younger, nimbler specialists.
You really, really, really don’t want your customer data to leak via a security breach in either kind of product.

That said, insofar as BI’s competitive issues resemble those of DBMS, they are those of DBMS-lite. For example:

BI is less mission-critical than some other database uses.
BI has done a lot less than DBMS to deal with multi-structured data.
Scalability demands on BI are less than those on DBMS — indeed, they’re the ones that are left over after the DBMS has done its data crunching first.

And full-stack analytic systems — perhaps delivered via SaaS (Software as a Service) — can moot the BI/data management distinction anyway.

Of course, there are major differences between how DBMS and BI are differentiated. The biggest are in user experience. I’d say:

For many people, BI is the user experience over the underlying data store(s).
Two crucial aspects of user experience are navigational power and speed of response.
- At one extreme, people hated the old green paper reports.
- At the other, BI in the QlikView/Tableau era is one of the few kinds of enterprise software that competes on the basis of being
- This is also somewhat true with respect to snazzy BI demos, such as interactive maps or way-before-their-day touch screens.*
Features like collaboration and mobile UIs also matter.
Since BI is commonly adopted via quick departmental projects — at least as the hoped-for first-step of a “land-and-expand” campaign — administrative usability is at a premium as well.

* Computer Pictures and thus Cullinet used a touch screen over 30 years ago. Great demo, but not so useful as an actual product, due to the limitations on data structure.

Where things get tricky is in my category of accuracy. In the early 2000s, I pitched and wrote a white paper arguing that BI helps bring “integrity” to an enterprise in various ways. But I don’t think BI vendors have done a good job of living up to that promise.

They’ve moved slowly in accuracy-intensive areas such as alerting or predictive modeling.
“Single source of truth” and similar protestations turned out to be much oversold.

Indeed, it’s tempting to say that business intelligence has been much too stupid. I really like some attempts to make BI sharper, e.g. at Rocana or ClearStory, but it remains to be seen whether many customer care about their business intelligence actually being smart.

So how does all this fit into my differentiation taxonomy/framework? Referring liberally to what has already been written above, we get:

Scope:
- For traditional tabular analysis, BI products compete on a bunch of UI features.
- Non-tabular analysis is much more primitive. Event series interfaces may be the closest thing to an exception.
- Collaboration is in the mix as well.
Accuracy: I discussed this one above.
Other trustworthiness:
- Security is a big deal.
- Mission-critical robustness is usually, in truth, just a nice-to-have. But some (self-)important executives may disagree.
Speed:
- For some functionality — e.g. cross-database joins — BI tools almost have to rely on their own DBMS-like engines for performance.
- For other it’s more optional. You can do single-RDBMS query straight against the underlying system, or you can pre-position some of the data in memory.
- Please also see the adoption and administration section below.
User experience: I discussed this one above.
Adoption and administration:
- When BI is “owned” by a department, especially one that also doesn’t manage the underlying data, set-up and administration need to be super-easy.
- Sometimes, departmental BI is used as an excuse to pressure central IT into making data available.
- Much like analytic DBMS, BI adoption can sometimes be tied to huge first-time-data-warehouse building projects.
- Administration of big enterprise-standard BI is, to re-use a term, much like DBMS-lite.
Cost: The true cost of BI usage is commonly governed more by the underlying data management (and data acquisition) than by the BI software (and supporting servers) itself. That said:
- BI “hard” costs — licenses, servers, cloud fees, whatever — commonly have to fit into departmental budgets.
- So do BI people costs.
- BI people requirements also often have to fit into departmental skillets.

Differentiation in data management

Curt Monash — Mon, 26 Oct 2015 19:32:34 +0000

In the previous post I broke product differentiation into 6-8 overlapping categories, which may be abbreviated as:

Scope
Accuracy
(Other) trustworthiness
Speed
User experience
Cost

and sometimes also issues in adoption and administration.

Now let’s use this framework to examine two market categories I cover — data management and, in separate post, business intelligence.

Applying this taxonomy to data management:

Scope: Different subcategories of data management technology are suitable for different kinds of data, different scale of data, etc. To a lesser extent that may be true within a subcategory as well.
Scope: Further, products may differ in what you can do with the data, especially analytically.
Accuracy: Don’t … lose … data.
Other trustworthiness:
- Uptime, availability and so on are big deals in many data management sectors.
- Security is hugely important for data that both belongs to other people — usually your customers — and is accessible via the internet. It’s important in numerous other database use cases as well.
- Awkwardly, the CAP Theorem teaches us that there can be a bit of a trade-off between availability and (temporary) accuracy.
Speed:
- Different kinds of data management products perform differently in different use cases.
- If your use case is down the middle of what a mature data management subsector focuses on, performance may not vary much among individual leading products.
- Even then, tuning effort may be quite different for different products.
User experience:
- Users rarely interact directly with database management products.
- There can be clear differentiation in database administration UIs. (The most dramatic example was perhaps the rise of Microsoft SQL Server.)
- Data manipulation languages (DMLs) can make a huge difference in programmers’ lives.
Cost:
- License and maintenance costs can be a huge issue, especially if you’re buying from traditional vendors.
- Performance affects cost in a few ways: hardware costs for sure, tuning effort in some cases, and occasionally even vendor license/maintenance fees.
- Ongoing operations costs can vary greatly by database product in general, and by your pre-existing in-house expertise in particular.
- Ease of programming can sometimes lead to significant programming cost differences as well.
Adoption: This one is often misunderstood.
- The effort of adopting new database technology for new applications is often overrated. When projects are huge, it’s often because of what you’re doing with the technology, not because of the technology itself.
- Migration, however, is usually a bitch.

For reasons of length, I’m doing a separate post on differentiation in business intelligence.

Sources of differentiation

Curt Monash — Mon, 26 Oct 2015 19:31:38 +0000

Obviously, a large fraction of what I write about involves technical differentiation. So let’s try for a framework where differentiation claims can be placed in context. This post will get through the generalities. The sequels will apply them to specific cases.

Many buying and design considerations for IT fall into six interrelated areas:

Scope: What does the technology even purport to do? This consideration applies to pretty much everything.
- Usually, this means something like features.
- However, there’s an important special case in which the important features are the information content. (Examples: Arguably Google, and the Bloomberg service for sure.)
Accuracy: How correctly does the technology do it? This can take multiple forms.
- Sometimes, a binary right/wrong distinction pretty much suffices, with an acceptable error rate of zero. If you’re writing data, it shouldn’t get lost. If you’re doing arithmetic, it should be correct. Etc.
- Sometimes, there’s a clear right/wrong distinction, but error rates are necessarily non-zero, often with a trade-off between the rates for false positives and false negatives. (In text search and similar areas, those rates are measured respectively as precision and recall.) Security is a classic example. Many other cases arise when trying to identify problems or
- Sometimes accuracy is on a scale. Predictive modeling results are commonly of that kind. So are text search, voice recognition and so on.
Other trustworthiness.
- Reliability, availability and security are considerations in almost any IT scenario.
- Also crucial are any factors that are perceived as affecting the risk of project failure. Sometimes, these are lumped together as (part of) maturity.
Speed. There’s a great real and/or perceived “need for speed”.
- On the user level:
  - There are many advantages to quick results, “real time” or otherwise.
  - In particular, analysis is often more accurate if you have time for more iterations or intermediate steps.
  - Please recall that speed can actually have multiple kinds of benefit. For example, it can reduce costs, it can improve accuracy, it can improve user experience, or it can enable capabilities that would otherwise be wholly impractical.
- There can also be considerations of time to (initial) value, although people sometimes overrate how often this is a function of the technology itself.
- Consistency of performance can be an important aspect of product maturity.
User experience. Ideally, using a system is easy and pleasurable, or at least not unpleasant.
- Ease of use often equates to ease of (re)learning …
- … but there are exceptions, generally for what might be considered “power users”.
- Speed and performance can avoid a lot of unpleasant frustration.
- In some cases you can compel somebody — usually an employee — to use your interface. Often, however, you can’t, and that’s when user experience may matter most.
- An important category of user experience that doesn’t directly equate to ease or is Of course, the more accurate the recommendations are, the better.
- Most systems have at least two categories of user experience — one for the true users, and one for the IT folks who manage it. The IT folks’ experience often depends not just on true UI features, but on how hard or difficult the underlying system is to deal with in the first place.
Cost, or more precisely TCO (Total Cost of Ownership). Cost is always important, and especially so if there are numerous viable alternatives.
- Sometimes money paid to the vendor really is the largest component of TCO.
- Often, however, hardware or IT personnel expenditures are the lion’s share of overall cost.
- Administrators’ user experience can affect a large chunk of TCO.

Related links

This post is starting out with two sequels, on data management and business intelligence respectively.
Issues of differentiation are central to my strategic worksheet.
When thinking about differentiation, keep in mind the distinction between wants and needs.
This post fits well with my claim that every product in a category is positioned along the same set of attributes.
In a post last year about differentiation, I wrote “Your spiffy innovation is important in fewer situations than you would like to believe.”
If you think you’re a rare exception to that rule, please see my post about over-optimism.

Basho and Riak

Curt Monash — Thu, 15 Oct 2015 15:18:05 +0000

Basho was on my (very short) blacklist of companies with whom I refuse to speak, because they have lied about the contents of previous conversations. But Tony Falco et al. are long gone from the company. So when Basho’s new management team reached out, I took the meeting.

For starters:

Basho management turned over significantly 1-2 years ago. The main survivors from the old team are 1 each in engineering, sales, and services.
Basho moved its headquarters to Bellevue, WA. (You get one guess as to where the new CEO lives.) Engineering operations are very distributed geographically.
Basho claims that it is much better at timely product shipments than it used to be. Its newest product has a planned (or at least hoped-for) 8-week cadence for point releases.
Basho’s revenue is ~90% subscription.
Basho claims >200 enterprise clients, vs. 100-120 when new management came in. Unfortunately, I forgot to ask the usual questions about divisions vs. whole organizations, OEM sell-through vs. direct, etc.
Basho claims an average contract value of >$100K, typically over 2-3 years. $9 million of that (which would be close to half the total, actually), comes from 2 particular deals of >$4 million each.

Basho’s product line has gotten a bit confusing, but as best I understand things the story is:

There’s something called Riak Core, which isn’t even a revenue-generating product. However, it’s an open source project with some big users (e.g. Goldman Sachs, Visa), and included in pretty much everything else Basho promotes.
Riak KV is the key-value store previously known as Riak. It generates the lion’s share of Basho’s revenue.
Riak S2 is an emulation of Amazon S3. Basho thinks that Riak KV loses efficiency when objects get bigger than 1 MB or so, and that’s when you might want to use Riak S2 in addition or instead.
Riak TS is for time series, and just coming out now.
Also in the mix are some (extra charge) connectors for Redis and Spark. Presumably, there are more of these to come.
There’s an umbrella marketing term of “Basho Data Platform”.

Technical notes on some of that include:

Riak Core doesn’t do data management. It just manages distributed operation of — well, whatever you want to operate. In part, Basho sees Riak Core as a better Apache ZooKeeper.
- That is the essence of the Riak/Spark pitch — something better than ZooKeeper for cluster management, and I presume some help in persisting Spark RDDs as well.
- The Riak/Redis pitch is even simpler — cluster management for Redis, and persistent backing as well.
- Basho’s criticisms of ZooKeeper start with “Cluster manager, manage thyself” claims about ZooKeeper availability, as in the PagerDuty ZooKeeper critique.
Riak KV has secondary indexing. Performance is somewhat questionable. It also has Solr indexing, which is fast.
At least in its 1.0 form, Riak TS assumes:
- There’s some kind of schema or record structure.
- There are explicit or else easily-inferred timestamps.
- Microsecond accuracy, perfect ordering and so on are not essential.
Thus, Riak TS 1.0 is not ideal for the classic Splunk use case where you text index/search on a lot of log emissions. It also is not ideal for financial tick storage.
Riak TS has range-based partitioning, where the range is in terms of time. Basho refers to this as “locality”.
Riak TS has a SQL subset. Evidently there’s decent flexibility as to which part of the database carries which schema.
Riak has a nice feature of allowing you stage a change to network topology before you push it live.
Riak’s vector clock approach to wide-area synchronization is more controversial.

Finally, notes on what Basho sees as use cases and competition include:

Riak KV is generally used to store usual-suspect stuff — log data, user/profile data and so on.
Basho thinks NoSQL is a 4-horse race — Basho/Riak KV, DataStax/Cassandra, MongoDB, Couchbase. (I would be surprised if there was much agreement with that view from, for example, MongoDB, DataStax, Aerospike, MapR or the HBase community.)
Basho competes on availability, scalability (including across geography) and so on, or in simplest terms:
- “Availability and correctness”
- Simple operation
Unsurprisingly, Basho thinks its closest competitor is DataStax. (However, DataStax tells me they don’t compete much with Basho.)
Basho thinks Riak KV has ease-of-operation advantages vs. Cassandra.
An example of a mission-critical Riak app is the UK National Health Service storing prescription information.
An example of Riak S2 and Riak KV being used together is Turner Broadcasting storing video in the former and associated metadata in the latter.
Riak TS is initially pointed at two use cases:
- “Internet of Things”
- “Metrics”, which seems to mean monitoring of system metrics.
Basho sees the competition for Riak TS as starting with HBase, Cassandra, and InfluxDB.

Rocana’s world

Curt Monash — Thu, 17 Sep 2015 11:49:21 +0000

For starters:

My client Rocana is the renamed ScalingData, where Rocana is meant to signify ROot Cause ANAlysis.
Rocana was founded by Omer Trajman, who I’ve referenced numerous times in the past, and who I gather is a former boss of …
… cofounder Eric Sammer.
Rocana recently told me it had 35 people.
Rocana has a very small number of quite large customers.

Rocana portrays itself as offering next-generation IT operations monitoring software. As you might expect, this has two main use cases:

Actual operations — figuring out exactly what isn’t working, ASAP.
Security.

Rocana’s differentiation claims boil down to fast and accurate anomaly detection on large amounts of log data, including but not limited to:

The sort of network data you’d generally think of — “everything” except packet-inspection stuff.
Firewall output.
Database server logs.
Point-of-sale data (at a retailer).
“Application data”, whatever that means. (Edit: See Tom Yates’ clarifying comment below.)

In line with segment leader Splunk’s pricing, data volumes in this area tend to be described in terms of new data/day. Rocana seems to start around 3 TB/day, which not coincidentally is a range that would generally be thought of as:

Challenging for Splunk, and for the budgets of Splunk customers.
Not a big problem for well-implemented Hadoop.

And so part of Rocana’s pitch, familiar to followers of analytic RDBMS and Hadoop alike, is “We keep and use all your data, unlike the legacy guys who make you throw some of it away up front.”

Since Rocana wants you to keep all your data, 3 TB/day is about 1 PB/year.

But really, that’s just saying that Rocana is an analytic stack built on Hadoop, using Hadoop for what people correctly think it’s well-suited for, done by guys who know a lot about Hadoop.

The cooler side of Rocana, to my tastes, is the actual analytics. Truth be told, I find almost any well thought out event-series analytics story cool. It’s an area much less mature than relational business intelligence, and accordingly with much more scope for innovation. On the visualization side, crucial aspects start:

Charting over time (duh).
Comparing widely disparate time intervals (e.g., current vs. historical/baseline).
Whichever good features from relational BI apply to your use case as well.

Other important elements may be more data- or application-specific — and the fact that I don’t have a long list of particulars illustrates just how immature the area really is.

Even cooler is Rocana’s integration of predictive modeling and BI, about which I previously remarked:

The idea goes something like this:

Suppose we have lots of logs about lots of things. Machine learning can help:

Notice what’s an anomaly.

Group together things that seem to be experiencing similar anomalies.

That can inform a BI-plus interface for a human to figure out what is happening.

Makes sense to me.

So far as I can tell, predictive modeling is used to notice aberrant data (raw or derived). This is quickly used to define a subset of data to drill down to (e.g., certain kinds of information from certain machines in a certain period of time). Event-series BI/visualization then lets you see the flows that led to the aberrant result, which was any luck will allow you to find the exact place where the data first goes wrong. And that, one hopes, is something that the ops guys can quickly fix.

I think similar approaches could make sense in numerous application segments.

Related links

Rocana’s Hadoop stack presumably includes both Kafka and Spark Streaming.
Back when Splunk still answered my email, I wrote about its inverted-list data management architecture.
Ursula Le Guin’s debut novel Rocannon’s World has nothing to do with this post (although it does start with a really lousy bit of temporal analysis ). I just like making allusions to her work.

Hadoop generalities

Curt Monash — Wed, 10 Jun 2015 12:33:17 +0000

Occasionally I talk with an astute reporter — there are still a few left — and get led toward angles I hadn’t considered before, or at least hadn’t written up. A blog post may then ensue. This is one such post.

There is a group of questions going around that includes:

Is Hadoop overhyped?
Has Hadoop adoption stalled?
Is Hadoop adoption being delayed by skills shortages?
What is Hadoop really good for anyway?
Which adoption curves for previous technologies are the best analogies for Hadoop?

To a first approximation, my responses are:

The Hadoop hype is generally justified, but …
… what exactly constitutes “Hadoop” is trickier than one might think, in at least two ways:
- Hadoop is much more than just a few core projects.
- Even the core of Hadoop is repeatedly re-imagined.
RDBMS are a good analogy for Hadoop.
As a general rule, Hadoop adoption is happening earlier for new applications, rather than in replacement or rehosting of old ones. That kind of thing is standard for any comparable technology, both because enabling new applications can be valuable and because migration is a pain.
Data transformation, as pre-processing for analytic RDBMS use, is an exception to that general rule. That said …
… it’s been adopted quickly because it saves costs. But of course a business that’s only about cost savings may not generate a lot of revenue.
Dumping data into a Hadoop-centric “data lake” is a smart decision, even if you haven’t figured out yet what to do with it. But of course, …
… even if zero-application adoption makes sense, it isn’t exactly a high-value proposition.
I’m generally a skeptic about market numbers. Specific to Hadoop, I note that:
- The most reliable numbers about Hadoop adoption come from Hortonworks, since it is the only pure-play public company in the market. (Compare, for example, the negligible amounts of information put out by MapR.) But Hortonworks’ experiences are not necessarily identical to those of other vendors, who may compete more on the basis of value-added service and technology rather than on open source purity or price.
- Hadoop (and the same is true of NoSQL) are most widely adopted at digital companies rather than at traditional enterprises.
- That said, while all traditional enterprises have some kind of digital presence, not all have ones of the scope that would mandate a heavy investment in internet technologies. Large consumer-oriented companies probably do, but companies with more limited customer bases might not be there yet.
Concerns about skill shortages are exaggerated.
- The point of distributing processing frameworks such as Spark or MapReduce is to make distributed analytic or application programming not be much harder than any other kind.
- If a new programming language or framework needs to be adopted — well, programmers nowadays love learning that kind of stuff.
- The industry is moving quickly to make distributed systems easier to administer. Any skill shortages in operations should prove quite temporary.

MemSQL 4.0

Curt Monash — Wed, 20 May 2015 09:41:34 +0000

I talked with my clients at MemSQL about the release of MemSQL 4.0. Let’s start with the reminders:

MemSQL started out as in-memory OTLP (OnLine Transaction Processing) DBMS …
… but quickly positioned with “We also do ‘real-time’ analytic processing” …
… and backed that up by adding a flash-based column store option …
… before Gartner ever got around to popularizing the term HTAP (Hybrid Transaction and Analytic Processing).
There’s also a JSON option.

The main new aspects of MemSQL 4.0 are:

Geospatial indexing. This is for me the most interesting part.
A new optimizer and, I suppose, query planner …
… which in particular allow for serious distributed joins.
Some rather parallel-sounding connectors to Spark. Hadoop and Amazon S3.
Usual-suspect stuff including:
- More SQL coverage (I forgot to ask for details).
- Some added or enhanced administrative/tuning/whatever tools (again, I forgot to ask for details).
- Surely some general Bottleneck Whack-A-Mole.

There’s also a new free MemSQL “Community Edition”. MemSQL hopes you’ll experiment with this but not use it in production. And MemSQL pricing is now wholly based on RAM usage, so the column store is quasi-free from a licensing standpoint is as well.

Before MemSQL 4.0, distributed joins were restricted to the easy cases:

Two tables are distributed (i.e. sharded) on the same key.
One table is small enough to be broadcast to each node.

Now arbitrary tables can be joined, with data reshuffling as needed. Notes on MemSQL 4.0 joins include:

Join algorithms are currently nested-loop and hash, and in “narrow cases” also merge.
MemSQL fondly believes that its in-memory indexes work very well for nested-loop joins.
The new optimizer is fully cost-based (but I didn’t get much clarity as to the cost estimators for JSON).
MemSQL’s indexing scheme, skip lists, had histograms anyway, with the cutesy name skiplistogram.
MemSQL’s queries have always been compiled, and of course have to be planned before compilation. However, there’s a little bit of plan flexibility built in based on the specific values queried for, aka “parameter-sensitive plans” or “run-time plan choosing”.

To understand the Spark/MemSQL connector, recall that MemSQL has “leaf” nodes, which store data, and “aggregator” nodes, which combine query results and ship them back to the requesting client. The Spark/MemSQL connector manages to skip the aggregation step, instead shipping data directly from the various MemSQL leaf nodes to a Spark cluster. In the other direction, a Spark RDD can be saved into MemSQL as a table. This is also somehow parallel, and can be configured either as a batch update or as an append; intermediate “conflict resolution” policies are possible as well.

In other connectivity notes:

MemSQL’s idea of a lambda architecture involves a Kafka stream, with data likely being stored twice (in Hadoop and MemSQL).
MemSQL likes and supports the Spark DataFrame API, and says financial trading firms are already using it.

Other application areas cited for streaming/lambda kinds of architectures are — you guessed it! — ad-tech and “anomaly detection”.

And now to the geospatial stuff. I thought I heard:

A “point” is actually a square region less than 1 mm per side.
There are on the order of 2^30 such points on the surface of the Earth.

Given that Earth’s surface area is a little over 500,000,000 square meters, I’d think 2^50 would be a better figure, but fortunately that discrepancy doesn’t matter to the rest of the discussion. (Edit: As per a comment below, that’s actually square kilometers, so unless I made further errors we’re up to the 2^70 range.)

Anyhow, if the two popular alternatives for geospatial indexing are R-trees or space-filling curves, MemSQL favors the latter. (One issue MemSQL sees with R-trees is concurrency.) Notes on space-filling curves start:

In this context, a space-filling curve is a sequential numbering of points in a higher-dimensional space. (In MemSQL’s case, the dimension is two.)
Hilbert curves seem to be in vogue, including at MemSQL.
Nice properties of Hilbert space-filling curves include:
- Numbers near each other always correspond to points near each other.
- The converse is almost always true as well.*
- If you take a sequence of numbers that is simply the set of all possibilities with a particular prefix string, that will correspond to a square region. (The shorter the prefix, the larger the square.)

*You could say it’s true except in edge cases … but then you’d deserve to be punished.

Given all that, my understanding of the way MemSQL indexes geospatial stuff — specifically points and polygons — is:

Points have numbers assigned to them by the space-filling curve; those are indexed in MemSQL’s usual way. (Skip lists.)
A polygon is represented by its vertices. Take the longest prefix they share. That could be used to index them (you’d retrieve a square region that includes the polygon). But actually …
… a polygon is covered by a union of such special square regions, and indexed accordingly, and I neglected to ask exactly how the covering set of squares was chosen.

As for company metrics — MemSQL cites >50 customers and >60 employees.

Related links

I’ve posted about earlier versions of MemSQL technology, e.g. in May, 2014, April, 2013 and June, 2012.

Greenplum is being open sourced

Curt Monash — Wed, 18 Feb 2015 21:51:39 +0000

While I don’t find the Open Data Platform thing very significant, an associated piece of news seems cooler — Pivotal is open sourcing a bunch of software, with Greenplum as the crown jewel. Notes on that start:

Greenplum has been an on-again/off-again low-cost player since before its acquisition by EMC, but open source is basically a commitment to having low license cost be permanently on.
In most regards, “free like beer” is what’s important here, not “free like speech”. I doubt non-Pivotal employees are going to do much hacking on the long-closed Greenplum code base.
That said, Greenplum forked PostgreSQL a long time ago, and the general PostgreSQL community might gain ideas from some of the work Greenplum has done.
The only other bit of newly open-sourced stuff I find interesting is HAWQ. Redis was already open source, and I’ve never been persuaded to care about GemFire.

Greenplum, let us recall, is a pretty decent MPP (Massively Parallel Processing) analytic RDBMS. Various aspects of it were oversold at various times, and I’ve never heard that they actually licked concurrency. But Greenplum has long had good SQL coverage and petabyte-scale deployments and a columnar option and some in-database analytics and so on; i.e., it’s legit. When somebody asks me about open source analytic RDBMS to consider, I expect Greenplum to consistently be on the short list.

Further, the low-cost alternatives for analytic RDBMS are adding up.

Amazon Redshift has considerable traction.
Hadoop (even just with Hive) has offloaded a lot of ELT (Extract/Load/Transform) from analytic RDBMS such as Teradata.
Now Greenplum is in the mix as well.

For many analytic RDBMS use cases, at least one of those three will be an appealing possibility.

By no means do I want to suggest those are the only alternatives.

Smaller-vendor offerings, such as CitusDB or Infobright, may well be competitive too.
Larger vendors can always slash price in specific deals.
MonetDB is still around.

But the three possibilities I cited first should suffice as proof for almost all enterprises that, for most use cases not requiring high concurrency, analytic RDBMS need not cost an arm and a leg.

Related link

Greenplum revenue at EMC was problematic from the get-go.