Splunk – DBMS 2 : DataBase Management System Services

Interana

Curt Monash — Mon, 17 Apr 2017 10:10:41 +0000

Interana has an interesting story, in technology and business model alike. For starters:

Interana does ad-hoc event series analytics, which they call “interactive behavioral analytics solutions”.
Interana has a full-stack analytic offering, include:
- Its own columnar DBMS …
- … which has a non-SQL DML (Data Manipulation Language) meant to handle event series a lot more fluently than SQL does, but which the user is never expected to learn because …
- … there also are BI-like visual analytics tools that support plenty of drilldown.
Interana sells all this to “product” departments rather than marketing, because marketing doesn’t sufficiently value Interana’s ad-hoc query flexibility.
Interana boasts >40 customers, with annual subscription fees ranging from high 5 figures to low 7 digits.

And to be clear — if we leave aside any questions of marketing-name sizzle, this really is business intelligence. The closest Interana comes to helping with predictive modeling is giving its ad-hoc users inspiration as to where they should focus their modeling attention.

Interana also has an interesting twist in its business model, which I hope can be used successfully by other enterprise software startups as well.

For now, at no extra charge, Interana will operate its software for you as a managed service. (A majority of Interana’s clients run the software on Amazon or Azure, where that kind of offering makes sense.)
However, presumably in connection with greater confidence in its software’s ease of administration, Interana will move this year toward unbundling the service as an extra-charge offering on top of the software itself.

The key to understanding Interana is its DML. Notes on that include:

Interana’s DML is focused on path analytics …
- … but Interana doesn’t like to use that phrase because it sounds too math-y and difficult.
- Interana may be the first company that’s ever told me it’s focused on providing a better nPath.
Primitives in Interana’s language — notwithstanding the company’s claim that it never ever intended to sell to marketing departments — include familiar web analytics concepts such as “session”, “funnel” and so on. (However, these are being renamed to more neutral terms such as “flow” in an upcoming version of the product.)
As typical example questions or analytic subjects, Interana offered:
- “Which are the most common products in shopping carts where time-to-checkout was greater than 30 minutes?”
- Exactly which steps in the onboarding process result in the greatest user frustration?
The Interana folks and I agree that Splunk is the most recent example of a new DML kicking off a significant company.
The most recent example I can think of in which a vendor hung its hat on a new DML that was a “visual programming language” is StreamBase, with EventFlow. That didn’t go all that well.
To use Founder/CTO Bobby Johnson’s summary term, the real goal of the Interana language is to describe a state machine, specifically one that produces (sets of) sequences of events (and the elapsed time between them).

Notes on Interana speeds & feeds include:

Interana only promises data freshness up to micro-batch latencies — i.e., a few minutes. (Obviously, this shuts them out of most networking monitoring and devops use cases.)
Interana thinks it’s very important for query response time to max out at a low number of seconds. If necessary, the software will return approximate results rather than exact ones so as to meet this standard.
Interana installations and workloads to date have gotten as large as:
- 1-200 nodes.
- Trillions of rows, equating to 100s of TBs of data after compression/ >1 PB uncompressed.
- Billions of rows/events received per day.
- 100s of 1000s of (very sparse) columns.
- 1000s of named users.

Although Interana’s original design point was spinning disk, most customers store their Interana data on flash.

Interana architecture choices include:

They’re serious about micro-batching.
- If the user’s data is naturally micro-batched — e.g. a new S3 bucket every few minutes — Interana works with that.
- Even if the customer’s data is streamed — e.g. via Kafka — Interana insists on micro-batching it.
They’re casual about schemas.
- Interana assumes data arrives with some kind of recognizable structure, via JSON, CSV or whatever.
  - Interana observes, correctly, that log data often is decently structured.
    - For example, if you’re receiving “phone home” pings from products you originally manufactured, you know what data structures to expect.
    - Interana calls this “logging with intent”.
  - Interana is fine with a certain amount of JSON (for example) schema change over time.
  - If your arriving data truly is a mess, then you need to calm it down via a pass through Splunk or whatever before sending it to Interana.
- JSON hierarchies turn into multi-part column names in the usual way.
- Interana supports one level of true nesting, and one level only; column values can be “lists”, but list values can’t be list themselves.

Finally, other Interana tech notes include:

Compression is a central design consideration …
- … especially but not only compression algorithms designed to deal with great sparseness, such as run-length encoding (RLE).
- Dictionary compression, in a strategy that is rarer than I once expected it to be, uses a global rather than shard-by-shard dictionary. The data Interana expects is of low-enough cardinality for this to be the better choice.
- Column data is sorted. A big part of the reason is of course to aid compression.
- Compression strategies are chosen automatically for each segment. Wholly automatically, I gather; you can’t tune the choice manually.
As you would think, Interana technically includes multiple data stores.
- Data first hits a write-optimized store. Unlike the case of Vertica, this WOS never is involved in answering queries.
- Asynchronously, the data is broken into columns, and banged to “disk”.
- Asynchronously again, the data is sorted.
- Queries run against sorted data, sorting recent blocks on-the-fly if necessary.
Interana lets you shard different replicas of the data according to different shard keys.
Interana is proud of the random sampling it does when serving approximate query results.

Notes on the transition to the cloud

Curt Monash — Tue, 04 Oct 2016 02:22:21 +0000

1. The cloud is super-hot. Duh. And so, like any hot buzzword, “cloud” means different things to different marketers. Four of the biggest things that have been called “cloud” are:

The Amazon cloud, Microsoft Azure, and their competitors, aka public cloud.
Software as a service, aka SaaS.
Co-location in off-premises data centers, aka colo.
On-premises clusters (truly on-prem or colo as the case may be) designed to run a broad variety of applications, aka private cloud.

Further, there’s always the idea of hybrid cloud, in which a vendor peddles private cloud systems (usually appliances) running similar technology stacks to what they run in their proprietary public clouds. A number of vendors have backed away from such stories, but a few are still pushing it, including Oracle and Microsoft.

This is a good example of Monash’s Laws of Commercial Semantics.

2. Due to economies of scale, only a few companies should operate their own data centers, aka true on-prem(ises). The rest should use some combination of colo, SaaS, and public cloud.

This fact now seems to be widely understood.

3. The public cloud is a natural fit for those use cases in which elasticity truly matters. Many websites and other consumer internet backends have that characteristic. Such systems are often also a good fit for cloud technologies in general.

This is frequently a good reason for new — i.e. “greenfield” — apps to run in the cloud.

4. Security and privacy can be concerns in moving to the cloud. But I’m hearing that more and more industries are overcoming those concerns.

In connection to that point, it might be interesting to note:

In the 1960s and 1970s, one of the biggest industries for remote computing services — i.e. SaaS — was commercial banking.
Other big users were hospitals and stockbrokers.
The US intelligence agencies are building out their own shared, dedicated cloud.

5. Obviously, Amazon is the gorilla in the cloud business. Microsoft Azure gets favorable mentions as well. I don’t hear much about other public cloud providers, however, except that there are a lot of plans to support Google’s cloud just in case.

In particular, I hear less than I expected to about public clouds run by national-champion telecom companies around the world.

6. It’s inconvenient for an application vendor to offer both traditional and SaaS versions of a product. Release cycles and platform support are different in the two cases. But there’s no reason a large traditional application vendor couldn’t pull it off, and the largest are already more or less claiming to. Soon, this will feel like a market necessity across the board.

7. The converse is less universally true. However, some SaaS vendors do lose out from their lack of on-premises options. Key considerations include:

Does your application need to run close to your customers’ largest databases?
Do your customers still avoid the public cloud?

If both those things are true, and you don’t have an on-premises option, certain enterprises are excluded from your addressable market.

8. Line-of-business departments are commonly more cloud-friendly than central IT is. Reasons include:

Departments don’t necessarily see central IT as any “closer” to them than the cloud is.
Departments don’t necessarily care about issues that give central IT pause.
Departments sometimes buy things that only are available via remote delivery, e.g. narrowly focused SaaS applications or market data.

I discussed some of this in my recent post on vendor lock-in.

9. When the public cloud was younger, it had various technological limitations. You couldn’t easily get fast storage like flash. You couldn’t control data movement well enough for good MPP (Massively Parallel Processing) in use cases like analytic SQL.

Those concerns seem to have been largely alleviated.

10. It takes a long time for legacy platforms to be decommissioned. At some enterprises, however, that work has indeed been going on for a long time, via virtualization.

11. If you think about system requirements:

There is a lot of computing power in devices that may be regarded as IoT nodes — phones, TV boxes, thermostats, cars, industrial equipment, sensors, etc. Client-side computing is getting ever more diverse.
Server-side computing, however, is more homogenous. Enterprises can, should and likely will meet the vast majority of their server requirements on a relatively small number of clusters each.

I argued the latter point in my 2013 post on appliances, clusters, and clouds, using terminology and reasoning that are now only slightly obsolete.

So what will those clusters be? Some will be determined by app choices. Most obviously, if you use SaaS, the SaaS vendor decides which cloud(s) your data is in. And if you’re re-hosting legacy systems via virtualization, that’s another cluster.

Otherwise, clusters will probably be organized by database, in the most expansive sense of term. For example, there could be separate clusters for:

Operational data managed by your general-purpose RDBMS (Oracle, SQL Server, DB2, whatever).
Relational data warehousing, whether in an analytic RDBMS or otherwise.
Log files, perhaps managed in Hadoop or Splunk.
Your website and other internet back-ends, perhaps running over NoSQL data stores.
Text documents managed by some kind of search engine.
Media block or object storage, if the organization’s audio/video/whatever would overwhelm a text search engine. (Text search or document management systems can often also handle low volumes of non-text media.)

Indeed, since computing is rarely as consolidated as CIOs dream of it being, a large enterprise might have several clusters for any of those categories — each running different software for data and storage management — with different deployment choices among colo, true on-prem, and true cloud.

Splunk engages in stupid lawyer tricks

Curt Monash — Wed, 25 Nov 2015 14:14:32 +0000

Using legal threats as an extension of your marketing is a bad idea. At least, it’s a bad idea in the United States, where such tactics are unlikely to succeed, and are apt to backfire instead. Splunk seems to actually have had some limited success intimidating Sumo Logic. But it tried something similar against Rocana, and I was set up to potentially be collateral damage. I don’t think that’s working out very well for Splunk.

Specifically, Splunk sent a lawyer letter to Rocana, complaining about a couple of pieces of Rocana marketing collateral. Rocana responded publicly, and posted both the Splunk letter and Rocana’s lawyer response. The Rocana letter eviscerated Splunk’s lawyers on matters of law, clobbered them on the facts as well, exposed Splunk’s similar behavior in the past, and threw in a bit of snark at the end.

Now I’ll pile on too. In particular, I’ll note that, while Splunk wants to impose a duty of strict accuracy upon those it disagrees with, it has fewer compunctions about knowingly communicating falsehoods itself.

1. Splunk’s letter insinuates that Rocana might have paid me to say what I blogged about them. Those insinuations are of course false.

Splunk was my client for a lot longer, and at a higher level of annual retainer, than Rocana so far has been. Splunk never made similar claims about my posts about them. Indeed, Splunk complained that I did not write about them often or favorably enough, and on at least one occasion seemed to delay renewing my services for that reason.

2. Similarly, Splunk’s letter makes insinuations about quotes I gave Rocana. But I also gave at least one quote to Splunk when they were my client. As part of the process — and as is often needed — I had a frank and open discussion with them about my quote policies. So Splunk should know that their insinuations are incorrect.

3. Splunk’s letter actually included the sentences

Splunk can store data in, and analyze data across, Splunk, SQL, NoSQL, and Hadoop data repositories. Accordingly, the implication that Splunk cannot scale like Hadoop is misleading and inaccurate.

I won’t waste the time of this blog’s readers by explaining how stupid that is, except to point out that I don’t think Splunk executes queries entirely in Hadoop. If you want to consider the matter further, you might consult my posts regarding Splunk HPAS and Splunk Hunk.

4. I and many other people have heard concerns about the cost of running Splunk for high volumes of data ingest. Splunk’s letter suggests we’re all making this up. This post suggests that Splunk’s lawyers can’t have been serious.

Related links

In the US, even individuals can safely stare down libel threats. I’ve done it.
Trying to make something that’s online go away can have the extreme opposite effect.

Basho and Riak

Curt Monash — Thu, 15 Oct 2015 15:18:05 +0000

Basho was on my (very short) blacklist of companies with whom I refuse to speak, because they have lied about the contents of previous conversations. But Tony Falco et al. are long gone from the company. So when Basho’s new management team reached out, I took the meeting.

For starters:

Basho management turned over significantly 1-2 years ago. The main survivors from the old team are 1 each in engineering, sales, and services.
Basho moved its headquarters to Bellevue, WA. (You get one guess as to where the new CEO lives.) Engineering operations are very distributed geographically.
Basho claims that it is much better at timely product shipments than it used to be. Its newest product has a planned (or at least hoped-for) 8-week cadence for point releases.
Basho’s revenue is ~90% subscription.
Basho claims >200 enterprise clients, vs. 100-120 when new management came in. Unfortunately, I forgot to ask the usual questions about divisions vs. whole organizations, OEM sell-through vs. direct, etc.
Basho claims an average contract value of >$100K, typically over 2-3 years. $9 million of that (which would be close to half the total, actually), comes from 2 particular deals of >$4 million each.

Basho’s product line has gotten a bit confusing, but as best I understand things the story is:

There’s something called Riak Core, which isn’t even a revenue-generating product. However, it’s an open source project with some big users (e.g. Goldman Sachs, Visa), and included in pretty much everything else Basho promotes.
Riak KV is the key-value store previously known as Riak. It generates the lion’s share of Basho’s revenue.
Riak S2 is an emulation of Amazon S3. Basho thinks that Riak KV loses efficiency when objects get bigger than 1 MB or so, and that’s when you might want to use Riak S2 in addition or instead.
Riak TS is for time series, and just coming out now.
Also in the mix are some (extra charge) connectors for Redis and Spark. Presumably, there are more of these to come.
There’s an umbrella marketing term of “Basho Data Platform”.

Technical notes on some of that include:

Riak Core doesn’t do data management. It just manages distributed operation of — well, whatever you want to operate. In part, Basho sees Riak Core as a better Apache ZooKeeper.
- That is the essence of the Riak/Spark pitch — something better than ZooKeeper for cluster management, and I presume some help in persisting Spark RDDs as well.
- The Riak/Redis pitch is even simpler — cluster management for Redis, and persistent backing as well.
- Basho’s criticisms of ZooKeeper start with “Cluster manager, manage thyself” claims about ZooKeeper availability, as in the PagerDuty ZooKeeper critique.
Riak KV has secondary indexing. Performance is somewhat questionable. It also has Solr indexing, which is fast.
At least in its 1.0 form, Riak TS assumes:
- There’s some kind of schema or record structure.
- There are explicit or else easily-inferred timestamps.
- Microsecond accuracy, perfect ordering and so on are not essential.
Thus, Riak TS 1.0 is not ideal for the classic Splunk use case where you text index/search on a lot of log emissions. It also is not ideal for financial tick storage.
Riak TS has range-based partitioning, where the range is in terms of time. Basho refers to this as “locality”.
Riak TS has a SQL subset. Evidently there’s decent flexibility as to which part of the database carries which schema.
Riak has a nice feature of allowing you stage a change to network topology before you push it live.
Riak’s vector clock approach to wide-area synchronization is more controversial.

Finally, notes on what Basho sees as use cases and competition include:

Riak KV is generally used to store usual-suspect stuff — log data, user/profile data and so on.
Basho thinks NoSQL is a 4-horse race — Basho/Riak KV, DataStax/Cassandra, MongoDB, Couchbase. (I would be surprised if there was much agreement with that view from, for example, MongoDB, DataStax, Aerospike, MapR or the HBase community.)
Basho competes on availability, scalability (including across geography) and so on, or in simplest terms:
- “Availability and correctness”
- Simple operation
Unsurprisingly, Basho thinks its closest competitor is DataStax. (However, DataStax tells me they don’t compete much with Basho.)
Basho thinks Riak KV has ease-of-operation advantages vs. Cassandra.
An example of a mission-critical Riak app is the UK National Health Service storing prescription information.
An example of Riak S2 and Riak KV being used together is Turner Broadcasting storing video in the former and associated metadata in the latter.
Riak TS is initially pointed at two use cases:
- “Internet of Things”
- “Metrics”, which seems to mean monitoring of system metrics.
Basho sees the competition for Riak TS as starting with HBase, Cassandra, and InfluxDB.

Rocana’s world

Curt Monash — Thu, 17 Sep 2015 11:49:21 +0000

For starters:

My client Rocana is the renamed ScalingData, where Rocana is meant to signify ROot Cause ANAlysis.
Rocana was founded by Omer Trajman, who I’ve referenced numerous times in the past, and who I gather is a former boss of …
… cofounder Eric Sammer.
Rocana recently told me it had 35 people.
Rocana has a very small number of quite large customers.

Rocana portrays itself as offering next-generation IT operations monitoring software. As you might expect, this has two main use cases:

Actual operations — figuring out exactly what isn’t working, ASAP.
Security.

Rocana’s differentiation claims boil down to fast and accurate anomaly detection on large amounts of log data, including but not limited to:

The sort of network data you’d generally think of — “everything” except packet-inspection stuff.
Firewall output.
Database server logs.
Point-of-sale data (at a retailer).
“Application data”, whatever that means. (Edit: See Tom Yates’ clarifying comment below.)

In line with segment leader Splunk’s pricing, data volumes in this area tend to be described in terms of new data/day. Rocana seems to start around 3 TB/day, which not coincidentally is a range that would generally be thought of as:

Challenging for Splunk, and for the budgets of Splunk customers.
Not a big problem for well-implemented Hadoop.

And so part of Rocana’s pitch, familiar to followers of analytic RDBMS and Hadoop alike, is “We keep and use all your data, unlike the legacy guys who make you throw some of it away up front.”

Since Rocana wants you to keep all your data, 3 TB/day is about 1 PB/year.

But really, that’s just saying that Rocana is an analytic stack built on Hadoop, using Hadoop for what people correctly think it’s well-suited for, done by guys who know a lot about Hadoop.

The cooler side of Rocana, to my tastes, is the actual analytics. Truth be told, I find almost any well thought out event-series analytics story cool. It’s an area much less mature than relational business intelligence, and accordingly with much more scope for innovation. On the visualization side, crucial aspects start:

Charting over time (duh).
Comparing widely disparate time intervals (e.g., current vs. historical/baseline).
Whichever good features from relational BI apply to your use case as well.

Other important elements may be more data- or application-specific — and the fact that I don’t have a long list of particulars illustrates just how immature the area really is.

Even cooler is Rocana’s integration of predictive modeling and BI, about which I previously remarked:

The idea goes something like this:

Suppose we have lots of logs about lots of things. Machine learning can help:

Notice what’s an anomaly.

Group together things that seem to be experiencing similar anomalies.

That can inform a BI-plus interface for a human to figure out what is happening.

Makes sense to me.

So far as I can tell, predictive modeling is used to notice aberrant data (raw or derived). This is quickly used to define a subset of data to drill down to (e.g., certain kinds of information from certain machines in a certain period of time). Event-series BI/visualization then lets you see the flows that led to the aberrant result, which was any luck will allow you to find the exact place where the data first goes wrong. And that, one hopes, is something that the ops guys can quickly fix.

I think similar approaches could make sense in numerous application segments.

Related links

Rocana’s Hadoop stack presumably includes both Kafka and Spark Streaming.
Back when Splunk still answered my email, I wrote about its inverted-list data management architecture.
Ursula Le Guin’s debut novel Rocannon’s World has nothing to do with this post (although it does start with a really lousy bit of temporal analysis ). I just like making allusions to her work.

IT-centric notes on the future of health care

Curt Monash — Tue, 26 May 2015 05:02:09 +0000

It’s difficult to project the rate of IT change in health care, because:

Health care is suffused with technology — IT, medical device and biotech alike — and hence has the potential for rapid change. However, it is also the case that …
… health care is heavily bureaucratic, political and regulated.

Timing aside, it is clear that health care change will be drastic. The IT part of that starts with vastly comprehensive electronic health records, which will be accessible (in part or whole as the case may be) by patients, care givers, care payers and researchers alike. I expect elements of such records to include:

The human-generated part of what’s in ordinary paper health records today, but across a patient’s entire lifetime. This of course includes notes created by doctors and other care-givers.
Large amounts of machine-generated data, including:
- The results of clinical tests. Continued innovation can be expected in testing, for reasons that include:
  - Most tests exploit electronic technology. Progress in electronics is intense.
  - Biomedical research is itself intense.
  - In particular, most research technologies (for example gene sequencing) can be made cheap enough over time to be affordable clinically.
- The output of consumer health-monitoring devices — e.g. Fitbit and its successors. The buzzword here is “quantified self”, but what it boils down to is that every moment of our lives will be measured and recorded.

These vastly greater amounts of data cited above will allow for greatly changed analytics.

Right now, medical decisions are made based on research that looks at a few data points each for a specially-recruited sample of patients, then draws conclusions based on simplistic and questionable statistical methods.
More sophisticated analytic methods are commonly used, but almost always still to aid in the discovery and formation of hypotheses that will then be validated, if at all, using the bad old analytic techniques.
State of the art predictive modeling, applied to vastly more data, will surely yield greatly better results.

And so I believe that health care itself will be revolutionized.

Diagnosis will be much more accurate, pretty much across the board, except in those limited areas where it’s already excellent today.
Medication regimens will be much more personalized. (Pharma manufacturing may have to change greatly as a result.) So will other treatments. So will diet/fitness regimens.
The vulnerable (elderly, hospital patients) will be more accurately and comprehensively monitored. Also, their care will likely be aided by robotics.
Some of the same things will be true of infants and toddlers. (In other cases, they get such close attention today that I can’t imagine how it could be greatly increased. )

I believe that this will all happen because I believe that it will make health care vastly more successful. And if I’m right about that, no obstacles will be able to prevent it from coming into play — not cost (which will keep going down in a quasi-Moore’s-Law way), not bureaucratic inertia (although that will continue to slow things greatly), and not privacy fears (despite the challenges cited below).

So what are the IT implications of all this?

I already mentioned the need for new (or newly-used) kinds of predictive modeling.
Probably in association with those, event detection — which in many but not all cases will amount to anomaly detection — will be huge. If one goal is to let the elderly and ailing live independently, but receive help when it’s needed — well, recognizing when that help is needed will be crucial. Similar dynamics will occur in hospitals.
And in support of that, there will be great amount of monitoring, and hence strong demands upon sensors and recognition. Potentially, all five human senses will be mimicked, among others. These technologies will become even more important in health care if I’m right that robotics will play a big role.
Data quality will be a major challenge, especially in the doctors’-notes parts of health records. Reasons start:
- Different medical professionals might evaluate the same situation differently; diagnosis is a craft rather than a dumb, repeatable skill.
- If entries are selected from a predefined set of options, none may be a perfect match to the doctor’s actual opinion.
- Doctors often say what’s needful to have their decisions (care, tests, etc.) approved, whether or not it precisely matches what they really think. Thus, there are significant incentives to enter bad data.
- Free-text data is more central to health care than to many other application areas, and text data is inherently dirty.
- Health records are decades later than many other applications in moving from paper to IT.
Data integration problems will also be and indeed already are huge, because different health care providers have addressed the tough challenges of record-keeping in different ways.

As for data management — well, almost everything discussed in this blog could come into play.

A person’s entire medical record resembles the kind of mess increasingly often dumped these days into NoSQL — typically MongoDB, Cassandra, or HBase.
There are plenty of business-transaction records in the mix, of the kind that have long been managed by RDBMS.
There are a whole lot of diverse machines in the mix, and managing the data to keep such a menagerie running is commonly the job of Splunk or streaming-enhanced Hadoop.
There’s a lot of free text in medical records. Also images, video and so on.
Since graph analytics is used in research today, it might at some point make its way into clinical use.

Finally, let me say:

Data-driven medicine cannot live up to its potential unless researchers can investigate data sets comprising private information of large numbers of people.
Researchers will not have the appropriate permissions unless privacy law moves toward a basis in data use, rather than exclusively regulating data possession.

Related links

The New York Times and Hacker News discussed the benefits of using your own medical records a couple months ago.
I wrote about the monitoring/early response aspects of health care in February, 2015.
Perhaps my most recent survey of privacy issues was in September, 2014.
A pretty good survey of the debate about statistical methods in medical research came out in December, 2013.

Cask and CDAP

Curt Monash — Thu, 05 Mar 2015 15:00:13 +0000

For starters:

Continuuity toured in 2012 and touted its “app server for Hadoop” technology.
Continuuity recently changed its name to Cask and went open source.
Cask’s product is now called CDAP (Cask Data Application Platform). It’s still basically an app server for Hadoop and other “big data” — ouch do I hate that phrase — data stores.
Cask and Cloudera partnered.
I got a more technical Cask briefing this week.

Also:

App servers are a notoriously amorphous technology. The focus of how they’re used can change greatly every couple of years.
Partly for that reason, I was unimpressed by Continuuity’s original hype-filled positioning.

So far as I can tell:

Cask’s current focus is to orchestrate job flows, with lots of data mappings.
This is supposed to provide lots of developer benefits, for fairly obvious reasons. Those are pitched in terms of an integration story, more in a “free you from the mess of a many-part stack” sense than strictly in terms of data integration.
CDAP already has a GUI to monitor what’s going on. A GUI to specify workflows is coming very soon.
CDAP doesn’t consume a lot of cycles itself, and hence isn’t a real risk for unpleasant overhead, if “overhead” is narrowly defined. Rather, performance drags could come from …
… sub-optimal choices in data mapping, database design or workflow composition.

I’d didn’t push the competition point hard (the call was generally a bit rushed due to a hard stop on my side), but:

Cask thinks it doesn’t have much in the way of exact or head-to-head competitors, but cites Spring and WibiData/Kiji as coming closest.
I’d think that data integration vendors who use Hadoop as an execution engine (Informatica, Syncsort and many more) would be in the mix as well.
Cask disclaimed competition with Teradata Revelytix, on the theory that Cask is focused on operational/”real-time” use cases, while Revelytix Loom is focused on data science/investigative analytics.

To reiterate part of that last bullet — like much else we’re hearing about these days, CDAP is focused on operational apps, perhaps with a streaming aspect.

To some extent CDAP can be viewed as restoring the programmer/DBA distinction to the non-SQL world and streaming worlds. That is:

Somebody creates a data mapping “pattern”.
Programmers (including perhaps the creator) write to that pattern.
Somebody (perhaps the creator) tweaks the mapping to optimize performance, or to reflect changes in the underlying data management.

Further notes on CDAP data access include:

Cask is proud that a pattern can literally be remapped from one data store to another, although I wonder how often that is likely to happen in practice.
Also, a single “row” can reference multiple data stores.
Cask’s demo focused on imposing a schema on a log file, something you might do incrementally as you decide to extract another field of information. This is similar to major use cases for schema-on-need and for Splunk.
For most SQL-like access and operations, CDAP relies on Hive, even to external data stores or non-tabular data. Cask is working with Cloudera on Impala access.

Examples of things that Cask supposedly makes easy include:

Chunking streaming data by time (e.g. 1 minute buckets).
Encryption.
Generating database stats (histograms and so on).

Tidbits as to how Cask perceives or CDAP plays with other technologies include:

Kafka is hot.
Spark Streaming is hot enough to be on the CDAP roadmap.
Cask believes that its administrative tools don’t conflict with Cloudera Manager or Ambari, because they’re more specific to an application, job or dataset.
CDAP is built on Twill, which is a thread-like abstraction over YARN that Cask contributed to Apache. Mesos is in the picture as well, as a YARN alternative.
Cask is seeing some interest in Flink. (Flink is basically a Spark alternative out of Germany, which I’ve been dismissing as unneeded.)

Cask has ~40 people, multiple millions of dollars in trailing revenue, and — naturally — high expectations for future growth. I neglected, however, to ask how that revenue was split between subscription, professional services and miscellaneous. Cask expects to finish 2015 with a healthy two-digit number of customers.

Cask’s customers seem concentrated in usual-suspect internet-related sectors, although Cask gave it a bit of an enterprise-y spin by specifically citing SaaS (Software as a Service) and telecom. When I asked who else seems to be a user or interested based on mailing list activity, Cask mentioned a lot of financial services and some health care as well.

Related link

Cask doesn’t have the obvious .com URL.

Data models

Curt Monash — Mon, 23 Feb 2015 03:08:10 +0000

7-10 years ago, I repeatedly argued the viewpoints:

Relational DBMS were the right choice in most cases.
Multiple kinds of relational DBMS were needed, optimized for different kinds of use case.
There were a variety of specialized use cases in which non-relational data models were best.

Since then, however:

Hadoop has flourished.
NoSQL has flourished.
Graph DBMS have matured somewhat.
Much of the action has shifted to machine-generated data, of which there are many kinds.

So it’s probably best to revisit all that in a somewhat organized way.

To make the subject somewhat manageable, I’ll focus on fielded data — i.e. data that represents values of something — rather than, for example, video or images. Fielded data always arrives as a string of bits, whose meaning boils down to a set of pairs. Here by “string of bits” I mean mainly a single record or document (for example), although most of what I say can apply to a whole stream of data instead.

Important distinctions include:

Are the field names implicit or explicit? In relational use cases field names tend to be implicit, governed by the metadata. In some log files they may be space-savingly implicit as well. In other logs, XML streams, JSON streams and so on they are explicit.
If the field names are implicit, is any processing needed to recover them? Think Hadoop or Splunk acting on “dumb-looking” log data.
In any one record/document/whatever, are the field names unique? If not, then the current data model is not relational.
Are the field names the same from one record/document/whatever to the next? I.e., does the data fit into a consistent schema?
Is there a structure connecting the field names (and if so what kind)? E.g., hierarchical documents, or relational foreign keys.

Some major data models can be put into a fairly strict ordering of query desirability by noting:

The best thing to query is a relational DBMS. Everything has a known field name, so SELECTs are straightforward. You also have JOINs, which are commonly very valuable. And RDBMS are a mature technology with in many cases great query performance.
The next-best thing to query is another kind of data store with known field names. In such data stores:
- SQL or SQL-like SELECTs will still work, or can easily be made to do.
- Useful indexing systems can be grafted on to them (although they are typically less mature than in RDBMS).
- In the (mainly) future, perhaps JOINs can be grafted on as well.
The worst thing to query is a data store in which you only have a schema on read. You have to do work to make the thing queryable in the first place.

Unsurprisingly, that ordering is reversed when it comes to writing data.

The easiest thing to write to is a data store with no structure.
Next-easiest is to write to a data store that lets you make up the structure as you go along.
The hardest thing to write to is a relational DBMS, because of the requirements that must be obeyed, notably:
- Implicit field names, governed by metadata.
- Unique field names within any one record.
- The same (ordered) set of field names for each record — more precisely, a limited collection of such ordered sets, one per table.

And so, for starters, most large enterprises will have important use cases for data stores in all of the obvious categories. In particular:

Usually it is best to have separate brands of general-purpose/OLTP (OnLine Transaction Processing) and analytic RDBMS. Further:
- I have in the past also advocated for a mid-range — i.e. lighter-weight — general purpose RDBMS.
- SAP really, really wants you to use HANA to run SAP’s apps.
- You might want an in-memory RDBMS (MemSQL) or a particularly cloudy one or whatever.
Your website alone is reason enough to use a NoSQL DBMS, most likely MongoDB or Cassandra. And it often makes sense to have multiple NoSQL systems used for different purposes, because:
- They’re all immature right now, with advantages over each other.
- The apps you’re using them for are likely to be thrown out in a few years, so you won’t have great pain switching if you ever do decide to standardize.
Whatever else Hadoop is — and it’s a lot of things — it’s also a happy home for log files. And enterprises have lots of log files.

Beyond that:

You may want something to manage organizational hierarchies and so on, if you build enough custom systems in areas such as security, knowledge management, or MDM (Master Data Management). I’m increasingly persuaded by the argument that this should be a graph DBMS rather than an LDAP (Lightweight Directory Access Protocol) system.
Splunk is cool.
Use cases for various other kinds of data stores can often be found.
Of course you’ll be implicitly using whatever is bundled into your SaaS (Software as a Service) systems, your app-specific appliances and so on.

And finally, I think in-memory data grids:

Will be widely used and important.
Will be used to instantiate multiple data models at once.

Related links

One reason for writing this post was for some deck-clearing before I revisit the white-hot topic of data streaming. (October, 2014)
I’ve long mused about the challenges of getting by without joins. (November, 2010)
In 2013 I observed that data models will be in perpetual, rapid flux.
In 2013 I also discussed attempts to combine multiple data models (or access methods) in a single DBMS.
I surveyed data models and access methods back in 2008.

Notes on machine-generated data, year-end 2014

Curt Monash — Thu, 01 Jan 2015 03:49:37 +0000

Most IT innovation these days is focused on machine-generated data (sometimes just called “machine data”), rather than human-generated. So as I find myself in the mood for another survey post, I can’t think of any better idea for a unifying theme.

1. There are many kinds of machine-generated data. Important categories include:

Web, network and other IT logs.
Game and mobile app event data.
CDRs (telecom Call Detail Records).
“Phone-home” data from large numbers of identical electronic products (for example set-top boxes).
Sensor network output (for example from a pipeline or other utility network).
Vehicle telemetry.
Health care data, in hospitals.
Digital health data from consumer devices.
Images from public-safety camera networks.
Stock tickers (if you regard them as being machine-generated, which I do).

That’s far from a complete list, but if you think about those categories you’ll probably capture most of the issues surrounding other kinds of machine-generated data as well.

2. Technology for better information and analysis is also technology for privacy intrusion. Public awareness of privacy issues is focused in a few areas, mainly:

Government snooping on the contents of communications.
Communication traffic analysis.
Photos and videos (airport scanners, public cameras, etc.)
Commercial ad targeting.
Traditional medical records.

Other areas, however, continue to be overlooked, with the two biggies in my opinion being:

The potential to apply marketing-like psychographic analysis in other areas, such as hiring decisions or criminal justice.
The ability to track people’s movements in great detail, which will be increased greatly yet again as the market matures — and some think this will happen soon — for consumer digital health.

My core arguments about privacy and surveillance seem as valid as ever.

3. The natural database structures for machine-generated data vary wildly. Weblog data structure is often remarkably complex. Log data from complex organizations (e.g. IT shops or hospitals) might comprise many streams, each with a different (even if individually simple) organization. But in the majority of my example categories, record structure is very simple and repeatable. Thus, there are many kinds of machine-generated data that can, at least in principle, be handled well by a relational DBMS …

4. … at least to some extent. In a further complication, much machine-generated data arrives as a kind of time series. Many (but not all) time series call for a strong commitment to event-series styles of analytics. Event series analytics are a challenge for relational DBMS, but Vertica and others have tried to step up with various kinds of temporal predicates or datatypes. Event series are also a challenge for business intelligence vendors, and a potentially significant driver for competitive rebalancing in the BI market.

5. Event series even aside, I wish I understood more about business intelligence for non-tabular data. I plan to fix that.

6. Streaming and memory-centric processing are closely related subjects. What I wrote recently about them for Hadoop still applies: Spark, Kafka, etc. is still the base streaming case going forward; Storm is still around as an alternative; Tachyon or something like it will change the game somewhat. But not all streaming machine-generated data needs to land in Hadoop at all. As noted above, relational data stores (especially memory-centric ones) can suffice. So can NoSQL. So can Splunk.

Not all these considerations are important in all use cases. For one thing, latency requirements vary greatly. For example:

High-frequency trading is an extreme race; microseconds matter.
Internet interaction applications increasingly require data freshness to the last click or other user action. Computational latency requirements can go down to the single-digit milliseconds. Real-time ad auctions have a race aspect that may drive latency lower yet.
Minute-plus response can be fine for individual remote systems. Sometimes they ping home more rarely than that.

There’s also still plenty of true batch mode, but — and I say this as part of a conversation that’s been underway for over 40 years — interactive computing is preferable whenever feasible.

7. My views about predictive analytics are still somewhat confused. For starters:

The math and technology of predictive modeling both still seem pretty simple …
… but sometimes achieve mind-blowing results even so.
There’s a lot of recent innovation in predictive modeling, but adoption of the innovative stuff is still fairly tepid.
Adoption of the simple stuff is strong in certain market sectors, especially ones connected to customer understanding, such as marketing or anti-fraud.

So I’ll mainly just link to some of my past posts on the subject, and otherwise leave discussion of predictive analytics to another day.

WibiData has some innovative ideas in predictive experimentation.
Nutonian has some innovative ideas in non-linear modeling for pattern detection/root-cause analysis.
It’s still at the anecdotal level, but there have been interesting ideas in the rapid retraining of models.
Ayasdi reminded us that there’s room for innovation in clustering.
My Thanksgiving round-up post points to a lot of my prior comments on predictive modeling.

Finally, back in 2011 I tried to broadly categorize analytics use cases. Based on that and also on some points I just raised above, I’d say that a ripe area for breakthroughs is problem and anomaly detection and diagnosis, specifically for machines and physical installations, rather than in the marketing/fraud/credit score areas that are already going strong. That’s an old discipline; the concept of statistical process control dates back before World War II. Perhaps they’re underway; the Conviva retraining example linked above is certainly imaginative. But I’d like to see a lot more in the area.

Even more important, of course, could be some kind of revolution in predictive modeling for medicine.

Streaming for Hadoop

Curt Monash — Sun, 05 Oct 2014 08:56:55 +0000

The genesis of this post is that:

Hortonworks is trying to revitalize the Apache Storm project, after Storm lost momentum; indeed, Hortonworks is referring to Storm as a component of Hadoop.
Cloudera is talking up what I would call its human real-time strategy, which includes but is not limited to Flume, Kafka, and Spark Streaming. Cloudera also sees a few use cases for Storm.
This all fits with my view that the Current Hot Subject is human real-time data freshness — for analytics, of course, since we’ve always had low latencies in short-request processing.
This also all fits with the importance I place on log analysis.
Cloudera reached out to talk to me about all this.

Of course, we should hardly assume that what the Hadoop distro vendors favor will be the be-all and end-all of streaming. But they are likely to at least be influential players in the area.

In the parts of the problem that Cloudera emphasizes, the main tasks that need to be addressed are:

Getting data into the plumbing from whatever systems it’s being generated in. This is the province of Flume, one of Cloudera’s earliest projects. I’d add that this is also one of the core competencies of Splunk.
Getting data where it needs to go. Flume can do this. Kafka, a publish/subscribe messaging system, can do it in a more general way, because streams are sent to a Kafka broker, which then re-streams them to their ultimate destination.
Processing data in flight. Storm can do this. Spark Streaming can do it more easily. Spark Streaming is or soon will be a part of every serious Hadoop distribution. Flume can do some lightweight processing as well.
Serving up data for further query. Cloudera would like you to do this via HBase or Impala. But Oracle is a fine choice too, and indeed a popular choice among Cloudera customers.

I guess there’s also a step of receiving data out of the plumbing system. Cloudera and I glossed over that aspect when we talked, but I’ll say:

Spark commonly lives over HDFS (Hadoop Distributed File System).
Flume feeds HDFS. Flume was also hacked years ago — rah-rah open source! — to feed Kafka instead, and also to be fed by it.

Cloudera has not yet decided whether to make Kafka part of CDH (which stands for Cloudera Distribution yada yada Hadoop). Considerations in that probably include:

Kafka has impressive adoption among high-profile internet companies, but not so much among conventional enterprises.
Surely not coincidentally, Kafka is missing features in areas such as security (e.g. it lacks Kerberos integration).
Kafka lacks cool capabilities to let you configure rather than code, although Cloudera thinks that in some cases you can work around this problem by marrying Kafka and Flume.

I still find it bizarre that a messaging system be named after an author famous for writing about depressingly inescapable situations. Also, I wish that:

Kafka had something to do with transformations.
The name Kafka had been used by a commercial software company, which could offer product trials.

Highlights from the Storm vs. Spark Streaming vs. Samza part of my discussion with Cloudera include:

Storm has a companion project Trident that makes Storm somewhat easier to program and/or configure. But Trident only has some of the usability advantages of Spark Streaming.
Cloudera sees no advantages to Samza, a Kafka companion project, when compared with whichever of Spark Streaming or Storm + Trident is better suited to a particular use case.
Cloudera likes the rich set of primitives that Spark Streaming inherits from Spark. Cloudera also notes that, if you learn to program over Spark for any reason, then you will in particular have learned how to program over Spark Streaming.
Spark Streaming lets you join Spark Streaming data to other data that Spark can get access to. I agree with Cloudera that this is an important advantage.
Cloudera sees Storm’s main advantages as being in latency. If you need 10-200 millisecond latency, Storm can give you that today while Spark Streaming can’t. However, Cloudera notes that to write efficiently to your persistent store — which Cloudera fondly hopes but does not insist will be HBase or Impala — you may need to micro-batch your writes anyway.

Also, Spark Streaming has a major advantage over bare Storm in whether you have to manually configure your topology, but I wasn’t clear as to how far Trident closes that particular gap.

Cloudera and I didn’t particularly talk about data-consuming technologies such as BI, predictive analytics, or analytic applications, but we did review use cases a bit. Nothing too surprising jumped out. Indeed, the discussion reminded me of a 2007 list I did of applications — other than extreme low-latency ones — for CEP (Complex Event Processing).

Top-of-mind were things that fit into one or more of the buckets “internet”, “retail”, “recommendation/personalization”, “security” or “anti-fraud”.
Transportation/logistics got mentioned, to which I replied that the CEP vendors had all seemed to have one trucking/logistics client each.
At least in theory, there are potentially huge future applications in health care.

In general, candidate application areas for streaming-to-Hadoop match those that involve large volumes of machine-generated data.

Edit: Shortly after I posted this, Storm creator Nathan Marz put up a detailed and optimistic post about the history and state of Storm.