MapReduce – DBMS 2 : DataBase Management System Services

Cloudera Altus

Curt Monash — Wed, 14 Jun 2017 13:12:48 +0000

I talked with Cloudera before the recent release of Altus. In simplest terms, Cloudera’s cloud strategy aspires to:

Provide all the important advantages of on-premises Cloudera.
Provide all the important advantages of native cloud offerings such as Amazon EMR (Elastic MapReduce, or at least come sufficiently close to that goal.
Benefit from customers’ desire to have on-premises and cloud deployments that work:
- Alike in any case.
- Together, to the extent that that makes use-case sense.

In other words, Cloudera is porting its software to an important new platform.* And this port isn’t complete yet, in that Altus is geared only for certain workloads. Specifically, Altus is focused on “data pipelines”, aka data transformation, aka “data processing”, aka new-age ETL (Extract/Transform/Load). (Other kinds of workload are on the roadmap, including several different styles of Impala use.) So what about that is particularly interesting? Well, let’s drill down.

*Or, if you prefer, improving on early versions of the port.

Since so much of the Hadoop and Spark stacks is open source, competition often isn’t based on core product architecture or features, but rather on factors such as:

Ease of management. This one is nuanced in the case of cloud/Altus. For starters:
- One of Cloudera’s main areas of differentiation has always been Cloudera Manager.
- Cloudera Director was Cloudera’s first foray into cloud-specific management.
- Cloudera Altus features easier/simpler management than Cloudera Director, meant to be analogous to native Amazon management tools, and good-enough for use cases that don’t require strenuous optimization.
- Cloudera Altus also includes an optional workload analyzer, in slight conflict with other parts of the Altus story. More on that below.
Ease of development. Frankly, this rarely seems to come up as a differentiator in the Hadoop/Spark world, various “notebook” offerings such as Databricks’ or Cloudera’s notwithstanding.
Price. When price is the major determinant, Cloudera is sad.
Open source purity. Ditto. But at most enterprises — at least those with hefty IT budgets — emphasis on open source purity either is a proxy for price shopping, or else boils down to largely bogus concerns about vendor lock-in.

Of course, “core” kinds of considerations are present to some extent too, including:

Performance, concurrency, etc. I no longer hear many allegations of differences in across-the-board Hadoop performance. But the subject does arise in specific areas, most obviously in analytic SQL processing. It arises in the case of Altus as well, in that Cloudera improved in a couple of areas that it concedes were previously Amazon EMR advantages, namely:
- Interacting with S3 data stores.
- Spinning instances up and down.
Reliability and data safety. Cloudera mentioned that it did some work so as to be comfortable with S3’s eventual consistency model.

Recently, Cloudera has succeeded at blowing security up into a major competitive consideration. Of course, they’re trying that with Altus as well. Much of the Cloudera Altus story is the usual — rah-rah Cloudera security, Sentry, Kerberos everywhere, etc. But there’s one aspect that I find to be simple yet really interesting:

Cloudera Altus doesn’t manage data for you.
Rather, it launches and manages jobs on a separate Hadoop cluster.

Thus, there are very few new security risks to running Cloudera Altus, beyond whatever risks are inherent to running any version of Hadoop in the public cloud.

Where things get a bit more complicated is some features for workload analysis.

Cloudera recently introduced some capabilities for on-the-fly trouble-shooting. That’s fine.
Cloudera has also now announced an offline workload analyzer, which compares actual metrics computed from your log files to “normal” ones from well-running jobs. For that, you really do have to ship information to a separate cluster managed by Cloudera.

The information shipped is logs rather than actual query results or raw data. In theory, an attacker who had all those logs could conceivably make inferences about the data itself; but in practice, that doesn’t seem like an important security risk at all.

So is this an odd situation where that strategy works, or could what we might call light-touch managed services turn out to be widespread and important? That’s a good question to address in a separate post.

Notes on Spark and Databricks — generalities

Curt Monash — Sun, 31 Jul 2016 14:29:46 +0000

I visited Databricks in early July to chat with Ion Stoica and Reynold Xin. Spark also comes up in a large fraction of the conversations I have. So let’s do some catch-up on Databricks and Spark. In a nutshell:

Spark is indeed the replacement for Hadoop MapReduce.
Spark is becoming the default platform for machine learning.
SparkSQL (nee’ Shark) is puttering along predictably.
Databricks reports good success in its core business of cloud-based machine learning support.
Spark Streaming has strong adoption, but its position is at risk.
Databricks, the original authority on Spark, is not keeping a tight grip on that role.

I shall explain below. I also am posting separately about Spark evolution, especially Spark 2.0. I’ll also talk a bit in that post about Databricks’ proprietary/closed-source technology.

Spark is the replacement for Hadoop MapReduce.

This point is so obvious that I don’t know what to say in its support. The trend is happening, as originally decreed by Cloudera (and me), among others. People are rightly fed up with the limitations of MapReduce, and — niches perhaps aside — there are no serious alternatives other than Spark.

The greatest use for Spark seems to be the same as the canonical first use for MapReduce: data transformation. Also in line with the Spark/MapReduce analogy:

Data-transformation-only use cases are important, but they don’t dominate.
Most other use cases typically have a data transformation element as well …
… which has to be started before any other work can be done.

And so it seems likely that, at least for as long as Spark is growing rapidly, data transformation will appear to be the biggest Spark use case.

Spark is becoming the default platform for machine learning.

Largely, this is a corollary of:

The previous point.
The fact that Spark was originally designed with machine learning as its principal use case.

To do machine learning you need two things in your software:

A collection of algorithms. Spark, I gather, is one of numerous good alternatives there.
Support for machine learning workflows. That’s where Spark evidently stands alone.

And thus I have conversations like:

“Are you doing anything with Spark?”
“We’ve gotten more serious about machine learning, so yes.”

SparkSQL (nee’ Shark) is puttering along.

SparkSQL is pretty much following the Hive trajectory.

Useful from Day One as an adjunct to other kinds of processing.
A tease and occasionally useful as a SQL engine for its own sake, but really not very good, pending years to mature.

Databricks reports good success in its core business of cloud-based machine learning support.

Databricks, to an even greater extent than I previously realized, is focused on its cloud business, for which there are well over 200 paying customers. Notes on that include:

As you might expect based on my comments above, the majority of usage is for data transformation, but a lot of that is in anticipation of doing machine learning/predictive modeling in the near future.
Databricks customers typically already have their data in the Amazon cloud.
Naturally, a lot of Databricks customers are internet companies — ad tech startups and the like. Databricks also reports “strong” traction in the segments:
- Media
- Financial services (especially but not only insurance)
- Health care/pharma
The main languages Databricks customers use are R and Python. Ion said that Python was used more on the West Coast, while R was used more in the East.

Databricks’ core marketing concept seems to be “just-in-time data platform”. I don’t know why they picked that, as opposed to something that emphasizes Spark’s flexibility and functionality.

Spark Streaming’s long-term success is not assured.

To a first approximation, things look good for Spark Streaming.

Spark Streaming is definitely the leading companion to Kafka, and perhaps also to cloud equivalents (e.g. Amazon Kinesis).
The “traditional” alternatives of Storm and Samza are pretty much done.
Newer alternatives from Twitter, Confluent and Flink aren’t yet established.
Cloudera is a big fan of Spark Streaming.
Even if Spark Streaming were to generally decline, it might keep substantial “good enough” usage, analogously to Hive and SparkSQL.
Cool new Spark Streaming technology is coming out.

But I’m also hearing rumbles and grumbles about Spark Streaming. What’s more, we know that Spark Streaming wasn’t a core part of Spark’s design; the use case just happened to emerge. Demanding streaming use cases typically involve a lot of short-request inserts (or updates/upserts/whatever). And if you were designing a system to handle those … would it really be based on Spark?

Databricks is not keeping a tight grip on Spark leadership.

For starters:

Databricks’ main business, as noted above, is its cloud service. That seems to be going well.
Databricks’ secondary business is licensing stuff to Spark distributors. That doesn’t seem to amount to much; it’s too easy to go straight to the Apache distribution and bypass Databricks. No worries; this never seemed it would be a big revenue opportunity for Databricks.

At the moment, Databricks is pretty clearly the general leader of Spark. Indeed:

If you want the story on where Spark is going, you do what I did — you ask Databricks.
Similarly, if you’re thinking of pushing the boundaries on Spark use, and you have access to the Databricks folks, that’s who you’ll probably talk to.
Databricks employs ~1/3 of Spark committers.
Databricks organizes the Spark Summit.

But overall, Databricks doesn’t seem to care much about keeping Spark leadership. Its marketing efforts in that respect are minimal. Word-of-mouth buzz paints a similar picture. My direct relationship with the company gives the same impression. Oh, I’m sure Databricks would like to remain the Spark leader. But it doesn’t seem to devote much energy toward keeping the role.

Related links

Starting with my introduction to Spark, previous overview posts include those in:

Readings in Database Systems

Curt Monash — Thu, 10 Dec 2015 12:26:40 +0000

Mike Stonebraker and Larry Ellison have numerous things in common. If nothing else:

They’re both titanic figures in the database industry.
They both gave me testimonials on the home page of my business website.
They both have been known to use the present tense when the future tense would be more accurate.

I mention the latter because there’s a new edition of Readings in Database Systems, aka the Red Book, available online, courtesy of Mike, Joe Hellerstein and Peter Bailis. Besides the recommended-reading academic papers themselves, there are 12 survey articles by the editors, and an occasional response where, for example, editors disagree. Whether or not one chooses to tackle the papers themselves — and I in fact have not dived into them — the commentary is of great interest.

But I would not take every word as the gospel truth, especially when academics describe what they see as commercial market realities. In particular, as per my quip in the first paragraph, the data warehouse market has not yet gone to the extremes that Mike suggests,* if indeed it ever will. And while Joe is close to correct when he says that the company Essbase was acquired by Oracle, what actually happened is that Arbor Software, which made Essbase, merged with Hyperion Software, and the latter was eventually indeed bought by the giant of Redwood Shores.**

*When it comes to data warehouse market assessment, Mike seems to often be ahead of the trend.

**Let me interrupt my tweaking of very smart people to confess that my own commentary on the Oracle/Hyperion deal was not, in retrospect, especially prescient.

Mike pretty much opened the discussion with a blistering attack against hierarchical data models such as JSON or XML. To a first approximation, his views might be summarized as:

Logical hierarchical models can be OK in certain cases. In particular, JSON could be a somewhat useful datatype in an RDBMS.
Physical hierarchical models are horrible.
Rather, you should implement the logical hierarchical model over a columnar RDBMS.

My responses start:

Nested data structures are more important than Mike’s discussion seems to suggest.
Native XML and JSON stores are apt to have an index on every field. If you squint, that index looks a lot like a column store.
Even NoSQL stores should and I think in most cases will have some kind of SQL-like DML (Data Manipulation Language). In particular, there should be some ability to do joins, because total denormalization is not always a good choice.

In no particular order, here are some other thoughts about or inspired by the survey articles in Readings in Database Systems, 5th Edition.

I agree that OLTP (OnLine Transaction Processing) is transitioning to main memory.
I agree with the emphasis on “data in motion”.
While I needle him for overstating the speed of the transition, Mike is right that columnar architectures are winning for analytics. (Or you could say they’ve won, if you recognize that mop-up from the victory will still take 1 or 2 decades.)
The guys seem to really hate MapReduce, which is an old story for Mike, but a bit of a reversal for Joe.
MapReduce is many things, but it’s not a data model, and it’s also not something that Hadoop 1.0 was an alternative to. Saying each of those things was sloppy writing.
The guys characterize consistency/transaction isolation as a rather ghastly mess. That part was an eye-opener.
Mike is a big fan of arrays. I suspect he’s right in general, although I also suspect he’s overrating SciDB. I also think he’s somewhat overrating the market penetration of cube stores, aka MOLAP.
The point about Hadoop (in particular) and modern technologies in general showing the way to modularization of DBMS is an excellent one.
Joe and Mike disagreed about analytics; Joe’s approach rang truer for me. My own opinion is:
- Business intelligence has been important for quite a while, and won’t stop.
- Machine learning is becoming ever more important.
- It’s still early days for the integration of the two areas, but much more will come.
The challenge of whether anybody wants to do machine learning (or other advanced analytics) over a DBMS is sidestepped in part by the previously mentioned point about the modularization of a DBMS. Hadoop, for example, can be both an OK analytic DBMS (although not fully competitive with mature, dedicated products) and of course also an advanced analytics framework.
Similarly, except in the short-term I’m not worried about the limitations of Spark’s persistence mechanisms. Almost every commercial distribution of Spark I can think of is part of a package that also contains a more mature data store.
Versatile DBMS and analytic frameworks suffer strategic contention for memory, with different parts of the system wanting to use it in different ways. Raising that as a concern about the integration of analytic DBMS with advanced analytic frameworks is valid.
I used to overrate the importance of abstract datatypes, in large part due to Mike’s influence. I got over it. He should too. They’re useful, to the point of being a checklist item, but not a game-changer. A big part of the problem is what I mentioned in the previous point — different parts of a versatile DBMS would prefer to do different things with memory.
I used to overrate the importance of user-defined functions in an analytic RDBMS. Mike had nothing to do with my error. I got over it. He should too. They’re useful, to the point of being a checklist item, but not a game-changer. Looser coupling between analytics and data management seems more flexible.
Excellent points are made about the difficulties of “First we build the perfect schema” data warehouse projects and, similarly, MDM (Master Data Management).
There’s an interesting discussion that helps explain why optimizer progress is so slow (both for the industry in general and for each individual product).

Related links

I did a deep dive into MarkLogic’s indexing strategy in 2008, which informed my comment about XML/JSON stores above.
Again with MarkLogic as the focus, in 2010 I was skeptical about document stores not offering joins. MarkLogic has since capitulated.
I’m not current on SciDB, but I did write a bit about it in 2010.
I’m surprised that I can’t find a post to point to about modularization of DBMS. I’ll leave this here as a placeholder until I can.
Edit: As promised, I’ve now posted about the object-relational/abstract datatype boom of the 1990s.

Hadoop generalities

Curt Monash — Wed, 10 Jun 2015 12:33:17 +0000

Occasionally I talk with an astute reporter — there are still a few left — and get led toward angles I hadn’t considered before, or at least hadn’t written up. A blog post may then ensue. This is one such post.

There is a group of questions going around that includes:

Is Hadoop overhyped?
Has Hadoop adoption stalled?
Is Hadoop adoption being delayed by skills shortages?
What is Hadoop really good for anyway?
Which adoption curves for previous technologies are the best analogies for Hadoop?

To a first approximation, my responses are:

The Hadoop hype is generally justified, but …
… what exactly constitutes “Hadoop” is trickier than one might think, in at least two ways:
- Hadoop is much more than just a few core projects.
- Even the core of Hadoop is repeatedly re-imagined.
RDBMS are a good analogy for Hadoop.
As a general rule, Hadoop adoption is happening earlier for new applications, rather than in replacement or rehosting of old ones. That kind of thing is standard for any comparable technology, both because enabling new applications can be valuable and because migration is a pain.
Data transformation, as pre-processing for analytic RDBMS use, is an exception to that general rule. That said …
… it’s been adopted quickly because it saves costs. But of course a business that’s only about cost savings may not generate a lot of revenue.
Dumping data into a Hadoop-centric “data lake” is a smart decision, even if you haven’t figured out yet what to do with it. But of course, …
… even if zero-application adoption makes sense, it isn’t exactly a high-value proposition.
I’m generally a skeptic about market numbers. Specific to Hadoop, I note that:
- The most reliable numbers about Hadoop adoption come from Hortonworks, since it is the only pure-play public company in the market. (Compare, for example, the negligible amounts of information put out by MapR.) But Hortonworks’ experiences are not necessarily identical to those of other vendors, who may compete more on the basis of value-added service and technology rather than on open source purity or price.
- Hadoop (and the same is true of NoSQL) are most widely adopted at digital companies rather than at traditional enterprises.
- That said, while all traditional enterprises have some kind of digital presence, not all have ones of the scope that would mandate a heavy investment in internet technologies. Large consumer-oriented companies probably do, but companies with more limited customer bases might not be there yet.
Concerns about skill shortages are exaggerated.
- The point of distributing processing frameworks such as Spark or MapReduce is to make distributed analytic or application programming not be much harder than any other kind.
- If a new programming language or framework needs to be adopted — well, programmers nowadays love learning that kind of stuff.
- The industry is moving quickly to make distributed systems easier to administer. Any skill shortages in operations should prove quite temporary.

Teradata will support Presto

Curt Monash — Mon, 08 Jun 2015 09:32:16 +0000

At the highest level:

Presto is, roughly speaking, Facebook’s replacement for Hive, at least for queries that are supposed to run at interactive speeds.
Teradata is announcing support for Presto with a classic open source pricing model.
Presto will also become, roughly speaking, Teradata’s replacement for Hive.
Teradata’s Presto efforts are being conducted by the former Hadapt.

Now let’s make that all a little more precise.

Regarding Presto (and I got most of this from Teradata)::

To a first approximation, Presto is just another way to write SQL queries against HDFS (Hadoop Distributed File System). However …
… Presto queries other data stores too, such as various kinds of RDBMS, and federates query results.
Facebook at various points in time created both Hive and now Presto.
Facebook started the Presto project in 2012 and now has 10 engineers on it.
Teradata has named 16 engineers – all from Hadapt – who will be contributing to Presto.
Known serious users of Presto include Facebook, Netflix, Groupon and Airbnb. Airbnb likes Presto well enough to have 1/3 of its employees using it, via an Airbnb-developed tool called Airpal.
Facebook is known to have a cluster cited at 300 petabytes and 4000 users where Presto is presumed to be a principal part of the workload.

Daniel Abadi said that Presto satisfies what he sees as some core architectural requirements for a modern parallel analytic RDBMS project:

Data is pipelined between operators, with no gratuitous writing to disk the way you might have in something MapReduce-based. This is different from the sense of “pipelining” in which one query might keep an intermediate result set hanging around because another query is known to need those results as well.
Presto processing is vectorized; functions don’t need to be re-invoked a tuple at a time. This is different from the sense of vectorization in which several tuples are processed at once, exploiting SIMD (Single Instruction Multiple Data). Dan thinks SIMD is useful mainly for column stores, and Presto tries to be storage-architecture-agnostic.
Presto query operators and hence query plans are dynamically compiled, down to byte code.
Although it is generally written in Java, Presto uses direct memory management rather than relying on what Java provides. Dan believes that, despite being written in Java, Presto performs as if it were written in C.

More precisely, this is a checklist for interactive-speed parallel SQL. There are some query jobs long enough that Dan thinks you need the fault-tolerance obtained from writing intermediate results to disk, ala’ HadoopDB (which was of course the MapReduce-based predecessor to Hadapt).

That said, Presto is a newish database technology effort, there’s lots of stuff missing from it, and there still will be lots of stuff missing from Presto years from now. Teradata has announced contribution plans to Presto for, give or take, the next year, in three phases:

Phase 1 (released immediately, and hence in particular already done):
- An installer.
- More documentation, especially around installation.
- Command-line monitoring and management.
Phase 2 (later in 2015)
- Integrations with YARN, Ambari and soon thereafter Cloudera Manager.
- Expanded SQL coverage.
Phase 3 (some time in 2016)
- An ODBC driver, which is of course essential for business intelligence tool connectivity.
- Other connectors (e.g. more targets for query federation).
- Security.
- Further SQL coverage.

Absent from any specific plans that were disclosed to me was anything about optimization or other performance hacks, and anything about workload management beyond what can be gotten from YARN. I also suspect that much SQL coverage will still be lacking after Phase 3.

Teradata’s basic business model for Presto is:

Teradata is selling subscriptions, for which the principal benefit is support.
Teradata reserves the right to make some of its Presto-enhancing code subscription-only, but has no immediate plans to do so.
Teradata being Teradata, it would love to sell you Presto-related professional services. But you’re absolutely welcome to consume Presto on the basis of license-plus-routine-support-only.

And of course Presto is usurping Hive’s role wherever that makes sense in Teradata’s data connectivity story, e.g. Teradata QueryGrid.

Finally, since I was on the phone with Justin Borgman and Dan Abadi, discussing a project that involved 16 former Hadapt engineers, I asked about Hadapt’s status. That may be summarized as:

There are currently no new Hadapt sales.
Only a few large Hadapt customers are still being supported by Teradata.
The former Hadapt folks would love Hadapt or Hadapt-like technology to be integrated with Presto, but no such plans have been finalized at this time.

Notes on indexes and index-like structures

Curt Monash — Thu, 16 Apr 2015 22:42:59 +0000

Indexes are central to database management.

My first-ever stock analyst report, in 1982, correctly predicted that index-based DBMS would supplant linked-list ones …
… and to this day, if one wants to retrieve a small fraction of a database, indexes are generally the most efficient way to go.
Recently, I’ve had numerous conversations in which indexing strategies played a central role.

Perhaps it’s time for a round-up post on indexing.

1. First, let’s review some basics. Classically:

An index is a DBMS data structure that you probe to discover where to find the data you really want.
Indexes make data retrieval much more selective and hence faster.
While indexes make queries cheaper, they make writes more expensive — because when you write data, you need to update your index as well.
Indexes also induce costs in database size and administrative efforts. (Manual index management is often the biggest hurdle for “zero-DBA” RDBMS installations.)

2. Further:

A DBMS or other system can index data it doesn’t control.
- This is common in the case of text indexing, and not just in public search engines like Google. Performance design might speak against recopying text documents. So might security.
- This capability overlaps with but isn’t exactly the same thing as an “external tables” feature in an RDBMS.
Indexes can be updated in batch mode, rather than real time.
- Most famously, this is why Google invented MapReduce.
- Indeed, in cases where you index external data, it’s almost mandatory.
Indexes written in real-time are often cleaned up in batch, or at least asynchronously with the writes.
- The most famous example is probably the rebalancing of B-trees.
- Append-only index writes call for later clean-up as well.

3. There are numerous short-request RDBMS indexing strategies, with various advantages and drawbacks. But better indexing, as a general rule, does not a major DBMS product make.

The latest example is my former clients at Tokutek, who just got sold to Percona in a presumably small deal — regrettably without having yet paid me all the money I’m owed. (By the way, the press release for that acquisition highlights TokuDB’s advantages in compression much more than it mentions straight performance.)
In a recent conversation with my clients at MemSQL, I basically heard from Nikita Shamgunov that:
- He felt that lockless indexes were essential to scale-out, and to that end …
- … he picked skip lists, not because they were the optimal lockless index, but because they were good enough and a lot easier to implement than the alternatives. (Edit: Actually, see Nikita’s comment below.)
Red-black trees are said to be better than B-trees. But they come up so rarely that I don’t really understand how they work.
solidDB did something cool with Patricia tries years ago. McObject and ScaleDB tried them too. Few people noticed or cared.

I’ll try to explain this paradox below.

4. The analytic RDBMS vendors who arose in the previous decade were generally index-averse. Netezza famously does not use indexes at all. Neither does Vertica, although the columns themselves played some of the role of indexes, especially give the flexibility in their sort orders. Others got by with much less indexing than was common in, for example, Oracle data warehouses.

Some of the reason was indexes’ drawbacks in terms of storage space and administrative overhead. Also, sequential scans can be much faster from spinning disk than more selective retrieval, so table scans often outperformed index-driven retrieval.

5. It is worth remembering that almost any data access method brings back more data than you really need, at least as an intermediate step. For starters, data is usually retrieved in whole pages, whether you need all their contents or not. But some indexing and index-alternative technologies go well beyond that.

To avoid doing true full table scans, Netezza relies on “zone maps”. These are a prominent example of what is now often called data skipping.
Bloom filters in essence hash data into a short string of bits. If there’s a hash collision, excess data is returned.
Geospatial queries often want to return data for regions that have no simple representation in the database. So instead they bring back data for a superset of the desired region, which the DBMS does know how to return.

6. Geospatial indexing is actually one of the examples that gave me the urge to write this post. There are two main geospatial indexing strategies I hear about. One is the R-tree, which basically divides things up into rectangles, rectangles within those rectangles, rectangles within those smaller rectangles, and so on. A query initially brings back the data within a set of rectangles whose union contains the desired region; that intermediate result is then checked row by row for whether it belongs in the final result set.

The other main approach to geospatial indexing is the space-filling curve. The idea behind this form of geospatial indexing is roughly:

For computational purposes, a geographic region is of course a lattice of points rather than a true 2-dimensional continuum.
So you take a lattice — perhaps in the overall shape of a square — and arrange its points in a sequence, so that each point is adjacent in some way to its predecessor.
Then regions on a plane are covered by subsequences (or unions of same).

The idea gets its name because, if you trace a path through the sequence of points, what you get is an approximation to a true space-filling curve.

7. And finally — mature DBMS use multiple indexing strategies. One of the best examples of a DBMS winning largely on the basis of its indexing approach is Sybase IQ, which popularized bitmap indexing. But when last I asked, some years ago, Sybase IQ actually used 9 different kinds of indexing. Oracle surely has yet more. This illustrates that different kinds of indexes are good in different use cases, which in turn suggests obvious reasons why clever indexing rarely gives a great competitive advantage.

Where the innovation is

Curt Monash — Mon, 19 Jan 2015 08:27:57 +0000

I hoped to write a reasonable overview of current- to medium-term future IT innovation. Yeah, right. But if we abandon any hope that this post could be comprehensive, I can at least say:

1. Back in 2011, I ranted against the term Big Data, but expressed more fondness for the V words — Volume, Velocity, Variety and Variability. That said, when it comes to data management and movement, solutions to the V problems have generally been sketched out.

Volume has been solved. There are Hadoop installations with 100s of petabytes of data, analytic RDBMS with 10s of petabytes, general-purpose Exadata sites with petabytes, and 10s/100s of petabytes of analytic Accumulo at the NSA. Further examples abound.
Velocity is being solved. My recent post on Hadoop-based streaming suggests how. In other use cases, velocity is addressed via memory-centric RDBMS.
Variety and Variability have been solved. MongoDB, Cassandra and perhaps others are strong NoSQL choices. Schema-on-need is in earlier days, but may help too.

2. Even so, there’s much room for innovation around data movement and management. I’d start with:

Product maturity is a huge issue for all the above, and will remain one for years.
Hadoop and Spark show that application execution engines:
- Have a lot of innovation ahead of them.
- Are tightly entwined with data management, and with data movement as well.
Hadoop is due for another refactoring, focused on both in-memory and persistent storage.
There are many issues in storage that can affect data technologies as well, including but not limited to:
- Solid-state (flash or post-flash) vs. spinning disk.
- Networked vs. direct-attached.
- Virtualized vs. identifiable-physical.
- Object/file/block.
Graph analytics and data management are still confused.

3. As I suggested last year, data transformation is an important area for innovation.

MapReduce was invented for data transformation, which is still a large part of what goes on in Hadoop.
The smart data preparation crowd is deservedly getting attention.
The more different data models — NoSQL and so on — that are used, the greater are the demands on data transformation.

4. There’s a lot going on in investigative analytics. Besides the “platform” technologies already mentioned, in areas such as fast-query, data preparation, and general execution engines, there’s also great innovation higher in the stack. Most recently I’ve written about multiple examples in predictive modeling, such as:

Mathematically (more) complex models that are at once more accurate and more easily arrived at than (nearly) linear ones.
Similarly, more complex clustering.
Predictive experimentation.
The use of business intelligence and predictive modeling to inform each other.

Beyond that:

Event-series analytics is another exciting area. (At least on the BI side, I frankly expected it to sweep through the relevant vertical markets more quickly than it has.)
I’ve long been disappointed in the progress in text analytics. But sentiment analysis is doing fairly well, many more languages are analyzed than before, and I occasionally hear rumblings of text analytic sophistication inching back towards that already available in the previous decade.
While I don’t write about it much, modern BI navigation is an impressive and wonderful thing.

5. Back in 2013, in what was perhaps my previous most comprehensive post on innovation, I drew a link between innovation and refactoring, where what was being refactored was “everything”. Even so, I’ve been ignoring a biggie. Security is a mess, and I don’t see how it can ever be solved unless systems are much more modular from the ground up. By that I mean:

“Fencing” processes and resources away from each other improves system quality, in that it defends against both deliberate attacks and inadvertent error.
Fencing is costly, both in terms of context-switching and general non-optimization. Nonetheless, I suspect that …
… the cost of such process isolation may need to be borne.
Object-oriented programming and its associated contracts are good things in this context. But it’s obvious they’re not getting the job done on their own.

More specifically,

It is cheap to give single-purpose intelligent devices more computing power than they know what to do with. There is really no excuse for allowing them to be insecure.
It is rare for a modern PC to go much above 25% CPU usage, simply because most PC programs are still single-core. This illustrates that — assuming some offsetting improvements in multi-core parallelism — desktop software could take a security performance hit without much pain to users’ wallets.
On servers, we may in many cases be talking about lightweight virtual machines.

And to be clear:

What I’m talking about would do little to help the authentication/authorization aspects of security, but …
… those will never be perfect in any case (because they depend upon fallible humans) …
… which is exactly why other forms of security will always be needed.

6. You’ve probably noticed the fuss around an open letter about artificial intelligence, with some press coverage suggesting that AI is a Terminator-level threat to humanity. Underlying all that is a fairly interesting paper summarizing some needs for future research and innovation in AI. In particular, reading the paper reminded me of the previous point about security.

7. Three areas of software innovation that, even though they’re pretty much in my wheelhouse, I have little to say about right now are:

Application development technology, languages, frameworks, etc.
The integration of analytics into old-style operational apps.
The never-ending attempts to make large-enterprise-class application functionality available to outfits with small-enterprise sophistication and budgets.

8. There is, of course, tremendous innovation in robots and other kinds of device. But this post is already long enough, so I’ll address those areas some other time.

Related links

In many cases, I think that innovations will prove more valuable — or at least much easier to monetize — when presented to particular vertical markets.
Edit: I followed up on the last point with a post about soft robots.

Hadoop’s next refactoring?

Curt Monash — Sun, 07 Dec 2014 14:59:45 +0000

I believe in all of the following trends:

Hadoop is a Big Deal, and here to stay.
Spark, for most practical purposes, is becoming a big part of Hadoop.
Most servers will be operated away from user premises, whether via SaaS (Software as a Service), co-location, or “true” cloud computing.

Trickier is the meme that Hadoop is “the new OS”. My thoughts on that start:

People would like this to be true, although in most cases only as one of several cluster computing platforms.
Hadoop, when viewed as an operating system, is extremely primitive.
Even so, the greatest awkwardness I’m seeing when different software shares a Hadoop cluster isn’t actually in scheduling, but rather in data interchange.

There is also a minor issue that if you distribute your Hadoop work among extra nodes you might have to pay a bit more to your Hadoop distro support vendor. Fortunately, the software industry routinely solves more difficult pricing problems than that.

Recall now that Hadoop — like much else in IT — has always been about two things: data storage and program execution. The evolution of Hadoop program execution to date has been approximately:

Originally, MapReduce and JobTracker were the way to execute programs in Hadoop, period, at least if we leave HBase out of the discussion.
In a major refactoring, YARN replaced a lot of what JobTracker did, with the result that different program execution frameworks became easier to support.
Most of the relevant program execution frameworks — such as MapReduce, Spark or Tez — have data movement and temporary storage near their core.

Meanwhile, Hadoop data storage is mainly about HDFS (Hadoop Distributed File System). Its evolution, besides general enhancement, has included the addition of file types suitable for specific kinds of processing (e.g. Parquet and ORC to accelerate analytic database queries). Also, there have long been hacks that more or less bypassed central Hadoop data management, and let data be moved in parallel on a node-by-node basis. But several signs suggest that Hadoop data storage should and will be refactored too. Three efforts in particular point in that direction:

HDFS caching, about which I know relatively little.
Tachyon, in which I gather Spark’s creators continue to believe.
Cloudera’s deal to run Hadoop against Isilon storage and its associated interest in also supporting other kinds of storage, such as the object storage commonly found in clouds.

The part of all this I find most overlooked is inter-program data exchange. If two programs both running on Hadoop want to exchange data, what do they do, other than reading and writing to HDFS, or invoking some kind of a custom connector? What’s missing is a nice, flexible distributed memory layer, which:

Works well with Hadoop execution engines (Spark, Tez, Impala …).
Works well with other software people might want to put on their Hadoop nodes.
Interfaces nicely to HDFS, Isilon, object storage, et al.
Is fully parallel any time it needs to talk with persistent or external storage.
Can be fully parallel any time it needs to talk with any other software on the Hadoop cluster.

Tachyon could, I imagine, become that. HDFS caching probably could not.

In the past, I’ve been skeptical of in-memory data grids. But now I think that a such a grid could take Hadoop to the next level of generality and adoption.

Related links

One recent trigger for this line of thinking was Eric Frenkiel’s idea of MemSQL/Spark integration.
The most recent Tachyon slide deck I’ve seen cited my April, 2012 post on many kinds of memory-centric data management.

Thoughts and notes, Thanksgiving weekend 2014

Curt Monash — Mon, 01 Dec 2014 01:48:43 +0000

I’m taking a few weeks defocused from work, as a kind of grandpaternity leave. That said, the venue for my Dances of Infant Calming is a small-but-nice apartment in San Francisco, so a certain amount of thinking about tech industries is inevitable. I even found time last Tuesday to meet or speak with my clients at WibiData, MemSQL, Cloudera, Citus Data, and MongoDB. And thus:

1. I’ve been sloppy in my terminology around “geo-distribution”, in that I don’t always make it easy to distinguish between:

Storing different parts of a database in different geographies, often for reasons of data privacy regulatory compliance.
Replicating an entire database into different geographies, often for reasons of latency and/or availability/ disaster recovery,

The latter case can be subdivided further depending on whether multiple copies of the data can accept first writes (aka active-active, multi-master, or multi-active), or whether there’s a clear single master for each part of the database.

What made me think of this was a phone call with MongoDB in which I learned that the limit on number of replicas had been raised from 12 to 50, to support the full-replication/latency-reduction use case.

2. Three years ago I posted about agile (predictive) analytics. One of the points was:

… if you change your offers, prices, ad placement, ad text, ad appearance, call center scripts, or anything else, you immediately gain new information that isn’t well-reflected in your previous models.

Subsequently I’ve been hearing more about predictive experimentation such as bandit testing. WibiData, whose views are influenced by a couple of Very Famous Department Store clients (one of which is Macy’s), thinks experimentation is quite important. And it could be argued that experimentation is one of the simplest and most direct ways to increase the value of your data.

3. I’d further say that a number of developments, trends or possibilities I’m seeing are or could be connected. These include agile and experimental predictive analytics in general, as noted in the previous point, along with:

Also, the flashiest application I know of for only-moderately-successful KXEN came when one or more large retailers decided to run separate models for each of thousands of stores.

4. MongoDB, the product, has been refactored to support pluggable storage engines. In connection with that, MongoDB does/will ship with two storage engines – the traditional one and a new one from WiredTiger (but not TokuMX). Both will be equally supported by MongoDB, the company, although there surely are some tiers of support that will get bounced back to WiredTiger.

WiredTiger has the same techie principals as SleepyKat – get the wordplay?! – which was Mike Olson’s company before Cloudera. When asked, Mike spoke of those techies in remarkably glowing terms.

I wouldn’t be shocked if WiredTiger wound up playing the role for MongoDB that InnoDB played for MySQL. What I mean is that there were a lot of use cases for which the MySQL/MyISAM combination was insufficiently serious, but InnoDB turned MySQL into a respectable DBMS.

5. Hadoop’s traditional data distribution story goes something like:

Data lives on every non-special Hadoop node that does processing.
This gives the advantage of parallel data scans.
Sometimes data locality works well; sometimes it doesn’t.
Of course, if the output of every MapReduce step is persisted to disk, as is the case with Hadoop MapReduce 1, you might create some of your own data locality …
… but Hadoop is getting away from that kind of strict, I/O-intensive processing model.

However, Cloudera has noticed that some large enterprises really, really like to have storage separate from processing. Hence its recent partnership to work with EMC Isilon. Other storage partnerships, as well as a better fit with S3/object storage kinds of environments, are sure to follow, but I have no details to offer at this time.

6. Cloudera’s count of Spark users in its customer base is currently around 60. That includes everything from playing around to full production.

7. Things still seem to be going well at MemSQL, but I didn’t press for any details that I would be free to report.

8. Speaking of MemSQL, one would think that at some point something newer would replace Oracle et al. in the general-purpose RDBMS world, much as Unix and Linux grew to overshadow the powerful, secure, reliable, cumbersome IBM mainframe operating systems. On the other hand:

IBM blew away its mainframe competitors and had pretty close to a monopoly. But Oracle has some close and somewhat newer competitors in DB2 and Microsoft SQL Server. Therefore …
… upstarts have three behemoths to outdo, not just one.
MySQL, PostgreSQL and to some extent Sybase are still around as well.

Also, perhaps no replacement will be needed. If we subdivide the database management world into multiple categories including:

General-purpose RDBMS.
Analytic RDBMS.
NoSQL.
Non-relational analytic data stores (perhaps Hadoop-based).

it’s not obvious that the general-purpose RDBMS category on its own requires any new entrants to ever supplant the current leaders.

All that said – if any of the current new entrants do pull off the feat, SAP HANA is probably the best (longshot) guess to do so, and MemSQL the second-best.

9. If you’re a PostgreSQL user with performance or scalability concerns, you might want to check what Citus Data is doing.

Some stuff on my mind, September 28, 2014

Curt Monash — Mon, 29 Sep 2014 00:21:03 +0000

1. I wish I had some good, practical ideas about how to make a political difference around privacy and surveillance. Nothing else we discuss here is remotely as important. I presumably can contribute an opinion piece to, more or less, the technology publication(s) of my choice; that can have a small bit of impact. But I’d love to do better than that. Ideas, anybody?

2. A few thoughts on cloud, colocation, etc.:

The economies of scale of colocation-or-cloud over operating your own data center are compelling. Most of the reasons you outsource hardware manufacture to Asia also apply to outsourcing data center operation within the United States. (The one exception I can think of is supply chain.)
The arguments for cloud specifically over colocation are less persuasive. Colo providers can even match cloud deployments in rapid provisioning and elastic pricing, if they so choose.
Surely not coincidentally, I am told that Rackspace is deemphasizing cloud, reemphasizing colocation, and making a big deal out of Open Compute. In connection with that, Rackspace has pulled back from its leadership role in OpenStack.
I’m hearing much more mention of Amazon Redshift than I used to. It seems to have a lot of traction as a simple and low-cost option.
I’m hearing less about Elastic MapReduce than I used to, although I imagine usage is still large and growing.
In general, I get the impression that progress is being made in overcoming the inherent difficulties in cloud (and even colo) parallel analytic processing. But it all still seems pretty vague, except for the specific claims being made for traction of Redshift, EMR, and so on.
Teradata recently told me that in colocation pricing, it is common for floor space to be everything, with power not separately metered. But I don’t think that trend is a big deal, as it is not necessarily permanent.
Cloud hype is of course still with us.
Other than the above, I stand by my previous thoughts on appliances, clusters and clouds.

3. As for the analytic DBMS industry:

Concurrency is still a challenge. But otherwise …
… great SQL query performance isn’t something to get excited about any more, especially in immature systems.
Be careful about systems that have great performance when intermediate result sets fit into RAM, but not when they spill to disk. In particular, watch for this problem in the Hadoop/Spark world.
Vendors are getting better about ANSI SQL coverage (SQL 99 Analytics, windowing, etc. …)
“Runs on Hadoop” isn’t an exciting claim unless you can mix and match SQL and generic Hadoop processing in the same jobs against the same data, even though lesser forms of SQL/Hadoop integration might also with help some aspects of TCO (Total Cost of Ownership).
More generally, what’s needed is:
- The ability to mix SQL and other kinds of analytic processing.
- The ability to mix traditional tabular data, JSON, and log data.
- The ability to mix data in place with data that’s trickling/streaming in.

4. Meanwhile, the analytic ease of use story remains popular, in business intelligence and predictive analytics/data science alike. Marketers typically oversimplify it to their own detriment, however, just as they do performance stories.

5. On the short-request side:

NoSQL is still going gangbusters.
NewSQL still isn’t, except that I haven’t talked with MemSQL for a while and they were doing well when I did.
Transparent sharding has stagnated as a business, good technology notwithstanding, and the vendors are pivoting.

6. Finally, one vendor note — Sharmila assures me by brief email that things are going gangbusters at ClearStory. This is unsurprising, as ClearStory exemplifies several trends I believe in, including robust analytic stacks, strong data navigation, Spark, and the incorporation of broad varieties of data.

And of course ClearStory also empowers business analysts to make do without IT involvement, like the other cool analytic kids also do.